Earlier this year, the Canadian A.I. startup Dessa brought a Joe Rogan doppelganger to life. Echoing the podcaster’s emphatic tone and swear-laden delivery, his voice clone described—in a script written by Dessa engineers—a chimpanzee hockey team. If you didn’t know better, you might think it was just another episode of The Joe Rogan Experience. But it was a digital puppet, a vocal deepfake that also recited a series of tongue twisters in robotic perfection.
Like your fingerprint, your voice is unique. Your vocal signature has a fundamental connection with your body, emerging from an idiosyncratic mix of your physiology, biology, habits, and personal and social history. At the same time, the human voice has historically been understood as an expression of the soul. It marks our privileged status as a species—after all, we are one of the few animals, including parrots, songbirds, dolphins, whales, and elephants, who can talk.
Technologies that reproduce the human voice have always stirred existential concerns. At the end of the 19th century, Thomas Edison first split the voice from the human body. In 1877, the inventor announced his phonograph, a machine that could record and play back sound. For its earliest listeners, sound recording heralded a new era where the voice would not die with the body. By preserving the voice, the phonograph promised to sustain life beyond death. According to an 1877 Scientific American article, the machine presented “the illusion of a real presence” that, like contemporary deepfakes, were difficult to discern from human speakers. Others, like the composer John Philip Sousa, rued the day these “talking machines” came into being. In a well-known essay, he decried the “menace of mechanical music,” casting sound recording as a “substitute for human skill, intelligence, and soul.”
Now vocal deepfakes—like deepfake videos and photos—are thus poised to intensify an already alarming crisis around evidence, trust, and authenticity. Certainly, it’s worrying that vocal avatars could be deployed in the manner that deepfake video and photos have been. For critics, a foreboding future looms where vocal deepfakes erode trust in traditional forms of evidence (and herald even more annoying robocalls and phone scams). For others, this nascent technology holds great promise, offering realistic vocal models for people with speech impairments, more convincing voice assistants, intimate chatbots, and myriad uses in the entertainment industry. Motivated more by artistic interests than commercial applications, musicians in particular envision different possibilities for the future of human and machine collaboration.
Corporate initiatives in A.I. voice synthesis have proliferated over the last few years. By drawing on existing audio archives, as Dessa did with Joe Rogan’s podcasts, these projects have tended to replicate existing cultural figures. In June, a pair of Facebook A.I. researchers, Mike Lewis and Sean Vasquez, released the results of their speech synthesizer, MelNet. Trained on a 452-hour data set including more than 2000 TED talks, the machine learning system generated uncanny vocal clones of Bill Gates, Jane Goodall, and George Takei, among other famous voices.
While fortune cookie sound bites of Bill Gates advising a listener to “pluck the bright rose without leaves” are novel, such vocal clones are not brand-new. In 2016, WaveNet, a project from Google DeepMind, synthesized voices by sampling existing human speech. Since then, a number of international startups and research groups have continued to develop the technology and its applications in ways that test traditional boundaries of identity. Cambridge-based Modulate builds voice skins that allow you to cloak yourself in someone else’s voice. Baidu’s Deep Voice can swap a voice’s gender or accent. Other projects are more altruistic. Through Project Revoice, a partnership with the ALS Foundation, Montreal-based A.I. startup Lyrebird, named for the Australian bird with the uncanny ability to mimic natural and artificial sounds, aims to restore digital voices to people with the disease who might lose their own.
These systems all basically learn to speak by analyzing human vocal nuance from massive amounts of audio data. But while earlier programs were trained on audio waveforms, like Dessa, MelNet instead uses spectrogram visuals of audio. More informationally dense than a waveform, spectrograms can capture orders of magnitude more data. In their paper, Vasquez and Lewis emphasize MelNet’s superior capture of “high level structure”—the subtleties of accent, pitch, and cadence that imbue a voice with its identity. Though difficult to describe, these are the features to which the human ear is highly attuned. Lyrebird co-founder Jose Sotelo calls these audible signatures the “DNA of the voice.”
By reproducing these qualities, A.I. speech synthesis may threaten the unique status of the human voice. But it could also help us to find new ways of expressing our humanity. For her latest album, Proto, the experimental composer Holly Herndon collaborated with her partner, artist Mat Dryhurst, and A.I. expert Jules LaPlace to build an A.I. “baby” called Spawn. Trained on folksong choruses, Spawn helps her create music that blurs human and nonhuman voices in gorgeous, haunting, and sometimes jarring compositions. Where other musicians have used A.I. techniques, they’ve often turned the neural networks on their own catalogs, or existing musical sources. Rather than automating the composition process, Herndon instead used the technology to foster new creative approaches. As something of a techno-utopianist, Herndon uses A.I. vocal technologies to discover the human within the machine.
Over two years, Herndon trained Spawn with a community of people, including a 300-person singing session at Berlin exhibition hall Martin-Gropius-Bau. I watched a recording of this session at the ISM Hexadome in San Francisco, a six-channel media installation Herndon also used for this live group training. Blending human and machine performances in dynamic call-and-response, the piece brought audience members, video recordings, and Spawn together in a profoundly moving chorus of hymns. In this way, Herndon folded this emergent technology into a quintessentially human activity. According to ethnomusicologist Gary Tomlinson, singing is intertwined with human history, culture, and evolution.
Much of the media conversation around A.I. rehearses a troubled vision of the technology’s terrible effects on human culture and society. As the story goes, machines are coming for our jobs, eventually to automate us into obsolescence. At the same time, media narratives about AI tend to erase the human labor that drives these machine learning processes, including arduous coding, training, data curation, and composition. By using A.I. to affirm our connections with each other, Herndon models an ethics of engagement that celebrates humans co-evolving with technology. As she tells Loud and Quiet, “The human body has been like a machine since industrialization, so how can technology get the body out of these machine-like motions so we can be more human together. That’s the vision.”
As they built Spawn, Herndon and her collaborators were highly aware that technologies encode values. In Proto, they’re thinking about protocol not only in terms of technological infrastructure, but as a “baseline set of rules that a community agrees upon.” As she tells the Fader, “What kind of values [do] we want to instill at the protocol layer before things get out of hand? What are we going to take as a shared truth? It’s not just a technology question—it’s political and social.”
In this sense, Herndon’s collaboration with her A.I. baby—and the many humans who gave birth to it—embodies a crucial approach to using A.I. as what MIT Media Lab Director Joi Ito calls extended, rather than artificial, intelligence. In Wired, he says, “Instead of trying to control or design or even understand systems, it is more important to design systems that participate as responsible, aware, and robust elements of even more complex systems.”
Voice clones threaten to legitimate fake news. Yet they can also augment, rather than replace, individual humans and the complex adaptive systems in which we work, play, and dwell. A.I. reflects our voices, and our values. It shows us the most automated parts of ourselves and challenges us to find more conscious expression. Used in that way, A.I. is not simply a mimic, but an improvisational partner. Rather than supplanting the human voice, a singing A.I. might join the chorus.
Future Tense is a partnership of Slate, New America, and Arizona State University that examines emerging technologies, public policy, and society.