Now You’re Talking!

Google has developed speech-recognition technology that actually works.

If you’ve got an Android phone, try this: Hit the microphone icon on the home screen, then ask, “How many angstroms in a mile?” Use your normal speaking voice—don’t speak slowly or strain to over-pronounce “angstrom.” So long as you have a good Internet connection, the phone shouldn’t take more than a second to recognize your question and shoot back a reply: 1.609344 × 1013.

This works with all kinds of queries. Say “what’s 10 times 10 divided by 5 billion” and the phone will do math for you. Say “directions to McDonald’s” or read out an address—even a vague one like “33rd and Sixth, NYC”—and Android will pull up a map showing where you want to go. It works for other languages, too: Android’s Translate app (also available for the iPhone) will not only convert your English into spoken French (among several other languages) but also has a “conversation mode” that will translate the French waiter’s response back into English. And if that’s not enough, Android lets you dictate your e-mail and text messages, too.

If you’ve tried speech-recognition software in the past, you may be skeptical of Android’s capabilities. Older speech software required you to talk in a stilted manner, and it was so prone to error that it was usually easier just to give up and type. Today’s top-of-the-line systems—like software made by Dragon—don’t ask you to talk funny, but they tend to be slow and use up a lot of your computer’s power when deciphering your words. Google’s system, on the other hand, offloads its processing to the Internet cloud. Everything you say to Android goes back to Google’s data centers, where powerful servers apply statistical modeling to determine what you’re saying. The process is fast, can be done from anywhere, and is uncannily accurate. You can speak normally (though if you want punctuation in your email, you’ve got to say “period” and “comma”), you can speak for as long as you’d like, and you can use the biggest words you can think of. It even works if you’ve got an accent.

How does Android’s speech system work so well? The magic of data. Speech recognition is one of a handful of Google’s artificial intelligence programs—the others are language translation and image search—that get their power by analyzing impossibly huge troves of information. For the speech system, the data are a large number of voice recordings. If you’ve used Android’s speech recognition system, Google Voice’s e-mail transcription service, Goog411 (a now-defunct information service), or some other Google speech-related service, there’s a good chance that the company has your voice somewhere on its servers. And it’s only because Google has your voice—and millions of others—that it can recognize mine.

Unless you’ve turned on Android’s “personalized voice recognition” system, your recordings are stored anonymously—that is, Google can’t tie your voice to your name. Still, the privacy implications in building a huge database of millions of peoples’ utterances are fascinating—so fascinating that I’ll devote my next column to discussing them. Leaving aside privacy concerns for a moment, it’s undeniable that speech recognition is one of a number of programs that could only have come about because of our newfound capacity to store and analyze lots and lots of information. In some ways the future of software—and, thus, of the computer industry—depends on such databases. If The Graduate were filmed today, the job advice to Benjamin Braddock would go like this: “One word: data.”

To understand why Google’s stash of recorded voice snippets is necessary for speech recognition, it helps to understand the history of creating machines that can decipher speech. Late last year, I met Mike Cohen, the head of Google’s speech system, in a nondescript conference room at Google’s Mountain View, Calif., headquarters. Cohen is one of the world’s experts in voice-recognition systems; he’s been in the business for decades, and he’s seen it evolve from a field dominated by linguists who were interested in computers to one dominated by engineers who are interested in linguistics.

“In the 1970s, there were two camps that didn’t really talk to one another,” Cohen says. The linguists believed, more or less, that the number of distinct sounds in human speech could ultimately be analyzed and turned into a set of computational rules. All you needed to do, they thought, was listen to enough human speech and then map, in painstaking detail, the frequencies of the sounds you heard. Once all the different sounds were analyzed and stored in a reference library, a computer would be able to recognize a given sound just by looking it up.

While this seemed to make intuitive sense, the engineers saw one glaring problem: It would never scale. The engineers believed they could get much further with computational analysis—if you gave a powerful computer enough audio samples, it would eventually be able to find all sorts of nuances that human linguists would never be able to identify. This was best expressed in a famous quote by Frederick Jelinek, one of the field’s pioneering computer scientists: “Every time I fire a linguist, the performance of the speech recognizer goes up.”

Over the years both sides bridged their differences, Cohen says, and today’s speech-recognition systems use deep insights from linguistics and engineering. Still, it turned out that the engineers were right on the fundamental problem: There are too many different possible sounds in human speech to be described by explicit linguistic rules. Cohen points out one small example. To most people the a sound in the words map, tap, and cat seems identical. In fact, there are very subtle differences. To create the M sound in map, you bring your lips together, forming a long closed tube in your vocal tract. This affects the a sound that follows—since your throat is transitioning from the low-frequency m sound, the first 10 to 30 milliseconds of the a in map includes many low-frequency notes that aren’t found in the early part of the a in tap. Now imagine how many such nuances there are for all the different words and combinations of words in every different language. “There’s no way we could do it by writing explicit rules,” Cohen says. The only way to find all these differences is through large-scale data analysis—by having lots of computers scrutinize lots and lots of examples of human speech.

But where to get all that speech? “A big bottleneck in the field has been data,” Cohen says. For many years researchers knew the theoretical process for building speech-recognition systems, but they had no idea how to get enough human chatter, or enough computing power, to actually do it. Then came Google. It turns out that the very same infrastructure that Google needed to build a fantastic search engine—acres and acres of data centers to store and analyze Web sites, and a range of internal processes that are specifically tuned to managing large amounts of information—would also be effective for solving speech recognition and other artificial intelligence problems, Cohen says.

There’s a lot of overlap between search and speech. To decipher your speech, Google’s system doesn’t just use recorded voices. It also relies on a host of other data, including billions of written search queries that it uses to predict the words you’re most probably saying. If you say 33rd and Sixth, NYC,” your NYC might sound like and I see, but Google knows that you’re probably saying NYC, because that’s what a lot of other people mean when they say that phrase. Altogether, Google’s speech recognition program comprises many billions of pieces of text and audio; Cohen says that building just one part of the speech-recognition system required “roughly 70 CPU-years” of computer time. Google’s cloud of processors can do that amount of crunching in a single day. “This is one of the things that brought me to Google,” Cohen says. “We can now iterate much more quickly, experiment much more quickly, to train these enormous models and see what works.”

Speech recognition is still a very young field. “We don’t do well enough at anything right now,” Cohen says. He notes that the system keeps getting better—and more and more people keep using Android’s voice search—but we’re still many years (and maybe even decades) away from what Cohen says is Google’s long-term vision for speech-recognition. “We want it to be totally ubiquitous,” he says. “No matter what the application is, no matter what you’re trying to do with your phone, we want you to be able to talk to your phone.”