Amazon has been working on its voice-enabled virtual assistant, Alexa, for years. It’s how the company’s Fire TV, its Echo smart speakers, and other Amazon devices are able to answer our questions—and know what we want.
While we know Alexa is always listening, how does she think? And how much of what she hears does she “remember?” In this week’s episode of Slate’s tech podcast If Then, I chatted with Al Lindsay, vice president of Alexa Engine Software at Amazon. He’s been at Amazon since 2004 and has led the Alexa team since 2011. In other words, he’s the guy in charge of building Amazon’s version of the all-knowing Star Trek computer. In our interview, we discussed how Alexa understands our commands, what he makes of users’ growing concerns over privacy, and why Alexa was laughing so creepily for some users recently.
You can read an abridged version of our conversation below, or stream or download the full discussion and episode via iTunes, Stitcher, Spotify, the Google Play store, or wherever you get your podcasts.
Will Oremus: Let’s start with the basics. Pretend I know almost nothing about technology. You probably won’t be too far from the truth. Explain to me, in the simplest possible terms, what happens when I ask my Amazon Echo, “Alexa, what’s the weather today?”
Al Lindsay: Sure. The first thing that happens is the local software running on your device, which is able to recognize the wake word “Alexa,” detects that you said the word “Alexa,” wakes up, alerts you that it’s now listening by lighting up the ring in blue, opens a connection to the cloud, and streams the rest of your request to Alexa in the cloud. That’s the first stage, which is understanding, “Hey, you’re talking to me. I need to do something with this, get it over to the cloud, so that Alexa can process it.”
The next part is understanding the words that you said, so speech recognition, or what we call ASR—it stands for Automatic Speech Recognition. That’s understanding the words, so based off the language, trying to understand what the strings of potential words are you might have said, and then we get those on over to a system we call, natural language understanding, which then tries to make sense of the meaning of those words. It might be able to understand that you said, “What’s the weather?” Those words—the fact that translates to a request that needs to be routed to a piece of software that understands about weather, and location, and those types of things is the understanding layer.
Then, I think about weather, as an application, we call them skills. Skills are like applications and they handle your request. You get that request over to the weather skill and it’s able to figure out and speak back to you, using the text-to-speech engine, the answer to your query.
What about the question of whether Alexa is “always listening?” You hear that sometimes. You have this device in your living room, it’s listening to you all the time. If nothing else, it’s listening to make sure you’re not saying the word Alexa, right?
What’s happening is the software runs locally on the device itself, and it’s listening locally for the word Alexa. There’s no connectivity or streaming, at that time. The sound is passing through the microphones. The engine is simply looking for that one pattern. Do I see the pattern Alexa? If not, it’s just passing through. It’s only when we detect the phrase Alexa, locally, that we then wake up and say, “Hey, this was meant for me. Now I need to take further action and actually start to listen and stream to the cloud.”
What about this fear that, over time, this device that you have in your living room, that is listening to you and analyzing what you say, that it will just get to know so much about you that you should really be concerned about privacy. I’ve actually heard people say, “I would never get an Amazon Echo, or I would never get a Google Home because I’m just concerned about the privacy. I don’t like the idea of inviting something into my living room that’s going to be learning all this stuff about me.”
Right. Well, we take privacy really seriously at Amazon. It’s something that we thought deeply about right from the beginning. The whole product and the whole Alexa experience is designed around being thoughtful about privacy and customers’ concerns about that. A lot of the decisions that we’ve made to date, and will continue to make, are centered around being transparent in what we do. Right from the experience I’ve described, where you say the wake-word, and then the blue ring lights up. The blue ring is there to reassure you, by letting you know, like, “Hey, I think I heard my name. I’ve just opened a connection to the cloud, so that I can try and follow-up on what I heard after and, hopefully, help you.”
All the way through to the transparent way in which you can go into the app, see your entire history of utterances, delete them, or see how Alexa interpreted them. You said something and Alexa thought you said, “Play music.” She got it wrong. You can see that right in your utterance history. That was all about being transparent and showing people, “Hey look, here’s what Alexa is doing,” and give you some control over it. Now, go ahead and delete it, make it go away.
OK, got it. I think what you said is that when you haven’t said the wake word, “Alexa,” when you’re just in your kitchen, or in your living room, with the device and it’s listening for the word, Alexa, it is, in a sense, listening to everything you say, but that stuff is not going to some remote server. That stuff is all staying on the device. Is there a way that we know that the device then disposes of it, so it can’t get hacked, or somebody can get into your Echo and see all the stuff that it recorded when you weren’t trying to talk to it?
Sure. I think I used the phrase passes through, because without getting too deep into the technology, it literally is inspecting the acoustic pattern. So it has no notion, or sense, of words or meaning. All of that’s in the cloud, right? It’s looking for a pattern that matches “Alexa.” Everything else is just sound waves passing through and they’re not recording, they’re literally, passing through a buffer and disappearing as they flow through without recognizing that pattern.
It’s not until you snap onto that pattern, you’re like, “There’s that pattern. That’s Alexa.” Now, it opens a channel and take everything from this point forward and stream it to the cloud.
Could there be a positive side to better understanding though some of the stuff that Alexa, in theory, could pick up? I mean, I know, for instance, if you do certain search terms in Google that, that suggests you might be considering an act of terrorism, or that you might be considering suicide. Google would take certain steps that it wouldn’t normally take to try to protect you and/or others. Is that something that your team has had to think about?
Not my team specifically, but I do believe we’ve done work with the national crisis center. Where we do get those utterances, where someone’s expressed that they’re struggling with something in their life. We’ve carefully crafted responses to try to be helpful, direct them to places they can get help. There definitely are instances of that in our experience.
What’s a privacy concern about voice A.I. software, and about Alexa, that you think is valid and that it’s something that worries you too and you actually are trying to tackle it, or it’s a challenge for your team?
I don’t really have one for a response for that. I feel there isn’t really anything that falls into that category, that I’m aware of, or focused on.
A lot of people, recently, became familiar with the “creepy laugh” that emanated from some Alexa devices, where it just seemed to start laughing randomly and people’s living rooms. Obviously, people were freaked out by that. I think we came to learn that maybe it wasn’t as creepy as people might have thought. I think, people were saying, it thought it heard the word “laugh,” or the command, “Alexa laugh,” but it didn’t actually hear that. Is that what was happening?
Yeah, the shorter the utterance, the harder it is to get accurate ASR. I mean, if you say, “Play supercalifragilisticexpialidocious,” that’s probably one of the most easiest words to snap to, because nothing else sounds like it. But short, one syllable, commands, so “laugh” being one of them, are easily confusable with other things. I think the combination of either an intended wake up, but then a misunderstanding of the command, or an unintended wake up and a misunderstanding of whatever noise came after it, can result in a misrecognition, and in this case, it just happened to misrecognize to be laugh.
The editorial that we had, had Alexa just laugh when you asked her to laugh. I think that, that second part may have been jarring to people and so, now, she’ll say something along the lines of, “Sure, I can laugh,” and then laugh. At least you have a little bit of context for what just happened.
But it’s a hard job that you’ve set for yourself. I’ve heard you describe the vision, again, as building a Star Trek computer, something that can answer any question that you might have, or help you out with whatever you might need. That would seem to suggest that you’d need something like what is sometimes called in the industry, general A.I., or hard A.I.—an artificial intelligence that understands a lot about the world and is not just smart about one particular thing, like telling you the weather, or ordering you a Domino’s pizza.
That said, you guys, your team, seems to have taken a little bit of a piecemeal approach, where you’re not trying to build an artificial intelligence genius from the ground up, you’re trying to work on one problem at a time and maybe if you figure out how to solve enough little problems, over time, then eventually they could add up to something that can answer almost anything you’d ask. Is that a fair description of the approach you’re taking to this problem?
Let me turn it around the other way, because when I think about invention and these large changes in technology, the mental model, I think, lay people tend to have is that there’s the genius in the corner that has an epiphany, or falls in the bathtub and bumps her head and has a vision for the flux capacitor that’s making time-travel possible. But often the way it works is 99 percent perspiration and 1 percent inspiration. It’s just a lot of hard work.
I don’t know that it’s necessarily incremental, I feel, where Echo came into the market, with solutions to far-field speech recognition and highly accurate wake word technology and really good natural language understanding, all of those were really large leaps. I don’t think of them as incremental. Some of those were just accepted in the science community to be intractable unsolvable problems.
I do feel we try to point ourselves at the hardest problems first and go after those.
[Alexa] really is a magical experience, that first time you get an Echo in your home and you ask it a question and it just answers you. I mean, it’s amazing, or you ask it to do something, and it just does it. In the long run, if Alexa becomes as successful as your team hopes, and becomes this entry point toward buying things, toward taking all sorts of online actions, toward learning about the world, what about the concern that this gives Amazon a lot of control over the flow of information? A lot of control over who buys what? I mean, if I ask to order a pizza through my Alexa, Amazon, in some sense, is getting to choose. Who’s the default pizza vendor for Alexa? Which companies get to partner with Amazon? What are the terms of that?
Is there any concern? Is there any validity to the concern that Amazon is inserting itself as a very powerful middleman in all kinds of transactions between people and the online world?
Well, I think, when I think about Alexa, I think about user-interface paradigm. I think about technology growing up from command line interfaces through the ’70s, to the invention of the graphical user interface—the mouse, the keyboard. Then you had the onset of the internet, and browsers, and search engines. Then touch screens, iPhones, and tablets.
I think about the voice interface as a natural evolution of those technology interfaces. And only as a way to interact with technology, your platform, or a service that underlies it. More so than, I think the way you’ve presented it. I mean, Amazon today is an awesome retailing platform that allows other third-party merchants to sell, just as we sell products to our customers directly on our own platform.
I think adding a voice capability to something like shopping just removes friction and it makes things easier for customers. It’s a much more natural way to interface with technology. In that way, I view it as a positive advancement in the way that we interface with technology.
One way of looking at it, as you’ve pointed out, is that in the past people have had to learn to speak computer and now you’re teaching the computer to speak our language, so that it can relate to us on our own terms.
That’s a good way to look at it.
What’s the single biggest technical challenge to making Alexa as smart, and as capable, as you want it to be? You mentioned a couple of big advances that made the Amazon Echo possible. One of them was just the ability to hear somebody from across the room and isolate that sound above all the other ambient noise. Another was the natural language understanding, the baseline ability to parse your words and figure out what it is you’re saying. What, in the next five years, or 10 years, constrains a device like Alexa, because as wonderful and powerful as it is, you still … I mean, anybody who buys one, you quickly learn that you can’t just ask it anything and have it answer.
I mean a lot of stuff it’s going to say, “Sorry, I don’t understand your question,” or that sort of thing. Why is that and what might change that in the future?
I think context is an important challenge. When we converse with other humans, there’s all kinds of nonverbal clues that you pick up on, or you have a history with the other individual, things you’ve done over your lives. There’s where you are, what you’re doing at that moment. When you say something to a friend, like, “I wonder what’s up with that guy?” As words, it’s hard to parse that and make sense of it, from a natural language understanding perspective, but you, as a human, have no problem at all understanding the inferences that are hidden in that question.
Computers can’t do this today. They struggle to bring to bear all of the context and nonverbal clues, environmental clues, what’s going on in the world clues, to be able to sort through those shortcuts and be able to really get at the heart of what you meant. It’s more hear what I mean, not what I say. I think that’s the biggest challenge facing all of the A.I. providers in the next, as you said, five years. How do we become more contextually aware?
In Alexa, we do it in the small. If you’ve got multiple devices nearby, and you say, “Stop.” The one that heard you, there’s nothing going on, on that device, but elsewhere in the room there’s another device that’s playing some music, that context is useful and we can figure out, “Hey, we should stop the music on the other device,” or the video that’s playing on your Fire TV. These are nice smaller examples, but in the large, I think, having context over multiple interactions, and understanding your environment, and whose present, and who’s not present, and where you are geographically, or physically, or things that you’ve had affinity to in the past will allow a more natural conversational interface for you with your artificial agent. Where you won’t feel like it’s limited in what you can ask. Maybe you can ask anything. Maybe you can have a hypothetical conversation about current events in the world, like, “Hey, what do you think about what’s going on in the Middle East?” I think that’s the meat of the challenge to get to the next level.