This week could safely be described as the week of Laurel vs. Yanny, the week that the “audio version of the dress” took over the internet. After the ambiguous shoe, the Adidas jacket, and the dresser, not to forget the jumping pylons, you could be forgiven to suffer from #thedress fatigue. And yet, this one spread like wildfire because it is something new. This time, you could hear it!
Briefly, this new internet sensation was created a couple of days ago in collaboration between a student—Katie Hetzel, who looked up the definition of the word laurel on Vocabulary.com, but heard yanny when listening to the pronunciation—and several others who popularized the story on social media, including Reddit and Twitter.
The result? A several-day-long internet frenzy with people arguing whether the audio recording says “Laurel,” “Yanny,” or something else entirely, with several major celebrities chiming in, recalling the dress situation from three years ago (and then people co-opted laurel vs. yanny for their own agendas, sort of ruining the fun).
Regardless, amid all of this excitement and despite many efforts, two major questions have remained unanswered: “What is happening, and why is it happening?”
Like with the dress, there is no shortage of wild speculations and bold hypotheses as to what might be going on. Just like we speculated that the kind of device or level of brightness might play a decisive role in how people saw the dress, some suggested that speaker quality determined whether you heard Laurel or Yanny. But just like in the case of the dress, the suggestion that such a phenomenon can be so easily explained—and that this is essentially an exercise in mass hysteria—doesn’t hold water. People deserve more credit than that.
The most compelling evidence that this can’t be easily dismissed or explained—just as with the dress—is that two people who hear the recording on the same device hear two radically divergent interpretations. Laurel and Yanny are not close or easy to mistake with each other. I, like so many people, experienced the strangeness of this effect by listening with my teaching assistant Julie Cachia—she heard “Laurel,” me: “Yanny.” We were not alone.
So something interesting is going on. What is it?
First, it is important to realize that auditory perception, and speech perception in particular, is extremely malleable and open to interpretation. You essentially always experience a highly edited version of the physical sound signal. For instance, those pauses between words that you are hearing, that allow you to parse the meaning of a sentence? Yeah, they are not present in the audio stream but added by the listener. A.I. systems have a real problem parsing natural language, if spoken very fast, because they can’t do this like we can. If you want to experience this effect for yourself, listen to a foreign language that you have no knowledge of (I recommend Italian) and see if you can make out the pauses between the words. Now you know what babies are going through when trying to pick up the first language.
Expectations can shape all kinds of experienced meaning. Even radically impoverished sound signals (in the extreme case, just a few repeating speech fragments) can be “heard” in any number of ways. Consider this example of “phantom words,” which were pioneered by Diana Deutsch.
How many interpretations can you come up with? I counted more than 50, from “real well” to “bueno” to “no way,” “nowhere,” “railway,” “mailman,” “random,” “nowhere,” and so on. The reason for this remarkable ability to find meaning in poorly constrained stimuli is that the physical stimulus is consistent with many interpretations, allowing the brain the degrees of freedom for expectations and subjective biases to take over.
If presented in isolation, these expectations can even be the decisive factor in what is experienced. Consider this recent tweet that illustrates this principle nicely. Whether you hear “brainstorm” or “green needle” depends mostly on your expectations, the—isolated and impoverished—audio signal is consistent with both interpretations.
What prevents this from happening in real life is that words are rarely presented in isolation. The sentence a word is embedded in usually provides a sufficiently disambiguating context. Moreover, the speech quality is usually not so bad that you have to guess entirely.
If you do, it can be extremely helpful to know what someone is talking about to actually hear the sounds as meaningful speech. In other words, experience matters. For instance, listen to this clip:
Can you make out the words? Are these even words? It is speech, but the sounds have been digitally impoverished. Now try this one, the unaltered original speech:
If you now click on the first one again, your experience should have been fundamentally altered. We presumably just performed precision microsurgery in your brain. You cannot unhear this—it would still influence how you heard the altered speech if you were to play it many years from now.
We still don’t understand exactly what happens in the underlying brain that brings about these rapid, dramatic, and everlasting changes. All of this is simply meant to illustrate that the physical stimulus in (auditory) perception is usually extremely open to interpretation and disambiguated by expectations that usually derive from experience.
Does this mean that the Laurel/Yanny thing is uninteresting and an old hat? Not at all—on the contrary. I must confess that this is the first time in my life that I’m enthusiastically excited about speech perception. The reason for this is that in all the examples we gave so far, the experience is fairly predictable—an impoverished physical stimulus is largely open to interpretation and disambiguated by expectations stemming from context and experience. But why does the subjective experience here cluster into such radically different experiences? And why can they shift back and forth? This is where the lessons from the dress come in again. In 2015, the idiosyncratic experience of radically divergent percepts was puzzling. It took years to establish that there are commonly three steps to perception that can account for the perception of the dress, and probably other phenomena:
1. Perception is fundamentally a guess. It just doesn’t feel like that because the brain tells you that it is “sure” about its interpretation at any given point.
2. Because it is a guess, it can be wrong. In particular if the sensory evidence is weak, the brain will rely more on assumptions (derived from prior experience). For instance, it can be shown that people rely on stereotypes to predict behavior most strongly when they know nothing else about a person. Once they get to know them, the new evidence dominates, a process known as “individuation.”
3. Because life experience differs between people, these assumptions—and consequently the conclusions or interpretations—differ too.
In the case of the dress, this takes the form of color constancy, an ongoing process in the brain by which the organism ensures that the interpretation of the color of objects is invariant despite changing wavelengths in the illuminant. For instance, the spectral composition of daylight changes dramatically throughout the day. In the dress photograph, the illumination was ill-defined, so people had to rely on their assumptions to infer which illumination was more likely, so that they could then discount it to achieve color constancy. And people assume what they have experienced more of in the past. (If you have only ever seen horses, but were to encounter a unicorn once, you would probably conclude that you are looking at a horse with a particularly disfiguring facial tumor. Story of my life.) What we learned about the dress is that the bottom line seems to be that people who rise early—everything else being equal (the fact that everything else is not equal necessitates large sample sizes for studies of this effect)—are likely to encounter more daylight relative to night owls, over the course of a lifetime. Night owls will discount yellowish artificial light and perceive the #dress as black and blue whereas early risers will discount bluish daylight and perceive it as white and gold, which is what we have found empirically.
This helps us understand this Laurel/Yanny phenomenon because it provides an analogy. And sound stimuli are arguably easier to conceptualize parametrically than visual stimuli. Physically, sounds correspond to vibrations of (air) molecules and one can visualize the relative amplitude of these vibrations—which in the case of speech are brought about by the movements of the vocal cords at different frequencies over time. This is called a spectrogram.
If you look at the spectrogram of the Laurel/Yanny recording, something interesting emerges: There seem to be two clusters, one with low frequencies at the bottom, one with high frequencies at the top.
You are looking at a series of spectrograms. The central one is the original ambiguous recording itself. Those to the left of it have been “low-pass filtered.” In a low-pass filter, vibrations at high frequencies are suppressed. Those to the right of it have been “high-pass filtered,” which means that vibrations at low frequencies have been suppressed. Red color means a stronger amplitude of the vibrations at that frequency and blue means weaker vibrations. This amplitude corresponds to the loudness of that frequency (on the y-axis) at a given time (on the x-axis). The combination of all these vibrations at any given point in time corresponds to the sound you hear.
Interestingly—and you can confirm this for yourself—most people seem to hear the low-pass filtered versions of the recording (those on the left) as a hard and deep “Laurel” whereas they hear the high-pass filtered versions of the recording (those on the right) as a wispier “Yanny.” The farther from the center, the more extreme the filtering that was applied.
So the “Yanny” sound is contained in the high frequencies, and the “Laurel,” in the low ones. This mirrors a similar situation in vision, where the elusive smile of the Mona Lisa seems to be contained entirely in the low spatial frequencies, which are best perceived in peripheral vision. That’s why the smile tends to disappear when you look directly at it.
What remains to be explained is why some people seem to focus on higher frequencies whereas others expect lower ones. If the dress is any guide, prior life experience will play a major role in setting these differential expectations.
Some people have theorized that age can account for which word you’re more likely to hear first. The range of human hearing comprises frequencies from 20 to 20,000 hertz, or three orders of magnitude. An important caveat is that this is the hearing range of young people. As people age, their hearing ranges shrink, and they do so starting at the higher end of the range, falling off dramatically with increasingly advanced age. Thus, it is quite possible that—particularly for people at the opposite ends of the age range—this effect dominates the percept. Old people might be physically unable to perceive “Yanny” because their auditory system attenuates all high frequencies, and the low frequencies that carry the “Laurel” part of the recording (the panels on the left) are all that they are left with. Conversely, the experience of young people might be dominated by high frequencies, which would lead them to predominantly perceive it as “Yanny.”
This is a real possibility, but there are three reasons why I believe that this account is too simplistic:
1. This is extremely anecdotal, but my 5-year-old son, Karl, said he heard “Laurel,” with supreme confidence.
2. The experience has switched for individuals, including me in the course of a single day. We age, but we don’t age that fast. Plus, it went in the wrong direction for me. I heard “Laurel” first, then “Yanny.” And just now, I’m hearing “Laurel” again.
3. Even the high frequency ranges of the stimulus are not all that high. The recording seems to have been done with relatively low quality, and as far as I can tell, it cuts off at 6,500 hertz. Most people should have no problem hearing that well into the high middle ages. Indeed, my own cutoff seems to still sit at a happy 15,000 hertz. (You can find your cutoff here.) But if I’m still able to hear frequencies much higher than Yanny, why is my experience dominated by “Laurel”?
All of these points pose challenges to a pure age account of the phenomenon. There seems to be more going on than just that.
It is important to note that we are mostly dealing with hypotheses at this point. It is unreasonable to expect science to be able to offer a comprehensive explanation for a new phenomenon within hours (or days) of it surfacing. This again echoes the situation with the dress in 2015. Then as now, there is no shortage of ideas and hypotheses (educated guesses) as to what might be going on. But it takes time to establish which are more or less likely to account for the phenomena (and by how much). Science is a process. Knowledge creation (“Wissenschaft”) takes time. It sometimes feels like magic, but it’s not magical. (The good news is that science can actually narrow the space of possible hypotheses down over time. This sharply contrasts with other fields—in history, for example, the number of viable hypotheses as to what brought down the Roman empire has only ever grown over time, because there is no way to resolve or adjudicate any of them, and yet we are still adding new ones.)
So what have we learned so far? The dress answered the age-old question of whether you are seeing the same color red as me with an astonishing “not necessarily.” The Laurel/Yanny phenomenon vividly illustrates the solution to another philosophical conundrum: If a tree falls in the forest, it will create a pressure wave in the air. But the sound that is produced by this depends entirely on who is listening, and specifically, the status of their auditory system, which in turn is dependent on the rest of the brain, with all of its expectations. Put differently, the same tree falling might sound different to two observers who are both present and witnessing the same event.
This brings us the last take-home message: We live in a more idiosyncratic subjective world—one that allows for radically different interpretations—than we commonly realize. The dress suggested that this is true for color vision, and Laurel/Yanny suggests that this is more generally true.
The worrisome part about all of this is subjective overconfidence. If you are like most people, you were 100 percent sure that the recording said one thing. Even considering the other option seemed absurd … until you suddenly weren’t any more. And then it sounded or looked like the other option, without any ambiguity. In other words, the brain doesn’t allow much room for epistemic humility. This is a feature because it needs to maintain action potential—eternal doubt inhibits action, which most organisms, locked in an evolutionary death struggle, can ill afford. The purpose of other people, then, is to check this subjective overconfidence of the individual—to provide different interpretations of the same evidence as a means of reality check. This is a great strength of society as a social group—it can keep individual brains modest. The great struggle of modern society, of course, is that modesty is difficult, and in the age of social media particularly, this cognitive diversity has been weaponized to the point of extreme tribalism, which can be dangerous.
Regardless of these considerations, we are obviously at the very beginning when it comes to a full understanding of the Laurel/Yanny phenomenon, and that is OK. To really nail down what might separate the Laurels from the Yannys, we first need to get the data. If you want to help us with that and have a couple of minutes to spare, click here.
Correction, May 21, 2018: The image showing the range of spectrograms originally included the wrong spectrogram for moderate Laurel. It has been updated.