Al Jazeera television aired an audio tape this week it claimed was recorded by Osama Bin Laden. Administration sources told MSNBC Wednesday that they believe the tape is authentic. How do you authenticate a sound recording?
The feds have audio recordings analyzed by both human experts and machines. Human analysts are very good at doing the kind of thing most people do subconsciously—telling if someone comes from a particular region by recognizing basic vowel and consonant qualities. For example, a human analyst can tell whether the “Ye” sound in “Yemen” is of the right length and stress for Bin Laden’s dialect. The expert would listen to previous recordings of Bin Laden’s voice and painstakingly compare words—syllable by syllable—to those on the current tape. The feds might also bring in a linguist to verify whether the words on the tape generally match those uttered by someone of Bin Laden’s age and educational background.
For a machine analysis, the feds use voice-authentication software, which measures the acoustic qualities of the voice—pitch, loudness, basic resonances—that can’t be estimated by a human expert. This kind of analysis can produce basic spectrographic information (indicating overall intonation and loudness) or it can look for specific features of the voice, like if Bin Laden’s voice was a bit on the nasal side. Voice authentication software is also excellent for cleaning up bad recordings; the latest tape is allegedly very noisy and possibly went down a phone line at some point. Such a system can also tell if different samples of the voice were recorded on different microphones and in different locations.
Once the recording is cleaner, the software can deconstruct each single sound. Every person creates the same sounds using a slightly different set of basic pitches. So, the set of frequencies in Bin Laden’s vowels, like those in “ea” from “fear,” will be marginally different from anyone else’s. By examining this frequency detail for every vowel and comparing them to previous examples by him, a machine analysis can tell if they are the same and were all made by him. In cases where two examples of a word, like “bombing” and “bombing,” sound exactly the same to a human expert, a machine can sometimes pick out frequency differences that indicate the words were spoken by two different people.
What if analysts are pretty sure the voice on a tape is Bin Laden’s, but want to make sure it hasn’t been spliced together from Osama’s Greatest Hits? In that case, man and machine would look for tell-tale signs of fraud. The first red flag is any hitch in Bin Laden’s timing. It’s almost impossible to fake a speaker’s rhythm, to make sure every syllable in an utterance matches the overall length and structure of that utterance. So, if the word “Kuwait” were inserted from a previous recording by Bin Laden, it would jar the basic rhythm of the rest of his speech.
Another sign of fakery is background noise. It’s quite difficult to remove the original sound context from a voice recording. And even if you could, you’d still have to deal with the fact that speakers unconsciously pitch their voice to accommodate background noise. A giveaway sign might show up in the basic frequencies of one of Bin Laden’s “kills” versus another of his “kills.” If these pitches were different enough, this would be cause for suspicion.
Together, human and machine can provide formidable testimony in court, but neither type of analysis can say with 100 percent certainty that the speaker on the tape is Bin Laden or anyone else.
Explainer thanks Dr. Francis Nolan of the Linguistics Department, CambridgeUniversity and Judith Markowitz of J Markowitz Consultants and Speech Technology Magazine in Chicago.