Future Tense

The Communities That Live Captioning Leaves Behind

If 13-year-olds in a spelling bee can learn “chaebol” and “marmennill,” transcription software can, too.

A woman sits at a table in a park, looking with consternation at her iPad screen.
Real accessibility need to be really inclusive. Stefan Vladimirov on Unsplash

During our synagogue’s Zoom services last week, my family and I found ourselves giggling when we should have been serious. Auto-captions were turned on, and they kept botching the rabbi’s Hebrew-laced English.

Mourner’s Kaddish (memorial prayer) was transcribed as mourner Scottish, and refua shlema (wish for a “full recovery”) became with flu wash Emma. Some of the transcriptions bordered on offensive, like when Torah became terrorism and yasher koach (great job!) became wish a cough.

Advertisement

We weren’t relying on the captions and could laugh at these mistakes. But I kept thinking of the many people—especially those with hearing impairment—who can’t participate without auto-captions and were probably confused, at times totally lost. This technology is a real boon and certainly a step toward a more inclusive internet, but it’s a shame that it only works well for those who speak “standard” English. As immigrant, indigenous, and religious groups conduct their activities online, millions of people are affected by the software’s shortcomings. This is clearly an issue of equity and inclusion, and tech companies like Facebook, Google, and Zoom must address it.

Advertisement
Advertisement

As a linguist, I study how Americans infuse elements of their ancestral and sacred languages into English. There are two types of language mixing: the use of elements from two languages by bilingual speakers; and the injection of loanwords, words from another language used even by people who cannot speak that language. I tested both of these with multiple captioning services, including Zoom, YouTube, Facebook, and Google Chrome’s newly released live-captioning tool. I found that the captioning problems are not limited to my rabbi’s Hebrew words, but occur with many instances of language mixing, including Spanglish entertainment and informational videos with loanwords from Arabic, Punjabi, Vietnamese, and Amharic.

Advertisement

In some cases, bilingual speech is relatively compartmentalized, a phenomenon known as code-switching. Code-switching isn’t always between languages—a teenager might code-switch between talking to her teacher and peers, or a Black person might code-switch between white co-workers and Black friends. But code-switching does often involve multiple languages, such as when a Korean-American woman speaks Korean to her parents and then English to her brother, or in Islamic religious services, when Arabic prayers are followed by an English sermon. Transcription software cannot yet capture code-switching, but it could—if the software knew both languages and recognized which sections were which.

Advertisement

However, most bilingual speech is not separated so clearly. Software would have a harder time parsing a Spanish-English sentence like “estamos typeando el paper” (we are typing the paper), where one language blends into another from word to word and even syllable to syllable. This is translanguaging—speakers using a single linguistic system that includes elements of what outsiders consider separate languages. Translanguaging is ubiquitous in everyday conversation in immigrant communities and in plenty of recordings that need captions, such as a comedy routine about Spanglish and the new Spanish-English audio production Romeo y Julieta. Transcription software might be years away from capturing these events well.

Advertisement

In contrast, software should have a much easier time representing loanwords. The Hebrew words my rabbi used are loanwords—many Jews use them frequently within English even if they cannot understand biblical texts or converse in Hebrew. So are Islamic words like haram (sinful) and Jumu‘ah (Friday prayer) and food words from many cultures like banh mi (Vietnamese sandwich) and injera (Ethiopian flatbread).

When I tested cultural and religious videos with these words, human-edited captions recognized them, but auto-captions did not. For example, in an informational video about Sikh worship, Chrome botched most of the Punjabi words: Sangat (congregation) became summit, and Siri Guru Granth Sahib (holy scripture) became city good at ground side.

Advertisement

I’ve been contacting companies that offer auto-transcription, pointing out this problem and offering a list of loanwords commonly used by Jews (the community I’ve researched most). A few companies replied that they can’t help because they only support English. My response is that words like Torah, sangat, and injera have become part of English. Although some are used primarily in specific communities, they are still featured in English dictionaries and even spelling bees. If 13-year-olds can be trained to spell chaebol and marmennill, so can transcription software.

Advertisement

Other companies—and some of my colleagues who are experts in natural language processing—said I could address this problem by editing transcripts, as the machine learning improves through corrections. I started to assemble volunteers and arrange for access to synagogue YouTube accounts. But then I realized that communities should not shoulder this burden alone.

Software developers should enable individuals to submit loanwords not just within their own accounts but to the company’s general dictionary. Then, in consultation with community members, companies should use this data to train their transcription tools to accurately represent diversity within English. If companies are concerned that adding words will interfere with the software’s overall accuracy, they can enable add-on dictionaries tagged for particular religious and ethnic groups. Ideally tech companies will also tackle code-switching and translanguaging, but loanwords are a logical first step.

The exclusion of language mixing from auto-transcription is part of a broader issue of equity. Research on voice recognition is often conducted among white middle-class Americans. This leaves many—such as speakers of African American English, Hawaiian Pidgin, and Spanglish—struggling to communicate with Siri, Alexa, and service bots. Academic research has documented this bias, as have opinion pieces and comedy.

New technologies make life easier in so many ways, and we’re all grateful for them. However, they should do so equitably—regardless of ethnicity, religion, and disability.

Future Tense is a partnership of Slate, New America, and Arizona State University that examines emerging technologies, public policy, and society.

Advertisement