The Complicated Decisions That Come With Digitizing Indigenous Languages

Technology is helping preserve endangered languages. But this development comes with challenges.

A blinking cursor on a screen that then shows an indigenous family, one member at a time.
Franco Zacharzewski

Imagine: Hundreds of years of environmental, social, and political changes have left the English language with only a handful of speakers. Linguists and community members are trying to rebuild, but nearly the entirety of the written and audio record is gone. All that remains are 8,000 tweets from the year 2019, audio recordings of African American Vernacular and Irish English, a 1923 edition of the Oxford English Dictionary, and nine Shakespeare plays.

How do you measure what’s missing? How do you bring English back?

This, of course, is the reality for scores of Native North American languages. When Europeans first made contact with tribes across the continent, more than 2,000 languages were being spoken. Today, after centuries of forced relocations, broken treaties, abusive residential schools, and other discriminatory practices, only 256 languages are spoken. A full 199 are endangered, according to the Catalogue of Endangered Languages. Yet even after everything those communities endured, they’re fighting for their words—and the ability to protect them. New technology like smartphone keyboards, language-learning apps, and digital databases makes revitalization work easier than ever, but it also requires hard conversations about which parts of a language must be kept offline.

At a recent conference called Breath of Life 2.0, held at Miami University of Ohio, participants explored the possibilities and pitfalls of archival databases. For the people involved in language revitalization work, whether they’re awakening a dormant language (one that no longer has fluent speakers) or trying to prevent an existing language from losing all its speakers, one of the major obstacles is a lack of digital resources.

Jerome Viles experienced that firsthand. An enrolled member of the Confederated Tribes of Siletz Indians and part of the Southwest Oregon Dene Languages Project team, Viles helps organize archival materials produced by linguists and community members over the past 100 years. First, his community had to find the documents, which were scattered in institutions across the country. But the other problem was organizing and reconciling them. For instance, some records of language may have been written by French Jesuits who came to North America to convert Native American people. But even if three Jesuits worked with the same tribe, they could all have different spelling systems. Then there are audio recordings from the 1800s and 1900s, plus ethnographic material collected by anthropologists.

What Viles’ group and many others needed was a way to collect and compare thousands of words across multiple generations and several dialects. And that’s where the Indigenous Language Digital Archive came in. ILDA is a database built by engineers at Miami University in collaboration with the Miami Tribe of Oklahoma (which built an earlier form of the database specifically for its own language revitalization work). At the Breath of Life 2.0 conference, participants learned how to navigate the software and move their documents into one big digital space. By the end of the five days, a few surprising discoveries had emerged. Community members from the Northern Paiute, working on the Numu language, were tickled to find seven different ways to say “our husband.”

“I was getting a kick out of that,” said Nicholas Cortez. “I was like, where’s this ‘our’ coming from?”

Each group gained unique insights from using the database, but all were united in one concern: protecting language documents from people outside the community. As documents are entered into ILDA, a dictionary is created from that information. But what if the files contained sensitive information? For some groups, it was words and stories that should only appear at certain times of the year. For others, it was medical information that could be misused if the reader didn’t fully understand it.

Mark Pearson, who works on the technology side of the Osage Language Department, says even putting up videos of Osage on YouTube was debated, because not all knowledge of the language is meant for the general public. (Understandably, he did not want to give me examples of this.) At a recent gathering of First Nations in Canada, Māori language specialist Te Taka Keegan noted, “We can be colonized through data. We need to be aware of that, and we need to take steps to make sure we’re not.” Keegan explained that Google Translate has continued to change how it interprets Māori phrases over the past 10 years, and he’s not sure those changes are always for the better, since the system automatically collects data instead of working with the community.

The Google engineering team is still working on security features that would allow a user to make some information private for a specific length of time or make it visible only to certain people. Daryl Baldwin—one of the project directors for the Breath of Life conference and a member of the Miami Tribe—and his team have grappled with how to handle documents that contain stories. They held onto a collection for 20 years before finally releasing it to the public, because they wanted to be sure community members were in a place to understand them as “not bedtime stories.” Today, many of the stories are only told at certain times of the year to respect the tribe’s storytelling tradition.

Despite these complicated decisions, those involved with ILDA and other projects believe it is important to make information available to community members hoping to learn their languages. The Osage Nation, today located in Oklahoma, includes about 20,000 members, some of whom live far from the tribe headquarters. With no more fluent speakers left, figuring out how to share materials with people outside Oklahoma was once a major challenge. They’d tried video conferences but were looking to expand.

Pearson suggested building language apps. The tricky part was getting the Osage script recognized. Like the Cherokee, who developed a syllabary for their language in the early 1800s, the Osages use their own alphabet. Pearson says that alphabet was a necessary tool for learning Osage, since sounds exist in that language that don’t correspond to English.

But getting that alphabet into software was its own challenge. Unicode is the standard for representing text. When a font isn’t supported by Unicode, the characters appear as empty white boxes (known as tofu). For a while, the Osages could only use images of their script for software—a rough workaround that didn’t give them much access to digital tools. But then, Craig Cornelius, who works in the international engineering department at Google, learned about Osage through his work with the Cherokee. Over the past few years, he’s helped the Osages have their font accepted by the Unicode Consortium and make keyboards featuring it.

“In many cases, Google being involved has been a catalyst,” Cornelius says. He adds that other tech companies like Microsoft see the work being done and decide to hop on board with their own software, as with the Microsoft Office Suite in Cherokee. Today, the Osages have two language apps available to community members anywhere in the world, and they can use the Osage keyboard on their phones, as long as they have a recent-enough model of phone.

In Cornelius’ work at least, Google lets communities take the lead when deciding what to put into the world. “Many Native American groups have been victimized by the larger society and are understandably wary,” he says.

Those questions of privacy and security will vary from one community to the next and across the different databases and apps they use. It might mean holding back on releasing some pieces of the language. But even having these kinds of conversations is proof of how much has changed in the past decade.

“The world that our community’s language was used in has been under attack for 160 years, and I’d like to see that world rebuilt,” Viles says. “Our languages are for us to speak.”

Future Tense is a partnership of Slate, New America, and Arizona State University that examines emerging technologies, public policy, and society.