Sanjib Chaudhary chanced upon StoryWeaver, a multilingual children’s storytelling platform, while searching for books he could read to his 7-year-old daughter. Chaudhary’s mother tongue is Kochila Tharu, a language with about 250,000 speakers in eastern Nepal. (Nepali, Nepal’s official language, has 16 million speakers.) Languages with a relatively small number of speakers, like Kochila Tharu, do not have enough digitized material for linguistic communities to thrive—no Google Translate, no film or television subtitles, no online newspapers. In industry parlance, these languages are “underserved” and “underresourced.”
This is where StoryWeaver comes in. Founded by the Indian education nonprofit Pratham Books, StoryWeaver currently hosts more than 50,000 open-licensed stories across reading levels in more than 300 languages from around the world. Users can explore the repository by reading level, language, and theme, and once they select a story, they can click through illustrated slides (each as if it were the page of a book) in the selected language (there are also bilingual options, where two languages are shown side-by-side, as well as download and read-along audio options). “Smile Please,” a short tale about a fawn’s ramblings in the forest, is currently the “most read” story—originally written in Hindi for beginners, it has since been translated into 147 languages and read 281,000 times.
A majority of the languages represented on the platform are from Africa and Asia, and many are Indigenous, in danger of losing speakers in a world of almost complete English hegemony. Chaudhary’s experience as a parent reflects this tension. “The problem with children is that they prefer to read storybooks in English rather than in their own language because English is much, much easier. With Kochila Tharu, the spelling is difficult, the words are difficult, and you know, they’re exposed to English all the time, in schools, on television,” Chaudhary said
Artificial intelligence-assisted translation tools like StoryWeaver can bring more languages into conversation with one another—but the tech is still new, and it depends on data that only speakers of underserved languages can provide. This raises concerns about how the labor of the native speakers powering A.I. tools will be valued and how repositories of linguistic data will be commercialized.
To understand how A.I.-assisted translation tools like StoryWeaver work, it’s helpful to look at neighboring India: With 22 official languages and more than 780 spoken languages, it is no accident that the country is a hub of innovation for multilingual tech. StoryWeaver’s inner core is inspired by a natural language processing tool developed at Microsoft Research India called interactive neural machine translation prediction technology, or INMT.
Unlike most A.I.-powered commercial translation tools, INMT doesn’t do away with a human intermediary altogether. Instead, it assists humans with hints in the language they’re translating into. For example, if you begin typing, “It is raining” in the target language, the model working on the back-end supplies “tonight,” “heavily,” and “cats and dogs” as options for completing your sentence, based on the context and the previous word or set of words. During translation, the tool accounts for meaning in the original language and what the target language allows, and then generates possibilities for the translator to choose from, said Kalika Bali, principal researcher at Microsoft and one of INMT’s main architects.
Tools like INMT allow StoryWeaver’s cadre of volunteers to generate translations of existing stories quickly. The user interface is easy to master even for amateur translators, many of whom, like Chaudhary, are either volunteering their time or already working for nonprofits in early childhood education. The latter is the case for Churki Hansda. Working in Kora and Santali, two underserved Indigenous languages spoken in eastern India, she is an employee at Suchana Uttor Chandipur Community Society, one of StoryWeaver’s many partner organizations scattered all over the world. “We didn’t really have storybooks growing up. Our school textbooks were in Bengali [the dominant regional language], and we would end up memorizing everything because we didn’t understand what we were reading,” Hansda told me. “It’s a good feeling to be able to create books in our languages for our children.”
Amna Singh, Pratham Books’ content and partnerships manager, estimates that 58 percent of the languages represented on StoryWeaver are underserved, a status quo that has cascading consequences for early childhood learning outcomes. But attempts to undo the neglect of underserved language communities are also closely linked with unlocking their potential as consumers, and A.I.-powered translation technology is a big part of this shift. Voice recognition tools and chat bots in regional Indian languages aim to woo customers outside metropolitan cities, a market that is expected to expand as cellular data usage becomes even cheaper.
These tools are only as good as their training data, and sourcing is a major challenge. For sustained multilingualism on the internet, machine translation models require large volumes of training data generated in two languages parallel to one another. Parliamentary proceedings and media publications are common sources of publicly available data that can be scraped for training purposes. However, both these sources—according to Microsoft’s researcher Bali—are too specific, and do not encompass a wide enough range in terms of topics and vocabulary to be properly representative of human speech. (This is why StoryWeaver isn’t a good source for training data, either, because sentences in children’s books are fairly simple and the reading corpus only goes up to fourth-grade reading levels.)
Technical requirements aside, data work is also often invisible and poorly compensated, and it takes place in unregulated environments. There’s increasing concern over what we owe the behind-the-scenes human workers compiling data sets to train A.I. systems.
Known as crowdworkers, these people perform rote, piecemeal tasks that range from labeling images of trees and pedestrians for self-driving cars to spotting signs of disease in medical scans. This type of monotonous “ghost work” takes on an emotional dimension in the context of language preservation. Language data workers contributing to machine translation models are so motivated by the prospect of linguistic dignity on the internet that fair compensation and data stewardship issues get jettisoned in favor of discussions that foreground why this work is important from a cultural perspective.
The cultural value, after all, is enormous: Sanjib Chaudhary’s daughter understands more Kochila Tharu than she did even a few years ago, and Chaudhury’s involvement with StoryWeaver has since grown. Over the past year and a half, he and two friends worked on generating Nepali equivalents for a total of 40,000 English words. But they were paid only $243 for the project, or less than 1 cent per English word, divided three ways. According to Microsoft’s Bali, models need 100,000 paired sentences to start generating acceptable translations.
Despite the repetitiveness and poorly compensated nature of the work, Chaudhary sees himself not as a crowdworker but a language steward. “We have many homophonic words in Kochila Tharu which aren’t there in English. Take the names of different fish … we have so many words for fish, fishing equipment, and fish preparations that you wouldn’t find in other languages,” he said. “If our language dies, we will lose them. I want to collect these words before they disappear.”
The hope for a future when marginal linguistic identities can thrive online is a powerful incentive for stewards like Chaudhary and Hansda. Hansda’s stint with StoryWeaver led to a paid opportunity at AI4Bharat (or A.I. for India), an initiative at the Indian Institute of Technology in Chennai that collects data in labeled pairs for English and 12 Indian languages. The 100,000 sentences Hansda will add to the AI4Bharat dataset for Santali over 18 months span Indigenous oral histories, folktales, literature, sentences, and words. Hansda is paid $1.66 per hour for this work as a “language expert.”
To be truly innovative—and accountable—A.I.-assisted language research must ensure native speakers and their communities aren’t merely contributing data, but also helping to determine what this data will be used for. For now, AI4Bharat seeks to “bring parity with respect to English in A.I. technologies for Indian languages with open-source contributions.” That assumes openness will automatically lead to inclusion. But in practice, there are no clear guidelines preventing companies developing A.I. technologies from using datasets collected and trained by noncommercial research entities like universities or nonprofits. AI4Bharat, for example, categorizes its crowd-sourced datasets as open-source, meaning Hansda’s contributions could be commercialized for profit in the future. There’s precedent for that: Announced last fall, Meta’s not-yet-public Make-a-Video A.I. tool was trained by datasets compiled from publicly available video clips on YouTube and Shutterstock. Calling the practice “A.I. data laundering,” technologist Andy Baio wrote that “Outsourcing the heavy lifting of data collection and model training to non-commercial entities allows corporations to avoid accountability and potential legal liability.”
For now, the push toward linguistic inclusion—whether motivated by commercial profit, social impact, technological innovation, or a mix of all three—is exciting for speakers of minority languages. Hansda hopes for a day when her grandchildren can live their online lives in Santali. “They’ll say, ‘Our grandmother did this,’ ” she said.