Welsh Wikipedia Gives Me Hope

Welsh and other smaller language movements on Wikimedia projects suggest there may be ways to train technology to allow for cultural differences.

Collage of the Welsh flag dragon grasping at the Welsh Wicipedia globe logo.
Photo illustration by Slate. Images by Wikipedia.

Welcome to Source Notes, a Future Tense column about the internet’s knowledge ecosystem.

If you say, “Alexa, faint o’r gloch yw hi?” the smart speaker will not understand that you are asking for the time of day. That’s because Welsh is not one of the eight languages currently supported by Amazon’s Alexa-enabled devices. Gareth Morlais, a Welsh language and digital media specialist for the Welsh government, has argued for years that this language gap is disturbing. In a 2017 presentation, Morlais noted that the Welsh language, then ranked 172nd in the world by number of speakers, was not supported by Alexa, Twitter, or Google’s search interface. At the time, Alexa only spoke and understood two languages: English and German. “The technology actually tells you which language your family can speak at home, which is a horror story,” Morlais said. “What we need to do here is try to shape the technology so that it speaks the same language that we want to speak.”

Although Alexa still does not speak or understand Welsh, the Celtic language’s presence in tech has increased dramatically within a short period. Google announced in February that it had expanded its offerings in Docs, Sheets, Slides, and Drive to include Welsh. And Google Translate—infamous since 2009 for its Scymraeg, or scummy Welsh—has, according to the BBC, recently taken a great leap forward in terms of the accuracy and quality of its Welsh translations. Morlais and others attribute this in part to the fact that there are now more than 100,000 articles on the Welsh version of Wikipedia, known as Wicipedia.

Like other language editions, Wicipedia is a separate website with its own content, not simply a translation of English Wikipedia, a distinction that matters for both users and big tech companies. Back in 2017, Morlais observed, “There appears to be an indication that there is a link between the languages with the most Wikipedia articles or pages and the languages that are supported by the digital giants.” Google Translate and other technologies use artificial neural networks to learn from example, training themselves with language data from rich internet sources like Welsh Wikipedia.

The Welsh community is not alone in using wiki-technology to promote its language. This year’s Celtic Knot conference in Cornwall, England, included several indigenous languages with their own Wikipedia editions. The original idea, as the name suggests, was to focus on Celtic languages, including Irish, Breton, Scottish Gaelic, Welsh, and Cornish (which was declared extinct merely a decade ago), as well as Scots.* But as word got out about a Wikipedia minority language conference, others began to join, representing, for example, the Sámi language spoken in parts of Norway, Finland, Sweden, and Russia; the Berber family of languages spoken in Northern Africa; and the Basque and Catalan communities. (In his 2017 presentation, Morlais noted that Catalan was one of the few minority languages supported by Google search, an accomplishment he linked to the fact that Catalan already had more than 500,000 articles on its language edition of Wikipedia.)

At the Celtic Knot conference, these smaller language communities gathered together to discuss strategies to improve the content in their specific editions of the Wikimedia projects, such as making more medical content available in their local languages. One popular session this year involved setting up Wikidata infoboxes so that these smaller language encyclopedias could source common structured data from the shared Wikidata hub. So, for example, English Google searches for a “list of largest cities” would return this article, Welsh searches would surface the article “Dinasoedd mwyaf y byd,” and both language editions pull data from this central repository.

The National Library of Wales appointed Jason Evans to the position of National Wikimedian nearly two years ago in a formal announcement that recognized Wicipedia as the most popular Welsh website. (The U.S. National Archives also has a Wikipedian in residence to foster collaboration between the National Archives and the English Wikipedia community.) Since his appointment, Evans has developed collaborations and partnerships to help advance the representation of Wales and the Welsh language on Wikimedia projects, like sharing open-access content such as the Peniarth manuscript collection of Welsh history, culture, and verse, thousands of landscape prints, and the laws of Hywel Dda, an influential Welsh king in the 10th century.

I asked Evans why he believed it was so important Welsh have its own language version of the encyclopedia. With the updated translation technology, wasn’t it technically feasible to have a single encyclopedia, and then translate that centralized information into the reader’s specific language? Would that not be more scalable?

“If everyone just had the same generic article, it wouldn’t promote cultural diversity—you know, all the things that make us human,” Evans said. He provided some examples: The English-language Wikipedia article about the Game of Thrones television series notes that the fictional languages Dothraki and Valyrian have reportedly been heard by more people than the Welsh, Irish, and Scots Gaelic languages combined, which suggests the popularity of the show and the relative tininess of Celtic languages. But the Welsh Wikipedia article on the same topic exhibits considerably more Welsh pride. It highlights the roles of two Welsh actors (Iwan Rheon as Ramsay Bolton, Owen Teale as Ser Alliser Thorne) and the series’s use of Welsh and Welsh-sounding names.

The different language editions also reflect ideological differences. English Wikipedia states that Catalonia is an autonomous community in Spain. But Welsh Wikipedia describes Catalonia as a European country, based on its declaration of independence in 2017. Where information is consistent across cultures—like the populations of the world’s largest cities—it perhaps makes sense to pull from a standardized source. But philosophers make fine distinctions among data, information, and knowledge, which is not necessarily machine-readable. It’s understandable that the Welsh language community would support the Catalonians’ independence if this position better reflects the Welsh community’s universe of knowledge.

Harnessing technology is an important part of the Welsh government’s goal to have 1 million Welsh speakers by 2050. (In the most recent census, in 2011, there were 562,000 Welsh speakers in Wales.) Children in the 19th century were punished for speaking Welsh by being forced to wear a shameful wooden plaque around their necks called a Welsh Not. “There was this phase in Welsh history where it was seen as the language of poor, uneducated people, so people tried desperately to rid themselves of their Welsh,” Evans said. “Now bilingualism is generally encouraged through the educational system.”

That has also created an unusual challenge for Wicipedia: Many of the contributors are children learning Welsh in bilingual programs. “They change things, or write a rude word in, just to be funny,” Evans said. “Most of the IP addresses that are blocked [from editing] in Wales are schools.” But the occasional prank job is not a terrible problem (and it’s also something that happens on lots of other language editions). When Evans or his colleagues visit a school to teach students digital literacy skills and how to edit the site, they can lift the IP address ban.

Evans notes that, like in other Wikipedia communities, most of Wicipedia’s contributors are men. Still, less than 20 percent of the English Wikipedia’s biographies are about women, while Welsh Wikipedia has approximately a 50/50 split. Evans believes this gender balance stems from the fact that Welsh Wikipedia had not had access to much freely available content, meaning the site has largely been written from scratch. It’s a striking contrast from English Wikipedia, which in 2006 began incorporating information from the 1911 edition of Encyclopædia Britannica once it entered the public domain. But importing from this traditional source meant that historical gaps in coverage were simply carried forward into the digital space.

There’s a popular pessimistic view that technology will always be biased because it is learning from biased source material and limited data sets. Yet Welsh and other smaller language movements on Wikimedia projects suggest there may be ways to train technology to allow for cultural differences. Further efforts to “fix the machine” may in fact be a deeply human experience, filled with national pride, teenage pranks, and the rest.

Correction, Aug. 12, 2019: This article originally misstated that Scots is a Celtic language. While Scottish Gaelic is, Scots is not. Both have been discussed at the Celtic Knot conference.

Future Tense is a partnership of Slate, New America, and Arizona State University that examines emerging technologies, public policy, and society.