Future Historians Will Rely on Wikipedia’s COVID-19 Coverage

The deletions, the editor fights—all of it will be important to researchers studying this period.

In March, Facebook was filled with posts that claimed that 5G networks, not a novel coronavirus, were making people sick. Yet searching for those same posts today leads to an error message: “Sorry, this content isn’t available right now.” That’s because Facebook and other social media companies have removed many conspiracy-type posts from their platforms, including the thoroughly debunked 5G connection. But some internet activists are concerned that this pandemic-related content is not only being removed but erased, leaving future researchers with a gap-filled historical record.

Enter Wikipedia. In April, 75 signatory organizations sent a letter asking social media companies and content-sharing platforms to preserve all data that they have blocked or removed during the COVID-19 pandemic and make it available for future research. The letter’s recipients included Facebook, Twitter, Google, and the Wikimedia Foundation, the parent organization of Wikipedia. When Wikipedia editors discussed the letter among themselves in forums like Wikipedia Weekly, the most common reaction was, Don’t we already do this?

Over the past few months, Wikipedia’s coverage of the COVID-19 pandemic has been widely praised for its breadth and relative trustworthiness. To date, the main English Wikipedia article about the pandemic has been viewed more than 67 million times, and COVID-19 articles exist in 175 languages. The 5,000 articles related to COVID-19 cover everything from Anthony Fauci’s peers across the world, to the resulting global economic crisis (e.g., German Wirtschaftskrise and its Arabic counterpart), to a somewhat circular Wikipedia article about Wikipedia’s own response to the pandemic.

But today’s wealth of Wikipedia content will also be valuable to future parties. As scholar and Wikimedia program coordinator Liam Wyatt writes, the “text in Wikipedia’s archive will be of interest to linguists, historians or sociologists of the year 4000.” In an interview, Katherine Maher, chief executive officer and executive director of the Wikimedia Foundation, told me, “One of the things that historians will find valuable is the way Wikipedia documents the rate of acceleration of understanding the virus itself.”

For example, a future historian looking back on Wikipedia’s coverage of the COVID-19 pandemic this year would likely review the relevant “diffs.” Every Wikipedia article, and every revision to it, is saved even if the edit is relatively minor or short-lived. The diff shows the difference between one version and another of a Wikipedia page, allowing anybody to see exactly what changed between two precisely time-stamped moments. The diffs for the Wikipedia article about the COVID-19 pandemic include this one on Jan. 7 noting the first suspicions that the virus had an animal source, and this one on Jan. 8 with the first use of “novel coronavirus.” More recently, this diff shows the first insertion of the word bleach on April 29, after comments from President Donald Trump. A historian could use Wikipedia’s diffs to construct a case about how knowledge about COVID-19 evolved throughout 2020.

Researchers in the future could also learn from debates among editors. Each Wikipedia article has a discussion page where editors can participate in conversations about building the encyclopedia. Throughout April and early May, Wikipedia’s volunteer editors engaged in a lengthy discussion about renaming the article from “2019–2020 coronavirus pandemic” to its current name “COVID-19 pandemic.” Notice how the new name identifies the virus specifically and drops the time range. What might this renaming signify to a future historian? It’s impossible to know, of course, but one interpretation is that this was an early recognition that this pandemic could last until 2021 and beyond.

Future researchers will struggle more with historical data from social media companies. In March, Facebook, Twitter, and YouTube removed videos from their platforms in which Brazilian President Jair Bolsonaro said that the drug hydroxychloroquine was an effective treatment for COVID-19. While this helped stop the spread of medical misinformation on those platforms, the deletion of the posts (and all associated comments and metadata) makes it more difficult for researchers to understand how the public engaged with that misleading content before it was taken down. In the past, these companies have not disclosed data on deleted posts, even after the fact, as they consider such information proprietary.

And it’s not just big tech companies that are purging the future historical record. Woody Harrelson and John Cusack posted support for the 5G coronavirus conspiracy theory before voluntarily deleting those posts from Twitter and Instagram. And some journalists have begun routinely deleting their old tweets in order to reduce the risk of online harassment, a practice the Columbia Journalism Review characterized as “erasing the first draft of history.” But Wikipedia is less likely to be accused of this historical erasure since, with few exceptions, the software preserves the project’s entire edit history.

Preservation is Wikipedia’s strong suit, but a long-term challenge for the project is the issue of systemic bias. Largely unintentional bias can be seen in the encyclopedia’s biographical articles (more than 80 percent male) and the disproportionate number of articles about sci-fi and technical topics (mirroring the preferences of the site’s earliest contributors). We know that when original source material is biased, this limits the understanding of future researchers, who will ask questions millennia later like “Where are all the women in ancient philosophy?

But today’s Wikipedia supercontributors are keen to ensure that future historians will have access to a better archive. Comprehensive coverage was a recurring theme at this month’s virtual symposium on Wikipedia and COVID-19 organized by Wikimedia NYC, which featured prolific volunteers like Jason Moore. Moore has been documenting the pandemic in real time from many viewpoints, starting articles about the pandemic’s impacts in various U.S.
states, the LGBTQ community, and discrete sectors like the cannabis industry. Another presenter at the symposium, Lane Rasberry of the University of Virginia, demonstrated how Wikidata can visually represent outbreaks of the virus on a world map. Because this language is machine-readable, it can be filtered out immediately from the central hub of Wikidata into the various language editions of Wikipedia. But Rasberry cautioned that this wiki outbreak data overrepresented North America and Europe and underrepresented places with fewer wiki editors. “That’s just the way it’s working for now,” he said.

Then again, future researchers may be able to account for some geographic distortions so long as the original record is still accessible. After the symposium, presenter Netha Hussain described an article she started on English-language Wikipedia called “Misinformation related to the 2020 coronavirus pandemic in India.” But if you search Wikipedia for that article today, you will not find it. That’s because other Wikipedia editors voted to delete the page on May 6. (The pro-delete group argued that it was improper for India to have a separate article for misinformation when other countries did not.) The article about misinformation in India is not completely lost to posterity, however, and unlike social media companies, Wikipedia is not claiming that it retains ownership of deleted content. These deleted articles can be viewed by Wikipedia’s volunteer administrators, and Hussain has also saved a copy of the deleted article about India and the editorial discussion about its deletion. Perhaps a future historian will someday comb through this discussion to better understand how editors responded to allegations of a “corona jihad,” a false narrative that has led to persecution of India’s Muslim minority.

Interestingly (at least to me!), these hypothetical future researchers would be using Wikipedia as a primary source. That may sound heretical, given that librarians and educators have been reminding us for nearly 20 years that Wikipedia is not a primary source, not a secondary source, but a tertiary source. That’s why Wikipedia has a handy help page to remind readers that “you probably shouldn’t be citing Wikipedia.” But citing Wikipedia as a primary source makes sense in a future state where enough time has passed that today’s Wikipedia revisions have become a historical artifact. Imagining this distant future presents an interesting thought exercise not only for Wikipedians but for other creators of online content: How might this digital media someday be interpreted as a revealing artifact from this period of distress and disease?

