In September, Google became aware of anti-Semitic memes appearing as top image search results for the terms “Jewish baby stroller” and “Jewish bunk bed.” (Don’t search for these terms—it can amplify the memes.) The results show photos of portable ovens and oven racks respectively, referring to the crematoria Nazis used to incinerate Jews during the Holocaust. Many of these memes were a few years old, but they likely surfaced in bulk only recently due to a coordinated operation by anti-Semitic extremists of sharing the memes, some of which appeared on Reddit in recent months.
“For ‘baby strollers,’ there’s lots of helpful content. For this, there’s not,” Google’s public search liaison Danny Sullivan wrote on Twitter. “It’s not likely a topic normally searched for, nor an actual product that’s marketed. There’s a ‘void’ of good content to surface that matches what was asked for.” Sullivan said Google would not remove the anti-Semitic results but would look for ways to surface “more helpful content.”
A data void emerges when there’s a lack of information. For a clear example, try searching an uncommon combination of keywords in Google, and you might be able to find one. Google will say, “It looks like there aren’t any great matches for your search.” When there isn’t much information to fill a set of keywords in a Google search, then bad actors can optimize their content to rank well for that particular search.
This is an example of an empty data void. But, as the “Jewish baby stroller” example shows, a data void doesn’t have to appear empty of information entirely—it just has to be empty of good, trustworthy information. Sometimes the manipulation of a data void is improvised, done by populating searches as news breaks. It can be planned out in advance by covertly filling voids with disinformation, then executing online campaigns to prompt a spike in searches for those keywords. Other times, data voids can appear organically—say, filled with a fringe conspiracy theory—and if a troll finds one where misinformation ranks higher than truthful information, they can purposefully direct unsuspecting viewers to the search terms.
The term data void was coined in May 2018—just six months before the midterms—by Microsoft’s Michael Golebiewski in a report published with Danah Boyd, president of Data & Society. “A rising trend in online misinformation is to encourage users to search for a topic for which the motivated manipulator knows that only one point of view will be represented,” they wrote. They pointed to the June 2015 shooting at Emanuel African Methodist Episcopal Church in Charleston, South Carolina, when Dylan Roof opened fire, killing nine Black church attendees and injuring one. In Roof’s manifesto, he wrote that his life was changed after Googling “black on white crime.” The first website he came to was one by a white supremacist group.
“The term ‘black on white crimes’ is not a popular search term, but the results provided on the first page of major search engines are very problematic,” wrote Golebiewski and Boyd. “This is a classic example of a data void.”
Soon after Golebiewski and Boyd published their report, Claudia Flores-Saviaga, a Facebook research fellow and Ph.D. candidate studying computer science at West Virginia University, noticed this happening in connection with the election. Flores-Saviaga and her adviser Saiph Savage were studying political discussions on Reddit about Latinos and immigration issues leading into the November 2018 midterm elections. Their data analysis found that the most active users were political trolls sharing propaganda intending to mobilize Latinos to vote Republican, and they were drowning out the few anti-Republican and neutral actors discussing Latino politics. Many of the most popular posts about Latinos in political subreddits touted stories of Latinos across the United States who voted for President Donald Trump and anecdotes purporting that Trump’s immigration policies were beneficial for Latinos who were U.S. citizens.
Latino voters have long been neglected in American politics. Politicians often dismiss Latinos as unreliable voters and avoid doing much Latino outreach. Relatively little information about American political candidates—especially Democratic ones—is available in Latino news outlets. But Latinos are expected to be the largest nonwhite voting group in 2020, accounting for a record 13.3 percent of eligible voters. At the same time, according to MIT Technology Review, they are being targeted with record-breaking levels of disinformation favoring conservatives. In September, Politico reported in Florida—a critical swing state—Latinos have been inundated with QAnon conspiracy theories that are damaging Joe Biden’s lead in the polls.
Flores-Saviaga and Savage conducted a new data analysis in the weeks leading up to the 2020 election to see what data voids could be affecting the Latino vote now. They collected all of the Facebook posts from the top three Latino newspapers in each state between Aug. 18 (the day Biden officially became the Democratic nominee) and Oct. 20. What they found was a disparity in the amount of coverage former Vice President Joe Biden and Trump have been getting in Latino news outlets: There were 9,898 posts about Trump and Vice President Mike Pence and just 3,255 posts about Biden and his running mate, Sen. Kamala Harris. Flores-Saviaga said there could also be data voids where Latino newspapers and Facebook groups have light coverage of the Democratic candidates. And this leaves the door wide open for trolls to fill that space with propaganda and false information directly targeting Latino voters. Without enough high-quality information to refute misinformation, many readers have no clue they’ve stumbled into a data void.
It’s hard to do anything about this. As Golebiewski and Boyd wrote in their report on data voids, “Search engines aren’t human.” They don’t understand human intention. They assume that for every question there must be a relevant answer, drawing on keywords, recency, number of clicks, and various other factors (masked from the public as trade secrets) to determine relevance. It’s not difficult to make new content if all other results are dated, or to optimize keywords in posts to match odd phrases, and push disinformation to the first page of search results. And I don’t only mean Google, Bing, and DuckDuckGo. I’m also talking about social media search functions on sites like Reddit, YouTube, and Facebook. According to Boyd, a surprising number of people use YouTube’s search bar—which processes billions of searches every month—as a search engine, specifically seeking video format answers instead of the mishmash of links that traditional search engines aggregate. Reddit, Twitter, and Facebook are also popular platforms to search for news and political information. Data voids can appear on all of these sites and more.
There are an infinite number of data voids out there, though most will never be a problem. For example, a random string of numbers and letters probably won’t match any results, but it’s unlikely that multiple people will ever search those exact same characters. The caveat is that we don’t know which data voids will end up being weaponized, so it’s hard to proactively fill voids with high-quality content in an attempt to combat disinformation.
This is particularly problematic when it comes to breaking news, especially when it’s something that has never happened before—say, the president getting COVID-19. Legitimate news happens slowly. It needs to be reported, fact-checked, and edited before publication. Conspiracy can be whipped up with minimal effort and spread across social media in a matter of seconds. It didn’t take long for baseless allegations to appear claiming Democrats had deliberately poisoned Trump with the coronavirus. In this case, legitimate news filled the void pretty fast. But filling voids after misinformation has been able to spread in any capacity can only curb further misinformation; it can’t entirely stop what’s already out there. Research suggests that facts aren’t very effective in combating conspiracy. “Even after all the explainers, debunks, and stakes-laying, QAnon hasn’t receded in popularity—it’s exploded,” wrote Whitney Phillips, assistant professor of communication and rhetorical studies at Syracuse University. It’s easy to see this in conspiracy-focused Facebook groups, where posters make fun of the platform’s fact-checking system or contend it’s biased or funded by billionaires with agendas. It’s hard to put out the fire that is misinformation once it spreads, while it’s not hard to get misinformation to rank well in search results during the first few hours of breaking news.
Some of the most successful disinformation campaigns expressly lead the masses to problematic search terms. Media manipulators coordinate in private chats on platforms like WhatsApp and Discord. They prepare by creating content—websites, blog posts, social media pages, memes—using search engine optimization techniques to fill data voids for particular phrases. Then they make those phrases go viral by posting them in memes on message boards, incorporating the words into their in-group lingo and even bringing the phrases into interviews, getting the media to spread their content for them.
The only way to fill data voids is to produce enough high-quality content to saturate vulnerable voids. But neutral content is not always enough. Flores-Saviaga said trolls have very sophisticated techniques for creating engaging content that can go viral over neutral content about the same topic. For Reddit posts about Latinos and U.S. politics, Flores-Saviaga saw that right-wing trolls were able to get far more engagement than neutral posters were.
According to Flores-Saviaga, the strategies trolls use employ insider slang to create a sense of community, viral microtasks like sharing memes and tweeting certain hashtags, and historical context, framing how people should understand the political landscape by pushing conspiracy theories and explainers (like manuals on how to red-pill a liberal). These strategies work exceptionally well at holding people’s attention, and they make it easy to take advantage of lapses in the information ecosystem.
We may live in an era where any question can be asked and answered with a few clicks, but not every answer is a good answer. As far as search algorithms are concerned, any “relevant” answer—no matter how wrong—is good enough. When it comes to truth and fair democracy, though, bad answers are simply not acceptable.