Sometimes it isn’t enough for data to be big. Consider Google Books, a searchable digital archive of millions of texts spanning the history of the printed word. This enormous corpus has inspired researchers to rethink the ways we map the history of language, allowing them to make sweeping claims about the evolution of language and culture. Researchers have, for example, used it to chart changing patterns of celebrity culture or to propose that moral language is waning. The trouble is, we may be arriving at those assertions a little too glibly.
According to a paper published by PLOS One last week by three data scientists from the University of Vermont, the basic design of Google Books threatens to undermine its ability to map cultural trends. The paper’s authors—Eitan Adam Pechenick, Christopher M. Danforth, and Peter Sheridan Dodds—came to the site hoping that it would allow them to examine linguistic evolution. Instead, as Dodds told me, they concluded that it is “difficult to say anything at all” on the basis of the results Google Books renders up.
According to the paper, the problems begin with a basic disjunction between the way culture works and the way Google collects materials for its database. Generally speaking, once a book has been scanned, it’s entirely accessible, meaning that Google has no reason to index it again. There are, of course, exceptions—as the researchers note, “new editions and reprints allow some books to appear more than once”—but, for the most part, the one-to-one rule apparently applies.
Sounds reasonable, right? But let’s say you go to your local library’s online catalog and search for any reasonably popular book—Harry Potter, The Hunger Games, or whatever’s perched atop the best-seller lists this month. In all likelihood, you’ll find that it has numerous copies of it scattered throughout the system. Its multiplicity reflects its hold on the wider public imagination. If the library owned only one copy of each of these volumes—as is the case with Google Books—it would find itself badly out of sync with the reading public and the cultural zeitgeist more generally. Without such information, the University of Vermont researchers suggest, it’s not possible to make strong claims about the ways people were living at a given time. As they snarkily put it, “Evidently, incorporating popularity in any useful fashion would be an extremely difficult undertaking on the part of Google.”
By not taking into account the relative popularity of texts, Google Books leaves itself open to disproportionate influence from less widely recognized sources. “It’s as if you’re giving every work in a library the same weight,” Dodds said. When an author publishes numerous books about a single character, for example, that character’s name may appear to be far more central to an era’s discourse than it actually was. Dodds pointed me to the example of Star Trek novelizations, which made names like Spock appear with improbable frequency. By contrast, Dodds noted, a long-standing best-seller like A Tale of Two Cities has trouble making a dent at all, even in eras when everyone was reading it.
Part of the problem is that Google isn’t particularly interested in what’s popular. It draws the majority of its volumes from university libraries, meaning that it ends up relying heavily on academic—and especially scientific—literature. Consequently, it can give the mistaken impression that our language is gradually becoming more abstract and science-y.
Not everyone agrees that this is really an issue. Daniel Shore, an associate professor of English at Georgetown University (and former colleague of mine), who is working on a book about digital archives and the history of linguistic forms, isn’t convinced that Google only archives a single copy of each book. He says Google Books isn’t thoroughgoing enough in its collection strategies to guarantee that it avoids duplicates. The real trouble, Shore suggested, is that Google Books is effectively a black box. We ultimately don’t know what it contains, which makes it difficult to draw strong conclusions on the basis of the data it furnishes.
For Shore, then, the real issues begin with Google’s little-understood archival methodology. Its approach, he noted, has been to simply “scan it all.” Because of this, “It’s often difficult to figure out what you’re reading, where it came from, who published it.”
Ted Underwood, an English professor at the University of Illinois at Urbana-Champaign who has published and blogged about these issues for years, takes a similarly moderate approach. He pointed to one study that used Google Books to make broad claims about the changing nature of childhood in the mid-20th century, a study that failed to acknowledge that parenting manuals emerged as a genre during that era. With such bumbling scholarship in mind, Underwood noted that he was grateful to Pechenick, Danforth, and Dodds for substantiating “some of what we’ve suspected” about Google Books. Nevertheless, he suggested that the problem might not be with the database itself. Many researchers examine Google Books as a fortuneteller might read tea leaves: “You can come up with any story you want for why the apparent popularity of a word is going up or down.”
Ultimately, Underwood suggests, Google Books may help us come to such conclusions, but it won’t do so on its own. “It’s not just that Google Books can’t support claims about cultural evolution,” he said. “It’s not that it’s a bad corpus for that; it’s that you’d need more than one corpus for that.” If it’s going to be truly helpful, we’ll need to juxtapose it with other, more thoughtfully assembled collections. For now, it may have limited uses. As Underwood pointed out, for example, we can still use it as a rough tool that allows us to ask, “When did this word or phrase start appearing in the print record?”
But Dodds still believes that Google Books as it stands now is largely useless to researchers. If it’s going to be functional, he told me, “It needs a lot more metadata.” There should, for example, be a way to separate out the scientific literature or to classify fictional works by genre. And even if we could do that, he noted, the issue of relative popularity might remain a concern. Ultimately, we might have to recognize that Google Books simply isn’t a great research tool, however appealing it might be.
This article is part of Future Tense, a collaboration among Arizona State University, New America, and Slate. Future Tense explores the ways emerging technologies affect society, policy, and culture. To read more, visit the Future Tense blog and the Future Tense home page. You can also follow us on Twitter.