The Big Data Paradox

It’s never complete, and it’s always messy—and if it’s not, you can’t trust it.

XT-002 Deconstructor.
World of Warcraft’s XT-002 Deconstructor. WoW players wrote automated tracking systems to gather data on where treasure items and monsters popped up in the game.

Courtesy of Blizzard Entertainment, Inc./Creative Commons

Big data is messy data. It’s not enough just to collect it and count it, because there is never just one way to count it. Big data certainly doesn’t mean “the end of theory,” as Wired editor Chris Anderson notoriously put it in 2008. I came down hard on big data last week while discussing the Facebook and OkCupid experiments on users and their supposed revelations about human nature. These revelations turned out to be founded on sloppy analysis. That said, big data is undeniably important and is already responsible for great gains in efficiency and knowledge. Big data is not a miracle worker, but it is changing our lives.

One question to ask is how big data differs from regular data, other than there just being a lot more of it. Regular data doesn’t magically become big data just because you’ve got 100 million data points instead of a thousand. While computers have made large-scale number crunching far easier and faster than it was 20 or 30 years ago, that doesn’t mean that weather reports or graphs of seismic activity suddenly qualify as big data.

Contrariwise, big data is never complete either. It’s easy to think of big data as simply including all the data, but as Rachel Schutt and Cathy O’Neil put it in their excellent and skeptical book Doing Data Science, “It’s pretty much never all.” This is not a bad thing. As Jorge Luis Borges put it in “On Exactitude in Science,” a perfect map of a country—one necessarily as big as the country itself—is perfectly useless. But you must remain aware of what’s being excluded.

Aside from sheer quantity, there are three defining characteristics of big data. One key big data difference is megasourcing: taking data from huge numbers of distributed sources. If these sources are people, you can call it “crowdsourcing,” but the sources don’t need to be people. Every online ranking system, from Facebook “likes” to Reddit reputation systems, is an example of megasourcing, but so is Google Maps, which aggregates data from thousands of cars and satellites and third-party data sources around the world.

Another is automation. The ability to analyze data as fast as it can be collected means that the results can be put in play automatically, without anyone having to examine the data manually. This is not just a benefit, but a necessity, as the sheer quantity of data is becoming too great for humans to analyze even with the benefit of extra time. Hence the danger of big data: that the analyses are garbage, as we saw in the case of Facebook’s mood experiment, where “not happy” and “happy” both got treated as positive mood indicators. There’s so much data that there’s not enough time to validate the results (unless there’s a public outcry).

Finally, there is the issue of feedback. If an automated ad system decides you should see an ad for diapers because you recently “liked” a stroller on Facebook and bought wipes on Amazon, then further data on you—such as whether you clicked on that diaper ad—is interpreted as a consequence of the analysis that’s already been performed. Big data does not measure static or pristine systems; it puts its results back into these systems and changes their behavior. (This, naturally, makes the effects of big data that much more complicated and dependent.) “We’re witnessing the beginning of a massive, culturally saturated feedback loop,” write Schutt and O’Neil, “where our behavior changes the product and the product changes our behavior.”

A perfect example of these three features of big data comes from online multiplayer game World of Warcraft. (As usual, computer gamers got here first.) To figure out how often certain rare treasure items drop, how strong certain monsters are, and where items and monsters pop up in the game world, players wrote external, automated tracking systems like Wowhead that could be installed on their computers. Anyone who used these extensions while playing would automatically upload data of all of their encounters, pickups, and statistics to a central third-party server, which would aggregate them into a searchable database and generate stats. So if you wanted to know where to find a particular monster in WoW, you could get a breakdown of probabilities, down to the specific in-world coordinates.

That’s great for a discrete, artificial, coded fantasy world, but what about the real world? Here the issue of messiness re-enters and dominates. If you look at the most successful case studies in Viktor Mayer-Schönberger and Kenneth Cukier’s sensible 2013 book Big Data—from Amazon’s recommendation engine to New York’s search for illegally converted buildings to predicting exploding manholes—they are all cases of selection optimization. That is, the data is used to help select and prioritize the most relevant and crucial data points, whether those points are books you are likely to buy or manholes that are likely to explode. Big data is suited to optimization problems because such problems are generally error-tolerant: If the analysis points to some safe manholes or some books you don’t want to buy, that’s fine. Think of it as analogous to Google’s search results and ads: It’s fine if some irrelevant results or ads show up, as long as there are enough good results and ads to keep people clicking.

Similarly, such analysis can identify anomalous correlations that would otherwise go unnoticed. Statistician Andrew Gelman’s analysis of New York’s stop-and-frisk policy concluded, “The differences in stop rates among ethnic groups are real, they are substantial, and they are not explained by previous arrest rates or precincts.” Having found a meaningful correlation where, in principle, there should not have been one, Gelman could then show a meaningful disparity in stop rates based on race. Such analyses would only improve with finer-grained and more comprehensive data; as long as the interpretation is well-grounded, individual errors and incompleteness should not corrupt the results.

For contrast, consider problems where error tolerance is extremely low. In speech recognition, language translation, medical diagnosis, and many other fields, analysis and results must be complete, exhaustive, and almost perfectly fine-tuned. If you translate a sentence from Japanese to English, your margin for error is pretty much zero: Any mistake could create a total misunderstanding. This isn’t to underestimate how often Google Translate does produce sensible results, due to the corpus of megasourced translation data available to it. But it also shows why human translators won’t be out of business anytime soon.

Likewise, while I’m happy to have my doctor use big data–style analyses—such as those offered by gene analyzer 23andMe—to find potential trouble spots in my health, I only want that data to supplement my doctor’s skills, not replace it. Only in cases such as spell-checking, where the megasourced data is remarkably coherent and precise, can you reach a degree of certainty that you would feel comfortable turning over much responsibility to a computer. Even crowdsourced spam-filtering, while impressively reliable, produces nontrivial numbers of false negatives and positives.

This is also why the government’s “vacuum cleaner” approach to collecting data rings somewhat hollow. We now know that the FBI had been warned multiple times about Boston Marathon bombing suspect Tamerlan Tsarnaev back in 2011, but the FBI never put him under surveillance, possibly due to lack of coordination and a spelling mistake. The next time a terrorist attack happens on their watch, I guarantee you that there will have been signals in their data that their analyses missed. Given a huge haystack, big data will find some needles pretty quickly, but it will never guarantee you that it’s found them all.

Big data, then, is good for when you want incremental optimization rather than a killer paradigm shift. The sorts of “discoveries” you see Facebook, OkCupid, and even Google trumpeting from big data should be greeted with caution. The real gains come in degrees of quantity rather than quality: saving time, identifying potential trouble spots, and identifying the biggest bang for your buck. These gains can save huge amounts of money, time, and even lives. But they do lack some of the flashiness of, for example, Google Flu Trends—where we thought one kind of data could be conjured out of another through magically emergent correlations, only to find that the correlations were a lot less solid than they seemed. Ironically, the great increase in data only makes the failings of its imprecision more noticeable and problematic. Though disappointing, it’s also reassuring. Messiness is how you know that your data really does reflect real life.