I am a data junkie. As I researched my column last week about the implications of Facebook’s study on the sharing of political content, I practically drooled over the data the social network had released as part of it. In order to identify patterns in how stories from various news websites are shared on Facebook—and conclude that the far left and far right are particularly siloed in their exposure to political ideas—I had to both draw upon qualitative background knowledge (about politics and political websites) and assess ambiguous quantitative measures (like Facebook shares). While that background knowledge is available in libraries and on Amazon, the data was only accessible because a private company allowed it be. For an increasing amount of data-driven research, whether it’s me sifting through social shares or A.I. researchers trying to train a computer to identify a cat, we need companies’ data. And for the most part, they don’t want to give it to us.
Historically, it hasn’t worked like this. During the Cold War, huge Defense Department grants went to research labs, think tanks, and universities in the hope that the work would result in traditional research output, like papers, prototypes, conferences, and the like. Corporations and startups were beneficiaries of the public-private partnerships, but not the only ones. We got the corporate product Unix, but it was coupled with the more basic research that informed its design. Whether you were in academia, industry, or your garage, you could help advance the field.
With the rise of Silicon Valley as we know it today, the private sector has gained pre-eminence in driving technology forward, with massive tech revenues supplanting military-industrial complex dollars in many areas of computer technology. (There are some exceptions, like cryptography.) While the coupling of defense concerns and technological innovation is inherently problematic, one side effect of these public-private partnerships was a traditionally collective approach to scientific research; the government might have held some of the purse strings, but it didn’t micromanage the research. But in the case of a company like Facebook, the research takes place behind closed corporate doors and materials are released only at Facebook’s discretion. While Facebook did a much better job this time than in last year’s disastrous “emotional contagion” experiments—in which both the methodology and ethics were problematic—it’s still not doing enough.
This latest Facebook study, published in Science, came in for a fair bit of criticism from academic researchers such as Eszter Hargittai and Zeynep Tufekci. Some of this criticism, about the study’s “conclusive” (not really) findings and the playing down of Facebook’s interventionist News Feed ranking algorithm, was wholly justified. The study is much closer to being “one data point in a long line of research,” as Facebook’s William Nevius described it to me, than it is to being a conclusive demonstration of anything. But some of the pushback, such as Nathan Jurgenson’s tendentious critique, tried to have it both ways, condemning the study as inconclusive and then using the conclusions to bash on Facebook anyway, all before descending into sub-Foucauldian jabs like “Power and control are most efficiently maintained when they are made invisible.”
Facebook’s research is troubling not because of issues of power and control (spooky!), but because of problems with transparency and science. We should be concerned about the increasing transfer of a portion of science and social science research away from the public eye and into the corporate sector. This shift has been happening for a while; the Internet boom has caused a lot more research to follow from innovation rather than the other way around. Consider an example from the field of computer science systems research. The open-source Apache Hadoop data processing framework was created in 2005 by open-source software engineers Doug Cutting (who joined Yahoo in 2006 and is now chief architect of Cloudera) and Michael Cafarella, then a graduate student in computer science at the University of Washington. It was not an academic project, but one supported by Yahoo, with the company going on to use it from 2006 on and deploying the first large-scale Hadoop clusters in 2008. This sort of public-private partnership may not look so distant from the sorts of corporate/educational partnerships that created the Internet and Unix, but there are two important differences. First, the government played a negligible role, in contrast to the DARPA-funded projects of the Cold War era that created Route 128 and fueled the prehistory of Silicon Valley. And second, the pure research output of Hadoop is considerably lower. Hadoop was created more toward the end of increasing corporate revenue rather than attracting research grants or obtaining tenure. That shift is important to remember.
Hadoop was originally inspired by Google’s MapReduce framework, which had been described in a 2004 paper by Google engineers Jeffrey Dean and Sanjay Ghemawat. At the time, a friend told me that his Ph.D. adviser had sneered at the paper, calling it a bit amateurish and not theoretically rigorous, and not really fit to be published in USENIX’s prestigious Operating Systems Design and Implementation symposium proceedings. I had already had my own tanglings with grouchy Ph.D. researchers who looked down on the menial business of software engineering, and I disapproved of the condescension. Moreover, MapReduce was absurdly useful. (Disclosure: I used to work at Google, during which time I lived and breathed MapReduce, and my wife still does. I like MapReduce. I like Hadoop.) And in retrospect, the paper signaled a real shift toward systems research geared toward large-scale data processing frameworks and cluster management—research for which it was useful to have large-scale data and clusters. Today, Dean and Ghemawat’s paper has more than 13,000 citations, while the winner of that year’s symposium, “Using Model Checking to Find Serious File System Errors,” has about 250.
Research is not a popularity contest, but you can see why there was so much interest in the Google paper: Universities were not being confronted with the set of practical problems that Google was running into at the time, and so their work tended to proceed from more theoretical assumptions rather than corporate demands. This had also been true at the older research laboratories like Bell and DEC in the 20th century, but it was not especially true at newer companies. While Google and Microsoft had research departments, there was (and remains) a much tighter focus on producing work that could be used now.
Today the rise of “big data” has completed the reversal: Data is acquired for commercial purposes and then put to research use, as we saw with this month’s Facebook paper. And unlike the earlier case with MapReduce and Hadoop, this research is not reproducible. Anyone can build a Hadoop cluster, and the MapReduce paper didn’t reference any Google secret sauce. But Facebook’s study is all about Facebook’s algorithmic secret sauce. Facebook isn’t researching computer science, it’s researching itself. In his Science essay accompanying the Facebook study, political and computer scientist David Lazer, a professor at Northeastern University, describes the urgent “need to create a new field around the social algorithm, which examines the interplay of social and computational code.” But that leaves us with the question, How and where is this research to be performed when all the data is locked up?
On the one hand, it’s nice to have research that’s not driven by the overarching fear of the Soviet Union getting a technological lead and blowing us off the face of the planet. On the other, the frequently competing impulses toward knowledge and profit can cause some real dilemmas, like the ethical problems with social media experiments raised by law professor James Grimmelmann, or the non-disinterested interest Facebook has in showing itself not to be excessively curating your feed. We can see such dangers with Big Pharma, whose research seems to show that drugs work surprisingly well an awful lot of the time. The 2014 Sony Pictures hack revealed the Motion Picture Association of America official trying to order up some custom academic research to encourage Internet service providers to block websites linked to piracy. But that research is (supposedly) transparent, whereas here we are dealing with data that we only have access to when Facebook or Google lets us. Those companies can then set the terms for how the research is to be done, with the only pushback coming from prestigious publishing institutions like Science. Science can normally reject papers in favor of better research, but in this case, no one outside Facebook can do this research. If Facebook submits a paper about its data, the implicit message is, Take it or leave it.
This is fundamentally why we’re seeing the level of criticism of the Facebook study that we have, even though it is hardly inferior to much scientific work out there. As Lancet editor Richard Horton wrote in a scathing editorial last month, “much of the scientific literature, perhaps half, may simply be untrue. Afflicted by studies with small sample sizes, tiny effects, invalid exploratory analyses, and flagrant conflicts of interest, together with an obsession for pursuing fashionable trends of dubious importance, science has taken a turn towards darkness.” Rather, the Facebook study has a unique problem: By publishing the study, Science is partly ceding one of the few chips that the institutions of science have left to prevent corporations having a full veto over research processes into social media. With the data as tantalizing as it is, it seems unlikely that scientific institutions will be able to resist Facebook’s allure. That leaves a couple of options: government regulation of large corporate data stores (a messy business unlikely to happen soon), or the creation of open-source social networks that would allow for transparent and replicable research in a way that corporations most likely never will. I have a dream of a decentralized and federated social network that looks more like the Internet than Facebook, which would allow for people to opt in and opt out of research with fine-grained controls and no centralized authority with a profit motive. Until that dream comes true, alas, those of us who wish to understand the impact that filtering algorithms have on our opinions and knowledge are stuck hoping that Facebook keeps sharing its data. Please?
This article is part of Future Tense, a collaboration among Arizona State University, New America, and Slate. Future Tense explores the ways emerging technologies affect society, policy, and culture. To read more, visit the Future Tense blog and the Future Tense home page. You can also follow us on Twitter.