In a 2013 paper, psychologist Michal Kosinski and collaborators from University of Cambridge in the United Kingdom warned that “the predictability of individual attributes from digital records of behavior may have considerable negative implications,” posing a threat to “well-being, freedom, or even life.” This warning followed their striking findings about how accurately the personal attributes of a person (from political leanings to intelligence to sexual orientation) could be inferred from nothing but their Facebook likes. Kosinski and his colleagues had access to this information through the voluntary participation of the Facebook users by offering them the results of a personality quiz, a method that can drive viral engagement. Of course, one person’s warning may be another’s inspiration.
Kosinski’s original research really was an important scientific finding. The paper has been cited more than 1,000 times and the dataset has spawned many other studies. But the potential uses for it go far beyond academic research. In the past few days, the Guardian and the New York Times have published a number of new stories about Cambridge Analytica, the data mining and analytics firm best known for aiding President Trump’s campaign and the pro-Brexit campaign. This trove of reporting shows how Cambridge Analytica allegedly relied on the psychologist Aleksandr Kogan (who also goes by Aleksandr Spectre), a colleague of the original researchers at Cambridge, to gain access to profiles of around 50 million Facebook users.
This case raises numerous complicated ethical and political issues, but as data ethicists, one issue stands out to us: Both Facebook and its users are exposed to the downstream consequences of unethical research practices precisely because like other major platforms, the social network does not proactively facilitate ethical research practices in exchange for access to data that users have consented to share. (Disclosure: One of us, Jacob Metcalf, has worked in a consulting capacity for Facebook on topics not directly related to this story.)
Cambridge Analytica has been profiled relentlessly in the media and subject to congressional and Parliamentary probes, creating a swirl of sometimes conflicting views about what it did for these political campaigns and whether its models are even effective. But the participation of Christopher Wylie, a young data scientist at Cambridge Analytica, as a whistleblower, has blown the story wide open. And it has illuminated a number of critical issues in digital research ethics.
According to the Guardian’s and New York Times’ reporting, the data that was used to build these models came from a rough duplicate of that personality quiz method used legitimately for scientific research. Kogan, a lecturer in another department, reportedly approached Kosinski and their Cambridge colleagues in the Psychometric Centre to discuss commercializing the research. To his credit, Kosinski declined. However, Kogan built an app named thisismydigitallife for his own startup, Global Science Research, which collected the same sorts of data. GSR paid Mechanical Turk workers (contrary to the terms of Mechanical Turk) to take a psychological quiz and provide access to their Facebook profiles. In 2014, under the contract with the parent company of Cambridge Analytica, SCL, that data was harvested and used to build a model of 50 million U.S. Facebook users that included allegedly 5,000 data points on each user.
How was a small app-based quiz used to harvest comprehensive data about that many people? At that point in history, Facebook’s API (the portal that allows third parties to make use of Facebook software and data) by default allowed third parties to access not only your own profile with permission, but also the full profiles of all of your friends.
Thus, a quiz app with 270,000 users could easily provide access to 50,000,000 full profiles. That represents only 185 friends per user, which is below average. According to the Times, 30 million of those profiles had enough information in them to correlate with other real-world datapoints held by data brokers and commonly used by political campaigns. This enabled Cambridge Analytica to connect these psychometric Facebook profiles to actual voters and offer their clients the ability to tailor advertisements to detailed psychometric profiles.
Facebook no longer allows such expansive access to friends’ profiles via the API and requires clearer explanations about what data APIs request access to.
So if the Facebook API allowed Kogan access to this data, what did he do wrong? This is where things get murky, but bear with us. It appears that Kogan deceitfully used his dual roles as a researcher and an entrepreneur to move data between an academic context and a commercial context, although the exact method of it is unclear. The Guardian claims that Kogan “had a licence from Facebook to collect profile data, but it was for research purposes only” and “[Kogan’s] permission from Facebook to harvest profiles in large quantities was specifically restricted to academic use.” Transferring the data this way would already be a violation of the terms of Facebook’s API policies that barred use of the data outside of Facebook for commercial uses, but we are unfamiliar with Facebook offering a “license” or special “permission” for researchers to collect greater amounts of data via the API.
The Times tells a slightly different story. Their reporters state that Cambridge Analytica funded and managed Kogan’s work and “allowed him to keep a copy for his own research, according to company emails and financial records,” and that he divulged to Facebook and users only “that he was collecting information for academic purposes.” Digital advertising expert Jay Pinho speculates that Kogan registered the app in the several-months window in 2014 between Facebook announcing stricter API terms and the beginning of those terms for new apps.
Regardless, it does appear that the amount of data thisismydigitallife was vacuuming up triggered a security review at Facebook and an automatic shutdown of its API access. Relying on Wylie’s narrative, the Guardian claims that Kogan “spoke to an engineer” and resumed access:
“Facebook could see it was happening,” says Wylie. “Their security protocols were triggered because Kogan’s apps were pulling this enormous amount of data, but apparently Kogan told them it was for academic use. So they were like, ‘Fine’.”
Kogan claims that he had a close working relationship with Facebook and that it was familiar with his research agendas and tools.
Almost four years later, after the models generated by this data have arguably influenced a U.S. presidential election, Facebook has now declared Kogan, Wylie, Global Science Research, and Cambridge Analytica personae non gratae on its platform, and its chief counsel Paul Grewal has declared this was a “scam” and “fraud.” Cambridge Analytica has pushed back by throwing the “seemingly reputable academic” Kogan under the bus for supposedly deceiving it about the illicitness of his data gathering operation, although earlier reporting indicated that it understood it was doing something fishy.
From the perspective of research ethicists, the system most clearly failed when Kogan was able to use the credential of “researcher” to persuade someone at Facebook to restore access to the API despite major red flags. Ultimately, it doesn’t matter how the deceit occurred if the research community has no uniform, clear route to accessing data from the big platforms. We simply don’t know how Facebook would assess the question of whether Kogan was conducting legitimate research, let alone whether it did.
We also don’t know whether Facebook allows some researchers greater access via the API, although we do know that it allows some researchers special access to backend datasets when doing so is mutually beneficial. Facebook also has an internal ethics review process for its own research and product design, which is unique in the industry and should be emulated more widely. However, there is typically no mutual visibility between academic researchers and platforms that would demonstrate trustworthiness, such as public registration of research projects, evidence of funding sources, or a record of consent and/or terms and conditions. If such a system were established, academic societies, publishers, and university ethics review boards would immediately be able to require that all research pass through it.
Without that transparency, we cannot help protect users of the platforms from abuse like that perpetrated by Cambridge Analytica. A great deal of research confirms that most people don’t pay attention to permissions and privacy policies for the apps they download and the services they use—and the notices are often too vague or convoluted to clearly understand anyway. How many Facebook users give third parties access to their profile so that they can get a visualization of the words they use most, or to find out which Star Wars character they are? It isn’t surprising that Kosinski’s original recruitment method—a personality quiz that provided you with a psychological profile of yourself based on a common five-factor model—resulted in more than 50,000 volunteers providing access to their Facebook data. Indeed, Kosinski later co-authored a paper detailing how to use viral marketing techniques to recruit study participants, and he has written about the ethical dynamics of utilizing friend data.
We don’t wish to reduce the amount of access researchers have to platforms. Quite the opposite. Increased access is necessary to build the infrastructures and norms of ethical research in this new frontier of science. Genuinely academic research shouldn’t need to rely on manipulative viral quizzes to study these technologies at the heart of society. We should have a portal that allows registered researchers to query anonymized data from users who have consented to have their data offered to us, without ever needing the data to leave Facebook’s own servers. This would open up doors for more research while also giving users more control over who has access to their data. Hopefully the lesson that the major platforms take from this scandal is that lack of open dialog about ethical research practices is precisely why they are exposed to the consequences of unethical research practices by third parties.
This explosive story comes at a moment when platforms have been starting to commit to opening up to more research. Just last week, Facebook, Twitter, and Google signed on to an EU Commission report on coordinated disinformation campaigns that called for greater research access. Recent op-eds have argued that the platforms are stymying essential research into the dynamics of propaganda by keeping researchers behind restrictive nondisclosure agreements and forcing others to use kludgy workarounds like user surveys or hand coding.
Ultimately, researchers and platforms need each other. Platforms have a vast, unprecedented trove of data about human behavior, but they cannot understand it and build the best possible products without external researchers’ critical insights. The worst possible result of this scandal is a reduction of access. The best possible result is the development of equitable, open, and transparent access to research data with user consent.