Personal medical information from 1 in every 4 Americans has been stolen. On average, there were three data breaches in the U.S. every day during 2016. People outed by trails of left-behind data have taken their own lives. “Better” outcomes of doxxing include relentless abuse and death threats.
Against this backdrop comes the United Kingdom’s wise move toward new data protection laws. As part of this process, a proposed law would ban “intentionally or recklessly re-identifying individuals from anonymised or pseudonymised data.” Digital Minister Matt Hancock told the Guardian the law, if implemented, will “give people more control over their data, [and] require more consent for its use.” This shift recognizes that the threat of doxxing can chill your comfortable internet browsing. But, counterintuitively, it also makes your data less secure.
Whether your data can be truly anonymous is up for debate. But “anonymous data” usually refers to information that cannot be associated with the person who generated it. This kind of data is vital to scientific research. When I conducted cancer research, I used free, open-access, genetic data from real people with real diseases. Having the same, constantly updated, genetic data freely available to all scientists creates a baseline, which can help verify results, allow researchers to avoid echo chambers, and aid reproducibility.
The issue is that “anonymous” data can often be de-anonymized. In 2006, Netflix released the data of 500,000 customers in the interest of crowdsourcing improvements for their prediction algorithm. As was the standard at the time, they removed all personally identifiable information from the data, releasing only customers’ movie ratings, thinking this would keep their customers’ identities hidden. They were wrong. Arvind Narayanan and Vitali Shmatikov, researchers from the University of Texas at Austin, compared movie ratings from the Netflix dataset with publicly available IMDB data, and because movie ratings are very personal—I can’t imagine anyone other than me 5-starring The Man With the Iron Fists, Hunter x Hunter, and My Cousin Vinny—they were able to match names of IMDB users with Netflix accounts.
This research demonstrates two things: 1) Anonymous data often isn’t, and 2) it can be critically important for researchers to—as the U.K. might put it—“intentionally … re-identify individuals from anonymised or pseudonymised data.” Researchers need the ability to break privacy systems. When the options are a good guy picking your lock to convince you it’s broken, or a bad guy picking your lock to steal your passport, the choice is clear. The analysis from the University of Texas at Austin led to lawsuits against Netflix. More importantly, it warned the entire data industry that the privacy methods Netflix was using at the time were insufficient and should no longer be used.
The U.K.’s Information Commissioner’s Office anonymization code of practice considers data “anonymized” if it is not “reasonably likely” to be traced back to an individual. “Reasonably likely” isn’t well defined, but if only one research team in the world can de-anonymize the data, the data probably falls under their definition of anonymous. Under this definition, and the newly proposed laws, the Texas researchers would have committed a crime punishable by an apparently unlimited fine. Shmatikov, who is now a professor at Cornell University, views the U.K.’s proposed law as perfectly wrong. He told me that the kind of research that will keep people safe is “exactly the kind of activity that [the U.K.] is trying to penalize.” He later said he would not have conducted his research if it were penalized by this sort of law.
To be clear, I do not unequivocally support all research that seeks to break security systems. The leadership from the unnamed company that sold the FBI the software used break into the San Bernardino shooter’s iPhone should be tarred, feathered, and marched down a busy street by a matronly nun ringing a bell. It didn’t require the FBI to release the details of that security flaw, so the exploit might be sitting in your pocket right now, waiting to be abused. A white-hat hacker always releases his or her secrets.
But laws are not the cure for doxxing. The ongoing research of white-hat researchers like Shmatikov will not stop people from invading each other’s privacy, but it will keep all of our data safer.
And the new laws do offer protection for research in other related areas. For instance, citizens may be barred from having their data removed from university datasets if the institution can argue the data is “essential to the research.” It would be extremely easy to create a similar research exception in the deanonymization statutes. Now, Shmatikov doesn’t see this as a perfect solution. In his eyes, good security research can be done by people outside academia. He also believes the harm from deanonymization occurs when personal information is shared across the internet, not when someone is deanonymized.
Still, the ICO has two options. It could add an exception to the deanonymization laws for researchers who sequester the data they deanonymize, whether the researchers are academic or not. Or, it could penalize the sharing of deanonymized data, rather than its creation. (Remember, the U.K. doesn’t have the First Amendment.) Both paths would disincentivize internet hordes from making the private public, without handcuffing researchers from improving the technology that will actually help.