Who Trained Your A.I.?

Artificial intelligence systems are only as good as the data used to teach them. A lot of that data is old and biased—and quietly shaping our future.

Emails from the defunct Enron Corp., once headed by CEO Kenneth Lay, have become a commonly used data set for training A.I. systems.

Photo illustration by Slate. Photo by Stephen JAFFE/Getty Images. Illustration by Lisa Larson Walker.

The problem with artificial intelligence isn’t that it’s going to get too smart and take over the world, as Elon Musk likes to warn. The problem, rather, is that A.I. is too dumb and kind of already has.

Artificial-intelligence software, which automates decision-making, is used by judges to help inform criminal sentencing. It’s used by Google to surface the address of your favorite pharmacy when you type “pharmacy” into your smartphone. Hospitals use A.I. to design treatment plans, and the Associated Press uses it pen articles about minor league sports.

But even as A.I. nears ubiquity, it’s not clear that the technology is ready for primetime. Accounts of biases and inequities perpetuated by these systems abound. Take what happened last year when Kabir Alli, a high school student in Virginia, used Google to search for “three black teenagers” and the algorithm surfaced an array of mug shots. (When he searched for “three white teenagers,” the search engine returned a page of smiling young people.) Also in 2016, a study by ProPublica found that popular A.I. software used to rate criminal defendants’ likelihood of committing future crimes, which is then used to help determine bail and sentencing decisions, had extreme racial biases built in; the software, called COMPAS, was far more likely to label a black defendant as incorrectly prone to recidivism than a white defendant. The year before that, an online advertising study found that Google’s A.I. system showed fewer ads for high-paying jobs to women then it did to men.

These systems are often built in ways that unintentionally reflect larger societal prejudices, and one reason for that is because in order for these smart machines to learn, they must first be fed large sets of data. And it’s humans, with all their own biases, who are doing the feeding. But it turns out that it’s not just individuals’ biases that are causing these systems to become bigoted; it’s also a matter of what data is legally available to teams building A.I. systems to feed their thinking machines. And A.I. data sets may have a serious copyright problem that exacerbates the bias problem.

There are two main ways of acquiring data to build A.I. systems. One requires constructing a platform that collects the data itself, like how people give over their personal information to Facebook for free. (Facebook, for example, probably has one of the best collections in the world of information about the ways people communicate, which could be used to build an incredible natural-language A.I. system.) The other way to get data to build artificial intelligence software is to buy or acquire it from somewhere else, which can lead to a whole host of problems, including trying to obfuscate the use of unlicensed data, as well as falling back on publicly available data sets that are rife with historical biases, according to a recent paper from Amanda Levendowski, a fellow at the NYU School of Law.

When a company depends on data it didn’t collect itself, it disincentivizes the opening up of A.I. systems for scrutiny, explains Levendowski, since it would mean that if companies are making their A.I. systems smarter with unlicensed data they could be held liable. What’s more, the fact that large data sets are often copyrighted also means that data is regularly pulled from collections that are either in the public domain or else use data that’s been made public, like through WikiLeaks or in the course of an investigation, for example. Works that are in the public domain are not subject to copyright restrictions and are available for anyone to use without paying. The problem with turning to public-domain data, though, is that it is generally old, which means it may reflect the mores and biases of its time. From what books get published to what subjects doctors chose to conduct medical studies on, the history of racism and sexism in America is, in a sense, mirrored through old published data that is now available for free. And when it comes to using data sets that were leaked or released during a criminal investigation, the problem with that data is that it’s often publically available because it is so controversial and problematic.

A perfect example of this, according to Levendowski, are the Enron emails, which are one of the most influential data sets for training A.I. systems in the world. The Enron email data set is composed of 1.6 million emails sent between Enron employees that were publicly released by the Federal Energy Regulatory Commission in 2003. They’ve become a commonly used data set for training A.I. systems. “If you think there might be significant biases embedded in emails sent among employees of [a] Texas oil-and-gas company that collapsed under federal investigation for fraud stemming from systemic, institutionalized unethical culture, you’d be right,” writes Levendowski. “Researchers have used the Enron emails specifically to analyze gender bias and power dynamics.” In other words, the most popular email data set for training A.I. has also been recognized by researchers as something that’s useful for studying misogyny, and our machines may be learning to display the same toxic masculinity as the Enron execs.

Likewise, it wouldn’t be surprising if the 20,000 hacked emails WikiLeaks published in a machine-readable format from John Podesta, Hilary Clinton’s campaign manager and former White House chief of staff, last year, becomes A.I. training data, too, since using the data is unlikely to get legal pushback, according to Levendowski.

Teams building artificially intelligent software have to train their software on something, and that usually means whatever data is legally available to them—even if that data isn’t ideal. So what can be done?

One thing that would help would be clarity about whether using copyrighted data to build A.I. systems should be considered fair use and therefore not a violation of copyright law. That’s an issue that hasn’t been litigated in the courts yet, and until that happens, Levendowski says, A.I. makers are likely to continue to fall back on biased, easily accessible, and legally noncontroversial data sets—and go through great pains to make sure the inner workings of their A.I. systems are locked down to prevent potential copyright violations from surfacing.

Now that data is being used to train machines to think for us, we don’t need to worry about only transparency and accountability. We need to ensure that the future our A.I. systems are helping us build doesn’t repeat the injustices of the past.