Data is the lifeblood of artificial intelligence, and despite estimates that the world will generate more data over the next three years than it has in the previous 30, there still isn’t enough of it to supply the booming A.I. industry.
Amazon can predict your buying habits because its algorithms are trained on the data collected from its 112 million Prime subscribers in the U.S. and the tens of millions of other people around the world who visit the site and use its other products on a regular basis. Google’s advertising business depends on predictive models fueled by the billions of internet searches it processes each day and data from the 2.5 billion devices running the Android operating system. The tech giants have carved out these massive data monopolies, and that gives them near-impenetrable advantages in the field of A.I.
So how is a small A.I. startup to train its models to compete? Data collection is a time-consuming and expensive process. What about a hospital chain that wants to harness A.I. to better diagnose diseases but can’t use its own patient data due to federal privacy laws and cybersecurity concerns? Or a credit scoring agency seeking to model risky behavior that doesn’t want to use sensitive consumer information?
The answer, increasingly, is to use synthetic data—created by A.I., for A.I. In many cases, it’s a cheaper and faster option, but it carries a risk: The techniques used to generate realistic-looking data can also exacerbate harmful biases in that data.
Synthetic data comes in many forms, from images of fake faces that are indistinguishable from real ones to statistically realistic purchasing patterns for thousands of fictional customers. Executives at multiple synthetic data companies—including established firms like GenRocket and startups such as Mostly AI, Hazy, and AI Reverie—said they’ve seen a huge growth in demand for boutique data sets over just the past two years. Companies can also turn to open-source tools like Synthea, which researchers at institutions including the U.S. Department of Veterans Affairs use to create realistic medical histories for thousands of fake patients in order to study disease patterns and treatment paths.
Executives at multiple for-profit synthetic data companies, as well as at Mitre Corp., which created Synthea, have seen an explosion of interest in their services over the past several years. With that growth, though, comes potential peril for algorithms that are increasingly used to make life-changing decisions—and increasingly shown to amplify racism, sexism, and other harmful biases in high-impact areas like facial recognition, criminality prediction, and health care decision-making. Researchers say that in many cases, training an algorithm on algorithmically generated data increases the risk that an artificial intelligence system will perpetuate harmful discrimination.
“That process of creating a synthetic data set, depending on what you’re extrapolating from and how you’re doing that, can actually exacerbate the biases,” says Deb Raji, a technology fellow at the AI Now Institute. “Synthetic data can be useful for assessment and evaluation [of algorithms], but dangerous and ultimately misleading when it comes to training [them].”
One of the most common ways to create synthetic data is with a generative adversarial network, or GAN, a method developed in 2014 whereby two neural networks are pitted against each other. First, both are trained on similar sets of real data. Then the first network, or generative model, attempts to synthesize data realistic enough that it will fool the second network, the discriminatory model, into believing the synthesized data came from the same source as the real training data. The more the two networks compete in this positive feedback loop, the better they each get at their task, resulting in a synthetic data set that can be, statistically and to the naked eye, nearly indistinguishable from the real thing.
The GAN method can be problematic, though, because “algorithms are lazy—they always try to find the easiest way to make a prediction,” says Harry Keen, CEO of Hazy, a London-based company that creates synthetic data for financial services companies, telecoms, and governments. And when it comes to extrapolating from data sets about real people, GANs often achieve their goal by following the path of least resistance and ignoring outliers (women and people of color in a data set of Fortune 500 CEOs, for example). That kind of algorithmic discrimination can occur with real data—take, for example, the automated hiring system Amazon had to scrap after discovering it favored men over women due to the historical employment data it was trained on—but GAN-generated synthetic data can amplify the bias.
In a study from January, researchers at Arizona State University demonstrated this phenomenon. (Disclosure: ASU is a partner with Slate and New America in Future Tense.) They started with a data set composed of 17,245 images of engineering professors from universities across the country, 80 percent of whom were male and 76 percent of whom were white. They then trained a GAN on that data set to create synthetic images. The result? A data set of highly realistic faces that were 93 percent male and 99 percent white.
In the language of the industry, the synthetic photos the GAN generated were “accurate”: They looked to the human eye and statistical models like adult human faces, rather than random assortments of pixels or some other object. But in a real-world sense, the data set as a whole was misleading because the existing bias was amplified. Had it been used for a purpose like hiring new engineering professors, it would have perpetuated real-world discrimination.
Julia Stoyanovich, a computer science professor at New York University, says the debate in the industry shouldn’t be “accuracy versus fairness.” That is, companies don’t have to choose. Instead, “the data should represent the world how it should be.”
Very recently, some synthetic data companies have turned their attention toward generating data sets that are just that: both accurate and fair. Hazy and Mostly AI, a Vienna-based company, have experimented with methods for controlling the biases of data in ways that can actually reduce harm—“distorting reality,” as Keen calls it, to ensure that a particularly harmful pattern contained in the real-world data doesn’t make its way into the synthetic data set.
In May, Mostly AI published a discussion of two of its experiments. In the first, researchers started with income data from the 1994 U.S. census and sought to generate a synthetic data set in which the proportion of men and women who earned more than $50,000 a year was more equal that in the original data. In the second, they used data from a controversial recidivism prediction program to generate a synthetic data set in which criminality was less linked to gender and skin color. The resulting data sets aren’t strictly “accurate”—women did earn less in 1994 (and now) and Black men are arrested at a higher rate than other groups—but they are far more useful in contexts where the goal is not to perpetuate sexism and racism. A synthetic data set generated to equalize the income gap between men and women, for example, could help a company make fairer decisions about how much to compensate its employees.
These experiments are in their early stages, and even if the methods are perfected, there remains a significant barrier to their widespread adoption: Companies don’t seem to care as much about fairness as they do about accuracy to the original data. “There’s always another priority, it seems,” says Daniel Soukup, a data scientist leading Mostly AI’s fairness research. “You’re trading off revenue against making fair predictions, and I think that’s a very hard sell for these institutions and these organizations. … At the end of the day, this company [Mostly AI], in addition to being a small startup, is for-profit.”
The small group of academics who research bias in synthetic data hold out hope that new techniques will lead to A.I. models that reflect (and manifest) the world in which we want to live, rather than perpetuating centuries of systemic racism and sexism. “I’m really optimistic,” says Bill Howe, a University of Washington professor who studies synthetic data. “There doesn’t seem to be any reason why we can’t use these methods to do a better job than we do now.” Except that, at the moment, synthetic data buyers aren’t asking for fairer data, and companies aren’t inclined to invest in developing the methods to create it without that financial incentive.
Months after the Arizona State University researchers released a study demonstrating how a GAN exacerbated the racial and gender biases in a facial image data set, a group of Ph.D. candidates at Stanford University proved they could do the opposite. In their paper, which they presented July 14 at the International Conference on Machine Learning, the Stanford group outlined a method that allowed them to weight certain features—in this case gender and hair color—as more important than others in order to generate a more diverse set of facial images.
The Stanford group told Slate that more work needs to be done before the method is ready to generate data that could responsibly be used to train algorithms with real-world impacts. But their success is evidence that, should the commercial synthetic data industry and its customers decide to do so, it’s possible to use synthetic data as a tool that purposefully combats harmful bias rather than one that unintentionally feeds it. It’s not a problem that can be solved by a single algorithm or technique. Doing it correctly will require continual attention from the end users of synthetic data, who are best positioned to know the biases likely to crop up in their field, and a willingness on their part to combat those biases.
Future Tense is a partnership of Slate, New America, and Arizona State University that examines emerging technologies, public policy, and society.