The Myth of China’s Big A.I. Advantage

Calling data “the new oil” could hurt efforts to protect privacy.

An oil derrick spewing binary code.
Illustration by Slate. Photo by Getty Images Plus.

We hear all the time that “data is the new oil.” It’s the hottest new analogy to describe the ways in which data—primarily, access to data for training machine learning and artificial intelligence systems—is an important strategic resource of the 21st century. The use of the analogy also extends into the so-called A.I. race between the United States and China. According to Kai-Fu Lee, CEO of Sinovation Ventures and author of the book AI Superpowers, “If data is the new oil, China is the new OPEC.”

But the analogy of data as the new oil is seriously flawed, as some technologists have begun to point out, and the conclusions policymakers are drawing from it—that access to the most data in and of itself provides the greatest advantage for A.I development—threaten to result in bad policies based on misunderstandings of data’s role in fueling A.I. The risk is that policymakers could shy away from potential privacy legislation out of fear that putting checks on access to data will disadvantage the U.S.

Data, unlike oil, is not a scarce resource. Characterizing it as such completely misunderstands the ease with which data is exploding around us, for governments, corporations, research institutions, universities, and even individuals to potentially find and collect. Furthermore, oil is not reusable, whereas data can be used, copied, and modified many times over. This is not to say that the “data is the new oil” analogy is utterly unfounded. You could argue that the internet is an extractive industry, sucking value out of consumers’ data, much like those who mine oil from the Earth. But by and large, it falls seriously short of reality.

The main issue of “data is the new oil,” however, lies not just with the figurative descriptor. Many have taken this analogy and derived from it the same conclusion: that access to the most data in and of itself provides the greatest strategic and commercial advantage because of the role data plays in fueling A.I. development. From Trump administration policy to media reports to op-eds, there is much talk in particular about “China’s data advantage” in this regard. Because of China’s large population and lax rules about government surveillance, the narrative goes, the country as a whole has a strategic A.I. advantage over the United States. In large part, this has been fueled by Kai-Fu Lee’s AI Superpowers, in which he argues that access to data for training machine learning systems is the biggest deciding factor in global A.I. dominance. Problem is, exactly how important data is in A.I. development—let alone whether it’s the factor—remains to be seen.

There are other key drivers of A.I. development and implementation, such as the hardware on which machine learning algorithms are developed. “One of the greatest limitations of progress in deep learning is the amount of computation available,” says MIT’s Vivienne Sze. After all, it was advancement in computing hardware that enabled decades-old machine learning techniques to power the A.I. revolution we see today. And this will hold going forward. Hardware with high throughput and energy efficiency is “critical to expanding the deployment of [deep neural networks] in both existing and new domains,” Sze and her colleagues found. Because “the most creative machine learning algorithms are hamstrung by machines that can’t harness their power,” an IBM vice president wrote, “if we’re to make great strides in AI, our hardware must change, too.” And software matters too. Using lots of data won’t fix a bad algorithm.

Many have also argued that talent is another driving force in A.I. development. “As adoption of AI gathers pace, the value of skills that can’t be replicated by machines is also increasing,” says a report by PricewaterhouseCoopers. “People will need to be responsible for determining the strategic application of AI and providing challenge and oversight to decisions.” Related, a “lack of capable talent — people skilled in deep learning technology and analytics — may well turn out to be the biggest obstacle for large companies,” writes the former chief research scientist of the Australian Artificial Intelligence Institute. Human skill is required to not just design the technologies but to successfully implement them (which China well recognizes via its investment in curating domestic A.I. talent). And like data, this human factor is not necessarily concentrated in one place. According to one study, China ranks higher than the U.S. for quality and volume of citations of A.I. research, but according to another, the U.S. ranks much higher for A.I. workforce talent writ large.

Even if data is the principal deciding factor in developing superior artificial intelligence applications, the question still remains of what kind of data is most important. This is because data does not generalize well. Machine learning systems trained on white faces don’t work well on darker ones. Likewise, a spoken-language-processing algorithm for French doesn’t just immediately also understand every other language. American tech companies’ aggressive efforts to break into foreign markets and collect data, and some of those countries’ pushback over data sovereignty, is just more evidence of this fact. Different kinds of data, and different kinds of data from different places in the world, are important should companies or other entities want to develop globally superior A.I. systems.

Here, China’s internet platforms have a distinct disadvantage given the challenges they face expanding outside of China’s market. The factors that have made China’s tech titans successful inside China may not translate well internationally, namely a closed and controlled system. Having lots of data about what Chinese teenagers are buying online, for example, may have little bearing on developing A.I. applications that can compete in the rest of the world—which may be one of the reasons Chinese researchers (like their American counterparts) are using datasets from all over the globe.

It’s worth questioning whom this misleading data-as-oil analogy serves, as Graham Webster and Scarlet Kim have argued. (Webster is our colleague at New America; New America is a partner with Slate and Arizona State University in Future Tense.) Facebook has used this “China’s data advantage” rhetoric to argue against its own regulation, for instance. Should Facebook be subject to certain data collection rules, Zuckerberg has said, “we’re going to fall behind Chinese competitors” and others who lack such rules.

In addition, such a claim about data being the most important factor in A.I. development feeds into the flawed notion of an A.I. “arms race.” It implies that getting somewhere first is better, when first deployments of A.I. applications are in fact often unpredictable, chaotic, or error-prone—which is the case with many emerging technologies. To think that collecting the most data as soon as possible matters most is therefore to reach a flawed conclusion.

But many U.S. policymakers are already headed this way, and the results could be harmful. Thinking that collecting the most data—that volume in and of itself is the most important—risks leading U.S. regulators to abandon privacy rules for big tech in the belief that they will hurt national competition. It also ignores the extent to which competition over access to A.I. training data should not be prioritized in the same way for different kinds of data, both in what the data is for (facial recognition, natural language processing, etc.) and where it’s from (region, country, on whom, etc.).

At its core, the “data is the new oil” idea suggests that there has to be a trade-off between privacy and innovation. That’s simply not the case. Creating policy that understands this will go a long way to restoring trust in U.S. big tech at home and around the world, while also helping fuel advances in A.I. that benefit humanity, from disease diagnosis to transportation safety. But writing bad policy based on misconceptions about the importance and nature of data in our increasingly A.I.-driven and A.I.-competitive world will only serve to undermine U.S. technological competitiveness and democratic protections around technology.

Future Tense is a partnership of Slate, New America, and Arizona State University that examines emerging technologies, public policy, and society.