When the richest man in the world is being sued by one of the most popular social media companies, it’s news. But while most of the conversation about Elon Musk’s attempt to cancel his $44 billion contract to buy Twitter is focusing on the legal, social, and business components, we need to keep an eye on how the discussion relates to one of tech industry’s most buzzy products: artificial intelligence.
The lawsuit shines a light on one of the most essential issues for the industry to tackle: What can and can’t AI do, and what should and shouldn’t AI do? The Twitter v Musk contretemps reveals a lot about the thinking about AI in tech and startup land—and raises issues about how we understand the deployment of the technology in areas ranging from credit checks to policing.
At the core of Musk’s claim for why he should be allowed out of his contract with Twitter is an allegation that the platform has done a poor job of identifying and removing spam accounts. Twitter has consistently claimed in quarterly filings that less than 5 percent of its active accounts are spam; Musk thinks it’s much higher than that. From a legal standpoint, it probably doesn’t really matter if Twitter’s spam estimate is off by a few percent, and Twitter’s been clear that its estimate is subjective and that others could come to different estimates with the same data. That’s presumably why Musk’s legal team lost in a hearing on Tuesday when they asked for more time to perform detailed discovery on Twitter’s spam-fighting efforts, suggesting that likely isn’t the question on which the trial will turn.
Regardless of the legal merits, it’s important to scrutinize the statistical and technical thinking from Musk and his allies. Musk’s position is best summarized in his filing from July 15, which states: “In a May 6 meeting with Twitter executives, Musk was flabbergasted to learn just how meager Twitter’s process was.” Namely: “Human reviewers randomly sampled 100 accounts per day (less than 0.00005% of daily users) and applied unidentified standards to somehow conclude every quarter for nearly three years that fewer than 5% of Twitter users were false or spam.” The filing goes on to express the flabbergastedness of Musk by adding, “That’s it. No automation, no AI, no machine learning.” Perhaps the most prominent endorsement of Musk’s argument here came from venture capitalist David Sacks, who quoted it while declaring, “Twitter is toast.” But there’s an irony in Musk’s complaint here: If Twitter were using machine learning for the audit as he seems to think they should, and only labeling spam that was similar to old spam, it would actually produce a lower, less-accurate estimate than it has now.
There are three components to Musk’s assertion that deserve examination: his basic statistical claim about what a representative sample looks like, his claim that the spam-level auditing process should automated or use “AI” or “machine learning,” and an implicit claim about what AI can actually do.
On the statistical question, this is something any professional anywhere near the machine learning space should be able to answer (so can many high school students). Twitter uses a daily sampling of accounts to scrutinize a total of 9,000 accounts per quarter (averaging about 100 per calendar day) to arrive at its under-5 percent spam estimate. Though that sample of 9,000 users per quarter is, as Musk notes, a very small portion of the 229 million active users the company reported in early 2022, a statistics professor (or student) would tell you that that’s very much not the point. Statistical significance isn’t determined by what percentage of the population is sampled but simply by the actual size of the sample in question. As Facebook whistleblower Sophie Zhang put it, you can make the comparison to soup: It “doesn’t matter if you have a small or giant pot of soup, if it’s evenly mixed you just need a spoonful to taste-test.”
The whole point of statistical sampling is that you can learn most of what you need to know about the variety of a larger population by studying a much-smaller but decently sized portion of it. Whether the person drawing the sample is a scientist studying bacteria, or a factory quality inspector checking canned vegetables, or a pollster asking about political preferences, the question isn’t “what percentage of the overall whole am I checking,” but rather “how much should I expect my sample to look like the overall population for the characteristics I’m studying?” If you had to crack open a large percentage of your cans of tomatoes to check for their quality, you’d have a hard time making a profit, so you want to check the fewest possible to get within a reasonable range of confidence in your findings.
While this thinking does go against the grain of certain impulses (there’s a reason why many people make this mistake), there is also a way to make this approach to sampling more intuitive. Think of the goal in setting sample size as getting a reasonable answer to the question, “If I draw another sample of the same size, how different would I expect it to be?” A classic approach to explaining this problem is to imagine you’ve bought a great mass of marbles, that are supposed to come in a specific ratio: 95 percent purple marbles and 5 percent yellow marbles. You want to do a quality inspection to ensure the delivery is good, so you load them into one of those bingo game hoppers, turn the crank, and start counting the marbles you draw, in each color. Let’s say your first sample of 20 marbles has 19 purple and one yellow; should you be confident that you got the right mix from your vendor? You can probably intuitively understand that the next 20 random marbles you draw could end up being very different, with zero yellows or seven. But what if you draw 1,000 marbles, around the same as the typical political poll? What if you draw 9,000 marbles? The more marbles you draw, the more you’d expect the next drawing to look similar, because it’s harder to hide random fluctuations in larger samples.
There are online calculators that can let you run the numbers yourself. If you only draw 20 marbles and get one yellow, you can have 95 percent confidence that the yellows would be between 0.13 percent and 24.9 percent of the total—not very exact. If you draw 1,000 marbles and get 50 yellows, you can have 95 percent confidence that yellows would be between 3.7 percent and 6.5 percent of the total; closer, but perhaps not something you’d sign your name to in a quarterly filing. At 9,000 marbles with 450 yellow, you can have 95 percent confidence the yellows are between 4.56 percent and 5.47 percent; you’re now accurate to within a range of less than half a percent, and at that point Twitter’s lawyers presumably told them they’d done enough for their public disclosure.
This reality—that statistical sampling works to tell us about large populations based on much-smaller samples—underpins every area where statistics is used, from checking the quality of the concrete used to make the building you’re currently sitting in, to ensuring the reliable flow of internet traffic to the screen you’re reading this on.
It’s also what drives all current approaches to artificial intelligence today. Specialists in the field almost never use the term “artificial intelligence” to describe their work, preferring to use “machine learning.” But another common way to describe the entire field as it currently stands is “applied statistics.” Machine learning today isn’t really computers “thinking” in anything like what we assume humans do (to the degree we even understand how humans think, which isn’t a great degree); it’s mostly pattern-matching and -identification, based on statistical optimization. If you feed a convolutional neural network thousands of images of dogs and cats and then ask the resulting model to determine if the next image is of a dog or a cat, it’ll probably do a good job, but you can’t ask it to explain what makes a cat different from a dog on any broader level; it’s just recognizing the patterns in pictures, using a layering of statistical formulas.
Stack up statistical formulas in specific ways, and you can build a machine learning algorithm that, fed enough pictures, will gradually build up a statistical representation of edges, shapes, and larger forms until it recognizes a cat, based on the similarity to thousands of other images of cats it was fed. There’s also a way in which statistical sampling plays a role: You don’t need pictures of all the dogs and cats, just enough to get a representative sample, and then your algorithm can infer what it needs to about all the other pictures of dogs and cats in the world. And the same goes for every other machine learning effort, whether it’s an attempt to predict someone’s salary using everything else you know about them, with a boosted random forests algorithm, or to break down a list of customers into distinct groups, in a clustering algorithm like a support vector machine.
You don’t absolutely have to understand statistics as well as a student who’s recently taken a class in order to understand machine learning, but it helps. Which is why the statistical illiteracy paraded by Musk and his acolytes here is at least somewhat surprising.
But more important, in order to have any basis for overseeing the creation of a machine-learning product, or to have a rationale for investing in a machine-learning company, it’s hard to see how one could be successful without a decent grounding in the rudiments of machine learning, and where and how it is best applied to solve a problem. And yet, team Musk here is suggesting they do lack that knowledge.
Once you understand that all machine learning today is essentially pattern-matching, it becomes clear why you wouldn’t rely on it to conduct an audit such as the one Twitter performs to check for the proportion of spam accounts. “They’re hand-validating so that they ensure it’s high-quality data,” explained security professional Leigh Honeywell, who’s been a leader at firms like Slack and Heroku, in an interview. She added, “any data you pull from your machine learning efforts will by necessity be not as validated as those efforts.” If you only rely on patterns of spam you’ve already identified in the past and already engineered into your spam-detection tools, in order to find out how much spam there is on your platform, you’ll only recognize old spam patterns, and fail to uncover new ones.
Where Twitter should be using automation and machine learning to identify and remove spam is outside of this audit function, which the company seems to do. It wouldn’t otherwise be possible to suspend half a million accounts every day and lock millions of accounts each week, as CEO Parag Agrawal claims. In conversations I’ve had with cybersecurity workers in the field, it’s quite clear that large amounts of automation is used at Twitter (though machine learning specifically is actually relatively rare in the field because the results often aren’t as good as other methods, marketing claims by allegedly AI-based security firms to the contrary).
At least in public claims related to this lawsuit, prominent Silicon Valley figures are suggesting they have a different understanding of what machine learning can do, and when it is and isn’t useful. This disconnect between how many nontechnical leaders in that world talk about “AI,” and what it actually is, has significant implications for how we will ultimately come to understand and use the technology.
The general disconnect between the actual work of machine learning and how it’s touted by many company and industry leaders is something data scientists often chalk up to marketing. It’s very common to hear data scientists in conversation among themselves declare that “AI is just a marketing term.” It’s also quite common to have companies using no machine learning at all describe their work as “AI” to investors and customers, who rarely know the difference or even seem to care.
This is a basic reality in the world of tech. In my own experience talking with investors who make investments in “AI” technology, it’s often quite clear that they know almost nothing about these basic aspects of how machine learning works. I’ve even spoken to CEOs of rather large companies that rely at their core on novel machine learning efforts to drive their product, who also clearly have no understanding of how the work actually gets done.
Not knowing or caring how machine learning works, what it can or can’t do, and where its application can be problematic could lead society to significant peril. If we don’t understand the way machine learning actually works—most often by identifying a pattern in some dataset and applying that pattern to new data—we can be led deep down a path in which machine learning wrongly claims, for example, to measure someone’s face for trustworthiness (when this is entirely based on surveys in which people reveal their own prejudices), or that crime can be predicted (when many hyperlocal crime numbers are highly correlated with more police officers being present in a given area, who then make more arrests there), based almost entirely on a set of biased data or wrong-headed claims.
If we’re going to properly manage the influence of machine learning on our society—on our systems and organizations and our government—we need to make sure these distinctions are clear. It starts with establishing a basic level of statistical literacy, and moves on to recognizing that machine learning isn’t magic—and that it isn’t, in any traditional sense of the word, “intelligent”—that it works by pattern-matching to data, that the data has various biases, and that the overall project can produce many misleading and/or damaging outcomes.
It’s an understanding one might have expected—or at least hoped—to find among some of those investing most of their life, effort, and money into machine-learning-related projects. If even people that deep aren’t making those efforts to sort fact from fiction, it’s a poor omen for the rest of us, and the regulators and other officials who might be charged with keeping them in check.