Science

The Supreme Court Isn’t Equipped to Judge Harvard’s Discrimination Lawsuit

Courts have always been bad at statistics, and statistics is getting harder to adjudicate.

Photo illustration of a gavel on the bench.
Photo illustration by Slate. Photos by Thinkstock.

In a French courtroom in 2014, a Labrador named Tango was called to the stand to serve as a witness to the murder of his owner. Tango’s demeanor was carefully observed as the judge ordered the suspect to approach Tango with a bat. The prosecutor hoped the approach would elicit some dramatic response from Tango: Ferocious barking would have been ideal. But the dog appeared entirely uninterested. The absurdity of the stunt wasn’t lost on the defense lawyer, who inquired after how the humans in the court were supposed to interpret the canine’s response: “So if Tango lifted his right paw, moved his mouth or his tail, is he recognizing my client?”

We rarely call dogs to the witness stand because we don’t speak dog. But judges and juries are frequently expected to communicate fluently in almost equally foreign languages: the unnervingly specific languages of highly technical fields. The ongoing lawsuit alleging racial discrimination in Harvard College’s admissions process exemplifies a particularly dubious example of how this plays out, and how the expert witnesses courts rely on to navigate this knowledge gap don’t necessarily solve the problem.

To determine whether or not Harvard discriminates against Asian-American applicants, the court will have to assess some 700 pges of intricate analyses from two dueling, highly funded statisticians who arrived at diametrically opposing conclusions. In the likely scenario that the case makes its way to the Supreme Court, the hefty stack of reports will be plopped before nine judges, who have virtually no formal statistical or scientific training. Sure, the Supreme Court justices have impressive academic credentials, but it is fair to wonder whether their cumulative expertise in law, history, English literature, political science, government, and philosophy have prepared them to evaluate a statistical dispute. And it’s likely that situations like this one will become both more frequent and more dubious in the future. With increasingly complex data flooding this “big-data” world, courts will need to evaluate ever-subtler statistical arguments. Are they prepared?

The question is more troubling when we consider that even simple statistical fallacies have duped the courts in the past. Consider this case from 1964: In the wake of a robbery in Los Angeles, someone reported seeing a blond woman with a ponytail commit the deed before jumping into a yellow getaway car driven by a bearded and mustachioed black man. Malcolm Collins and his wife Janet Collins, who matched the descriptions but could not be identified by the eyewitnesses, were initially convicted of the crime. The conviction hinged largely on the testimony of a mathematics instructor at a local state college who made the damning assertion that the probability of the Collins’ innocence was 1 in 12 million. He came to that number by simply multiplying together some rough guesses for the probabilities of each of the separate attributes: About 1 in 4 men had a mustache. About 1 in 10 women had a ponytail. And so forth.

The case went to California’s Supreme Court and was overturned when the judges recognized subtle but crucial flaws in the mathematician’s argument. For instance, he didn’t take into account the fact that men with beards are more likely to have mustaches than men without beards, an error which skewed his calculation against Collins. The biggest fallacy at the root of it all was what is (fittingly) called the “Prosecutor’s Fallacy.” Even if a concordance of all those attributes (blond hair, ponytail, yellow car, etc.) is unlikely, innocence isn’t necessarily equally unlikely. Consider the fact that any lottery winner could be accused of cheating by the same logic: John must have cheated because the probability of him guessing 25, 28, 29, 47, 51, and 58 by chance alone is only 1 in 1 billion. A more revealing statistic was computed in the appeal: The probability that there were two couples in the Los Angeles area that both matched the description was 40 percent. A subtlety in statistical interpretation was the difference between near certain guilt and an easily conceivable coincidence.

The Collins case had a relatively happy ending. The system corrected its statistical misstep, and the case became a standard example in legal pedagogy for the misuse of statistical evidence. But courts continued to make the same elementary statistical errors, sometimes with tragic outcomes. For example, in 1996 and 1968, each of Sally Clark’s two sons died shortly after their births. In 1999, she was charged with double murder when pediatrician Roy Meadow stated that the probability of both sons dying of sudden infant death syndrome was 1 in 73 million. In his calculation, Meadow had in fact made the same “beard and mustache” error that the mathematician in the Collins case made 30 years earlier. Just as a mustache is more likely on a man who already has beard, a second case of SIDS is much likely for a mother who has already had a first. The jury had also fallen for the prosecutor’s fallacy: The fact that two cases of SIDS is rare didn’t imply that Clark was guilty. Clark spent three years in prison before the ruling was overturned on the basis of exculpatory forensic evidence. After her release, her psychiatric health worsened, and she died of alcohol poisoning in 2007.

And there’s another component to the problem. Even as courts have kept falling into the same old statistical potholes, statistics has undergone a major transformation. Modern computers have enabled us to create and collect unprecedented amounts of data at unprecedented speeds. Statistical software has advanced to the point where many common statistical tests are now trivial to perform; tasks that once took painful days now take seconds.

Paradoxically, this newfound agility actually opens up a big can of worms. A FiveThirtyEight article by Christie Aschwanden and Ritchie King, “Science Isn’t Broken,” provides a beautiful illustration as to why. Aschwanden and King include an interactive graphic which I’d recommend you fiddle around with. It puts you in the shoes of a modern statistician trying to determine whether Democrats are good or bad for the American economy. This exercise helps you realize that in order to assess this question, a few judgment calls need to be made. How do I define “good for the economy?” How do I measure how “Democratic” or “Republican” a government is? After making these decisions, the results of standard statistical tests pop up on your screen. The tool makes clear just how critical these subjective decisions are to the end result you get. It soon becomes obvious that you can point and click your way to any conclusion you like. For example, Democratic presidents are good for employment, but Democratic governors are bad for inflation. Depending on precisely how you squint at the data, you can make a case that Democrats are both better or worse for the economy.

People often talk about “cooking the data.” I think we can go a long way with this culinary analogy. Statisticians are like chefs. With only fire, salt, and pepper, an ancient chef handed a rancid steak had few options. But a modern chef, with kitchen gadgets and a pantry full of produce and spices, might transform a shoe into a delightful beef bourguignon. Likewise, newfound access to rich, detailed data sets combined with the tools to quickly navigate them allows statisticians to poke and prod their data until they find a recipe they like. In careful, balanced hands, complex data provides an opportunity for real insight. But a motivated statistician can also easily take advantage of complex data to cook any story he likes. In a world swimming with data, statisticians emerged as the sages that could tell you anything you wanted to hear.

It’s therefore unsurprising that statistics would be a tempting component of legal arguments. In a case like Harvard’s discrimination lawsuit, the tome of detailed admissions data supplied is more than any statistician would need to craft a convincing case in either direction.

I’m not saying expert witnesses are intentionally lying. Like all witnesses, they are sworn to tell “the truth, the whole truth, and nothing but the truth.” Breaching that oath is perjury. For another, the enterprise of “cooking data” is entirely antithetical to the craft of a statistician. Statisticians are more like judges by nature: The central objective of their craft is to give a fair and unbiased assessment of a complicated corpus of data.

The problem is that there are ways around these hurdles, ones that often have nothing to do with a statistician’s integrity. Prosecutors and defendants are free to shop around until they find an expert witness whose analysis aligns with their needs. And then, of course, there is the fact of exorbitant compensation. For instance, David Card, Harvard’s expert witness, is compensated $750 per hour for his services. Peter Arcidiacono, the expert witness hired by the Students for Fair Admissions, is compensated $450 per hour. Legally, an expert witnesses’ compensation cannot depend on the outcome of the case: Expert witnesses are formally compensated only for their time. But if an expert witness’ analysis doesn’t support his client, that client won’t require any more of the expert’s time, perhaps creating not-so-subtle incentives.

Regardless of what the statistician’s intentions are, the fact remains that when two expert witnesses butt heads in a case, they dependably line up on opposite sides of the argument. This is clearly the case in the Harvard discrimination lawsuit: Skimming the two reports side-by-side, it’s almost inconceivable that the expert witnesses­—both senior statistics professors at prestigious universities—are looking at the same data.

A few choice snapshots. Arcidiacono claims: “There is strong statistical evidence that Harvard employed a floor for African-American admits … ” Card rebuts: “ … a ‘floor’ for the admission rate of African-American applicants is not supported by available data.” When Arcidiacono claims the data are “ … indicative of a penalty against Asian-American applicants in the scoring of the personal ratings,” Card maintains that there is “ … no reliable evidence that the personal rating is biased against Asian-American applicants.” “Asian-American applicants also suffer a statistically significant penalty relative to white applicants,” claims Arcidiacono. To which Card doubles down: “ … Asian-American applicants are admitted at a slightly higher rate than White applicants.” Up is down.

The central question: “Does Harvard discriminate against Asian-American applicants?” It should be entirely unsurprising that Card arrives at an emphatic “No” and Arcidiacono arrives at an emphatic “Yes.” But it’s unlikely that either is outright lying. The “truth” and “nothing but the truth” clauses are probably upheld. It’s “the whole truth” that’s the rub. As Joseph Kadane, a professor of statistics at Carnegie Mellon and a longtime expert statistical witness, points out, that bar is very different from “only those truths that help my client.”

One of the most striking claims in Arcidiacono’s report was that an Asian-American male applicant with a 25 percent chance of acceptance would have a 95 percent chance if he was African-American. The claim isn’t exactly a lie, but it is substantially misleading. For each of the 160,000 applicants in the available data, Arcidiacono’s model allows him to estimate the change in acceptance probability that an applicant would see if he or she was a different race. Arcidiacono simply cherry-picked the single Asian-American in the entire applicant pool for which a hypothetical change of race would have had the largest effect. When he refers to “an Asian-American male applicant,” he doesn’t mean “a typical Asian-American male applicant.” Arcidiacono’s statement is true for a single, and probably exceedingly unusual, individual.

To be sure, “the whole truth” might be an unreasonable bar. The trove of data given to Card and Arcidiacono is incredibly complex—years of admissions data with incredibly detailed information on each and every applicant. Any summary statement about the sprawling data will inevitably require some level of omission. Card and Arcidiacono needed to make some discretionary choices, just as Aschwanden and King made you do. Whether those choices are well-reasoned, or whether they are trimming, seasoning, and marinating the data until it serves their purposes, is the question. And it’s entirely likely that the courts—long fumbling over basic probability—are ill-equipped to answer it.

It doesn’t have to be like this: There’s more than one possible way to tame this circus. A particularly straightforward fix would be to have a third, truly neutral expert witness participate in the proceedings. When a seemingly irreconcilable disagreement arises between the defense and prosecuting witnesses, this third expert witness could be consulted. Rather than be paid by either side, both could be required to equally contribute compensation. This arrangement isn’t entirely impractical: It already happens sometimes, though rarely.

There’s nothing special about the ability of statisticians to develop entirely different opinions about the same data. This is precisely the craft of a lawyer; wrestle with the facts of a case until you find the most compelling possible position for your client. But there might be a larger negative externality in scientists abiding by a lawyer’s rules of engagement: If science can be wielded as a weapon sold to the highest bidder, then the public may rightly begin to question the credibility of science as a path to truth.

This exact type of damage has already been done to the credibility of psychiatrists, often called to the stand to attest to sanity or insanity. The title of Margaret Hagen’s book sums up a growing sentiment: Whores of the Court: The Fraud of Psychiatric Testimony and the Rape of American Justice. It would be destructive to the enterprise of science if statisticians started developing such a caricature in the public eye. Statisticians are supposed to be the sober, level-headed referees of science. When the referees can’t be trusted, the game’s outcome is meaningless.

Thanks to professors Joseph Kadane and Bruce Levin for their invaluable insights.