We’re just days away from kickoff, folks, and the Super Bowl of sports analysis is already underway.
It’s a drama of rival against rival and a race against the clock. It’s precision, persistence, power.
A gritty band of physicists and data specialists is locked in smash-mouth combat, a battle etched in p’s and chi’s. Some will emerge victorious, with calculations made eternal. The rest will save their spreadsheets for another day.
They call it statistical analysis. And they do it under the autumn moon, in the heat of a Texas afternoon, in the ice-bucket chill of a Wisconsin winter. And thousands come to watch or sleep, to cheer or stand in silent adulation.
In the past few days, we’ve seen an epic struggle over numbers, a fierce and unrelenting war of rapid peer review. If there’s a lesson to be learned from all this pregame hype—if it is indeed a “teachable moment,” as some have lately claimed—then let it be for scientists, not screaming sports fans. If nothing else, #deflategate shows us how the feisty mores of football-talk can be applied to research.
In case you somehow missed it, coverage of this weekend’s Super Bowl has been hijacked by a technical debate. The favored New England Patriots stand accused of illegally deflating game equipment to gain a competitive advantage. League investigators found that the team had been using floppy footballs—filled to an air pressure of 10.5 pounds per square inch instead of the minimum 12.5—for the first half of its semifinal game against the Indianapolis Colts. While no one thinks this could have changed the outcome of that game—the Patriots prevailed by a score of 45–7—sideline experts noted that a limp football might be easier to throw and catch, especially in the slick, rainy weather in Massachusetts that day.
That first round of chatter, more punditry than science, soon gave way to a surreal and wonderful debate. At a news conference Saturday, Patriots coach Bill Belichick blamed the loss in air pressure on a change in temperature, and suggested that his team should not be charged, under the auspices of the ideal gas law. A physicist at MIT said that the coach had the science right, as did a former researcher at NASA. (High school teachers used his comments as a teaching tool.) But the cosmologist Neil deGrasse Tyson went the other way on social media, in a post retweeted 12,000 times, and Bill Nye the Science Guy agreed. (Tyson later walked back his claims, at least a bit, under cross-examination.) Befuddled by the dueling claims, lawyers hired by the NFL sent a plaintive email to Columbia University: “[We] would like to discuss engaging a professor of physics to consult on matters relating to gas physics and environmental impacts on inflated footballs.”
But even better than this public tussle over thermodynamics has been the numbers fight that followed. An engineer and freelance statistician named Warren Sharp went digging through the numbers for statistical evidence that New England cheated. He found more than he expected: The Patriots were outperforming bettors’ expectations in games played in lousy weather, he noted in a Jan. 21 post. Then he did a follow-up analysis, showing that, weather aside, New England players fumbled far less often than players on any other team.
That second post was republished in Slate, among other places. Its claims were bold: “The New England Patriots’ prevention of fumbles is nearly impossible,” wrote Sharp. “This is an extremely abnormal occurrence and is not simply random fluctuation.” The graphs were indeed alarming, especially the one that showed New England’s fumble rate had improved dramatically in 2007. As he detailed in a subsequent post, also republished in Slate, that happens to be one year after the NFL changed its rules (at the Patriots’ behest, in part) so that each team could use its own footballs while on offense. Had Sharp’s analysis proved the case against Belichick?
It seemed like a strong argument to me, but within a day or two, Sharp’s forensics had been subjected to a swift and vicious round of peer review. A graduate student in Philadelphia named Bill Herman pointed out that Sharp used some lousy data, including fumbles that occurred on kickoffs. (Any deflated footballs would not have been in play for those.) He’d also had his fractions flipped: Instead of taking fumbles per play, Sharp used plays per fumble, an inverted fraction that puffed up his results. (The same distortion confuses measurements of fuel economy. We talk about miles per gallon, but we should be more concerned about gallons per mile.)
There were other problems, too. A self-described math nerd and analyst for GrubHub questioned Sharp’s choice of a normal distribution to describe his data, and thus his conclusions about the fumble numbers’ “abnormality.” Reanalyzing the stats, this critic found that the Patriots may fumble less than other teams, but their success is not spectacular, “and not a convincing case for cheating.”
Other critics were somewhat less polite. A Patriots fan and physicist named Drew Fustin hinted that Sharp may have cherry-picked his data to support the case against New England. Sharp did make some odd decisions: He excluded from his analysis any team that plays in an indoor stadium, on the theory that deflated balls wouldn’t help them quite so much. But that approach ignores the fact that indoor teams play only half their games at home, and that outdoor teams sometimes play their away games in domes. When Fustin tweaked the study to account for this and other “convenient omissions,” the effect on fumbles shrank.
That was far from the most aggressive trashing of Sharp’s work. A management consultant named Daryl Sng accused him Tuesday of “fumbling the data” and said the analysis was “garbage.” (He also gave another useful reappraisal of the stats.) On the same day, a pair of statistics professors posted a nasty screed on Deadspin, in which they mocked Sharp for his sloppy science, as well as his promiscuous use of all-caps, decimal points, and axis labels. “Statistics can say whatever you want it to when it’s used irresponsibly or haphazardly,” they wrote. (For a detailed summary of this back-and-forth, read Neil Paine’s post on FiveThirtyEight.)
That’s when the fighting really started. Brian Burke, an NFL-stats guru who often writes for Slate, called the Deadspin critique “so disingenuous it makes stat guys look bad.” It’s true the fumble numbers don’t “prove” cheating on the part of the Patriots, but they certainly are suggestive. “More than warrants more digging,” he tweeted.
Indeed, several of the follow-up analyses did find a hint of something weird. New England’s fumble rates may not have been as nutty as Sharp thought, but they weren’t totally normal, either. In the cleverest of his analyses, Sharp had compared players who were traded to or from the Patriots in recent years, then asked whether they’d had a better grip on the football while playing for New England. That promised some control for each player’s individual ability, and for the possibility that Belichick only puts in guys who aren’t prone to fumbling. Sharp’s version of this study found a huge effect of 88 percent. But after others fixed his data, some of the same players still showed an improvement when playing for New England, of 23 percent.
By that point, though, a Gamergate mentality had taken hold online. Patriots fans alleged a plot against their team, abetted by scientific sophistry and a lack of journalism ethics. Some very smart and thoughtful observers also felt the media had failed. “It’s been an embarrassing few days for data journalism,” wrote economist Justin Wolfers. “Many outlets [such as Slate] mindlessly reprinted [a] shoddy analysis of fumble data.”
I had nothing at all to do with Slate’s choice to go with Sharp’s analysis. But at the risk of sounding like a shameless home-team fan—or worse, a shameless Jets fan—I think it was the right decision. The analysis may not have been Good Science, in the sense of having come through careful vetting, ahead of time, by experts in the field. In retrospect we know that Sharp’s approach was somewhat slapdash, and his conclusions overblown. But it also seems as if he turned up some suggestive data. His approach was interesting—clever, even—and it led to further work.
If Sharp had behaved more like a scientist—and if Slate had acted more like a scientific journal—then his analysis would not have made it into print. There would have been no fierce debates on methodology in the comments of his post, no more hypotheses to help explain the data. Fellow geeks would not have checked his numbers and corrected his results. They would not have given us the last few days’ entertaining and enlightening dispute. We would not have a better, deeper understanding of the stats.
This sort of back-and-forth happens all the time in science, but it’s almost always hidden from the public and conducted in slow motion. Formal peer review does not allow for free and quick exchange; it pushes scientists behind a veil of anonymity so they can snipe in secret or logroll for their friends. It’s said that this uptight approach helps to make the scientific process work, but the data don’t support that claim. So why not sports it up a bit? This week’s controversy shows what happens when you let someone fire off a quick analysis: Other people get to shoot it down or do it better. Maybe things should get a little ugly in the lab, a little gladiator, a little NFL. A scientist should have the right to trash-talk her opponents. A scientist should have the right to grab his crotch in public.
That’s the lesson of #deflategate and the egghead-meathead nexus it revealed. Football-talk could use a bit more science. But science-talk could use a bit more football.