In the chilly, rainy winter of 2007, crowds poured in to Aiello’s Italian Restaurant, between Syracuse and Binghamton, New York, to make the most of its all-you-can-eat buffet of pizza, salad, breadsticks, pasta, and soup. The diners tucked in at the buffet without realizing that when they reached the cash register to pay, they’d be given a short questionnaire probing them about their restaurant experience and their food choices.
Four studies resulted from this restaurant data, generating the buzz you would expect from new research on humans’ constantly examined eating habits. The one that found the most pickup in mainstream media suggested that men ate 93 percent more pizza and 86 percent more salad in the presence of women—a result that the authors of the study attributed to their desire to “show off.” The other three papers—all about food costs—drew on data collected from participants who dined at the restaurant after receiving coupons to discount their meals to either $4 or $8. The findings generally suggested that paying less for your food is correlated with lower satisfaction and even more feelings of having overeaten.
The studies were all co-authored by Brian Wansink, a professor at Cornell University best known for his popular books on eating behavior and previous position as the executive director of the U.S. Department of Agriculture’s Center for Nutrition Policy and Promotion. You’ve probably even heard of one of his more famous findings—which shows that people eat more from larger plates because portions look smaller and suggests that downsizing your plate could help you reduce calorie intake by around 22 percent. If you’re looking to understand eating habits, Wansink’s work has been considered a place to start.
Wansink’s name lent these studies a bit of credibility. But they also seem to have been conducting using flawed research techniques, far outside the gold standard for research. That gold standard insists that scientists make a hypothesis about a given topic and then test to see if their hypothesis still stands. (It is also common for scientists to collect exploratory data that, in turn, are used to create hypotheses that are then tested by more focused follow-up experiments.) These studies did not follow that model. According to a November blog post by Wansink himself, the data used in these papers was initially part of a “self-funded, failed study which had null results.” The resulting data were later handed to an unpaid visiting Ph.D. student from a Turkish University for further investigation, with instructions from Wansink on “what the analyses should be and what the tables should look like.” It was her reworking that led to the results that were ultimately published.
The tendency to search for trends and hypotheses in already collected data is known as HARKing, or hypothesizing after the results are known, in academia. It is generally frowned upon. The curious thing about this case is that it was brought to light by Wansink himself. In his original blog, Wansink even noted that when he passed off his data, he told the Ph.D. student: “This cost us a lot of time and our own money to collect. There’s got to be something here we can salvage because it’s a cool (rich & unique) data set.” When online commenters were quick to criticize the research methods laid out in the post, Wansink was just as quick to reply to them in defense.
The debacle eventually prompted three researchers to take a closer look at the papers in question. Using online tools, the trio reported approximately 150 statistical inconsistencies in the four papers in a study published without peer review on PeerJ PrePrints on Jan. 25 (such informal post-publication peer review of studies is now common due to forums such as PubPeer). The assessment shows that the mean, standard deviation, sample sizes, and related values do not live up to their original reported figures. One of the authors of the preprint Nicholas Brown, a Ph.D. student at the University of Groningen in the Netherlands, said he didn’t think the errors were a result of fraud—he thinks they resulted from cherry-picking results. “The errors in the preprint are probably due to the HARKing having been done very sloppily,” Brown told me.
Brown did not think that correcting the results would mean that the findings would no longer be statistically significant. Wansink has not released his data because he says they contain confidential information about the diners and the restaurant’s profits, so Brown wasn’t able to make the changes himself and rerun the data. But Wansink said in an addendum to his blog that he is going to. He admitted to finding “minor inconsistencies” when rescanning the papers, and on Tuesday, he contacted all four journals asking if they would be willing to publish errata addressing the problems after he has thoroughly reanalyzed the data. If the conclusions significantly change, which Wansink, like Brown, thinks is unlikely, he plans to retract the relevant studies. Wansink also confessed to mistakenly using the same data set for four different papers without declaration—another no-no in the academic world known as “salami-slicing.”
Wansink denies that the papers were HARKed. He said that the authors went into experiments with a main primary hypothesis—to assess whether people who paid more ate more to get their money’s worth—as well as a few secondary interests. Once his initial hypotheses didn’t work, Wansink delved into what he calls “deep data dives,” which are essentially looking for trends in a collection of data with lots of variables. As Wansink puts it: “Perhaps your hypo worked during lunches but not dinners, or with small groups but not large groups. You don’t change your hypothesis, but you figure out where it worked and where it didn’t.”
And this is where these four studies and their possible errors run up against the larger problem that plagues academic research, scientific journals, and science journalism. As Jeff Rouder, a professor at the University of Missouri in Columbia, told me, Wansink is not really the one who deserves the full blame here. The errors that were made are errors that are at the very least understandable given the current academic pressure to publish—Wansink’s original post was in praise of his industrious grad students who made something out of nothing. And speaking of something out of nothing, the reason that the original study failed and the data were handed off in the first place is that it reached a null hypothesis—in other words, it didn’t prove what they thought it might. Reaching a null hypothesis often means that an experiment is doomed for nonpublishing—even though null and negative results can teach us just as much as positive results. But journals have a bias toward publishing positive results.
Maybe as a result of these unfair conditions, Wansink perhaps understandably sent his research assistant to spelunk through what had been very costly data to collect. And she turned up statistically significant results, which is the bar of entry for many scientific journals. Unfortunately, just because a correlation is significant does not necessarily mean that it tells us something true about the world. Statisticians are generally skeptical of the techniques employed in this work. Looking at different data and subdividing variables will inevitably surface spurious correlations—trends that correlate by chance and have no causal connection. It’s perfectly possible that the men in the study did eat significantly more in front of women. It’s also perfectly possible that they did so for reasons entirely unrelated to the women’s presence.
Slate contributor Andrew Gelman, a statistician at Columbia University, was one of the first to raise alarm bells after encountering Wansink’s post—he blogged about it back in December. He has since followed up with a nuanced assessment on the likelihood that the results, even if they maintain statistical significance, indicate something true.
While his assessment is meticulous and quite fair, Gelman also describes one nightmare scenario in which the research, despite being catastrophically flawed, ends up influencing public policy (given Wansink’s prominence, this seems possible). One result of such a situation could be that “people follow bad nutrition advice and their health is harmed or, at the very least, their quality of life is harmed because they feel they should be following some arbitrary rules.” Gelman’s scenario assumes the research has made its way into policy. But remember the fanfare that the men-eat-more-around-women study received from the general media? This research has already made its way into headlines. And that means it’s likely to have affected people’s lives—for better or worse