Everyone Is Missing the Point About Brian Wansink and P-Hacking

There are more crucial lessons to learn from the replication crisis.

Brian Wansink raises a hand during the 2013 Discovery Vitality Summit in Johannesburg, South Africa.
Brian Wansink during the 2013 Discovery Vitality Summit in Johannesburg, South Africa. Lefty Shivambu/Gallo Images/Getty Images

This post originally appeared on the author’s website.

Last month, Brian Wansink, the Cornell University researcher and retraction king, retired. As David Randall wrote in the Wall Street Journal, “Over a 25-year career, Mr. Wansink developed an international reputation as an expert on eating behavior. He was the main popularizer of the notion that large portions lead inevitably to overeating. But Mr. Wansink resigned last week … after an investigative faculty committee found he had committed a litany of academic breaches. … A generation of Mr. Wansink’s journal editors and fellow scientists failed to notice anything wrong with his research—a powerful indictment of the current system of academic peer review.”

But some news reports missed the point, in a way that I’ve discussed before: They focused on “p-hacking” and bad behavior rather than the larger problem of researchers expecting routine discovery. At NPR, Brett Dahlberg wrote: “The fall of a prominent food and marketing researcher may be a cautionary tale for scientists who are tempted to manipulate data and chase headlines.”

Sure, manipulating data is bad. But there’s nothing wrong with chasing headlines if you think you have an important message to share. And I suspect that the big problem is not overt “manipulation” but researchers fooling themselves.

A focus on misdeeds will have two negative consequences. First, when people point out flawed statistical practices being done by honest and well-meaning scientists, critics might associate this with p-hacking and cheating, unfairly implying that these errors are the result of bad intent. Second, various researchers who are using poor scientific practices but are not cheating will think that their research is fine just because they’re not falsifying data or p-hacking. So, for both these reasons, I like the framing of the garden of forking paths: why multiple comparisons can be a problem, even when there is no fishing expedition or p-hacking and the research hypothesis was posited ahead of time.

Additionally, Dahlberg writes: “The gold standard of scientific studies is to make a single hypothesis, gather data to test it, and analyze the results to see if it holds up. By Wansink’s own admission in the blog post, that’s not what happened in his lab.”

No. Sometimes it’s fine to make a single hypothesis and test it. But work in a field such as eating behavior is inherently speculative. We’re not talking about general relativity or other theories of physics that make precise predictions that can be tested; this is a field where the theories are fuzzy, and we can learn from data. This is not a criticism of behavioral research; it’s just the way it is.

One problem with various flawed statistical methods associated with hypothesis testing is the mistaken identification of vague hypotheses, such as “People eat more when they’re served in large bowls” with specific statistical models of particular experiments.

We need to move beyond the middle-school science-lab story of the precise hypothesis and accept that, except in rare instances, a single study in social science will not be definitive. The gold standard is not a test of a single hypothesis; the gold standard is a clearly defined and executed study with high-quality data that leads to an increase in our understanding.

Dahlberg continues: “P-hacking is when researchers play with the data, often using complex statistical models, to arrive at results that look like they’re not random.” My problem here is with the implication of intentionality. And most of the examples of unreplicable research I’ve seen used simple statistical comparisons and tests—nothing complex at all, maybe the occasional regression model. Wansink was mostly using t-tests and chi-squared tests.

The article continues, quoting a university administrator: “We believe that the overwhelming majority of scientists are committed to rigorous and transparent work of the highest caliber.” Two things are being conflated here: procedural characteristics (rigor and transparency) and quality of research (work of the highest caliber). Unfortunately, honesty and transparency are not enough. And again, there’s a problem when scientific errors are framed as moral errors. Sure, Wansink’s work had tons of problems, including a nearly complete lack of rigor and transparency. But lots of people do rigorous (in the sense of being controlled experiments) and transparent studies, but are still doing low-quality research because they have noisy data and bad theories. It’s fine to encourage good behavior and slam bad behavior—but let’s remember that lots of bad work is being done by good people.

My point here is not to bang on Dahlberg. I’m writing because I keep seeing this moralistic framing of the replication crisis that I think is unhelpful.

At some point, Wansink must have realized he was doing something wrong, or he worked really hard to avoid confronting his errors. He made lots of claims where he had no data, and he continues to drastically minimize the problems that have been found with his work. But Wansink is an unusual case. Lots of people out there are trying their best but still are doing junk science. And even Wansink might feel that his sloppiness and minimization of errors are in the cause of a greater good of improving public health.

The problem is a fundamental lack of understanding. Yes, cheating occurs, but the cheating arises out of statistical confusion. People are taught that they are doing studies with 80 percent power (see here and here), so they think they should be routinely achieving success, and they do what it takes—and what they’ve seen other people do—to get that success. Now, don’t get me wrong: I’m frustrated as hell when researcher hype their claims, dodge criticism, and even attack their critics—but I think this is all coming from these researchers living in a statistical fantasy world.

As David Randall wrote, let something good come from Mr. Wansink’s downfall. Let’s hope that Wansink, too, can use his energy and creativity in a way that can benefit society. And sure, it’s good for researchers to know that if you publish papers where the numbers don’t show up and you can’t produce your data, eventually your career may suffer. But what I’d really like is for researchers and decision-makers to recognize fundamental difficulties of science, to realize that statistics is not just a bunch of paperwork, and that following the conventional steps of research without sensible theory or good measurement is a recipe for disaster.

With clear enough thinking, the replication crisis never needed to have happened because people would’ve realized that so many of these studies would be hopeless. But in the world as it is, we’ve learned a lot from failed replication attempts and careful examinations of particular research projects. Let that be the lesson: Even if you are an honest and well-meaning researcher who’s never p-hacked, you can fool yourself, along with thousands of colleagues, news reporters, and funders, into believing things that aren’t so.