Future Tense

The Unintended Consequences of Trying to Replicate Research

And how to fix them.

researchers lab.
Making replication a goal in itself may, paradoxically, make the literature even less reliable.

Michael Blann/Thinkstock

This article is part of Future Tense, a collaboration among Arizona State University, New America, and Slate. On Thursday, April 21, Future Tense will hold an event in Washington, D.C., on the reproducibility crisis in biomedicine. For more information and to RSVP, visit the New America website.

In 2012, a researcher from the pharmaceutical company Amgen and another from the MD Anderson Cancer Center in Houston dropped a bombshell in the pages of Nature: When Amgen had tried to confirm the results of 53 landmark studies in preclinical cancer research, they could only verify six. The paper, which has since been cited nearly 500 times by other scientific articles, focused attention on scientific reproducibility and replication, and how often the results we hear about hold up to further scrutiny.

Reproducibility has become a shibboleth of sorts, sparking conferences, sessions at major medical meetings, and grant funding from foundations interested in research integrity. (Disclosure: Some of those foundations have also funded the Center for Scientific Integrity, the parent nonprofit of Retraction Watch, which we created to promote transparency in science.)

But while the reproduction of studies should be a much higher priority for science than it is even now, there’s a “careful what you wish for” moment happening. Making replication a goal in itself may, paradoxically, make the literature even less reliable.

How’s that, you say? The answer to that question lies in another problem facing science, and biomedicine in particular: Positive publication bias, which is another way of saying that journals would rather publish studies that find a positive result rather than a negative one.

In a 2015 paper in the Review of General Psychology,” Michèle B. Nuijten of Tilburg University and her colleagues argue that a rush to replicate may prove to be an epistemological debacle. (Incidentally, Tilburg University was the last academic home of Diederik Stapel, the Netherlands’ most “successful” fraudster to date, if retractions are any indication. Stapel fabricated data in nearly 60 social psychology studies that have been retracted and, in the process, did grave damage to his field.)

Here’s the problem: Let’s say a study finds that people who drink two Starbucks lattes a day are wealthier than those who drink Dunkin’ Donuts coffee. (The cynic might say that sure, they start out that way, but the expensive habit drains their savings.) Suspend your disbelief for a moment and assume that’s an interesting and important result, so a journal publishes it. Then, some other researchers decide to try to replicate the study, and find that the results are even stronger. Another journal is happy to publish that study, too.

After this happens a few more times, another group decides to gather all of the studies into something called a meta-analysis, which, if done properly, should tell us what’s really happening, strengthening the signals of any individual studies. Nuijten’s team surveyed psychology researchers at various levels to see whether they would agree with that characterization, and in general, they did: Combining smaller and larger studies, they said, would be a more robust result than any of the individual studies. And voila! The meta-analysis shows that indeed, lattes equal wealth.

However, the real story, as with icebergs, is what’s happening out of view. It turns out that a bunch of other teams had tried to replicate the original work but couldn’t. Perhaps the real link between coffee drinking and wealth was Keurigs. They may have tried to publish those findings, but few journals were interested. (More on that in a moment.) Or, anticipating that the study would be rejected, they didn’t bother, just shoving the work into a virtual file drawer and forgetting about it.

That means the vaunted meta-analysis doesn’t paint the full picture. It’s like looking a detail of a canvas—say, a bit of an island in the middle of the lake—and thinking the painting is of land instead of a body of water.

Of course, the Dutch researchers—who aren’t the first to warn of the unintended consequences of replication drives—aren’t calling for a moratorium on replication studies. (Others are fighting replication efforts, but unsuccessfully.) Far from it. What they’d like to see out of such work is what they’d like generally out of science: a reliance on more rigorous statistics from the get-go. In particular, they focus on a measure called “power” that, in essence, tells you how many subjects you need to demonstrate that whatever you’re studying truly matters. In general, studies of common events—say, high blood pressure or diabetes—require larger numbers to be relevant than studies of rare events, like severe birth defects.

Nuijten and her colleagues think the biggest dragon to slay here is publication bias. Doing so won’t be easy; journals have been aware of the issue for years and have done little to change course. The incentives to publish positive, “groundbreaking” results are intense. Those are the papers that other scientists will cite, and since journals are ranked based on the average citation rate, you can see how what they publish ends up skewed. In many ways, positive publication bias is also responsible for so many irreproducible studies being published: The results are buffed and cherry-picked to satisfy editors’ thirst for big news.

But it’s not impossible to change positive publication bias. One proposed remedy is to modify the peer review process so that reviewers grade manuscripts on the quality of their introductions and methods sections rather than the novelty of the findings.  Similarly, journals could assess submissions based solely on the rigor of their methods—as does Plos One

“Another way,” writes Nuijten elsewhere, “is for journals to commit to publishing replication results independent of whether the results are significant.  Indeed, this is the stated replication policy of some journals already.” Uri Simonsohn, of the University of Pennsylvania, and his colleagues have a slightly different solution, which relies on something they call the p-curve, an attempt to correct for publication bias by taking natural statistical variation into account. And one journal has even created a “negative results” section.

Clearly, what we need now is a study of all of these interventions, to see which can cut down on publication bias and make replications more useful. But that would need to be replicated, and we’d want all the results—be they positive, negative or ambiguous, published. Science takes one step forward, but only after walking in circles first.

Future Tense explores the ways emerging technologies affect society, policy, and culture. To read more, follow us on Twitter and sign up for our weekly newsletter.