Psychologists are up in arms over, of all things, the editorial process that led to the recent publication of a special issue of the journal Social Psychology. This may seem like a classic case of ivory tower navel gazing, but its impact extends far beyond academia. The issue attempts to replicate 27 “important findings in social psychology.” Replication—repeating an experiment as closely as possible to see whether you get the same results—is a cornerstone of the scientific method. Replication of experiments is vital not only because it can detect the rare cases of outright fraud, but also because it guards against uncritical acceptance of findings that were actually inadvertent false positives, helps researchers refine experimental techniques, and affirms the existence of new facts that scientific theories must be able to explain.
One of the articles in the special issue reported a failure to replicate a widely publicized 2008 study by Simone Schnall, now tenured at Cambridge University, and her colleagues. In the original study, two experiments measured the effects of people’s thoughts or feelings of cleanliness on the harshness of their moral judgments. In the first experiment, 40 undergraduates were asked to unscramble sentences, with one-half assigned words related to cleanliness (like pure or pristine) and one-half assigned neutral words. In the second experiment, 43 undergraduates watched the truly revolting bathroom scene from the movie Trainspotting, after which one-half were told to wash their hands while the other one-half were not. All subjects in both experiments were then asked to rate the moral wrongness of six hypothetical scenarios, such as falsifying one’s résumé and keeping money from a lost wallet. The researchers found that priming subjects to think about cleanliness had a “substantial” effect on moral judgment: The hand washers and those who unscrambled sentences related to cleanliness judged the scenarios to be less morally wrong than did the other subjects. The implication was that people who feel relatively pure themselves are—without realizing it—less troubled by others’ impurities. The paper was covered by ABC News, the Economist, and the Huffington Post, among other outlets, and has been cited nearly 200 times in the scientific literature.
However, the replicators—David Johnson, Felix Cheung, and Brent Donnellan (two graduate students and their adviser) of Michigan State University—found no such difference, despite testing about four times more subjects than the original studies.
Aggrieved about several aspects of the replication process, Schnall aired her concerns, first to a journalist covering the special issue for Science, and then on her personal blog. When the Michigan State researchers told her that they planned to replicate her study, Schnall had gladly provided them with the materials she used in her experiments (the moral dilemmas she asked subjects to consider, the procedures she followed, and so on). She had also accepted the journal editors’ invitation to peer review the experimental protocol and statistical analysis the replicators planned to follow. But after that, she felt shut out. Although Schnall had approved of their proposed method of collecting and analyzing data, neither she nor anyone other than the editors reviewed the results of the replication. When the replicators did share their data and analysis with her, she asked for two weeks to review it to try to determine why they had failed to reproduce her original findings, but the manuscript had already been submitted for publication.
Once the journal accepted the paper, Donnellan reported, in a much-tweeted blog post, that his team had failed to replicate Schnall’s results. Although the vast majority of the post consisted of sober academic analysis, he titled it “Go Big or Go Home”—a reference to the need for bigger sample sizes to reduce the chances of accidentally finding positive results—and at one point characterized their study as an “epic fail” to replicate the original findings.
After reviewing the new data, Schnall developed an explanation for why the Johnson group failed to replicate her study. But the guest editors of the special issue (social psychologists Brian Nosek of the University of Virginia and Daniel Lakens of the Eindhoven University of Technology in the Netherlands), having initially said that the original authors of some of the replicated studies “may” be invited to respond, now told her there was no space in the issue for responses by any original authors. The editors also disagreed with her argument that she had found an error in the replication that rendered it “invalid” and warranted editorial intervention. (Schnall’s claim of error involves some technical issues regarding measurement and statistics, and has been analyzed by several respected methodologists. At present, the consensus is running against her on this point.)
The editor in chief of Social Psychology later agreed to devote a follow-up print issue to responses by the original authors and rejoinders by the replicators, but as Schnall told Science, the entire process made her feel “like a criminal suspect who has no right to a defense and there is no way to win.” The Science article covering the special issue was titled “Replication Effort Provokes Praise—and ‘Bullying’ Charges.” Both there and in her blog post, Schnall said that her work had been “defamed,” endangering both her reputation and her ability to win grants. She feared that by the time her formal response was published, the conversation might have moved on, and her comments would get little attention.
How wrong she was. In countless tweets, Facebook comments, and blog posts, several social psychologists seized upon Schnall’s blog post as a cri de coeur against the rising influence of “replication bullies,” “false positive police,” and “data detectives.” For “speaking truth to power,” Schnall was compared to Rosa Parks. The “replication police” were described as “shameless little bullies,” “self-righteous, self-appointed sheriffs” engaged in a process “clearly not designed to find truth,” “second stringers” who were incapable of making novel contributions of their own to the literature, and—most succinctly—“assholes.” Meanwhile, other commenters stated or strongly implied that Schnall and other original authors whose work fails to replicate had used questionable research practices to achieve sexy, publishable findings. At one point, these insinuations were met with threats of legal action.
Brent Donnellan apologized for his use of “go big or go home” and “epic fail,” and another researcher apologized for comments that seemed to imply that Schnall’s original work might not have been “honest.” But for too long, the discussion continued to focus on who chose the wrong words or took the wrong tone, whose career and reputation were mostly likely to be hurt, whose research plans have been most chilled, and who did what to whom—and what their real motives were.
* * *
This all may seem like little more than a reminder of the adage that the politics of academia are so nasty because the stakes are so small. The #repligate controversy spiked to an unusual intensity, even for academia. But the stakes for the rest of us are anything but low. Scientific knowledge is not produced by scientists alone, and it certainly doesn’t affect only them.
Much science, including psychological science, wouldn’t be possible without funding from governments, foundations, and universities. Funders clearly have a stake in the quality and validity of research results. In recent years, many members of Congress (most of them Republican) have expressed deep skepticism about the value of behavioral science. The National Science Foundation, which finances a lot of behavioral science (including Schnall’s original 2008 study) is a frequent target of their criticism. Last month, the House of Representatives passed an amendment that would reallocate $15 million within the NSF budget from social, behavioral, and economic research to the fields of physical sciences, biology, computer science, math, and engineering.
Those who oppose funding for behavioral science make a fundamental mistake: They assume that valuable science is limited to the “hard sciences.” Social science can be just as valuable, but it’s difficult to demonstrate that an experiment is valuable when you can’t even demonstrate that it’s replicable.
Science often depends on the blood, sweat, tears, and other biospecimens of human subjects, as well as on their time, willingness to assume privacy risks, or relive traumatic experiences, and so on. International guidelines and prominent scholarly frameworks state that in order for research involving human subjects to be ethical, it must be well designed to answer an important question. Subjects are seen as altruists who assume risks and costs, sometimes great and sometimes small, in order to help advance science. Neither a poorly designed study nor a study aimed at answering a trivial question is said to be capable of producing social benefit that offsets those risks and costs. Our own unorthodox view is that, since subjects are in fact often motivated by goals (such as payment, free medical attention, or scientific curiosity) that they will achieve regardless of whether the study advances science, it’s not automatically unethical to invite them to participate in a study that might be poorly designed or trivial. There’s also a danger in insisting that ethics boards enforce this rule too rigidly, because a lot of what passes for criticism of a proposed study’s design or importance is actually intramural disagreement among scientists, and scientists shouldn’t be allowed to block one another’s research on parochial grounds. Still, many subjects volunteer for research with the expectation that all reasonable efforts will be made to ensure that the results are correct. Replication is science’s most basic way of verifying correctness.
Perhaps most importantly, science is intellectually ascendant. Research in the natural and social sciences increasingly—and rightfully—influences scholarship in other areas and reaches the ears of policymakers and leaders. Since leaving Freud behind and turning decisively to empirical research, psychology has told us much about human nature that we did not previously understand. Social psychology has an especially distinguished history of discovering fundamental and inconvenient truths. To name just two examples, the studies on conformity conducted by Solomon Asch in the 1950s and the studies on obedience to authority conducted by Stanley Milgram in the 1960s showed how our behavior can be influenced to a frightening degree by the actions of those around us. Milgram’s famous finding that a majority of research volunteers would follow instructions to the point of shocking a fellow human being with 450 volts of electricity was not predicted in advance, even by a group of psychiatrists.
Social priming, the field that is near the center of the replication debate, is like Asch and Milgram on steroids. According to social priming and related “embodiment” theories, overt instructions or demonstrations from others aren’t necessary to influence our behavior; subtle environmental cues we aren’t even aware of can have large effects on our decisions. If washing our hands can affect our moral judgments, then moral “reasoning” is much less rational and under our deliberate control than we think. Some social priming researchers have even proposed that their findings could underpin a new type of psychotherapy. What’s at stake in this research, then, is far from trivial—it is our most basic understanding of human nature. And law, economics, philosophy, management, political science, and other fields now rightly turn to psychology to ensure their own work reflects accurate and up-to-date accounts of human nature.
But if the funders, human subjects, and consumers who enable psychology research can’t trust that its veracity will be confirmed, then the broader social trust necessary to sustain that research enterprise will erode. Unfortunately, published replications have been distressingly rare in psychology. A 2012 survey of the top 100 psychology journals found that barely 1 percent of papers published since 1900 were purely attempts to reproduce previous findings. Some of the most prestigious journals have maintained explicit policies against replication efforts; for example, the Journal of Personality and Social Psychology published a paper purporting to support the existence of ESP-like “precognition,” but would not publish papers that failed to replicate that (or any other) discovery. Science publishes “technical comments” on its own articles, but only if they are submitted within three months of the original publication, which leaves little time to conduct and document a replication attempt.
* * *
The “replication crisis” is not at all unique to social psychology, to psychological science, or even to the social sciences. As Stanford epidemiologist John Ioannidis famously argued almost a decade ago, “Most research findings are false for most research designs and for most fields.” Failures to replicate and other major flaws in published research have since been noted throughout science, including in cancer research, research into the genetics of complex diseases like obesity and heart disease, stem cell research, and studies of the origins of the universe. Earlier this year, the National Institutes of Health stated “The complex system for ensuring the reproducibility of biomedical research is failing and is in need of restructuring.”
Given the stakes involved and its centrality to the scientific method, it may seem perplexing that replication is the exception rather than the rule. The reasons why are varied, but most come down to the perverse incentives driving research. Scientific journals typically view “positive” findings that announce a novel relationship or support a theoretical claim as more interesting than “negative” findings that say that things are unrelated or that a theory is not supported. The more surprising the positive finding, the better, even though surprising findings are statistically less likely to be accurate. Since journal publications are valuable academic currency, researchers—especially those early in their careers—have strong incentives to conduct original work rather than to replicate the findings of others. Replication efforts that do happen but fail to find the expected effect are usually filed away rather than published. That makes the scientific record look more robust and complete than it is—a phenomenon known as the “file drawer problem.”
The emphasis on positive findings may also partly explain the fact that when original studies are subjected to replication, so many turn out to be false positives. The near-universal preference for counterintuitive, positive findings gives researchers an incentive to manipulate their methods or poke around in their data until a positive finding crops up, a common practice known as “p-hacking” because it can result in p-values, or measures of statistical significance, that make the results look stronger, and therefore more believable, than they really are.
A few years ago, researchers managed to publish a few failures to replicate prominent social priming experiments. Partly in response, Nobel Prize–winning cognitive psychologist Daniel Kahneman, whose work has had far-reaching influence in law, government, economics, and many other fields, wrote an open letter to social psychologists working in this area. He counted himself a “general believer” in the effects and noted that he had cited them in his own work. But Kahneman warned of “a train wreck looming” for social priming research. “Colleagues who in the past accepted your surprising results as facts when they were published … have now attached a question mark to the field,” he said, “and it is your responsibility to remove it.”
The recent special issue of Social Psychology was an unprecedented collective effort by social psychologists to do just that—by altering researchers’ and journal editors’ incentives in order to check the robustness of some of the most talked-about findings in their own field. Any researcher who wanted to conduct a replication was invited to preregister: Before collecting any data from subjects, they would submit a proposal detailing precisely how they would repeat the original study and how they would analyze the data. Proposals would be reviewed by other researchers, including the authors of the original studies, and once approved, the study’s results would be published no matter what. Preregistration of the study and analysis procedures should deter p-hacking, guaranteed publication should counteract the file drawer effect, and a requirement of large sample sizes should make it easier to detect small but statistically meaningful effects.
The results were sobering. At least 10 of the 27 “important findings” in social psychology were not replicated at all. In the social priming area, only one of seven replications succeeded.
* * *
The incivility and personal attacks surrounding both this latest replication attempt (and prior attempts) may draw the attention of researchers away from where it belongs: on producing the robust science that everyone needs and deserves. Of course, researchers are human beings, not laboratory-dwelling robots, so it’s entirely understandable that some will be disappointed or even feel persecuted when others fail to replicate their research. For that matter, it’s understandable that some replicators will take pride and satisfaction in contributing to the literature by challenging the robustness of a celebrated finding.
But worry over these natural emotional responses should not lead us to rewrite the rules of science. To publish a scientific result is to make a claim about reality. Reality doesn’t belong to researchers, much less to any single researcher, and claims about it need to be verified. Critiques or attempts to replicate scientific claims should always be—and usually are—about reality, not about the researchers who made the claim. In science, as in The Godfather: It’s not personal, it’s business.
One way to keep things in perspective is to remember that scientific truth is created by the accretion of results over time, not by the splash of a single study. A single failure-to-replicate doesn’t necessarily invalidate a previously reported effect, much less imply fraud on the part of the original researcher—or the replicator. Researchers are most likely to fail to reproduce an effect for mundane reasons, such as insufficiently large sample sizes, innocent errors in procedure or data analysis, and subtle factors about the experimental setting or the subjects tested that alter the effect in question in ways not previously realized.
Caution about single studies should go both ways, though. Too often, a single original study is treated—by the media and even by many in the scientific community—as if it definitively establishes an effect. Publications like Harvard Business Review and idea conferences like TED, both major sources of “thought leadership” for managers and policymakers all over the world, emit a steady stream of these “stats and curiosities.” Presumably, the HBR editors and TED organizers believe this information to be true and actionable. But most novel results should be initially regarded with some skepticism, because they too may have resulted from unreported or unnoticed methodological quirks or errors. Everyone involved should focus their attention on developing a shared evidence base that consists of robust empirical regularities—findings that replicate not just once but routinely—rather than of clever one-off curiosities.
Those who create the incentives to produce particular kinds of research should do their part to reset expectations and realign priorities. Funders, for example, should earmark some money for confirming the science they’ve already funded and, as NIH has recently done, consider changes in the way they review grant proposals. Journals, science writers, and academic hiring and promotion committees should view well-conducted exploratory research and well-conducted confirmatory research, including replications, as contributing more equally to our shared knowledge base, and therefore as both worthy of attention. Original authors whose work doesn’t replicate might feel less threatened by that outcome, and those who fail to replicate canonical work might feel less need to take victory laps, if single data points were regarded as just that, and if the rewards for producing these data points—whether through original research or replication, and whether the results are positive or negative—were more evenly distributed.
Another key to depersonalizing the replication wars is to recognize that, contrary to some claims that original authors are unfairly targeted, there are legitimate reasons for focusing limited replication resources on some findings rather than others. There is little point in replicating a random sample of the studies conducted within a single field, since science doesn’t advance by forming general judgments about whole areas of research. It advances by establishing whether a specific effect is true or false—or, more precisely, by establishing the size and reliability of specific effects. As Carl Sagan famously warned, and as students are now taught, extraordinary claims require extraordinary evidence. This means that original findings that are surprising, that feature large effects, or that seem to conflict with other established findings should be subject to greater scrutiny—including replication.
Of course, reasonable people may disagree about what claims are extraordinary. Many social psychologists find social priming results unsurprising, and some of the impetus to replicate them has come from people in a related field, cognitive psychology. This is not because cognitive psychologists habitually prefer tearing things down to building them up, or because they have run out of creative ideas of their own. Rather, some cognitive psychologists find social priming claims to be extraordinary because they seem to conflict with what their own field has found about how the mind and brain work. Science is a grand web of cause-and-effect relationships between concepts, and these relationships must, in the long run, be mutually consistent between fields as well as within them.
Doesn’t this mean that some replicators will be motivated, consciously or not, to fail to replicate original research that conflicts with their favored theory of reality? Sure. But most replication attempts in fact arise from a sincere interest in an original finding and a desire to understand and expand upon it. There’s no reason to think that replicators are any more motivated to “fail” to replicate original findings than are original authors, as a group, to “succeed” in finding evidence of an effect in the first place. Scholars, especially scientists, are supposed to be skeptical about received wisdom, develop their views based solely on evidence, and remain open to updating those views in light of changing evidence. But as psychologists know better than anyone, scientists are hardly free of human motives that can influence their work, consciously or unconsciously. It’s easy for scholars to become professionally or even personally invested in a hypothesis or conclusion. These biases are addressed partly through the peer review process, and partly through the marketplace of ideas—by letting researchers go where their interest or skepticism takes them, encouraging their methods, data, and results to be made as transparent as possible, and promoting discussion of differing views. The clashes between researchers of different theoretical persuasions that result from these exchanges should of course remain civil; but the exchanges themselves are a perfectly healthy part of the scientific enterprise.
This is part of the reason why we cannot agree with a more recent proposal by Kahneman, who had previously urged social priming researchers to put their house in order. He contributed an essay to the special issue of Social Psychology in which he proposed a rule—to be enforced by reviewers of replication proposals and manuscripts—that authors “be guaranteed a significant role in replications of their work.” Kahneman proposed a specific process by which replicators should consult with original authors, and told Science that in the special issue, “the consultations did not reach the level of author involvement that I recommend.”
Collaboration between opposing sides would probably avoid some ruffled feathers, and in some cases it could be productive in resolving disputes. With respect to the current controversy, given the potential impact of an entire journal issue on the robustness of “important findings,” and the clear desirability of buy-in by a large portion of psychology researchers, it would have been better for everyone if the original authors’ comments had been published alongside the replication papers, rather than left to appear afterward. But consultation or collaboration is not something replicators owe to original researchers, and a rule to require it would not be particularly good science policy.
Replicators have no obligation to routinely involve original authors because those authors are not the owners of their methods or results. By publishing their results, original authors state that they have sufficient confidence in them that they should be included in the scientific record. That record belongs to everyone. Anyone should be free to run any experiment, regardless of who ran it first, and to publish the results, whatever they are.
Moreover, there may be downsides of collaboration between originators and replicators of important findings. According to that 2012 survey of the top 100 psychology journals over the past century or so, “replications were significantly less likely to be successful when there was no overlap in authorship between the original and replicating articles.” It’s possible that original authors might correct errors in the replicators’ methodology or data analysis that would otherwise have resulted in a non-replication. But it seems just as likely that original authors could bias the outcome in a variety of ways, mostly pointing toward success. Independent, arm’s-length replication is the best test of the validity and reproducibility of a scientific result; all else equal, replication by or in league with the original researchers will have lesser evidentiary value.
But is all else equal? Kahneman rightly notes that the methods sections of most articles reporting original research are insufficiently detailed to allow third parties to conduct precise replications. As a result, if the goal were to replicate what the original researcher actually did, initial communication with original authors to learn more about their methods would indeed often be necessary for a replication to be scientifically valid.
However, an alternative legitimate goal that does not require consultation with original authors is replicating what the authors reported. That, after all, is what has entered the public record as a claim about reality. If an original study reports that, “if X and Y are done, A occurs,” it will not do for its author to respond to a replicator’s failed attempt to produce A from X and Y by suddenly insisting that another condition, Z, is also critical to reproducing the effect. A good-faith attempt to replicate exactly what was published cannot be criticized as invalid.
Rather than compromise the independence of replications by requiring their authors to consult with original study authors, the scientific standards of the future should require original authors to make all materials necessary for a precise replication of their study publicly available upon publication. This is analogous to the bargain that lies at the heart of U.S. patent law. An inventor is granted a limited monopoly in exchange for publishing a sufficiently precise description of the process of making the invention, and of the “best mode” of using it, so as to enable “any person of ordinary skill in the relevant art” to reproduce and use it herself without “undue experimentation.” In principle, a patent that fails to meet this requirement violates the social bargain and may be invalidated. Scientists who publish their results profit in the form of jobs, promotions, recognition by their peers and the public, and grants. In exchange, they, too, ought to disclose the “best mode” of reproducing the effect they claim to have found.
As in patent law, where it is not always clear who counts as a “person of ordinary skill” in the relevant art, scientists may disagree about who is qualified to conduct a proper replication. For brain imaging, say, years of training are required. But some critics of replication drives have been too quick to suggest that replicators lack the subtle expertise to reproduce the original experiments. One prominent social psychologist has even argued that tacit methodological skill is such a large factor in getting experiments to work that failed replications have no value at all (since one can never know if the replicators really knew what they were doing, or knew all the tricks of the trade that the original researchers did), a surprising claim that drew sarcastic responses. It’s true that brain imaging is a lot more complicated than turning on an MRI scanner and sending your manuscript to Nature. But many psychology experiments are so non-technical, and some findings are so robust, that they can be easily replicated by college and even high school students.
Like all researchers, replicators and those who publish their work have obligations to adhere to standard—not special—procedures of peer-review, information disclosure and sharing, and so on. Replicators and journals should not replace a bias in favor of positive results of original research with a bias in favor of failed replications. Such a “reverse file drawer problem” would be unfair not only to original researchers but to everyone with an interest in the accuracy of the scientific record.
* * *
Psychology has long been a punching bag for critics of “soft science,” but the field is actually leading the way in tackling a problem that is endemic throughout science. The replication issue of Social Psychology is just one example. The Association for Psychological Science is pushing for better reporting standards and more study of research practices, and at its annual meeting in May in San Francisco, several sessions on replication were filled to overflowing. International collaborations of psychologists working on replications, such as the Reproducibility Project and the Many Labs Replication Project (which was responsible for 13 of the 27 replications published in the special issue of Social Psychology) are springing up.
Even the most tradition-bound journals are starting to change. The Journal of Personality and Social Psychology—the same journal that, in 2011, refused to even consider replication studies—recently announced that although replications are “not a central part of its mission,” it’s reversing this policy. We wish that JPSP would see replications as part of its central mission and not relegate them, as it has, to an online-only ghetto, but this is a remarkably nimble change for a 50-year-old publication. Other top journals, most notable among them Perspectives in Psychological Science, are devoting space to systematic replications and other confirmatory research. The leading journal in behavior genetics, a field that has been plagued by unreplicable claims that particular genes are associated with particular behaviors, has gone even further: It now refuses to publish original findings that do not include evidence of replication.
A final salutary change is an overdue shift of emphasis among psychologists toward establishing the size of effects, as opposed to disputing whether or not they exist. The very notion of “failure” and “success” in empirical research is urgently in need of refinement. When applied thoughtfully, this dichotomy can be useful shorthand (and we’ve used it here). But there are degrees of replication between success and failure, and these degrees matter.
For example, suppose an initial study of an experimental drug for cardiovascular disease suggests that it reduces the risk of heart attack by 50 percent compared to a placebo pill. The most meaningful question for follow-up studies is not the binary one of whether the drug’s effect is 50 percent or not (did the first study replicate?), but the continuous one of precisely how much the drug reduces heart attack risk. In larger subsequent studies, this number will almost inevitably drop below 50 percent, but if it remains above 0 percent for study after study, then the best message should be that the drug is in fact effective, not that the initial results “failed to replicate.”
Maybe a drug that consistently reduces cardiovascular risk by, say, 2 percent lacks enough practical value to offset its costs; that’s for patients, doctors, and payers to decide. And not every scientific question concerns effect size; sometimes the point of an experiment is just to show that two processes or relationships are not identical, in which case it doesn’t much matter exactly how different they are. But measuring and calibrating effects is a vital task for any science that aspires to real-world relevance, as psychology should. Of the 17 studies in the special issue where replication “succeeded,” five found smaller effect sizes than the original studies reported. A 2 percent effect can have a much different meaning from a 10 percent or 50 percent effect. If Milgram had shown that only 2 percent of people were willing to shock to 450 volts, rather than more than 50 percent, would we care as much?
Besides muddying the scientific waters, excessive focus on the binary question of failure versus success may generate more heat than light, as #repligate suggests. According to the Science article on the special replication issue, several authors of original studies described the replication process as “bullying.” But a different view was offered by another researcher, Eugene Caruso of the University of Chicago, who reported in 2013 that priming subjects by exposing them to the sight of money made them more accepting of societal norms. This result also “failed” to replicate. Caruso acknowledged that the outcome “was certainly disappointing at a personal level,” but added, “when I take a broader perspective, it’s apparent that we can always learn something from a carefully designed and executed study.” This is exactly the broader view of success that everyone with a stake in good science should keep in mind.