Sloppy Science

Are sketchy practices in the lab to blame for the replication crisis in psychology research?

Much of David Peterson’s recent work on labs focuses on a common experimental method in which researchers attempt to confuse a baby. 


David Peterson would like to know why psychology has fallen into crisis. The discipline now seems rife with shoddy data. A recent, large-scale effort to reproduce experiments found that more than half of 100 major papers could not be replicated. Even certain bedrock findings—including those that spawned entire subfields of research—now appear to be unstable. What, exactly, led us to this point?

Peterson has some clues. The graduate student in sociology at Northwestern University has spent parts of the past four years conducting an ethnographic study of about a dozen different research labs. The subjects of his research were at times distressing in their honesty. “You want to know how it works?” one graduate student told him. “We have a bunch of half-baked ideas. We run a bunch of experiments. Whatever data we get, we pretend that’s what we were looking for.”

That quote comes from Peterson’s most recent academic paper, published in January, which describes the 16 months he spent visiting three developmental psychology labs. The behavior he documents provides some useful insight into how rank-and-file scientists bend the rules of research in subtle and unsubtle ways.

Much of Peterson’s paper focuses on a common experimental method in which researchers attempt to confuse a baby. In a typical setup, a mother sits with her child on her lap as various toys or pictures are presented on a makeshift stage or computer screen. Some of these scenarios make sense: For example, two 4-inch Mickey Mouse dolls might be placed behind a screen, one after another, before the screen is pulled away to reveal both dolls. Other scenarios are meant to maximize baby befuddlement: The same two dolls get placed behind the screen, let’s say, but when the screen is pulled away, only one remains. If the baby gawks at the latter version of the stimulus—if he stares longer at the doll behind the screen—then the researchers infer that he’s confused. Confusion here functions as a sign of comprehension, since the baby would only be thrown off if he understood that something weird had taken place.

More than 20 years ago, a developmental psychologist named Karen Wynn ran a study very much like the one described above, with little Mickey Mouse dolls, and showed that even 5-month-olds have a grasp of simple arithmetic. (The babies knew that something was amiss, i.e., they gawked when one of the dolls disappeared.) As Peterson points out, this work has proved reliable: It’s been replicated many times since then. But his observations suggest that even this venerable laboratory protocol requires more futzing and interpretation than one might assume from reading published papers.

For one thing, it’s very hard to track a baby’s gaze. In the labs where Peterson embedded, researchers trained cameras on the babies’ faces, and then two people were supposed to analyze the footage independently, to gauge how much time each infant spent looking at the stimuli. Peterson found that these “independent coders” were nearly always sitting in the same room, and they often conferred with each other. The video footage could be ambiguous: Was the baby gazing at the stimuli or just zoning out while looking straight ahead? At times the babies shifted on their parents’ laps such that their eyes were out of frame, and coders had to figure out where they were looking by, for instance, the angle of their chins. So they’d check in with each other to make sure they were on the same page.

The babies would have tantrums, too, and bring experiments to an early finish or a temporary pause. Or else they’d fall asleep. Either way, researchers had to figure out—or decide ad hoc—when to toss the data from a session and when to try to make the best of it. There were other problems, too. In principle, the parents were supposed to keep their eyes closed, so as not to influence their children with their own reactions to the dolls and pictures. But according to Peterson, this instruction could be overlooked without consequence. He quotes one experimenter’s instruction to a mother: “During the trial, we ask you to close your eyes. That’s just for the journals so we can say you weren’t directing her attention. But you can peek if you want to. It’s not a big deal. But there’s not much to see.”

The psychologists also started analyzing data before their studies were complete. They’d run a handful of babies through the test and then check for “significant results.” If things were looking good, the study would proceed; otherwise, the scientists would change things up or move on to another project. And Peterson observed that the scientists sometimes came up with stories to explain their data after the fact. They’d get a certain finding, then reverse-engineer a hypothesis to explain it. In one meeting, he reports, a student told her mentor that she’d forgotten her original motivation for the study. “You don’t have to reconstruct your logic,” the mentor said. “You have the results now. If you can come up with an interpretation that works, that will motivate the hypothesis.”

I called up Peterson to ask about—and, let’s be honest, to lament—all these cringe-y observations. But he was reluctant to construe his paper as an indictment of psychology. “To some degree the article has been taken as a gotcha piece of journalism, but I’m a little uncomfortable with that designation,” he told me. Take the so-called independent coders, who talked to each other as they analyzed the video. You could describe that approach as being “completely illegitimate,” he said, but you could also see it as a necessary form of calibration for a very messy kind of measurement. “That’s a more nuanced way of understanding it,” he said.

You could view the other sketchy laboratory practices with similar generosity. Babies do not make easy research subjects: Unlike adults, they won’t follow your instructions, but unlike, say, lab mice, there are very strict limits on how you can interact with them. (Peterson notes that in the labs he visited, the researchers almost never even touched the infants they were studying.) Given these constraints, it’s only natural that more formal rules of research—such as they’re understood—would have to be adapted or adjusted.

Indeed, an earlier generation of sociologists had come to similar conclusions based on their own ethnographic forays into labs. That first wave of science-studies scholars, who did their research in the 1970s and 1980s, were more interested in the harder sciences. Even there, they found, research practice could seem a little sloppy. The sociologists observed that lab work is often artisanal and based in part on informal skills and modes of thinking. What’s more, different communities of scientists, in different fields of study, might develop their own standards for what counts as evidence and proof. For example, a study of genetics using mice might require larger sample sizes, according to the customs of that field, than a study of behavior using monkeys.

For a study from 2001, researchers in Wales interviewed dozens of biochemists to better understand the shift from classroom science, where all the rules of research seem very clear, to laboratory work, where rules are ill-defined and answers often indistinct. The authors argue that this creates a kind of “reality shock” for young scientists, who must learn to grapple with the practice of science in the real world.

I had that experience myself, during my brief career as a graduate student in neurobiology. Before I started in my program, I’d imagined research as a cerebral exercise—a test of judgment, expertise, and critical analysis. Those were all important factors in the work, but they often felt secondary to the learning of less lofty skills: how to train a kitten, how to carve the cerebellum from a bird (without killing it), how to keep a plate of cells alive in culture, how to build a psychophysics rig from 80/20. Even after months of practice and no small amount of progress, I could tell that many of my classmates would always outperform me. It was as if they had a golden touch for doing research, or I had a leaden one. They knew how to make things work by intuition. They had a better feel for when and how to tweak a recipe or let things get a little fuzzy.

It could be that the same implicit knowledge—the subtle art of doing science—is at work in psychology labs. Earlier this month, I wrote about the psychologist Roy Baumeister, whose famous work on “ego depletion” recently failed to a replicate in a massive, global study. Baumeister didn’t think the replication study meant that much, since it had been done on a computer. For his original research, he said, he’d had his subjects complete their tasks with pencil and paper. When you strip away that artisanal, hands-on approach, you’re much more likely to mess up. “If you go back to the early days of social psychology,” he said, “there was quite a bit of engineering the procedures just right to get people to perceive things and experiences things in a certain way.”

Are these hand-made “calibrations” in psychology any more egregious than the futzing that goes on in other kinds of science labs? For a paper published last fall, Peterson compared his experience in the three developmental psychology labs with studies of three labs in other disciplines—two in social psychology and one in the molecular biology of vision. In the end he found that the molecular biology lab performed its work with more scientific rigor. “There’s this ongoing, low-level replication that’s constantly happening,” he explained. As each experiment developed, the researchers would continually update their techniques; they would test and then apply new methods to their work, and then they would verify their progress as they went.

The psychologists didn’t have the same opportunity to fiddle with technology or to refine their work with better methods. They couldn’t intervene with their human subjects in the same way that a biologist could mess around with mice or plated cells. Peterson found that both the molecular biologists and psychologists confronted uncertainties in the course of doing research, but it was the biologists, he said, who had the tools to engage with that confusion and then to help resolve it.

In other words, he’s saying that familiar biases about the differences between “hard” and “soft” science should be taken seriously. Psychology is not the only field to suffer from replication problems—they’ve cropped up lately in cancer research, among other fields—but Peterson’s work suggests that there could be good reasons why the failures of psychology have been the most apparent to this point.

But even if Peterson is wrong and psychology isn’t any more likely than other fields to generate false positives in its research literature, we might still find psychologists at the center of the replication crisis. Psychology research is more likely to fail in replication for the simple reason that psychology studies are relatively cheap and easy to reproduce. In many cases (to Roy Baumeister’s chagrin), all you need to attempt a replication is access to a personal computer and a willing pool of Internet volunteers. By comparison, replicating a study in molecular biology is extremely expensive and time-consuming.

Psychologists may also be better-equipped to diagnose and understand the replication problem. As a rule, they’re more invested in the study of statistics and more inclined to pioneer new ways of evaluating the research literature for signs of widespread bias. And they’re trained from early on to understand how individuals might fall victim to subtle social cues and hidden motivations. Perhaps psychology is at the center of this crisis because psychologists are themselves so highly trained at introspection.