I don’t know how else to put it. David Peterson, a sociologist, recently published an ethnographic study of 3 infant cognition labs. Titled “The Baby Factory: Difficult Research Objects, Disciplinary Standards, and the Production of Statistical Significance,” it recounts his time spend as a participant observer in those labs, attending lab meetings and running subjects.
In his own words, Peterson “shows how psychologists produce statistically significant results under challenging circumstances by using strategies that enable them to bridge the distance between an uncontrollable research object and a professional culture that prizes methodological rigor.” The account of how the labs try to “bridge the distance” reveals one problematic practice after another, in a way that sometimes makes them seem like normal practice and no big deal to the people in the labs. Here are a few examples.
Protocol violations that break blinding and independence:
…As a routine part of the experiments, parents are asked to close their eyes to prevent any unconscious influence on their children. Although this was explicitly stated in the instructions given to parents, during the actual experiment, it was often overlooked; the parents’ eyes would remain open. Moreover, on several occasions, experimenters downplayed the importance of having one’s eyes closed. One psychologist told a mother, “During the trial, we ask you to close your eyes. That’s just for the journals so we can say you weren’t directing her attention. But you can peek if you want to. It’s not a big deal. But there’s not much to see.”
Optional stopping based on data peeking:
Rather than waiting for the results from a set number of infants, experimenters began “eyeballing” the data as soon as babies were run and often began looking for statistical significance after just 5 or 10 subjects. During lab meetings and one-on-one discussions, experiments that were “in progress” and still collecting data were evaluated on the basis of these early results. When the preliminary data looked good, the test continued. When they showed ambiguous but significant results, the test usually continued. But when, after just a few subjects, no significance was found, the original protocol was abandoned and new variations were developed.
Invalid comparisons of significant to nonsignificant:
Because experiments on infant subjects are very costly in terms of both time and money, throwing away data is highly undesirable. Instead, when faced with a struggling experiment using a trusted experimental paradigm, experimenters would regularly run another study that had higher odds of success. This was accomplished by varying one aspect of the experiment, such as the age of the participants. For instance, when one experiment with 14-month-olds failed, the experimenter reran the same study with 18-month-olds, which then succeeded. Once a significant result was achieved, the failures were no longer valueless. They now represented a part of a larger story: “Eighteen-month-olds can achieve behavior X, but 14-month-olds cannot.” Thus, the failed experiment becomes a boundary for the phenomenon.
When a clear and interesting story could be told about significant findings, the original motivation was often abandoned. I attended a meeting between a graduate student and her mentor at which they were trying to decipher some results the student had just received. Their meaning was not at all clear, and the graduate student complained that she was having trouble remembering the motivation for the study in the first place. Her mentor responded, “You don’t have to reconstruct your logic. You have the results now. If you can come up with an interpretation that works, that will motivate the hypothesis.”
A blunt explanation of this strategy was given to me by an advanced graduate student: “You want to know how it works? We have a bunch of half-baked ideas. We run a bunch of experiments. Whatever data we get, we pretend that’s what we were looking for.” Rather than stay with the original, motivating hypothesis, researchers in developmental science learn to adjust to statistical significance. They then “fill out” the rest of the paper around this necessary core of psychological research.
Peterson discusses all this in light of recent discussions about replicability and scientific practices in psychology. He says that researchers have basically 3 choices: limit the scope of your questions to what you can do well with available methods, relax our expectations of what a rigorous study looks like, or engage in QRPs. I think that is basically right. It is why I believe that any attempt to reduce QRPs has to be accompanied by changes to incentive structures, which govern the first two.
Peterson also suggests that QRPs are “becoming increasingly unacceptable.” That may be true in public discourse, but the inside view presented by his ethnography suggests that unless more open practices become standard, labs will continue to have lots of opportunity to engage in them and little incentive not to.
UPDATE: More of my own reactions in this followup post: Reading “The Baby Factory” in context.