An eye-popping ethnography of three infant cognition labs

I don’t know how else to put it. David Peterson, a sociologist, recently published an ethnographic study of 3 infant cognition labs. Titled “The Baby Factory: Difficult Research Objects, Disciplinary Standards, and the Production of Statistical Significance,” it recounts his time spend as a participant observer in those labs, attending lab meetings and running subjects.

In his own words, Peterson “shows how psychologists produce statistically significant results under challenging circumstances by using strategies that enable them to bridge the distance between an uncontrollable research object and a professional culture that prizes methodological rigor.” The account of how the labs try to “bridge the distance” reveals one problematic practice after another, in a way that sometimes makes them seem like normal practice and no big deal to the people in the labs. Here are a few examples.

Protocol violations that break blinding and independence:

…As a routine part of the experiments, parents are asked to close their eyes to prevent any unconscious influence on their children. Although this was explicitly stated in the instructions given to parents, during the actual experiment, it was often overlooked; the parents’ eyes would remain open. Moreover, on several occasions, experimenters downplayed the importance of having one’s eyes closed. One psychologist told a mother, “During the trial, we ask you to close your eyes. That’s just for the journals so we can say you weren’t directing her attention. But you can peek if you want to. It’s not a big deal. But there’s not much to see.”

Optional stopping based on data peeking:

Rather than waiting for the results from a set number of infants, experimenters began “eyeballing” the data as soon as babies were run and often began looking for statistical significance after just 5 or 10 subjects. During lab meetings and one-on-one discussions, experiments that were “in progress” and still collecting data were evaluated on the basis of these early results. When the preliminary data looked good, the test continued. When they showed ambiguous but significant results, the test usually continued. But when, after just a few subjects, no significance was found, the original protocol was abandoned and new variations were developed.

Invalid comparisons of significant to nonsignificant:

Because experiments on infant subjects are very costly in terms of both time and money, throwing away data is highly undesirable. Instead, when faced with a struggling experiment using a trusted experimental paradigm, experimenters would regularly run another study that had higher odds of success. This was accomplished by varying one aspect of the experiment, such as the age of the participants. For instance, when one experiment with 14-month-olds failed, the experimenter reran the same study with 18-month-olds, which then succeeded. Once a significant result was achieved, the failures were no longer valueless. They now represented a part of a larger story: “Eighteen-month-olds can achieve behavior X, but 14-month-olds cannot.” Thus, the failed experiment becomes a boundary for the phenomenon.

And HARKing:

When a clear and interesting story could be told about significant findings, the original motivation was often abandoned. I attended a meeting between a graduate student and her mentor at which they were trying to decipher some results the student had just received. Their meaning was not at all clear, and the graduate student complained that she was having trouble remembering the motivation for the study in the first place. Her mentor responded, “You don’t have to reconstruct your logic. You have the results now. If you can come up with an interpretation that works, that will motivate the hypothesis.”

A blunt explanation of this strategy was given to me by an advanced graduate student: “You want to know how it works? We have a bunch of half-baked ideas. We run a bunch of experiments. Whatever data we get, we pretend that’s what we were looking for.” Rather than stay with the original, motivating hypothesis, researchers in developmental science learn to adjust to statistical significance. They then “fill out” the rest of the paper around this necessary core of psychological research.

Peterson discusses all this in light of recent discussions about replicability and scientific practices in psychology. He says that researchers have basically 3 choices: limit the scope of your questions to what you can do well with available methods, relax our expectations of what a rigorous study looks like, or engage in QRPs. I think that is basically right. It is why I believe that any attempt to reduce QRPs has to be accompanied by changes to incentive structures, which govern the first two.

Peterson also suggests that QRPs are “becoming increasingly unacceptable.” That may be true in public discourse, but the inside view presented by his ethnography suggests that unless more open practices become standard, labs will continue to have lots of opportunity to engage in them and little incentive not to.

UPDATE: I discuss what all this means in a followup post: Reading “The Baby Factory” in context.

7 thoughts on “An eye-popping ethnography of three infant cognition labs

  1. This reminds me of Bruno Latour’s “Laboratory Life” in which he similarly “embedded” himself in a lab, and subsequently made clear the gaps between protocol and practice. I think a lot of those critiques from Latour and his field get ignored by scientists who tend to see think of them as the ravings of nutbar French postmodernists, so I’m curious to see whether this paper gets any traction, coming as it does from an “actual scientist.”

    I coded looking time data once, and it was certainly unpleasant, but we never used the chin as a proxy for the eyes.

  2. Without a doubt, a few things that are described in this article are appalling and clearly do not reflect good research practices. Yet, with plenty of what this author describes it is entirely unclear whether anything problematic actually occurred – he appears to dislike piloting as a practice, implies that any imperfect data necessarily makes one more likely to achieve statistical significance (I wish), and several times seemed to just assume something nefarious was going on when he had no clear reason to know whether it was or not. The whole thing reads as so biased and one-sided that I have trouble believing anyone would give it any credence at all. My lab, and most baby labs that I know, spend hours and hours and hours each year discussing bias and how to prevent it – including big discussions on how the fact that we keep everyone blind to everything means it’s critical that no one be afraid to mention when something seems “off” to them, as even the “experts” in the lab are always ignorant of something. That one person with a clear axe to grind would be given license to trash an entire field based on assumptions of nefariousness and a few crass comments by a grad student, and that this trashing would be posted and re-posted and assumed to be true, is to me far more troubling than the few clear errors that were raised in this article.

    1. This ethnography isn’t trying to present new evidence about what’s typical in a population. That’s not really what ethnographies are good for and I don’t see that as the author’s aim. We already have all kinds of information available elsewhere about the prevalence of problematic practices in psychology. Some of that is referenced in the article.

      What this article is doing is what ethnographies do: thick description, context, and interpretation. It’s showing what problematic practices look like when they happen. And it is interpreting them in relation to the larger forces around them – the expectations and incentives that scientists face, the practical challenges of research, etc.

      I do think it’s reasonable to ask how much this is really specific to infant research. Personally I suspect that the relevant factors — like having “difficult research objects,” scarce resources, and facing unrealistic expectations of what good research looks like — apply to a lot of areas of psychology and other fields as well.

  3. The real problem here is the use of the “scientific method” with/upon objects of study that aren’t amenable to that form of inquiry. Psychology, sociology, and anthropology each posit the existence of some structure that has no “real” substance as do the objects of the “hard” sciences such as physics or chemistry.. The mind, society, and “man” are useful constructs no doubt but these constructs cannot bear analyses devised to study the physical universe. Furthermore, it has been suggested by no less than Foucault that psychology, sociology, anthropology and the rest of the “human sciences” are really no more than inadvertent studies into the structures of the human subconscious and/or larger culture given that they are entirely dependent upon language in the deployment of their purported “scientific” methods. Finally, experimental psychology tends to stubbornly model itself after the physical sciences which compounds these errors while some in the sociological and anthropological disciplines at least recognize the “failings” of their domains of knowledge in terms of the matters discussed above.

  4. It strikes me that the p < .05 really is causing problems here. I feel like most of the scientific stories people tell around their results are unreliable–that actually seems like the most unreliable part of the replication crisis. But if experiments were designed thoughtfully, then it would be interesting to know what happens either way. Rich descriptions of what happened might be helpful, and having some confidence interval around an effect–even if it includes zero on one side–might give useful information. Maybe it contributes to a meta-analysis down the line. Maybe it helps you rule out certain magnitudes of effects. But not the p < .05 barrier to publication does seem to be leading to the shifty behavior. I'm not typically a fan of banning p-values, but it does seem like they are the ultimate funnel here.

    1. Daniel Läkens just put up an interesting analysis of the BASP p-value ban. He concludes that it isn’t helping::

      If you just ban reporting of p-values, but don’t change other things, then you’re probably just providing less information about the same research as before. I think you nailed 2 other important factors: good design that’s going to be informative either way, and loosening the dependence of reporting and publication decisions on significance. (My guess is that BASP didn’t do away with that dependence, it just created a new variety: “if .05 < p < .25 send to BASP"). If you do both of those things, then I bet reporting p-values is likely to be much less of an issue.

  5. Thanks for posting this, Sanjay — I wouldn’t have come across this (fascinating) paper otherwise. The description of dodgy research practices in the article is certainly chilling, but I also agree with your assessment in your later post that the author is overall sympathetic to the challenges faced by these researchers. Reading the paper, it seemed to me that the underlying problem, in this and other areas, is that people aren’t content with descriptive or qualitative studies of difficult systems, but desperately want quantification, even if it isn’t warranted. (“People” doesn’t necessarily refer to the scientists, but the larger scientific community.) A noisy, small-N system (like a bunch of infants) may be great to carefully observe and think deeply about, and one could presumably learn a lot from this, but this isn’t considered sufficient — rather people want p-values, “significance,” etc., and are surprised when this leads to bad science.

    The other thing I liked about the paper is simply that the author embedded himself among scientists for a while. I’ve long felt that there should be a lot more of this sort of anthropological study of the day-to-day operation of scientific “culture,” especially given the importance of science in our society. I forwarded your post to some people I know who might get a chance to be studied in a similar way, encouraging them to do it! (Though I realize the airing of dirty laundry described here might turn them off!)

Comments are closed.