Some reflections on the Bargh-Doyen elderly walking priming brouhaha

Recently a controversy broke out over the replicability of a study John Bargh et al. published in 1996. The study reported that unconsciously priming a stereotype of elderly people caused subjects to walk more slowly. A recent replication attempt by Stephane Doyen et al., published in PLoS ONE, was unable to reproduce the results. (Less publicized, but surely relevant, is another non-replication by Hal Pashler et al.) Ed Yong wrote up an article about it  in Discover, which last week drew a sharp response from Bargh.

The broader context is that there has been a large and ongoing discussion about replication in psychology (i.e., that there isn’t enough of it). I don’t have much to say about whether the elderly-walking effect is real. But this controversy has raised a number of issues about scientific discourse online as well as about how we think about replication.

The discussion has been unnecessarily inflammatory – on all sides. Bargh has drawn a lot of criticism for his response, which among other things included factual errors about PLoS ONE, suggestions that Doyen et al. were “incompetent or ill-informed,” and a claim that Yong was practicing irresponsible journalism. The PLoS ONE editors posted a strongly worded but civil response in the comments, and Yong has written a rebuttal. As for the scientific issue — is the elderly-priming effect real? — Daniel Simons has written an excellent post on the many, many reasons why an effect might fail to replicate. A failure to replicate does not need to impeach the honesty or scientific skills of either the original researcher or the replicator. It does not even mean the effect is not real. In an ideal world, Bargh should have treated the difference between his results and those of Doyen et al. as a puzzle to be worked out, not as a personal attack to be responded to in kind.

But… it’s not as though Bargh went bananas over a dispassionate report of a non-replication. Doyen et al. strongly suggested that Bargh et al.’s procedure had been contaminated by expectancy effects. Since expectancy effects are widely known about in behavioral science (raise your hand if you have heard the phrase “double-blind”), the implication was that Bargh had been uncareful. And Ed Yong ran with that interpretation by leading off his original piece with the tale of Clever Hans. I don’t know whether Doyen or Yong meant to be inflammatory: I know nothing about Doyen; and in Yong’s case, based on his journalistic record, I doubt it (and he apparently gave Bargh plenty of opportunity to weigh in before his original post went live). But wherever you place the blame, a scientifically unfortunate result is that all of the other reasonable possibilities that Simons lists have been mostly ignored by the principals in this discussion.

Are priming effects hard to produce or easy? A number of priming researchers have suggested that priming effects are hard to get reliably. This doesn’t mean they aren’t important — experiments require isolation of the effect of interest, and the ease of isolating a phenomenon is not the same thing as its importance. (Those Higgs bosons are so hard to detect — so even if they exist they must not matter, right?) Bargh makes this point in his response too, suggesting that if Doyen et al. accidentally called subjects’ conscious attention to the elderly stereotype, that could wash out the effect (because conscious attention can easily interfere with automatic processes).

That being said… the effects in the original Bargh et al. report were big. Really big, by psychology standards. In experiment 2a, Bargh et al. report t(28) = 2.86, which corresponds to an effect size of d = 1.08. And in their replication, experiment 2b, they report t(28) = 2.16, which translates to d = 0.82. So even if we account for some shrinkage, under the right conditions it should not be hard for somebody to reproduce the elderly-walking priming effect in a new study.

The expectancy effects study is rhetorically powerful but proves little. In their Experiment 1, Doyen et al. tested the same hypothesis about priming stereotypes that Bargh tested. But in Experiment 2, Doyen et al. tested a hypothesis about experimenter expectancies. That is a completely different hypothesis. The second study tells us that experimenter expectancies can affect walking speed. But walking speed surely can be affected by more than one thing. So Experiment 2 does not tell us to what extent, if any at all, differences in walking speed were caused by experimenter expectancies in Bargh’s experiment (or for that matter, anywhere else in the natural world outside of Doyen’s lab). This is the inferential error of confusing causes of effects with effects of causes. Imagine that Doyen et al. had clubbed the subjects in the elderly-prime condition in the knee; most likely that would have slowed them down. But would we take that as evidence that Bargh et al. had done the same?

The inclusion of Experiment 2 served a strong rhetorical function, by planting in the audience’s mind the idea that the difference between Bargh vs Doyen Exp 1 was due to expectancy effects (and Ed Yong picked up and ran with this suggestion by referring to Clever Hans). But scientifically, all it shows is that expectancy effects can influence the dependent variable in the Bargh experiment. That’s not nothing, but anybody who already believes that experiments need to be double-blind should have seen that coming. If we had documentary evidence that in the actual 1996 studies Bargh et al. did not actually eliminate expectancy effects, that would be relevant. (We likely never will have such evidence; see next point.) But Experiment 2 does not shed nearly as much light as it appears to.

We need more openness with methods and materials. When I started off in psychology, someone once told me that a scientific journal article should contain everything you need to reproduce the experiment (either directly or via references to other published materials). That, of course, is almost never true and maybe is unrealistic. Especially when you factor in things like lab skills, many of which are taught via direct apprenticeship rather than in writing, and which matter just as much in behavioral experiments as they do in more technology-heavy areas of science.

But with all that being said, I think we could do a lot better. A big part of the confusion in this controversy is over the details of methods — what exactly did Bargh et al. do in the original study, and how closely did Doyen et al. reproduce the procedure? The original Bargh et al. article followed the standards of its day in how much methodological detail it reported. Bargh later wrote a methods chapter that described more details of the priming technique (and which he claims Doyen et al. did not follow). But in this era of unlimited online supplements, there is no reason why in future studies, all of the stimuli, instructions, etc. could not be posted. That would enormously aid replication attempts.

What makes for a “failed” replication? This turns out to be a small point in the present context but an important one in a more general sense, so I couldn’t help but make it. We should be very careful about the language of “successful” and “failed” replications when it is based on the difference between p<.05 and p>.05. That is, just because the original study could reject the null and the replication could not, that doesn’t mean that the replication is significantly different from the original study. If you are going to say you failed to replicate the original result, you should conduct a test of that difference.

As far as I can tell neither Doyen et al. nor Pashler et al. did that. So I did. I converted each study’s effect to an r effect size and then comparing the studies with a z test of the difference between independent rs, and indeed Doyen et al. and Pashler et al. each differed from Bargh’s original experiments. So this doesn’t alter the present discussion. But as good practice, the replication reports should have reported such tests.

How a sexist environment affects women in engineering

Women in traditionally male-dominated fields like math and engineering face the extra burden that their performance, beyond reflecting on them individually, might be taken as broader confirmation of stereotypes if they perform poorly. A newly published series of experiments by Christine Logel and colleagues tested the effects of such stereotype threat among engineering students.

Standardized observations showed that male engineering students who had previously expressed subtle sexist attitudes on a pretest were more likely, when talking with a female engineering student about work issues, to adopt a domineering posture and to display signs of sexual interest (such as noticeably looking at the woman’s body).

In the next 2 experiments, female engineering students were randomly assigned in one experiment to interact with males who had endorsed different levels of subtle sexism, and in a second experiment with an actor who randomly either displayed or did not display the domineering/sexual nonverbal behaviors. Women performed worse on an engineering test after interacting with the randomly-assigned sexist males (or males simulating sexists’ nonverbal behavior).

In another experiment, women’s poorer performance was shown to be limited to stereotype-related tests, not a broad cognitive deficit. In a final experiment, interacting with a domineering/sexually interested male caused women to have temporarily elevated concern about negative stereotypes, which they subsequent attempted to suppress (thought suppression being a well-known resource hog).

The results further support the idea that even subtle sexism can be toxic in workplace environments where women are traditionally targets of discrimination.