Some reflections on the Bargh-Doyen elderly walking priming brouhaha

Recently a controversy broke out over the replicability of a study John Bargh et al. published in 1996. The study reported that unconsciously priming a stereotype of elderly people caused subjects to walk more slowly. A recent replication attempt by Stephane Doyen et al., published in PLoS ONE, was unable to reproduce the results. (Less publicized, but surely relevant, is another non-replication by Hal Pashler et al.) Ed Yong wrote up an article about it  in Discover, which last week drew a sharp response from Bargh.

The broader context is that there has been a large and ongoing discussion about replication in psychology (i.e., that there isn’t enough of it). I don’t have much to say about whether the elderly-walking effect is real. But this controversy has raised a number of issues about scientific discourse online as well as about how we think about replication.

The discussion has been unnecessarily inflammatory – on all sides. Bargh has drawn a lot of criticism for his response, which among other things included factual errors about PLoS ONE, suggestions that Doyen et al. were “incompetent or ill-informed,” and a claim that Yong was practicing irresponsible journalism. The PLoS ONE editors posted a strongly worded but civil response in the comments, and Yong has written a rebuttal. As for the scientific issue — is the elderly-priming effect real? — Daniel Simons has written an excellent post on the many, many reasons why an effect might fail to replicate. A failure to replicate does not need to impeach the honesty or scientific skills of either the original researcher or the replicator. It does not even mean the effect is not real. In an ideal world, Bargh should have treated the difference between his results and those of Doyen et al. as a puzzle to be worked out, not as a personal attack to be responded to in kind.

But… it’s not as though Bargh went bananas over a dispassionate report of a non-replication. Doyen et al. strongly suggested that Bargh et al.’s procedure had been contaminated by expectancy effects. Since expectancy effects are widely known about in behavioral science (raise your hand if you have heard the phrase “double-blind”), the implication was that Bargh had been uncareful. And Ed Yong ran with that interpretation by leading off his original piece with the tale of Clever Hans. I don’t know whether Doyen or Yong meant to be inflammatory: I know nothing about Doyen; and in Yong’s case, based on his journalistic record, I doubt it (and he apparently gave Bargh plenty of opportunity to weigh in before his original post went live). But wherever you place the blame, a scientifically unfortunate result is that all of the other reasonable possibilities that Simons lists have been mostly ignored by the principals in this discussion.

Are priming effects hard to produce or easy? A number of priming researchers have suggested that priming effects are hard to get reliably. This doesn’t mean they aren’t important — experiments require isolation of the effect of interest, and the ease of isolating a phenomenon is not the same thing as its importance. (Those Higgs bosons are so hard to detect — so even if they exist they must not matter, right?) Bargh makes this point in his response too, suggesting that if Doyen et al. accidentally called subjects’ conscious attention to the elderly stereotype, that could wash out the effect (because conscious attention can easily interfere with automatic processes).

That being said… the effects in the original Bargh et al. report were big. Really big, by psychology standards. In experiment 2a, Bargh et al. report t(28) = 2.86, which corresponds to an effect size of d = 1.08. And in their replication, experiment 2b, they report t(28) = 2.16, which translates to d = 0.82. So even if we account for some shrinkage, under the right conditions it should not be hard for somebody to reproduce the elderly-walking priming effect in a new study.

The expectancy effects study is rhetorically powerful but proves little. In their Experiment 1, Doyen et al. tested the same hypothesis about priming stereotypes that Bargh tested. But in Experiment 2, Doyen et al. tested a hypothesis about experimenter expectancies. That is a completely different hypothesis. The second study tells us that experimenter expectancies can affect walking speed. But walking speed surely can be affected by more than one thing. So Experiment 2 does not tell us to what extent, if any at all, differences in walking speed were caused by experimenter expectancies in Bargh’s experiment (or for that matter, anywhere else in the natural world outside of Doyen’s lab). This is the inferential error of confusing causes of effects with effects of causes. Imagine that Doyen et al. had clubbed the subjects in the elderly-prime condition in the knee; most likely that would have slowed them down. But would we take that as evidence that Bargh et al. had done the same?

The inclusion of Experiment 2 served a strong rhetorical function, by planting in the audience’s mind the idea that the difference between Bargh vs Doyen Exp 1 was due to expectancy effects (and Ed Yong picked up and ran with this suggestion by referring to Clever Hans). But scientifically, all it shows is that expectancy effects can influence the dependent variable in the Bargh experiment. That’s not nothing, but anybody who already believes that experiments need to be double-blind should have seen that coming. If we had documentary evidence that in the actual 1996 studies Bargh et al. did not actually eliminate expectancy effects, that would be relevant. (We likely never will have such evidence; see next point.) But Experiment 2 does not shed nearly as much light as it appears to.

We need more openness with methods and materials. When I started off in psychology, someone once told me that a scientific journal article should contain everything you need to reproduce the experiment (either directly or via references to other published materials). That, of course, is almost never true and maybe is unrealistic. Especially when you factor in things like lab skills, many of which are taught via direct apprenticeship rather than in writing, and which matter just as much in behavioral experiments as they do in more technology-heavy areas of science.

But with all that being said, I think we could do a lot better. A big part of the confusion in this controversy is over the details of methods — what exactly did Bargh et al. do in the original study, and how closely did Doyen et al. reproduce the procedure? The original Bargh et al. article followed the standards of its day in how much methodological detail it reported. Bargh later wrote a methods chapter that described more details of the priming technique (and which he claims Doyen et al. did not follow). But in this era of unlimited online supplements, there is no reason why in future studies, all of the stimuli, instructions, etc. could not be posted. That would enormously aid replication attempts.

What makes for a “failed” replication? This turns out to be a small point in the present context but an important one in a more general sense, so I couldn’t help but make it. We should be very careful about the language of “successful” and “failed” replications when it is based on the difference between p<.05 and p>.05. That is, just because the original study could reject the null and the replication could not, that doesn’t mean that the replication is significantly different from the original study. If you are going to say you failed to replicate the original result, you should conduct a test of that difference.

As far as I can tell neither Doyen et al. nor Pashler et al. did that. So I did. I converted each study’s effect to an r effect size and then comparing the studies with a z test of the difference between independent rs, and indeed Doyen et al. and Pashler et al. each differed from Bargh’s original experiments. So this doesn’t alter the present discussion. But as good practice, the replication reports should have reported such tests.

Journals can be groundbreaking or definitive, not both

I was recently invited to contribute to Personality and Social Psychology Connections, an online journal of commentary (read: fancy blog) run by SPSP. Don Forsyth is the editor, and the contributors include David Dunning, Harry Reis, Jennifer Crocker, Shige Oishi, Mark Leary, and Scott Allison. My inaugural post is titled “Groundbreaking or definitive? Journals need to pick one.” Excerpt:

Do our top journals need to rethink their missions of publishing research that is both groundbreaking and definitive? And as a part of that, do they — and we scientists — need to reconsider how we engage with the press and the public?…

In some key ways groundbreaking is the opposite of definitive. There is a lot of hard work to be done between scooping that first shovelful of dirt and completing a stable foundation. And the same goes for science (with the crucial difference that in science, you’re much more likely to discover along the way that you’ve started digging on a site that’s impossible to build on). “Definitive” means that there is a sufficient body of evidence to accept some conclusion with a high degree of confidence. And by the time that body of evidence builds up, the idea is no longer groundbreaking.

Read it here.


The brain scans, they do nothing

Breaking news: New brain scan reveals nothing at all.

This is an amazing discovery’, said leading neuroscientist Baroness Susan Greenfield, ‘the pictures tell us nothing about how the brain works, provide us with no insights into the nature of human consciousness, and all with such lovely colours.’ …

The development, which has been widely reported around the world, is also significant because it allows journalists to publish big fancy pictures of the brain that look really impressive while having little or no explanatory value.

I’ve previously mentioned the well documented bias to think that brain pictures automatically make research more sciencey, even if the pictures are irrelevant to the conclusions. Satire makes that point a lot better though.

In science, rejection is normal

In the news: A coupla guys played around with some #2 pencils and Scotch tape and won a Nobel Prize in physics. Talk about easy science! This is what happens when you work in a field with such low-hanging fruit that you run out of testable hypotheses.

Okay, kidding aside…

The initial NY Times report noted that the first paper on graphene that the researchers wrote was rejected by Nature before later being published in Science. [1]

It would be easy to fit that into a narrative that is common in movies and in science journalism: the brilliant iconoclasts rejected by the hidebound scientific establishment.

Far more likely though is a much more mundane explanation: scientists see their work rejected all the time. It’s just part of how science works. The review process is not perfect, and sometimes you have to shop a good idea around for a while before you can convince people of its merit. And the more productive you are, the more rejection experiences you will accumulate over a career.

It’s a good reminder that if you’re a working scientist (or trying to start a career as one), don’t get too worked up about rejection.

[1] Puzzling sidenote: For some reason that part no longer appears in the article on the NY Times website, but since there’s no correction statement I’ll still assume that it’s true and they just edited it out of a later edition for some reason. The rejection anecdote still appears on the PBS website.

Pretty pictures of brains are more convincing

This study seemed like it was begging to be done, so I figured somebody must have done it already. Thank you Google Scholar for helping me find it…

Seeing is believing: The effect of brain images on judgments of scientific reasoning [pdf]

David P. McCabe and Alan D. Castel

Brain images are believed to have a particularly persuasive influence on the public perception of research on cognition. Three experiments are reported showing that presenting brain images with articles summarizing cognitive neuroscience research resulted in higher ratings of scientific reasoning for arguments made in those articles, as compared to articles accompanied by bar graphs, a topographical map of brain activation, or no image. These data lend support to the notion that part of the fascination, and the credibility, of brain imaging research lies in the persuasive power of the actual brain images themselves. We argue that brain images are influential because they provide a physical basis for abstract cognitive processes, appealing to people’s affinity for reductionistic explanations of cognitive phenomena.

For a few years now I’ve been joking that I should end every talk with a slide of a random brain image, and conclude, “Aaaannnd… all of this happens in the brain!” This is solid evidence that doing so would help my credibility.

Now, the next big question is: who’s going to replicate this with psychologists and neuroscientists as the subjects?

Apparently I’m on a blogging break

I just noticed that I haven’t posted in over a month. Don’t fear, loyal readers (am I being presumptuous with that plural? hi Mom!). I haven’t abandoned the blog, apparently I’ve just been too busy or preoccupied to flesh out any coherent thoughts.

So instead, here are some things that, over the last month, I’ve thought about posting but haven’t summoned up the wherewithal to turn into anything long enough to be interesting:

  • Should psychology graduate students routinely learn R in addition to, or perhaps instead of, other statistics software? (I used to think SPSS or SAS was capable enough for the modal grad student and R was too much of a pain in the ass, but I’m starting to come around. Plus R is cheaper, which is generally good for graduate students.)
  • What should we do about gee-whiz science journalism covering social neuroscience that essentially reduces to, “Wow, can you believe that X happens in the brain?” (Still working on that one. Maybe it’s too deeply ingrained to do anything.)
  • Reasons why you should read my new commentary in Psychological Inquiry. (Though really, if it takes a blog post to explain why an article is worth reading, maybe the article isn’t worth reading. I suggest you read it and tell me.)
  • A call for proposals for what controversial, dangerous, or weird research I should conduct now that I just got tenure.
  • Is your university as sketchy as my university? (Okay, my university probably isn’t really all that sketchy. And based on the previous item, you know I’m not just saying that to cover my butt.)
  • My complicated reactions to the very thought-provoking Bullock et al. “mediation is hard” paper in JPSP.

Our spring term is almost over, so maybe I’ll get to one of these sometime soon.

On base rates and the “accuracy” of computerized Facebook gaydar

I never know what to make of reports stating the “accuracy” of some test or detection algorithm. Take this example, from a New York Times article by Steve Lohr titled How Privacy Vanishes Online:

In a class project at the Massachusetts Institute of Technology that received some attention last year, Carter Jernigan and Behram Mistree analyzed more than 4,000 Facebook profiles of students, including links to friends who said they were gay. The pair was able to predict, with 78 percent accuracy, whether a profile belonged to a gay male.

I have no idea what “78 percent accuracy” means in this context. The most obvious answer would seem to be that of all 4,000 profiles analyzed, 78% were correctly classified as gay versus not gay. But if that’s the case, I have an algorithm that beats the pants off of theirs. Are you ready for it?

Say that everybody is not gay.

Figure that around 5 to 10 percent of the population is gay. If these 4,000 students are representative of that, then saying not gay every time will yield an “accuracy” of 90-95%.

But wait — maybe by “accuracy” they mean what percentage of gay people are correctly identified as such. In that case, I have an algorithm that will be 100% accurate by that standard. Ready?

Say that everybody is gay.

You can see how silly this gets. To understand how good the test is, you need two numbers: sensitivity and specificity. My algorithms each turn out to be 100% on one and 0% on the other. Which means that they’re both crap. (A good test needs to be high on both.) I am hoping that the MIT class’s algorithm was a little better, and the useful numbers just didn’t get translated. But this news report tells us nothing that we need to know to evaluate it.

Say it again

When students learn writing, they often are taught that if you have to say the same kind of thing more than once, word things in a slightly different way each time. The idea is to add interest through variety.

But when I work with psychology students on their writing, I often have to work hard to break them of that habit. In scientific writing, precision and clarity are the most important. This doesn’t mean that scientific writing cannot also be elegant and interesting (the vary-the-wording strategy is often just a cheap trick anyhow). But your first priority is to make sure that your reader knows exactly what you mean.

Problems arise when journalists trained in vary-the-wording write about statistics. Small thing, but take this sentence from a Slate piece (in the oft-enlightening Explainer column) about the Fort Hood shooting:

Studies have shown that the suicide rate among male doctors is 40 percent higher than among men overall and that female doctors take their own lives at 130 percent the rate of women in general.

The same comparison is being made for men and for women: how does the suicide rate among doctors compare to the general population? But the numbers are not presented in parallel. For men, the number presented is 40, as in “40 percent higher than” men in general. For women, the number is 130, as in “130 percent the rate of” women in general.

The prepositions are the tipoff that the writer is doing different things, and a careful reader can probably figure that out. But the attempt to add variety just bogs things down. A reader will have to slow down and possibly re-read once or twice to figure out that 40% and 130% are both telling us that doctors commit suicide more often than others.

Separately: why break it out by gender? In context, the writer is trying to make a point about doctors versus everybody else. Not male doctors versus female doctors. We often reflexively categorize things by gender (I’m using “we” in a society-wide sense) when it’s unnecessary and uninformative.

Improving the grant system ain’t so easy

Today’s NY Times has an article by Gina Kolata about how the National Cancer Institute plays it safe with grant funding. The main point of the article is that NCI funds too many “safe” studies — studies that promise a high probability of making a modest, incremental discovery. This is done at the expense of more speculative and exploratory studies that take bigger risks but could lead to greater leaps in knowledge.

The article, and by and large the commenters on it, seem to assume that things would be better if the NCI funded more high-risk research. Missing is any analysis of what might be the downsides of adopting such a strategy.

By definition, a high-risk proposal has a lower probabilty of producing usable results. (That’s what people mean by “risk” in this context.) So for every big breakthrough, you’d be funding a larger number of dead ends. That raises three problems: a substantive policy problem, a practical problem, and a political problem.

1. The substantive problem is in knowing what would be the net effect of changing the system. If you change the system so that you invest grant dollars in research that pays off half as often, but when it does the findings are twice as valuable, it’s a wash — you haven’t made things better or worse overall. So it’s a problem of adjusting the system to optimize the risk X reward payoffs. I’m not saying the current situation is optimal; but nobody is presenting any serious analysis of whether an alternative investment strategy would be better.

2. The practical problem is that we would have to find some way to choose among high-risk studies. The problem everybody is pointing to is that in the current system, scientists have to present preliminary studies, stick to incremental variations on well-established paradigms, reassure grant panels that their proposal is going to pay off, etc. Suppose we move away from that… how would you choose amongst all the riskier proposals?

People like to point to historical breakthroughs that never would have been funded by a play-it-safe NCI. But it may be a mistake to believe those studies would have been funded by a take-a-risk NCI, because we have the benefit of hindsight and a great deal of forgetting. Before the research was carried out — i.e., at the time it would have been a grant proposal — every one of those would-be-breakthrough proposals would have looked just as promising as a dozen of their contemporaries that turned out to be dead-ends and are now lost to history. So it’s not at all clear that all of those breakthroughs would have been funded within a system that took bigger risks, because they would have been competing against an even larger pool of equally (un)promising high-risk ideas.

3. The political problem is that even if we could solve #1 and #2, we as a society would have to have the stomach for putting up with a lot of research that produces no meaningful results. The scientific community, politicians, and the general public would have to be willing to constantly remind themselves that scientific dead ends are not a “waste” of research dollars — they are the inevitable consequence of taking risks. There would surely be resistance, especially at the political level.

So what’s the solution? I’m sure there could be some improvements made within the current system, especially in getting review panels and program officers to reorient to higher-risk studies. But I think the bigger issue has to do with the overall amount of money available. As the top-rated commenter on Kolata’s article points out, the FY 2010 defense appropriation is more than 6 times what we have spent at NCI since Nixon declared a “war” on cancer 38 years ago. If you make resources scarce, of course you’re going to make people cautious about how they invest those resources. There’s a reason angel investors are invariably multi-millionnaires. If you want to inspire the scientific equivalent of angel investing, then the people giving out the money are going to have to feel like they’ve got enough money to take risks with.

Taking aim at evolutionary psychology

Sharon Begley has a doozy of an article in Newsweek taking aim at evolutionary psychology. The article is a real mixed bag and is already starting to generate vigorous rebuttals.

As background, the term “evolutionary psychology” tends to confuse outsiders because it sounds like a catchall for any approach to psychology that incorporates evolutionary theory and principles. But that’s not how it’s used by insiders. Rather, evolutionary psychology (EP) refers to one specific way (of many) of thinking about evolution and human behavior. (This article by Eric Alden Smith contrasts EP with other evolutionary approaches.) EP can be differentiated from other evolutionary approaches on at least 3 different levels. There are the core scientific propositions, assumptions, and methods that EPs use. There are the particular topics and conclusions that EP has most commonly been associated with. And there is a layer of politics and extra-scientific discourse regarding how EP is discussed and interpreted by its proponents, its critics, and the media.

Begley makes clear that EP is not the only way of applying evolutionary principles to understanding human behavior. (In particular, she contrasts it with human behavioral ecology). Thus, hopefully most readers won’t take this as a ding on evolutionary theory broadly speaking. But unfortunately, she cherrypicks her examples and conflates the controversies at different levels — something that I suspect is going to drive the EP folks nuts.

At the core scientific level, one of the fundamental debates is over modularity versus flexibility. EP posits that the ancestral environment presented our forebears with specific adaptive problems that were repeated over multiple generations, and as a result we evolved specialized cognitive modules that help us solve those problems. Leda Cosmides’s work on cheater detection is an example of this — she has proposed that humans have specialized cognitive mechanisms for detecting when somebody isn’t holding up their obligations in a social exchange. Critics of EP argue that our ancestors faced a wide and unpredictable range of adaptive problems, and as a result our minds are more flexible — for example they say that we detect cheaters by applying a general capacity for reasoning, not through specialized cheater-detecting skills. This is an important, serious scientific debate with broad implications.

Begley discusses the modularity versus flexibility debate — and if her article stuck to the deep scientific issues, it could be a great piece of science journalism. But it is telling what topics and examples she uses to flesh out her arguments. Cosmides’s work on cheater detection would have been a great topic to focus on: Cosmides has found support across multiple methods and levels of analysis, and at the same time critics like David Buller have presented serious challenges. That could have made for a thoughtful but still dramatic presentation. But Begley never mentions cheater detection. Instead, she picks examples of proposed adaptations that (a) have icky overtones, like rape or the abuse of stepchildren; and (b) do not have widespread support even among EPs. (Daly and Wilson, the researchers who originally suggested that stepchild abuse might be an adaptation, no longer believe that the evidence supports that conclusion.) Begley wants to leave readers with the impression that EP claims are falling apart left and right because of fundamental flaws in the underlying principles (as opposed to narrower instances of particular arguments or evidence falling through). To make her case, she cherrypicks the weakest and most controversial claims. She never mentions less-controversial EP research on topics like decision-making, emotions, group dynamics, etc.

Probably the ugliest part of the article is the way that Begley worms ad hominem attacks into her treatment of the science, and then accuses EPs of changing topics when they defend themselves. A major point of Begley’s is that EP is used to justify horrific behavior like infidelity, rape, and child abuse. Maybe the findings are sometimes used that way — but in my experience that is almost never done by the scientists themselves, who are well aware of the difference between “is” and “ought.” (If Begley wants to call somebody out on committing the naturalistic fallacy, she should be taking aim at mass media, not science.) Begley also seems to play a rhetorical “I’m not touching you” baiting game. Introducing EP research on jealousy she writes, “Let’s not speculate on the motives that (mostly male) evolutionary psychologists might have in asserting that their wives are programmed to not really care if they sleep around…” Then amazingly a few paragraphs later she writes, “Evolutionary psychologists have moved the battle from science, where they are on shaky ground, to ideology, where bluster and name-calling can be quite successful.” Whahuh? Who’s moving what battle now?

The whole thing is really unfortunate, because evolutionary psychology deserves serious attention by serious science journalists (which Begley can sometimes be). David Buller’s critique a few years ago raised some provocative challenges and earned equally sharp rebuttals, and the back-and-forth continues to reverberate. That makes for a potentially gripping story. And EP claims frequently get breathless coverage and oversimplified interpretations in the mass media, so a nuanced and thoughtful treatment of the science (with maybe a little media criticism thrown in) would play a needed corrective role. I’m no EP partisan — I tend to take EP on a claim-by-claim basis, and I find the evidence for some EP conclusions to be compelling and others poorly supported. I just wish the public was getting a more informative and more scientifically grounded view of the facts and controversies.