Is there p-hacking in a new breastfeeding study? And is disclosure enough?

There is a new study out about the benefits of breastfeeding on eventual adult IQ, published in The Lancet Global Health. It’s getting lots of news coverage, for example in NPR, BBC, New York Times, and more.

A friend shared a link and asked what I thought of it. So I took a look at the article and came across this (emphasis added):

We based statistical comparisons between categories on tests of heterogeneity and linear trend, and we present the one with the lower p value. We used Stata 13·0 for the analyses. We did four sets of analyses to compare breastfeeding categories in terms of arithmetic means, geometric means, median income, and to exclude participants who were unemployed and therefore had no income.

Yikes. The description of the analyses is frankly a little telegraphic. But unless I’m misreading it, or they did some kind of statistical correction that they forgot to mention, it sounds like they had flexibility in the data analyses (I saw no mention of pre-registration in the analysis plan), they used that flexibility to test multiple comparisons, and they’re openly disclosing that they used p-values for model selection – which is a more technical way of saying they engaged in p-hacking. (They don’t say how they selected among the 4 sets of analyses with different kinds of means etc.; was that based on p-values too?)*

From time to time students ask, Am I allowed to do x statistical thing? And my standard answer is, in the privacy of your office/lab/coffeeshop/etc. you are allowed to do whatever you want! Exploratory data analysis is a good thing. Play with your data and learn from it.** But if you are going to publish the results of your exploration, then disclose. If you did something that could bias your p-values, let readers know and they can make an informed evaluation.***

But that advice assumes that you are talking to a sophisticated reader. When it comes time to talk to the public, via the press, you have a responsibility to explain yourself. “We used a statistical approach that has an increased risk of producing false positives when there is no effect, or overestimating the size of effects when they are real.”

And if that weakens your story too much, well, that’s valid. Your story is weaker. Scientific journals are where experts communicate with other experts, and it could still be interesting enough to publish for that audience, perhaps to motivate a more definitive followup study. But if it’s too weak to go to the public and tell mothers what to do with their bodies… Maybe save the press release for the pre-registered Study 2.


* The study has other potential problems which are pretty much par for the course in these kinds of observational studies. They try to statistically adjust for differences between kids who were breastfed and those who weren’t, but that assumes that you have a complete and precisely measured set of all relevant covariates. Did they? It’s not a testable assumption, though it’s one that experts can make educated guesses at. On the plus side, when they added potentially confounding variables to the models the effects got stronger, not weaker. On the minus side, as Michelle Meyer pointed out on Twitter, they did not measure or adjust for parental IQ, which will definitely be associated with child IQ and for which the covariates they did use (like parental education and income) are only rough proxies.

** Though using p-values to guide your exploratory data analysis isn’t the greatest idea.

*** Some statisticians will no doubt disagree and say you shouldn’t be reporting p-values with known bias. My response is (a) if you want unbiased statistics then you shouldn’t be reading anything that’s gone through pre-publication review, and (b) that’s what got us into this mess in the first place. I’d rather make it acceptable for people to disclose everything, as opposed to creating an expectation and incentive for people to report impossibly clean results.

Some reflections on the Bargh-Doyen elderly walking priming brouhaha

Recently a controversy broke out over the replicability of a study John Bargh et al. published in 1996. The study reported that unconsciously priming a stereotype of elderly people caused subjects to walk more slowly. A recent replication attempt by Stephane Doyen et al., published in PLoS ONE, was unable to reproduce the results. (Less publicized, but surely relevant, is another non-replication by Hal Pashler et al.) Ed Yong wrote up an article about it  in Discover, which last week drew a sharp response from Bargh.

The broader context is that there has been a large and ongoing discussion about replication in psychology (i.e., that there isn’t enough of it). I don’t have much to say about whether the elderly-walking effect is real. But this controversy has raised a number of issues about scientific discourse online as well as about how we think about replication.

The discussion has been unnecessarily inflammatory – on all sides. Bargh has drawn a lot of criticism for his response, which among other things included factual errors about PLoS ONE, suggestions that Doyen et al. were “incompetent or ill-informed,” and a claim that Yong was practicing irresponsible journalism. The PLoS ONE editors posted a strongly worded but civil response in the comments, and Yong has written a rebuttal. As for the scientific issue — is the elderly-priming effect real? — Daniel Simons has written an excellent post on the many, many reasons why an effect might fail to replicate. A failure to replicate does not need to impeach the honesty or scientific skills of either the original researcher or the replicator. It does not even mean the effect is not real. In an ideal world, Bargh should have treated the difference between his results and those of Doyen et al. as a puzzle to be worked out, not as a personal attack to be responded to in kind.

But… it’s not as though Bargh went bananas over a dispassionate report of a non-replication. Doyen et al. strongly suggested that Bargh et al.’s procedure had been contaminated by expectancy effects. Since expectancy effects are widely known about in behavioral science (raise your hand if you have heard the phrase “double-blind”), the implication was that Bargh had been uncareful. And Ed Yong ran with that interpretation by leading off his original piece with the tale of Clever Hans. I don’t know whether Doyen or Yong meant to be inflammatory: I know nothing about Doyen; and in Yong’s case, based on his journalistic record, I doubt it (and he apparently gave Bargh plenty of opportunity to weigh in before his original post went live). But wherever you place the blame, a scientifically unfortunate result is that all of the other reasonable possibilities that Simons lists have been mostly ignored by the principals in this discussion.

Are priming effects hard to produce or easy? A number of priming researchers have suggested that priming effects are hard to get reliably. This doesn’t mean they aren’t important — experiments require isolation of the effect of interest, and the ease of isolating a phenomenon is not the same thing as its importance. (Those Higgs bosons are so hard to detect — so even if they exist they must not matter, right?) Bargh makes this point in his response too, suggesting that if Doyen et al. accidentally called subjects’ conscious attention to the elderly stereotype, that could wash out the effect (because conscious attention can easily interfere with automatic processes).

That being said… the effects in the original Bargh et al. report were big. Really big, by psychology standards. In experiment 2a, Bargh et al. report t(28) = 2.86, which corresponds to an effect size of d = 1.08. And in their replication, experiment 2b, they report t(28) = 2.16, which translates to d = 0.82. So even if we account for some shrinkage, under the right conditions it should not be hard for somebody to reproduce the elderly-walking priming effect in a new study.

The expectancy effects study is rhetorically powerful but proves little. In their Experiment 1, Doyen et al. tested the same hypothesis about priming stereotypes that Bargh tested. But in Experiment 2, Doyen et al. tested a hypothesis about experimenter expectancies. That is a completely different hypothesis. The second study tells us that experimenter expectancies can affect walking speed. But walking speed surely can be affected by more than one thing. So Experiment 2 does not tell us to what extent, if any at all, differences in walking speed were caused by experimenter expectancies in Bargh’s experiment (or for that matter, anywhere else in the natural world outside of Doyen’s lab). This is the inferential error of confusing causes of effects with effects of causes. Imagine that Doyen et al. had clubbed the subjects in the elderly-prime condition in the knee; most likely that would have slowed them down. But would we take that as evidence that Bargh et al. had done the same?

The inclusion of Experiment 2 served a strong rhetorical function, by planting in the audience’s mind the idea that the difference between Bargh vs Doyen Exp 1 was due to expectancy effects (and Ed Yong picked up and ran with this suggestion by referring to Clever Hans). But scientifically, all it shows is that expectancy effects can influence the dependent variable in the Bargh experiment. That’s not nothing, but anybody who already believes that experiments need to be double-blind should have seen that coming. If we had documentary evidence that in the actual 1996 studies Bargh et al. did not actually eliminate expectancy effects, that would be relevant. (We likely never will have such evidence; see next point.) But Experiment 2 does not shed nearly as much light as it appears to.

We need more openness with methods and materials. When I started off in psychology, someone once told me that a scientific journal article should contain everything you need to reproduce the experiment (either directly or via references to other published materials). That, of course, is almost never true and maybe is unrealistic. Especially when you factor in things like lab skills, many of which are taught via direct apprenticeship rather than in writing, and which matter just as much in behavioral experiments as they do in more technology-heavy areas of science.

But with all that being said, I think we could do a lot better. A big part of the confusion in this controversy is over the details of methods — what exactly did Bargh et al. do in the original study, and how closely did Doyen et al. reproduce the procedure? The original Bargh et al. article followed the standards of its day in how much methodological detail it reported. Bargh later wrote a methods chapter that described more details of the priming technique (and which he claims Doyen et al. did not follow). But in this era of unlimited online supplements, there is no reason why in future studies, all of the stimuli, instructions, etc. could not be posted. That would enormously aid replication attempts.

What makes for a “failed” replication? This turns out to be a small point in the present context but an important one in a more general sense, so I couldn’t help but make it. We should be very careful about the language of “successful” and “failed” replications when it is based on the difference between p<.05 and p>.05. That is, just because the original study could reject the null and the replication could not, that doesn’t mean that the replication is significantly different from the original study. If you are going to say you failed to replicate the original result, you should conduct a test of that difference.

As far as I can tell neither Doyen et al. nor Pashler et al. did that. So I did. I converted each study’s effect to an r effect size and then comparing the studies with a z test of the difference between independent rs, and indeed Doyen et al. and Pashler et al. each differed from Bargh’s original experiments. So this doesn’t alter the present discussion. But as good practice, the replication reports should have reported such tests.

Journals can be groundbreaking or definitive, not both

I was recently invited to contribute to Personality and Social Psychology Connections, an online journal of commentary (read: fancy blog) run by SPSP. Don Forsyth is the editor, and the contributors include David Dunning, Harry Reis, Jennifer Crocker, Shige Oishi, Mark Leary, and Scott Allison. My inaugural post is titled “Groundbreaking or definitive? Journals need to pick one.” Excerpt:

Do our top journals need to rethink their missions of publishing research that is both groundbreaking and definitive? And as a part of that, do they — and we scientists — need to reconsider how we engage with the press and the public?…

In some key ways groundbreaking is the opposite of definitive. There is a lot of hard work to be done between scooping that first shovelful of dirt and completing a stable foundation. And the same goes for science (with the crucial difference that in science, you’re much more likely to discover along the way that you’ve started digging on a site that’s impossible to build on). “Definitive” means that there is a sufficient body of evidence to accept some conclusion with a high degree of confidence. And by the time that body of evidence builds up, the idea is no longer groundbreaking.

Read it here.


The brain scans, they do nothing

Breaking news: New brain scan reveals nothing at all.

This is an amazing discovery’, said leading neuroscientist Baroness Susan Greenfield, ‘the pictures tell us nothing about how the brain works, provide us with no insights into the nature of human consciousness, and all with such lovely colours.’ …

The development, which has been widely reported around the world, is also significant because it allows journalists to publish big fancy pictures of the brain that look really impressive while having little or no explanatory value.

I’ve previously mentioned the well documented bias to think that brain pictures automatically make research more sciencey, even if the pictures are irrelevant to the conclusions. Satire makes that point a lot better though.

In science, rejection is normal

In the news: A coupla guys played around with some #2 pencils and Scotch tape and won a Nobel Prize in physics. Talk about easy science! This is what happens when you work in a field with such low-hanging fruit that you run out of testable hypotheses.

Okay, kidding aside…

The initial NY Times report noted that the first paper on graphene that the researchers wrote was rejected by Nature before later being published in Science. [1]

It would be easy to fit that into a narrative that is common in movies and in science journalism: the brilliant iconoclasts rejected by the hidebound scientific establishment.

Far more likely though is a much more mundane explanation: scientists see their work rejected all the time. It’s just part of how science works. The review process is not perfect, and sometimes you have to shop a good idea around for a while before you can convince people of its merit. And the more productive you are, the more rejection experiences you will accumulate over a career.

It’s a good reminder that if you’re a working scientist (or trying to start a career as one), don’t get too worked up about rejection.

[1] Puzzling sidenote: For some reason that part no longer appears in the article on the NY Times website, but since there’s no correction statement I’ll still assume that it’s true and they just edited it out of a later edition for some reason. The rejection anecdote still appears on the PBS website.

Pretty pictures of brains are more convincing

This study seemed like it was begging to be done, so I figured somebody must have done it already. Thank you Google Scholar for helping me find it…

Seeing is believing: The effect of brain images on judgments of scientific reasoning [pdf]

David P. McCabe and Alan D. Castel

Brain images are believed to have a particularly persuasive influence on the public perception of research on cognition. Three experiments are reported showing that presenting brain images with articles summarizing cognitive neuroscience research resulted in higher ratings of scientific reasoning for arguments made in those articles, as compared to articles accompanied by bar graphs, a topographical map of brain activation, or no image. These data lend support to the notion that part of the fascination, and the credibility, of brain imaging research lies in the persuasive power of the actual brain images themselves. We argue that brain images are influential because they provide a physical basis for abstract cognitive processes, appealing to people’s affinity for reductionistic explanations of cognitive phenomena.

For a few years now I’ve been joking that I should end every talk with a slide of a random brain image, and conclude, “Aaaannnd… all of this happens in the brain!” This is solid evidence that doing so would help my credibility.

Now, the next big question is: who’s going to replicate this with psychologists and neuroscientists as the subjects?

Apparently I’m on a blogging break

I just noticed that I haven’t posted in over a month. Don’t fear, loyal readers (am I being presumptuous with that plural? hi Mom!). I haven’t abandoned the blog, apparently I’ve just been too busy or preoccupied to flesh out any coherent thoughts.

So instead, here are some things that, over the last month, I’ve thought about posting but haven’t summoned up the wherewithal to turn into anything long enough to be interesting:

  • Should psychology graduate students routinely learn R in addition to, or perhaps instead of, other statistics software? (I used to think SPSS or SAS was capable enough for the modal grad student and R was too much of a pain in the ass, but I’m starting to come around. Plus R is cheaper, which is generally good for graduate students.)
  • What should we do about gee-whiz science journalism covering social neuroscience that essentially reduces to, “Wow, can you believe that X happens in the brain?” (Still working on that one. Maybe it’s too deeply ingrained to do anything.)
  • Reasons why you should read my new commentary in Psychological Inquiry. (Though really, if it takes a blog post to explain why an article is worth reading, maybe the article isn’t worth reading. I suggest you read it and tell me.)
  • A call for proposals for what controversial, dangerous, or weird research I should conduct now that I just got tenure.
  • Is your university as sketchy as my university? (Okay, my university probably isn’t really all that sketchy. And based on the previous item, you know I’m not just saying that to cover my butt.)
  • My complicated reactions to the very thought-provoking Bullock et al. “mediation is hard” paper in JPSP.

Our spring term is almost over, so maybe I’ll get to one of these sometime soon.