Evaluating a new critique of the Reproducibility Project

Over the last five years psychologists have been paying more and more attention to issues that could be diminishing the quality of our published research — things like low power, p-hacking, and publication bias. We know these things can affect reproducibility, but it can be hard to gauge their practical impact. The Reproducibility Project: Psychology (RPP), published last year in Science, was a massive, coordinated effort to produce an estimate of where several of the field’s top journals stood in 2008 before all the attention and concerted improvement began.

The RPP is not perfect, and the paper is refreshingly frank about its limitations and nuanced about its conclusions. But all science proceeds on fallible evidence (there isn’t any other kind), and it has been welcomed by many psychologists as an informative examination of the reliability of our published findings.

Welcomed by many, but not welcomed by all.

In a technical commentary released today in Science, Dan Gilbert, Gary King, Stephen Pettigrew, and Tim Wilson take exception to the conclusions that the RPP authors and many scientists who read it have reached. They offer re-analyses of the RPP, some incorporating outside data. They maintain that the RPP authors’ conclusions are wrong, and on re-examination the data tell us that “the reproducibility of psychological science is quite high.” (The RPP authors published a reply.)

What should we make of it? I read the technical comment, the supplement (as you’ll see there were some surprises in it), the Open Science Collaboration’s reply, Gilbert et al.’s unpublished response to the reply, and I re-read the Many Labs report that plays a critical role in the commentary. Here are my thoughts.

Unpacking the critics’ replication metric

To start with, a key to understanding Gilbert et al.’s critique is understanding the metric of replicability that it uses.

There are many reasons why original and replication studies might get different results. Some are neutral and unavoidable, like sampling error. Some are signs of good things, like scientists pushing into the unknown. But some are problems. That can include a variety of errors and biases in original studies, errors and biases in replications, and systemic problems like publication bias. Just like there are many reasons for originals and replications to get different results, there are many ways to index those differences. Different metrics are sensitive to different things about original and replication studies. Whereas the RPP looked at a number of different metrics, the critique focuses on one: whether the point estimate of the replication effect size falls within the confidence interval of the original.

But in justifying this choice, the critique’s authors misstate what confidence intervals are. They write that 95% of replications should fall within the original studies’ confidence intervals. That just isn’t true – a P% confidence interval does not predict P% success in future replications. To be fair, almost everyone misinterprets confidence intervals. But when they are pivotal to your sole metric of reproducibility and your interpretation hinges on them, it would be good to get them right.

Another issue that is critical to interpreting intervals is knowing that intervals get wider the less data you have. This is never addressed, but the way Gilbert et al. use original studies’ confidence intervals to gauge replicability means that the lower an original study’s power, the easier it will be to “successfully” replicate it. Conversely, a very high-powered original study can “fail to replicate” because of trivial heterogeneity in the effect. Not all replication metrics are vulnerable to this problem. But if you are going to use a replication metric that is sensitive to power in this way, you need to present it alongside other information that puts it in context. Otherwise you can be led seriously astray.

A limited scope with surprising omissions

The RPP is descriptive, observational data about replications. Gilbert et al. try to model the underlying causes. If there are many reasons why original and replication studies can differ, it would make sense to try to model as many of them as possible, or at least the most important ones. Unfortunately, the critique takes a quite narrow, confirmatory approach to modeling differences between original and replication studies. Of all the possible reasons why original and replication studies can differ, it only looks for random error and flaws in replication studies.

This leads to some striking omissions. For example, any scientist can tell you that publication bias is ubiquitous. It creates biases in the results of original published studies, which can make it harder to reproduce their (biased) effects. But it would not have affected the replications in the RPP. Nor would it affect the comparisons among Many Labs replications that Gilbert et al. use as benchmarks (more on that in a moment). Yet the commentary’s re-analyses of replicability make no attempt to detect or account for publication bias anywhere.

If you want to know how much something varies, calculate its variance

Gilbert et al. propose that replications might have variable effects because of differences in study populations or procedures. This is certainly an important issue, and one that has been raised before in interpreting replications.

In order to offer new insight on this issue, Gilbert et al. re-analyze data from Klein et al.’s (2014) Many Labs 1 study to see how often pairs of studies trying to get the same effect had a “successful” replication by the original-study-confidence-interval criterion. Unfortunately, that analysis mixes together power and effect size heterogeneity – they are very different things, and both higher power of original studies and effect size heterogeneity will lower replication success in this kind of analysis. It does not provide a clean estimate of effect variability.

There is a more straightforward way to know if effects varied across Many Labs replication sites: calculate the variance in the effects. Klein et al. report this in their Table 3. The data show that big effects tended to vary across sites but more modest ones did not. And by big I mean big – there are 5 effects in Many Labs 1 with a Cohen’s d greater than 1.0. Four of them are variations on the anchoring effect. Effect sizes that big are quite unusual in social psychology – they were probably included by Klein et al. to make sure there were some slam-dunk effects in the Many Labs project, not because they are representative. But effects that are more typical in size are not particularly variable in Many Labs 1. Nor is there much variance in any of the effects examined in the similar Many Labs 3.

Apples-to-oranges comparisons of replicability from RPP to Many Labs

Another argument Gilbert et al. make is that with enough power, most RPP replications would have been successful. To support this argument they look again at Many Labs to see how often the combined sample of 6000+ participants could replicate the original studies. Here is how they describe it:

OSC attempted to replicate each of 100 studies just once, and that attempt produced an unsettling result: Only 47% of the original studies were successfully replicated (i.e., produced effects that fell within the confidence interval of the original study). In contrast, MLP [Many Labs] attempted to replicate each of its studies 35 or 36 times and then pooled the data. MLP’s much more powerful method produced a much more heartening result: A full 85% of the original studies were successfully replicated. What would have happened to MLP’s heartening result if they had used OSC’s method? Of MLP’s 574 replication studies, only 195 produced effects that fell within the confidence interval of the original, published study. In other words, if MLP had used OSC’s method, they would have reported an unsettling replication rate of 34% rather than the heartening 85% they actually reported.

Three key numbers stand out in this paragraph. The RPP replication rate was 47%. The high-powered (N>6000) Many Labs pooled-sample replication rate was 85%. But if the RPP approach is applied to Many Labs (i.e. looking at single samples instead of the pooled rate), the rate drops to 34%. On its face, that sound like a problem for the RPP.

Except when I actually looked at Table 2 of Many Labs and tried to verify the 85% number for the pooled sample, I couldn’t. There are 15 original studies where a confidence interval could be calculated. Only 6 of the pooled replication effects landed inside the intervals. So the correct number is 40%. Where did 85% come from? Although it’s virtually impossible to tell in the paragraph I quoted above, I found buried in the supplement the key detail that Gilbert et al. got their “heartening” 85% from a totally different replication metric — the tally of of replications that got p < .05 (if you treat the anchoring effects as one, there are 11 significant effects out of 13). Instead of making an apples-to-apples comparison, they switch to a different metric exactly once in their critique, on only one side of this key comparison.

What if instead you calculate the replicability rate using the same metric for both sides of the comparison? Using the confidence interval metric that Gilbert et al. use everywhere else, you get 47% in the RPP versus 40% in the pooled analysis of Many Labs. So the RPP actually did better than Many Labs with its N > 6000 sample sizes. How could that be?

It turns out that the confidence interval metric can lead you to some surprising conclusions. Because larger effects were more variable in Many Labs 1, the effects that did the worst job “replicating” by Gilbert et al’s original-study-confidence-interval criterion are the biggest ones. Thus anchoring – yes, anchoring – “failed to replicate” three out of four times. Gain vs. loss framing failed too. (Take that, Kahneman and Tversky!) By contrast, flag priming would appear to have replicated successfully – even though the original authors themselves have said that Many Labs did not successfully replicate it.

In addition to completely undermining the critique’s conclusion about power, all of this goes back to my earlier point that the confidence-interval metric needs to be interpreted with great caution. In the RPP authors’ reply, they mention bring up differences among replication metrics. In an unpublished response, Gilbert et al. write: “This is a red herring. Neither we nor the authors of OSC­2015 found any substantive differences in the conclusions drawn from the confidence interval measure versus the other measures.” I don’t know what to make of that. How can they think 85% versus 40% is not a substantive difference?

Flaws in a fidelity metric

Another issue raised by the critique is what its authors call the “fidelity” of the replications: how well the replication protocols got the original studies’ methods right. As with variability in populations and procedures, this is an important issue that merits a careful look in any replication study.

The technical comment gives a few examples of differences between original and replication protocols that sound to like they could have mattered in some casese. How did these issues play out in the RPP as a whole? Unfortunately, the critique uses a flawed metric to quantify the effects of fidelity: the original authors’ endorsement of the replication protocol.

There are two problems with their approach. First, original study authors have expertise in the methods, of course. But they also have inside knowledge about flaws in their original studies. The critique acknowledges this problem but makes no attempt to account for it in the analyses.

Second, Gilbert et al. compared “endorsements” to “nonendorsements,” but a majority of the so-called nonendorsements were cases where original authors simply did not respond – an important detail that is again only found in the supplement. Original authors only registered concerns in 11 out of 100 replications, versus 18 nonresponses. Like with any missing-data problem, we do not know what the nonresponders would have said if they had responded. But the analysis assumes that none of the 18 would have endorsed the replication protocols.

A cleaner fidelity metric would have helped. But ultimately, these kinds of indirect analyses can only go so far. Gilbert et al. claim that original studies would replicate just fine if only replicators would get the procedures right. This is an empirical question with a very direct way of getting an answer: go run a replication the way you think it ought to be done. I suspect that some of the studies probably would successfully replicate, either because of Type II error or substantive differences. We could learn a tremendous amount from direct empirical tests of hypotheses about replication fidelity and other hidden moderators, far more than we can from these kinds of indirect analyses with weak proxies.

We can move the conversation forward

In the last 5 years there have been a lot of changes in psychology. We now know that there are problems with how we have sometimes done research in the past. For example, it was long considered okay to analyze small, noisy datasets with a lot of flexibility to look around for patterns that supported a publishable conclusion. There is a lot more awareness now that these practices will lead to lower reproducibility, and the field is starting to do something about that. The RPP came around after we already knew that. But it added meaningfully to that discussion by giving us an estimate of reproducibility in several top journals. It gave us a sense, however rough, of where the field stood in 2008 before we started making changes.

That does not mean psychologists are all of one mind about where psychology is at on reproducibility and what we ought to do about it. There has been a lot of really fruitful discussion recently coming from different perspectives. Some of the critical commentaries raise good concerns and have a lot of things I agree with.

The RPP was a big and complicated project, and given its impact it warrants serious critical analysis from multiple perspectives. I agree with Uri Simonsohn that some of the protocol differences between originals and replications deserve closer scrutiny, and it is good that Gilbert et al. brought them to our attention. I found myself less enthusiastic about their analyses, for the reasons I have outlined here.

But the discussion will continue to move forward. The RPP dataset is still open, and I know there are other efforts under way to draw new insights from it. Even better, there is lots of other, new meta-science happening too. I remain optimistic that as we continue to learn more, we will keep making things better in our field.

* * * * *

UPDATE (3/8/2016): There has been a lot of discussion about the Gilbert et al. technical comment since I put up this blog. Gilbert et al. have written a reply that responds to some of the issues that I and others have raised.

Here are some other relevant discussions in the academic blogosphere:

8 thoughts on “Evaluating a new critique of the Reproducibility Project

  1. Great post.

    It would be interesting to do a poll of psychologists to find out what people mean when they say “replicated”. To try to analyze the ‘folk statistics’ people hold. Becase I think “replicated” is a folk statistical concept really, it’s like “average” (rather than the precisely defined “mean”). I wonder if some of the debate over RPP is because people hold different concepts of replication.

  2. Very nice. Regarding the previous poster’s question, I’ve always considered a result to replicate if my parameter estimates are very similar across studies. I am always beating the drum to graduate students that it is not sufficient to simply note whether the significance replicates.

    1. Mike: Right. But I think many people do feel that significance is the essence of replication. And then there’s the question of consistency: if I try to replicate a result twice, and I get one very close replication, and one complete failure to replicate, did the original result “replicate” or not? Does “replicate” mean “consistently replicate”?

      I think people differ in how they use these terms and this might be driving some of the controversy.

  3. I find the Gilbert et al. study ill-conceived. The main point of the RPP (as I see it) is to forefront the same thing that statisticians have been saying for years: The goal of any scientific study should be to produce results that are reproducible, and not only to produce results that are p < 0.05. Achieving this goal requires major changes in the mindset of scientists, academic publishers, and academic evaluators, and if successful, will benefit research quality. Why one would want to undermine such a positive endeavor is unclear to me.

  4. Hi, first of all thank you for this posting your analysis. I must admit that reading the original articles you mentionned got me quite confused about what is going on and a bit of clarity is always refreshing ! I don’t know if this is the right place to share what I’m about to write but, since you’re well aware of the methodological problems we are facing (and, of Meehl’s, Rozeboom’s, Cohen’s and Cumming’s work) and since you seem quite an experienced researcher, I wanted to discuss some points that (I might be wrong though) in my opinion seem to be completely overshadowed by this whole replication crisis.

    Besides the problems caused by misusing NHST and underpowered studies involving WEIRD subjects, what has to be answered is the more fundamental enterprise of psychology as a science. In fact, even improving current techniques will not help us deal tremendously with something that (in my experience, and I don’t have much) has never been even part of our training, that is, theory building and conceptual analysis. When (and how) are will we overcome the massive data collection routine? The whole point of quantitative research is to test hypotheses derived from logical reasoning, that can be semantic or mathematical, but what we have most of the time is ordinal predictions that, under the best possible designs (as shown by Meehl) will be right 50% of the time no matter what theory is being tested. In fact, given the ‘crud factor’, demonstrating a non-null effect of some variable on another can hardly be called ‘theory testing’, and replication won’t be much informative…

    There have been successful efforts to derive quantitative predictions (instead of nihil hypothesis testing), for instance in the case of the transtheoritical model, but, as far as I know, there has not been any attempt to define a unified theoritical framework for social psychologists to work within (besides meta-theories like I3 and such). Everyone goes by his favourite concept and there is a significant amount of either relabeling similar concepts or testing platitudes under the disguise of ‘science’ because it is quantitative (I’m thinking about all the ‘need for xxx’ ones, which amounts to saying that storm is there because we can see rain and lightning, but I will do no better). Methodology, no matter how complex and rigorous, will not think for us…

    That has been noted several times (there is even a book from the 70’s called Social science as sorcery which some claims might unfortunately ring true today), but it doesn’t seem to bother. I’m new to the field (I will start a phd in september) and sometimes feel powerless with these issues (and with the mass of published papers I can’t read), since I’m about to follow that same path. Milgram didn’t have any control group, neither did he use anything but descriptive statistics, but his studies can still be (and are) replicated today. This is also true of Festinger. Heider wrote only one book but layed the foundation for modern social cognition… In his 1931 paper “The conflict between aristotelian and galileian modes of thought in contemporary psychology”, Lewin called for a Galilean revolution in our field… 85 years after I wonder where our cumulative knowledge is. Our theory of Evolution ? Do you think such things would even be possible?

    I am not naive about the problems caused by both the publication system as it is, and the need for funds, which further narrows the focus on micro-contextual phenomena that might yield some practical results (in terms of intervention for instance) but maybe these whole profit driven constraints are part of what truly impedes psychology’s progress (and not its youthness), in fact no other science had to face such obstacles in its early years. What can be done about it (if, of course, this is relevant, but I may be too uninformed or pessimistic) ?

    1. Jay, I think these are great points. I guess I would start by saying that I don’t think anybody thinks the only stuff that matters is replicability and the family of issues that usually get discussed with it (like power, publication bias, openness, etc.). The focus is on those issues right now because it’s an identified problem and people think we can fix it. Replicability plays a critical role in making evidence useful for testing and developing theories, but that doesn’t mean it’s sufficient.

      As far as better theory goes, I don’t think we are going to be able to jump straight from where we are right now to mathematized theories that make quantitative point predictions. That may be more tractable in some areas than others but it’s not a cure-all solution or a next step. Where it’s not what we need, I like the ideas Paul Rozin has discussed about the importance of research that describes important phenomena. It needs to be acceptable to be motivated by “informed curiosity” rather than deductions from formal theory — which he argues is a common way of doing things in other sciences, like biology (https://sites.sas.upenn.edu/rozin/files/socpsysci195pspr2001pap_0.pdf). Lynne Cooper’s recent editorial for JPSP:PPID specifically cites Rozin and invites this kind of research, which I think is an important step.

      I also think that, even without making quantitative point predictions, we can develop and test theories by looking for otherwise-improbable patterns across diverse forms of evidence — the “damn strange coincidences” that Meehl talks about in his 1990 Psych Inquiry paper (http://rhowell.ba.ttu.edu/Meehl1.pdf).

      I don’t think it’s the case that we have not been doing these things at all, but I think we could think about ways to do them better. And I don’t think they’re disconnected from the present discussion, because having more replicable, less biased evidence will help with that.

      1. I guess you’re right when you say there is no cure-all solution. Thank you for the papers, I’ve already read Rozin’s and he makes a great point when he says we could use less ANOVA designs and more qualitative data (this echoes the plead against ‘methodological idolatry’, reminds me of Cambell & Fiske, 1959). I’ll read Meehl’s !!

        Anyway, I was not arguing for a purely mathematical theorization, which, given the probabilistic nature of the phenomena we investigate will be equivalent to chasing butterflies, but there are definitely some ‘tougher’ tests we could put theories to (see for instance http://www.ncbi.nlm.nih.gov/pubmed/22837590 sorry but I don’t have a link to a full access online pdf).

        I’m glad you answered back,
        Best regards

  5. This whole discussion would be a lot clearer if people would frame it in terms of Gelman’s Type M and Type S errors.

Comments are closed.