An open review of Many Labs 3: Much to learn

A pre-publication manuscript for the Many Labs 3 project has been released. The project, with 64 authors and supported by the Center for Open Science, ran replications of 10 previously-published effects on diverse topics. The research was conducted in 20 different university subject pools plus an Mturk comparison sample, with very high statistical power (over 3,000 subjects total). The project was pre-registered, and wherever possible the replicators worked with original authors or other experts to make the replications faithful to the originals. Big kudos to the project coordinating team and all the researchers who participated in this important work, as well as all the original authors who worked with the replicators.

A major goal was to examine whether time of semester moderates effect sizes, testing the common intuition among researchers that subjects are “worse” (less attentive) at the end of the term. But really, there is much more to it than that:

Not much replicated. The abstract says that 7 of 10 effects did not replicate. But dig a little deeper and the picture is more complicated. For starters, only 9 of those 10 effects were direct (or if you prefer, “close”) replications. The other was labeled a conceptual replication and deserves separate handling. More on that below; for now, let’s focus on the 9 direct replications.

Of the 3 effects that were tagged as “successful” replications, one — the Stroop effect — is so well-established that it is probably best thought of as a validity check on the project. If the researchers could not obtain a Stroop effect, I would have been more inclined to believe that they screwed something up than to believe that the Stroop isn’t real. But they got the Stroop effect, quite strongly. Test passed. (Some other side findings from manipulation checks or covariates could also be taken as validation tests. Strong arguments were judged as more persuasive; objectively warmer rooms were judged as warmer; etc.)

One major departure I have from the authors is how they defined a successful replication — as a significant effect in the same direction as the original significant effect. Tallying significance tests is not a good metric because, as the saying goes, the difference between significant and not significant is not necessarily significant. And perhaps less obviously, two significant effects — even in the same direction — can be very different.

I have argued before that the core question of replication is really this: If I do the same experiment, will I observe the same result? So the appropriate test is whether the replication effect equals the original effect, plus or minus error. It was not always possible to obtain an effect size from the original published report. But in many instances where it was, the replication effects were much smaller than the original ones. A particularly striking case of this was the replication of Tversky and Kahneman’s original, classic demonstration of the availability heuristic. The original effect was d = 0.82, 95% CI = [0.47, 1.17]. The replication was d = 0.09, 95% CI = [0.02, 0.16]. So the ML3 authors counted it as a “successful” replication, but the replication study obtained a substantially different effect than the original study.

Does it matter if the results are different but in the same direction? I say it does. Effect sizes matter, for a variety of reasons. Lurking in the background is an important set of meta-science questions about the quality of evidence in published reports. Remember, out of a set of 9 direct replications, 6 could not produce the original effect at all. Against that backdrop, what does it mean when a 7th one’s conclusion was right but the data to support it were not reproducible? You start to wonder if it’s a win for Tversky and Kahneman’s intellects rather than for the scientific process. I don’t want to count on my colleagues’ good guesswork (not all of us are as insightful as Tversky and Kahneman). I want to be persuaded by good data.

For what it’s worth, the other “successful” replication (metaphoric structuring, from a 2000 study by Lera Boroditsky) had a smaller point estimate than the original study, but the original had low power and thus a huge confidence interval (d = 0.07 to 1.20!). Without running the numbers, my guess is that a formal test of the difference between the original and replication effects would not be significant for that reason.

So what’s the box score for the direct replications? On the plus side, 1 gimme and 1 newish effect. In the middle, 1 replication that agreed with the original result in direction but sharply differed in magnitude. And then 6 blanks.

None of the interactions replicated. Of 3 hypothesized interactions, none of them replicated. None, zip, nada. All three were interactions between a manipulated variable and an individual difference. Interactions like that require a lot of statistical power to detect, and the replication effort had much more power than the original. We now know that under conditions of publication bias, published underpowered studies are especially likely to be biased (i.e., false positives). So add this to the growing pile of evidence that researchers and journals need to pay more attention to power.

A particularly interesting case is the replication of a moral credentialing study. In the original study, the authors had hypothesized a main effect of a credentialing manipulation. They didn’t find the main effect, but they did find (unexpectedly) an interaction with gender. By contrast, Many Labs 3  found the originally hypothesized main effect but not the gender interaction. So it is quite possible (if you think ML3 is the more reliable dataset) that the original study obtained both a false negative and a false positive — both potentially costly consequences of running with too little power.

Emblematic problems with a “conceptual replication.” An oddball in the mix was the one “conceptual replication.” The original study looked at associations between two self-report instruments, the NEO PI-R and Cloninger’s TCI, in a sample of psychiatric patients admitted to a hospital emergency unit. The replicators picked out one correlation to focus on — the one between NEO Conscientiousness and TCI Persistence. But they subbed in a 2-item scale from Gosling et al.’s TIPI for the 48-item NEO Conscientiousness scale, and instead of using the TCI they used time spent on an unsolvable anagram task as a measure of “persistence.” And they looked at all this in university subject pools and an Mturk sample.

The NEO-to-TIPI exchange is somewhat defensible, as the TIPI has been validated against the NEO PI-R.* But the shorter measure should be expected to underestimate any effects.

Swapping in the unsolvable anagrams for the TCI Persistence scale is a bit more mystifying. Sure, people sometimes call the anagram task a measure of “persistence” but (a) it’s a one-off behavioral measure not optimized to measuring stable individual differences, and (b) since there is no literature cited (or that I know of) connecting the anagram task to the TCI, there is a serious risk of falling prey to the jingle fallacy. In other words, just because the thing measured by the TCI and the thing measured by the anagrams are called “persistence” by different researchers, that does not mean they are the same construct. (For example, goals are hierarchical. Planful “persistence” at higher-order goals might be very different than wheel-spinning “persistence” on narrow tasks. Indeed, one of the 2 cited studies on the anagrams task used disengagement from the anagrams as an index of adaptive behavior.)

These problems, I think, are emblematic of problems with how “conceptual replication” is often treated. First, direct and conceptual replication are not the same thing and should not be treated as such. Direct replication is about the reproducibility of scientific observations; conceptual replication is about putting a theory to the test in more than one way. Both are a necessary part of science but they are not substitutable. Second, in this case the conceptual replication is problematic on its own terms. So many things changed from the original to the “replication” that it is difficult to interpret the difference in results. Is the issue that the TIPI was swapped in for the NEO? The anagrams for the TCI? Is it an issue of going from self-report to tested behavior? From an aggregate measure of personality to a one-shot measure? Psychiatric emergency patients to undergrads? Conceptual replication is much more useful when it is incremental and systematic.

The same interpretive difficulty would be present if the ML3 result had appeared to replicate the original, but I suspect we would be much less likely to notice it. We might just say hooray, it replicated — when it is entirely possible that we would actually be looking at 2 completely unrelated effects that just happened to have gotten the same labels.

Another blow against the “hidden moderator” defense. So far I haven’t touched on the stated purpose of this project, which was to look at time-of-semester as a moderator of effects. The findings are easy to summarize: it turns out it there was no there there. The authors also tested for heterogeneity of effects due to order (since subjects got the tasks for the 10 replications in random order) or across samples and labs. There was no there there either. Not even in the 2 tasks that were administered in person rather than by computers and thus were harder to standardize.

The authors put the interpretation so well that I’ll quote them at length here [emphasis added]:

A common explanation for the challenges of replicating results across samples and settings is that there are many seen and unseen moderators that qualify the detectability of effects (Cesario, 2014). As such, when differences are observed across study administrations, it is easy to default to the assumption that it must be due to features differing between the samples and settings. Besides time of semester, we tested whether the site of data collection, and the order of administration during the study session moderated the effects. None of these had a substantial impact on any of the investigated effects. This observation is consistent with the first “Many Labs” study (Klein et al., 2014) and is the focus of the second (Klein et al., 2015). The present study provides further evidence against sample and setting differences being a default explanation for variation in replicability. That is not to deny that such variation occurs, just that direct evidence for a given effect is needed to demonstrate that it is a viable explanation.

When effects do not replicate, “maybe there were moderators” needs to be treated as what it is – a post hoc speculation for future testing. Moderator explanations are worth consideration on a case by case basis, and hunting for them could lead to genuine discoveries. But as evidence keeps accruing that sample and setting moderators are not commonplace, we should be more and more skeptical of those kinds of explanations offered post hoc. High-powered pre-registered replications are very credible. And when the replication results disagree with the original study, our interpretations should account for what we know about the statistical properties of the studies and the way we came to see the evidence, and give less weight to untested speculation.


* An earlier version incorrectly stated that the TIPI had only been validated indirectly against the NEO. Thanks to Sam Gosling for pointing out the error.