An open review of Many Labs 3: Much to learn

A pre-publication manuscript for the Many Labs 3 project has been released. The project, with 64 authors and supported by the Center for Open Science, ran replications of 10 previously-published effects on diverse topics. The research was conducted in 20 different university subject pools plus an Mturk comparison sample, with very high statistical power (over 3,000 subjects total). The project was pre-registered, and wherever possible the replicators worked with original authors or other experts to make the replications faithful to the originals. Big kudos to the project coordinating team and all the researchers who participated in this important work, as well as all the original authors who worked with the replicators.

A major goal was to examine whether time of semester moderates effect sizes, testing the common intuition among researchers that subjects are “worse” (less attentive) at the end of the term. But really, there is much more to it than that:

Not much replicated. The abstract says that 7 of 10 effects did not replicate. But dig a little deeper and the picture is more complicated. For starters, only 9 of those 10 effects were direct (or if you prefer, “close”) replications. The other was labeled a conceptual replication and deserves separate handling. More on that below; for now, let’s focus on the 9 direct replications.

Of the 3 effects that were tagged as “successful” replications, one — the Stroop effect — is so well-established that it is probably best thought of as a validity check on the project. If the researchers could not obtain a Stroop effect, I would have been more inclined to believe that they screwed something up than to believe that the Stroop isn’t real. But they got the Stroop effect, quite strongly. Test passed. (Some other side findings from manipulation checks or covariates could also be taken as validation tests. Strong arguments were judged as more persuasive; objectively warmer rooms were judged as warmer; etc.)

One major departure I have from the authors is how they defined a successful replication — as a significant effect in the same direction as the original significant effect. Tallying significance tests is not a good metric because, as the saying goes, the difference between significant and not significant is not necessarily significant. And perhaps less obviously, two significant effects — even in the same direction — can be very different.

I have argued before that the core question of replication is really this: If I do the same experiment, will I observe the same result? So the appropriate test is whether the replication effect equals the original effect, plus or minus error. It was not always possible to obtain an effect size from the original published report. But in many instances where it was, the replication effects were much smaller than the original ones. A particularly striking case of this was the replication of Tversky and Kahneman’s original, classic demonstration of the availability heuristic. The original effect was d = 0.82, 95% CI = [0.47, 1.17]. The replication was d = 0.09, 95% CI = [0.02, 0.16]. So the ML3 authors counted it as a “successful” replication, but the replication study obtained a substantially different effect than the original study.

Does it matter if the results are different but in the same direction? I say it does. Effect sizes matter, for a variety of reasons. Lurking in the background is an important set of meta-science questions about the quality of evidence in published reports. Remember, out of a set of 9 direct replications, 6 could not produce the original effect at all. Against that backdrop, what does it mean when a 7th one’s conclusion was right but the data to support it were not reproducible? You start to wonder if it’s a win for Tversky and Kahneman’s intellects rather than for the scientific process. I don’t want to count on my colleagues’ good guesswork (not all of us are as insightful as Tversky and Kahneman). I want to be persuaded by good data.

For what it’s worth, the other “successful” replication (metaphoric structuring, from a 2000 study by Lera Boroditsky) had a smaller point estimate than the original study, but the original had low power and thus a huge confidence interval (d = 0.07 to 1.20!). Without running the numbers, my guess is that a formal test of the difference between the original and replication effects would not be significant for that reason.

So what’s the box score for the direct replications? On the plus side, 1 gimme and 1 newish effect. In the middle, 1 replication that agreed with the original result in direction but sharply differed in magnitude. And then 6 blanks.

None of the interactions replicated. Of 3 hypothesized interactions, none of them replicated. None, zip, nada. All three were interactions between a manipulated variable and an individual difference. Interactions like that require a lot of statistical power to detect, and the replication effort had much more power than the original. We now know that under conditions of publication bias, published underpowered studies are especially likely to be biased (i.e., false positives). So add this to the growing pile of evidence that researchers and journals need to pay more attention to power.

A particularly interesting case is the replication of a moral credentialing study. In the original study, the authors had hypothesized a main effect of a credentialing manipulation. They didn’t find the main effect, but they did find (unexpectedly) an interaction with gender. By contrast, Many Labs 3  found the originally hypothesized main effect but not the gender interaction. So it is quite possible (if you think ML3 is the more reliable dataset) that the original study obtained both a false negative and a false positive — both potentially costly consequences of running with too little power.

Emblematic problems with a “conceptual replication.” An oddball in the mix was the one “conceptual replication.” The original study looked at associations between two self-report instruments, the NEO PI-R and Cloninger’s TCI, in a sample of psychiatric patients admitted to a hospital emergency unit. The replicators picked out one correlation to focus on — the one between NEO Conscientiousness and TCI Persistence. But they subbed in a 2-item scale from Gosling et al.’s TIPI for the 48-item NEO Conscientiousness scale, and instead of using the TCI they used time spent on an unsolvable anagram task as a measure of “persistence.” And they looked at all this in university subject pools and an Mturk sample.

The NEO-to-TIPI exchange is somewhat defensible, as the TIPI has been validated against the NEO PI-R.* But the shorter measure should be expected to underestimate any effects.

Swapping in the unsolvable anagrams for the TCI Persistence scale is a bit more mystifying. Sure, people sometimes call the anagram task a measure of “persistence” but (a) it’s a one-off behavioral measure not optimized to measuring stable individual differences, and (b) since there is no literature cited (or that I know of) connecting the anagram task to the TCI, there is a serious risk of falling prey to the jingle fallacy. In other words, just because the thing measured by the TCI and the thing measured by the anagrams are called “persistence” by different researchers, that does not mean they are the same construct. (For example, goals are hierarchical. Planful “persistence” at higher-order goals might be very different than wheel-spinning “persistence” on narrow tasks. Indeed, one of the 2 cited studies on the anagrams task used disengagement from the anagrams as an index of adaptive behavior.)

These problems, I think, are emblematic of problems with how “conceptual replication” is often treated. First, direct and conceptual replication are not the same thing and should not be treated as such. Direct replication is about the reproducibility of scientific observations; conceptual replication is about putting a theory to the test in more than one way. Both are a necessary part of science but they are not substitutable. Second, in this case the conceptual replication is problematic on its own terms. So many things changed from the original to the “replication” that it is difficult to interpret the difference in results. Is the issue that the TIPI was swapped in for the NEO? The anagrams for the TCI? Is it an issue of going from self-report to tested behavior? From an aggregate measure of personality to a one-shot measure? Psychiatric emergency patients to undergrads? Conceptual replication is much more useful when it is incremental and systematic.

The same interpretive difficulty would be present if the ML3 result had appeared to replicate the original, but I suspect we would be much less likely to notice it. We might just say hooray, it replicated — when it is entirely possible that we would actually be looking at 2 completely unrelated effects that just happened to have gotten the same labels.

Another blow against the “hidden moderator” defense. So far I haven’t touched on the stated purpose of this project, which was to look at time-of-semester as a moderator of effects. The findings are easy to summarize: it turns out it there was no there there. The authors also tested for heterogeneity of effects due to order (since subjects got the tasks for the 10 replications in random order) or across samples and labs. There was no there there either. Not even in the 2 tasks that were administered in person rather than by computers and thus were harder to standardize.

The authors put the interpretation so well that I’ll quote them at length here [emphasis added]:

A common explanation for the challenges of replicating results across samples and settings is that there are many seen and unseen moderators that qualify the detectability of effects (Cesario, 2014). As such, when differences are observed across study administrations, it is easy to default to the assumption that it must be due to features differing between the samples and settings. Besides time of semester, we tested whether the site of data collection, and the order of administration during the study session moderated the effects. None of these had a substantial impact on any of the investigated effects. This observation is consistent with the first “Many Labs” study (Klein et al., 2014) and is the focus of the second (Klein et al., 2015). The present study provides further evidence against sample and setting differences being a default explanation for variation in replicability. That is not to deny that such variation occurs, just that direct evidence for a given effect is needed to demonstrate that it is a viable explanation.

When effects do not replicate, “maybe there were moderators” needs to be treated as what it is – a post hoc speculation for future testing. Moderator explanations are worth consideration on a case by case basis, and hunting for them could lead to genuine discoveries. But as evidence keeps accruing that sample and setting moderators are not commonplace, we should be more and more skeptical of those kinds of explanations offered post hoc. High-powered pre-registered replications are very credible. And when the replication results disagree with the original study, our interpretations should account for what we know about the statistical properties of the studies and the way we came to see the evidence, and give less weight to untested speculation.


* An earlier version incorrectly stated that the TIPI had only been validated indirectly against the NEO. Thanks to Sam Gosling for pointing out the error.

What counts as a successful or failed replication?

Let’s say that some theory states that people in psychological state A1 will engage in behavior B more than people in psychological state A2. Suppose that, a priori, the theory allows us to make this directional prediction, but not a prediction about the size of the effect.

A researcher designs an experiment — call this Study 1 — in which she manipulates A1 versus A2 and then measures B. Consistent with the theory, the result of Study 1 shows more of behavior B in condition A1 than A2. The effect size is d=0.8 (a large effect). A null hypothesis significance test shows that the effect is significantly different from zero, p<.05.

Now Researcher #2 comes along and conducts Study 2. The procedures of Study 2 copy Study 1 as closely as possible — the same manipulation of A, the same measure of B, etc. The result of Study 2 shows more of behavior B in condition A1 than in A2 — same direction as Study 1. In Study 2, the effect size is d=0.3 (a smallish effect). A null hypothesis significance test shows that the effect is significantly different from zero, p<.05. But a comparison of the Study 1 effect to the Study 2 effect (d=0.8 versus d=0.3) is also significant, p<.05.

Here’s the question: did Study 2 successfully replicate Study 1?

My answer is no. Here’s why. When we say “replication,” we should be talking about whether we can reproduce a result. A statistical comparison of Studies 1 and 2 shows that they gave us significantly different results. We should be bothered by the difference, and we should be trying to figure out why.

People who would call Study 2 a “successful” replication of Study 1 are focused on what it means for the theory. The theoretical statement that inspired the first study only spoke about direction, and both results came out in the same direction. By that standard you could say that it replicated.

But I have two problems with defining replication in that way. My first problem is that, after learning the results of Study 1, we had grounds to refine the theory to include statements about the likely range of the effect’s size, not just its direction. Those refinements might be provisional, and they might be contingent on particular conditions (i.e., the experimental conditions under which Study 1 was conducted), but we can and should still make them. So Study 2 should have had a different hypothesis, a more focused one, than Study 1. Theories should be living things, changing every time they encounter new data. If we define replication as testing the theory twice then there can be no replication, because the theory is always changing.

My second problem is that we should always be putting theoretical statements to multiple tests. That should be such normal behavior in science that we shouldn’t dilute the term “replication” by including every possible way of doing it. As Michael Shermer once wrote, “Proof is derived through a convergence of evidence from numerous lines of inquiry — multiple, independent inductions all of which point to an unmistakable conclusion.” We should all be working toward that goal all the time.

This distinction — between empirical results vs. conclusions about theories — goes to the heart of the discussion about direct and conceptual replication. Direct replication means that you reproduce, as faithfully as possible, the procedures and conditions of the original study. So the focus should rightly be on the result. If you get a different result, it either means that despite your best efforts something important differed between the two studies, or that one of the results was an accident.

By contrast, when people say “conceptual replication” they mean that they have deliberately changed one or more major parts of the study — like different methods, different populations, etc. Theories are abstractions, and in a “conceptual replication” you are testing whether the abstract theoretical statement (in this case, B|A1 > B|A2) is still true under a novel concrete realization of the theory. That is important scientific work, but it differs in huge, qualitative ways from true replication. As I’ve said, it’s not just a difference in empirical procedures; it’s a difference in what kind of inferences you are trying to draw (inferences about a result vs. inferences about a theoretical statement). Describing those simply as 2 varieties of the same thing (2 kinds of replication) blurs this important distinction.

I think this means a few important things for how we think about replications:

1. When judging a replication study, the correct comparison is between the original result and the new one. Even if the original study ran a significance test against a null hypothesis of zero effect, that isn’t the test that matters for the replication. There are probably many ways of making this comparison, but within the NHST framework that is familiar to most psychologists, the proper “null hypothesis” to test against is the one that states that the two studies produced the same result.

2. When we observe a difference between a replication and an original study, we should treat that difference as a problem to be solved. Not (yet) as a conclusive statement about the validity of either study. Study 2 didn’t “fail to replicate” Study 1; rather, Studies 1 and 2 produced different results when they should have produced the same, and we now need to figure out what caused that difference.

3. “Conceptual replication” should depend on a foundation of true (“direct”) replicability, not substitute for it. The logic for this is very much like how validity is strengthened by reliability. It doesn’t inspire much confidence in a theory to say that it is supported by multiple lines of evidence if all of those lines, on their own, give results of poor or unknown consistency.