Moderator interpretations of the Reproducibility Project

The Reproducibility Project: Psychology (RPP) was published in Science last week. There has been some excellent coverage and discussion since then. If you haven’t heard about it,* Ed Yong’s Atlantic coverage will catch you up. And one of my favorite commentaries so far is on Michael Frank’s blog, with several very smart and sensible ways the field can proceed next.

Rather than offering a broad commentary, in this post I’d like to discuss one possible interpretation of the results of the RPP, which is “hidden moderators.” Hidden moderators are unmeasured differences between original and replication experiments that would result in differences in the true, underlying effects and therefore in the observed results of replications. Things like differences in subject populations and experimental settings. Moderator interpretations were the subject of a lengthy discussion on the ISCON Facebook page recently, and are the focus of an op-ed by Lisa Feldman Barrett.

In the post below, I evaluate the hidden-moderator interpretation. The tl;dr version is this: Context moderators are probably common in the world at large and across independently-conceived experiments. But an explicit design goal of direct replication is to eliminate them, and there’s good reason to believe they are rare in replications.

1. Context moderators are probably not common in direct replications

Many social and personality psychologists believe that lots of important effects vary by context out in the world at large. I am one of those people — subject and setting moderators are an important part of what I study in my own work. William McGuire discussed the idea quite eloquently, and it can be captured in an almost koan-like quote from Niels Bohr: “The opposite of one profound truth may very well be another profound truth.” It is very often the case that support for a broad hypothesis will vary and even be reversed over traditional “moderators” like subjects and settings, as well as across different methods and even different theoretical interpretations.

Consider as a thought experiment** what would happen if you told 30 psychologists to go test the deceptively simple-seeming hypothesis “Happiness causes smiling” and turned them loose. You would end up with 30 different experiments that would differ in all kinds of ways that we can be sure would matter: subjects (with different cultural norms of expressiveness), settings (e.g., subjects run alone vs. in groups), manipulations and measures of the IV (film clips, IAPS pictures, frontal asymmetry, PANAS?) and DV (FACS, EMG, subjective ratings?), and even construct definitions (state or trait happiness? eudaimonic or hedonic? Duchenne or social smiles?). You could learn a lot by studying all the differences between the experiments and their results.

But that stands in stark contrast to how direct replications are carried out, including the RPP. Replicators aren’t just turned loose with a broad hypothesis. In direct replication, the goal is to test the hypothesis “If I do the same experiment, I will get the same result.” Sometimes a moderator-ish hypothesis is built in (“this study was originally done with college students, will I get the same effect on Mturk?”). But such differences from the original are planned in. The explicit goal of replication design is for any other differences to be controlled out. Well-designed replication research makes a concerted effort to faithfully repeat the original experiments in every way that documentation, expertise, and common sense say should matter (and often in consultation with original authors too). The point is to squeeze out any room for substantive differences.

Does it work? In a word, yes. We now have data telling us that the squeezing can be very effective. In Many Labs 1 and Many Labs 3 (which I reviewed here), different labs followed standardized replication protocols for a series of experiments. In principle, different experimenters, different lab settings, and different subject populations could have led to differences between lab sites. But in analyses of heterogeneity across sites, that was not the result. In ML1, some of the very large and obvious effects (like anchoring) varied a bit in just how large they were (from “kinda big” to “holy shit”). Across both projects, more modest effects were quite consistent. Nowhere was there evidence that interesting effects wink in and out of detectability for substantive reasons linked to sample or setting.

We will continue to learn more as our field gets more experience with direct replication. But right now, a reasonable conclusion from the good, systematic evidence we have available is this: When some researchers write down a good protocol and other researchers follow it, the results tend to be consistent. In the bigger picture this is a good result for social psychology: it is empirical evidence that good scientific control is within our reach, neither beyond our experimental skills nor intractable for the phenomena we study.***

But it also means that when replicators try to clamp down potential moderators, it is reasonable to think that they usually do a good job. Remember, the Many Labs labs weren’t just replicating the original experiments (from which their results sometimes differed – more on that in a moment). They were very successfully and consistently replicating each other. There could be individual exceptions here and there, but on the whole our field’s experience with direct replication so far tells us that it should be unusual for unanticipated moderators to escape replicators’ diligent efforts at standardization and control.

2. A comparison of a published original and a replication is not a good way to detect moderators

Moderation means there is a substantive difference between 2 or more (true, underlying) effects as a function of the moderator variable. When you design an experiment to test a moderation hypothesis, you have to set things up so you can make a valid comparison. Your observations should ideally be unbiased, or failing that, the biases should be the same at different levels of the moderator so that they cancel out in the comparison.

With the RPP (and most replication efforts), we are trying to interpret observed differences between published original results and replications. The moderator interpretation rests on the assumption that observed differences between experiments are caused by substantive differences between them (subjects, settings, etc.). An alternative explanation is that there are different biases. And that is almost certainly the case. The original experiments are generally noisier because of modest power, and that noise is then passed through a biased filter (publication bias for sure — these studies were all published at selective journals — and perhaps selective reporting in some cases too). By contrast, the replications are mostly higher powered, the analysis plans were pre-registered, and the replicators committed from the outset to publish their findings no matter what the results.

That means that a comparison of published original studies and replication studies in the RPP is a poor way to detect moderators, because you are comparing a noisy and biased observation to one that is much less so.**** And such a comparison would be a poor way to detect moderators even if you were quite confident that moderators were out there somewhere waiting to be found.

3. Moderator explanations of the Reproducibility Project are (now) post hoc

The Reproducibility Project has been conducted with an unprecedented degree of openness. It was started 4 years ago. Both the coordinating plan and the protocols of individual studies were pre-registered. The list of selected studies was open. Original authors were contacted and invited to consult.

What that means is that anyone could have looked at an original study and a replication protocol, applied their expert judgment, and made a genuinely a priori prediction of how the replication results would have differed from the original. Such a prediction could have been put out in the open at any time, or it could have been pre-registered and embargoed so as not to influence the replication researchers.

Until last Friday, that is.

Now the results of the RPP are widely known. And although it is tempting to now look back selectively at “failed” replications and generate substantively interesting reasons, such explanations have to be understood for what they are: untested post hoc speculation. (And if someone now says they expected a failure all along, they’re possibly HARKing too.)

Now, don’t get me wrong — untested post hoc speculation is often what inspires new experiments. So if someone thinks they see an important difference between an original result and a replication and gets an idea for a new study to test it out, more power to them. Get thee to the lab.

But as an interpretation of the data we have in front of us now, we should be clear-eyed in appraising such explanations, especially as an across-the-board factor for the RPP. From a bargain-basement Bayesian perspective, context moderators in well-controlled replications have a low prior probability (#1 above), and comparisons of original and replication studies have limited evidential value because of unequal noise and bias (#2). Put those things together and the clear message is that we should be cautious about concluding that there are hidden moderators lurking everywhere in the RPP. Here and there, there might be compelling, idiosyncratic reasons to think there could be substantive differences to motivate future research. But on the whole, as an explanation for the overall pattern of findings, hidden moderators are not a strong contender.

Instead, we need to face up to the very well-understood and very real differences that we know about between published original studies and replications. The noxious synergy between low power and selective publication is certainly a big part of the story. Fortunately, psychology has already started to make changes since 2008 when the RPP original studies were published. And positive changes keep happening.

Would it be nice to think that everything was fine all along? Of course. And moderator explanations are appealing because they suggest that everything is fine, we’re just discovering limits and boundary conditions like we’ve always been doing.***** But it would be counterproductive if that undermined our will to continue to make needed improvements to our methods and practices. Personally, I don’t think everything in our past is junk, even post-RPP – just that we can do better. Constant self-critique and improvement are an inherent part of science. We have diagnosed the problem and we have a good handle on the solution. All of that makes me feel pretty good.


* Seriously though?

** A thought meta-experiment? A gedankengedankenexperiment?

*** I think if you asked most social psychologists, divorced from the present conversation about replicability and hidden moderators, they would already have endorsed this view. But it is nice to have empirical meta-scientific evidence to support it. And to show the “psychology isn’t a science” ninnies.

**** This would be true even if you believed that the replicators were negatively biased and consciously or unconsciously sandbagged their efforts. You’d think the bias was in the other direction, but it would still be unequal and therefore make comparisons of original vs. replication a poor empirical test of moderation. (You’d also be a pretty cynical person, but anyway.)

***** For what it’s worth, I’m not so sure that the hidden moderator interpretation would actually be all that reassuring under the cold light of a rational analysis. This is not the usual assumption that moderators are ubiquitous out in the world. We are talking about moderators that pop up despite concerted efforts to prevent them. Suppose that ubiquitous occult moderators were the sole or primary explanation for the RPP results — so many effects changing when we change so little, going from one WEIRD sample to another, with maybe 5-ish years of potential for secular drift, and using a standardized protocol. That would suggest that we have a poor understanding of what is going on in our labs. It would also suggest that it is extraordinarily hard to study main effects or try to draw even tentative generalizable conclusions about them. And given how hard it is to detect interactions, that would mean that our power problem would be even worse than people think it is now.

What should SPSP do about APA and the Hoffman report?

I am a member-at-large in the Society for Personality and Social Psychology. We will be having our semiannual board meeting in a couple of weeks. On the agenda is a discussion of the Hoffman Report, which details collusion between American Psychological Association officials and the U.S. military to enable and support abusive interrogations.

I have had several discussions with people about what, if anything, SPSP should be doing about its relationship with APA. But I’d really like to open up the discussion and get feedback from more people, especially SPSP members. In this post I’d like to lay out some background about SPSP’s relationship with APA, and bring up some possibilities about what to do next.

What is SPSP’s current legal and financial relationship with APA?

It’s easy to get confused about this. Heck, I’m on the SPSP board and I still find it a bit confusing. (If I get any of this wrong I hope somebody corrects me.) Here goes.

SPSP is legally its own organization. We have our own articles of incorporation and bylaws. Membership in SPSP and APA are separate processes and you can be a member of (and pay dues to) either one without joining the other. We operate our own annual conference and our own non-APA journals.

So they’re totally independent, right? Well, not totally.

SPSP also operates APA Division 8. That means that if you are an APA Division 8 member, the SPSP Board is in charge of running your division. And conversely, 1 of the 11 voting members of the SPSP Board of Directors is the APA Division 8 council representative, who is elected by APA Division 8 members under APA’s bylaws. So our governance is at least somewhat intertwined.

And the APA website certainly blurs the distinction – if you navigate to their Division 8 page, it says that Division 8 is SPSP, and it links to SPSP’s website. (Contrast that to the SPSP website’s page about Division 8, which goes to pains to emphasize the difference.)

Financially, some money changes hands. On the income side, we get a bit of a cut from [corrected] APA Division 8 membership dues and from the fact that some APA Division 8 conference programming provides continuing education credits. On the expenses side, SPSP spends money on the APA conference. We pay to send the SPSP president, SPSP-selected APA program chair, and the Division 8 council representative to the APA conference, and we pay for some APA conference programming (a social hour and travel for an invited speaker). In our 2014 budget under “Division 8,” the numbers were roughly $3k in income from APA, and $13k in expenses, for a net of $10k being spent. On top of that we reimburse some travel expenses for the Division 8 representative and program chair to attend SPSP board meetings (these travel expenses are in a different part of the budget and not broken out, but it’s probably a few thousand bucks a year). In the context of SPSP’s overall budget those are modest numbers — SPSP’s total revenues for 2014 were over $2 million. If you are an SPSP member, it would technically not be correct to say that exactly $0 of your membership dues goes toward APA-related stuff. But the percentage is small.

So like I said, it’s easy to get confused. Legally and financially, SPSP is largely separate but not 100% independent. And by operating Division 8 for APA, we certainly have a real, ongoing relationship.

What are the reasons for maintaining our relationship or for ending it?

That’s really what I’m hoping to hear feedback from people about. I’ll summarize some of the arguments I’ve heard so far, but I hope people will elaborate on them.

On the “keep it” side, people point out that APA is the only organization that represents all of psychology in the United States. APA thus connects us to other psychological disciplines, including the practitioner disciplines. And they are going to continue to have influence on policy and lobbying (including things SPSP members are likely to care about like social science funding). Another thing people bring up is that many of the top-tier journals that SPSP members publish in (JPSP etc.) are APA journals, and we should keep our involvement and influence in that. And lastly, as a result of the Hoffman Report APA may make some fundamental reforms – so this may be an important time to stay involved to shape those reforms and make sure they are implemented.

On the “end it” side, people argue that APA is drifting further and further away from relevance to psychological scientists (aside from its journal publishing business), and that we have relatively little influence with them anyway (a single Division 8 member on the APA council). APA gets its credibility and influence in part from being associated with scientific organizations like us, and they have leveraged that credibility to do great damage. And this wasn’t just a few individuals abusing their offices – it was organized, institutional corruption. So maybe SPSP (as an organization and members as individuals) is getting too little and giving up to much from our relationship – materially, reputationally, or morally.

These are just brief statements of the arguments, in part because they are not all my own. And like I said, this is what I want to hear people discuss.

Realistically what might SPSP do in the immediate future?

As I see it there are 3 avenues:

1. Decide we should stay involved and keep our relationship.

2. Decide we’re past the point of no return with APA and end our relationship. As I understand it, the SPSP Board is not empowered to end our relationship at our meeting. Doing so would require amending our bylaws and articles of incorporation, and that requires a referendum of the SPSP membership. But such a referendum could be initiated by either the Board or by a petition of 100 SPSP members. [UPDATE AND CLARIFICATION: The SPSP Board has budgetary authority to stop spending money on Division 8-related expenses, without a full referendum. Removing the Division 8 representative position from the SPSP board would require amending the bylaws via referendum, but not the articles of incorporation. Thanks to Chad Rummel for the corrections.]

3. Wait and see. Many of the individuals implicated in the Hoffman Report have resigned, but the fallout continues to fall out. Many questions remain to be addressed: How (if at all) will APA change its governance system (which, according to the Hoffman report, was able to be manipulated and circumvented pretty easily toward horrific ends)? Change its ethics policy? Change its stance toward military and operational psychology? Will it reopen and investigate old ethics charges? Will it make restitution for the damage it has done? These questions are likely to be taken up at the APA conference and especially the council meeting in Toronto next week. And the SPSP board meeting will be immediately afterward. So if you think our continued relationship with APA should be contingent on any of those things (and if you view APA’s rehabilition as a still-open question), it may be too soon to make a decision just yet.

In addition to the 3 items above, there may be other things I am not considering. If so, I hope people will bring those up.

Like I said, I would like to hear from SPSP members – and any others who have a stake in SPSP’s relationship with APA. You can comment below. I’ll post a link to this blog post on social media (Facebook, Twitter) in places where SPSP members are likely to see it and have a chance to comment. If you would like to reach me privately, my email is my first name AT uoregon DOT edu.

Replicability in personality psychology, and the symbiosis between cumulative science and reproducible science

There is apparently an idea going around that personality psychologists are sitting on the sidelines having a moment of schadenfreude during the whole social psychology Replicability Crisis thing.

Not true.

The Association for Research in Personality conference just wrapped up in St. Louis. It was a great conference, with lots of terrific research. (Highlight: watching three of my students give kickass presentations.) And the ongoing scientific discussion about openness and reproducibility had a definite, noticeable effect on the program.

The most obvious influence was the (packed) opening session on reproducibility. First, Rich Lucas talked about the effects of JRP’s recent policy of requiring authors to explicitly talk about power and sample size decisions. The policy has had a noticeable impact on sample sizes of published papers, without major side effects like tilting toward college samples or cheap self-report measures.

Second, Simine Vazire talked about the particular challenges of addressing openness and replicability in personality psychology. A lot of the discussion in psychology has been driven by experimental psychologists, and Simine talked about what the general issues that cut across all of science look like when applied in particular to personality psychology. One cool recommendation she had (not just for personality psychologists) was to imagine that you had to include a “Most Damning Result” section in your paper, where you had to report the one result that looked worst for your hypothesis. How would that change your thinking?*

Third, David Condon talked about particular issues for early-career researchers, though really it was for anyone who wants to keep learning – he had a charming story of how he was inspired by seeing one of his big-name intellectual heroes give a major award address at a conference, then show up the next morning for an “Introduction to R” workshop. He talked a lot about tools and technology that we can use to help us do more open, reproducible science.

And finally, Dan Mroczek talked about research he has been doing with a large consortium to try to do reproducible research with existing longitudinal datasets. They have been using an integrated data analysis framework as a way of combining longitudinal datasets to test novel questions, and to look at issues like generalizability and reproducibility across existing data. Dan’s talk was a particularly good example of why we need broad participation in the replicability conversation. We all care about the same broad issues, but the particular solutions that experimental social psychologists identify aren’t going to work for everybody.

In addition to its obvious presence in the plenary session, reproducibility and openness seemed to suffuse the conference. As Rick Robins pointed out to me, there seemed to be a lot more people presenting null findings in a more open, frank way. And talk of which findings were replicated and which weren’t, people tempering conclusions from initial data, etc. was common and totally well received like it was a normal part of science. Imagine that.

One things that stuck out at me in particular was the relationship between reproducible science and cumulative science. Usually I think of the first helping the second; you need robust, reproducible findings as a foundation before you can either dig deeper into process or expand out in various ways. But in many ways, the conference reminded me that the reverse is true as well: cumulative science helps reproducibility.

When people are working on the same or related problems, using the same or related constructs and measures, etc. then it becomes much easier to do robust, reproducible science. In many ways structural models like the Big Five have helped personality psychology with that. For example, the integrated data analysis that Dan talked about requires you to have measures of the same constructs in every dataset. The Big Five provide a common coordinate system to map different trait measures onto, even if they weren’t originally conceptualized that way. Psychology needs more models like that in other domains – common coordinate systems of constructs and measures that help make sense of how different research programs fit together.

And Simine talked about (and has blogged about) the idea that we should collect fewer but better datasets, with more power and better but more labor-intensive methods. If we are open with our data, we can do something really well, and then combine or look across datasets better to take advantage of what other people do really well – but only if we are all working on the same things so that there is enough useful commonality across all those open datasets.

That means we need to move away from a career model of science where every researcher is supposed to have an effect, construct, or theory that is their own little domain that they’re king or queen of. Personality psychology used to be that way, but the Big Five has been a major counter to that, at least in the domain of traits. That kind of convergence isn’t problem-free — the model needs to evolve (Big Six, anyone?), which means that people need the freedom to work outside of it; and it can’t try to subsume things that are outside of its zone of relevance. Some people certainly won’t love it – there’s a certain satisfaction to being the World’s Leading Expert on X, even if X is some construct or process that only you and maybe your former students are studying. But that’s where other fields have gone, even going as far as expanding beyond the single-investigator lab model: Big Science is the norm in many parts of physics, genomics, and other fields. With the kinds of problems we are trying to solve in psychology – not just our reproducibility problems, but our substantive scientific ones — that may increasingly be a model for us as well.



* Actually, I don’t think she was only imagining. Simine is the incoming editor at SPPS.** Give it a try, I bet she’ll desk-accept the first paper that does it, just on principle.

** And the main reason I now have footnotes in most of my blog posts.

ASA releases consensus statement

Several months ago, the journal Basic and Applied Social Psychology published an editorial announcing a “ban” on p-values and confidence intervals, and treating Bayesian inferential methods with suspicion as well. The editorial generated quite a bit of buzz among scientists and statisticians alike.

In response the American Statistical Association released a letter expressing concern about the prospect of doing science without any inferential statistics at all. It announced that it would assemble a blue-ribbon panel of statisticians to issue recommendations.

That statement has now been completed, and I got my hands on an advance copy. Here it is:

We, the undersigned statisticians, represent the full range of statistical perspectives, Bayesian and frequentist alike. We have come to full agreement on the following points:

1. Regarding guiding principles, we all agree that statistical inference is an essential part of science and should not be dispensed with under any circumstances. Whenever possible you should put one of us on your grant to do it for you.

2. As to specific recommendations on how to do statistical inference, we are in full agreement that you should do it our way. “Our” in this context is understood to represent a subset of us to be worked out at a later date.

Thank you for your consideration.

Is there p-hacking in a new breastfeeding study? And is disclosure enough?

There is a new study out about the benefits of breastfeeding on eventual adult IQ, published in The Lancet Global Health. It’s getting lots of news coverage, for example in NPR, BBC, New York Times, and more.

A friend shared a link and asked what I thought of it. So I took a look at the article and came across this (emphasis added):

We based statistical comparisons between categories on tests of heterogeneity and linear trend, and we present the one with the lower p value. We used Stata 13·0 for the analyses. We did four sets of analyses to compare breastfeeding categories in terms of arithmetic means, geometric means, median income, and to exclude participants who were unemployed and therefore had no income.

Yikes. The description of the analyses is frankly a little telegraphic. But unless I’m misreading it, or they did some kind of statistical correction that they forgot to mention, it sounds like they had flexibility in the data analyses (I saw no mention of pre-registration in the analysis plan), they used that flexibility to test multiple comparisons, and they’re openly disclosing that they used p-values for model selection – which is a more technical way of saying they engaged in p-hacking. (They don’t say how they selected among the 4 sets of analyses with different kinds of means etc.; was that based on p-values too?)*

From time to time students ask, Am I allowed to do x statistical thing? And my standard answer is, in the privacy of your office/lab/coffeeshop/etc. you are allowed to do whatever you want! Exploratory data analysis is a good thing. Play with your data and learn from it.** But if you are going to publish the results of your exploration, then disclose. If you did something that could bias your p-values, let readers know and they can make an informed evaluation.***

But that advice assumes that you are talking to a sophisticated reader. When it comes time to talk to the public, via the press, you have a responsibility to explain yourself. “We used a statistical approach that has an increased risk of producing false positives when there is no effect, or overestimating the size of effects when they are real.”

And if that weakens your story too much, well, that’s valid. Your story is weaker. Scientific journals are where experts communicate with other experts, and it could still be interesting enough to publish for that audience, perhaps to motivate a more definitive followup study. But if it’s too weak to go to the public and tell mothers what to do with their bodies… Maybe save the press release for the pre-registered Study 2.


* The study has other potential problems which are pretty much par for the course in these kinds of observational studies. They try to statistically adjust for differences between kids who were breastfed and those who weren’t, but that assumes that you have a complete and precisely measured set of all relevant covariates. Did they? It’s not a testable assumption, though it’s one that experts can make educated guesses at. On the plus side, when they added potentially confounding variables to the models the effects got stronger, not weaker. On the minus side, as Michelle Meyer pointed out on Twitter, they did not measure or adjust for parental IQ, which will definitely be associated with child IQ and for which the covariates they did use (like parental education and income) are only rough proxies.

** Though using p-values to guide your exploratory data analysis isn’t the greatest idea.

*** Some statisticians will no doubt disagree and say you shouldn’t be reporting p-values with known bias. My response is (a) if you want unbiased statistics then you shouldn’t be reading anything that’s gone through pre-publication review, and (b) that’s what got us into this mess in the first place. I’d rather make it acceptable for people to disclose everything, as opposed to creating an expectation and incentive for people to report impossibly clean results.

An open review of Many Labs 3: Much to learn

A pre-publication manuscript for the Many Labs 3 project has been released. The project, with 64 authors and supported by the Center for Open Science, ran replications of 10 previously-published effects on diverse topics. The research was conducted in 20 different university subject pools plus an Mturk comparison sample, with very high statistical power (over 3,000 subjects total). The project was pre-registered, and wherever possible the replicators worked with original authors or other experts to make the replications faithful to the originals. Big kudos to the project coordinating team and all the researchers who participated in this important work, as well as all the original authors who worked with the replicators.

A major goal was to examine whether time of semester moderates effect sizes, testing the common intuition among researchers that subjects are “worse” (less attentive) at the end of the term. But really, there is much more to it than that:

Not much replicated. The abstract says that 7 of 10 effects did not replicate. But dig a little deeper and the picture is more complicated. For starters, only 9 of those 10 effects were direct (or if you prefer, “close”) replications. The other was labeled a conceptual replication and deserves separate handling. More on that below; for now, let’s focus on the 9 direct replications.

Of the 3 effects that were tagged as “successful” replications, one — the Stroop effect — is so well-established that it is probably best thought of as a validity check on the project. If the researchers could not obtain a Stroop effect, I would have been more inclined to believe that they screwed something up than to believe that the Stroop isn’t real. But they got the Stroop effect, quite strongly. Test passed. (Some other side findings from manipulation checks or covariates could also be taken as validation tests. Strong arguments were judged as more persuasive; objectively warmer rooms were judged as warmer; etc.)

One major departure I have from the authors is how they defined a successful replication — as a significant effect in the same direction as the original significant effect. Tallying significance tests is not a good metric because, as the saying goes, the difference between significant and not significant is not necessarily significant. And perhaps less obviously, two significant effects — even in the same direction — can be very different.

I have argued before that the core question of replication is really this: If I do the same experiment, will I observe the same result? So the appropriate test is whether the replication effect equals the original effect, plus or minus error. It was not always possible to obtain an effect size from the original published report. But in many instances where it was, the replication effects were much smaller than the original ones. A particularly striking case of this was the replication of Tversky and Kahneman’s original, classic demonstration of the availability heuristic. The original effect was d = 0.82, 95% CI = [0.47, 1.17]. The replication was d = 0.09, 95% CI = [0.02, 0.16]. So the ML3 authors counted it as a “successful” replication, but the replication study obtained a substantially different effect than the original study.

Does it matter if the results are different but in the same direction? I say it does. Effect sizes matter, for a variety of reasons. Lurking in the background is an important set of meta-science questions about the quality of evidence in published reports. Remember, out of a set of 9 direct replications, 6 could not produce the original effect at all. Against that backdrop, what does it mean when a 7th one’s conclusion was right but the data to support it were not reproducible? You start to wonder if it’s a win for Tversky and Kahneman’s intellects rather than for the scientific process. I don’t want to count on my colleagues’ good guesswork (not all of us are as insightful as Tversky and Kahneman). I want to be persuaded by good data.

For what it’s worth, the other “successful” replication (metaphoric structuring, from a 2000 study by Lera Boroditsky) had a smaller point estimate than the original study, but the original had low power and thus a huge confidence interval (d = 0.07 to 1.20!). Without running the numbers, my guess is that a formal test of the difference between the original and replication effects would not be significant for that reason.

So what’s the box score for the direct replications? On the plus side, 1 gimme and 1 newish effect. In the middle, 1 replication that agreed with the original result in direction but sharply differed in magnitude. And then 6 blanks.

None of the interactions replicated. Of 3 hypothesized interactions, none of them replicated. None, zip, nada. All three were interactions between a manipulated variable and an individual difference. Interactions like that require a lot of statistical power to detect, and the replication effort had much more power than the original. We now know that under conditions of publication bias, published underpowered studies are especially likely to be biased (i.e., false positives). So add this to the growing pile of evidence that researchers and journals need to pay more attention to power.

A particularly interesting case is the replication of a moral credentialing study. In the original study, the authors had hypothesized a main effect of a credentialing manipulation. They didn’t find the main effect, but they did find (unexpectedly) an interaction with gender. By contrast, Many Labs 3  found the originally hypothesized main effect but not the gender interaction. So it is quite possible (if you think ML3 is the more reliable dataset) that the original study obtained both a false negative and a false positive — both potentially costly consequences of running with too little power.

Emblematic problems with a “conceptual replication.” An oddball in the mix was the one “conceptual replication.” The original study looked at associations between two self-report instruments, the NEO PI-R and Cloninger’s TCI, in a sample of psychiatric patients admitted to a hospital emergency unit. The replicators picked out one correlation to focus on — the one between NEO Conscientiousness and TCI Persistence. But they subbed in a 2-item scale from Gosling et al.’s TIPI for the 48-item NEO Conscientiousness scale, and instead of using the TCI they used time spent on an unsolvable anagram task as a measure of “persistence.” And they looked at all this in university subject pools and an Mturk sample.

The NEO-to-TIPI exchange is somewhat defensible, as the TIPI has been validated against the NEO PI-R.* But the shorter measure should be expected to underestimate any effects.

Swapping in the unsolvable anagrams for the TCI Persistence scale is a bit more mystifying. Sure, people sometimes call the anagram task a measure of “persistence” but (a) it’s a one-off behavioral measure not optimized to measuring stable individual differences, and (b) since there is no literature cited (or that I know of) connecting the anagram task to the TCI, there is a serious risk of falling prey to the jingle fallacy. In other words, just because the thing measured by the TCI and the thing measured by the anagrams are called “persistence” by different researchers, that does not mean they are the same construct. (For example, goals are hierarchical. Planful “persistence” at higher-order goals might be very different than wheel-spinning “persistence” on narrow tasks. Indeed, one of the 2 cited studies on the anagrams task used disengagement from the anagrams as an index of adaptive behavior.)

These problems, I think, are emblematic of problems with how “conceptual replication” is often treated. First, direct and conceptual replication are not the same thing and should not be treated as such. Direct replication is about the reproducibility of scientific observations; conceptual replication is about putting a theory to the test in more than one way. Both are a necessary part of science but they are not substitutable. Second, in this case the conceptual replication is problematic on its own terms. So many things changed from the original to the “replication” that it is difficult to interpret the difference in results. Is the issue that the TIPI was swapped in for the NEO? The anagrams for the TCI? Is it an issue of going from self-report to tested behavior? From an aggregate measure of personality to a one-shot measure? Psychiatric emergency patients to undergrads? Conceptual replication is much more useful when it is incremental and systematic.

The same interpretive difficulty would be present if the ML3 result had appeared to replicate the original, but I suspect we would be much less likely to notice it. We might just say hooray, it replicated — when it is entirely possible that we would actually be looking at 2 completely unrelated effects that just happened to have gotten the same labels.

Another blow against the “hidden moderator” defense. So far I haven’t touched on the stated purpose of this project, which was to look at time-of-semester as a moderator of effects. The findings are easy to summarize: it turns out it there was no there there. The authors also tested for heterogeneity of effects due to order (since subjects got the tasks for the 10 replications in random order) or across samples and labs. There was no there there either. Not even in the 2 tasks that were administered in person rather than by computers and thus were harder to standardize.

The authors put the interpretation so well that I’ll quote them at length here [emphasis added]:

A common explanation for the challenges of replicating results across samples and settings is that there are many seen and unseen moderators that qualify the detectability of effects (Cesario, 2014). As such, when differences are observed across study administrations, it is easy to default to the assumption that it must be due to features differing between the samples and settings. Besides time of semester, we tested whether the site of data collection, and the order of administration during the study session moderated the effects. None of these had a substantial impact on any of the investigated effects. This observation is consistent with the first “Many Labs” study (Klein et al., 2014) and is the focus of the second (Klein et al., 2015). The present study provides further evidence against sample and setting differences being a default explanation for variation in replicability. That is not to deny that such variation occurs, just that direct evidence for a given effect is needed to demonstrate that it is a viable explanation.

When effects do not replicate, “maybe there were moderators” needs to be treated as what it is – a post hoc speculation for future testing. Moderator explanations are worth consideration on a case by case basis, and hunting for them could lead to genuine discoveries. But as evidence keeps accruing that sample and setting moderators are not commonplace, we should be more and more skeptical of those kinds of explanations offered post hoc. High-powered pre-registered replications are very credible. And when the replication results disagree with the original study, our interpretations should account for what we know about the statistical properties of the studies and the way we came to see the evidence, and give less weight to untested speculation.


* An earlier version incorrectly stated that the TIPI had only been validated indirectly against the NEO. Thanks to Sam Gosling for pointing out the error.

Top 10 signs you are a statistics maven

I previously referenced Donald Sharpe’s idea of a statistics maven: people with one foot in a science field and one foot in statistics, who frequently act as a conduit for new quantitative innovations. Afterward I had an email exchange with someone who wanted to know how to become a maven, and I had to pass along the news that he probably already was. As a public service to others with similar concerns, I thought I should gather together the most probable symptoms (pending a comprehensive program of construct validation research, of course). Here at the top ten signs that you are a statistics maven:

10. You have installed R packages just to see what they do.

9. Your biggest regret from undergrad is a tossup between that person you never asked out and not taking more math.

8. You call the statistics you learned in grad school “frequentist statistics” and not just “statistics.”

7. People who are not quantitative psychologists call you a quantitative psychologist.

6. But you would be embarrassed if an actual quantitative psychologist overheard them.

5. You have a dead-tree subscription to Psychological Methods delivered to home so you can read it in bed.

4. You are thanked in the acknowledgements sections of your entire cohort’s dissertations.

3. You have a Keep Calm and Read Meehl poster in your office.

2. You once ran an entire study just to have the right kind of data for an analysis you wanted to try.

1. You have strong opinions about bar graphs and you are not afraid to share them.

 (p.s. Shoutout to Aaron Weidman for #5.)