Bold changes at Psychological Science

Style manuals sound like they ought to be boring things, full of arcane points about commas and whatnot. But Wikipedia’s style manual has an interesting admonition: Be bold. The idea is that if you see something that could be improved, you should dive in and start making it better. Don’t wait until you are ready to be comprehensive, don’t fret about getting every detail perfect. That’s the path to paralysis. Wikipedia is an ongoing work in progress, your changes won’t be the last word but you can make things better.

In a new editorial at Psychological Science, interim editor Stephen Lindsay is clearly following the be bold philosophy. He lays out a clear and progressive set of principles for evaluating research. Beware the “troubling trio” of low power, surprising results, and just-barely-significant results. Look for signs of p-hacking. Care about power and precision. Don’t confuse nonsignificant for null.

To people who have been paying attention to the science reform discussion of the last few years (and its longstanding precursors), none of this is new. What is new is that an editor of a prominent journal has clearly been reading and absorbing the last few years’ wave of careful and thoughtful scholarship on research methods and meta-science. And he is boldly acting on it.

I mean, yes, there are some things I am not 100% in love with in that editorial. Personally, I’d like to see more value placed on good exploratory research.* I’d like to see him discuss whether Psychological Science will be less results-oriented, since that is a major contributor to publication bias.** And I’m sure other people have their objections too.***

But… Improving science will forever be a work in progress. Lindsay has laid out a set of principles. In the short term, they will be interpreted and implemented by humans with intelligence and judgment. In the longer term, someone will eventually look at what is and is not working and will make more changes.

Are Lindsay’s changes as good as they could possibly be? The answers are (1) “duh” because obviously no and (2) “duh” because it’s the wrong question. Instead let’s ask, are these changes better than things have been? I’m not going to give that one a “duh,” but I’ll stand behind a considered “yes.”


* Part of this is because in psychology we don’t have nearly as good a foundation of implicit knowledge and accumulated wisdom for differentiating good from bad exploratory research as we do for hypothesis-testing. So exploratory research gets a bad name because somebody hacks around in a tiny dataset and calls it “exploratory research,” and nobody has the language or concepts to say why they’re doing it wrong. I hope we can fix that. For starters, we could start stealing more ideas from the machine learning and genomics people, though we will need to adapt them for the particular features of our scientific problems. But that’s a blog post for another day.

** There are some nice comments about this already on the ISCON facebook page. Dan Simons brought up the exploratory issue; Victoria Savalei the issue about results-focus. My reactions on these issues are in part bouncing off of theirs.

*** When I got to the part about using confidence intervals to support the null, I immediately had a vision of steam coming out of some of the Twitter Bayesians’ ears.

Moderator interpretations of the Reproducibility Project

The Reproducibility Project: Psychology (RPP) was published in Science last week. There has been some excellent coverage and discussion since then. If you haven’t heard about it,* Ed Yong’s Atlantic coverage will catch you up. And one of my favorite commentaries so far is on Michael Frank’s blog, with several very smart and sensible ways the field can proceed next.

Rather than offering a broad commentary, in this post I’d like to discuss one possible interpretation of the results of the RPP, which is “hidden moderators.” Hidden moderators are unmeasured differences between original and replication experiments that would result in differences in the true, underlying effects and therefore in the observed results of replications. Things like differences in subject populations and experimental settings. Moderator interpretations were the subject of a lengthy discussion on the ISCON Facebook page recently, and are the focus of an op-ed by Lisa Feldman Barrett.

In the post below, I evaluate the hidden-moderator interpretation. The tl;dr version is this: Context moderators are probably common in the world at large and across independently-conceived experiments. But an explicit design goal of direct replication is to eliminate them, and there’s good reason to believe they are rare in replications.

1. Context moderators are probably not common in direct replications

Many social and personality psychologists believe that lots of important effects vary by context out in the world at large. I am one of those people — subject and setting moderators are an important part of what I study in my own work. William McGuire discussed the idea quite eloquently, and it can be captured in an almost koan-like quote from Niels Bohr: “The opposite of one profound truth may very well be another profound truth.” It is very often the case that support for a broad hypothesis will vary and even be reversed over traditional “moderators” like subjects and settings, as well as across different methods and even different theoretical interpretations.

Consider as a thought experiment** what would happen if you told 30 psychologists to go test the deceptively simple-seeming hypothesis “Happiness causes smiling” and turned them loose. You would end up with 30 different experiments that would differ in all kinds of ways that we can be sure would matter: subjects (with different cultural norms of expressiveness), settings (e.g., subjects run alone vs. in groups), manipulations and measures of the IV (film clips, IAPS pictures, frontal asymmetry, PANAS?) and DV (FACS, EMG, subjective ratings?), and even construct definitions (state or trait happiness? eudaimonic or hedonic? Duchenne or social smiles?). You could learn a lot by studying all the differences between the experiments and their results.

But that stands in stark contrast to how direct replications are carried out, including the RPP. Replicators aren’t just turned loose with a broad hypothesis. In direct replication, the goal is to test the hypothesis “If I do the same experiment, I will get the same result.” Sometimes a moderator-ish hypothesis is built in (“this study was originally done with college students, will I get the same effect on Mturk?”). But such differences from the original are planned in. The explicit goal of replication design is for any other differences to be controlled out. Well-designed replication research makes a concerted effort to faithfully repeat the original experiments in every way that documentation, expertise, and common sense say should matter (and often in consultation with original authors too). The point is to squeeze out any room for substantive differences.

Does it work? In a word, yes. We now have data telling us that the squeezing can be very effective. In Many Labs 1 and Many Labs 3 (which I reviewed here), different labs followed standardized replication protocols for a series of experiments. In principle, different experimenters, different lab settings, and different subject populations could have led to differences between lab sites. But in analyses of heterogeneity across sites, that was not the result. In ML1, some of the very large and obvious effects (like anchoring) varied a bit in just how large they were (from “kinda big” to “holy shit”). Across both projects, more modest effects were quite consistent. Nowhere was there evidence that interesting effects wink in and out of detectability for substantive reasons linked to sample or setting.

We will continue to learn more as our field gets more experience with direct replication. But right now, a reasonable conclusion from the good, systematic evidence we have available is this: When some researchers write down a good protocol and other researchers follow it, the results tend to be consistent. In the bigger picture this is a good result for social psychology: it is empirical evidence that good scientific control is within our reach, neither beyond our experimental skills nor intractable for the phenomena we study.***

But it also means that when replicators try to clamp down potential moderators, it is reasonable to think that they usually do a good job. Remember, the Many Labs labs weren’t just replicating the original experiments (from which their results sometimes differed – more on that in a moment). They were very successfully and consistently replicating each other. There could be individual exceptions here and there, but on the whole our field’s experience with direct replication so far tells us that it should be unusual for unanticipated moderators to escape replicators’ diligent efforts at standardization and control.

2. A comparison of a published original and a replication is not a good way to detect moderators

Moderation means there is a substantive difference between 2 or more (true, underlying) effects as a function of the moderator variable. When you design an experiment to test a moderation hypothesis, you have to set things up so you can make a valid comparison. Your observations should ideally be unbiased, or failing that, the biases should be the same at different levels of the moderator so that they cancel out in the comparison.

With the RPP (and most replication efforts), we are trying to interpret observed differences between published original results and replications. The moderator interpretation rests on the assumption that observed differences between experiments are caused by substantive differences between them (subjects, settings, etc.). An alternative explanation is that there are different biases. And that is almost certainly the case. The original experiments are generally noisier because of modest power, and that noise is then passed through a biased filter (publication bias for sure — these studies were all published at selective journals — and perhaps selective reporting in some cases too). By contrast, the replications are mostly higher powered, the analysis plans were pre-registered, and the replicators committed from the outset to publish their findings no matter what the results.

That means that a comparison of published original studies and replication studies in the RPP is a poor way to detect moderators, because you are comparing a noisy and biased observation to one that is much less so.**** And such a comparison would be a poor way to detect moderators even if you were quite confident that moderators were out there somewhere waiting to be found.

3. Moderator explanations of the Reproducibility Project are (now) post hoc

The Reproducibility Project has been conducted with an unprecedented degree of openness. It was started 4 years ago. Both the coordinating plan and the protocols of individual studies were pre-registered. The list of selected studies was open. Original authors were contacted and invited to consult.

What that means is that anyone could have looked at an original study and a replication protocol, applied their expert judgment, and made a genuinely a priori prediction of how the replication results would have differed from the original. Such a prediction could have been put out in the open at any time, or it could have been pre-registered and embargoed so as not to influence the replication researchers.

Until last Friday, that is.

Now the results of the RPP are widely known. And although it is tempting to now look back selectively at “failed” replications and generate substantively interesting reasons, such explanations have to be understood for what they are: untested post hoc speculation. (And if someone now says they expected a failure all along, they’re possibly HARKing too.)

Now, don’t get me wrong — untested post hoc speculation is often what inspires new experiments. So if someone thinks they see an important difference between an original result and a replication and gets an idea for a new study to test it out, more power to them. Get thee to the lab.

But as an interpretation of the data we have in front of us now, we should be clear-eyed in appraising such explanations, especially as an across-the-board factor for the RPP. From a bargain-basement Bayesian perspective, context moderators in well-controlled replications have a low prior probability (#1 above), and comparisons of original and replication studies have limited evidential value because of unequal noise and bias (#2). Put those things together and the clear message is that we should be cautious about concluding that there are hidden moderators lurking everywhere in the RPP. Here and there, there might be compelling, idiosyncratic reasons to think there could be substantive differences to motivate future research. But on the whole, as an explanation for the overall pattern of findings, hidden moderators are not a strong contender.

Instead, we need to face up to the very well-understood and very real differences that we know about between published original studies and replications. The noxious synergy between low power and selective publication is certainly a big part of the story. Fortunately, psychology has already started to make changes since 2008 when the RPP original studies were published. And positive changes keep happening.

Would it be nice to think that everything was fine all along? Of course. And moderator explanations are appealing because they suggest that everything is fine, we’re just discovering limits and boundary conditions like we’ve always been doing.***** But it would be counterproductive if that undermined our will to continue to make needed improvements to our methods and practices. Personally, I don’t think everything in our past is junk, even post-RPP – just that we can do better. Constant self-critique and improvement are an inherent part of science. We have diagnosed the problem and we have a good handle on the solution. All of that makes me feel pretty good.


* Seriously though?

** A thought meta-experiment? A gedankengedankenexperiment?

*** I think if you asked most social psychologists, divorced from the present conversation about replicability and hidden moderators, they would already have endorsed this view. But it is nice to have empirical meta-scientific evidence to support it. And to show the “psychology isn’t a science” ninnies.

**** This would be true even if you believed that the replicators were negatively biased and consciously or unconsciously sandbagged their efforts. You’d think the bias was in the other direction, but it would still be unequal and therefore make comparisons of original vs. replication a poor empirical test of moderation. (You’d also be a pretty cynical person, but anyway.)

***** For what it’s worth, I’m not so sure that the hidden moderator interpretation would actually be all that reassuring under the cold light of a rational analysis. This is not the usual assumption that moderators are ubiquitous out in the world. We are talking about moderators that pop up despite concerted efforts to prevent them. Suppose that ubiquitous occult moderators were the sole or primary explanation for the RPP results — so many effects changing when we change so little, going from one WEIRD sample to another, with maybe 5-ish years of potential for secular drift, and using a standardized protocol. That would suggest that we have a poor understanding of what is going on in our labs. It would also suggest that it is extraordinarily hard to study main effects or try to draw even tentative generalizable conclusions about them. And given how hard it is to detect interactions, that would mean that our power problem would be even worse than people think it is now.

Replicability in personality psychology, and the symbiosis between cumulative science and reproducible science

There is apparently an idea going around that personality psychologists are sitting on the sidelines having a moment of schadenfreude during the whole social psychology Replicability Crisis thing.

Not true.

The Association for Research in Personality conference just wrapped up in St. Louis. It was a great conference, with lots of terrific research. (Highlight: watching three of my students give kickass presentations.) And the ongoing scientific discussion about openness and reproducibility had a definite, noticeable effect on the program.

The most obvious influence was the (packed) opening session on reproducibility. First, Rich Lucas talked about the effects of JRP’s recent policy of requiring authors to explicitly talk about power and sample size decisions. The policy has had a noticeable impact on sample sizes of published papers, without major side effects like tilting toward college samples or cheap self-report measures.

Second, Simine Vazire talked about the particular challenges of addressing openness and replicability in personality psychology. A lot of the discussion in psychology has been driven by experimental psychologists, and Simine talked about what the general issues that cut across all of science look like when applied in particular to personality psychology. One cool recommendation she had (not just for personality psychologists) was to imagine that you had to include a “Most Damning Result” section in your paper, where you had to report the one result that looked worst for your hypothesis. How would that change your thinking?*

Third, David Condon talked about particular issues for early-career researchers, though really it was for anyone who wants to keep learning – he had a charming story of how he was inspired by seeing one of his big-name intellectual heroes give a major award address at a conference, then show up the next morning for an “Introduction to R” workshop. He talked a lot about tools and technology that we can use to help us do more open, reproducible science.

And finally, Dan Mroczek talked about research he has been doing with a large consortium to try to do reproducible research with existing longitudinal datasets. They have been using an integrated data analysis framework as a way of combining longitudinal datasets to test novel questions, and to look at issues like generalizability and reproducibility across existing data. Dan’s talk was a particularly good example of why we need broad participation in the replicability conversation. We all care about the same broad issues, but the particular solutions that experimental social psychologists identify aren’t going to work for everybody.

In addition to its obvious presence in the plenary session, reproducibility and openness seemed to suffuse the conference. As Rick Robins pointed out to me, there seemed to be a lot more people presenting null findings in a more open, frank way. And talk of which findings were replicated and which weren’t, people tempering conclusions from initial data, etc. was common and totally well received like it was a normal part of science. Imagine that.

One things that stuck out at me in particular was the relationship between reproducible science and cumulative science. Usually I think of the first helping the second; you need robust, reproducible findings as a foundation before you can either dig deeper into process or expand out in various ways. But in many ways, the conference reminded me that the reverse is true as well: cumulative science helps reproducibility.

When people are working on the same or related problems, using the same or related constructs and measures, etc. then it becomes much easier to do robust, reproducible science. In many ways structural models like the Big Five have helped personality psychology with that. For example, the integrated data analysis that Dan talked about requires you to have measures of the same constructs in every dataset. The Big Five provide a common coordinate system to map different trait measures onto, even if they weren’t originally conceptualized that way. Psychology needs more models like that in other domains – common coordinate systems of constructs and measures that help make sense of how different research programs fit together.

And Simine talked about (and has blogged about) the idea that we should collect fewer but better datasets, with more power and better but more labor-intensive methods. If we are open with our data, we can do something really well, and then combine or look across datasets better to take advantage of what other people do really well – but only if we are all working on the same things so that there is enough useful commonality across all those open datasets.

That means we need to move away from a career model of science where every researcher is supposed to have an effect, construct, or theory that is their own little domain that they’re king or queen of. Personality psychology used to be that way, but the Big Five has been a major counter to that, at least in the domain of traits. That kind of convergence isn’t problem-free — the model needs to evolve (Big Six, anyone?), which means that people need the freedom to work outside of it; and it can’t try to subsume things that are outside of its zone of relevance. Some people certainly won’t love it – there’s a certain satisfaction to being the World’s Leading Expert on X, even if X is some construct or process that only you and maybe your former students are studying. But that’s where other fields have gone, even going as far as expanding beyond the single-investigator lab model: Big Science is the norm in many parts of physics, genomics, and other fields. With the kinds of problems we are trying to solve in psychology – not just our reproducibility problems, but our substantive scientific ones — that may increasingly be a model for us as well.



* Actually, I don’t think she was only imagining. Simine is the incoming editor at SPPS.** Give it a try, I bet she’ll desk-accept the first paper that does it, just on principle.

** And the main reason I now have footnotes in most of my blog posts.

An open review of Many Labs 3: Much to learn

A pre-publication manuscript for the Many Labs 3 project has been released. The project, with 64 authors and supported by the Center for Open Science, ran replications of 10 previously-published effects on diverse topics. The research was conducted in 20 different university subject pools plus an Mturk comparison sample, with very high statistical power (over 3,000 subjects total). The project was pre-registered, and wherever possible the replicators worked with original authors or other experts to make the replications faithful to the originals. Big kudos to the project coordinating team and all the researchers who participated in this important work, as well as all the original authors who worked with the replicators.

A major goal was to examine whether time of semester moderates effect sizes, testing the common intuition among researchers that subjects are “worse” (less attentive) at the end of the term. But really, there is much more to it than that:

Not much replicated. The abstract says that 7 of 10 effects did not replicate. But dig a little deeper and the picture is more complicated. For starters, only 9 of those 10 effects were direct (or if you prefer, “close”) replications. The other was labeled a conceptual replication and deserves separate handling. More on that below; for now, let’s focus on the 9 direct replications.

Of the 3 effects that were tagged as “successful” replications, one — the Stroop effect — is so well-established that it is probably best thought of as a validity check on the project. If the researchers could not obtain a Stroop effect, I would have been more inclined to believe that they screwed something up than to believe that the Stroop isn’t real. But they got the Stroop effect, quite strongly. Test passed. (Some other side findings from manipulation checks or covariates could also be taken as validation tests. Strong arguments were judged as more persuasive; objectively warmer rooms were judged as warmer; etc.)

One major departure I have from the authors is how they defined a successful replication — as a significant effect in the same direction as the original significant effect. Tallying significance tests is not a good metric because, as the saying goes, the difference between significant and not significant is not necessarily significant. And perhaps less obviously, two significant effects — even in the same direction — can be very different.

I have argued before that the core question of replication is really this: If I do the same experiment, will I observe the same result? So the appropriate test is whether the replication effect equals the original effect, plus or minus error. It was not always possible to obtain an effect size from the original published report. But in many instances where it was, the replication effects were much smaller than the original ones. A particularly striking case of this was the replication of Tversky and Kahneman’s original, classic demonstration of the availability heuristic. The original effect was d = 0.82, 95% CI = [0.47, 1.17]. The replication was d = 0.09, 95% CI = [0.02, 0.16]. So the ML3 authors counted it as a “successful” replication, but the replication study obtained a substantially different effect than the original study.

Does it matter if the results are different but in the same direction? I say it does. Effect sizes matter, for a variety of reasons. Lurking in the background is an important set of meta-science questions about the quality of evidence in published reports. Remember, out of a set of 9 direct replications, 6 could not produce the original effect at all. Against that backdrop, what does it mean when a 7th one’s conclusion was right but the data to support it were not reproducible? You start to wonder if it’s a win for Tversky and Kahneman’s intellects rather than for the scientific process. I don’t want to count on my colleagues’ good guesswork (not all of us are as insightful as Tversky and Kahneman). I want to be persuaded by good data.

For what it’s worth, the other “successful” replication (metaphoric structuring, from a 2000 study by Lera Boroditsky) had a smaller point estimate than the original study, but the original had low power and thus a huge confidence interval (d = 0.07 to 1.20!). Without running the numbers, my guess is that a formal test of the difference between the original and replication effects would not be significant for that reason.

So what’s the box score for the direct replications? On the plus side, 1 gimme and 1 newish effect. In the middle, 1 replication that agreed with the original result in direction but sharply differed in magnitude. And then 6 blanks.

None of the interactions replicated. Of 3 hypothesized interactions, none of them replicated. None, zip, nada. All three were interactions between a manipulated variable and an individual difference. Interactions like that require a lot of statistical power to detect, and the replication effort had much more power than the original. We now know that under conditions of publication bias, published underpowered studies are especially likely to be biased (i.e., false positives). So add this to the growing pile of evidence that researchers and journals need to pay more attention to power.

A particularly interesting case is the replication of a moral credentialing study. In the original study, the authors had hypothesized a main effect of a credentialing manipulation. They didn’t find the main effect, but they did find (unexpectedly) an interaction with gender. By contrast, Many Labs 3  found the originally hypothesized main effect but not the gender interaction. So it is quite possible (if you think ML3 is the more reliable dataset) that the original study obtained both a false negative and a false positive — both potentially costly consequences of running with too little power.

Emblematic problems with a “conceptual replication.” An oddball in the mix was the one “conceptual replication.” The original study looked at associations between two self-report instruments, the NEO PI-R and Cloninger’s TCI, in a sample of psychiatric patients admitted to a hospital emergency unit. The replicators picked out one correlation to focus on — the one between NEO Conscientiousness and TCI Persistence. But they subbed in a 2-item scale from Gosling et al.’s TIPI for the 48-item NEO Conscientiousness scale, and instead of using the TCI they used time spent on an unsolvable anagram task as a measure of “persistence.” And they looked at all this in university subject pools and an Mturk sample.

The NEO-to-TIPI exchange is somewhat defensible, as the TIPI has been validated against the NEO PI-R.* But the shorter measure should be expected to underestimate any effects.

Swapping in the unsolvable anagrams for the TCI Persistence scale is a bit more mystifying. Sure, people sometimes call the anagram task a measure of “persistence” but (a) it’s a one-off behavioral measure not optimized to measuring stable individual differences, and (b) since there is no literature cited (or that I know of) connecting the anagram task to the TCI, there is a serious risk of falling prey to the jingle fallacy. In other words, just because the thing measured by the TCI and the thing measured by the anagrams are called “persistence” by different researchers, that does not mean they are the same construct. (For example, goals are hierarchical. Planful “persistence” at higher-order goals might be very different than wheel-spinning “persistence” on narrow tasks. Indeed, one of the 2 cited studies on the anagrams task used disengagement from the anagrams as an index of adaptive behavior.)

These problems, I think, are emblematic of problems with how “conceptual replication” is often treated. First, direct and conceptual replication are not the same thing and should not be treated as such. Direct replication is about the reproducibility of scientific observations; conceptual replication is about putting a theory to the test in more than one way. Both are a necessary part of science but they are not substitutable. Second, in this case the conceptual replication is problematic on its own terms. So many things changed from the original to the “replication” that it is difficult to interpret the difference in results. Is the issue that the TIPI was swapped in for the NEO? The anagrams for the TCI? Is it an issue of going from self-report to tested behavior? From an aggregate measure of personality to a one-shot measure? Psychiatric emergency patients to undergrads? Conceptual replication is much more useful when it is incremental and systematic.

The same interpretive difficulty would be present if the ML3 result had appeared to replicate the original, but I suspect we would be much less likely to notice it. We might just say hooray, it replicated — when it is entirely possible that we would actually be looking at 2 completely unrelated effects that just happened to have gotten the same labels.

Another blow against the “hidden moderator” defense. So far I haven’t touched on the stated purpose of this project, which was to look at time-of-semester as a moderator of effects. The findings are easy to summarize: it turns out it there was no there there. The authors also tested for heterogeneity of effects due to order (since subjects got the tasks for the 10 replications in random order) or across samples and labs. There was no there there either. Not even in the 2 tasks that were administered in person rather than by computers and thus were harder to standardize.

The authors put the interpretation so well that I’ll quote them at length here [emphasis added]:

A common explanation for the challenges of replicating results across samples and settings is that there are many seen and unseen moderators that qualify the detectability of effects (Cesario, 2014). As such, when differences are observed across study administrations, it is easy to default to the assumption that it must be due to features differing between the samples and settings. Besides time of semester, we tested whether the site of data collection, and the order of administration during the study session moderated the effects. None of these had a substantial impact on any of the investigated effects. This observation is consistent with the first “Many Labs” study (Klein et al., 2014) and is the focus of the second (Klein et al., 2015). The present study provides further evidence against sample and setting differences being a default explanation for variation in replicability. That is not to deny that such variation occurs, just that direct evidence for a given effect is needed to demonstrate that it is a viable explanation.

When effects do not replicate, “maybe there were moderators” needs to be treated as what it is – a post hoc speculation for future testing. Moderator explanations are worth consideration on a case by case basis, and hunting for them could lead to genuine discoveries. But as evidence keeps accruing that sample and setting moderators are not commonplace, we should be more and more skeptical of those kinds of explanations offered post hoc. High-powered pre-registered replications are very credible. And when the replication results disagree with the original study, our interpretations should account for what we know about the statistical properties of the studies and the way we came to see the evidence, and give less weight to untested speculation.


* An earlier version incorrectly stated that the TIPI had only been validated indirectly against the NEO. Thanks to Sam Gosling for pointing out the error.

Popper on direct replication, tacit knowledge, and theory construction

I’ve quoted some of this before, but it was buried in a long post and it’s worth quoting at greater length and on its own. It succinctly lays out his views on several issues relevant to present-day discussions of replication in science. Specifically, Popper makes clear that (1) scientists should replicate their own experiments; (2) scientists should be able to instruct other experts how to reproduce their experiments and get the same results; and (3) establishing the reproducibility of experiments (“direct replication” in the parlance of our times) is a necessary precursor for all the other things you do to construct and test theories.

Kant was perhaps the first to realize that the objectivity of scientific statements is closely connected with the construction of theories — with the use of hypotheses and universal statements. Only when certain events recur in accordance with rules or regularities, as is the case with repeatable experiments, can our observations be tested — in principle — by anyone. We do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them. Only by such repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence’, but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable.

Every experimental physicist knows those surprising and inexplicable apparent ‘effects’ which in his laboratory can perhaps even be reproduced for some time, but which finally disappear without trace. Of course, no physicist would say in such a case that he had made a scientific discovery (though he might try to rearrange his experiments so as to make the effect reproducible). Indeed the scientifically significant physical effect may be defined as that which can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed. No serious physicist would offer for publication, as a scientific discovery, any such ‘occult effect,’ as I propose to call it — one for whose reproduction he could give no instructions. The ‘discovery’ would be only too soon rejected as chimerical, simply because attempts to test it would lead to negative results. (It follows that any controversy over the question whether events which are in principle unrepeatable and unique ever do occur cannot be decided by science: it would be a metaphysical controversy.)

– Karl Popper (1959/2002), The Logic of Scientific Discovery, pp. 23-24.

Failed experiments do not always fail toward the null

There is a common argument among psychologists that null results are uninformative. Part of this is the logic of NHST – failure to reject the null is not the same as confirmation of the null. Which is an internally valid statement, but ignores the fact that studies with good power also have good precision to estimate effects.

However there is a second line of argument which is more procedural. The argument is that a null result can happen when an experimenter makes a mistake in either the design or execution of a study. I have heard this many times; this argument is central to an essay that Jason Mitchell recently posted arguing that null replications have no evidentiary value. (The essay said other things too, and has generated some discussion online; see e.g., Chris Said’s response.)

The problem with this argument is that experimental errors (in both design and execution) can produce all kinds of results, not just the null. Confounds, artifacts, failures of blinding procedures, demand characteristics, outliers and other violations of statistical assumptions, etc. can all produce non-null effects in data. When it comes to experimenter error, there is nothing special about the null.

Moreover, we commit a serious oversight when we use substantive results as the sole evidence of procedures. Say that the scientific hypothesis is that X causes Y. So we design an experiment with an operationalization of X, O_X, and an operationalization of Y, O_Y. A “positive” result tells us O_X -> O_Y. But unless we can say something about the relationships between O_X and X and between O_Y and Y, the result tells us nothing about X and Y.

We have a well established framework for doing that with measurements: construct validation. We expect that measures can and should be validated independent of results to document that Y -> O_Y (convergent validity) and P, Q, R, etc. !-> O_Y (discriminant validity). We have papers showing that measurement procedures are generally valid (in fact these are some of our most-cited papers!). And we typically expect papers that apply previously-established measurement procedures to show that the procedure worked in a particular sample, e.g. by reporting reliability, factor structure, correlations with other measures, etc.

Although we do not seem to publish as many validation papers on experimental manipulations as on measurements, the logic of validation applies just as well. We can obtain evidence that O_X -> X, for example by showing that experimental O_X affects already-established measurements O_X2, O_X3, etc. And in a sufficiently powered design we can show that O_X does not meaningfully influence other variables that are known to affect Y or O_Y. Just as with measurements, we can accumulate this evidence in systematic investigations to show that procedures are generally effective, and then when labs use the procedures to test substantive hypotheses they can run manipulation checks to show that they are executing a procedure correctly.

Programmatic validation is not always necessary — some experimental procedures are so face-valid that we are willing to accept that O_X -> X without a validation study. Likewise for some measurements. That is totally fine, as long as there is no double standard. But in situations where we would be willing to question whether a null result is informative, we should also be willing to question whether a non-null is. We need to evaluate methods in ways that do not depend on whether those methods give us results we like — for experimental manipulations and measurements alike.

Some thoughts on replication and falsifiability: Is this a chance to do better?

Most psychologists would probably endorse falsification as an important part of science. But in practice we rarely do it right. As others have observed before me, we do it backwards. Instead of designing experiments to falsify the hypothesis we are testing, we look for statistical evidence against a “nil null” — the point prediction that the true effect is zero. Sometimes the nil null is interesting, sometimes it isn’t, but it’s almost never a prediction from the theory that we are actually hoping to draw conclusions about.

The more rigorous approach is to derive a quantitative prediction from a theory. Then you design an experiment where the prediction could fail if the theory is wrong. Statistically speaking, the null hypothesis should be the prediction from your theory (“when dropped, this object will accelerate toward the earth at 9.8 m/s^2”). Then if a “significant” result tells you that the data are inconsistent with the theory (“average measured acceleration was 8.6 m/s^2, which differs from 9.8 at p < .05″), you have to either set aside the theory itself or one of the supporting assumptions you made when you designed the experiment. You get some leeway to look to the supporting assumptions (“oops, 9.8 assumes no wind resistance”), but not endless leeway — if the predictions keep failing, eventually you have to face facts and walk away from your theory. On the flip side, a theory is corroborated when it survives many risky opportunities to fail.

The problem in psychology — and many other sciences, including quite a bit of biology and medicine — is that our theories rarely make specific enough quantitative predictions to do hypothesis testing the “right” way. Few of our theories lead to anything remotely close to “g = 9.8 m/s^2” in specificity. People sometimes suggest this is a problem with psychologists’ acumen as theorists. I am more inclined to think it is a function of being a young science and having chosen very difficult problems to solve. So in the grand scheme, I don’t think we should self-flagellate too much about being poor theorists or succumb to physics envy. Most of the time I am inclined to agree with people Paul Rozin (who was agreeing with Solomon Asch) and William McGuire that instead we need to adapt our approach to our scientific problems and current state of knowledge, rather than trying to ape a caricature of “hard” science. That requires changes in how we do science: we need more exploration and discovery to accumulate interesting knowledge about our phenomena, and we need to be more modest and conditional in our theories. It would be a mistake to say we need to simply double down on the caricature.

So with all this being said, there is something really interesting and I think under-appreciated about the recent movement toward replication, and it is this: This may be a great opportunity to do falsification better.

The repeatability theory

Every results section says some version of, “We did this experiment and we observed these results.”[1] It is a specific statement about something that happened in the past. But hand-in-hand with that statement is, implicitly, another claim: “If someone does the same experiment again, they will get the same results.” The second claim is a mini-theory: it is a generalization of the first claim. Call it the repeatability theory. Every experimental report comes with its own repeatability theory. It is a necessary assumption of inferential statistics. And if we did not make it, we would be doing history rather than science.

And here’s the thing: the repeatability theory is very falsifiable. The rigorous, strong kind of falsifiable. We just need to clarify what it means to (A) do the same experiment again and (B) observe the same or different results.

Part B is a little easier. “The same results” does not mean exactly the same results to infinite precision. It means “the same results plus or minus error.” The hypothesis is that Experiment 1 (the original) and Experiment 2 (the replication) are observations with error of the same underlying effect, so any observed differences between experiments are just noise. If you are using NHST[2] that leads to a straightforward “strong” null hypothesis: effectsize_1 = effectsize_2. If you have access to all the raw data, you can combine both experiments into a single dataset, create an indicator variable for which study the effect came from, and test the interaction of that indicator with the effect. The null hypothesis is no interaction, which sounds like the old fashioned nil-null but in fact “interaction = 0” is the same as saying the effects are equal, which is the very specific quantitative hypothesis derived from the repeatability theory. If you don’t have the raw data, don’t despair. You can calculate an effect from each experiment and then compare them, like with a test of independent correlations. You can and should also estimate the difference between effects (effectsize_1 – effectsize_2) and an associated confidence interval. That difference is itself an effect size: it quantifies whatever difference there is between the studies, and can tell you if the difference is large or trivial.

Part A, “do the same experiment again,” is more complicated. Literalists like to point out that you will never be in the same room, with the same weather outside, with the same RA wearing the same shirt, etc. etc. They are technically right about all of that.[3]

But the realistic answer is that “the same experiment” just has to repeat the things that matter. “What matters” has been the subject of some discussion recently, for example in a published commentary by Danny Kahneman and a blog post by Andrew Wilson. In my thinking you can divide “what matters” into 3 categories: the original researchers’ specification of the experiment, technical skills in the methods used, and common sense. The onus is on the original experimenter to be able to tell a competent colleague what is necessary to repeat the experiment. In the old days of paper journals and page counts, it was impossible for most published papers to do this completely and you needed a lot of backchannel communication. With online supplements the gap is narrowing, but I still think it can’t hurt for a replicator to reach out to an original author. (Though in contrast to Kahneman, I would describe this as a methodological best practice, neither a matter of etiquette nor an absolute requirement.) If researchers say they do not know what conditions are necessary to produce an effect, that is no defense. It should undermine our faith in the original study. Don’t take my word for it, here’s Sir Karl (whose logic is better than his language – this is [hopefully obviously] limited neither to men nor physicists):

Every experimental physicist knows those surprising and inexplicable apparent ‘effects’ which in his laboratory can perhaps even be reproduced for some time, but which finally disappear without trace. Of course, no physicist would say in such a case that he had made a scientific discovery (though he might try to rearrange his experiments so as to make the effect reproducible). Indeed the scientifically significant physical effect may be defined as that which can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed. No serious physicist would offer for publication, as a scientific discovery, any such ‘occult effect,’ as I propose to call it – one for whose reproduction he could give no instructions. (Karl Popper, The Logic of Scientific Discovery, pp. 23-24)

Interpreting results

What happens when the data are inconsistent with the repeatability theory – original != replication? As with all empirical results, we have to consider multiple interpretations. This is true in all of science and has been recognized for a long time; replications are not special in this regard. An observed discrepancy between the original result and a replication[4] is an empirical finding that needs to be interpreted like any other empirical finding. However, a few issues come up commonly in interpreting replications:

First vs. latest. There is nothing special about an experiment being either the first or the latest, ceteris paribus. However ceteris is rarely paribus. If the replication has more power or if the scientific community gets to see its results through a less biased process than the original (e.g., due to pre-registration or a results-independent publication process), those things should give it more weight.

Technical skills. A technical analysis of the methods used and labs’ track records with them is appropriate. I am not much swayed by broad appeals to experimental “artistry.” Instead, I find these interpretations more persuasive when someone can put forward a plausible candidate for something important in the original that is not easy to standardize or carry off without specific skills. For example, a computer-administered experiment is possible to standardize and audit (and in some cases the code and digital stimuli can be reproduced exactly). But an experiment that involves confederates or cover stories might be harder to pull off for a lab that does not do that routinely. When that is the case, manipulation checks, lab visits/exchanges (in person or through video), and other validation procedures become important.

Moderators. Replications can never reproduce every single aspect of the original study. They do their best to reproduce everything that the original specification, technical knowledge, and common sense say should matter. But they can and will still depart from original studies in any number of ways: the subject pool being drawn from, the local social and cultural context, procedural changes made for practical reasons, etc. When the replication departs substantially from the original, it is fair to consider possible moderators. But moderator interpretations are nearly always post hoc, and should be weighed accordingly until we have more data.

I think it’s also important to point out that the possibility of unanticipated moderators is not a problem with replications; rather, if you are interested in discovery it is a very good reason to run them. Consider a hypothetical example from a recent blog post by Tim Wilson: a study originally run in the laboratory that produces a smaller effect in an online replication. Wilson imagines this is an outcome that a replicator with improbable amounts of both malevolence and prescience might arrange on purpose. But a far more likely scenario is that if the original specification, technical knowledge, and common sense all say that offline-online shouldn’t matter but it turns out that it does, that could actually be a very interesting discovery! People are living more of their lives online, and it is important to know how social cognition and behavior work in virtual spaces. And a discovery like that might also save other scientists a lot of wasted effort and resources, if for example they thought the experiment would work online and planned to run replicate-and-extend studies or adapt parts of the original procedure for new studies. In the end, Wilson’s example of replication gone wrong looks more like a useful discovery.

Discovery and replication need each other

Discovery and replication are often contrasted with each other. Discovery is new and exciting; replication is dull “duplication.” But that is silly. Replication separates real discoveries from noise-surfing, and as just noted it can itself lead to discoveries. We can and should do both. And not just in some sort of division of labor arrangement, but in an integrated way as part of our science. Exciting new discoveries need to be replicated before we take them as definitive. Replication within and between labs should be routine and normal.

An integrated discovery-replication approach is also an excellent way to build theories. Both Rozin and McGuire criticize psychology’s tendency to equate “theory” with broad, decontextualized statements – pronouncements that almost invariably get chipped away in subsequent studies as we discover moderators and boundary conditions. This kind of “overclaim first, then back away slowly” approach supports the hype cycle and means that a tendency to make incorrect statements is baked in to our research process. Instead, Rozin wants us to accumulate interesting descriptive facts about the phenomena we are studying; McGuire wants us to study how effects vary over populations and contexts. A discovery-replication approach allows us to do this both of these things. We can use discovery-oriented exploratory research to derive truly falsifiable predictions to then be tested. That way we will amass a body of narrow but well-corroborated theoretical statements (the repeatability theories) to assemble into bigger theories from the foundations up, rather than starting with bold pronouncements. We will also build up knowledge about quantitative estimates of effects, which we can use to start to make interval and even point predictions. That kind of cumulative science is likely to generate fewer sexy headlines in the short run, but it will be a whole lot more durable.


1. I am using “experiment” in the very broad sense here of a structured scientific observation, not the more limited sense of a study that involves randomized manipulation by an experimenter.[5]

2. I’m sure the Bayesians have an answer for the statistical problem too. It is probably a good one. But c’mon, this is a chance to finally do NHST right!

3. Literalists also like to say it’s a problem that you will never have the exact same people as subjects again. They are technically wrong about that being a problem. “Drawing a sample” is part of what constitutes the experiment. But pressing this point will get you into an argument with a literalist over a technicality, which is never fun, so I suggest letting it drop.

4. “Discrepancy” = “failed replication” in the parlance of our time, but I don’t like that phrase. Who/what failed? Totally unclear, and the answer may be nobody/nothing.

5. I am totally ripping this footnote thing off of Simine Vazire but telling myself I’m ripping off David Foster Wallace.