Some thoughts on replication and falsifiability: Is this a chance to do better?

Most psychologists would probably endorse falsification as an important part of science. But in practice we rarely do it right. As others have observed before me, we do it backwards. Instead of designing experiments to falsify the hypothesis we are testing, we look for statistical evidence against a “nil null” — the point prediction that the true effect is zero. Sometimes the nil null is interesting, sometimes it isn’t, but it’s almost never a prediction from the theory that we are actually hoping to draw conclusions about.

The more rigorous approach is to derive a quantitative prediction from a theory. Then you design an experiment where the prediction could fail if the theory is wrong. Statistically speaking, the null hypothesis should be the prediction from your theory (“when dropped, this object will accelerate toward the earth at 9.8 m/s^2”). Then if a “significant” result tells you that the data are inconsistent with the theory (“average measured acceleration was 8.6 m/s^2, which differs from 9.8 at p < .05″), you have to either set aside the theory itself or one of the supporting assumptions you made when you designed the experiment. You get some leeway to look to the supporting assumptions (“oops, 9.8 assumes no wind resistance”), but not endless leeway — if the predictions keep failing, eventually you have to face facts and walk away from your theory. On the flip side, a theory is corroborated when it survives many risky opportunities to fail.

The problem in psychology — and many other sciences, including quite a bit of biology and medicine — is that our theories rarely make specific enough quantitative predictions to do hypothesis testing the “right” way. Few of our theories lead to anything remotely close to “g = 9.8 m/s^2” in specificity. People sometimes suggest this is a problem with psychologists’ acumen as theorists. I am more inclined to think it is a function of being a young science and having chosen very difficult problems to solve. So in the grand scheme, I don’t think we should self-flagellate too much about being poor theorists or succumb to physics envy. Most of the time I am inclined to agree with people Paul Rozin (who was agreeing with Solomon Asch) and William McGuire that instead we need to adapt our approach to our scientific problems and current state of knowledge, rather than trying to ape a caricature of “hard” science. That requires changes in how we do science: we need more exploration and discovery to accumulate interesting knowledge about our phenomena, and we need to be more modest and conditional in our theories. It would be a mistake to say we need to simply double down on the caricature.

So with all this being said, there is something really interesting and I think under-appreciated about the recent movement toward replication, and it is this: This may be a great opportunity to do falsification better.

The repeatability theory

Every results section says some version of, “We did this experiment and we observed these results.”[1] It is a specific statement about something that happened in the past. But hand-in-hand with that statement is, implicitly, another claim: “If someone does the same experiment again, they will get the same results.” The second claim is a mini-theory: it is a generalization of the first claim. Call it the repeatability theory. Every experimental report comes with its own repeatability theory. It is a necessary assumption of inferential statistics. And if we did not make it, we would be doing history rather than science.

And here’s the thing: the repeatability theory is very falsifiable. The rigorous, strong kind of falsifiable. We just need to clarify what it means to (A) do the same experiment again and (B) observe the same or different results.

Part B is a little easier. “The same results” does not mean exactly the same results to infinite precision. It means “the same results plus or minus error.” The hypothesis is that Experiment 1 (the original) and Experiment 2 (the replication) are observations with error of the same underlying effect, so any observed differences between experiments are just noise. If you are using NHST[2] that leads to a straightforward “strong” null hypothesis: effectsize_1 = effectsize_2. If you have access to all the raw data, you can combine both experiments into a single dataset, create an indicator variable for which study the effect came from, and test the interaction of that indicator with the effect. The null hypothesis is no interaction, which sounds like the old fashioned nil-null but in fact “interaction = 0” is the same as saying the effects are equal, which is the very specific quantitative hypothesis derived from the repeatability theory. If you don’t have the raw data, don’t despair. You can calculate an effect from each experiment and then compare them, like with a test of independent correlations. You can and should also estimate the difference between effects (effectsize_1 – effectsize_2) and an associated confidence interval. That difference is itself an effect size: it quantifies whatever difference there is between the studies, and can tell you if the difference is large or trivial.

Part A, “do the same experiment again,” is more complicated. Literalists like to point out that you will never be in the same room, with the same weather outside, with the same RA wearing the same shirt, etc. etc. They are technically right about all of that.[3]

But the realistic answer is that “the same experiment” just has to repeat the things that matter. “What matters” has been the subject of some discussion recently, for example in a published commentary by Danny Kahneman and a blog post by Andrew Wilson. In my thinking you can divide “what matters” into 3 categories: the original researchers’ specification of the experiment, technical skills in the methods used, and common sense. The onus is on the original experimenter to be able to tell a competent colleague what is necessary to repeat the experiment. In the old days of paper journals and page counts, it was impossible for most published papers to do this completely and you needed a lot of backchannel communication. With online supplements the gap is narrowing, but I still think it can’t hurt for a replicator to reach out to an original author. (Though in contrast to Kahneman, I would describe this as a methodological best practice, neither a matter of etiquette nor an absolute requirement.) If researchers say they do not know what conditions are necessary to produce an effect, that is no defense. It should undermine our faith in the original study. Don’t take my word for it, here’s Sir Karl (whose logic is better than his language – this is [hopefully obviously] limited neither to men nor physicists):

Every experimental physicist knows those surprising and inexplicable apparent ‘effects’ which in his laboratory can perhaps even be reproduced for some time, but which finally disappear without trace. Of course, no physicist would say in such a case that he had made a scientific discovery (though he might try to rearrange his experiments so as to make the effect reproducible). Indeed the scientifically significant physical effect may be defined as that which can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed. No serious physicist would offer for publication, as a scientific discovery, any such ‘occult effect,’ as I propose to call it – one for whose reproduction he could give no instructions. (Karl Popper, The Logic of Scientific Discovery, pp. 23-24)

Interpreting results

What happens when the data are inconsistent with the repeatability theory – original != replication? As with all empirical results, we have to consider multiple interpretations. This is true in all of science and has been recognized for a long time; replications are not special in this regard. An observed discrepancy between the original result and a replication[4] is an empirical finding that needs to be interpreted like any other empirical finding. However, a few issues come up commonly in interpreting replications:

First vs. latest. There is nothing special about an experiment being either the first or the latest, ceteris paribus. However ceteris is rarely paribus. If the replication has more power or if the scientific community gets to see its results through a less biased process than the original (e.g., due to pre-registration or a results-independent publication process), those things should give it more weight.

Technical skills. A technical analysis of the methods used and labs’ track records with them is appropriate. I am not much swayed by broad appeals to experimental “artistry.” Instead, I find these interpretations more persuasive when someone can put forward a plausible candidate for something important in the original that is not easy to standardize or carry off without specific skills. For example, a computer-administered experiment is possible to standardize and audit (and in some cases the code and digital stimuli can be reproduced exactly). But an experiment that involves confederates or cover stories might be harder to pull off for a lab that does not do that routinely. When that is the case, manipulation checks, lab visits/exchanges (in person or through video), and other validation procedures become important.

Moderators. Replications can never reproduce every single aspect of the original study. They do their best to reproduce everything that the original specification, technical knowledge, and common sense say should matter. But they can and will still depart from original studies in any number of ways: the subject pool being drawn from, the local social and cultural context, procedural changes made for practical reasons, etc. When the replication departs substantially from the original, it is fair to consider possible moderators. But moderator interpretations are nearly always post hoc, and should be weighed accordingly until we have more data.

I think it’s also important to point out that the possibility of unanticipated moderators is not a problem with replications; rather, if you are interested in discovery it is a very good reason to run them. Consider a hypothetical example from a recent blog post by Tim Wilson: a study originally run in the laboratory that produces a smaller effect in an online replication. Wilson imagines this is an outcome that a replicator with improbable amounts of both malevolence and prescience might arrange on purpose. But a far more likely scenario is that if the original specification, technical knowledge, and common sense all say that offline-online shouldn’t matter but it turns out that it does, that could actually be a very interesting discovery! People are living more of their lives online, and it is important to know how social cognition and behavior work in virtual spaces. And a discovery like that might also save other scientists a lot of wasted effort and resources, if for example they thought the experiment would work online and planned to run replicate-and-extend studies or adapt parts of the original procedure for new studies. In the end, Wilson’s example of replication gone wrong looks more like a useful discovery.

Discovery and replication need each other

Discovery and replication are often contrasted with each other. Discovery is new and exciting; replication is dull “duplication.” But that is silly. Replication separates real discoveries from noise-surfing, and as just noted it can itself lead to discoveries. We can and should do both. And not just in some sort of division of labor arrangement, but in an integrated way as part of our science. Exciting new discoveries need to be replicated before we take them as definitive. Replication within and between labs should be routine and normal.

An integrated discovery-replication approach is also an excellent way to build theories. Both Rozin and McGuire criticize psychology’s tendency to equate “theory” with broad, decontextualized statements – pronouncements that almost invariably get chipped away in subsequent studies as we discover moderators and boundary conditions. This kind of “overclaim first, then back away slowly” approach supports the hype cycle and means that a tendency to make incorrect statements is baked in to our research process. Instead, Rozin wants us to accumulate interesting descriptive facts about the phenomena we are studying; McGuire wants us to study how effects vary over populations and contexts. A discovery-replication approach allows us to do this both of these things. We can use discovery-oriented exploratory research to derive truly falsifiable predictions to then be tested. That way we will amass a body of narrow but well-corroborated theoretical statements (the repeatability theories) to assemble into bigger theories from the foundations up, rather than starting with bold pronouncements. We will also build up knowledge about quantitative estimates of effects, which we can use to start to make interval and even point predictions. That kind of cumulative science is likely to generate fewer sexy headlines in the short run, but it will be a whole lot more durable.

—–

1. I am using “experiment” in the very broad sense here of a structured scientific observation, not the more limited sense of a study that involves randomized manipulation by an experimenter.[5]

2. I’m sure the Bayesians have an answer for the statistical problem too. It is probably a good one. But c’mon, this is a chance to finally do NHST right!

3. Literalists also like to say it’s a problem that you will never have the exact same people as subjects again. They are technically wrong about that being a problem. “Drawing a sample” is part of what constitutes the experiment. But pressing this point will get you into an argument with a literalist over a technicality, which is never fun, so I suggest letting it drop.

4. “Discrepancy” = “failed replication” in the parlance of our time, but I don’t like that phrase. Who/what failed? Totally unclear, and the answer may be nobody/nothing.

5. I am totally ripping this footnote thing off of Simine Vazire but telling myself I’m ripping off David Foster Wallace.

10 thoughts on “Some thoughts on replication and falsifiability: Is this a chance to do better?

  1. Thanks for this post, very interesting. But my understanding is that in frequentist statistics, as you describe, it is important that it is the alternative hypothesis which should be drawn from the theory you wish to prove, and not the null. The null hypothesis is never “accepted”, as the alternative can be, it is instead “not rejected”. Not rejecting the null hypothesis does not mean the data provides evidence for the null, but that it does not provide sufficient evidence for the alternative hypothesis!

    1. In NHST you can reject the null or fail to reject the null; the data do not confirm the null. That corresponds very nicely to falsificationism, which says that observations can falsify a theory or not falsify a theory, but observations do not confirm or “prove” a theory. That is why falsificationism lines up so much better with NHST when the null hypothesis is an interesting theoretical prediction, rather than the nil null.

      The problem with the framework of rejecting a null to support an alternative is that it is very weak evidence. Consider the toy example I used in the text: the theoretical statement that at the Earth’s surface, gravitational acceleration is 9.8 m/s^2. If we make that the alternative hypothesis, what is the null? The nil null would be “gravitational acceleration is 0.” In addition to the null being an absurd prediction, rejection of that null is consistent with an infinite number of theories. Why do we take it as evidence in favor of 9.8? And how does falsifying a theoretical statement that nobody would have believed (“objects do not fall”) help us?

      1. That is all very well, but mathematically-speaking the null and alternative hypotheses are not interchangeable. In your set-up we would be reliant on the statistical power of the test, rather than the statistical significance. Our standard arsenal of tests are arranged to optimise the significance, not the power.

        You are right that it is difficult to determine the value of a physical constant using “point hypotheses” (e.g. null = 0, alt = 9.8, sometimes called “simple hypotheses”). However, in practice, physicists determine these values by using successively smaller ranges for the alternative hypothesis. In your example you might begin with the alternative hypothesis that 9 < g < 10, and the null that g lies outside that range. Once acceptable evidence has accumulated for this alternative, you test the alternative that 9.7 < g < 9.9, and the alternative that g lies outside that range. You continue to narrow down the ranges until you are satisfied with the precision you have achieved. These "range hypotheses" (more usually called "composite hypotheses") are acceptable mathematically, see for example http://en.wikipedia.org/wiki/Likelihood-ratio_test#Definition_.28likelihood_ratio_test_for_composite_hypotheses.29

      2. We may be talking about somewhat different things: situations where you are estimating an unknown value of a parameter, versus when prior theory gives a specific prediction about a parameter’s value. g may not have been the best example in that respect (I don’t know the history of physics enough to be sure but I suspect g was originally estimated, not predicted from pre-existing theory).

        In psychology we are often in the situation of studying effects with unknown parameters. The old way of doing NHST was to reject the nil null and not even really think about the parameter value apart from its sign. Increasingly people are starting to advocate for estimation and meta-analysis (as in Cumming’s “new statistics”), which is somewhat similar to the process of successively narrowing down ranges (though Cumming dispenses with statistical hypothesis-testing altogether).

        But what do you do when a theory makes a point prediction, such as predicting a specific value of theta a priori, or predicting theta1 = theta2? How do you do reject a null hypothesis that includes every value except a single point?

      3. Thank you for the reference to Cummings, looks very interesting! As someone whose background is in pure statistics it is very pertinent to see the issues that crop up during application.

        Theoretical point predictions in physics do indeed tend to be tested by the narrowing of ranges. As a recent example, the Higgs boson had a predicted mass from theory, which was then chased down experimentally, as you can see here: http://en.wikipedia.org/wiki/Higgs_boson#Experimental_search

        I guess this is because in physics they are interested in a certain useful precision, and they stop when they reach that point (or run out of money!) We can never measure anything absolutely exactly, so I suppose every point prediction is a prediction that, given a certain inevitable measurement error, the majority of measurements will fall within a predicted range. So in the end it isn’t necessary to test an all-but-one-point null, which is fortunate, as that presents a technical mathematical difficulty. Instead, we may choose our desired precision (perhaps in terms of significant figures?), and test on that range.

  2. Nice post, here are a few thoughts I had.

    1. The statistical procedure that you propose for evaluating replication attempts is sensible on the face of it, but it seems that it would lead to some ironies or paradoxes in practice due to statistical power issues. (I suspect this is what James was thinking of when he mentions that “In your set-up we would be reliant on the statistical power of the test, rather than the statistical significance. Our standard arsenal of tests are arranged to optimise the significance, not the power.”) The basic problem is the fact that we will tend to evaluate replication attempts as successful (that is, we will fail to reject the null of equivalent effect sizes) when either or both of the effect sizes in question are estimated with a lot of uncertainty. And conversely, when either or both of the effect sizes are estimated very precisely, we will tend to evaluate replication attempts as unsuccessful (that is, reject the null of equivalent effect sizes), even when there are only trivial differences in the two effect sizes.

    One of the ironies that follows from this is that if an experimenter wants the result of their study to be significant and also to have a high chance of being successfully replicated in the future according to the proposed criterion, then they are actually best off aiming for an effect size that is just barely significant and estimated with a ton of uncertainty. As they design their study to have greater and greater power, they are (by definition) increasing the chance of observing a significant effect, but *decreasing* the odds that subsequent replication attempts will successfully replicate their results! A related irony is that if a replicator is motivated to show that the original effect can be successfully replicated, they are better off designing a low-powered replication attempt than a high-powered replication attempt. So both of these examples suggest that the proposed replication criterion could lead to some perverse incentives in the replication enterprise.

    A similar, but I think less problematic, proposal has been published recently by Verhagen & Wakenmakers in JEP:General (PDF: http://www.ejwagenmakers.com/inpress/BayesianReplicationTest.pdf ). Their proposal is conceptually similar to yours: it involves comparing a “replication hypothesis” where the replication attempt is consistent with the original (presumably significantly non-zero) study, with a “skeptic’s hypothesis” which holds that the true effect size is 0 in both cases. They do this in a Bayes Factor framework that should avoid the power paradoxes I described above. It’s an impressive paper.

    2. Regarding your footnote 3. I totally agree with you here, and — if you’ll forgive just a bit of shameless self promotion — I have a working paper posted on SSRN (link: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2449567 ) where I extend this line of argument to apply as well to the stimuli in the experiment, which very often are sampled in a similar way to the participants (e.g., they are samples of words or photographs or whatever). Another argument that I offer for the same conclusion is based on statistical power problems inherent in exactly duplicating stimulus samples when replicating a study. You or others may find it interesting.

    1. Taking your comments out of order, thanks for that link! That’s a great extension of your earlier paper on treating stimuli as sampled (which I hope gets wide attention).

      Regarding power & precision: I think we need to do a better job as a field of being skeptical of underpowered studies. The example you give, a barely-significant effect in an underpowered design, should be setting off loud alarm bells on its own, regardless of any attempt to replicate it. And a weak replication should do the same. No statistical test is going to capture everything we need to look at to evaluate a replication.

      A falsificationist lens supports skepticism of low-powered studies. A theory is falsifiable to the extent that it prohibits empirical observations — the more things we could potentially observe that would be inconsistent with a theory, the more falsifiable the theory. The repeatability theory – “if you do the same experiment you will get the same result” – is less falsifiable when an underpowered design allows more observations to count as “the same result.” Likewise, an empirical test of a theory is a better test when it is “risky” – i.e., has a better chance of ruling out the theory if it is wrong. So we should prefer high-powered original and replication studies over low-powered ones.

      As for the opposite problem – an original/replication pair with high power to detect differences – we should be so lucky to have that problem! My comment about estimating differences and judging whether their are large or trivial applies to that situation. And when both studies have high power, I would put more stock in moderator explanations (which is consistent with McGuire’s call for studying how effects vary over contexts and populations).

      Re the Bayesian tests, I need to read that paper in more depth. My first reaction is that I am not sure why I’d want a test that adjudicates between 2 hypotheses that in many instances will both be wrong. The repeatability approach does not give special standing to the nil null, and when the repeatability hypothesis is rejected that is taken as an invitation to figure out why, rather than as evidence for the nil null. But I need to read the paper. In general, I am glad to see the field talking about how to evaluate replications (both statistically and scientifically). Uri Simonsohn has also been contributing to this issue with his detectability test (http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2259879).

      1. Great post as usual. Like Jake, I was also wondering about testing differences in effect sizes via statistical interactions (or any other NHST approach). It seems to me that one potential issue would be that of the statistical power one would need to show “significant” differences in the sizes of the effects of an original study vs. a replication.

        As Uri Simonsohn pointed out in a data colada blog post a while back, we’ve been collectively unaware of the huge statistical power that is needed to reliably detect interaction effects. Uri’s main point was that getting a significant interaction is more complicated when both effects are in the same direction, but that one is slightly smaller than the other (which Simonsohn calls “attenuated interactions”).

        It would seem to me that this would describe the “best-case” scenario for most replication attempts (in that most published effects are probably over-estimates of the true effect size, for a number of different reasons ranging from sampling error to QRPs). So if my replication study revealed a smaller effect than the original study, but was still in the same direction as the original finding, it would be hard for me to use an interaction test to demonstrate that the size of the effects was different (given sample sizes typical in most psych research).

        I guess this is one more reason for everyone to increase the statistical power of all of our studies. As you mention in your reply to Jake, if both an original and a replication study have high power, then we’re in much better shape to make inferences about the data and the effects under study.

        Based on all of this, one might give a slightly modified version of Rozin/Funder’s third law: “Something [demonstrated with one or more high-powered studies] beats nothing, two times out of three”

        Also, I love the term “noise-surfing.” Wonderful metaphor! (I sometimes call it “chasing noise”).

  3. You talk about “psychology” when you are talking about what seems to be Cognitive Psychology. Behavioural Psychology uses different methods. There’s a lot to be said for single subject research designs.

  4. To add another ingredient/perspective to the discussion between James and Sanjay: One consideration that often seems to be neglected in psychology especially (but also often in other fields) is to consider what difference might be of practical significance, and to think in terms of practical significance as well as statistical significance in drawing inferences. For more discussion, see the July 3 and July 6 posts at http://www.ma.utexas.edu/blogs/mks/.
    I hope you will also read some of the other posts in that series, which was prompted by the recent special issue of Social Psychology on replications.

Comments are closed.