Replicability in personality psychology, and the symbiosis between cumulative science and reproducible science

There is apparently an idea going around that personality psychologists are sitting on the sidelines having a moment of schadenfreude during the whole social psychology Replicability Crisis thing.

Not true.

The Association for Research in Personality conference just wrapped up in St. Louis. It was a great conference, with lots of terrific research. (Highlight: watching three of my students give kickass presentations.) And the ongoing scientific discussion about openness and reproducibility had a definite, noticeable effect on the program.

The most obvious influence was the (packed) opening session on reproducibility. First, Rich Lucas talked about the effects of JRP’s recent policy of requiring authors to explicitly talk about power and sample size decisions. The policy has had a noticeable impact on sample sizes of published papers, without major side effects like tilting toward college samples or cheap self-report measures.

Second, Simine Vazire talked about the particular challenges of addressing openness and replicability in personality psychology. A lot of the discussion in psychology has been driven by experimental psychologists, and Simine talked about what the general issues that cut across all of science look like when applied in particular to personality psychology. One cool recommendation she had (not just for personality psychologists) was to imagine that you had to include a “Most Damning Result” section in your paper, where you had to report the one result that looked worst for your hypothesis. How would that change your thinking?*

Third, David Condon talked about particular issues for early-career researchers, though really it was for anyone who wants to keep learning – he had a charming story of how he was inspired by seeing one of his big-name intellectual heroes give a major award address at a conference, then show up the next morning for an “Introduction to R” workshop. He talked a lot about tools and technology that we can use to help us do more open, reproducible science.

And finally, Dan Mroczek talked about research he has been doing with a large consortium to try to do reproducible research with existing longitudinal datasets. They have been using an integrated data analysis framework as a way of combining longitudinal datasets to test novel questions, and to look at issues like generalizability and reproducibility across existing data. Dan’s talk was a particularly good example of why we need broad participation in the replicability conversation. We all care about the same broad issues, but the particular solutions that experimental social psychologists identify aren’t going to work for everybody.

In addition to its obvious presence in the plenary session, reproducibility and openness seemed to suffuse the conference. As Rick Robins pointed out to me, there seemed to be a lot more people presenting null findings in a more open, frank way. And talk of which findings were replicated and which weren’t, people tempering conclusions from initial data, etc. was common and totally well received like it was a normal part of science. Imagine that.

One things that stuck out at me in particular was the relationship between reproducible science and cumulative science. Usually I think of the first helping the second; you need robust, reproducible findings as a foundation before you can either dig deeper into process or expand out in various ways. But in many ways, the conference reminded me that the reverse is true as well: cumulative science helps reproducibility.

When people are working on the same or related problems, using the same or related constructs and measures, etc. then it becomes much easier to do robust, reproducible science. In many ways structural models like the Big Five have helped personality psychology with that. For example, the integrated data analysis that Dan talked about requires you to have measures of the same constructs in every dataset. The Big Five provide a common coordinate system to map different trait measures onto, even if they weren’t originally conceptualized that way. Psychology needs more models like that in other domains – common coordinate systems of constructs and measures that help make sense of how different research programs fit together.

And Simine talked about (and has blogged about) the idea that we should collect fewer but better datasets, with more power and better but more labor-intensive methods. If we are open with our data, we can do something really well, and then combine or look across datasets better to take advantage of what other people do really well – but only if we are all working on the same things so that there is enough useful commonality across all those open datasets.

That means we need to move away from a career model of science where every researcher is supposed to have an effect, construct, or theory that is their own little domain that they’re king or queen of. Personality psychology used to be that way, but the Big Five has been a major counter to that, at least in the domain of traits. That kind of convergence isn’t problem-free — the model needs to evolve (Big Six, anyone?), which means that people need the freedom to work outside of it; and it can’t try to subsume things that are outside of its zone of relevance. Some people certainly won’t love it – there’s a certain satisfaction to being the World’s Leading Expert on X, even if X is some construct or process that only you and maybe your former students are studying. But that’s where other fields have gone, even going as far as expanding beyond the single-investigator lab model: Big Science is the norm in many parts of physics, genomics, and other fields. With the kinds of problems we are trying to solve in psychology – not just our reproducibility problems, but our substantive scientific ones — that may increasingly be a model for us as well.

 

———-

* Actually, I don’t think she was only imagining. Simine is the incoming editor at SPPS.** Give it a try, I bet she’ll desk-accept the first paper that does it, just on principle.

** And the main reason I now have footnotes in most of my blog posts.

ASA releases consensus statement

Several months ago, the journal Basic and Applied Social Psychology published an editorial announcing a “ban” on p-values and confidence intervals, and treating Bayesian inferential methods with suspicion as well. The editorial generated quite a bit of buzz among scientists and statisticians alike.

In response the American Statistical Association released a letter expressing concern about the prospect of doing science without any inferential statistics at all. It announced that it would assemble a blue-ribbon panel of statisticians to issue recommendations.

That statement has now been completed, and I got my hands on an advance copy. Here it is:

We, the undersigned statisticians, represent the full range of statistical perspectives, Bayesian and frequentist alike. We have come to full agreement on the following points:

1. Regarding guiding principles, we all agree that statistical inference is an essential part of science and should not be dispensed with under any circumstances. Whenever possible you should put one of us on your grant to do it for you.

2. As to specific recommendations on how to do statistical inference, we are in full agreement that you should do it our way. “Our” in this context is understood to represent a subset of us to be worked out at a later date.

Thank you for your consideration.

Is there p-hacking in a new breastfeeding study? And is disclosure enough?

There is a new study out about the benefits of breastfeeding on eventual adult IQ, published in The Lancet Global Health. It’s getting lots of news coverage, for example in NPR, BBC, New York Times, and more.

A friend shared a link and asked what I thought of it. So I took a look at the article and came across this (emphasis added):

We based statistical comparisons between categories on tests of heterogeneity and linear trend, and we present the one with the lower p value. We used Stata 13·0 for the analyses. We did four sets of analyses to compare breastfeeding categories in terms of arithmetic means, geometric means, median income, and to exclude participants who were unemployed and therefore had no income.

Yikes. The description of the analyses is frankly a little telegraphic. But unless I’m misreading it, or they did some kind of statistical correction that they forgot to mention, it sounds like they had flexibility in the data analyses (I saw no mention of pre-registration in the analysis plan), they used that flexibility to test multiple comparisons, and they’re openly disclosing that they used p-values for model selection – which is a more technical way of saying they engaged in p-hacking. (They don’t say how they selected among the 4 sets of analyses with different kinds of means etc.; was that based on p-values too?)*

From time to time students ask, Am I allowed to do x statistical thing? And my standard answer is, in the privacy of your office/lab/coffeeshop/etc. you are allowed to do whatever you want! Exploratory data analysis is a good thing. Play with your data and learn from it.** But if you are going to publish the results of your exploration, then disclose. If you did something that could bias your p-values, let readers know and they can make an informed evaluation.***

But that advice assumes that you are talking to a sophisticated reader. When it comes time to talk to the public, via the press, you have a responsibility to explain yourself. “We used a statistical approach that has an increased risk of producing false positives when there is no effect, or overestimating the size of effects when they are real.”

And if that weakens your story too much, well, that’s valid. Your story is weaker. Scientific journals are where experts communicate with other experts, and it could still be interesting enough to publish for that audience, perhaps to motivate a more definitive followup study. But if it’s too weak to go to the public and tell mothers what to do with their bodies… Maybe save the press release for the pre-registered Study 2.

—–

* The study has other potential problems which are pretty much par for the course in these kinds of observational studies. They try to statistically adjust for differences between kids who were breastfed and those who weren’t, but that assumes that you have a complete and precisely measured set of all relevant covariates. Did they? It’s not a testable assumption, though it’s one that experts can make educated guesses at. On the plus side, when they added potentially confounding variables to the models the effects got stronger, not weaker. On the minus side, as Michelle Meyer pointed out on Twitter, they did not measure or adjust for parental IQ, which will definitely be associated with child IQ and for which the covariates they did use (like parental education and income) are only rough proxies.

** Though using p-values to guide your exploratory data analysis isn’t the greatest idea.

*** Some statisticians will no doubt disagree and say you shouldn’t be reporting p-values with known bias. My response is (a) if you want unbiased statistics then you shouldn’t be reading anything that’s gone through pre-publication review, and (b) that’s what got us into this mess in the first place. I’d rather make it acceptable for people to disclose everything, as opposed to creating an expectation and incentive for people to report impossibly clean results.

An open review of Many Labs 3: Much to learn

A pre-publication manuscript for the Many Labs 3 project has been released. The project, with 64 authors and supported by the Center for Open Science, ran replications of 10 previously-published effects on diverse topics. The research was conducted in 20 different university subject pools plus an Mturk comparison sample, with very high statistical power (over 3,000 subjects total). The project was pre-registered, and wherever possible the replicators worked with original authors or other experts to make the replications faithful to the originals. Big kudos to the project coordinating team and all the researchers who participated in this important work, as well as all the original authors who worked with the replicators.

A major goal was to examine whether time of semester moderates effect sizes, testing the common intuition among researchers that subjects are “worse” (less attentive) at the end of the term. But really, there is much more to it than that:

Not much replicated. The abstract says that 7 of 10 effects did not replicate. But dig a little deeper and the picture is more complicated. For starters, only 9 of those 10 effects were direct (or if you prefer, “close”) replications. The other was labeled a conceptual replication and deserves separate handling. More on that below; for now, let’s focus on the 9 direct replications.

Of the 3 effects that were tagged as “successful” replications, one — the Stroop effect — is so well-established that it is probably best thought of as a validity check on the project. If the researchers could not obtain a Stroop effect, I would have been more inclined to believe that they screwed something up than to believe that the Stroop isn’t real. But they got the Stroop effect, quite strongly. Test passed. (Some other side findings from manipulation checks or covariates could also be taken as validation tests. Strong arguments were judged as more persuasive; objectively warmer rooms were judged as warmer; etc.)

One major departure I have from the authors is how they defined a successful replication — as a significant effect in the same direction as the original significant effect. Tallying significance tests is not a good metric because, as the saying goes, the difference between significant and not significant is not necessarily significant. And perhaps less obviously, two significant effects — even in the same direction — can be very different.

I have argued before that the core question of replication is really this: If I do the same experiment, will I observe the same result? So the appropriate test is whether the replication effect equals the original effect, plus or minus error. It was not always possible to obtain an effect size from the original published report. But in many instances where it was, the replication effects were much smaller than the original ones. A particularly striking case of this was the replication of Tversky and Kahneman’s original, classic demonstration of the availability heuristic. The original effect was d = 0.82, 95% CI = [0.47, 1.17]. The replication was d = 0.09, 95% CI = [0.02, 0.16]. So the ML3 authors counted it as a “successful” replication, but the replication study obtained a substantially different effect than the original study.

Does it matter if the results are different but in the same direction? I say it does. Effect sizes matter, for a variety of reasons. Lurking in the background is an important set of meta-science questions about the quality of evidence in published reports. Remember, out of a set of 9 direct replications, 6 could not produce the original effect at all. Against that backdrop, what does it mean when a 7th one’s conclusion was right but the data to support it were not reproducible? You start to wonder if it’s a win for Tversky and Kahneman’s intellects rather than for the scientific process. I don’t want to count on my colleagues’ good guesswork (not all of us are as insightful as Tversky and Kahneman). I want to be persuaded by good data.

For what it’s worth, the other “successful” replication (metaphoric structuring, from a 2000 study by Lera Boroditsky) had a smaller point estimate than the original study, but the original had low power and thus a huge confidence interval (d = 0.07 to 1.20!). Without running the numbers, my guess is that a formal test of the difference between the original and replication effects would not be significant for that reason.

So what’s the box score for the direct replications? On the plus side, 1 gimme and 1 newish effect. In the middle, 1 replication that agreed with the original result in direction but sharply differed in magnitude. And then 6 blanks.

None of the interactions replicated. Of 3 hypothesized interactions, none of them replicated. None, zip, nada. All three were interactions between a manipulated variable and an individual difference. Interactions like that require a lot of statistical power to detect, and the replication effort had much more power than the original. We now know that under conditions of publication bias, published underpowered studies are especially likely to be biased (i.e., false positives). So add this to the growing pile of evidence that researchers and journals need to pay more attention to power.

A particularly interesting case is the replication of a moral credentialing study. In the original study, the authors had hypothesized a main effect of a credentialing manipulation. They didn’t find the main effect, but they did find (unexpectedly) an interaction with gender. By contrast, Many Labs 3  found the originally hypothesized main effect but not the gender interaction. So it is quite possible (if you think ML3 is the more reliable dataset) that the original study obtained both a false negative and a false positive — both potentially costly consequences of running with too little power.

Emblematic problems with a “conceptual replication.” An oddball in the mix was the one “conceptual replication.” The original study looked at associations between two self-report instruments, the NEO PI-R and Cloninger’s TCI, in a sample of psychiatric patients admitted to a hospital emergency unit. The replicators picked out one correlation to focus on — the one between NEO Conscientiousness and TCI Persistence. But they subbed in a 2-item scale from Gosling et al.’s TIPI for the 48-item NEO Conscientiousness scale, and instead of using the TCI they used time spent on an unsolvable anagram task as a measure of “persistence.” And they looked at all this in university subject pools and an Mturk sample.

The NEO-to-TIPI exchange is somewhat defensible, as the TIPI has been validated against the NEO PI-R.* But the shorter measure should be expected to underestimate any effects.

Swapping in the unsolvable anagrams for the TCI Persistence scale is a bit more mystifying. Sure, people sometimes call the anagram task a measure of “persistence” but (a) it’s a one-off behavioral measure not optimized to measuring stable individual differences, and (b) since there is no literature cited (or that I know of) connecting the anagram task to the TCI, there is a serious risk of falling prey to the jingle fallacy. In other words, just because the thing measured by the TCI and the thing measured by the anagrams are called “persistence” by different researchers, that does not mean they are the same construct. (For example, goals are hierarchical. Planful “persistence” at higher-order goals might be very different than wheel-spinning “persistence” on narrow tasks. Indeed, one of the 2 cited studies on the anagrams task used disengagement from the anagrams as an index of adaptive behavior.)

These problems, I think, are emblematic of problems with how “conceptual replication” is often treated. First, direct and conceptual replication are not the same thing and should not be treated as such. Direct replication is about the reproducibility of scientific observations; conceptual replication is about putting a theory to the test in more than one way. Both are a necessary part of science but they are not substitutable. Second, in this case the conceptual replication is problematic on its own terms. So many things changed from the original to the “replication” that it is difficult to interpret the difference in results. Is the issue that the TIPI was swapped in for the NEO? The anagrams for the TCI? Is it an issue of going from self-report to tested behavior? From an aggregate measure of personality to a one-shot measure? Psychiatric emergency patients to undergrads? Conceptual replication is much more useful when it is incremental and systematic.

The same interpretive difficulty would be present if the ML3 result had appeared to replicate the original, but I suspect we would be much less likely to notice it. We might just say hooray, it replicated — when it is entirely possible that we would actually be looking at 2 completely unrelated effects that just happened to have gotten the same labels.

Another blow against the “hidden moderator” defense. So far I haven’t touched on the stated purpose of this project, which was to look at time-of-semester as a moderator of effects. The findings are easy to summarize: it turns out it there was no there there. The authors also tested for heterogeneity of effects due to order (since subjects got the tasks for the 10 replications in random order) or across samples and labs. There was no there there either. Not even in the 2 tasks that were administered in person rather than by computers and thus were harder to standardize.

The authors put the interpretation so well that I’ll quote them at length here [emphasis added]:

A common explanation for the challenges of replicating results across samples and settings is that there are many seen and unseen moderators that qualify the detectability of effects (Cesario, 2014). As such, when differences are observed across study administrations, it is easy to default to the assumption that it must be due to features differing between the samples and settings. Besides time of semester, we tested whether the site of data collection, and the order of administration during the study session moderated the effects. None of these had a substantial impact on any of the investigated effects. This observation is consistent with the first “Many Labs” study (Klein et al., 2014) and is the focus of the second (Klein et al., 2015). The present study provides further evidence against sample and setting differences being a default explanation for variation in replicability. That is not to deny that such variation occurs, just that direct evidence for a given effect is needed to demonstrate that it is a viable explanation.

When effects do not replicate, “maybe there were moderators” needs to be treated as what it is – a post hoc speculation for future testing. Moderator explanations are worth consideration on a case by case basis, and hunting for them could lead to genuine discoveries. But as evidence keeps accruing that sample and setting moderators are not commonplace, we should be more and more skeptical of those kinds of explanations offered post hoc. High-powered pre-registered replications are very credible. And when the replication results disagree with the original study, our interpretations should account for what we know about the statistical properties of the studies and the way we came to see the evidence, and give less weight to untested speculation.

—–

* An earlier version incorrectly stated that the TIPI had only been validated indirectly against the NEO. Thanks to Sam Gosling for pointing out the error.

Top 10 signs you are a statistics maven

I previously referenced Donald Sharpe’s idea of a statistics maven: people with one foot in a science field and one foot in statistics, who frequently act as a conduit for new quantitative innovations. Afterward I had an email exchange with someone who wanted to know how to become a maven, and I had to pass along the news that he probably already was. As a public service to others with similar concerns, I thought I should gather together the most probable symptoms (pending a comprehensive program of construct validation research, of course). Here at the top ten signs that you are a statistics maven:

10. You have installed R packages just to see what they do.

9. Your biggest regret from undergrad is a tossup between that person you never asked out and not taking more math.

8. You call the statistics you learned in grad school “frequentist statistics” and not just “statistics.”

7. People who are not quantitative psychologists call you a quantitative psychologist.

6. But you would be embarrassed if an actual quantitative psychologist overheard them.

5. You have a dead-tree subscription to Psychological Methods delivered to home so you can read it in bed.

4. You are thanked in the acknowledgements sections of your entire cohort’s dissertations.

3. You have a Keep Calm and Read Meehl poster in your office.

2. You once ran an entire study just to have the right kind of data for an analysis you wanted to try.

1. You have strong opinions about bar graphs and you are not afraid to share them.

 (p.s. Shoutout to Aaron Weidman for #5.)

Statistics as math, statistics as tools

frame

How do you think about statistical methods in science? Are statistics a matter of math and logic? Or are they a useful tool? Over time, I have noticed that these seem to be two implicit frames for thinking about statistics. Both are useful, but they tend to be more common in different research communities. And I think sometimes conversations get off track when people are using different ones.

Frame 1 is statistics as math and logic. I think many statisticians and quantitative psychologists work under this frame. Their goal is to understand statistical methods, and statistics are based on math and logic. In math and logic, things are absolute and provable. (Even in statistics, which deals with uncertainty, the uncertainty is almost always quantifiable, and thus subject to analysis.) In math and logic, exceptions and boundary cases are important. If I say “All A are B” and you disagree with me, all you need to do is show me one instance of an A that is not B and you’re done.

In the realm of statistics, that can mean either proving or demonstrating that a method breaks down under some conditions. A good example of this is E. J. Wagenmakers et al.’s recent demonstration that using intervals to do hypothesis testing is wrong. Many people (including me) have assumed that if the 95% confidence interval of a parameter excludes 0, that’s the same as falsifying the hypothesis “parameter = 0.” E. J. and colleagues show an instance where this isn’t true — that is, where the data are uninformative about a hypothesis, but the interval would lead you to believe you had evidence against it. In the example, a researcher is testing a hypothesis about a binomial probability and has a single observation. So the demonstrated breakdown occurs at N = 1, which is theoretically interesting but not a common scenario in real-world research applications.

Frame 2 is statistics as a tool. I think many scientists work under this frame. The scientist’s goal is to understand the natural world, and statistics are a tool that you use as part of the research process. Scientists are pragmatic about tools. None of our tools are perfect – lab equipment generates noisy observations and can break down, questionnaires are only good for some populations, etc. Better tools are better, of course, but since they’re never perfect, at some point we have to decide they’re good enough so we can get out and use them.

Viewing statistics as a tool means that you care whether or not something works well enough under the conditions in which you are likely to use it. A good example of a tool-frame analysis of statistics is Judd, Westfall, and Kenny’s demonstration that traditional repeated-measures ANOVA fails to account for the sampling of stimuli in many within-subjects designs, and that multilevel modeling with random effects is necessary to correctly model those effects. Judd et al. demonstrate this with data from their own published studies, showing in some cases that they themselves would have (and should have) reached different scientific conclusions.

I suspect that this difference in frames relates to the communications gap that Donald Sharpe identified between statisticians and methodologists. (Sharpe’s paper is well worth a read, whether you’re a statistician or a scientist.) Statistical discoveries and innovations often die a lonely death in Psychological Methods because quants prove something under Frame 1 but do not go the next step of demonstrating that it matters under Frame 2, so scientists don’t adopt it. (To be clear, I don’t think the stats people always stay in Frame 1 – as Sharpe points out, some of the most cited papers are in Psychological Methods too. Many of them are ones that speak to both frames.)

I also wonder if this might contribute to the prevalence of less-than-optimal research practices (LTORPs, which includes the things sometimes labeled p-hacking or questionable research practices / QRPs). I’m sure plenty of scientists really have (had?) no idea that flexible stopping rules, trying analyses with and without a covariate to see which way works better, etc. are a problem. But I bet others have some general sense that LTORPs are not theoretically correct, perhaps because their first-year grad stats instructors told them so (probably in a Frame 1-ey way). But they also know — perhaps because they have been told by the very same statistics instructors — that there are plenty of statistical practices that are technically wrong but not a big deal (e.g., some departures from distributional assumptions). Tools don’t have to be perfect, they just have to work for the problem at hand. Specifically, I suspect that for a long time, many scientists’ attitude has been that p-values do not have to be theoretically correct, they just have to lead people to make enough right decisions enough of the time. Take them seriously but not that seriously. So when faced with a situation that they haven’t been taught the exact tools for, scientists will weigh the problem as best as they can, and sometimes they tell themselves — rightly or wrongly — that what they’re doing is good enough, and they do it.

Sharpe makes excellent points about why there is a communication gap and what to do about it. I hope the 2 frames notion complements that. Scientists have to make progress with limited resources, which means they are constantly making implicit (and sometimes explicit) cost-benefit calculations. If adopting a statistical innovation will require time up front to learn it and perhaps additional time each time you implement it, researchers will ask themselves if it is worth it. Of all the things I need or want to do — writing grants, writing papers, training grad students, running experiments, publishing papers — how much less of the other stuff will I be able to do if I put the time and resources into this? Will this help me do better at my goal of discovering things about the natural world (which is different than the goal of the statistician, which is to figure out new things about statistics), or is this just a lot of headache that’ll lead me to mostly the same place that the simpler way would?

I have a couple of suggestions for better dialogue and progress on both sides. One is that we need to recognize that the 2 frames come from different sets of goals – statisticians want to understand statistics, scientists want to understand the natural world. Statisticians should go beyond showing that something is provably wrong or right, and address whether it is consequentially wrong or right. One person’s provable error is another person’s reasonable approximation. And scientists should consult statisticians about real-world consequences of their decisions. As much as possible, don’t assume good enough, verify it.

Scientists also need usable tools to solve their problems. Both conceptual tools, and more concrete things like software, procedures, etc. So scientists and statisticians need to talk more. I think data peeking is a good example of this. To a statistician, setting sample size a priori probably seems like a decent assumption. To a scientist who has just spent two years and six figures of grant money on a study and arrived at suggestive but not conclusive results (a.k.a. p = .11), it is laughable to suggest setting aside that dataset and starting from scratch with a larger sample. If you think your only choice is either do that or run another few subjects and do the analysis again, then if you think it’s just a minor fudge (“good enough”) you’re probably going to run the subjects. Sequential analyses solve that problem. They have been around for a while, but languishing in a small corner of the clinical trials literature where there was a pressing ethical reason to use them. Now that scientists are realizing they exist and can solve a wider range of problems, sequential analyses are starting to get much wider attention. They probably should be integrated even more into the data-analytic frameworks (and software) for expensive research areas, like fMRI.

Sharpe encourages statisticians to pick real examples. Let me add that they should be examples of research that you are motivated to help. Theorems, simulations, and toy examples are Frame 1 tools. Analyses in real data will hit home with scientists where they live in Frame 2. Picking apart a study in an area you already have distaste for (“I think evolutionary psych is garbage, let me choose this ev psych study to illustrate this statistical problem”) might feel satisfying, but it probably leads to less thorough and less persuasive critiques. Show in real data how the new method helps scientists with their goals, not just what the old one gets wrong according to yours.

I think of myself as one of what Sharpe calls the Mavens – scientists with an extra interest in statistics, who nerd out on quant stuff, often teach quantitative classes within their scientific field, and who like to adopt and spread the word about new innovations. Sometimes Mavens get into something because it just seems interesting. But often they (we) are drawn to things that seem like cool new ways to solve problems in our field. Speaking as a Maven who thinks statisticians can help us make science better, I would love it if you could help us out. We are interested, and we probably want to help too.

 

Popper on direct replication, tacit knowledge, and theory construction

I’ve quoted some of this before, but it was buried in a long post and it’s worth quoting at greater length and on its own. It succinctly lays out his views on several issues relevant to present-day discussions of replication in science. Specifically, Popper makes clear that (1) scientists should replicate their own experiments; (2) scientists should be able to instruct other experts how to reproduce their experiments and get the same results; and (3) establishing the reproducibility of experiments (“direct replication” in the parlance of our times) is a necessary precursor for all the other things you do to construct and test theories.

Kant was perhaps the first to realize that the objectivity of scientific statements is closely connected with the construction of theories — with the use of hypotheses and universal statements. Only when certain events recur in accordance with rules or regularities, as is the case with repeatable experiments, can our observations be tested — in principle — by anyone. We do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them. Only by such repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence’, but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable.

Every experimental physicist knows those surprising and inexplicable apparent ‘effects’ which in his laboratory can perhaps even be reproduced for some time, but which finally disappear without trace. Of course, no physicist would say in such a case that he had made a scientific discovery (though he might try to rearrange his experiments so as to make the effect reproducible). Indeed the scientifically significant physical effect may be defined as that which can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed. No serious physicist would offer for publication, as a scientific discovery, any such ‘occult effect,’ as I propose to call it — one for whose reproduction he could give no instructions. The ‘discovery’ would be only too soon rejected as chimerical, simply because attempts to test it would lead to negative results. (It follows that any controversy over the question whether events which are in principle unrepeatable and unique ever do occur cannot be decided by science: it would be a metaphysical controversy.)

– Karl Popper (1959/2002), The Logic of Scientific Discovery, pp. 23-24.