An open review of Many Labs 3: Much to learn

A pre-publication manuscript for the Many Labs 3 project has been released. The project, with 64 authors and supported by the Center for Open Science, ran replications of 10 previously-published effects on diverse topics. The research was conducted in 20 different university subject pools plus an Mturk comparison sample, with very high statistical power (over 3,000 subjects total). The project was pre-registered, and wherever possible the replicators worked with original authors or other experts to make the replications faithful to the originals. Big kudos to the project coordinating team and all the researchers who participated in this important work, as well as all the original authors who worked with the replicators.

A major goal was to examine whether time of semester moderates effect sizes, testing the common intuition among researchers that subjects are “worse” (less attentive) at the end of the term. But really, there is much more to it than that:

Not much replicated. The abstract says that 7 of 10 effects did not replicate. But dig a little deeper and the picture is more complicated. For starters, only 9 of those 10 effects were direct (or if you prefer, “close”) replications. The other was labeled a conceptual replication and deserves separate handling. More on that below; for now, let’s focus on the 9 direct replications.

Of the 3 effects that were tagged as “successful” replications, one — the Stroop effect — is so well-established that it is probably best thought of as a validity check on the project. If the researchers could not obtain a Stroop effect, I would have been more inclined to believe that they screwed something up than to believe that the Stroop isn’t real. But they got the Stroop effect, quite strongly. Test passed. (Some other side findings from manipulation checks or covariates could also be taken as validation tests. Strong arguments were judged as more persuasive; objectively warmer rooms were judged as warmer; etc.)

One major departure I have from the authors is how they defined a successful replication — as a significant effect in the same direction as the original significant effect. Tallying significance tests is not a good metric because, as the saying goes, the difference between significant and not significant is not necessarily significant. And perhaps less obviously, two significant effects — even in the same direction — can be very different.

I have argued before that the core question of replication is really this: If I do the same experiment, will I observe the same result? So the appropriate test is whether the replication effect equals the original effect, plus or minus error. It was not always possible to obtain an effect size from the original published report. But in many instances where it was, the replication effects were much smaller than the original ones. A particularly striking case of this was the replication of Tversky and Kahneman’s original, classic demonstration of the availability heuristic. The original effect was d = 0.82, 95% CI = [0.47, 1.17]. The replication was d = 0.09, 95% CI = [0.02, 0.16]. So the ML3 authors counted it as a “successful” replication, but the replication study obtained a substantially different effect than the original study.

Does it matter if the results are different but in the same direction? I say it does. Effect sizes matter, for a variety of reasons. Lurking in the background is an important set of meta-science questions about the quality of evidence in published reports. Remember, out of a set of 9 direct replications, 6 could not produce the original effect at all. Against that backdrop, what does it mean when a 7th one’s conclusion was right but the data to support it were not reproducible? You start to wonder if it’s a win for Tversky and Kahneman’s intellects rather than for the scientific process. I don’t want to count on my colleagues’ good guesswork (not all of us are as insightful as Tversky and Kahneman). I want to be persuaded by good data.

For what it’s worth, the other “successful” replication (metaphoric structuring, from a 2000 study by Lera Boroditsky) had a smaller point estimate than the original study, but the original had low power and thus a huge confidence interval (d = 0.07 to 1.20!). Without running the numbers, my guess is that a formal test of the difference between the original and replication effects would not be significant for that reason.

So what’s the box score for the direct replications? On the plus side, 1 gimme and 1 newish effect. In the middle, 1 replication that agreed with the original result in direction but sharply differed in magnitude. And then 6 blanks.

None of the interactions replicated. Of 3 hypothesized interactions, none of them replicated. None, zip, nada. All three were interactions between a manipulated variable and an individual difference. Interactions like that require a lot of statistical power to detect, and the replication effort had much more power than the original. We now know that under conditions of publication bias, published underpowered studies are especially likely to be biased (i.e., false positives). So add this to the growing pile of evidence that researchers and journals need to pay more attention to power.

A particularly interesting case is the replication of a moral credentialing study. In the original study, the authors had hypothesized a main effect of a credentialing manipulation. They didn’t find the main effect, but they did find (unexpectedly) an interaction with gender. By contrast, Many Labs 3  found the originally hypothesized main effect but not the gender interaction. So it is quite possible (if you think ML3 is the more reliable dataset) that the original study obtained both a false negative and a false positive — both potentially costly consequences of running with too little power.

Emblematic problems with a “conceptual replication.” An oddball in the mix was the one “conceptual replication.” The original study looked at associations between two self-report instruments, the NEO PI-R and Cloninger’s TCI, in a sample of psychiatric patients admitted to a hospital emergency unit. The replicators picked out one correlation to focus on — the one between NEO Conscientiousness and TCI Persistence. But they subbed in a 2-item scale from Gosling et al.’s TIPI for the 48-item NEO Conscientiousness scale, and instead of using the TCI they used time spent on an unsolvable anagram task as a measure of “persistence.” And they looked at all this in university subject pools and an Mturk sample.

The NEO-to-TIPI exchange is somewhat defensible, as the TIPI has been validated against the NEO PI-R.* But the shorter measure should be expected to underestimate any effects.

Swapping in the unsolvable anagrams for the TCI Persistence scale is a bit more mystifying. Sure, people sometimes call the anagram task a measure of “persistence” but (a) it’s a one-off behavioral measure not optimized to measuring stable individual differences, and (b) since there is no literature cited (or that I know of) connecting the anagram task to the TCI, there is a serious risk of falling prey to the jingle fallacy. In other words, just because the thing measured by the TCI and the thing measured by the anagrams are called “persistence” by different researchers, that does not mean they are the same construct. (For example, goals are hierarchical. Planful “persistence” at higher-order goals might be very different than wheel-spinning “persistence” on narrow tasks. Indeed, one of the 2 cited studies on the anagrams task used disengagement from the anagrams as an index of adaptive behavior.)

These problems, I think, are emblematic of problems with how “conceptual replication” is often treated. First, direct and conceptual replication are not the same thing and should not be treated as such. Direct replication is about the reproducibility of scientific observations; conceptual replication is about putting a theory to the test in more than one way. Both are a necessary part of science but they are not substitutable. Second, in this case the conceptual replication is problematic on its own terms. So many things changed from the original to the “replication” that it is difficult to interpret the difference in results. Is the issue that the TIPI was swapped in for the NEO? The anagrams for the TCI? Is it an issue of going from self-report to tested behavior? From an aggregate measure of personality to a one-shot measure? Psychiatric emergency patients to undergrads? Conceptual replication is much more useful when it is incremental and systematic.

The same interpretive difficulty would be present if the ML3 result had appeared to replicate the original, but I suspect we would be much less likely to notice it. We might just say hooray, it replicated — when it is entirely possible that we would actually be looking at 2 completely unrelated effects that just happened to have gotten the same labels.

Another blow against the “hidden moderator” defense. So far I haven’t touched on the stated purpose of this project, which was to look at time-of-semester as a moderator of effects. The findings are easy to summarize: it turns out it there was no there there. The authors also tested for heterogeneity of effects due to order (since subjects got the tasks for the 10 replications in random order) or across samples and labs. There was no there there either. Not even in the 2 tasks that were administered in person rather than by computers and thus were harder to standardize.

The authors put the interpretation so well that I’ll quote them at length here [emphasis added]:

A common explanation for the challenges of replicating results across samples and settings is that there are many seen and unseen moderators that qualify the detectability of effects (Cesario, 2014). As such, when differences are observed across study administrations, it is easy to default to the assumption that it must be due to features differing between the samples and settings. Besides time of semester, we tested whether the site of data collection, and the order of administration during the study session moderated the effects. None of these had a substantial impact on any of the investigated effects. This observation is consistent with the first “Many Labs” study (Klein et al., 2014) and is the focus of the second (Klein et al., 2015). The present study provides further evidence against sample and setting differences being a default explanation for variation in replicability. That is not to deny that such variation occurs, just that direct evidence for a given effect is needed to demonstrate that it is a viable explanation.

When effects do not replicate, “maybe there were moderators” needs to be treated as what it is – a post hoc speculation for future testing. Moderator explanations are worth consideration on a case by case basis, and hunting for them could lead to genuine discoveries. But as evidence keeps accruing that sample and setting moderators are not commonplace, we should be more and more skeptical of those kinds of explanations offered post hoc. High-powered pre-registered replications are very credible. And when the replication results disagree with the original study, our interpretations should account for what we know about the statistical properties of the studies and the way we came to see the evidence, and give less weight to untested speculation.

—–

* An earlier version incorrectly stated that the TIPI had only been validated indirectly against the NEO. Thanks to Sam Gosling for pointing out the error.

Top 10 signs you are a statistics maven

I previously referenced Donald Sharpe’s idea of a statistics maven: people with one foot in a science field and one foot in statistics, who frequently act as a conduit for new quantitative innovations. Afterward I had an email exchange with someone who wanted to know how to become a maven, and I had to pass along the news that he probably already was. As a public service to others with similar concerns, I thought I should gather together the most probable symptoms (pending a comprehensive program of construct validation research, of course). Here at the top ten signs that you are a statistics maven:

10. You have installed R packages just to see what they do.

9. Your biggest regret from undergrad is a tossup between that person you never asked out and not taking more math.

8. You call the statistics you learned in grad school “frequentist statistics” and not just “statistics.”

7. People who are not quantitative psychologists call you a quantitative psychologist.

6. But you would be embarrassed if an actual quantitative psychologist overheard them.

5. You have a dead-tree subscription to Psychological Methods delivered to home so you can read it in bed.

4. You are thanked in the acknowledgements sections of your entire cohort’s dissertations.

3. You have a Keep Calm and Read Meehl poster in your office.

2. You once ran an entire study just to have the right kind of data for an analysis you wanted to try.

1. You have strong opinions about bar graphs and you are not afraid to share them.

 (p.s. Shoutout to Aaron Weidman for #5.)

Statistics as math, statistics as tools

frame

How do you think about statistical methods in science? Are statistics a matter of math and logic? Or are they a useful tool? Over time, I have noticed that these seem to be two implicit frames for thinking about statistics. Both are useful, but they tend to be more common in different research communities. And I think sometimes conversations get off track when people are using different ones.

Frame 1 is statistics as math and logic. I think many statisticians and quantitative psychologists work under this frame. Their goal is to understand statistical methods, and statistics are based on math and logic. In math and logic, things are absolute and provable. (Even in statistics, which deals with uncertainty, the uncertainty is almost always quantifiable, and thus subject to analysis.) In math and logic, exceptions and boundary cases are important. If I say “All A are B” and you disagree with me, all you need to do is show me one instance of an A that is not B and you’re done.

In the realm of statistics, that can mean either proving or demonstrating that a method breaks down under some conditions. A good example of this is E. J. Wagenmakers et al.’s recent demonstration that using intervals to do hypothesis testing is wrong. Many people (including me) have assumed that if the 95% confidence interval of a parameter excludes 0, that’s the same as falsifying the hypothesis “parameter = 0.” E. J. and colleagues show an instance where this isn’t true — that is, where the data are uninformative about a hypothesis, but the interval would lead you to believe you had evidence against it. In the example, a researcher is testing a hypothesis about a binomial probability and has a single observation. So the demonstrated breakdown occurs at N = 1, which is theoretically interesting but not a common scenario in real-world research applications.

Frame 2 is statistics as a tool. I think many scientists work under this frame. The scientist’s goal is to understand the natural world, and statistics are a tool that you use as part of the research process. Scientists are pragmatic about tools. None of our tools are perfect – lab equipment generates noisy observations and can break down, questionnaires are only good for some populations, etc. Better tools are better, of course, but since they’re never perfect, at some point we have to decide they’re good enough so we can get out and use them.

Viewing statistics as a tool means that you care whether or not something works well enough under the conditions in which you are likely to use it. A good example of a tool-frame analysis of statistics is Judd, Westfall, and Kenny’s demonstration that traditional repeated-measures ANOVA fails to account for the sampling of stimuli in many within-subjects designs, and that multilevel modeling with random effects is necessary to correctly model those effects. Judd et al. demonstrate this with data from their own published studies, showing in some cases that they themselves would have (and should have) reached different scientific conclusions.

I suspect that this difference in frames relates to the communications gap that Donald Sharpe identified between statisticians and methodologists. (Sharpe’s paper is well worth a read, whether you’re a statistician or a scientist.) Statistical discoveries and innovations often die a lonely death in Psychological Methods because quants prove something under Frame 1 but do not go the next step of demonstrating that it matters under Frame 2, so scientists don’t adopt it. (To be clear, I don’t think the stats people always stay in Frame 1 – as Sharpe points out, some of the most cited papers are in Psychological Methods too. Many of them are ones that speak to both frames.)

I also wonder if this might contribute to the prevalence of less-than-optimal research practices (LTORPs, which includes the things sometimes labeled p-hacking or questionable research practices / QRPs). I’m sure plenty of scientists really have (had?) no idea that flexible stopping rules, trying analyses with and without a covariate to see which way works better, etc. are a problem. But I bet others have some general sense that LTORPs are not theoretically correct, perhaps because their first-year grad stats instructors told them so (probably in a Frame 1-ey way). But they also know — perhaps because they have been told by the very same statistics instructors — that there are plenty of statistical practices that are technically wrong but not a big deal (e.g., some departures from distributional assumptions). Tools don’t have to be perfect, they just have to work for the problem at hand. Specifically, I suspect that for a long time, many scientists’ attitude has been that p-values do not have to be theoretically correct, they just have to lead people to make enough right decisions enough of the time. Take them seriously but not that seriously. So when faced with a situation that they haven’t been taught the exact tools for, scientists will weigh the problem as best as they can, and sometimes they tell themselves — rightly or wrongly — that what they’re doing is good enough, and they do it.

Sharpe makes excellent points about why there is a communication gap and what to do about it. I hope the 2 frames notion complements that. Scientists have to make progress with limited resources, which means they are constantly making implicit (and sometimes explicit) cost-benefit calculations. If adopting a statistical innovation will require time up front to learn it and perhaps additional time each time you implement it, researchers will ask themselves if it is worth it. Of all the things I need or want to do — writing grants, writing papers, training grad students, running experiments, publishing papers — how much less of the other stuff will I be able to do if I put the time and resources into this? Will this help me do better at my goal of discovering things about the natural world (which is different than the goal of the statistician, which is to figure out new things about statistics), or is this just a lot of headache that’ll lead me to mostly the same place that the simpler way would?

I have a couple of suggestions for better dialogue and progress on both sides. One is that we need to recognize that the 2 frames come from different sets of goals – statisticians want to understand statistics, scientists want to understand the natural world. Statisticians should go beyond showing that something is provably wrong or right, and address whether it is consequentially wrong or right. One person’s provable error is another person’s reasonable approximation. And scientists should consult statisticians about real-world consequences of their decisions. As much as possible, don’t assume good enough, verify it.

Scientists also need usable tools to solve their problems. Both conceptual tools, and more concrete things like software, procedures, etc. So scientists and statisticians need to talk more. I think data peeking is a good example of this. To a statistician, setting sample size a priori probably seems like a decent assumption. To a scientist who has just spent two years and six figures of grant money on a study and arrived at suggestive but not conclusive results (a.k.a. p = .11), it is laughable to suggest setting aside that dataset and starting from scratch with a larger sample. If you think your only choice is either do that or run another few subjects and do the analysis again, then if you think it’s just a minor fudge (“good enough”) you’re probably going to run the subjects. Sequential analyses solve that problem. They have been around for a while, but languishing in a small corner of the clinical trials literature where there was a pressing ethical reason to use them. Now that scientists are realizing they exist and can solve a wider range of problems, sequential analyses are starting to get much wider attention. They probably should be integrated even more into the data-analytic frameworks (and software) for expensive research areas, like fMRI.

Sharpe encourages statisticians to pick real examples. Let me add that they should be examples of research that you are motivated to help. Theorems, simulations, and toy examples are Frame 1 tools. Analyses in real data will hit home with scientists where they live in Frame 2. Picking apart a study in an area you already have distaste for (“I think evolutionary psych is garbage, let me choose this ev psych study to illustrate this statistical problem”) might feel satisfying, but it probably leads to less thorough and less persuasive critiques. Show in real data how the new method helps scientists with their goals, not just what the old one gets wrong according to yours.

I think of myself as one of what Sharpe calls the Mavens – scientists with an extra interest in statistics, who nerd out on quant stuff, often teach quantitative classes within their scientific field, and who like to adopt and spread the word about new innovations. Sometimes Mavens get into something because it just seems interesting. But often they (we) are drawn to things that seem like cool new ways to solve problems in our field. Speaking as a Maven who thinks statisticians can help us make science better, I would love it if you could help us out. We are interested, and we probably want to help too.

 

Popper on direct replication, tacit knowledge, and theory construction

I’ve quoted some of this before, but it was buried in a long post and it’s worth quoting at greater length and on its own. It succinctly lays out his views on several issues relevant to present-day discussions of replication in science. Specifically, Popper makes clear that (1) scientists should replicate their own experiments; (2) scientists should be able to instruct other experts how to reproduce their experiments and get the same results; and (3) establishing the reproducibility of experiments (“direct replication” in the parlance of our times) is a necessary precursor for all the other things you do to construct and test theories.

Kant was perhaps the first to realize that the objectivity of scientific statements is closely connected with the construction of theories — with the use of hypotheses and universal statements. Only when certain events recur in accordance with rules or regularities, as is the case with repeatable experiments, can our observations be tested — in principle — by anyone. We do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them. Only by such repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence’, but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable.

Every experimental physicist knows those surprising and inexplicable apparent ‘effects’ which in his laboratory can perhaps even be reproduced for some time, but which finally disappear without trace. Of course, no physicist would say in such a case that he had made a scientific discovery (though he might try to rearrange his experiments so as to make the effect reproducible). Indeed the scientifically significant physical effect may be defined as that which can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed. No serious physicist would offer for publication, as a scientific discovery, any such ‘occult effect,’ as I propose to call it — one for whose reproduction he could give no instructions. The ‘discovery’ would be only too soon rejected as chimerical, simply because attempts to test it would lead to negative results. (It follows that any controversy over the question whether events which are in principle unrepeatable and unique ever do occur cannot be decided by science: it would be a metaphysical controversy.)

– Karl Popper (1959/2002), The Logic of Scientific Discovery, pp. 23-24.

The selection-distortion effect: How selection changes correlations in surprising ways

A little while back I ran across an idea buried in an old paper of Robyn Dawes that really opened my eyes. It was one of those things that seemed really simple and straightforward once I saw it. But I’d never run across it before.[1] The idea is this: when a sample is selected on a combination of 2 (or more) variables, the relationship between those 2 variables is different after selection than it was before, and not just because of restriction of range. The correlation changes in ways that, if you don’t realize it’s happening, can be surprising and potentially misleading. It can flip the sign of a correlation, or turn a zero correlation into a substantial one. Let’s call it the selection-distortion effect.

First, some background: Dawes was the head of the psychology department at the University of Oregon back in the 1970s. Merging his administrative role with his interests in decision-making, he collected data about graduate admissions decisions and how well they predict future outcomes. He eventually wrote a couple of papers based on that work for Science and American Psychologist. The Science paper, titled “Graduate admission variables and future success,” was about why the variables used to select applicants to grad school do not correlate very highly with the admitted students’ later achievements. Dawes’s main point was to demonstrate why, when predictor variables are negatively correlated with each other, they can be perfectly reasonable predictors as a set even though each one taken on its own has a low predictive validity among selected students.

However, in order to get to his main point Dawes had to explain why the correlations would be negative in the first place. He offered the explanation rather briefly and described it in the context of graduate admissions. But it actually refers to (I believe) a very general phenomenon. The key fact to grasp is this: Dawes found, consistently across multiple cohorts, that the correlation between GRE and GPA was negative among admitted students but positive among applicants.

This isn’t restriction of range. Restriction of range attenuates correlations – it pushes them toward zero. As I’ll show below, this phenomenon can easily flip signs and even make the absolute value of a correlation go from zero to substantial.

Instead, it is a result of a multivariate selection process. Grad school admissions committees select for both GRE and GPA. So the selection process eliminates people who are low on both, or really low on just one. Some people are very high on both, and they get admitted. But a lot of people who pass through the selection process are a bit higher on one than on the other (relative to each variable’s respective distributions). Being really excellent on one can compensate for only being pretty good on the other and get you across the selection threshold. It is this kind of implicitly compensatory relationship that makes the correlation more negative in the post-selection group than in the pre-selection group.

To illustrate, here is a figure from a simulation I ran. On the left X and Y are sampled from a standard normal distribution with a population correlation of rho =.30. The observed correlation among 500 cases is r = .26. On the right I have simulated a hard-threshold selection process designed to select cases in the top 50% of the population. Specifically, cases are selected if X + Y > 0. Among the 239 cases that passed the selection filter, the observed correlation is now r = -.25. The correlation hasn’t been attenuated — it has been flipped!

selection1

Eyeballing the plot on the right, it’s pretty obvious that a selection process has taken place — you can practically draw a diagonal line along the selection threshold. That’s because I created a hard threshold for illustrative purposes. But that isn’t necessary for the distortion effect to occur. If X and Y are just 2 of several things that predict selection, and/or if they are used in the selection process inconsistently (e.g., with random error as you might expect with human judges), you’ll still get the effect. So you can get it in samples where, if you only had the post-selection dataset to look at, it would not be at all obvious that it had been selected on those variables.

To illustrate, I ran another simulation. This time I set the population correlation to rho = .00 and added another uncorrelated variable, Z, to the selection process (which simulates a committee using things other than GRE and GPA to make its decisions). The observed pre-selection correlation between X and Y is r = .01; in the 253 cases that passed through the selection filter (X + Y + Z > 0), X and Y are correlated r = -.21. The correlation goes from nil to negative, increasing in absolute magnitude; and the scatterplot on the right looks a lot less chopped-off.

selection2

As I mentioned it above, once I wrapped my head around this phenomenon I started seeing it in a lot of places. Although Dawes found it among GPA and GRE, it is a statistical issue that is not particular to any one subject-matter domain. You will see it any time there is selection on a combination of 2 variables that allows them to compensate for each other to any degree. Thus both variables have to be part of one selection process: if you run a sample through 2 independent selection filters, one on X while ignoring Y and one on Y while ignoring X (so they cannot compensate for each other), the correlation will be attenuated by restriction of range but you will not observe the selection-distortion effect.[2]

Here are a few examples where I have started to wonder if something like this might be happening. These are all speculative but they fit the pattern.

1. Studies of intellectual ability and academic motivation among college students. You have to have some combination of intelligence and motivation in order to succeed academically and get into college. So the correlation between those two things is probably different among college students than in the pre-selection pool of applicants (and the general population), especially when looking at selective colleges. For example, in a sample of Penn students, Duckworth et al. (2007) reported that grit was negatively correlated with SAT scores. The authors described the finding as “surprising” and offered some possible explanations for it. I’d add the selection-distortion effect to the list of possible explanations.

To be clear, I am not saying that the negative correlation is “wrong.” That may well be a good unbiased estimate of the correlation at Penn. This is about what populations it would and wouldn’t generalize to. You might find something similar at  selective colleges and universities, but perhaps not in the general population. That’s something that anybody who studies ability and motivation in university subject pools should be aware of.

2. The correlation between research productivity and teaching effectiveness. In a recent op-ed, Adam Grant proposed that universities should create new research-only and teaching-only tenure tracks. Grant drew on sound thinking from organizational psychology that says that jobs should be organized around common skill sets. If you are going to create one job that requires multiple skills, they should be skills that are positively correlated so you can hire people who are good at all parts of their job. Grant combined that argument with evidence from Hattie & Marsh (1996) that among university professors, research productivity and teaching effectiveness have a correlation close to zero. On that basis he argued that we should split research and teaching into different positions.

However, it is plausible that the zero correlation among people who have been hired for R1 tenure-track jobs could reflect a selection-distortion effect. On the surface it may seem to people familiar with that selection process that research and teaching aren’t compensatory. But the studies in the Hattie & Marsh meta-analysis typically measured research productivity with some kind of quantitative metric like number of publications or citations, and overwhelmingly measured teaching effectiveness with student evaluations. Those 2 things are pretty close to 2 of the criteria that weigh heavily in hiring decisions: an established record of scholarly output (the CV) and oral presentation skills (the job talk). The latter is almost certainly related to student evaluations of teaching; indeed, I have heard many people argue that job talks are useful for that reason. Certainly it is plausible that in the hiring process there is some tradeoff between an outstanding written record and a killer job talk. There may be something similar on the self-selection side: Ph.D. grads who aren’t interested and good at some combination of research and teaching pursue other kinds of jobs. So it seems plausible to me that research and teaching ability (as these are typically indexed in the data Grant cites) could be positively correlated among Ph.D. graduates, and then the selection process is pushing that correlation in a negative direction.

3. The burger-fry tradeoff. Okay, admittedly kinda silly, but hear me out. Back when I was in grad school I noticed that my favorite places for burgers usually weren’t my favorite places for fries, and vice versa. I’m a enough of That Guy that I actually thought about it in correlational terms (“Gee, I wonder why there is a negative correlation between burger quality and fry quality”). Well years later I think I finally found the answer. The set of burger joints I frequented in town was already selected — I avoided the places with both terrible burgers and terrible fries. So yeah, among the selected sample of places I usually went to, there was a negative correlation. But I bet if you randomly sampled all the burger joints in town, you’d find a positive burger-fries correlation.

(Like I said, once I wrapped my head around the selection-distortion effect I started seeing it everywhere.)

What does this all mean? We as psychologists tend to be good at recognizing when we shouldn’t try to generalize about univariate statistics from unrepresentative samples. Like, you would not think that Obama’s approval rating in your subject pool is representative of his national approval. But we often try to draw generalizable conclusions about relationships between variables from unrepresentative samples. The selection-distortion effect is one way (of many) that that can go wrong. Correlations are sample statistics: at best they say something about the population and context they come from. Whether they generalize beyond that is an empirical question. When you have a selected sample, the selection-distortion effect can even give you surprising and even counterintuitive results if you are not on the lookout for it.

=====

1. Honestly, I’m more than a little afraid that somebody is going to drop into the comments and say, “Oh that? That’s the blahblahblah effect, everybody knows about that, here’s a link.”

2. Also, this may be obvious to the quantitatively-minded but “selection” is defined mechanistically, not psychologically — it does not matter if a human agent deliberately selected on X and Y, or even if it is just an artifact or side effect of some other selection process.

Failed experiments do not always fail toward the null

There is a common argument among psychologists that null results are uninformative. Part of this is the logic of NHST – failure to reject the null is not the same as confirmation of the null. Which is an internally valid statement, but ignores the fact that studies with good power also have good precision to estimate effects.

However there is a second line of argument which is more procedural. The argument is that a null result can happen when an experimenter makes a mistake in either the design or execution of a study. I have heard this many times; this argument is central to an essay that Jason Mitchell recently posted arguing that null replications have no evidentiary value. (The essay said other things too, and has generated some discussion online; see e.g., Chris Said’s response.)

The problem with this argument is that experimental errors (in both design and execution) can produce all kinds of results, not just the null. Confounds, artifacts, failures of blinding procedures, demand characteristics, outliers and other violations of statistical assumptions, etc. can all produce non-null effects in data. When it comes to experimenter error, there is nothing special about the null.

Moreover, we commit a serious oversight when we use substantive results as the sole evidence of procedures. Say that the scientific hypothesis is that X causes Y. So we design an experiment with an operationalization of X, O_X, and an operationalization of Y, O_Y. A “positive” result tells us O_X -> O_Y. But unless we can say something about the relationships between O_X and X and between O_Y and Y, the result tells us nothing about X and Y.

We have a well established framework for doing that with measurements: construct validation. We expect that measures can and should be validated independent of results to document that Y -> O_Y (convergent validity) and P, Q, R, etc. !-> O_Y (discriminant validity). We have papers showing that measurement procedures are generally valid (in fact these are some of our most-cited papers!). And we typically expect papers that apply previously-established measurement procedures to show that the procedure worked in a particular sample, e.g. by reporting reliability, factor structure, correlations with other measures, etc.

Although we do not seem to publish as many validation papers on experimental manipulations as on measurements, the logic of validation applies just as well. We can obtain evidence that O_X -> X, for example by showing that experimental O_X affects already-established measurements O_X2, O_X3, etc. And in a sufficiently powered design we can show that O_X does not meaningfully influence other variables that are known to affect Y or O_Y. Just as with measurements, we can accumulate this evidence in systematic investigations to show that procedures are generally effective, and then when labs use the procedures to test substantive hypotheses they can run manipulation checks to show that they are executing a procedure correctly.

Programmatic validation is not always necessary — some experimental procedures are so face-valid that we are willing to accept that O_X -> X without a validation study. Likewise for some measurements. That is totally fine, as long as there is no double standard. But in situations where we would be willing to question whether a null result is informative, we should also be willing to question whether a non-null is. We need to evaluate methods in ways that do not depend on whether those methods give us results we like — for experimental manipulations and measurements alike.

Some thoughts on replication and falsifiability: Is this a chance to do better?

Most psychologists would probably endorse falsification as an important part of science. But in practice we rarely do it right. As others have observed before me, we do it backwards. Instead of designing experiments to falsify the hypothesis we are testing, we look for statistical evidence against a “nil null” — the point prediction that the true effect is zero. Sometimes the nil null is interesting, sometimes it isn’t, but it’s almost never a prediction from the theory that we are actually hoping to draw conclusions about.

The more rigorous approach is to derive a quantitative prediction from a theory. Then you design an experiment where the prediction could fail if the theory is wrong. Statistically speaking, the null hypothesis should be the prediction from your theory (“when dropped, this object will accelerate toward the earth at 9.8 m/s^2”). Then if a “significant” result tells you that the data are inconsistent with the theory (“average measured acceleration was 8.6 m/s^2, which differs from 9.8 at p < .05″), you have to either set aside the theory itself or one of the supporting assumptions you made when you designed the experiment. You get some leeway to look to the supporting assumptions (“oops, 9.8 assumes no wind resistance”), but not endless leeway — if the predictions keep failing, eventually you have to face facts and walk away from your theory. On the flip side, a theory is corroborated when it survives many risky opportunities to fail.

The problem in psychology — and many other sciences, including quite a bit of biology and medicine — is that our theories rarely make specific enough quantitative predictions to do hypothesis testing the “right” way. Few of our theories lead to anything remotely close to “g = 9.8 m/s^2” in specificity. People sometimes suggest this is a problem with psychologists’ acumen as theorists. I am more inclined to think it is a function of being a young science and having chosen very difficult problems to solve. So in the grand scheme, I don’t think we should self-flagellate too much about being poor theorists or succumb to physics envy. Most of the time I am inclined to agree with people Paul Rozin (who was agreeing with Solomon Asch) and William McGuire that instead we need to adapt our approach to our scientific problems and current state of knowledge, rather than trying to ape a caricature of “hard” science. That requires changes in how we do science: we need more exploration and discovery to accumulate interesting knowledge about our phenomena, and we need to be more modest and conditional in our theories. It would be a mistake to say we need to simply double down on the caricature.

So with all this being said, there is something really interesting and I think under-appreciated about the recent movement toward replication, and it is this: This may be a great opportunity to do falsification better.

The repeatability theory

Every results section says some version of, “We did this experiment and we observed these results.”[1] It is a specific statement about something that happened in the past. But hand-in-hand with that statement is, implicitly, another claim: “If someone does the same experiment again, they will get the same results.” The second claim is a mini-theory: it is a generalization of the first claim. Call it the repeatability theory. Every experimental report comes with its own repeatability theory. It is a necessary assumption of inferential statistics. And if we did not make it, we would be doing history rather than science.

And here’s the thing: the repeatability theory is very falsifiable. The rigorous, strong kind of falsifiable. We just need to clarify what it means to (A) do the same experiment again and (B) observe the same or different results.

Part B is a little easier. “The same results” does not mean exactly the same results to infinite precision. It means “the same results plus or minus error.” The hypothesis is that Experiment 1 (the original) and Experiment 2 (the replication) are observations with error of the same underlying effect, so any observed differences between experiments are just noise. If you are using NHST[2] that leads to a straightforward “strong” null hypothesis: effectsize_1 = effectsize_2. If you have access to all the raw data, you can combine both experiments into a single dataset, create an indicator variable for which study the effect came from, and test the interaction of that indicator with the effect. The null hypothesis is no interaction, which sounds like the old fashioned nil-null but in fact “interaction = 0” is the same as saying the effects are equal, which is the very specific quantitative hypothesis derived from the repeatability theory. If you don’t have the raw data, don’t despair. You can calculate an effect from each experiment and then compare them, like with a test of independent correlations. You can and should also estimate the difference between effects (effectsize_1 – effectsize_2) and an associated confidence interval. That difference is itself an effect size: it quantifies whatever difference there is between the studies, and can tell you if the difference is large or trivial.

Part A, “do the same experiment again,” is more complicated. Literalists like to point out that you will never be in the same room, with the same weather outside, with the same RA wearing the same shirt, etc. etc. They are technically right about all of that.[3]

But the realistic answer is that “the same experiment” just has to repeat the things that matter. “What matters” has been the subject of some discussion recently, for example in a published commentary by Danny Kahneman and a blog post by Andrew Wilson. In my thinking you can divide “what matters” into 3 categories: the original researchers’ specification of the experiment, technical skills in the methods used, and common sense. The onus is on the original experimenter to be able to tell a competent colleague what is necessary to repeat the experiment. In the old days of paper journals and page counts, it was impossible for most published papers to do this completely and you needed a lot of backchannel communication. With online supplements the gap is narrowing, but I still think it can’t hurt for a replicator to reach out to an original author. (Though in contrast to Kahneman, I would describe this as a methodological best practice, neither a matter of etiquette nor an absolute requirement.) If researchers say they do not know what conditions are necessary to produce an effect, that is no defense. It should undermine our faith in the original study. Don’t take my word for it, here’s Sir Karl (whose logic is better than his language – this is [hopefully obviously] limited neither to men nor physicists):

Every experimental physicist knows those surprising and inexplicable apparent ‘effects’ which in his laboratory can perhaps even be reproduced for some time, but which finally disappear without trace. Of course, no physicist would say in such a case that he had made a scientific discovery (though he might try to rearrange his experiments so as to make the effect reproducible). Indeed the scientifically significant physical effect may be defined as that which can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed. No serious physicist would offer for publication, as a scientific discovery, any such ‘occult effect,’ as I propose to call it – one for whose reproduction he could give no instructions. (Karl Popper, The Logic of Scientific Discovery, pp. 23-24)

Interpreting results

What happens when the data are inconsistent with the repeatability theory – original != replication? As with all empirical results, we have to consider multiple interpretations. This is true in all of science and has been recognized for a long time; replications are not special in this regard. An observed discrepancy between the original result and a replication[4] is an empirical finding that needs to be interpreted like any other empirical finding. However, a few issues come up commonly in interpreting replications:

First vs. latest. There is nothing special about an experiment being either the first or the latest, ceteris paribus. However ceteris is rarely paribus. If the replication has more power or if the scientific community gets to see its results through a less biased process than the original (e.g., due to pre-registration or a results-independent publication process), those things should give it more weight.

Technical skills. A technical analysis of the methods used and labs’ track records with them is appropriate. I am not much swayed by broad appeals to experimental “artistry.” Instead, I find these interpretations more persuasive when someone can put forward a plausible candidate for something important in the original that is not easy to standardize or carry off without specific skills. For example, a computer-administered experiment is possible to standardize and audit (and in some cases the code and digital stimuli can be reproduced exactly). But an experiment that involves confederates or cover stories might be harder to pull off for a lab that does not do that routinely. When that is the case, manipulation checks, lab visits/exchanges (in person or through video), and other validation procedures become important.

Moderators. Replications can never reproduce every single aspect of the original study. They do their best to reproduce everything that the original specification, technical knowledge, and common sense say should matter. But they can and will still depart from original studies in any number of ways: the subject pool being drawn from, the local social and cultural context, procedural changes made for practical reasons, etc. When the replication departs substantially from the original, it is fair to consider possible moderators. But moderator interpretations are nearly always post hoc, and should be weighed accordingly until we have more data.

I think it’s also important to point out that the possibility of unanticipated moderators is not a problem with replications; rather, if you are interested in discovery it is a very good reason to run them. Consider a hypothetical example from a recent blog post by Tim Wilson: a study originally run in the laboratory that produces a smaller effect in an online replication. Wilson imagines this is an outcome that a replicator with improbable amounts of both malevolence and prescience might arrange on purpose. But a far more likely scenario is that if the original specification, technical knowledge, and common sense all say that offline-online shouldn’t matter but it turns out that it does, that could actually be a very interesting discovery! People are living more of their lives online, and it is important to know how social cognition and behavior work in virtual spaces. And a discovery like that might also save other scientists a lot of wasted effort and resources, if for example they thought the experiment would work online and planned to run replicate-and-extend studies or adapt parts of the original procedure for new studies. In the end, Wilson’s example of replication gone wrong looks more like a useful discovery.

Discovery and replication need each other

Discovery and replication are often contrasted with each other. Discovery is new and exciting; replication is dull “duplication.” But that is silly. Replication separates real discoveries from noise-surfing, and as just noted it can itself lead to discoveries. We can and should do both. And not just in some sort of division of labor arrangement, but in an integrated way as part of our science. Exciting new discoveries need to be replicated before we take them as definitive. Replication within and between labs should be routine and normal.

An integrated discovery-replication approach is also an excellent way to build theories. Both Rozin and McGuire criticize psychology’s tendency to equate “theory” with broad, decontextualized statements – pronouncements that almost invariably get chipped away in subsequent studies as we discover moderators and boundary conditions. This kind of “overclaim first, then back away slowly” approach supports the hype cycle and means that a tendency to make incorrect statements is baked in to our research process. Instead, Rozin wants us to accumulate interesting descriptive facts about the phenomena we are studying; McGuire wants us to study how effects vary over populations and contexts. A discovery-replication approach allows us to do this both of these things. We can use discovery-oriented exploratory research to derive truly falsifiable predictions to then be tested. That way we will amass a body of narrow but well-corroborated theoretical statements (the repeatability theories) to assemble into bigger theories from the foundations up, rather than starting with bold pronouncements. We will also build up knowledge about quantitative estimates of effects, which we can use to start to make interval and even point predictions. That kind of cumulative science is likely to generate fewer sexy headlines in the short run, but it will be a whole lot more durable.

—–

1. I am using “experiment” in the very broad sense here of a structured scientific observation, not the more limited sense of a study that involves randomized manipulation by an experimenter.[5]

2. I’m sure the Bayesians have an answer for the statistical problem too. It is probably a good one. But c’mon, this is a chance to finally do NHST right!

3. Literalists also like to say it’s a problem that you will never have the exact same people as subjects again. They are technically wrong about that being a problem. “Drawing a sample” is part of what constitutes the experiment. But pressing this point will get you into an argument with a literalist over a technicality, which is never fun, so I suggest letting it drop.

4. “Discrepancy” = “failed replication” in the parlance of our time, but I don’t like that phrase. Who/what failed? Totally unclear, and the answer may be nobody/nothing.

5. I am totally ripping this footnote thing off of Simine Vazire but telling myself I’m ripping off David Foster Wallace.