# The selection-distortion effect: How selection changes correlations in surprising ways

A little while back I ran across an idea buried in an old paper of Robyn Dawes that really opened my eyes. It was one of those things that seemed really simple and straightforward once I saw it. But I’d never run across it before.[1] The idea is this: when a sample is selected on a combination of 2 (or more) variables, the relationship between those 2 variables is different after selection than it was before, and not just because of restriction of range. The correlation changes in ways that, if you don’t realize it’s happening, can be surprising and potentially misleading. It can flip the sign of a correlation, or turn a zero correlation into a substantial one. Let’s call it the selection-distortion effect.

First, some background: Dawes was the head of the psychology department at the University of Oregon back in the 1970s. Merging his administrative role with his interests in decision-making, he collected data about graduate admissions decisions and how well they predict future outcomes. He eventually wrote a couple of papers based on that work for Science and American Psychologist. The Science paper, titled “Graduate admission variables and future success,” was about why the variables used to select applicants to grad school do not correlate very highly with the admitted students’ later achievements. Dawes’s main point was to demonstrate why, when predictor variables are negatively correlated with each other, they can be perfectly reasonable predictors as a set even though each one taken on its own has a low predictive validity among selected students.

However, in order to get to his main point Dawes had to explain why the correlations would be negative in the first place. He offered the explanation rather briefly and described it in the context of graduate admissions. But it actually refers to (I believe) a very general phenomenon. The key fact to grasp is this: Dawes found, consistently across multiple cohorts, that the correlation between GRE and GPA was negative among admitted students but positive among applicants.

This isn’t restriction of range. Restriction of range attenuates correlations – it pushes them toward zero. As I’ll show below, this phenomenon can easily flip signs and even make the absolute value of a correlation go from zero to substantial.

Instead, it is a result of a multivariate selection process. Grad school admissions committees select for both GRE and GPA. So the selection process eliminates people who are low on both, or really low on just one. Some people are very high on both, and they get admitted. But a lot of people who pass through the selection process are a bit higher on one than on the other (relative to each variable’s respective distributions). Being really excellent on one can compensate for only being pretty good on the other and get you across the selection threshold. It is this kind of implicitly compensatory relationship that makes the correlation more negative in the post-selection group than in the pre-selection group.

To illustrate, here is a figure from a simulation I ran. On the left X and Y are sampled from a standard normal distribution with a population correlation of rho =.30. The observed correlation among 500 cases is r = .26. On the right I have simulated a hard-threshold selection process designed to select cases in the top 50% of the population. Specifically, cases are selected if X + Y > 0. Among the 239 cases that passed the selection filter, the observed correlation is now r = -.25. The correlation hasn’t been attenuated — it has been flipped!

Eyeballing the plot on the right, it’s pretty obvious that a selection process has taken place — you can practically draw a diagonal line along the selection threshold. That’s because I created a hard threshold for illustrative purposes. But that isn’t necessary for the distortion effect to occur. If X and Y are just 2 of several things that predict selection, and/or if they are used in the selection process inconsistently (e.g., with random error as you might expect with human judges), you’ll still get the effect. So you can get it in samples where, if you only had the post-selection dataset to look at, it would not be at all obvious that it had been selected on those variables.

To illustrate, I ran another simulation. This time I set the population correlation to rho = .00 and added another uncorrelated variable, Z, to the selection process (which simulates a committee using things other than GRE and GPA to make its decisions). The observed pre-selection correlation between X and Y is r = .01; in the 253 cases that passed through the selection filter (X + Y + Z > 0), X and Y are correlated r = -.21. The correlation goes from nil to negative, increasing in absolute magnitude; and the scatterplot on the right looks a lot less chopped-off.

As I mentioned it above, once I wrapped my head around this phenomenon I started seeing it in a lot of places. Although Dawes found it among GPA and GRE, it is a statistical issue that is not particular to any one subject-matter domain. You will see it any time there is selection on a combination of 2 variables that allows them to compensate for each other to any degree. Thus both variables have to be part of one selection process: if you run a sample through 2 independent selection filters, one on X while ignoring Y and one on Y while ignoring X (so they cannot compensate for each other), the correlation will be attenuated by restriction of range but you will not observe the selection-distortion effect.[2]

Here are a few examples where I have started to wonder if something like this might be happening. These are all speculative but they fit the pattern.

1. Studies of intellectual ability and academic motivation among college students. You have to have some combination of intelligence and motivation in order to succeed academically and get into college. So the correlation between those two things is probably different among college students than in the pre-selection pool of applicants (and the general population), especially when looking at selective colleges. For example, in a sample of Penn students, Duckworth et al. (2007) reported that grit was negatively correlated with SAT scores. The authors described the finding as “surprising” and offered some possible explanations for it. I’d add the selection-distortion effect to the list of possible explanations.

To be clear, I am not saying that the negative correlation is “wrong.” That may well be a good unbiased estimate of the correlation at Penn. This is about what populations it would and wouldn’t generalize to. You might find something similar at  selective colleges and universities, but perhaps not in the general population. That’s something that anybody who studies ability and motivation in university subject pools should be aware of.

2. The correlation between research productivity and teaching effectiveness. In a recent op-ed, Adam Grant proposed that universities should create new research-only and teaching-only tenure tracks. Grant drew on sound thinking from organizational psychology that says that jobs should be organized around common skill sets. If you are going to create one job that requires multiple skills, they should be skills that are positively correlated so you can hire people who are good at all parts of their job. Grant combined that argument with evidence from Hattie & Marsh (1996) that among university professors, research productivity and teaching effectiveness have a correlation close to zero. On that basis he argued that we should split research and teaching into different positions.

However, it is plausible that the zero correlation among people who have been hired for R1 tenure-track jobs could reflect a selection-distortion effect. On the surface it may seem to people familiar with that selection process that research and teaching aren’t compensatory. But the studies in the Hattie & Marsh meta-analysis typically measured research productivity with some kind of quantitative metric like number of publications or citations, and overwhelmingly measured teaching effectiveness with student evaluations. Those 2 things are pretty close to 2 of the criteria that weigh heavily in hiring decisions: an established record of scholarly output (the CV) and oral presentation skills (the job talk). The latter is almost certainly related to student evaluations of teaching; indeed, I have heard many people argue that job talks are useful for that reason. Certainly it is plausible that in the hiring process there is some tradeoff between an outstanding written record and a killer job talk. There may be something similar on the self-selection side: Ph.D. grads who aren’t interested and good at some combination of research and teaching pursue other kinds of jobs. So it seems plausible to me that research and teaching ability (as these are typically indexed in the data Grant cites) could be positively correlated among Ph.D. graduates, and then the selection process is pushing that correlation in a negative direction.

3. The burger-fry tradeoff. Okay, admittedly kinda silly, but hear me out. Back when I was in grad school I noticed that my favorite places for burgers usually weren’t my favorite places for fries, and vice versa. I’m a enough of That Guy that I actually thought about it in correlational terms (“Gee, I wonder why there is a negative correlation between burger quality and fry quality”). Well years later I think I finally found the answer. The set of burger joints I frequented in town was already selected — I avoided the places with both terrible burgers and terrible fries. So yeah, among the selected sample of places I usually went to, there was a negative correlation. But I bet if you randomly sampled all the burger joints in town, you’d find a positive burger-fries correlation.

(Like I said, once I wrapped my head around the selection-distortion effect I started seeing it everywhere.)

What does this all mean? We as psychologists tend to be good at recognizing when we shouldn’t try to generalize about univariate statistics from unrepresentative samples. Like, you would not think that Obama’s approval rating in your subject pool is representative of his national approval. But we often try to draw generalizable conclusions about relationships between variables from unrepresentative samples. The selection-distortion effect is one way (of many) that that can go wrong. Correlations are sample statistics: at best they say something about the population and context they come from. Whether they generalize beyond that is an empirical question. When you have a selected sample, the selection-distortion effect can even give you surprising and even counterintuitive results if you are not on the lookout for it.

=====

1. Honestly, I’m more than a little afraid that somebody is going to drop into the comments and say, “Oh that? That’s the blahblahblah effect, everybody knows about that, here’s a link.”

2. Also, this may be obvious to the quantitatively-minded but “selection” is defined mechanistically, not psychologically — it does not matter if a human agent deliberately selected on X and Y, or even if it is just an artifact or side effect of some other selection process.

# Failed experiments do not always fail toward the null

There is a common argument among psychologists that null results are uninformative. Part of this is the logic of NHST – failure to reject the null is not the same as confirmation of the null. Which is an internally valid statement, but ignores the fact that studies with good power also have good precision to estimate effects.

However there is a second line of argument which is more procedural. The argument is that a null result can happen when an experimenter makes a mistake in either the design or execution of a study. I have heard this many times; this argument is central to an essay that Jason Mitchell recently posted arguing that null replications have no evidentiary value. (The essay said other things too, and has generated some discussion online; see e.g., Chris Said’s response.)

The problem with this argument is that experimental errors (in both design and execution) can produce all kinds of results, not just the null. Confounds, artifacts, failures of blinding procedures, demand characteristics, outliers and other violations of statistical assumptions, etc. can all produce non-null effects in data. When it comes to experimenter error, there is nothing special about the null.

Moreover, we commit a serious oversight when we use substantive results as the sole evidence of procedures. Say that the scientific hypothesis is that X causes Y. So we design an experiment with an operationalization of X, O_X, and an operationalization of Y, O_Y. A “positive” result tells us O_X -> O_Y. But unless we can say something about the relationships between O_X and X and between O_Y and Y, the result tells us nothing about X and Y.

We have a well established framework for doing that with measurements: construct validation. We expect that measures can and should be validated independent of results to document that Y -> O_Y (convergent validity) and P, Q, R, etc. !-> O_Y (discriminant validity). We have papers showing that measurement procedures are generally valid (in fact these are some of our most-cited papers!). And we typically expect papers that apply previously-established measurement procedures to show that the procedure worked in a particular sample, e.g. by reporting reliability, factor structure, correlations with other measures, etc.

Although we do not seem to publish as many validation papers on experimental manipulations as on measurements, the logic of validation applies just as well. We can obtain evidence that O_X -> X, for example by showing that experimental O_X affects already-established measurements O_X2, O_X3, etc. And in a sufficiently powered design we can show that O_X does not meaningfully influence other variables that are known to affect Y or O_Y. Just as with measurements, we can accumulate this evidence in systematic investigations to show that procedures are generally effective, and then when labs use the procedures to test substantive hypotheses they can run manipulation checks to show that they are executing a procedure correctly.

Programmatic validation is not always necessary — some experimental procedures are so face-valid that we are willing to accept that O_X -> X without a validation study. Likewise for some measurements. That is totally fine, as long as there is no double standard. But in situations where we would be willing to question whether a null result is informative, we should also be willing to question whether a non-null is. We need to evaluate methods in ways that do not depend on whether those methods give us results we like — for experimental manipulations and measurements alike.

# Some thoughts on replication and falsifiability: Is this a chance to do better?

Most psychologists would probably endorse falsification as an important part of science. But in practice we rarely do it right. As others have observed before me, we do it backwards. Instead of designing experiments to falsify the hypothesis we are testing, we look for statistical evidence against a “nil null” — the point prediction that the true effect is zero. Sometimes the nil null is interesting, sometimes it isn’t, but it’s almost never a prediction from the theory that we are actually hoping to draw conclusions about.

The more rigorous approach is to derive a quantitative prediction from a theory. Then you design an experiment where the prediction could fail if the theory is wrong. Statistically speaking, the null hypothesis should be the prediction from your theory (“when dropped, this object will accelerate toward the earth at 9.8 m/s^2″). Then if a “significant” result tells you that the data are inconsistent with the theory (“average measured acceleration was 8.6 m/s^2, which differs from 9.8 at p < .05″), you have to either set aside the theory itself or one of the supporting assumptions you made when you designed the experiment. You get some leeway to look to the supporting assumptions (“oops, 9.8 assumes no wind resistance”), but not endless leeway — if the predictions keep failing, eventually you have to face facts and walk away from your theory. On the flip side, a theory is corroborated when it survives many risky opportunities to fail.

The problem in psychology — and many other sciences, including quite a bit of biology and medicine — is that our theories rarely make specific enough quantitative predictions to do hypothesis testing the “right” way. Few of our theories lead to anything remotely close to “g = 9.8 m/s^2″ in specificity. People sometimes suggest this is a problem with psychologists’ acumen as theorists. I am more inclined to think it is a function of being a young science and having chosen very difficult problems to solve. So in the grand scheme, I don’t think we should self-flagellate too much about being poor theorists or succumb to physics envy. Most of the time I am inclined to agree with people Paul Rozin (who was agreeing with Solomon Asch) and William McGuire that instead we need to adapt our approach to our scientific problems and current state of knowledge, rather than trying to ape a caricature of “hard” science. That requires changes in how we do science: we need more exploration and discovery to accumulate interesting knowledge about our phenomena, and we need to be more modest and conditional in our theories. It would be a mistake to say we need to simply double down on the caricature.

So with all this being said, there is something really interesting and I think under-appreciated about the recent movement toward replication, and it is this: This may be a great opportunity to do falsification better.

The repeatability theory

Every results section says some version of, “We did this experiment and we observed these results.”[1] It is a specific statement about something that happened in the past. But hand-in-hand with that statement is, implicitly, another claim: “If someone does the same experiment again, they will get the same results.” The second claim is a mini-theory: it is a generalization of the first claim. Call it the repeatability theory. Every experimental report comes with its own repeatability theory. It is a necessary assumption of inferential statistics. And if we did not make it, we would be doing history rather than science.

And here’s the thing: the repeatability theory is very falsifiable. The rigorous, strong kind of falsifiable. We just need to clarify what it means to (A) do the same experiment again and (B) observe the same or different results.

Part B is a little easier. “The same results” does not mean exactly the same results to infinite precision. It means “the same results plus or minus error.” The hypothesis is that Experiment 1 (the original) and Experiment 2 (the replication) are observations with error of the same underlying effect, so any observed differences between experiments are just noise. If you are using NHST[2] that leads to a straightforward “strong” null hypothesis: effectsize_1 = effectsize_2. If you have access to all the raw data, you can combine both experiments into a single dataset, create an indicator variable for which study the effect came from, and test the interaction of that indicator with the effect. The null hypothesis is no interaction, which sounds like the old fashioned nil-null but in fact “interaction = 0″ is the same as saying the effects are equal, which is the very specific quantitative hypothesis derived from the repeatability theory. If you don’t have the raw data, don’t despair. You can calculate an effect from each experiment and then compare them, like with a test of independent correlations. You can and should also estimate the difference between effects (effectsize_1 – effectsize_2) and an associated confidence interval. That difference is itself an effect size: it quantifies whatever difference there is between the studies, and can tell you if the difference is large or trivial.

Part A, “do the same experiment again,” is more complicated. Literalists like to point out that you will never be in the same room, with the same weather outside, with the same RA wearing the same shirt, etc. etc. They are technically right about all of that.[3]

But the realistic answer is that “the same experiment” just has to repeat the things that matter. “What matters” has been the subject of some discussion recently, for example in a published commentary by Danny Kahneman and a blog post by Andrew Wilson. In my thinking you can divide “what matters” into 3 categories: the original researchers’ specification of the experiment, technical skills in the methods used, and common sense. The onus is on the original experimenter to be able to tell a competent colleague what is necessary to repeat the experiment. In the old days of paper journals and page counts, it was impossible for most published papers to do this completely and you needed a lot of backchannel communication. With online supplements the gap is narrowing, but I still think it can’t hurt for a replicator to reach out to an original author. (Though in contrast to Kahneman, I would describe this as a methodological best practice, neither a matter of etiquette nor an absolute requirement.) If researchers say they do not know what conditions are necessary to produce an effect, that is no defense. It should undermine our faith in the original study. Don’t take my word for it, here’s Sir Karl (whose logic is better than his language – this is [hopefully obviously] limited neither to men nor physicists):

Every experimental physicist knows those surprising and inexplicable apparent ‘effects’ which in his laboratory can perhaps even be reproduced for some time, but which finally disappear without trace. Of course, no physicist would say in such a case that he had made a scientific discovery (though he might try to rearrange his experiments so as to make the effect reproducible). Indeed the scientifically significant physical effect may be defined as that which can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed. No serious physicist would offer for publication, as a scientific discovery, any such ‘occult effect,’ as I propose to call it – one for whose reproduction he could give no instructions. (Karl Popper, The Logic of Scientific Discovery, pp. 23-24)

Interpreting results

What happens when the data are inconsistent with the repeatability theory – original != replication? As with all empirical results, we have to consider multiple interpretations. This is true in all of science and has been recognized for a long time; replications are not special in this regard. An observed discrepancy between the original result and a replication[4] is an empirical finding that needs to be interpreted like any other empirical finding. However, a few issues come up commonly in interpreting replications:

First vs. latest. There is nothing special about an experiment being either the first or the latest, ceteris paribus. However ceteris is rarely paribus. If the replication has more power or if the scientific community gets to see its results through a less biased process than the original (e.g., due to pre-registration or a results-independent publication process), those things should give it more weight.

Technical skills. A technical analysis of the methods used and labs’ track records with them is appropriate. I am not much swayed by broad appeals to experimental “artistry.” Instead, I find these interpretations more persuasive when someone can put forward a plausible candidate for something important in the original that is not easy to standardize or carry off without specific skills. For example, a computer-administered experiment is possible to standardize and audit (and in some cases the code and digital stimuli can be reproduced exactly). But an experiment that involves confederates or cover stories might be harder to pull off for a lab that does not do that routinely. When that is the case, manipulation checks, lab visits/exchanges (in person or through video), and other validation procedures become important.

Moderators. Replications can never reproduce every single aspect of the original study. They do their best to reproduce everything that the original specification, technical knowledge, and common sense say should matter. But they can and will still depart from original studies in any number of ways: the subject pool being drawn from, the local social and cultural context, procedural changes made for practical reasons, etc. When the replication departs substantially from the original, it is fair to consider possible moderators. But moderator interpretations are nearly always post hoc, and should be weighed accordingly until we have more data.

I think it’s also important to point out that the possibility of unanticipated moderators is not a problem with replications; rather, if you are interested in discovery it is a very good reason to run them. Consider a hypothetical example from a recent blog post by Tim Wilson: a study originally run in the laboratory that produces a smaller effect in an online replication. Wilson imagines this is an outcome that a replicator with improbable amounts of both malevolence and prescience might arrange on purpose. But a far more likely scenario is that if the original specification, technical knowledge, and common sense all say that offline-online shouldn’t matter but it turns out that it does, that could actually be a very interesting discovery! People are living more of their lives online, and it is important to know how social cognition and behavior work in virtual spaces. And a discovery like that might also save other scientists a lot of wasted effort and resources, if for example they thought the experiment would work online and planned to run replicate-and-extend studies or adapt parts of the original procedure for new studies. In the end, Wilson’s example of replication gone wrong looks more like a useful discovery.

Discovery and replication need each other

Discovery and replication are often contrasted with each other. Discovery is new and exciting; replication is dull “duplication.” But that is silly. Replication separates real discoveries from noise-surfing, and as just noted it can itself lead to discoveries. We can and should do both. And not just in some sort of division of labor arrangement, but in an integrated way as part of our science. Exciting new discoveries need to be replicated before we take them as definitive. Replication within and between labs should be routine and normal.

An integrated discovery-replication approach is also an excellent way to build theories. Both Rozin and McGuire criticize psychology’s tendency to equate “theory” with broad, decontextualized statements – pronouncements that almost invariably get chipped away in subsequent studies as we discover moderators and boundary conditions. This kind of “overclaim first, then back away slowly” approach supports the hype cycle and means that a tendency to make incorrect statements is baked in to our research process. Instead, Rozin wants us to accumulate interesting descriptive facts about the phenomena we are studying; McGuire wants us to study how effects vary over populations and contexts. A discovery-replication approach allows us to do this both of these things. We can use discovery-oriented exploratory research to derive truly falsifiable predictions to then be tested. That way we will amass a body of narrow but well-corroborated theoretical statements (the repeatability theories) to assemble into bigger theories from the foundations up, rather than starting with bold pronouncements. We will also build up knowledge about quantitative estimates of effects, which we can use to start to make interval and even point predictions. That kind of cumulative science is likely to generate fewer sexy headlines in the short run, but it will be a whole lot more durable.

—–

1. I am using “experiment” in the very broad sense here of a structured scientific observation, not the more limited sense of a study that involves randomized manipulation by an experimenter.[5]

2. I’m sure the Bayesians have an answer for the statistical problem too. It is probably a good one. But c’mon, this is a chance to finally do NHST right!

3. Literalists also like to say it’s a problem that you will never have the exact same people as subjects again. They are technically wrong about that being a problem. “Drawing a sample” is part of what constitutes the experiment. But pressing this point will get you into an argument with a literalist over a technicality, which is never fun, so I suggest letting it drop.

4. “Discrepancy” = “failed replication” in the parlance of our time, but I don’t like that phrase. Who/what failed? Totally unclear, and the answer may be nobody/nothing.

5. I am totally ripping this footnote thing off of Simine Vazire but telling myself I’m ripping off David Foster Wallace.

# Does the replication debate have a diversity problem?

Folks who do not have a lot of experiences with systems that don’t work well for them find it hard to imagine that a well intentioned system can have ill effects. Not work as advertised for everyone. That is my default because that is my experience.
– Bashir, Advancing How Science is Done

A couple of months ago, a tenured white male professor* from an elite research university wrote a blog post about the importance of replicating priming effects, in which he exhorted priming researchers to “Nut up or shut up.”

Just today, a tenured white male professor* from an elite research university said that a tenured scientist who challenged the interpretation and dissemination of a failed replication is a Rosa Parks, “a powerless woman who decided to risk everything.”

Well then.

The current discussion over replicability and (more broadly) improving scientific integrity and rigor is an absolutely important one. It is, at its core, a discussion about how scientists should do science. It therefore should include everybody who does science or has a stake in science.

Yet over the last year or so I have heard a number of remarks (largely in private) from scientists who are women, racial minorities, and members of other historically disempowered groups that they feel like the protagonists in this debate consist disproportionately of white men with tenure at elite institutions. Since the debate is over prescriptions for how science is to be done, it feels a little bit like the structurally powerful people shouting at each other and telling everybody else what to do.

By itself, that is enough to make people with a history of being disempowered wonder if they will be welcome to participate. And when the debate is salted with casually sexist language, and historically illiterate borrowing of other people’s oppression to further an argument — well, that’s going to hammer the point.

This is not a call for tenured white men to step back from the conversation. Rather, it is a call to bring more people in. Those of us who are structurally powerful in various ways have a responsibility to make sure that people from all backgrounds, all career stages, and all kinds of institutions are actively included and feel safe and welcome to participate. Justice demands it. That’s enough for me, but if you need a bonus, consider that including people with personal experience seeing well-intentioned systems fail might actually produce a better outcome.

—–

* The tenured and professor parts I looked up. White and male I inferred from social presentation.

# What is counterintuitive?

Simine Vazire has a great post contemplating how we should evaluate counterintuitive claims. For me that brings up the question: what do we mean when we say something is “counterintuitive?”

First, let me say what I think counterintuitive isn’t. The “intuitive” part points to the fact that when we label something counterintuitive, we are usually not talking about contradicting a formal, well-specified theory. For example, you probably wouldn’t say that the double-slit experiment was “counterintuitive;” you’d say it falsified classical mechanics.

In any science, though, you have areas of inquiry where there is not an existing theory that makes precise predictions. In social and personality psychology that is the majority of what we are studying. (But it’s true in other sciences too, probably more than we appreciate.) Beyond the reach of formal theory, scientists develop educated guesses, hunches, and speculations based on their knowledge and experience. So the “intuitive” in counterintuitive could refer to the intuitions of experts.

But in social and personality psychology we study phenomena that regular people reflect on and speculate about too. A connection to everyday lived experience is almost definitional to our field, whether you think it is something that we should actively pursue or just inevitably creeps in. So we have an extra source of intuitions – the intuitions of people who are living the phenomena that we study. Which includes ourselves, since social and personality psychologists are all human beings too.

And when you are talking about something that (a) people reflect on and wonder about and (b) is not already well settled, then chances are pretty good that people have had multiple, potentially contradictory ideas about it. Sometimes different people having different ideas; sometimes the same person having different ideas at different times. The contradictory ideas might even have made their way into cultural wisdom – like “birds of a feather flock together” versus “opposites attract.”

What I suspect that means is that “counterintuitive” is often just a rhetorical strategy for writing introduction sections and marketing our work. No matter how your results turned out, you can convince your audience that they once thought the opposite. Because chances are very good that they did. A skilled writer can exploit the same mechanisms that lead to hindsight bias to set people up, and then surprise! show them that the results went the other way.

I would not claim that this describes all instances of counterintuitive, but I think it describes a lot of them. As Simine points out, many people in psychology say that counterintuitive findings are more valuable — so clearly there is an incentive to frame things that way. (Counterintuitive framing is also a great way to sell a lot of books.)

Of course, it does not have to be that way. After all, we are the field that specializes in measuring and explaining people’s intuitions. Why don’t we ask our colleagues to back up their claims of being “counterintuitive” with data? Describe the procedure fully and neutrally to a group of people (experts or nonexperts, depending whose intuitions you want to claim to be counter to) and ask what they think will happen. Milgram famously did that with his obedience experiments.

We should also revisit why we think “counterintuitive” is valuable. Sometimes it clearly is. For example, when intuition systematically leads people to make consequentially bad decisions it can be important to document that and understand why. But being counterintuitive for counterintuitive’s sake? If intuitions vary widely — and so do results, across contexts and populations — then we run the risk that placing too much value on counterintuitive findings will do more to incentive rhetorical flash than substantive discoveries.

# What did Malcolm Gladwell actually say about the 10,000 hour rule?

A new paper out in Intelligence, from a group of authors led by David Hambrick, is getting a lot of press coverage for having “debunked” the 10,000-hour rule discussed in Malcolm Gladwell’s book Outliers. The 10,000-hour rule is — well, actually, that’s the point of this post: Just what, exactly, is the 10,000-hour rule?

The debate in Intelligence is between Hambrick et al. and researcher K. Anders Ericsson, who studies deliberate practice and expert performance (and wrote a rejoinder to Hambrick et al. in the journal). But Malcolm Gladwell interpreted Ericsson’s work in a popular book and popularized the phrase “the 10,000-hour rule.” And most of the press coverage mentions Gladwell.

Moreover, Gladwell has been the subject of a lot of discussion lately about how he interprets research and presents his conclusions. The 10,000-hour rule has become a runaway meme — there’s even a Macklemore song about it. And if you google it, you’ll find a lot of people talking about it and trying to apply it to their lives. The interpretations aren’t always the same, suggesting there’s been some interpretive drift in what people think the 10,000-hour rule really is. I read Outliers shortly after it came out, but my memory of it has probably been shaped by all of that conversation that has happened since. So I decided it would be interesting to go back to the source and take another look at what Gladwell actually said.

“The 10,000-Hour Rule” is the title of a chapter in Outliers. It weaves together a bunch of stories of how people became wildly successful. The pivotal moment where Gladwell lays out his thesis, the nut graf if you will, is this:

“For almost a generation, psychologists around the world have been engaged in a spirited debate over a question that most of us would consider to have been settled years ago. The question is this: is there such a thing as innate talent? The obvious answer is yes. Not every hockey player born in January ends up playing at the professional level. Only some do—the innately talented ones. Achievement is tal­ent plus preparation. The problem with this view is that the closer psychologists look at the careers of the gifted, the smaller the role innate talent seems to play and the bigger the role preparation seems to play.” (pp. 37-38)

This is classic Gladwell style — setting up the conventional wisdom and then knock it down. You might think X, but I’m going to show you it’s really not-X. In this case, what is the X that you might think? That there is such a thing as talent and that it matters for success. And Gladwell is promising to challenge that view. Zoom in and it’s laid bare:

“Achievement is tal­ent plus preparation. The problem with this view…”

Some Gladwell defenders have claimed he was just saying that talent isn’t enough by itself and preparation matters too. But that would be a pretty weak assertion for a bestselling book. I mean, who doesn’t think that violin prodigies or hockey players need to practice? And it is clear Gladwell is going for something more extreme than that. “Achievement is talent plus preparation” is not Gladwell’s thesis. To the contrary, that is the conventional wisdom that Gladwell is promising to overturn.

Gladwell then goes on to tell a bunch of stories of successful people who practiced a lot lot lot before they became successful. But that line of argument can only get you so far. Preparation and talent are not mutually exclusive. So saying “preparation matters” over and over really tells you nothing about whether talent matters too. And the difficulty for Gladwell is that, try as he might, he cannot avoid acknowledging a place for talent too. To deny that talent exists and matters would be absurd in the face of both common sense and hard data. And Gladwell can’t go that far:

“If we put the stories of hockey players and the Beatles and Bill Joy and Bill Gates together, I think we get a more com­plete picture of the path to success. Joy and Gates and the Beatles are all undeniably talented. Lennon and McCart­ney had a musical gift of the sort that comes along once in a generation, and Bill Joy, let us not forget, had a mind so quick that he was able to make up a complicated algorithm on the fly that left his professors in awe. That much is obvious.” (p. 55)

So “a more complete picture of the path to success” says that talent exists and it matters — a lot. It is actually a big deal if you have a “gift of the sort that comes along once in a generation.” So we are actually back to the conventional wisdom again: Achievement is talent plus preparation. Sure, Gladwell emphasizes the preparation piece in his storytelling. But that difference in emphasis tells us more about what is easier to narrate (nobody is ever going to make an 80’s-style montage about ACE models) than about which is actually the stronger cause. So after all the stories, it looks an awful lot like the 10,000-hour rule is just the conventional wisdom after all.

But wait! In the very next paragraph…

“But what truly distinguishes their histories is not their extraordinary talent but their extraordinary oppor­tunities.” (p. 55)

“Opportunities” doesn’t sound like talent *or* preparation. What’s that about?

This, I think, has been missing from a lot of the popular discussion about the 10,000-hour rule. Narrowly, the 10,000-hour rule is about talent and preparation. But that overlooks the emphasis in Outliers on randomness and luck — being in the right place and the right time. So you might expand the formula: “Achievement is talent plus preparation plus luck.”

Only Gladwell wants his conclusion to be simpler than the conventional wisdom, not more complicated. So he tries to equate luck with preparation, or more precisely with the opportunity to prepare. Be born in the right era, live in the right place, and maybe you’ll get a chance to spend 10,000 hours getting good at something.

The problem with simplifying the formula rather than complicating it is that you miss important things. Gladwell’s point is that you need opportunities to prepare — you can’t become a computer whiz unless you have access to a computer to tinker with (10,000 hours worth of access, to be precise). He notes that a lot of wealthy and famous computer innovators, like Bill Gates, Paul Allen, and Steve Jobs, were born in 1954 or 1955. So when personal computing took off they were just the right age to get to mess around with computers: old enough to start businesses, young enough and unattached enough to have the time to sink into something new and uncertain. Gladwell concludes that the timing of your birth is a sort of cosmically random factor that affects whether you’ll be successful.

But not all opportunities are purely random — in many domains, opportunities are more likely to come to people who are talented or prepared or both. If you show some early potential and dedication to hockey or music, people are more likely to give you a hockey stick or a violin. Sure, you have to live in a time and place where hockey sticks or violins exist, but there’s more to it than that.

And let us not forget one of the most important ways that people end up in the right place at the right time: privilege (turns out Macklemore has a song about that too). The year that Gates, Allen, and Jobs were all born in 1954-55 may be random in some cosmic sense. But the fact that they are all white dudes from America suggests some sort of pattern, at least to me. Gladwell tells a story about how Bill Hewlett gave a young Steve Jobs spare computer parts to tinker with. The story is told like it’s a lucky opportunity for Jobs, and in a sense it is. But I wonder what would have happened if a poor kid from East Palo Alto had asked Hewlett for the same thing.

So now we are up to 4 things: talent, preparation, luck, and privilege. They all matter, they all affect each other, and I am sure we could add to the list. And you could go even deeper and start questioning the foundations of how we have carved up our list of variables (just what do we mean by “innate talent” anyway, and is it the same thing — innate in the same way — for everybody?). That would be an even more complete picture of the path to success. Not an easy story to tell, I know, but maybe a better one.

# In which we admire a tiny p with complete seriousness

A while back a colleague forwarded me this quote from Stanley Schachter (yes that Stanley Schachter):

“This is a difference which is significant at considerably better than the p < .0001 level of confidence. If, in reeling off these zeroes, we manage to create the impression of stringing pearls on a necklace, we rather hope the reader will be patient and forbearing, for it has been the very number of zeros after this decimal point that has compelled us to treat these data with complete seriousness.”

The quote comes from a chapter on birth order in Schachter’s 1959 book The Psychology of Affiliation. The analysis was a chi-square test on 76 subjects. The subjects were selected from 3 different experiments for being “truly anxious” and combined for this analysis. True anxiety was determined if the subject scored at one or the other extreme endpoint of an anxiety scale (both complete denial and complete admission were taken to mean that the subject is “truly anxious”), and/or if the subject discontinued participation because the experiment made them feel too anxious.

# Let’s talk about diversity in personality psychology

In the latest issue of the ARP newsletter, Kelci Harris writes about diversity in ARP. You should read the whole thing. Here’s an excerpt:

Personality psychology should be intrinsically interesting to everyone, because, well, everyone has a personality. It’s accessible and that makes our research so fun and an easy thing to talk about with non-psychologists, that is, once we’ve explained to them what we actually do. However, despite what could be a universal appeal, our field is very homogenous. And that’s too bad, because diversity makes for better science. Good research comes from observations. You notice something about the world, and you wonder why that is. It’s probably reasonable to guess that most members of our field have experienced the world in a similar way due to their similar demographic backgrounds. This similarity in experience presents a problem for research because it makes us miss things. How can assumptions be challenged when no one realizes they are being made? What kind of questions will people from different backgrounds have that current researchers could never think of because they haven’t experienced the world in that way?

In response, Laura Naumann posted a letter to the ARP Facebook wall. Read it too. Another excerpt:

I challenge our field to begin to view those who conduct this type of research [on underrepresented groups] as contributing work that is EQUAL TO and AS IMPORTANT AS “traditional” basic research in personality and social psychology. First, this will require editors of “broad impact” journals to take a critical eye to their initial review process in evaluating what manuscripts are worthy of being sent out to reviewers. I’ve experienced enough frustration sending a solid manuscript to a journal only to have it quickly returned praising the work, but suggesting resubmission to a specialty journal (e.g., ethnic minority journal du jour). The message I receive is that my work is not interesting enough for broad dissemination. If we want a more welcoming field on the personal level, we need to model a welcoming field at the editorial level.

This is a discussion we need to be having. Big applause to Kelci and Laura for speaking out.

Now, what should we be doing? Read what Kelci and Laura wrote — they both have good ideas.

I’ll add a much smaller one, which came up in a conversation on my Facebook wall: let’s collect data. My impressions of what ARP conferences look like are very similar to Kelci’s, but not all important forms of diversity are visible, and if we had hard data we wouldn’t have to rely on impressions. How are the members and conference attendees of ARP and other personality associations distributed by racial and ethnic groups, gender, sexual orientation, national origin, socioeconomic background, and other important dimensions? How do those break down by career stage? And if we collect data over time, is better representation moving up the career ladder, or is the pipeline leaking? I hope ARP will consider collecting this data as part of the membership and conference registration processes going forward, and releasing aggregate numbers. (Maybe they already collect this, but if so, I cannot recall ever seeing any report of it.) With data we will have a better handle on what we’re doing well and what we could be doing better.

What else should we be doing — big or small? This is a conversation that is long overdue and that everybody should be involved in. Let’s have it.

# An interesting study of why unstructured interviews are so alluring

A while back I wrote about whether grad school admissions interviews are effective. Following up on that, Sam Gosling recently passed along an article by Dana, Dawes, and Peterson from the latest issue of Judgment and Decision Making:

Belief in the unstructured interview: The persistence of an illusion

Unstructured interviews are a ubiquitous tool for making screening decisions despite a vast literature suggesting that they have little validity. We sought to establish reasons why people might persist in the illusion that unstructured interviews are valid and what features about them actually lead to poor predictive accuracy. In three studies, we investigated the propensity for “sensemaking” – the ability for interviewers to make sense of virtually anything the interviewee says—and “dilution” – the tendency for available but non-diagnostic information to weaken the predictive value of quality information. In Study 1, participants predicted two fellow students’ semester GPAs from valid background information like prior GPA and, for one of them, an unstructured interview. In one condition, the interview was essentially nonsense in that the interviewee was actually answering questions using a random response system. Consistent with sensemaking, participants formed interview impressions just as confidently after getting random responses as they did after real responses. Consistent with dilution, interviews actually led participants to make worse predictions. Study 2 showed that watching a random interview, rather than personally conducting it, did little to mitigate sensemaking. Study 3 showed that participants believe unstructured interviews will help accuracy, so much so that they would rather have random interviews than no interview. People form confident impressions even interviews are defined to be invalid, like our random interview, and these impressions can interfere with the use of valid information. Our simple recommendation for those making screening decisions is not to use them.

It’s an interesting study. In my experience people’s beliefs in unstructured interviews are pretty powerful — hard to shake even when you show them empirical evidence.

I did have some comments on the design and analyses:

1. In Studies 1 and 2, each subject made a prediction about absolute GPA for 1 interviewee. So estimates of how good people are at predicting GPA from interviews are based on entirely between-subjects comparisons. It is very likely that a substantial chunk of the variance in predictions will be due to perceiver variance — differences between subjects in their implicit assumptions about how GPA is distributed. (E.g., Subject 1 might assume most GPAs range from 3 to 4, whereas Subject 2 assumes most GPAs range from 2.3 to 3.3. So even if they have the same subjective impression of the same target — “this person’s going to do great this term” — their numerical predictions might differ by a lot.) That perceiver variance would go into the denominator as noise variance in this study, lowering the interviewers’ predictive validity correlations.

Whether that’s a good thing or a bad thing depends on what situation you’re trying to generalize to. Perceiver variance would contribute to errors in judgment when each judge makes an absolute decision about a single target. On the other hand, in some cases perceivers make relative judgments about several targets, such as when an employer interviews several candidates and picks the best one. In that setting, perceiver variance would not matter, and a study with this design could underestimate accuracy.

2. Study 1 had 76 interviewers spread across 3 conditions (n = 25 or 26 per condition), and only 7 interviewees (each of whom was rated by multiple interviewers). Based on 73 degrees of freedom reported for the test of the “dilution” effect, it looks like they treated interviewer as the unit of analysis but did not account for the dependency in interviewees. Study 2 looked to have similar issues (though in Study 2 the dilution effect was not significant.)

3. I also had concerns about power and precision of the estimates. Any inferences about who makes better or worse predictions will depend a lot on variance among the 7 interviewees whose GPAs were being predicted (8 interviewees in study 2). I haven’t done a formal power analysis, but my intuition is that that’s pretty small. You can see a possible sign of this in one key difference between the studies. In Study 1, the correlation between the interviewees’ prior GPA and upcoming GPA was r = .65, but in Study 2 it was r = .37. That’s a pretty big difference between estimates of a quantity that should not be changing between studies.

So it’s an interesting study but not one that can give answers I’d call definitive. If that’s well understood by readers of the study, I’m okay with that. Maybe someone will use the interesting ideas in this paper as a springboard for a larger followup. Given the ubiquity of unstructured interviews, it’s something we need to know more about.

# The hotness-IQ tradeoff in academia

The other day I came across a blog post ranking academic fields by hotness. Important data for sure. But something about it was gnawing on me for a while, some connection I wasn’t quite making.

And then it hit me. The rankings looked an awful lot like another list I’d once seen of academic fields ranked by intelligence. Only, you know, upside-down.

Sure enough, when I ran the correlation among the fields that appear on both lists, it came out at r = -.45.

I don’t know what this means, but it seems important. Maybe a mathematician or computer scientist can help me understand it.