Simine Vazire has a great post contemplating how we should evaluate counterintuitive claims. For me that brings up the question: what do we mean when we say something is “counterintuitive?”
First, let me say what I think counterintuitive isn’t. The “intuitive” part points to the fact that when we label something counterintuitive, we are usually not talking about contradicting a formal, well-specified theory. For example, you probably wouldn’t say that the double-slit experiment was “counterintuitive;” you’d say it falsified classical mechanics.
In any science, though, you have areas of inquiry where there is not an existing theory that makes precise predictions. In social and personality psychology that is the majority of what we are studying. (But it’s true in other sciences too, probably more than we appreciate.) Beyond the reach of formal theory, scientists develop educated guesses, hunches, and speculations based on their knowledge and experience. So the “intuitive” in counterintuitive could refer to the intuitions of experts.
But in social and personality psychology we study phenomena that regular people reflect on and speculate about too. A connection to everyday lived experience is almost definitional to our field, whether you think it is something that we should actively pursue or just inevitably creeps in. So we have an extra source of intuitions – the intuitions of people who are living the phenomena that we study. Which includes ourselves, since social and personality psychologists are all human beings too.
And when you are talking about something that (a) people reflect on and wonder about and (b) is not already well settled, then chances are pretty good that people have had multiple, potentially contradictory ideas about it. Sometimes different people having different ideas; sometimes the same person having different ideas at different times. The contradictory ideas might even have made their way into cultural wisdom – like “birds of a feather flock together” versus “opposites attract.”
What I suspect that means is that “counterintuitive” is often just a rhetorical strategy for writing introduction sections and marketing our work. No matter how your results turned out, you can convince your audience that they once thought the opposite. Because chances are very good that they did. A skilled writer can exploit the same mechanisms that lead to hindsight bias to set people up, and then surprise! show them that the results went the other way.
I would not claim that this describes all instances of counterintuitive, but I think it describes a lot of them. As Simine points out, many people in psychology say that counterintuitive findings are more valuable — so clearly there is an incentive to frame things that way. (Counterintuitive framing is also a great way to sell a lot of books.)
Of course, it does not have to be that way. After all, we are the field that specializes in measuring and explaining people’s intuitions. Why don’t we ask our colleagues to back up their claims of being “counterintuitive” with data? Describe the procedure fully and neutrally to a group of people (experts or nonexperts, depending whose intuitions you want to claim to be counter to) and ask what they think will happen. Milgram famously did that with his obedience experiments.
We should also revisit why we think “counterintuitive” is valuable. Sometimes it clearly is. For example, when intuition systematically leads people to make consequentially bad decisions it can be important to document that and understand why. But being counterintuitive for counterintuitive’s sake? If intuitions vary widely — and so do results, across contexts and populations — then we run the risk that placing too much value on counterintuitive findings will do more to incentive rhetorical flash than substantive discoveries.
A new paper out in Intelligence, from a group of authors led by David Hambrick, is getting a lot of press coverage for having “debunked” the 10,000-hour rule discussed in Malcolm Gladwell’s book Outliers. The 10,000-hour rule is — well, actually, that’s the point of this post: Just what, exactly, is the 10,000-hour rule?
The debate in Intelligence is between Hambrick et al. and researcher K. Anders Ericsson, who studies deliberate practice and expert performance (and wrote a rejoinder to Hambrick et al. in the journal). But Malcolm Gladwell interpreted Ericsson’s work in a popular book and popularized the phrase “the 10,000-hour rule.” And most of the press coverage mentions Gladwell.
Moreover, Gladwell has been the subject of a lot of discussion lately about how he interprets research and presents his conclusions. The 10,000-hour rule has become a runaway meme — there’s even a Macklemore song about it. And if you google it, you’ll find a lot of people talking about it and trying to apply it to their lives. The interpretations aren’t always the same, suggesting there’s been some interpretive drift in what people think the 10,000-hour rule really is. I read Outliers shortly after it came out, but my memory of it has probably been shaped by all of that conversation that has happened since. So I decided it would be interesting to go back to the source and take another look at what Gladwell actually said.
“The 10,000-Hour Rule” is the title of a chapter in Outliers. It weaves together a bunch of stories of how people became wildly successful. The pivotal moment where Gladwell lays out his thesis, the nut graf if you will, is this:
“For almost a generation, psychologists around the world have been engaged in a spirited debate over a question that most of us would consider to have been settled years ago. The question is this: is there such a thing as innate talent? The obvious answer is yes. Not every hockey player born in January ends up playing at the professional level. Only some do—the innately talented ones. Achievement is talent plus preparation. The problem with this view is that the closer psychologists look at the careers of the gifted, the smaller the role innate talent seems to play and the bigger the role preparation seems to play.” (pp. 37-38)
This is classic Gladwell style — setting up the conventional wisdom and then knock it down. You might think X, but I’m going to show you it’s really not-X. In this case, what is the X that you might think? That there is such a thing as talent and that it matters for success. And Gladwell is promising to challenge that view. Zoom in and it’s laid bare:
“Achievement is talent plus preparation. The problem with this view…”
Some Gladwell defenders have claimed he was just saying that talent isn’t enough by itself and preparation matters too. But that would be a pretty weak assertion for a bestselling book. I mean, who doesn’t think that violin prodigies or hockey players need to practice? And it is clear Gladwell is going for something more extreme than that. “Achievement is talent plus preparation” is not Gladwell’s thesis. To the contrary, that is the conventional wisdom that Gladwell is promising to overturn.
Gladwell then goes on to tell a bunch of stories of successful people who practiced a lot lot lot before they became successful. But that line of argument can only get you so far. Preparation and talent are not mutually exclusive. So saying “preparation matters” over and over really tells you nothing about whether talent matters too. And the difficulty for Gladwell is that, try as he might, he cannot avoid acknowledging a place for talent too. To deny that talent exists and matters would be absurd in the face of both common sense and hard data. And Gladwell can’t go that far:
“If we put the stories of hockey players and the Beatles and Bill Joy and Bill Gates together, I think we get a more complete picture of the path to success. Joy and Gates and the Beatles are all undeniably talented. Lennon and McCartney had a musical gift of the sort that comes along once in a generation, and Bill Joy, let us not forget, had a mind so quick that he was able to make up a complicated algorithm on the fly that left his professors in awe. That much is obvious.” (p. 55)
So “a more complete picture of the path to success” says that talent exists and it matters — a lot. It is actually a big deal if you have a “gift of the sort that comes along once in a generation.” So we are actually back to the conventional wisdom again: Achievement is talent plus preparation. Sure, Gladwell emphasizes the preparation piece in his storytelling. But that difference in emphasis tells us more about what is easier to narrate (nobody is ever going to make an 80′s-style montage about ACE models) than about which is actually the stronger cause. So after all the stories, it looks an awful lot like the 10,000-hour rule is just the conventional wisdom after all.
But wait! In the very next paragraph…
“But what truly distinguishes their histories is not their extraordinary talent but their extraordinary opportunities.” (p. 55)
“Opportunities” doesn’t sound like talent *or* preparation. What’s that about?
This, I think, has been missing from a lot of the popular discussion about the 10,000-hour rule. Narrowly, the 10,000-hour rule is about talent and preparation. But that overlooks the emphasis in Outliers on randomness and luck — being in the right place and the right time. So you might expand the formula: “Achievement is talent plus preparation plus luck.”
Only Gladwell wants his conclusion to be simpler than the conventional wisdom, not more complicated. So he tries to equate luck with preparation, or more precisely with the opportunity to prepare. Be born in the right era, live in the right place, and maybe you’ll get a chance to spend 10,000 hours getting good at something.
The problem with simplifying the formula rather than complicating it is that you miss important things. Gladwell’s point is that you need opportunities to prepare — you can’t become a computer whiz unless you have access to a computer to tinker with (10,000 hours worth of access, to be precise). He notes that a lot of wealthy and famous computer innovators, like Bill Gates, Paul Allen, and Steve Jobs, were born in 1954 or 1955. So when personal computing took off they were just the right age to get to mess around with computers: old enough to start businesses, young enough and unattached enough to have the time to sink into something new and uncertain. Gladwell concludes that the timing of your birth is a sort of cosmically random factor that affects whether you’ll be successful.
But not all opportunities are purely random — in many domains, opportunities are more likely to come to people who are talented or prepared or both. If you show some early potential and dedication to hockey or music, people are more likely to give you a hockey stick or a violin. Sure, you have to live in a time and place where hockey sticks or violins exist, but there’s more to it than that.
And let us not forget one of the most important ways that people end up in the right place at the right time: privilege (turns out Macklemore has a song about that too). The year that Gates, Allen, and Jobs were all born in 1954-55 may be random in some cosmic sense. But the fact that they are all white dudes from America suggests some sort of pattern, at least to me. Gladwell tells a story about how Bill Hewlett gave a young Steve Jobs spare computer parts to tinker with. The story is told like it’s a lucky opportunity for Jobs, and in a sense it is. But I wonder what would have happened if a poor kid from East Palo Alto had asked Hewlett for the same thing.
So now we are up to 4 things: talent, preparation, luck, and privilege. They all matter, they all affect each other, and I am sure we could add to the list. And you could go even deeper and start questioning the foundations of how we have carved up our list of variables (just what do we mean by “innate talent” anyway, and is it the same thing — innate in the same way — for everybody?). That would be an even more complete picture of the path to success. Not an easy story to tell, I know, but maybe a better one.
A while back a colleague forwarded me this quote from Stanley Schachter (yes that Stanley Schachter):
“This is a difference which is significant at considerably better than the p < .0001 level of confidence. If, in reeling off these zeroes, we manage to create the impression of stringing pearls on a necklace, we rather hope the reader will be patient and forbearing, for it has been the very number of zeros after this decimal point that has compelled us to treat these data with complete seriousness.”
The quote comes from a chapter on birth order in Schachter’s 1959 book The Psychology of Affiliation. The analysis was a chi-square test on 76 subjects. The subjects were selected from 3 different experiments for being “truly anxious” and combined for this analysis. True anxiety was determined if the subject scored at one or the other extreme endpoint of an anxiety scale (both complete denial and complete admission were taken to mean that the subject is “truly anxious”), and/or if the subject discontinued participation because the experiment made them feel too anxious.
In the latest issue of the ARP newsletter, Kelci Harris writes about diversity in ARP. You should read the whole thing. Here’s an excerpt:
Personality psychology should be intrinsically interesting to everyone, because, well, everyone has a personality. It’s accessible and that makes our research so fun and an easy thing to talk about with non-psychologists, that is, once we’ve explained to them what we actually do. However, despite what could be a universal appeal, our field is very homogenous. And that’s too bad, because diversity makes for better science. Good research comes from observations. You notice something about the world, and you wonder why that is. It’s probably reasonable to guess that most members of our field have experienced the world in a similar way due to their similar demographic backgrounds. This similarity in experience presents a problem for research because it makes us miss things. How can assumptions be challenged when no one realizes they are being made? What kind of questions will people from different backgrounds have that current researchers could never think of because they haven’t experienced the world in that way?
In response, Laura Naumann posted a letter to the ARP Facebook wall. Read it too. Another excerpt:
I challenge our field to begin to view those who conduct this type of research [on underrepresented groups] as contributing work that is EQUAL TO and AS IMPORTANT AS “traditional” basic research in personality and social psychology. First, this will require editors of “broad impact” journals to take a critical eye to their initial review process in evaluating what manuscripts are worthy of being sent out to reviewers. I’ve experienced enough frustration sending a solid manuscript to a journal only to have it quickly returned praising the work, but suggesting resubmission to a specialty journal (e.g., ethnic minority journal du jour). The message I receive is that my work is not interesting enough for broad dissemination. If we want a more welcoming field on the personal level, we need to model a welcoming field at the editorial level.
This is a discussion we need to be having. Big applause to Kelci and Laura for speaking out.
Now, what should we be doing? Read what Kelci and Laura wrote — they both have good ideas.
I’ll add a much smaller one, which came up in a conversation on my Facebook wall: let’s collect data. My impressions of what ARP conferences look like are very similar to Kelci’s, but not all important forms of diversity are visible, and if we had hard data we wouldn’t have to rely on impressions. How are the members and conference attendees of ARP and other personality associations distributed by racial and ethnic groups, gender, sexual orientation, national origin, socioeconomic background, and other important dimensions? How do those break down by career stage? And if we collect data over time, is better representation moving up the career ladder, or is the pipeline leaking? I hope ARP will consider collecting this data as part of the membership and conference registration processes going forward, and releasing aggregate numbers. (Maybe they already collect this, but if so, I cannot recall ever seeing any report of it.) With data we will have a better handle on what we’re doing well and what we could be doing better.
What else should we be doing — big or small? This is a conversation that is long overdue and that everybody should be involved in. Let’s have it.
A while back I wrote about whether grad school admissions interviews are effective. Following up on that, Sam Gosling recently passed along an article by Dana, Dawes, and Peterson from the latest issue of Judgment and Decision Making:
Unstructured interviews are a ubiquitous tool for making screening decisions despite a vast literature suggesting that they have little validity. We sought to establish reasons why people might persist in the illusion that unstructured interviews are valid and what features about them actually lead to poor predictive accuracy. In three studies, we investigated the propensity for “sensemaking” - the ability for interviewers to make sense of virtually anything the interviewee says—and “dilution” – the tendency for available but non-diagnostic information to weaken the predictive value of quality information. In Study 1, participants predicted two fellow students’ semester GPAs from valid background information like prior GPA and, for one of them, an unstructured interview. In one condition, the interview was essentially nonsense in that the interviewee was actually answering questions using a random response system. Consistent with sensemaking, participants formed interview impressions just as confidently after getting random responses as they did after real responses. Consistent with dilution, interviews actually led participants to make worse predictions. Study 2 showed that watching a random interview, rather than personally conducting it, did little to mitigate sensemaking. Study 3 showed that participants believe unstructured interviews will help accuracy, so much so that they would rather have random interviews than no interview. People form confident impressions even interviews are defined to be invalid, like our random interview, and these impressions can interfere with the use of valid information. Our simple recommendation for those making screening decisions is not to use them.
It’s an interesting study. In my experience people’s beliefs in unstructured interviews are pretty powerful — hard to shake even when you show them empirical evidence.
I did have some comments on the design and analyses:
1. In Studies 1 and 2, each subject made a prediction about absolute GPA for 1 interviewee. So estimates of how good people are at predicting GPA from interviews are based on entirely between-subjects comparisons. It is very likely that a substantial chunk of the variance in predictions will be due to perceiver variance — differences between subjects in their implicit assumptions about how GPA is distributed. (E.g., Subject 1 might assume most GPAs range from 3 to 4, whereas Subject 2 assumes most GPAs range from 2.3 to 3.3. So even if they have the same subjective impression of the same target — “this person’s going to do great this term” — their numerical predictions might differ by a lot.) That perceiver variance would go into the denominator as noise variance in this study, lowering the interviewers’ predictive validity correlations.
Whether that’s a good thing or a bad thing depends on what situation you’re trying to generalize to. Perceiver variance would contribute to errors in judgment when each judge makes an absolute decision about a single target. On the other hand, in some cases perceivers make relative judgments about several targets, such as when an employer interviews several candidates and picks the best one. In that setting, perceiver variance would not matter, and a study with this design could underestimate accuracy.
2. Study 1 had 76 interviewers spread across 3 conditions (n = 25 or 26 per condition), and only 7 interviewees (each of whom was rated by multiple interviewers). Based on 73 degrees of freedom reported for the test of the “dilution” effect, it looks like they treated interviewer as the unit of analysis but did not account for the dependency in interviewees. Study 2 looked to have similar issues (though in Study 2 the dilution effect was not significant.)
3. I also had concerns about power and precision of the estimates. Any inferences about who makes better or worse predictions will depend a lot on variance among the 7 interviewees whose GPAs were being predicted (8 interviewees in study 2). I haven’t done a formal power analysis, but my intuition is that that’s pretty small. You can see a possible sign of this in one key difference between the studies. In Study 1, the correlation between the interviewees’ prior GPA and upcoming GPA was r = .65, but in Study 2 it was r = .37. That’s a pretty big difference between estimates of a quantity that should not be changing between studies.
So it’s an interesting study but not one that can give answers I’d call definitive. If that’s well understood by readers of the study, I’m okay with that. Maybe someone will use the interesting ideas in this paper as a springboard for a larger followup. Given the ubiquity of unstructured interviews, it’s something we need to know more about.
The other day I came across a blog post ranking academic fields by hotness. Important data for sure. But something about it was gnawing on me for a while, some connection I wasn’t quite making.
And then it hit me. The rankings looked an awful lot like another list I’d once seen of academic fields ranked by intelligence. Only, you know, upside-down.
Sure enough, when I ran the correlation among the fields that appear on both lists, it came out at r = -.45.
I don’t know what this means, but it seems important. Maybe a mathematician or computer scientist can help me understand it.
“I don’t care about any effect that I need more than 20 subjects per cell to detect.”
I have heard statements to this effect a number of times over the years. Sometimes from the mouths of some pretty well-established researchers, and sometimes from people quoting the well-established researchers they trained under. The idea is that if an effect is big enough — perhaps because of its real-world importance, or because of the experimenter’s skill in isolating and amplifying the effect in the lab — then you don’t need a big sample to detect it.
When I have asked people why they think that, the reasoning behind it goes something like this. If the true effect is large, then even a small sample will have a reasonable chance of detecting it. (“Detecting” = rejecting the null in this context.) If the true effect is small, then a small sample is unlikely to reject the null. So if you only use small samples, you will limit yourself to detecting large effects. And if that’s all you care about detecting, then you’re fine with small samples.
On first consideration, that might sound reasonable, and even admirably aware of issues of statistical power. Unfortunately it is completely wrong. Some of the problems with it are statistical and logical. Others are more practical:
1. It involves a classic error in probabilistic thinking. In probability notation, you are incorrectly assuming that P(A|B) = P(B|A). In psychology jargon, you are guilty of base rate insensitivity.
Here’s why. A power analysis tells the following: for a given sample size, IF the effect is large (or small) THEN you will have a certain probability of rejecting the null. But that probability is not the same as the probability that IF you have rejected the null THEN the effect is large (or small). The latter probability depends both on power and on how common large and small effects are to begin with — what a statistician would call the prior probability, and a psychologist would call the base rate.
To put this in context, suppose an experimenter is working in an area where most effects are small, some are medium, and a few are large (which pretty well describes the field of social psychology as a whole). The experimenter does not know in advance which it is, of course. When it turns out that the experimenter has stumbled onto one of the occasional large effects, the test will probably be significant. But more often the true effect will be small. Some of those will be “missed” but there will be so many of them (relative to the number of experiments run) that they’ll end up being the majority.
Consider a simplified numerical example with just 2 possibilities. Suppose that 10% of experiments are chasing an effect that is so big there’s an 90% chance of detecting it (.90 power), and 90% of experiments are chasing smaller effects with a 40% chance. Out of 100 experiments, the experimenter will get 9 significant results from the large effects, and 36 significant results from the small effects. So most of the significant results (80% of them) will come from having gotten a little bit lucky with small effects, rather than having nailed a big effect.
(Of course, that simplified example assumes, probably too generously, that the null is never true or close-to-true. Moreover, with small samples the most common outcome — absent any p-hacking — will be that the results are not significant at all, even when there really is an effect.)
If a researcher really is only interested in identifying effects of a certain size, it might seem reasonable to calculate effect sizes. But with small samples, the researcher will greatly overestimate effect sizes in the subset of results that are significant, and the amount of bias will be the greatest when the true effects are small. That is because in order to show up as significant, those small (true) effects will need to have been helped along by sampling error, giving them a positive bias as a group. So the data won’t be much help in distinguishing truly large effects from the times when chance helped along a small one. They’ll look much the same.
2. Behind the “I only need small samples” argument is the idea that the researcher attaches some special interpretive value (theoretical meaning or practical significance) to effects that are at least some certain size. But if that is the case, then the researcher needs to adjust how he or she does hypothesis testing. In conventional NHST, the null hypothesis you are trying to rule out states that the effect is zero. But if you only care about medium-and-larger effects, then you need to go a step further and rule out the hypothesis that the effect is small. Which is entirely possible to do, and not that difficult mathematically. But in order to differentiate a medium effect from a small one, you need statistical power. n=20 per cell won’t cut it.
(This situation is in some ways the mirror opposite of when researchers say they only care about the direction of an effect, not its size. But even if nothing substantive is riding on the effect size, most effects turn out to be small in size, and the experiment is only worth doing if it is reasonably capable of detecting something.)
3. All of the above assumes optimal scientific practice — no researcher degrees of freedom, no publication bias, etc. In the real world, of course, things are rarely optimal.
When it comes to running studies with small samples, one of the biggest risks is engaging in data churning and capitalizing on chance — even inadvertently. Consider two researchers who are planning to run 2-condition experiments in which a significant result would be theoretically interesting and publishable. Both decide to dedicate the time and resources to running up to 200 subjects on the project before giving up and moving on. Researcher A runs one well-powered study with n=100 per condition. Researcher B decides to run the experiment with n=20 per condition. If it is significant, B will stop and write up the result; if it is not significant, B tweaks the procedure and tries again, up to 5 times.
Without realizing it, Researcher B is working with an effective Type I error rate that is about 23%.
Of course, Researcher B should also recognize that this is a cross-validation situation. By tweaking and repeating (or merely being willing to tweak and repeat if the result is not significant), B is engaged in exploratory research. To guard against capitalizing on chance, the variation on the experiment that finally “works” needs to be run again. That is actually true regardless of sample size or power. But the practical consequences of violating it are bigger and bigger as samples get smaller and smaller.
The upshot of all of this is pretty straightforward: you cannot talk your way out of running studies with adequate power. It does not matter if you only care about large effects, if you care about small effects too, or if you do not care about effect sizes at all. It doesn’t even matter if you focus on estimation rather than NHST (which I wholeheartedly support by the way) — you still need adequate samples. The only alternatives are (a) to live with a lot of ambiguous (nonsignificant) results until enough accrue to do a meta-analysis, or (b) or p-hack your way out of them.