Failed experiments do not always fail toward the null

There is a common argument among psychologists that null results are uninformative. Part of this is the logic of NHST - failure to reject the null is not the same as confirmation of the null. Which is an internally valid statement, but ignores the fact that studies with good power also have good precision to estimate effects.

However there is a second line of argument which is more procedural. The argument is that a null result can happen when an experimenter makes a mistake in either the design or execution of a study. I have heard this many times; this argument is central to an essay that Jason Mitchell recently posted arguing that null replications have no evidentiary value. (The essay said other things too, and has generated some discussion online; see e.g., Chris Said’s response.)

The problem with this argument is that experimental errors (in both design and execution) can produce all kinds of results, not just the null. Confounds, artifacts, failures of blinding procedures, demand characteristics, outliers and other violations of statistical assumptions, etc. can all produce non-null effects in data. When it comes to experimenter error, there is nothing special about the null.

Moreover, we commit a serious oversight when we use substantive results as the sole evidence of procedures. Say that the scientific hypothesis is that X causes Y. So we design an experiment with an operationalization of X, O_X, and an operationalization of Y, O_Y. A “positive” result tells us O_X -> O_Y. But unless we can say something about the relationships between O_X and X and between O_Y and Y, the result tells us nothing about X and Y.

We have a well established framework for doing that with measurements: construct validation. We expect that measures can and should be validated independent of results to document that Y -> O_Y (convergent validity) and P, Q, R, etc. !-> O_Y (discriminant validity). We have papers showing that measurement procedures are generally valid (in fact these are some of our most-cited papers!). And we typically expect papers that apply previously-established measurement procedures to show that the procedure worked in a particular sample, e.g. by reporting reliability, factor structure, correlations with other measures, etc.

Although we do not seem to publish as many validation papers on experimental manipulations as on measurements, the logic of validation applies just as well. We can obtain evidence that O_X -> X, for example by showing that experimental O_X affects already-established measurements O_X2, O_X3, etc. And in a sufficiently powered design we can show that O_X does not meaningfully influence other variables that are known to affect Y or O_Y. Just as with measurements, we can accumulate this evidence in systematic investigations to show that procedures are generally effective, and then when labs use the procedures to test substantive hypotheses they can run manipulation checks to show that they are executing a procedure correctly.

Programmatic validation is not always necessary — some experimental procedures are so face-valid that we are willing to accept that O_X -> X without a validation study. Likewise for some measurements. That is totally fine, as long as there is no double standard. But in situations where we would be willing to question whether a null result is informative, we should also be willing to question whether a non-null is. We need to evaluate methods in ways that do not depend on whether those methods give us results we like — for experimental manipulations and measurements alike.

Some thoughts on replication and falsifiability: Is this a chance to do better?

Most psychologists would probably endorse falsification as an important part of science. But in practice we rarely do it right. As others have observed before me, we do it backwards. Instead of designing experiments to falsify the hypothesis we are testing, we look for statistical evidence against a “nil null” — the point prediction that the true effect is zero. Sometimes the nil null is interesting, sometimes it isn’t, but it’s almost never a prediction from the theory that we are actually hoping to draw conclusions about.

The more rigorous approach is to derive a quantitative prediction from a theory. Then you design an experiment where the prediction could fail if the theory is wrong. Statistically speaking, the null hypothesis should be the prediction from your theory (“when dropped, this object will accelerate toward the earth at 9.8 m/s^2″). Then if a “significant” result tells you that the data are inconsistent with the theory (“average measured acceleration was 8.6 m/s^2, which differs from 9.8 at p < .05″), you have to either set aside the theory itself or one of the supporting assumptions you made when you designed the experiment. You get some leeway to look to the supporting assumptions (“oops, 9.8 assumes no wind resistance”), but not endless leeway – if the predictions keep failing, eventually you have to face facts and walk away from your theory. On the flip side, a theory is corroborated when it survives many risky opportunities to fail.

The problem in psychology — and many other sciences, including quite a bit of biology and medicine — is that our theories rarely make specific enough quantitative predictions to do hypothesis testing the “right” way. Few of our theories lead to anything remotely close to “g = 9.8 m/s^2″ in specificity. People sometimes suggest this is a problem with psychologists’ acumen as theorists. I am more inclined to think it is a function of being a young science and having chosen very difficult problems to solve. So in the grand scheme, I don’t think we should self-flagellate too much about being poor theorists or succumb to physics envy. Most of the time I am inclined to agree with people Paul Rozin (who was agreeing with Solomon Asch) and William McGuire that instead we need to adapt our approach to our scientific problems and current state of knowledge, rather than trying to ape a caricature of “hard” science. That requires changes in how we do science: we need more exploration and discovery to accumulate interesting knowledge about our phenomena, and we need to be more modest and conditional in our theories. It would be a mistake to say we need to simply double down on the caricature.

So with all this being said, there is something really interesting and I think under-appreciated about the recent movement toward replication, and it is this: This may be a great opportunity to do falsification better.

The repeatability theory

Every results section says some version of, “We did this experiment and we observed these results.”[1] It is a specific statement about something that happened in the past. But hand-in-hand with that statement is, implicitly, another claim: “If someone does the same experiment again, they will get the same results.” The second claim is a mini-theory: it is a generalization of the first claim. Call it the repeatability theory. Every experimental report comes with its own repeatability theory. It is a necessary assumption of inferential statistics. And if we did not make it, we would be doing history rather than science.

And here’s the thing: the repeatability theory is very falsifiable. The rigorous, strong kind of falsifiable. We just need to clarify what it means to (A) do the same experiment again and (B) observe the same or different results.

Part B is a little easier. “The same results” does not mean exactly the same results to infinite precision. It means “the same results plus or minus error.” The hypothesis is that Experiment 1 (the original) and Experiment 2 (the replication) are observations with error of the same underlying effect, so any observed differences between experiments are just noise. If you are using NHST[2] that leads to a straightforward “strong” null hypothesis: effectsize_1 = effectsize_2. If you have access to all the raw data, you can combine both experiments into a single dataset, create an indicator variable for which study the effect came from, and test the interaction of that indicator with the effect. The null hypothesis is no interaction, which sounds like the old fashioned nil-null but in fact “interaction = 0″ is the same as saying the effects are equal, which is the very specific quantitative hypothesis derived from the repeatability theory. If you don’t have the raw data, don’t despair. You can calculate an effect from each experiment and then compare them, like with a test of independent correlations. You can and should also estimate the difference between effects (effectsize_1 – effectsize_2) and an associated confidence interval. That difference is itself an effect size: it quantifies whatever difference there is between the studies, and can tell you if the difference is large or trivial.

Part A, “do the same experiment again,” is more complicated. Literalists like to point out that you will never be in the same room, with the same weather outside, with the same RA wearing the same shirt, etc. etc. They are technically right about all of that.[3]

But the realistic answer is that “the same experiment” just has to repeat the things that matter. “What matters” has been the subject of some discussion recently, for example in a published commentary by Danny Kahneman and a blog post by Andrew Wilson. In my thinking you can divide “what matters” into 3 categories: the original researchers’ specification of the experiment, technical skills in the methods used, and common sense. The onus is on the original experimenter to be able to tell a competent colleague what is necessary to repeat the experiment. In the old days of paper journals and page counts, it was impossible for most published papers to do this completely and you needed a lot of backchannel communication. With online supplements the gap is narrowing, but I still think it can’t hurt for a replicator to reach out to an original author. (Though in contrast to Kahneman, I would describe this as a methodological best practice, neither a matter of etiquette nor an absolute requirement.) If researchers say they do not know what conditions are necessary to produce an effect, that is no defense. It should undermine our faith in the original study. Don’t take my word for it, here’s Sir Karl (whose logic is better than his language – this is [hopefully obviously] limited neither to men nor physicists):

Every experimental physicist knows those surprising and inexplicable apparent ‘effects’ which in his laboratory can perhaps even be reproduced for some time, but which finally disappear without trace. Of course, no physicist would say in such a case that he had made a scientific discovery (though he might try to rearrange his experiments so as to make the effect reproducible). Indeed the scientifically significant physical effect may be defined as that which can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed. No serious physicist would offer for publication, as a scientific discovery, any such ‘occult effect,’ as I propose to call it – one for whose reproduction he could give no instructions. (Karl Popper, The Logic of Scientific Discovery, pp. 23-24)

Interpreting results

What happens when the data are inconsistent with the repeatability theory – original != replication? As with all empirical results, we have to consider multiple interpretations. This is true in all of science and has been recognized for a long time; replications are not special in this regard. An observed discrepancy between the original result and a replication[4] is an empirical finding that needs to be interpreted like any other empirical finding. However, a few issues come up commonly in interpreting replications:

First vs. latest. There is nothing special about an experiment being either the first or the latest, ceteris paribus. However ceteris is rarely paribus. If the replication has more power or if the scientific community gets to see its results through a less biased process than the original (e.g., due to pre-registration or a results-independent publication process), those things should give it more weight.

Technical skills. A technical analysis of the methods used and labs’ track records with them is appropriate. I am not much swayed by broad appeals to experimental “artistry.” Instead, I find these interpretations more persuasive when someone can put forward a plausible candidate for something important in the original that is not easy to standardize or carry off without specific skills. For example, a computer-administered experiment is possible to standardize and audit (and in some cases the code and digital stimuli can be reproduced exactly). But an experiment that involves confederates or cover stories might be harder to pull off for a lab that does not do that routinely. When that is the case, manipulation checks, lab visits/exchanges (in person or through video), and other validation procedures become important.

Moderators. Replications can never reproduce every single aspect of the original study. They do their best to reproduce everything that the original specification, technical knowledge, and common sense say should matter. But they can and will still depart from original studies in any number of ways: the subject pool being drawn from, the local social and cultural context, procedural changes made for practical reasons, etc. When the replication departs substantially from the original, it is fair to consider possible moderators. But moderator interpretations are nearly always post hoc, and should be weighed accordingly until we have more data.

I think it’s also important to point out that the possibility of unanticipated moderators is not a problem with replications; rather, if you are interested in discovery it is a very good reason to run them. Consider a hypothetical example from a recent blog post by Tim Wilson: a study originally run in the laboratory that produces a smaller effect in an online replication. Wilson imagines this is an outcome that a replicator with improbable amounts of both malevolence and prescience might arrange on purpose. But a far more likely scenario is that if the original specification, technical knowledge, and common sense all say that offline-online shouldn’t matter but it turns out that it does, that could actually be a very interesting discovery! People are living more of their lives online, and it is important to know how social cognition and behavior work in virtual spaces. And a discovery like that might also save other scientists a lot of wasted effort and resources, if for example they thought the experiment would work online and planned to run replicate-and-extend studies or adapt parts of the original procedure for new studies. In the end, Wilson’s example of replication gone wrong looks more like a useful discovery.

Discovery and replication need each other

Discovery and replication are often contrasted with each other. Discovery is new and exciting; replication is dull “duplication.” But that is silly. Replication separates real discoveries from noise-surfing, and as just noted it can itself lead to discoveries. We can and should do both. And not just in some sort of division of labor arrangement, but in an integrated way as part of our science. Exciting new discoveries need to be replicated before we take them as definitive. Replication within and between labs should be routine and normal.

An integrated discovery-replication approach is also an excellent way to build theories. Both Rozin and McGuire criticize psychology’s tendency to equate “theory” with broad, decontextualized statements - pronouncements that almost invariably get chipped away in subsequent studies as we discover moderators and boundary conditions. This kind of “overclaim first, then back away slowly” approach supports the hype cycle and means that a tendency to make incorrect statements is baked in to our research process. Instead, Rozin wants us to accumulate interesting descriptive facts about the phenomena we are studying; McGuire wants us to study how effects vary over populations and contexts. A discovery-replication approach allows us to do this both of these things. We can use discovery-oriented exploratory research to derive truly falsifiable predictions to then be tested. That way we will amass a body of narrow but well-corroborated theoretical statements (the repeatability theories) to assemble into bigger theories from the foundations up, rather than starting with bold pronouncements. We will also build up knowledge about quantitative estimates of effects, which we can use to start to make interval and even point predictions. That kind of cumulative science is likely to generate fewer sexy headlines in the short run, but it will be a whole lot more durable.

—–

1. I am using “experiment” in the very broad sense here of a structured scientific observation, not the more limited sense of a study that involves randomized manipulation by an experimenter.[5]

2. I’m sure the Bayesians have an answer for the statistical problem too. It is probably a good one. But c’mon, this is a chance to finally do NHST right!

3. Literalists also like to say it’s a problem that you will never have the exact same people as subjects again. They are technically wrong about that being a problem. “Drawing a sample” is part of what constitutes the experiment. But pressing this point will get you into an argument with a literalist over a technicality, which is never fun, so I suggest letting it drop.

4. “Discrepancy” = “failed replication” in the parlance of our time, but I don’t like that phrase. Who/what failed? Totally unclear, and the answer may be nobody/nothing.

5. I am totally ripping this footnote thing off of Simine Vazire but telling myself I’m ripping off David Foster Wallace.

Does the replication debate have a diversity problem?

Folks who do not have a lot of experiences with systems that don’t work well for them find it hard to imagine that a well intentioned system can have ill effects. Not work as advertised for everyone. That is my default because that is my experience.
- Bashir, Advancing How Science is Done

A couple of months ago, a tenured white male professor* from an elite research university wrote a blog post about the importance of replicating priming effects, in which he exhorted priming researchers to “Nut up or shut up.”

Just today, a tenured white male professor* from an elite research university said that a tenured scientist who challenged the interpretation and dissemination of a failed replication is a Rosa Parks, “a powerless woman who decided to risk everything.”

Well then.

The current discussion over replicability and (more broadly) improving scientific integrity and rigor is an absolutely important one. It is, at its core, a discussion about how scientists should do science. It therefore should include everybody who does science or has a stake in science.

Yet over the last year or so I have heard a number of remarks (largely in private) from scientists who are women, racial minorities, and members of other historically disempowered groups that they feel like the protagonists in this debate consist disproportionately of white men with tenure at elite institutions. Since the debate is over prescriptions for how science is to be done, it feels a little bit like the structurally powerful people shouting at each other and telling everybody else what to do.

By itself, that is enough to make people with a history of being disempowered wonder if they will be welcome to participate. And when the debate is salted with casually sexist language, and historically illiterate borrowing of other people’s oppression to further an argument — well, that’s going to hammer the point.

This is not a call for tenured white men to step back from the conversation. Rather, it is a call to bring more people in. Those of us who are structurally powerful in various ways have a responsibility to make sure that people from all backgrounds, all career stages, and all kinds of institutions are actively included and feel safe and welcome to participate. Justice demands it. That’s enough for me, but if you need a bonus, consider that including people with personal experience seeing well-intentioned systems fail might actually produce a better outcome.

—–

* The tenured and professor parts I looked up. White and male I inferred from social presentation.

What is counterintuitive?

Simine Vazire has a great post contemplating how we should evaluate counterintuitive claims. For me that brings up the question: what do we mean when we say something is “counterintuitive?”

First, let me say what I think counterintuitive isn’t. The “intuitive” part points to the fact that when we label something counterintuitive, we are usually not talking about contradicting a formal, well-specified theory. For example, you probably wouldn’t say that the double-slit experiment was “counterintuitive;” you’d say it falsified classical mechanics.

In any science, though, you have areas of inquiry where there is not an existing theory that makes precise predictions. In social and personality psychology that is the majority of what we are studying. (But it’s true in other sciences too, probably more than we appreciate.) Beyond the reach of formal theory, scientists develop educated guesses, hunches, and speculations based on their knowledge and experience. So the “intuitive” in counterintuitive could refer to the intuitions of experts.

But in social and personality psychology we study phenomena that regular people reflect on and speculate about too. A connection to everyday lived experience is almost definitional to our field, whether you think it is something that we should actively pursue or just inevitably creeps in. So we have an extra source of intuitions – the intuitions of people who are living the phenomena that we study. Which includes ourselves, since social and personality psychologists are all human beings too.

And when you are talking about something that (a) people reflect on and wonder about and (b) is not already well settled, then chances are pretty good that people have had multiple, potentially contradictory ideas about it. Sometimes different people having different ideas; sometimes the same person having different ideas at different times. The contradictory ideas might even have made their way into cultural wisdom – like “birds of a feather flock together” versus “opposites attract.”

What I suspect that means is that “counterintuitive” is often just a rhetorical strategy for writing introduction sections and marketing our work. No matter how your results turned out, you can convince your audience that they once thought the opposite. Because chances are very good that they did. A skilled writer can exploit the same mechanisms that lead to hindsight bias to set people up, and then surprise! show them that the results went the other way.

I would not claim that this describes all instances of counterintuitive, but I think it describes a lot of them. As Simine points out, many people in psychology say that counterintuitive findings are more valuable — so clearly there is an incentive to frame things that way. (Counterintuitive framing is also a great way to sell a lot of books.)

Of course, it does not have to be that way. After all, we are the field that specializes in measuring and explaining people’s intuitions. Why don’t we ask our colleagues to back up their claims of being “counterintuitive” with data? Describe the procedure fully and neutrally to a group of people (experts or nonexperts, depending whose intuitions you want to claim to be counter to) and ask what they think will happen. Milgram famously did that with his obedience experiments.

We should also revisit why we think “counterintuitive” is valuable. Sometimes it clearly is. For example, when intuition systematically leads people to make consequentially bad decisions it can be important to document that and understand why. But being counterintuitive for counterintuitive’s sake? If intuitions vary widely — and so do results, across contexts and populations — then we run the risk that placing too much value on counterintuitive findings will do more to incentive rhetorical flash than substantive discoveries.

What did Malcolm Gladwell actually say about the 10,000 hour rule?

A new paper out in Intelligence, from a group of authors led by David Hambrick, is getting a lot of press coverage for having “debunked” the 10,000-hour rule discussed in Malcolm Gladwell’s book Outliers. The 10,000-hour rule is — well, actually, that’s the point of this post: Just what, exactly, is the 10,000-hour rule?

The debate in Intelligence is between Hambrick et al. and researcher K. Anders Ericsson, who studies deliberate practice and expert performance (and wrote a rejoinder to Hambrick et al. in the journal). But Malcolm Gladwell interpreted Ericsson’s work in a popular book and popularized the phrase “the 10,000-hour rule.” And most of the press coverage mentions Gladwell.

Moreover, Gladwell has been the subject of a lot of discussion lately about how he interprets research and presents his conclusions. The 10,000-hour rule has become a runaway meme — there’s even a Macklemore song about it. And if you google it, you’ll find a lot of people talking about it and trying to apply it to their lives. The interpretations aren’t always the same, suggesting there’s been some interpretive drift in what people think the 10,000-hour rule really is. I read Outliers shortly after it came out, but my memory of it has probably been shaped by all of that conversation that has happened since. So I decided it would be interesting to go back to the source and take another look at what Gladwell actually said.

“The 10,000-Hour Rule” is the title of a chapter in Outliers. It weaves together a bunch of stories of how people became wildly successful. The pivotal moment where Gladwell lays out his thesis, the nut graf if you will, is this:

“For almost a generation, psychologists around the world have been engaged in a spirited debate over a question that most of us would consider to have been settled years ago. The question is this: is there such a thing as innate talent? The obvious answer is yes. Not every hockey player born in January ends up playing at the professional level. Only some do—the innately talented ones. Achievement is tal­ent plus preparation. The problem with this view is that the closer psychologists look at the careers of the gifted, the smaller the role innate talent seems to play and the bigger the role preparation seems to play.” (pp. 37-38)

This is classic Gladwell style — setting up the conventional wisdom and then knock it down. You might think X, but I’m going to show you it’s really not-X. In this case, what is the X that you might think? That there is such a thing as talent and that it matters for success. And Gladwell is promising to challenge that view. Zoom in and it’s laid bare:

“Achievement is tal­ent plus preparation. The problem with this view…”

Some Gladwell defenders have claimed he was just saying that talent isn’t enough by itself and preparation matters too. But that would be a pretty weak assertion for a bestselling book. I mean, who doesn’t think that violin prodigies or hockey players need to practice? And it is clear Gladwell is going for something more extreme than that. “Achievement is talent plus preparation” is not Gladwell’s thesis. To the contrary, that is the conventional wisdom that Gladwell is promising to overturn.

Gladwell then goes on to tell a bunch of stories of successful people who practiced a lot lot lot before they became successful. But that line of argument can only get you so far. Preparation and talent are not mutually exclusive. So saying “preparation matters” over and over really tells you nothing about whether talent matters too. And the difficulty for Gladwell is that, try as he might, he cannot avoid acknowledging a place for talent too. To deny that talent exists and matters would be absurd in the face of both common sense and hard data. And Gladwell can’t go that far:

“If we put the stories of hockey players and the Beatles and Bill Joy and Bill Gates together, I think we get a more com­plete picture of the path to success. Joy and Gates and the Beatles are all undeniably talented. Lennon and McCart­ney had a musical gift of the sort that comes along once in a generation, and Bill Joy, let us not forget, had a mind so quick that he was able to make up a complicated algorithm on the fly that left his professors in awe. That much is obvious.” (p. 55)

So “a more complete picture of the path to success” says that talent exists and it matters — a lot. It is actually a big deal if you have a “gift of the sort that comes along once in a generation.” So we are actually back to the conventional wisdom again: Achievement is talent plus preparation. Sure, Gladwell emphasizes the preparation piece in his storytelling. But that difference in emphasis tells us more about what is easier to narrate (nobody is ever going to make an 80′s-style montage about ACE models) than about which is actually the stronger cause. So after all the stories, it looks an awful lot like the 10,000-hour rule is just the conventional wisdom after all.

But wait! In the very next paragraph…

“But what truly distinguishes their histories is not their extraordinary talent but their extraordinary oppor­tunities.” (p. 55)

“Opportunities” doesn’t sound like talent *or* preparation. What’s that about?

This, I think, has been missing from a lot of the popular discussion about the 10,000-hour rule. Narrowly, the 10,000-hour rule is about talent and preparation. But that overlooks the emphasis in Outliers on randomness and luck — being in the right place and the right time. So you might expand the formula: “Achievement is talent plus preparation plus luck.”

Only Gladwell wants his conclusion to be simpler than the conventional wisdom, not more complicated. So he tries to equate luck with preparation, or more precisely with the opportunity to prepare. Be born in the right era, live in the right place, and maybe you’ll get a chance to spend 10,000 hours getting good at something.

The problem with simplifying the formula rather than complicating it is that you miss important things. Gladwell’s point is that you need opportunities to prepare — you can’t become a computer whiz unless you have access to a computer to tinker with (10,000 hours worth of access, to be precise). He notes that a lot of wealthy and famous computer innovators, like Bill Gates, Paul Allen, and Steve Jobs, were born in 1954 or 1955. So when personal computing took off they were just the right age to get to mess around with computers: old enough to start businesses, young enough and unattached enough to have the time to sink into something new and uncertain. Gladwell concludes that the timing of your birth is a sort of cosmically random factor that affects whether you’ll be successful.

But not all opportunities are purely random — in many domains, opportunities are more likely to come to people who are talented or prepared or both. If you show some early potential and dedication to hockey or music, people are more likely to give you a hockey stick or a violin. Sure, you have to live in a time and place where hockey sticks or violins exist, but there’s more to it than that.

And let us not forget one of the most important ways that people end up in the right place at the right time: privilege (turns out Macklemore has a song about that too). The year that Gates, Allen, and Jobs were all born in 1954-55 may be random in some cosmic sense. But the fact that they are all white dudes from America suggests some sort of pattern, at least to me. Gladwell tells a story about how Bill Hewlett gave a young Steve Jobs spare computer parts to tinker with. The story is told like it’s a lucky opportunity for Jobs, and in a sense it is. But I wonder what would have happened if a poor kid from East Palo Alto had asked Hewlett for the same thing.

So now we are up to 4 things: talent, preparation, luck, and privilege. They all matter, they all affect each other, and I am sure we could add to the list. And you could go even deeper and start questioning the foundations of how we have carved up our list of variables (just what do we mean by “innate talent” anyway, and is it the same thing — innate in the same way — for everybody?). That would be an even more complete picture of the path to success. Not an easy story to tell, I know, but maybe a better one.

In which we admire a tiny p with complete seriousness

A while back a colleague forwarded me this quote from Stanley Schachter (yes that Stanley Schachter):

“This is a difference which is significant at considerably better than the p < .0001 level of confidence. If, in reeling off these zeroes, we manage to create the impression of stringing pearls on a necklace, we rather hope the reader will be patient and forbearing, for it has been the very number of zeros after this decimal point that has compelled us to treat these data with complete seriousness.”

The quote comes from a chapter on birth order in Schachter’s 1959 book The Psychology of Affiliation. The analysis was a chi-square test on 76 subjects. The subjects were selected from 3 different experiments for being “truly anxious” and combined for this analysis. True anxiety was determined if the subject scored at one or the other extreme endpoint of an anxiety scale (both complete denial and complete admission were taken to mean that the subject is “truly anxious”), and/or if the subject discontinued participation because the experiment made them feel too anxious.

Let’s talk about diversity in personality psychology

In the latest issue of the ARP newsletter, Kelci Harris writes about diversity in ARP. You should read the whole thing. Here’s an excerpt:

Personality psychology should be intrinsically interesting to everyone, because, well, everyone has a personality. It’s accessible and that makes our research so fun and an easy thing to talk about with non-psychologists, that is, once we’ve explained to them what we actually do. However, despite what could be a universal appeal, our field is very homogenous. And that’s too bad, because diversity makes for better science. Good research comes from observations. You notice something about the world, and you wonder why that is. It’s probably reasonable to guess that most members of our field have experienced the world in a similar way due to their similar demographic backgrounds. This similarity in experience presents a problem for research because it makes us miss things. How can assumptions be challenged when no one realizes they are being made? What kind of questions will people from different backgrounds have that current researchers could never think of because they haven’t experienced the world in that way?

 In response, Laura Naumann posted a letter to the ARP Facebook wall. Read it too. Another excerpt:

I challenge our field to begin to view those who conduct this type of research [on underrepresented groups] as contributing work that is EQUAL TO and AS IMPORTANT AS “traditional” basic research in personality and social psychology. First, this will require editors of “broad impact” journals to take a critical eye to their initial review process in evaluating what manuscripts are worthy of being sent out to reviewers. I’ve experienced enough frustration sending a solid manuscript to a journal only to have it quickly returned praising the work, but suggesting resubmission to a specialty journal (e.g., ethnic minority journal du jour). The message I receive is that my work is not interesting enough for broad dissemination. If we want a more welcoming field on the personal level, we need to model a welcoming field at the editorial level.

This is a discussion we need to be having. Big applause to Kelci and Laura for speaking out.

Now, what should we be doing? Read what Kelci and Laura wrote — they both have good ideas.

I’ll add a much smaller one, which came up in a conversation on my Facebook wall: let’s collect data. My impressions of what ARP conferences look like are very similar to Kelci’s, but not all important forms of diversity are visible, and if we had hard data we wouldn’t have to rely on impressions. How are the members and conference attendees of ARP and other personality associations distributed by racial and ethnic groups, gender, sexual orientation, national origin, socioeconomic background, and other important dimensions? How do those break down by career stage? And if we collect data over time, is better representation moving up the career ladder, or is the pipeline leaking? I hope ARP will consider collecting this data as part of the membership and conference registration processes going forward, and releasing aggregate numbers. (Maybe they already collect this, but if so, I cannot recall ever seeing any report of it.) With data we will have a better handle on what we’re doing well and what we could be doing better.

What else should we be doing — big or small? This is a conversation that is long overdue and that everybody should be involved in. Let’s have it.

An interesting study of why unstructured interviews are so alluring

A while back I wrote about whether grad school admissions interviews are effective. Following up on that, Sam Gosling recently passed along an article by Dana, Dawes, and Peterson from the latest issue of Judgment and Decision Making:

Belief in the unstructured interview: The persistence of an illusion

Unstructured interviews are a ubiquitous tool for making screening decisions despite a vast literature suggesting that they have little validity. We sought to establish reasons why people might persist in the illusion that unstructured interviews are valid and what features about them actually lead to poor predictive accuracy. In three studies, we investigated the propensity for “sensemaking” - the ability for interviewers to make sense of virtually anything the interviewee says—and “dilution” – the tendency for available but non-diagnostic information to weaken the predictive value of quality information. In Study 1, participants predicted two fellow students’ semester GPAs from valid background information like prior GPA and, for one of them, an unstructured interview. In one condition, the interview was essentially nonsense in that the interviewee was actually answering questions using a random response system. Consistent with sensemaking, participants formed interview impressions just as confidently after getting random responses as they did after real responses. Consistent with dilution, interviews actually led participants to make worse predictions. Study 2 showed that watching a random interview, rather than personally conducting it, did little to mitigate sensemaking. Study 3 showed that participants believe unstructured interviews will help accuracy, so much so that they would rather have random interviews than no interview. People form confident impressions even interviews are defined to be invalid, like our random interview, and these impressions can interfere with the use of valid information. Our simple recommendation for those making screening decisions is not to use them.

It’s an interesting study. In my experience people’s beliefs in unstructured interviews are pretty powerful — hard to shake even when you show them empirical evidence.

I did have some comments on the design and analyses:

1. In Studies 1 and 2, each subject made a prediction about absolute GPA for 1 interviewee. So estimates of how good people are at predicting GPA from interviews are based on entirely between-subjects comparisons. It is very likely that a substantial chunk of the variance in predictions will be due to perceiver variance — differences between subjects in their implicit assumptions about how GPA is distributed. (E.g., Subject 1 might assume most GPAs range from 3 to 4, whereas Subject 2 assumes most GPAs range from 2.3 to 3.3. So even if they have the same subjective impression of the same target — “this person’s going to do great this term” — their numerical predictions might differ by a lot.) That perceiver variance would go into the denominator as noise variance in this study, lowering the interviewers’ predictive validity correlations.

Whether that’s a good thing or a bad thing depends on what situation you’re trying to generalize to. Perceiver variance would contribute to errors in judgment when each judge makes an absolute decision about a single target. On the other hand, in some cases perceivers make relative judgments about several targets, such as when an employer interviews several candidates and picks the best one. In that setting, perceiver variance would not matter, and a study with this design could underestimate accuracy.

2. Study 1 had 76 interviewers spread across 3 conditions (n = 25 or 26 per condition), and only 7 interviewees (each of whom was rated by multiple interviewers). Based on 73 degrees of freedom reported for the test of the “dilution” effect, it looks like they treated interviewer as the unit of analysis but did not account for the dependency in interviewees. Study 2 looked to have similar issues (though in Study 2 the dilution effect was not significant.)

3. I also had concerns about power and precision of the estimates. Any inferences about who makes better or worse predictions will depend a lot on variance among the 7 interviewees whose GPAs were being predicted (8 interviewees in study 2). I haven’t done a formal power analysis, but my intuition is that that’s pretty small. You can see a possible sign of this in one key difference between the studies. In Study 1, the correlation between the interviewees’ prior GPA and upcoming GPA was r = .65, but in Study 2 it was r = .37. That’s a pretty big difference between estimates of a quantity that should not be changing between studies.

So it’s an interesting study but not one that can give answers I’d call definitive. If that’s well understood by readers of the study, I’m okay with that. Maybe someone will use the interesting ideas in this paper as a springboard for a larger followup. Given the ubiquity of unstructured interviews, it’s something we need to know more about.

The hotness-IQ tradeoff in academia

The other day I came across a blog post ranking academic fields by hotness. Important data for sure. But something about it was gnawing on me for a while, some connection I wasn’t quite making.

And then it hit me. The rankings looked an awful lot like another list I’d once seen of academic fields ranked by intelligence. Only, you know, upside-down.

Sure enough, when I ran the correlation among the fields that appear on both lists, it came out at r = -.45.

hotness-iq

I don’t know what this means, but it seems important. Maybe a mathematician or computer scientist can help me understand it.

 

The flawed logic of chasing large effects with small samples

“I don’t care about any effect that I need more than 20 subjects per cell to detect.”

I have heard statements to this effect a number of times over the years. Sometimes from the mouths of some pretty well-established researchers, and sometimes from people quoting the well-established researchers they trained under. The idea is that if an effect is big enough — perhaps because of its real-world importance, or because of the experimenter’s skill in isolating and amplifying the effect in the lab — then you don’t need a big sample to detect it.

When I have asked people why they think that, the reasoning behind it goes something like this. If the true effect is large, then even a small sample will have a reasonable chance of detecting it. (“Detecting” = rejecting the null in this context.) If the true effect is small, then a small sample is unlikely to reject the null. So if you only use small samples, you will limit yourself to detecting large effects. And if that’s all you care about detecting, then you’re fine with small samples.

On first consideration, that might sound reasonable, and even admirably aware of issues of statistical power. Unfortunately it is completely wrong. Some of the problems with it are statistical and logical. Others are more practical:

1. It involves a classic error in probabilistic thinking. In probability notation, you are incorrectly assuming that P(A|B) = P(B|A). In psychology jargon, you are guilty of base rate insensitivity.

Here’s why. A power analysis tells the following: for a given sample size, IF the effect is large (or small) THEN you will have a certain probability of rejecting the null. But that probability is not the same as the probability that IF you have rejected the null THEN the effect is large (or small). The latter probability depends both on power and on how common large and small effects are to begin with — what a statistician would call the prior probability, and a psychologist would call the base rate.

To put this in context, suppose an experimenter is working in an area where most effects are small, some are medium, and a few are large (which pretty well describes the field of social psychology as a whole). The experimenter does not know in advance which it is, of course. When it turns out that the experimenter has stumbled onto one of the occasional large effects, the test will probably be significant. But more often the true effect will be small. Some of those will be “missed” but there will be so many of them (relative to the number of experiments run) that they’ll end up being the majority.

Consider a simplified numerical example with just 2 possibilities. Suppose that 10% of experiments are chasing an effect that is so big there’s an 90% chance of detecting it (.90 power), and 90% of experiments are chasing smaller effects with a 40% chance. Out of 100 experiments, the experimenter will get 9 significant results from the large effects, and 36 significant results from the small effects. So most of the significant results (80% of them) will come from having gotten a little bit lucky with small effects, rather than having nailed a big effect.

(Of course, that simplified example assumes, probably too generously, that the null is never true or close-to-true. Moreover, with small samples the most common outcome — absent any p-hacking — will be that the results are not significant at all, even when there really is an effect.)

If a researcher really is only interested in identifying effects of a certain size, it might seem reasonable to calculate effect sizes. But with small samples, the researcher will greatly overestimate effect sizes in the subset of results that are significant, and the amount of bias will be the greatest when the true effects are small. That is because in order to show up as significant, those small (true) effects will need to have been helped along by sampling error, giving them a positive bias as a group. So the data won’t be much help in distinguishing truly large effects from the times when chance helped along a small one. They’ll look much the same.

2. Behind the “I only need small samples” argument is the idea that the researcher attaches some special interpretive value (theoretical meaning or practical significance) to effects that are at least some certain size. But if that is the case, then the researcher needs to adjust how he or she does hypothesis testing. In conventional NHST, the null hypothesis you are trying to rule out states that the effect is zero. But if you only care about medium-and-larger effects, then you need to go a step further and rule out the hypothesis that the effect is small. Which is entirely possible to do, and not that difficult mathematically. But in order to differentiate a medium effect from a small one, you need statistical power. n=20 per cell won’t cut it.

(This situation is in some ways the mirror opposite of when researchers say they only care about the direction of an effect, not its size. But even if nothing substantive is riding on the effect size, most effects turn out to be small in size, and the experiment is only worth doing if it is reasonably capable of detecting something.)

3. All of the above assumes optimal scientific practice — no researcher degrees of freedom, no publication bias, etc. In the real world, of course, things are rarely optimal.

When it comes to running studies with small samples, one of the biggest risks is engaging in data churning and capitalizing on chance — even inadvertently. Consider two researchers who are planning to run 2-condition experiments in which a significant result would be theoretically interesting and publishable. Both decide to dedicate the time and resources to running up to 200 subjects on the project before giving up and moving on. Researcher A runs one well-powered study with n=100 per condition. Researcher B decides to run the experiment with n=20 per condition. If it is significant, B will stop and write up the result; if it is not significant, B tweaks the procedure and tries again, up to 5 times.

Without realizing it, Researcher B is working with an effective Type I error rate that is about 23%.

Of course, Researcher B should also recognize that this is a cross-validation situation. By tweaking and repeating (or merely being willing to tweak and repeat if the result is not significant), B is engaged in exploratory research. To guard against capitalizing on chance, the variation on the experiment that finally “works” needs to be run again. That is actually true regardless of sample size or power. But the practical consequences of violating it are bigger and bigger as samples get smaller and smaller.

The upshot of all of this is pretty straightforward: you cannot talk your way out of running studies with adequate power. It does not matter if you only care about large effects, if you care about small effects too, or if you do not care about effect sizes at all. It doesn’t even matter if you focus on estimation rather than NHST (which I wholeheartedly support by the way) — you still need adequate samples. The only alternatives are (a) to live with a lot of ambiguous (nonsignificant) results until enough accrue to do a meta-analysis, or (b) or p-hack your way out of them.