The flawed logic of chasing large effects with small samples

“I don’t care about any effect that I need more than 20 subjects per cell to detect.”

I have heard statements to this effect a number of times over the years. Sometimes from the mouths of some pretty well-established researchers, and sometimes from people quoting the well-established researchers they trained under. The idea is that if an effect is big enough — perhaps because of its real-world importance, or because of the experimenter’s skill in isolating and amplifying the effect in the lab — then you don’t need a big sample to detect it.

When I have asked people why they think that, the reasoning behind it goes something like this. If the true effect is large, then even a small sample will have a reasonable chance of detecting it. (“Detecting” = rejecting the null in this context.) If the true effect is small, then a small sample is unlikely to reject the null. So if you only use small samples, you will limit yourself to detecting large effects. And if that’s all you care about detecting, then you’re fine with small samples.

On first consideration, that might sound reasonable, and even admirably aware of issues of statistical power. Unfortunately it is completely wrong. Some of the problems with it are statistical and logical. Others are more practical:

1. It involves a classic error in probabilistic thinking. In probability notation, you are incorrectly assuming that P(A|B) = P(B|A). In psychology jargon, you are guilty of base rate insensitivity.

Here’s why. A power analysis tells the following: for a given sample size, IF the effect is large (or small) THEN you will have a certain probability of rejecting the null. But that probability is not the same as the probability that IF you have rejected the null THEN the effect is large (or small). The latter probability depends both on power and on how common large and small effects are to begin with — what a statistician would call the prior probability, and a psychologist would call the base rate.

To put this in context, suppose an experimenter is working in an area where most effects are small, some are medium, and a few are large (which pretty well describes the field of social psychology as a whole). The experimenter does not know in advance which it is, of course. When it turns out that the experimenter has stumbled onto one of the occasional large effects, the test will probably be significant. But more often the true effect will be small. Some of those will be “missed” but there will be so many of them (relative to the number of experiments run) that they’ll end up being the majority.

Consider a simplified numerical example with just 2 possibilities. Suppose that 10% of experiments are chasing an effect that is so big there’s an 90% chance of detecting it (.90 power), and 90% of experiments are chasing smaller effects with a 40% chance. Out of 100 experiments, the experimenter will get 9 significant results from the large effects, and 36 significant results from the small effects. So most of the significant results (80% of them) will come from having gotten a little bit lucky with small effects, rather than having nailed a big effect.

(Of course, that simplified example assumes, probably too generously, that the null is never true or close-to-true. Moreover, with small samples the most common outcome — absent any p-hacking — will be that the results are not significant at all, even when there really is an effect.)

If a researcher really is only interested in identifying effects of a certain size, it might seem reasonable to calculate effect sizes. But with small samples, the researcher will greatly overestimate effect sizes in the subset of results that are significant, and the amount of bias will be the greatest when the true effects are small. That is because in order to show up as significant, those small (true) effects will need to have been helped along by sampling error, giving them a positive bias as a group. So the data won’t be much help in distinguishing truly large effects from the times when chance helped along a small one. They’ll look much the same.

2. Behind the “I only need small samples” argument is the idea that the researcher attaches some special interpretive value (theoretical meaning or practical significance) to effects that are at least some certain size. But if that is the case, then the researcher needs to adjust how he or she does hypothesis testing. In conventional NHST, the null hypothesis you are trying to rule out states that the effect is zero. But if you only care about medium-and-larger effects, then you need to go a step further and rule out the hypothesis that the effect is small. Which is entirely possible to do, and not that difficult mathematically. But in order to differentiate a medium effect from a small one, you need statistical power. n=20 per cell won’t cut it.

(This situation is in some ways the mirror opposite of when researchers say they only care about the direction of an effect, not its size. But even if nothing substantive is riding on the effect size, most effects turn out to be small in size, and the experiment is only worth doing if it is reasonably capable of detecting something.)

3. All of the above assumes optimal scientific practice — no researcher degrees of freedom, no publication bias, etc. In the real world, of course, things are rarely optimal.

When it comes to running studies with small samples, one of the biggest risks is engaging in data churning and capitalizing on chance — even inadvertently. Consider two researchers who are planning to run 2-condition experiments in which a significant result would be theoretically interesting and publishable. Both decide to dedicate the time and resources to running up to 200 subjects on the project before giving up and moving on. Researcher A runs one well-powered study with n=100 per condition. Researcher B decides to run the experiment with n=20 per condition. If it is significant, B will stop and write up the result; if it is not significant, B tweaks the procedure and tries again, up to 5 times.

Without realizing it, Researcher B is working with an effective Type I error rate that is about 23%.

Of course, Researcher B should also recognize that this is a cross-validation situation. By tweaking and repeating (or merely being willing to tweak and repeat if the result is not significant), B is engaged in exploratory research. To guard against capitalizing on chance, the variation on the experiment that finally “works” needs to be run again. That is actually true regardless of sample size or power. But the practical consequences of violating it are bigger and bigger as samples get smaller and smaller.

The upshot of all of this is pretty straightforward: you cannot talk your way out of running studies with adequate power. It does not matter if you only care about large effects, if you care about small effects too, or if you do not care about effect sizes at all. It doesn’t even matter if you focus on estimation rather than NHST (which I wholeheartedly support by the way) — you still need adequate samples. The only alternatives are (a) to live with a lot of ambiguous (nonsignificant) results until enough accrue to do a meta-analysis, or (b) or p-hack your way out of them.

Do people know how much power and status they have?

Do you know how much power and status you have in the important social situations in your life? Cameron Anderson and I have a chapter coming out in a few months looking at that question. The chapter is titled “Accurate When It Counts: Perceiving Power and Status in Social Groups.” (It draws in part on an earlier empirical paper we did together.) The part before the colon probably gives away a little bit of the answer. We present a case that most people, much of the time, are pretty good at perceiving their own and others’ power and status. (Better than they are at perceiving likability or personality traits.)

You can read the chapter if you want to see where the main point is coming from. I just want to briefly comment on a preliminary issue we had to develop along the way…

One of the fun things about writing this paper was working out what it means to be accurate in perceiving power and status. Accuracy has a long and challenging history in social perception research. How do you quantify how well somebody knows somebody else’s (or their own) likability, extraversion, morality, or — in our case — power or status?

We started by creating working definitions of power and status. What became clear along the way is that the accuracy question gets answered differently for power than for status because of the different definitions. For power, we adopted Susan Fiske’s definition that power is asymmetric outcome control (in a nutshell, Person A has power over Person B if A has control over B’s valued outcomes). For status, we defined it as respect and influence in the eyes of others.

Drawing on those definitions, here’s what we say about how to define accuracy in perceiving power:

The outcome-control framework is useful for studying perceptions. Outcome control is a structural property of relationships that does not depend on any person’s construal of a situation. Thus, one person may have power over another person even if one or both people do not realize it at a given time. (For example, a late-night TV host and the female intern he dates might both think about their relationship in purely romantic terms, but the fact that the host makes decisions about the intern’s salary and career advancement means that he has power over her). Because the outcome-control framework separates psychological processes such as the perception of power from power per se, it is conceptually coherent to ask questions about the accuracy of perceptions.

And here’s how accuracy is different for status:

Like power, status is a feature of a relationship (Fiske & Berdahl, 2007). Like power, status may vary from one situation to another. And like with power, it is possible for a single individual to misperceive her own status or the status of another person. However, because status is about respect and prestige in the eyes of others, at its core it involves collective perceptions – that is, status is a component of reputation. Thus status is socially constructed in a different and perhaps more fundamental way than power. Whereas it might make sense to say that an individual has power but nobody knows it, it would not make sense to say the same about status. This gives status a complicated but necessary relation to interpersonal perceptions, which will become important when we consider what it means to be accurate in perceiving status.

On a side note: egads, am I becoming a social constructivist?

Reference:

Srivastava, S. & Anderson, C. (in press). Accurate when it counts: Perceiving power and status in social groups. In J. L. Smith, W. Ickes, J. Hall, S. D. Hodges, & W. Gardner (Eds.), Managing interpersonal sensitivity: Knowing when—and when not—to understand others.