“I don’t care about any effect that I need more than 20 subjects per cell to detect.”

I have heard statements to this effect a number of times over the years. Sometimes from the mouths of some pretty well-established researchers, and sometimes from people quoting the well-established researchers they trained under. The idea is that if an effect is big enough — perhaps because of its real-world importance, or because of the experimenter’s skill in isolating and amplifying the effect in the lab — then you don’t need a big sample to detect it.

When I have asked people why they think that, the reasoning behind it goes something like this. If the true effect is large, then even a small sample will have a reasonable chance of detecting it. (“Detecting” = rejecting the null in this context.) If the true effect is small, then a small sample is unlikely to reject the null. So if you only use small samples, you will limit yourself to detecting large effects. And if that’s all you care about detecting, then you’re fine with small samples.

On first consideration, that might sound reasonable, and even admirably aware of issues of statistical power. Unfortunately it is completely wrong. Some of the problems with it are statistical and logical. Others are more practical:

1. It involves a classic error in probabilistic thinking. In probability notation, you are incorrectly assuming that P(A|B) = P(B|A). In psychology jargon, you are guilty of base rate insensitivity.

Here’s why. A power analysis tells the following: for a given sample size, IF the effect is large (or small) THEN you will have a certain probability of rejecting the null. But that probability is not the same as the probability that IF you have rejected the null THEN the effect is large (or small). The latter probability depends both on power and on how common large and small effects are to begin with — what a statistician would call the prior probability, and a psychologist would call the base rate.

To put this in context, suppose an experimenter is working in an area where most effects are small, some are medium, and a few are large (which pretty well describes the field of social psychology as a whole). The experimenter does not know in advance which it is, of course. When it turns out that the experimenter has stumbled onto one of the occasional large effects, the test will probably be significant. But more often the true effect will be small. Some of those will be “missed” but there will be so many of them (relative to the number of experiments run) that they’ll end up being the majority.

Consider a simplified numerical example with just 2 possibilities. Suppose that 10% of experiments are chasing an effect that is so big there’s an 90% chance of detecting it (.90 power), and 90% of experiments are chasing smaller effects with a 40% chance. Out of 100 experiments, the experimenter will get 9 significant results from the large effects, and 36 significant results from the small effects. So most of the significant results (80% of them) will come from having gotten a little bit lucky with small effects, rather than having nailed a big effect.

(Of course, that simplified example assumes, probably too generously, that the null is never true or close-to-true. Moreover, with small samples the most common outcome — absent any p-hacking — will be that the results are not significant at all, even when there really is an effect.)

If a researcher really is only interested in identifying effects of a certain size, it might seem reasonable to calculate effect sizes. But with small samples, the researcher will greatly overestimate effect sizes in the subset of results that are significant, and the amount of bias will be the greatest when the true effects are small. That is because in order to show up as significant, those small (true) effects will need to have been helped along by sampling error, giving them a positive bias as a group. So the data won’t be much help in distinguishing truly large effects from the times when chance helped along a small one. They’ll look much the same.

2. Behind the “I only need small samples” argument is the idea that the researcher attaches some special interpretive value (theoretical meaning or practical significance) to effects that are at least some certain size. But if that is the case, then the researcher needs to adjust how he or she does hypothesis testing. In conventional NHST, the null hypothesis you are trying to rule out states that the effect is zero. But if you only care about medium-and-larger effects, then you need to go a step further and rule out the hypothesis that the effect is small. Which is entirely possible to do, and not that difficult mathematically. But in order to differentiate a medium effect from a small one, you need statistical power. n=20 per cell won’t cut it.

(This situation is in some ways the mirror opposite of when researchers say they only care about the direction of an effect, not its size. But even if nothing substantive is riding on the effect size, most effects turn out to be small in size, and the experiment is only worth doing if it is reasonably capable of detecting something.)

3. All of the above assumes optimal scientific practice — no researcher degrees of freedom, no publication bias, etc. In the real world, of course, things are rarely optimal.

When it comes to running studies with small samples, one of the biggest risks is engaging in data churning and capitalizing on chance — even inadvertently. Consider two researchers who are planning to run 2-condition experiments in which a significant result would be theoretically interesting and publishable. Both decide to dedicate the time and resources to running up to 200 subjects on the project before giving up and moving on. Researcher A runs one well-powered study with n=100 per condition. Researcher B decides to run the experiment with n=20 per condition. If it is significant, B will stop and write up the result; if it is not significant, B tweaks the procedure and tries again, up to 5 times.

Without realizing it, Researcher B is working with an effective Type I error rate that is about 23%.

Of course, Researcher B should also recognize that this is a cross-validation situation. By tweaking and repeating (or merely being willing to tweak and repeat if the result is not significant), B is engaged in exploratory research. To guard against capitalizing on chance, the variation on the experiment that finally “works” needs to be run again. That is actually true regardless of sample size or power. But the practical consequences of violating it are bigger and bigger as samples get smaller and smaller.

The upshot of all of this is pretty straightforward: you cannot talk your way out of running studies with adequate power. It does not matter if you only care about large effects, if you care about small effects too, or if you do not care about effect sizes at all. It doesn’t even matter if you focus on estimation rather than NHST (which I wholeheartedly support by the way) — you still need adequate samples. The only alternatives are (a) to live with a lot of ambiguous (nonsignificant) results until enough accrue to do a meta-analysis, or (b) or p-hack your way out of them.