The usability of statistics; or, what happens when you think that (p=.05) != (p=.06)

The difference between significant and not significant is not itself significant.

That is the title of a 2006 paper by statisticians Andrew Gelman and Hal Stern. It is also the theme of a new review article in Nature Neuroscience by Sander Nieuwenhuis, Birte U Forstmann, and Eric-Jan Wagenmakers (via Gelman’s blog). The review examined several hundred papers in behavioral, systems, and cognitive neuroscience. Of all the papers that tried to compare two effects, about half of them made this error instead of properly testing for an interaction.

I don’t know how often the error makes it through to published papers in social and personality psychology, but I see it pretty regularly as a reviewer. I call it when I see it; sometimes other reviewers call it out too, sometimes they don’t.

I can also remember making this error as a grad student – and my advisor correcting me on it. But the funny thing is, it’s not something I was taught. I’m quite sure that nowhere along the way did any of my teachers say you can compare two effects by seeing if one is significant and the other isn’t. I just started doing it on my own. (And now I sometimes channel my old advisor and correct my own students on the same error, and I’m sure nobody’s teaching it to them either.)

If I wasn’t taught to make this error, where was I getting it from? When we talk about whether researchers have biases, usually we think of hot-button issues like political bias. But I think this reflects a more straightforward kind of bias — old habits of thinking that we carry with us into our professional work. To someone without scientific training, it seems like you should be able to ask “Does X cause Y, yes or no?” and expect a straightforward answer. Scientific training teaches us a couple of things. First, the question is too simple: it’s not a yes or no question; the answer is always going to come with some uncertainty; etc. Second, the logic behind the tool that most of us use – null hypothesis significance testing (NHST) – does not even approximate the form of the question. (Roughly: “In a world where X has zero effect on Y, would we see a result this far from the truth less than 5% of the time?”)

So I think what happens is that when we are taught the abstract logic of what we are doing, it doesn’t really pervade our thinking until it’s been ground into us through repetition. For a period of time – maybe in some cases forever – we carry out the mechanics of what we have been taught to do (run an ANOVA) but we map it onto our old habits of thinking (“Does X cause Y, yes or no?”). And then we elaborate and extrapolate from them in ways that are entirely sensible by their own internal logic (“One ANOVA was significant and the other wasn’t, so X causes Y more than it causes Z, right?”).

One of the arguments you sometimes hear against NHST is that it doesn’t reflect the way researchers think. It’s a sort of usability argument: NHST is the butterfly ballot of statistical methods. In principle, I don’t think that argument carries the day on its own (if we need to use methods and models that don’t track our intuitions, we should). But it should be part of the discussion. And importantly, the Nieuwenhuis et al. review shows us how using unintuitive methods can have real consequences.

Why does an IRB need an analysis plan?

My IRB has updated its forms since the last time I submitted an application, and I just saw this section, which I think is new (emphasis added by me):

Analysis: Explain how the data will be analyzed or studied (i.e. quantitatively or qualitatively and what statistical tests you plan on using). Explain how the interpretation will address the research questions. (Attach a copy of the data collection instruments).

What statistical tests I plan on using?

My first thought was “mission creep,” but I want to keep an open mind. Are there some statistical tests that are more likely to do harm to the human subjects who provided the data? Has anybody ever been given syphilis by a chi-square test? If I do a median split, am I damaging anything more than my own credibility? (“What if there are an odd number of subjects? Are you going to have to saw a subject in half?”)

Seriously though, is there something I’m missing?

Error variance and humility

I often hear researchers criticize each other for treating important phenomena as error variance. For example, situationist social psychologists criticize trait researchers for treating situations as error variance, and vice versa. (And us interactionists get peeved at both.) The implication is that if you treat something as error variance, you are dismissing it as unimportant. And that’s often how the term is used. For example, during discussions of randomized experiments, students who are learning how experiments work will often wonder whether pre-existing individual differences could have affected the outcomes. A typical response is, “Oh, that couldn’t have driven the effects because of randomization. If there are any individual differences, they go into the error variance.” And therefore they get excluded from the explanation of the phenomenon.

I think we’d all be better off if we remembered that the word “error” refers to an error of a model or theory. On the first day of my grad school regression course, Chick Judd wrote on the board: “DATA = MODEL + ERROR”. A short while later he wrote “ERROR = DATA – MODEL.” Error is data that your model cannot explain. Its existence is a sign of the incompleteness of your model. Its ubiquity should be a constant reminder to all scientists to stay humble and open-minded.

When you have an interaction, which variable moderates which?

I was talking recently with a colleague about interpreting moderator effects, and the question came up: when you have a 2-way interaction between A and B, how do you decide whether to say that A moderates B versus B moderates A?

Mathematically, of course, A*B = B*A, so the underlying math is indifferent. I was schooled in the Baron and Kenny approach to moderation and mediation. I’ve never found any hard and fast rules in any of Kenny’s writing on the subject (if I’ve missed any, please let me know in the comments section). B&K talk about the moderator moderating the “focal” variable, and I’ve always taken that to be an interpretive choice by the researcher. If the researcher’s primary goal is to understand how A affects Y, and in the researcher’s mind B is some other interesting variable across which the A->Y relationship might vary, then B is the moderator. And vice versa. And to me, it’s entirely legitimate to talk about the same analysis in different ways — it’s a framing issue rather than a deep substantive issue.

However, my colleague has been trying to apply Kraemer et al.’s “MacArthur framework” and has been running into some problems. One of the MacArthur rules is that the variable you call the moderator (M) is the one that comes first, since (in their framework) the moderator always temporally precedes the treatment (T). But in my colleague’s study the ordering is not clear. (I believe that in my colleague’s study, the variables in question meet all of Kraemer’s other criteria for moderation — e.g., they’re uncorrelated — but they were measured at the same timepoint in a longitudinal study. Theoretically it’s not clear which one “would have” come first. Does it come down to which one came first in the questionnaire packet?)

I’ll admit that I’ve looked at Kraemer et al.’s writing on mediation/moderation a few times and it’s never quite resonated with me — they’re trying to make hard-and-fast rules for choosing between what, to me, seem like 2 legitimate alternative interpretations. (I also don’t really grok their argument that a significant interaction can sometimes be interpreted as mediation — unless it’s “mediated moderation” in Kenny-speak — but that’s a separate issue.) I’m curious how others deal with this issue…

Thinking hard

I’ve been enjoying William Cleveland’s The Elements of Graphing Data, a book I wish I’d discovered years ago. The following sentence jumped out at me:

No complete prescription can be designed to allow us to proceed mechanically and to relieve us of thinking hard. (p. 59)

The context was — well, it doesn’t matter what the context was. It’s a great encapsulation of what statistical teaching, mentoring, and consulting should be (teaching how to think hard) and cannot be (mechanical prescriptions).