Statistics as math, statistics as tools


How do you think about statistical methods in science? Are statistics a matter of math and logic? Or are they a useful tool? Over time, I have noticed that these seem to be two implicit frames for thinking about statistics. Both are useful, but they tend to be more common in different research communities. And I think sometimes conversations get off track when people are using different ones.

Frame 1 is statistics as math and logic. I think many statisticians and quantitative psychologists work under this frame. Their goal is to understand statistical methods, and statistics are based on math and logic. In math and logic, things are absolute and provable. (Even in statistics, which deals with uncertainty, the uncertainty is almost always quantifiable, and thus subject to analysis.) In math and logic, exceptions and boundary cases are important. If I say “All A are B” and you disagree with me, all you need to do is show me one instance of an A that is not B and you’re done.

In the realm of statistics, that can mean either proving or demonstrating that a method breaks down under some conditions. A good example of this is E. J. Wagenmakers et al.’s recent demonstration that using intervals to do hypothesis testing is wrong. Many people (including me) have assumed that if the 95% confidence interval of a parameter excludes 0, that’s the same as falsifying the hypothesis “parameter = 0.” E. J. and colleagues show an instance where this isn’t true — that is, where the data are uninformative about a hypothesis, but the interval would lead you to believe you had evidence against it. In the example, a researcher is testing a hypothesis about a binomial probability and has a single observation. So the demonstrated breakdown occurs at N = 1, which is theoretically interesting but not a common scenario in real-world research applications.

Frame 2 is statistics as a tool. I think many scientists work under this frame. The scientist’s goal is to understand the natural world, and statistics are a tool that you use as part of the research process. Scientists are pragmatic about tools. None of our tools are perfect – lab equipment generates noisy observations and can break down, questionnaires are only good for some populations, etc. Better tools are better, of course, but since they’re never perfect, at some point we have to decide they’re good enough so we can get out and use them.

Viewing statistics as a tool means that you care whether or not something works well enough under the conditions in which you are likely to use it. A good example of a tool-frame analysis of statistics is Judd, Westfall, and Kenny’s demonstration that traditional repeated-measures ANOVA fails to account for the sampling of stimuli in many within-subjects designs, and that multilevel modeling with random effects is necessary to correctly model those effects. Judd et al. demonstrate this with data from their own published studies, showing in some cases that they themselves would have (and should have) reached different scientific conclusions.

I suspect that this difference in frames relates to the communications gap that Donald Sharpe identified between statisticians and methodologists. (Sharpe’s paper is well worth a read, whether you’re a statistician or a scientist.) Statistical discoveries and innovations often die a lonely death in Psychological Methods because quants prove something under Frame 1 but do not go the next step of demonstrating that it matters under Frame 2, so scientists don’t adopt it. (To be clear, I don’t think the stats people always stay in Frame 1 – as Sharpe points out, some of the most cited papers are in Psychological Methods too. Many of them are ones that speak to both frames.)

I also wonder if this might contribute to the prevalence of less-than-optimal research practices (LTORPs, which includes the things sometimes labeled p-hacking or questionable research practices / QRPs). I’m sure plenty of scientists really have (had?) no idea that flexible stopping rules, trying analyses with and without a covariate to see which way works better, etc. are a problem. But I bet others have some general sense that LTORPs are not theoretically correct, perhaps because their first-year grad stats instructors told them so (probably in a Frame 1-ey way). But they also know — perhaps because they have been told by the very same statistics instructors — that there are plenty of statistical practices that are technically wrong but not a big deal (e.g., some departures from distributional assumptions). Tools don’t have to be perfect, they just have to work for the problem at hand. Specifically, I suspect that for a long time, many scientists’ attitude has been that p-values do not have to be theoretically correct, they just have to lead people to make enough right decisions enough of the time. Take them seriously but not that seriously. So when faced with a situation that they haven’t been taught the exact tools for, scientists will weigh the problem as best as they can, and sometimes they tell themselves — rightly or wrongly — that what they’re doing is good enough, and they do it.

Sharpe makes excellent points about why there is a communication gap and what to do about it. I hope the 2 frames notion complements that. Scientists have to make progress with limited resources, which means they are constantly making implicit (and sometimes explicit) cost-benefit calculations. If adopting a statistical innovation will require time up front to learn it and perhaps additional time each time you implement it, researchers will ask themselves if it is worth it. Of all the things I need or want to do — writing grants, writing papers, training grad students, running experiments, publishing papers — how much less of the other stuff will I be able to do if I put the time and resources into this? Will this help me do better at my goal of discovering things about the natural world (which is different than the goal of the statistician, which is to figure out new things about statistics), or is this just a lot of headache that’ll lead me to mostly the same place that the simpler way would?

I have a couple of suggestions for better dialogue and progress on both sides. One is that we need to recognize that the 2 frames come from different sets of goals – statisticians want to understand statistics, scientists want to understand the natural world. Statisticians should go beyond showing that something is provably wrong or right, and address whether it is consequentially wrong or right. One person’s provable error is another person’s reasonable approximation. And scientists should consult statisticians about real-world consequences of their decisions. As much as possible, don’t assume good enough, verify it.

Scientists also need usable tools to solve their problems. Both conceptual tools, and more concrete things like software, procedures, etc. So scientists and statisticians need to talk more. I think data peeking is a good example of this. To a statistician, setting sample size a priori probably seems like a decent assumption. To a scientist who has just spent two years and six figures of grant money on a study and arrived at suggestive but not conclusive results (a.k.a. p = .11), it is laughable to suggest setting aside that dataset and starting from scratch with a larger sample. If you think your only choice is either do that or run another few subjects and do the analysis again, then if you think it’s just a minor fudge (“good enough”) you’re probably going to run the subjects. Sequential analyses solve that problem. They have been around for a while, but languishing in a small corner of the clinical trials literature where there was a pressing ethical reason to use them. Now that scientists are realizing they exist and can solve a wider range of problems, sequential analyses are starting to get much wider attention. They probably should be integrated even more into the data-analytic frameworks (and software) for expensive research areas, like fMRI.

Sharpe encourages statisticians to pick real examples. Let me add that they should be examples of research that you are motivated to help. Theorems, simulations, and toy examples are Frame 1 tools. Analyses in real data will hit home with scientists where they live in Frame 2. Picking apart a study in an area you already have distaste for (“I think evolutionary psych is garbage, let me choose this ev psych study to illustrate this statistical problem”) might feel satisfying, but it probably leads to less thorough and less persuasive critiques. Show in real data how the new method helps scientists with their goals, not just what the old one gets wrong according to yours.

I think of myself as one of what Sharpe calls the Mavens – scientists with an extra interest in statistics, who nerd out on quant stuff, often teach quantitative classes within their scientific field, and who like to adopt and spread the word about new innovations. Sometimes Mavens get into something because it just seems interesting. But often they (we) are drawn to things that seem like cool new ways to solve problems in our field. Speaking as a Maven who thinks statisticians can help us make science better, I would love it if you could help us out. We are interested, and we probably want to help too.


On cultural significance and the value of a life

With Michael Jackson and Farrah Fawcett dying on the same day, there are a lot of articles discussing them together. This one at MSNBC is a pretty representative example.

In reading the coverage, I can’t help but think that Farrah Fawcett’s cultural significance is getting pumped up. Not to say that she wasn’t a major cultural icon. But I think there’s something else going on.

As a culture we like to think that the value of a life is unmeasurable, and therefore all lives are equally sacred (economists be damned). Nobody would say that the extent to which society publicly mourns somebody’s death is a measure of their worth as a human being (most of us don’t get TV specials when we die). Media coverage is a function of fame and public impact, and private funerals are about mourning a beloved person, and those are usually completely different spheres. But the fact that Farrah Fawcett and Michael Jackson died on the day puts us in the uncomfortable position of looking at their deaths side-by-side. Fame and human worth get mixed together in the media coverage of somebody who has just died, and it’s hard to only apply one standard and not the other.

In this case, if we step back and look objectively in terms of cultural significance, I don’t think it’s hard to reach the conclusion that Farrah Fawcett and Michael Jackson were not on the same level. That isn’t to diminish the place that Fawcett held in society. But few people in history could measure up to Michael Jackson, who triggered a tectonic shift in how our culture thinks about music, dance, race, and celebrity. Rationally we can acknowledge that inequality without implying that one person’s life was more valuable than the other’s. But I suspect that on a gut level, it feels vaguely ghoulish to do so too loudly. So the end result is that Fawcett may be getting credited for even greater cultural significance than she otherwise would have.

(Related tangent: I can’t be the only one who feels uncomfortable every year during the Oscar tributes to Hollywood folks who’ve passed away, seeing the famous actors get louder applause than the obscure cinematographers. I suspect it’s the same sort of conflict between fame vs. human worth that’s driving that discomfort.)