Reflections on a foray into post-publication peer review

Recently I posted a comment on a PLOS ONE article for the first time. As someone who had a decent chunk of his career before post-publication peer review came along — and has an even larger chunk of his career left with it around — it was an interesting experience.

It started when a colleague posted an article to his Facebook wall. I followed the link out of curiosity about the subject matter, but what immediately jumped out at me was that it was a 4-study sequence with pretty small samples. (See Uli Schimmack’s excellent article The ironic effect of significant results on the credibility of multiple-study articles [pdf] for why that’s noteworthy.) That got me curious about effect sizes and power, so I looked a little bit more closely and noticed some odd things. Like that different N’s were reported in the abstract and the method section. And when I calculated effect sizes from the reported means and SDs, some of them were enormous. Like Cohen’s d > 3.0 level of enormous. (If all this sounds a little hazy, it’s because my goal in this post is to talk about my experience of engaging in post-publication review — not to rehash the details. You can follow the links to the article and comments for those.)

In the old days of publishing, it wouldn’t have been clear what to do next. In principle many psych journals will publish letters and comments, but in practice they’re exceedingly rare. Another alternative would have been to contact the authors and ask them to write a correction. But that relies on the authors agreeing that there’s a mistake, which authors don’t always do. And even if authors agree and write up a correction, it might be months before it appears in print.

But this article was published in PLOS ONE, which lets readers post comments on articles as a form of post-publication peer-review (PPPR). These comments aren’t just like comments on some random website or blog — they become part of the published scientific record, linked from the primary journal article. I’m all in favor of that kind of system. But it brought up a few interesting issues for how to navigate the new world of scientific publishing and commentary.

1. Professional etiquette. Here and there in my professional development I’ve caught bits and pieces of a set of gentleman’s rules about scientific discourse (and yes, I am using the gendered expression advisedly). A big one is, don’t make a fellow scientist look bad. Unless you want to go to war (and then there are rules for that too). So the old-fashioned thing to do — “the way I was raised” — would be to contact the authors quietly and petition them to make a correction themselves, so it could look like it originated with them. And if they do nothing, probably limit my comments to grumbling at the hotel bar at the next conference.

But for PPPR to work, the etiquette of “anything public is war” has to go out the window. Scientists commenting on each other’s work needs to be a routine and unremarkable part of scientific discourse. So does an understanding that even good scientists can make mistakes. And to live by the old norms is to affirm them. (Plus, the authors chose to submit to a journal that allows public comments, so caveat author.) So I elected to post a comment and then email the authors to let them know, so they would have a chance to respond quickly if they weren’t monitoring the comments. As a result, the authors posted several comments over the next couple of days correcting aspects of the article and explaining how the errors happened. And they were very responsive and cordial over email the entire time. Score one for the new etiquette.

2. A failure of pre-publication peer review? Some of the issues I raised in my comment were indisputable factual inconsistencies — like that the sample sizes were reported differently in different parts of the paper. Others were more inferential — like that a string of significant results in these 4 studies was significantly improbable, even under a reasonable expectation of an effect size consistent with the authors’ own hypothesis. A reviewer might disagree about that (maybe they think the true effect really is gigantic). Other issues, like the too-small SDs, would have been somewhere in the middle, though they turned out to be errors after all.

Is this a mark against pre-publication peer review? Obviously it’s hard to say from one case, but I don’t think it speaks well of PLOS ONE that these errors got through. Especially because PLOS ONE is supposed to emphasize “a high technical standard” and reporting of “sufficient detail” (the reason I noticed the issue with the SDs was because the article did not report effect sizes).

But this doesn’t necessarily make PLOS ONE worse than traditional journals like Psychological Science or JPSP, where similar errors get through all the time and then become almost impossible to correct. [UPDATE: Please see my followup post about pre-publication review at PLOS ONE and other journals.]

3. The inconsistency of post-publication peer review. I don’t think post-publication peer review is a cure-all. This whole episode depended in somebody (in this case, me) noticing the anomalies and being motivated to post a comment about them. If we got rid of pre-publication peer review and if the review process remained that unsystematic, it would be a recipe for a very biased system. This article’s conclusions are flattering to most scientists’ prejudices, and press coverage of the article has gotten a lot of mentions and “hell yeah”s on Twitter from pro-science folks. I don’t think it’s hard to imagine that that contributed to it getting a pass, and that if the opposite were true the article would have gotten a lot more scrutiny both pre- and post-publication. In my mind, the fix would be to make sure that all articles get a decent pre-publication review — not to scrap it altogether. Post-publication review is an important new development but should be an addition, not a replacement.

4. Where to stop? Finally, one issue I faced was how much to say in my initial comment, and how much to follow up. In particular, my original comment made a point about the low power and thus the improbability of a string of 4 studies with a rejected null. I based that on some hypotheticals and assumptions rather than formally calculating Schimmack’s incredibility index for the paper, in part because other errors in the initial draft made that impossible. The authors never responded to that particular point, but their corrections would have made it possible to calculate an IC index. So I could have come back and tried to goad them into a response. But I decided to let it go. I don’t have an axe to grind, and my initial comment is now part of the record. And one nice thing about PPPR is that readers can evaluate the arguments for themselves. (I do wish I had cited Schimmack’s paper though, because more people should know about it.)

Changing software to nudge researchers toward better data analysis practice

The tools we have available to us affect the way we interact with and even think about the world. “If all you have is a hammer” etc. Along these lines, I’ve been wondering what would happen if the makers of data analysis software like SPSS, SAS, etc. changed some of the defaults and options. Sort of in the spirit of Nudge — don’t necessarily change the list of what is ultimately possible to do, but make changes to make some things easier and other things harder (like via defaults and options).

Would people think about their data differently? Here’s my list of how I might change regression procedures, and what I think these changes might do:

1. Let users write common transformations of variables directly into the syntax. Things like centering, z-scoring, log-transforming, multiplying variables into interactions, etc. This is already part of some packages (it’s easy to do in R), but not others. In particular, running interactions in SPSS is a huge royal pain. For example, to do a simple 2-way interaction with centered variables, you have to write all this crap *and* cycle back and forth between the code and the output along the way:

desc x1 x2.
* Run just the above, then look at the output and see what the means are, then edit the code below.
compute x1_c = x1 - [whatever the mean was].
compute x2_c = x2 - [whatever the mean was].
compute x1x2 = x1_c*x2_c.
regression /dependent y /enter x1_c x2_c x1x2.

Why shouldn’t we be able to do it all in one line like this?

regression /dependent y /enter center(x1) center(x2) center(x1)*center(x2).

The nudge: If it were easy to write everything into a single command, maybe more people would look at interactions more often. And maybe they’d stop doing median splits and then jamming everything into an ANOVA!

2. By default, the output shows you parameter estimates and confidence intervals.

3. Either by default or with an easy-to-implement option, you can get a variety of standardized effect size estimates with their confidence intervals. And let’s not make variance-explained metrics (like R^2 or eta^2) the defaults.

The nudge: #2 and #3 are both designed to focus people on point and interval estimation, rather than NHST.

This next one is a little more radical:

4. By default the output does not show you inferential t-tests and p-values — you have to ask for them through an option. And when you ask for them, you have to state what the null hypotheses are! So if you want to test the null that some parameter equals zero (as 99.9% of research in social science does), hey, go for it — but it has to be an active request, not a passive default. And if you want to test a null hypothesis that some parameter is some nonzero value, it would be easy to do that too.

The nudge. In the way a lot of statistics is taught in psychology, NHST is the main event and effect estimation is an afterthought. This would turn it around. And by making users specify a null hypothesis, it might spur us to pause and think about how and why we are doing so, rather than just mining for asterisks to put in tables. Heck, I bet some nontrivial number of psychology researchers don’t even know that the null hypothesis doesn’t have to be the nil hypothesis. (I still remember the “aha” feeling the first time I learned that you could do that — well along into graduate school, in an elective statistics class.) If we want researchers to move toward point or range predictions with strong hypothesis testing, we should make it easier to do.

All of these things are possible to do in most or all software packages. But as my SPSS example under #1 shows, they’re not necessarily easy to implement in a user-friendly way. Even R doesn’t do all of these things in the standard lm function. As  a result, they probably don’t get done as much as they could or should.

Any other nudges you’d make?

Does your p-curve weigh as much as a duck?

Over at Psych Your Mind, Michael Kraus bravely reports the results of a p-curve analysis of his own publications.

p-curves were discussed by Uri Simonsohn at an SPSP symposium on false-positive findings (which I missed but got to read up about thanks to Kraus; many of the authors of the false-positive psychology paper were involved). Simonsohn has a paper forthcoming with details of the method. But the basic idea is that you should be able to tell if somebody is mining their data for significant findings by examining the distribution of p-values in their published work. A big spike of .049s and not enough <.01s could be the result of cherry-picking.

In a thoughtful but sometimes-heated discussion on the SPSP email list between Norbert Schwarz and the symposium participants, Schwarz argues — and I agree — that although p-curve analyses could be a useful tool, they will need to be interpreted cautiously. For example, Schwarz thinks that at this stage it would be inappropriate to base hiring decisions on candidates’ p-curves, something that Simonsohn apparently suggested in his talk.

A big part of the interpretive task is going to be that, as with any metric, users will have to accumulate data and build up some practical wisdom in figuring out how to interpret and apply it. Or to get a little jargony, we’ll have to do some construct validation. In particular, I think it will be crucial to remember that even though you could calculate a p-curve on a single researcher, the curve is not a property of the researcher. Rather, it will reflect the interaction of the researcher with history and context. Even setting aside measurement and sampling error, substantive factors like the incentives and practices set by publishers, granting agencies, and other powerful institutions; differing standards of different fields and subfields (e.g., in their use of NHST, in what people honestly believe and teach as acceptable practices); who the researcher was trained by and has collaborated with, etc. will affect researchers’ p-curves. Individual researchers are an important part of the picture, of course, but it would be a mistake to apply an overly simplistic model of where p-curves come from. (And of course they don’t have to be applied to individuals at all — they could be applied to literatures, to subfields, to journals, or really any way of categorizing publications).

One thing that both Schwarz and Simonsohn seem to agree on is that everybody has probably committed some or many of these errors, and we won’t make much progress unless people are willing to subject themselves to perhaps-painful soul searching. Schwarz in particular fears for a “witch hunt” atmosphere that could make people defensive and ultimately be counterproductive.

So hats off to Kraus for putting himself on the line. I’ll let you read his account and draw your own conclusions, but I think he’s impressively frank, especially for someone that early in his career. Speaking for myself, I’m waiting for Simonsohn’s paper so I can learn a little more about the method before trying it on my own vita. In the mean time I’m glad at least one of my papers has this little bit of p-curve kryptonite:

The p-values associated with the tests of the polynomial models are generally quite small, some so small as to exceed the computational limits of our data analysis software (SPSS 10.0.7, which ran out of decimal places at p < 10e–22).

Whew!

Do not use what I am about to teach you

I am gearing up to teach Structural Equation Modeling this fall term. (We are on quarters, so we start late — our first day of classes is next Monday.)

Here’s the syllabus. (pdf)

I’ve taught this course a bunch of times now, and each time I teach it I add more and more material on causal inference. In part it’s a reaction to my own ongoing education and evolving thinking about causation, and in part it’s from seeing a lot of empirical work that makes what I think are poorly supported causal inferences. (Not just articles that use SEM either.)

Last time I taught SEM, I wondered if I was heaping on so many warnings and caveats that the message started to veer into, “Don’t use SEM.” I hope that is not the case. SEM is a powerful tool when used well. I actually want the discussion of causal inference to help my students think critically about all kinds of designs and analyses. Even people who only run randomized experiments could benefit from a little more depth than the sophomore-year slogan that seems to be all some researchers (AHEM, Reviewer B) have been taught about causation.

The usability of statistics; or, what happens when you think that (p=.05) != (p=.06)

The difference between significant and not significant is not itself significant.

That is the title of a 2006 paper by statisticians Andrew Gelman and Hal Stern. It is also the theme of a new review article in Nature Neuroscience by Sander Nieuwenhuis, Birte U Forstmann, and Eric-Jan Wagenmakers (via Gelman’s blog). The review examined several hundred papers in behavioral, systems, and cognitive neuroscience. Of all the papers that tried to compare two effects, about half of them made this error instead of properly testing for an interaction.

I don’t know how often the error makes it through to published papers in social and personality psychology, but I see it pretty regularly as a reviewer. I call it when I see it; sometimes other reviewers call it out too, sometimes they don’t.

I can also remember making this error as a grad student – and my advisor correcting me on it. But the funny thing is, it’s not something I was taught. I’m quite sure that nowhere along the way did any of my teachers say you can compare two effects by seeing if one is significant and the other isn’t. I just started doing it on my own. (And now I sometimes channel my old advisor and correct my own students on the same error, and I’m sure nobody’s teaching it to them either.)

If I wasn’t taught to make this error, where was I getting it from? When we talk about whether researchers have biases, usually we think of hot-button issues like political bias. But I think this reflects a more straightforward kind of bias — old habits of thinking that we carry with us into our professional work. To someone without scientific training, it seems like you should be able to ask “Does X cause Y, yes or no?” and expect a straightforward answer. Scientific training teaches us a couple of things. First, the question is too simple: it’s not a yes or no question; the answer is always going to come with some uncertainty; etc. Second, the logic behind the tool that most of us use – null hypothesis significance testing (NHST) – does not even approximate the form of the question. (Roughly: “In a world where X has zero effect on Y, would we see a result this far from the truth less than 5% of the time?”)

So I think what happens is that when we are taught the abstract logic of what we are doing, it doesn’t really pervade our thinking until it’s been ground into us through repetition. For a period of time – maybe in some cases forever – we carry out the mechanics of what we have been taught to do (run an ANOVA) but we map it onto our old habits of thinking (“Does X cause Y, yes or no?”). And then we elaborate and extrapolate from them in ways that are entirely sensible by their own internal logic (“One ANOVA was significant and the other wasn’t, so X causes Y more than it causes Z, right?”).

One of the arguments you sometimes hear against NHST is that it doesn’t reflect the way researchers think. It’s a sort of usability argument: NHST is the butterfly ballot of statistical methods. In principle, I don’t think that argument carries the day on its own (if we need to use methods and models that don’t track our intuitions, we should). But it should be part of the discussion. And importantly, the Nieuwenhuis et al. review shows us how using unintuitive methods can have real consequences.

Why does an IRB need an analysis plan?

My IRB has updated its forms since the last time I submitted an application, and I just saw this section, which I think is new (emphasis added by me):

Analysis: Explain how the data will be analyzed or studied (i.e. quantitatively or qualitatively and what statistical tests you plan on using). Explain how the interpretation will address the research questions. (Attach a copy of the data collection instruments).

What statistical tests I plan on using?

My first thought was “mission creep,” but I want to keep an open mind. Are there some statistical tests that are more likely to do harm to the human subjects who provided the data? Has anybody ever been given syphilis by a chi-square test? If I do a median split, am I damaging anything more than my own credibility? (“What if there are an odd number of subjects? Are you going to have to saw a subject in half?”)

Seriously though, is there something I’m missing?

Error variance and humility

I often hear researchers criticize each other for treating important phenomena as error variance. For example, situationist social psychologists criticize trait researchers for treating situations as error variance, and vice versa. (And us interactionists get peeved at both.) The implication is that if you treat something as error variance, you are dismissing it as unimportant. And that’s often how the term is used. For example, during discussions of randomized experiments, students who are learning how experiments work will often wonder whether pre-existing individual differences could have affected the outcomes. A typical response is, “Oh, that couldn’t have driven the effects because of randomization. If there are any individual differences, they go into the error variance.” And therefore they get excluded from the explanation of the phenomenon.

I think we’d all be better off if we remembered that the word “error” refers to an error of a model or theory. On the first day of my grad school regression course, Chick Judd wrote on the board: “DATA = MODEL + ERROR”. A short while later he wrote “ERROR = DATA – MODEL.” Error is data that your model cannot explain. Its existence is a sign of the incompleteness of your model. Its ubiquity should be a constant reminder to all scientists to stay humble and open-minded.

When you have an interaction, which variable moderates which?

I was talking recently with a colleague about interpreting moderator effects, and the question came up: when you have a 2-way interaction between A and B, how do you decide whether to say that A moderates B versus B moderates A?

Mathematically, of course, A*B = B*A, so the underlying math is indifferent. I was schooled in the Baron and Kenny approach to moderation and mediation. I’ve never found any hard and fast rules in any of Kenny’s writing on the subject (if I’ve missed any, please let me know in the comments section). B&K talk about the moderator moderating the “focal” variable, and I’ve always taken that to be an interpretive choice by the researcher. If the researcher’s primary goal is to understand how A affects Y, and in the researcher’s mind B is some other interesting variable across which the A->Y relationship might vary, then B is the moderator. And vice versa. And to me, it’s entirely legitimate to talk about the same analysis in different ways — it’s a framing issue rather than a deep substantive issue.

However, my colleague has been trying to apply Kraemer et al.’s “MacArthur framework” and has been running into some problems. One of the MacArthur rules is that the variable you call the moderator (M) is the one that comes first, since (in their framework) the moderator always temporally precedes the treatment (T). But in my colleague’s study the ordering is not clear. (I believe that in my colleague’s study, the variables in question meet all of Kraemer’s other criteria for moderation — e.g., they’re uncorrelated — but they were measured at the same timepoint in a longitudinal study. Theoretically it’s not clear which one “would have” come first. Does it come down to which one came first in the questionnaire packet?)

I’ll admit that I’ve looked at Kraemer et al.’s writing on mediation/moderation a few times and it’s never quite resonated with me — they’re trying to make hard-and-fast rules for choosing between what, to me, seem like 2 legitimate alternative interpretations. (I also don’t really grok their argument that a significant interaction can sometimes be interpreted as mediation — unless it’s “mediated moderation” in Kenny-speak — but that’s a separate issue.) I’m curious how others deal with this issue…