The hotness-IQ tradeoff in academia

The other day I came across a blog post ranking academic fields by hotness. Important data for sure. But something about it was gnawing on me for a while, some connection I wasn’t quite making.

And then it hit me. The rankings looked an awful lot like another list I’d once seen of academic fields ranked by intelligence. Only, you know, upside-down.

Sure enough, when I ran the correlation among the fields that appear on both lists, it came out at r = -.45.

hotness-iq

I don’t know what this means, but it seems important. Maybe a mathematician or computer scientist can help me understand it.

 

The flawed logic of chasing large effects with small samples

“I don’t care about any effect that I need more than 20 subjects per cell to detect.”

I have heard statements to this effect a number of times over the years. Sometimes from the mouths of some pretty well-established researchers, and sometimes from people quoting the well-established researchers they trained under. The idea is that if an effect is big enough — perhaps because of its real-world importance, or because of the experimenter’s skill in isolating and amplifying the effect in the lab — then you don’t need a big sample to detect it.

When I have asked people why they think that, the reasoning behind it goes something like this. If the true effect is large, then even a small sample will have a reasonable chance of detecting it. (“Detecting” = rejecting the null in this context.) If the true effect is small, then a small sample is unlikely to reject the null. So if you only use small samples, you will limit yourself to detecting large effects. And if that’s all you care about detecting, then you’re fine with small samples.

On first consideration, that might sound reasonable, and even admirably aware of issues of statistical power. Unfortunately it is completely wrong. Some of the problems with it are statistical and logical. Others are more practical:

1. It involves a classic error in probabilistic thinking. In probability notation, you are incorrectly assuming that P(A|B) = P(B|A). In psychology jargon, you are guilty of base rate insensitivity.

Here’s why. A power analysis tells the following: for a given sample size, IF the effect is large (or small) THEN you will have a certain probability of rejecting the null. But that probability is not the same as the probability that IF you have rejected the null THEN the effect is large (or small). The latter probability depends both on power and on how common large and small effects are to begin with — what a statistician would call the prior probability, and a psychologist would call the base rate.

To put this in context, suppose an experimenter is working in an area where most effects are small, some are medium, and a few are large (which pretty well describes the field of social psychology as a whole). The experimenter does not know in advance which it is, of course. When it turns out that the experimenter has stumbled onto one of the occasional large effects, the test will probably be significant. But more often the true effect will be small. Some of those will be “missed” but there will be so many of them (relative to the number of experiments run) that they’ll end up being the majority.

Consider a simplified numerical example with just 2 possibilities. Suppose that 10% of experiments are chasing an effect that is so big there’s an 90% chance of detecting it (.90 power), and 90% of experiments are chasing smaller effects with a 40% chance. Out of 100 experiments, the experimenter will get 9 significant results from the large effects, and 36 significant results from the small effects. So most of the significant results (80% of them) will come from having gotten a little bit lucky with small effects, rather than having nailed a big effect.

(Of course, that simplified example assumes, probably too generously, that the null is never true or close-to-true. Moreover, with small samples the most common outcome — absent any p-hacking — will be that the results are not significant at all, even when there really is an effect.)

If a researcher really is only interested in identifying effects of a certain size, it might seem reasonable to calculate effect sizes. But with small samples, the researcher will greatly overestimate effect sizes in the subset of results that are significant, and the amount of bias will be the greatest when the true effects are small. That is because in order to show up as significant, those small (true) effects will need to have been helped along by sampling error, giving them a positive bias as a group. So the data won’t be much help in distinguishing truly large effects from the times when chance helped along a small one. They’ll look much the same.

2. Behind the “I only need small samples” argument is the idea that the researcher attaches some special interpretive value (theoretical meaning or practical significance) to effects that are at least some certain size. But if that is the case, then the researcher needs to adjust how he or she does hypothesis testing. In conventional NHST, the null hypothesis you are trying to rule out states that the effect is zero. But if you only care about medium-and-larger effects, then you need to go a step further and rule out the hypothesis that the effect is small. Which is entirely possible to do, and not that difficult mathematically. But in order to differentiate a medium effect from a small one, you need statistical power. n=20 per cell won’t cut it.

(This situation is in some ways the mirror opposite of when researchers say they only care about the direction of an effect, not its size. But even if nothing substantive is riding on the effect size, most effects turn out to be small in size, and the experiment is only worth doing if it is reasonably capable of detecting something.)

3. All of the above assumes optimal scientific practice — no researcher degrees of freedom, no publication bias, etc. In the real world, of course, things are rarely optimal.

When it comes to running studies with small samples, one of the biggest risks is engaging in data churning and capitalizing on chance — even inadvertently. Consider two researchers who are planning to run 2-condition experiments in which a significant result would be theoretically interesting and publishable. Both decide to dedicate the time and resources to running up to 200 subjects on the project before giving up and moving on. Researcher A runs one well-powered study with n=100 per condition. Researcher B decides to run the experiment with n=20 per condition. If it is significant, B will stop and write up the result; if it is not significant, B tweaks the procedure and tries again, up to 5 times.

Without realizing it, Researcher B is working with an effective Type I error rate that is about 23%.

Of course, Researcher B should also recognize that this is a cross-validation situation. By tweaking and repeating (or merely being willing to tweak and repeat if the result is not significant), B is engaged in exploratory research. To guard against capitalizing on chance, the variation on the experiment that finally “works” needs to be run again. That is actually true regardless of sample size or power. But the practical consequences of violating it are bigger and bigger as samples get smaller and smaller.

The upshot of all of this is pretty straightforward: you cannot talk your way out of running studies with adequate power. It does not matter if you only care about large effects, if you care about small effects too, or if you do not care about effect sizes at all. It doesn’t even matter if you focus on estimation rather than NHST (which I wholeheartedly support by the way) — you still need adequate samples. The only alternatives are (a) to live with a lot of ambiguous (nonsignificant) results until enough accrue to do a meta-analysis, or (b) or p-hack your way out of them.

Where is RDoC headed? A look at the eating disorders FOA

Thomas Insel, director of NIMH, made a splash recently with the announcement that NIMH funding will be less strictly tied to the DSM. That by itself would be good news, given all the problems with DSM. But the proposed replacement, the Research Domain Criteria (RDoC), has worried some people that NIMH is pursuing biology to the exclusion of other levels of analysis, as opposed to taking a more integrated approach.

We can try to divine NIMH future directions from RDoC description and the director’s blog post, but it’s hard to tell whether mentions of behavior and phenomenology reflect real priorities or just lip service. Likewise for social and cultural factors. They come up in a discussion of “environmental aspects” that might interact with neural circuits, but they do not appear as focal units of analysis in the RDoC matrix, leaving them in a somewhat ambiguous state.

Another approach is to look at revealed preferences. Regardless of what anybody is saying, how is NIMH actually going to spend its money?

As an early indication, the NIMH RDoC overview page links to 2 funding opportunity announcements (FOAs) that are based on RDoC. Presumably these are examples of where RDoC-driven research is headed. One of the FOAs is for eating disorders. Here is the overview:

Eating disorders, including anorexia nervosa (AN), bulimia nervosa (BN), and their variants, are a major source of physical and psychological morbidity and constitute the major contribution to excess mortality from psychiatric disorders.  Clinical presentations of eating disorders are highly heterogeneous, involving broad and often overlapping symptomatology, which is often further complicated by metabolic and nutritional challenges that result from restricted food intake, excessive exercise, and repeated binge and/or purge episodes.  The recognition that relatively specific behaviors, cognitive operations, and affective processes are primarily implemented by particular neural circuits suggests that dysregulated functions and associated neural circuits should be a critical focus of study, and, ultimately, the target of assessment and treatment for eating disorders.

Here is a list of words that do not appear anywhere in the eating disorders FOA:

social
media
culture
family
peer (when not followed by “review” referring to the funding processes)
body image
self (when not followed by “-report” in a rote recital of the RDoC units of anlaysis)

And maybe I shouldn’t get too hung up on a choice of a definite vs. indefinite article, but what’s up with stating that neural circuits should be “ultimately, the target of assessment and treatment”?

Eating disorders isn’t my area. So I might have missed something. Perhaps NIMH is planning to issue another RDoC-based eating disorders FOA that invites research on sociocultural factors. Or maybe I’m missing some other important way that they will be incorporated into NIMH’s priorities for studying eating disorders. But if not — if NIMH thinks that basic research on media, on family environments, on peer influence, on self-concept, on cultural norms are not terribly important for understanding and treating eating disorders — well, that’s really hard to defend. And not a good sign of where things are headed more broadly.

A null replication in press at Psych Science – anxious attachment and sensitivity to temperature cues

Etienne LeBel writes:

My colleague [Lorne Campbell] and I just got a paper accepted at Psych Science that reports on the outcome of two strict direct replications where we  worked very closely with the original author to have all methodological design specifications as similar as those in the original study (and unfortunately did not reproduce the original finding). 

We believe this is an important achievement for the “replication movement” because it shows that (a) attitudes are changing at the journal level with regard to rewarding direct replication efforts (to our knowledge this is the first strictly direct replications to be published at a top journal like Psych Science [JPSP eventually published large-scale failed direct replications of Bem’s ESP findings, but this was of course a special case]) and (b) that direct replication endeavors can contribute new knowledge concerning a theoretical idea while maintaining a cordial, non-adversarial atmosphere with the original author. We really want to emphasize this point the most to encourage other researchers to engage in similar direct replication efforts. Science should first and foremost be about the ideas rather than the people behind the ideas; we’re hoping that examples like ours will sensibilize people to a more functional research culture where it is OK and completely normal for ideas to be revised given new evidence.

An important achievement indeed. The original paper was published in Psychological Science too, so it is especially good to see the journal owning the replication attempt. And hats off to LeBel and Campbell for taking this on. Someday direct replications will hopefully be more normal, but in world we currently live in it takes some gumption to go out and try one.

I also appreciated the very fact-focused and evenhanded tone of the writeup. If I can quibble, I would have ideally liked to see a statistical test contrasting their effect against the original one – testing the hypothesis that the replication result is different from the original result. I am sure it would have been significant, and it would have been preferable over comparing the original paper’s significant rejection of the null versus the replications non-significant test against the null. But that’s a small thing compared to what a large step forward this is.

Now let’s see what happens with all those other null replications of studies about relationships and physical warmth.

Pre-publication peer review can fall short anywhere

The other day I wrote about a recent experience participating in post-publication peer review. Short version: I picked up on some errors in a paper published in PLOS ONE, which led to a correction. In my post I made the following observation:

Is this a mark against pre-publication peer review? Obviously it’s hard to say from one case, but I don’t think it speaks well of PLOS ONE that these errors got through. Especially because PLOS ONE is supposed to emphasize “a high technical standard” and reporting of “sufficient detail” (the reason I noticed the issue with the SDs was because the article did not report effect sizes).

But this doesn’t necessarily make PLOS ONE worse than traditional journals like Psychological Science or JPSP, where similar errors get through all the time and then become almost impossible to correct.

My intention was to discuss pre- and post-publication peer review generally, and I went out of my way to cite evidence that mistakes can happen anywhere. But some comments I’ve seen online have characterized this as a mark against PLOS ONE (and my “I don’t think it speaks well of PLOS ONE” phrasing probably didn’t help). So I would like to note the following:

1. After my blog post went up yesterday, somebody alerted me that the first author of the PLOS ONE paper has posted corrections to 3 other papers on her personal website. The errors are similar to what happened at PLOS ONE. She names authors and years, not full citations, but through a little deduction with her CV it appears that one of the journals is Psychological Science, one of them is the Journal of Personality and Social Psychology, and the third could be either JPSP, Personality and Social Psychology Bulletin, or the Journal of Experimental Social Psychology. So all 3 of the corrected papers were in high-impact journals with a traditional publishing model.

2. Some of the errors might look obvious now. But that is probably boosted by hindsight. It’s important to keep in mind that reviewers are busy people who are almost always working pro bono. And even at its best, the review process is always going to be a probabilistic filter. I certainly don’t check the math on every paper I read or review. I was looking at the PLOS ONE paper with a particular mindset that made me especially attentive to power and effect sizes. Other reviewers with different concerns might well have focused on different things. That doesn’t mean that we should throw up our hands, but in the big picture we need to be realistic about what we can expect of any review process (and design any improvements with that realism in mind).

3. In the end, what makes PLOS ONE different is that their online commenting system makes it possible for many eyes to be involved in a continuous review process — not just 2-3 reviewers and an editor before publication and then we’re done. That seems much smarter about the probabilistic nature of peer review. And PLOS ONE makes it possible to address potential errors quickly and transparently and in a way that is directly linked from the published article. Whereas with the other 3 papers, assuming that those corrections have been formally submitted to the respective journals, it could still be quite a while before they appear in print, and the original versions could be in wide circulation by then.

 

Reflections on a foray into post-publication peer review

Recently I posted a comment on a PLOS ONE article for the first time. As someone who had a decent chunk of his career before post-publication peer review came along — and has an even larger chunk of his career left with it around — it was an interesting experience.

It started when a colleague posted an article to his Facebook wall. I followed the link out of curiosity about the subject matter, but what immediately jumped out at me was that it was a 4-study sequence with pretty small samples. (See Uli Schimmack’s excellent article The ironic effect of significant results on the credibility of multiple-study articles [pdf] for why that’s noteworthy.) That got me curious about effect sizes and power, so I looked a little bit more closely and noticed some odd things. Like that different N’s were reported in the abstract and the method section. And when I calculated effect sizes from the reported means and SDs, some of them were enormous. Like Cohen’s d > 3.0 level of enormous. (If all this sounds a little hazy, it’s because my goal in this post is to talk about my experience of engaging in post-publication review — not to rehash the details. You can follow the links to the article and comments for those.)

In the old days of publishing, it wouldn’t have been clear what to do next. In principle many psych journals will publish letters and comments, but in practice they’re exceedingly rare. Another alternative would have been to contact the authors and ask them to write a correction. But that relies on the authors agreeing that there’s a mistake, which authors don’t always do. And even if authors agree and write up a correction, it might be months before it appears in print.

But this article was published in PLOS ONE, which lets readers post comments on articles as a form of post-publication peer-review (PPPR). These comments aren’t just like comments on some random website or blog — they become part of the published scientific record, linked from the primary journal article. I’m all in favor of that kind of system. But it brought up a few interesting issues for how to navigate the new world of scientific publishing and commentary.

1. Professional etiquette. Here and there in my professional development I’ve caught bits and pieces of a set of gentleman’s rules about scientific discourse (and yes, I am using the gendered expression advisedly). A big one is, don’t make a fellow scientist look bad. Unless you want to go to war (and then there are rules for that too). So the old-fashioned thing to do — “the way I was raised” — would be to contact the authors quietly and petition them to make a correction themselves, so it could look like it originated with them. And if they do nothing, probably limit my comments to grumbling at the hotel bar at the next conference.

But for PPPR to work, the etiquette of “anything public is war” has to go out the window. Scientists commenting on each other’s work needs to be a routine and unremarkable part of scientific discourse. So does an understanding that even good scientists can make mistakes. And to live by the old norms is to affirm them. (Plus, the authors chose to submit to a journal that allows public comments, so caveat author.) So I elected to post a comment and then email the authors to let them know, so they would have a chance to respond quickly if they weren’t monitoring the comments. As a result, the authors posted several comments over the next couple of days correcting aspects of the article and explaining how the errors happened. And they were very responsive and cordial over email the entire time. Score one for the new etiquette.

2. A failure of pre-publication peer review? Some of the issues I raised in my comment were indisputable factual inconsistencies — like that the sample sizes were reported differently in different parts of the paper. Others were more inferential — like that a string of significant results in these 4 studies was significantly improbable, even under a reasonable expectation of an effect size consistent with the authors’ own hypothesis. A reviewer might disagree about that (maybe they think the true effect really is gigantic). Other issues, like the too-small SDs, would have been somewhere in the middle, though they turned out to be errors after all.

Is this a mark against pre-publication peer review? Obviously it’s hard to say from one case, but I don’t think it speaks well of PLOS ONE that these errors got through. Especially because PLOS ONE is supposed to emphasize “a high technical standard” and reporting of “sufficient detail” (the reason I noticed the issue with the SDs was because the article did not report effect sizes).

But this doesn’t necessarily make PLOS ONE worse than traditional journals like Psychological Science or JPSP, where similar errors get through all the time and then become almost impossible to correct. [UPDATE: Please see my followup post about pre-publication review at PLOS ONE and other journals.]

3. The inconsistency of post-publication peer review. I don’t think post-publication peer review is a cure-all. This whole episode depended in somebody (in this case, me) noticing the anomalies and being motivated to post a comment about them. If we got rid of pre-publication peer review and if the review process remained that unsystematic, it would be a recipe for a very biased system. This article’s conclusions are flattering to most scientists’ prejudices, and press coverage of the article has gotten a lot of mentions and “hell yeah”s on Twitter from pro-science folks. I don’t think it’s hard to imagine that that contributed to it getting a pass, and that if the opposite were true the article would have gotten a lot more scrutiny both pre- and post-publication. In my mind, the fix would be to make sure that all articles get a decent pre-publication review — not to scrap it altogether. Post-publication review is an important new development but should be an addition, not a replacement.

4. Where to stop? Finally, one issue I faced was how much to say in my initial comment, and how much to follow up. In particular, my original comment made a point about the low power and thus the improbability of a string of 4 studies with a rejected null. I based that on some hypotheticals and assumptions rather than formally calculating Schimmack’s incredibility index for the paper, in part because other errors in the initial draft made that impossible. The authors never responded to that particular point, but their corrections would have made it possible to calculate an IC index. So I could have come back and tried to goad them into a response. But I decided to let it go. I don’t have an axe to grind, and my initial comment is now part of the record. And one nice thing about PPPR is that readers can evaluate the arguments for themselves. (I do wish I had cited Schimmack’s paper though, because more people should know about it.)

The PoPS replication reports format is a good start

Big news today is that Perspectives on Psychological Science is going to start publishing pre-registered replication reports. The inaugural editors will be Daniel Simons and Alex Holcombe, who have done the serious legwork to make this happen. See the official announcement and blog posts by Ed Yong and Melanie Tannenbaum. (Note: this isn’t the same as the earlier plan I wrote about for Psychological Science to publish replications, but it appears to be related.)

The gist of the plan is that after getting pre-approval from the editors (mainly to filter for important but as-yet unreplicated studies), proposers will create a detailed protocol. The original authors (and maybe other reviewers?) will have a chance to review the protocol. Once it has been approved, the proposer and other interested labs will run the study. Publication will be contingent on carrying out the protocol but not on the results. Collections of replications from multiple labs will be published together as final reports.

I think this is great news. In my ideal world published replications would be more routine, and wouldn’t require all the hoopla of prior review by original authors, multiple independent replications packaged together, etc. etc. In other words, they shouldn’t be extraordinary, and they should be as easy or easier to publish than original research. I also think every journal should take responsibility for replications of its own original reports (the Pottery Barn rule). BUT… this new format doesn’t preclude any of that from also happening elsewhere. By including all of those extras, PoPS replication reports might function as a first-tier, gold standard of replication. And by doing a lot of things right (such as focusing on effect sizes rather than tallying “successful” and “failed” replications, which is problematic) they might set an example for more mundane replication reports in other outlets.

This won’t solve everything — not by a long shot. We need to change scientific culture (by which I mean institutional incentives) so that replication is a more common and more valued activity. We need funding agencies to see it that way too. In a painful coincidence, news came out today that a cognitive neuroscientist admitted to misconduct in published research. One of the many things that commonplace replications would do would be to catch or prevent fraud. But whenever I’ve asked colleagues who use fMRI whether people in their fields run direct replications, they’ve just laughed at me. There’s little incentive to run them and no money to do it even if you wanted to. All of that needs to change across many areas of science.

But you can’t solve everything at once, and the PoPS initiative is an important step forward.