In which we admire a tiny p with complete seriousness

A while back a colleague forwarded me this quote from Stanley Schachter (yes that Stanley Schachter):

“This is a difference which is significant at considerably better than the p < .0001 level of confidence. If, in reeling off these zeroes, we manage to create the impression of stringing pearls on a necklace, we rather hope the reader will be patient and forbearing, for it has been the very number of zeros after this decimal point that has compelled us to treat these data with complete seriousness.”

The quote comes from a chapter on birth order in Schachter’s 1959 book The Psychology of Affiliation. The analysis was a chi-square test on 76 subjects. The subjects were selected from 3 different experiments for being “truly anxious” and combined for this analysis. True anxiety was determined if the subject scored at one or the other extreme endpoint of an anxiety scale (both complete denial and complete admission were taken to mean that the subject is “truly anxious”), and/or if the subject discontinued participation because the experiment made them feel too anxious.

Let’s talk about diversity in personality psychology

In the latest issue of the ARP newsletter, Kelci Harris writes about diversity in ARP. You should read the whole thing. Here’s an excerpt:

Personality psychology should be intrinsically interesting to everyone, because, well, everyone has a personality. It’s accessible and that makes our research so fun and an easy thing to talk about with non-psychologists, that is, once we’ve explained to them what we actually do. However, despite what could be a universal appeal, our field is very homogenous. And that’s too bad, because diversity makes for better science. Good research comes from observations. You notice something about the world, and you wonder why that is. It’s probably reasonable to guess that most members of our field have experienced the world in a similar way due to their similar demographic backgrounds. This similarity in experience presents a problem for research because it makes us miss things. How can assumptions be challenged when no one realizes they are being made? What kind of questions will people from different backgrounds have that current researchers could never think of because they haven’t experienced the world in that way?

 In response, Laura Naumann posted a letter to the ARP Facebook wall. Read it too. Another excerpt:

I challenge our field to begin to view those who conduct this type of research [on underrepresented groups] as contributing work that is EQUAL TO and AS IMPORTANT AS “traditional” basic research in personality and social psychology. First, this will require editors of “broad impact” journals to take a critical eye to their initial review process in evaluating what manuscripts are worthy of being sent out to reviewers. I’ve experienced enough frustration sending a solid manuscript to a journal only to have it quickly returned praising the work, but suggesting resubmission to a specialty journal (e.g., ethnic minority journal du jour). The message I receive is that my work is not interesting enough for broad dissemination. If we want a more welcoming field on the personal level, we need to model a welcoming field at the editorial level.

This is a discussion we need to be having. Big applause to Kelci and Laura for speaking out.

Now, what should we be doing? Read what Kelci and Laura wrote — they both have good ideas.

I’ll add a much smaller one, which came up in a conversation on my Facebook wall: let’s collect data. My impressions of what ARP conferences look like are very similar to Kelci’s, but not all important forms of diversity are visible, and if we had hard data we wouldn’t have to rely on impressions. How are the members and conference attendees of ARP and other personality associations distributed by racial and ethnic groups, gender, sexual orientation, national origin, socioeconomic background, and other important dimensions? How do those break down by career stage? And if we collect data over time, is better representation moving up the career ladder, or is the pipeline leaking? I hope ARP will consider collecting this data as part of the membership and conference registration processes going forward, and releasing aggregate numbers. (Maybe they already collect this, but if so, I cannot recall ever seeing any report of it.) With data we will have a better handle on what we’re doing well and what we could be doing better.

What else should we be doing — big or small? This is a conversation that is long overdue and that everybody should be involved in. Let’s have it.

An interesting study of why unstructured interviews are so alluring

A while back I wrote about whether grad school admissions interviews are effective. Following up on that, Sam Gosling recently passed along an article by Dana, Dawes, and Peterson from the latest issue of Judgment and Decision Making:

Belief in the unstructured interview: The persistence of an illusion

Unstructured interviews are a ubiquitous tool for making screening decisions despite a vast literature suggesting that they have little validity. We sought to establish reasons why people might persist in the illusion that unstructured interviews are valid and what features about them actually lead to poor predictive accuracy. In three studies, we investigated the propensity for “sensemaking” - the ability for interviewers to make sense of virtually anything the interviewee says—and “dilution” – the tendency for available but non-diagnostic information to weaken the predictive value of quality information. In Study 1, participants predicted two fellow students’ semester GPAs from valid background information like prior GPA and, for one of them, an unstructured interview. In one condition, the interview was essentially nonsense in that the interviewee was actually answering questions using a random response system. Consistent with sensemaking, participants formed interview impressions just as confidently after getting random responses as they did after real responses. Consistent with dilution, interviews actually led participants to make worse predictions. Study 2 showed that watching a random interview, rather than personally conducting it, did little to mitigate sensemaking. Study 3 showed that participants believe unstructured interviews will help accuracy, so much so that they would rather have random interviews than no interview. People form confident impressions even interviews are defined to be invalid, like our random interview, and these impressions can interfere with the use of valid information. Our simple recommendation for those making screening decisions is not to use them.

It’s an interesting study. In my experience people’s beliefs in unstructured interviews are pretty powerful — hard to shake even when you show them empirical evidence.

I did have some comments on the design and analyses:

1. In Studies 1 and 2, each subject made a prediction about absolute GPA for 1 interviewee. So estimates of how good people are at predicting GPA from interviews are based on entirely between-subjects comparisons. It is very likely that a substantial chunk of the variance in predictions will be due to perceiver variance — differences between subjects in their implicit assumptions about how GPA is distributed. (E.g., Subject 1 might assume most GPAs range from 3 to 4, whereas Subject 2 assumes most GPAs range from 2.3 to 3.3. So even if they have the same subjective impression of the same target — “this person’s going to do great this term” — their numerical predictions might differ by a lot.) That perceiver variance would go into the denominator as noise variance in this study, lowering the interviewers’ predictive validity correlations.

Whether that’s a good thing or a bad thing depends on what situation you’re trying to generalize to. Perceiver variance would contribute to errors in judgment when each judge makes an absolute decision about a single target. On the other hand, in some cases perceivers make relative judgments about several targets, such as when an employer interviews several candidates and picks the best one. In that setting, perceiver variance would not matter, and a study with this design could underestimate accuracy.

2. Study 1 had 76 interviewers spread across 3 conditions (n = 25 or 26 per condition), and only 7 interviewees (each of whom was rated by multiple interviewers). Based on 73 degrees of freedom reported for the test of the “dilution” effect, it looks like they treated interviewer as the unit of analysis but did not account for the dependency in interviewees. Study 2 looked to have similar issues (though in Study 2 the dilution effect was not significant.)

3. I also had concerns about power and precision of the estimates. Any inferences about who makes better or worse predictions will depend a lot on variance among the 7 interviewees whose GPAs were being predicted (8 interviewees in study 2). I haven’t done a formal power analysis, but my intuition is that that’s pretty small. You can see a possible sign of this in one key difference between the studies. In Study 1, the correlation between the interviewees’ prior GPA and upcoming GPA was r = .65, but in Study 2 it was r = .37. That’s a pretty big difference between estimates of a quantity that should not be changing between studies.

So it’s an interesting study but not one that can give answers I’d call definitive. If that’s well understood by readers of the study, I’m okay with that. Maybe someone will use the interesting ideas in this paper as a springboard for a larger followup. Given the ubiquity of unstructured interviews, it’s something we need to know more about.

The hotness-IQ tradeoff in academia

The other day I came across a blog post ranking academic fields by hotness. Important data for sure. But something about it was gnawing on me for a while, some connection I wasn’t quite making.

And then it hit me. The rankings looked an awful lot like another list I’d once seen of academic fields ranked by intelligence. Only, you know, upside-down.

Sure enough, when I ran the correlation among the fields that appear on both lists, it came out at r = -.45.

hotness-iq

I don’t know what this means, but it seems important. Maybe a mathematician or computer scientist can help me understand it.

 

The flawed logic of chasing large effects with small samples

“I don’t care about any effect that I need more than 20 subjects per cell to detect.”

I have heard statements to this effect a number of times over the years. Sometimes from the mouths of some pretty well-established researchers, and sometimes from people quoting the well-established researchers they trained under. The idea is that if an effect is big enough — perhaps because of its real-world importance, or because of the experimenter’s skill in isolating and amplifying the effect in the lab — then you don’t need a big sample to detect it.

When I have asked people why they think that, the reasoning behind it goes something like this. If the true effect is large, then even a small sample will have a reasonable chance of detecting it. (“Detecting” = rejecting the null in this context.) If the true effect is small, then a small sample is unlikely to reject the null. So if you only use small samples, you will limit yourself to detecting large effects. And if that’s all you care about detecting, then you’re fine with small samples.

On first consideration, that might sound reasonable, and even admirably aware of issues of statistical power. Unfortunately it is completely wrong. Some of the problems with it are statistical and logical. Others are more practical:

1. It involves a classic error in probabilistic thinking. In probability notation, you are incorrectly assuming that P(A|B) = P(B|A). In psychology jargon, you are guilty of base rate insensitivity.

Here’s why. A power analysis tells the following: for a given sample size, IF the effect is large (or small) THEN you will have a certain probability of rejecting the null. But that probability is not the same as the probability that IF you have rejected the null THEN the effect is large (or small). The latter probability depends both on power and on how common large and small effects are to begin with — what a statistician would call the prior probability, and a psychologist would call the base rate.

To put this in context, suppose an experimenter is working in an area where most effects are small, some are medium, and a few are large (which pretty well describes the field of social psychology as a whole). The experimenter does not know in advance which it is, of course. When it turns out that the experimenter has stumbled onto one of the occasional large effects, the test will probably be significant. But more often the true effect will be small. Some of those will be “missed” but there will be so many of them (relative to the number of experiments run) that they’ll end up being the majority.

Consider a simplified numerical example with just 2 possibilities. Suppose that 10% of experiments are chasing an effect that is so big there’s an 90% chance of detecting it (.90 power), and 90% of experiments are chasing smaller effects with a 40% chance. Out of 100 experiments, the experimenter will get 9 significant results from the large effects, and 36 significant results from the small effects. So most of the significant results (80% of them) will come from having gotten a little bit lucky with small effects, rather than having nailed a big effect.

(Of course, that simplified example assumes, probably too generously, that the null is never true or close-to-true. Moreover, with small samples the most common outcome — absent any p-hacking — will be that the results are not significant at all, even when there really is an effect.)

If a researcher really is only interested in identifying effects of a certain size, it might seem reasonable to calculate effect sizes. But with small samples, the researcher will greatly overestimate effect sizes in the subset of results that are significant, and the amount of bias will be the greatest when the true effects are small. That is because in order to show up as significant, those small (true) effects will need to have been helped along by sampling error, giving them a positive bias as a group. So the data won’t be much help in distinguishing truly large effects from the times when chance helped along a small one. They’ll look much the same.

2. Behind the “I only need small samples” argument is the idea that the researcher attaches some special interpretive value (theoretical meaning or practical significance) to effects that are at least some certain size. But if that is the case, then the researcher needs to adjust how he or she does hypothesis testing. In conventional NHST, the null hypothesis you are trying to rule out states that the effect is zero. But if you only care about medium-and-larger effects, then you need to go a step further and rule out the hypothesis that the effect is small. Which is entirely possible to do, and not that difficult mathematically. But in order to differentiate a medium effect from a small one, you need statistical power. n=20 per cell won’t cut it.

(This situation is in some ways the mirror opposite of when researchers say they only care about the direction of an effect, not its size. But even if nothing substantive is riding on the effect size, most effects turn out to be small in size, and the experiment is only worth doing if it is reasonably capable of detecting something.)

3. All of the above assumes optimal scientific practice — no researcher degrees of freedom, no publication bias, etc. In the real world, of course, things are rarely optimal.

When it comes to running studies with small samples, one of the biggest risks is engaging in data churning and capitalizing on chance — even inadvertently. Consider two researchers who are planning to run 2-condition experiments in which a significant result would be theoretically interesting and publishable. Both decide to dedicate the time and resources to running up to 200 subjects on the project before giving up and moving on. Researcher A runs one well-powered study with n=100 per condition. Researcher B decides to run the experiment with n=20 per condition. If it is significant, B will stop and write up the result; if it is not significant, B tweaks the procedure and tries again, up to 5 times.

Without realizing it, Researcher B is working with an effective Type I error rate that is about 23%.

Of course, Researcher B should also recognize that this is a cross-validation situation. By tweaking and repeating (or merely being willing to tweak and repeat if the result is not significant), B is engaged in exploratory research. To guard against capitalizing on chance, the variation on the experiment that finally “works” needs to be run again. That is actually true regardless of sample size or power. But the practical consequences of violating it are bigger and bigger as samples get smaller and smaller.

The upshot of all of this is pretty straightforward: you cannot talk your way out of running studies with adequate power. It does not matter if you only care about large effects, if you care about small effects too, or if you do not care about effect sizes at all. It doesn’t even matter if you focus on estimation rather than NHST (which I wholeheartedly support by the way) — you still need adequate samples. The only alternatives are (a) to live with a lot of ambiguous (nonsignificant) results until enough accrue to do a meta-analysis, or (b) or p-hack your way out of them.

Where is RDoC headed? A look at the eating disorders FOA

Thomas Insel, director of NIMH, made a splash recently with the announcement that NIMH funding will be less strictly tied to the DSM. That by itself would be good news, given all the problems with DSM. But the proposed replacement, the Research Domain Criteria (RDoC), has worried some people that NIMH is pursuing biology to the exclusion of other levels of analysis, as opposed to taking a more integrated approach.

We can try to divine NIMH future directions from RDoC description and the director’s blog post, but it’s hard to tell whether mentions of behavior and phenomenology reflect real priorities or just lip service. Likewise for social and cultural factors. They come up in a discussion of “environmental aspects” that might interact with neural circuits, but they do not appear as focal units of analysis in the RDoC matrix, leaving them in a somewhat ambiguous state.

Another approach is to look at revealed preferences. Regardless of what anybody is saying, how is NIMH actually going to spend its money?

As an early indication, the NIMH RDoC overview page links to 2 funding opportunity announcements (FOAs) that are based on RDoC. Presumably these are examples of where RDoC-driven research is headed. One of the FOAs is for eating disorders. Here is the overview:

Eating disorders, including anorexia nervosa (AN), bulimia nervosa (BN), and their variants, are a major source of physical and psychological morbidity and constitute the major contribution to excess mortality from psychiatric disorders.  Clinical presentations of eating disorders are highly heterogeneous, involving broad and often overlapping symptomatology, which is often further complicated by metabolic and nutritional challenges that result from restricted food intake, excessive exercise, and repeated binge and/or purge episodes.  The recognition that relatively specific behaviors, cognitive operations, and affective processes are primarily implemented by particular neural circuits suggests that dysregulated functions and associated neural circuits should be a critical focus of study, and, ultimately, the target of assessment and treatment for eating disorders.

Here is a list of words that do not appear anywhere in the eating disorders FOA:

social
media
culture
family
peer (when not followed by “review” referring to the funding processes)
body image
self (when not followed by “-report” in a rote recital of the RDoC units of anlaysis)

And maybe I shouldn’t get too hung up on a choice of a definite vs. indefinite article, but what’s up with stating that neural circuits should be “ultimately, the target of assessment and treatment”?

Eating disorders isn’t my area. So I might have missed something. Perhaps NIMH is planning to issue another RDoC-based eating disorders FOA that invites research on sociocultural factors. Or maybe I’m missing some other important way that they will be incorporated into NIMH’s priorities for studying eating disorders. But if not — if NIMH thinks that basic research on media, on family environments, on peer influence, on self-concept, on cultural norms are not terribly important for understanding and treating eating disorders — well, that’s really hard to defend. And not a good sign of where things are headed more broadly.

A null replication in press at Psych Science – anxious attachment and sensitivity to temperature cues

Etienne LeBel writes:

My colleague [Lorne Campbell] and I just got a paper accepted at Psych Science that reports on the outcome of two strict direct replications where we  worked very closely with the original author to have all methodological design specifications as similar as those in the original study (and unfortunately did not reproduce the original finding). 

We believe this is an important achievement for the “replication movement” because it shows that (a) attitudes are changing at the journal level with regard to rewarding direct replication efforts (to our knowledge this is the first strictly direct replications to be published at a top journal like Psych Science [JPSP eventually published large-scale failed direct replications of Bem's ESP findings, but this was of course a special case]) and (b) that direct replication endeavors can contribute new knowledge concerning a theoretical idea while maintaining a cordial, non-adversarial atmosphere with the original author. We really want to emphasize this point the most to encourage other researchers to engage in similar direct replication efforts. Science should first and foremost be about the ideas rather than the people behind the ideas; we’re hoping that examples like ours will sensibilize people to a more functional research culture where it is OK and completely normal for ideas to be revised given new evidence.

An important achievement indeed. The original paper was published in Psychological Science too, so it is especially good to see the journal owning the replication attempt. And hats off to LeBel and Campbell for taking this on. Someday direct replications will hopefully be more normal, but in world we currently live in it takes some gumption to go out and try one.

I also appreciated the very fact-focused and evenhanded tone of the writeup. If I can quibble, I would have ideally liked to see a statistical test contrasting their effect against the original one - testing the hypothesis that the replication result is different from the original result. I am sure it would have been significant, and it would have been preferable over comparing the original paper’s significant rejection of the null versus the replications non-significant test against the null. But that’s a small thing compared to what a large step forward this is.

Now let’s see what happens with all those other null replications of studies about relationships and physical warmth.