Melissa Ferguson and I are the program co-chairs for the upcoming SPSP conference in New Orleans, January 17-19. That means we are in charge of the scientific content of the program. (Cindy Pickett is the convention chair, meaning she’s in charge of pretty much everything else, which I have discovered is a heck of a lot more than 99% of the world knows. If you see Cindy at the conference, please buy her a drink.) The conference is going to be awesome. You should go.
One issue that I’m particularly attuned to is the representation of personality psychology on the program. During my work as program co-chair, I heard from some people who are from a more centrally personality-psych background that they’re worried that the conference is tilted too heavily toward social psych, and therefore there won’t be enough interesting stuff to go to.
I am writing here to dispel that notion. If you are a personality psychologist and you’re wavering about going, trust me: there’ll be lots of exciting stuff for you.
SPSP has a long-standing commitment to ensuring that both of its parent disciplines are well represented at the conference. That means, first of all, that the 2 program co-chairs are picked to make sure there is broad representation at the top. So among my predecessors are folks like Veronica Benet-Martinez, Sam Gosling, Will Fleeson, etc… — people who have both the expertise and motivation to make sure that outstanding personality submissions make it onto the program. Speaking for myself, I don’t see the personality/social distinction as mapping easily onto my work (it’s both!), but hopefully most people who are from a more canonical personality point of view will see me as intellectually connected to that.
One way that directly translates into program content is through selection of reviewers. Melissa and I made sure that both the symposium and poster review panels had plenty of personality psychologists, so all personality-related submissions get a fair shake. Not every good submission made it onto the program — there was just too much good stuff (and that’s true across all topic areas). But I personally assigned every symposium submission to its reviewers, and I promise you that anything that looked personality-ish got read by someone with relevant expertise.
On top of all that, SPSP’s 2013 president is David Funder. David got to handpick speakers for a Presidential Symposium, and he’ll also give a presidential address. Those sessions will appeal to everybody at SPSP, but I think personality psychologists will feel particularly happy.
For people interested in personality psychology content, here are some highlights:
Presidential Symposium, Thu 5:00 pm – 7:00 pm. Title: “The First ‘P’ in SPSP.” David will give the opening remarks, followed by talks by Colin DeYoung on personality and neuroscience, Sarah Hampson on lifespan personality development, and Bob Krueger on how personality psychology is shaping the DSM-5. (Hardcore social folks, these are 3 dynamite researchers. I bet you’ll like this one too!)
Presidential Address, Fri 2:00 pm – 3:15 pm. David Funder gets the spotlight this time, in a talk titled “Taking the Power of the Situation Seriously.”
Award lectures, Fri 5:00 pm – 6:30 pm. The recipients of SPSP’s 3 major awards will speak at this session. Dan McAdams is the winner of the Jack Block award for personality. Dan Wegner is the winner of the Campbell award in social psych (Thalia Wheatley will be speaking on his behalf). And Jamie Pennebaker is the winner of the inaugural Distinguished Scholar Award.
Symposium Room 217-219. In order to ensure that there is always something personality-oriented for people to go to, we picked 9 symposia that we thought would be especially appealing to personality psychologists and spread them out over every timeslot. So if you want personality, personality, and more personality, you can set up camp in room 217-219 and never leave.
All the other symposium rooms. Just because we highlighted personality stuff in one room doesn’t mean that’s the only place it appears on the schedule. “Personality versus social psychology” is a clearer distinction in people’s stereotypes than in reality. Spread across the schedule are presentations on gene-environment interactions, individual differences and health, subjective well-being, motivation and self-regulation, research methods and practices, and much more.
Posters, posters, posters. There is personality-related content in every poster session. Posters were grouped by keywords (self-nominated by the submitters), so an especially high concentration will be in Session E on Saturday morning.
As long as personality psychologists keep submitting their best stuff, the high-quality representation of personality at SPSP is going to remain the rule in years to come.
So what, then, is driving Simonsohn? His fraud-busting has an almost existential flavor. “I couldn’t tolerate knowing something was fake and not doing something about it,” he told me. “Everything loses meaning. What’s the point of writing a paper, fighting very hard to get it published, going to conferences?”
It reminded me of a story involving my colleague (and grand-advisor) Lew Goldberg. Lew was at a conference once when someone presented a result that he was certain could not be correct. After the talk, Lew stood up and publicly challenged the speaker to a bet that she’d made a coding error in the data. (The bet offer is officially part of the published scientific record. According to people who were there, it was for a case of whiskey.)
The research got published anyway, there were several years of back-and-forth with what Lew felt was a vague and insufficient admission of possible errors, which ended up with Lew and colleagues publishing a comment on an erratum – the only time I’ve ever heard of that happening in a scientific journal. When someone asked Lew recently why he’d been so motivated to follow through, he answered in part: “Science is more interesting when it’s true.”
Breathless headline-grabbing press releases based on modest findings. Investigations driven by confirmation bias. Broad generalizations based on tiny samples.
I am talking, of course, about the final report of the Diederik Stapel investigation.
Regular readers of my blog will know that I have been beating the drum for reform for quite a while. I absolutely think psychology in general, and perhaps social psychology especially, can and must work to improve its methods and practices.
But in reading the commission’s press release, which talks about “a general culture of careless, selective and uncritical handling of research and data” in social psychology, I am struck that those conclusions are based on a retrospective review of a known fraud case — a case that the commissions were specifically charged with finding an explanation for. So when they wag their fingers about a field rife with elementary statistical errors and confirmation bias, it’s a bit much for me.
I am writing this as a first reaction based on what I’ve seen in the press. At some point when I have the time and the stomach I plan to dig into the full 100-page commission report. I hope that — as is often the case when you go from a press release to an actual report — it takes a more sober and cautious tone. Because I do think that we have the potential to learn some important things by studying how Diederik Stapel did what he did. Most likely we will learn what kinds of hard questions we need to be asking of ourselves — not necessarily what the answers to those questions will be. Remember that the more we are shocked by the commission’s report, the less willing we should be to reach any sweeping generalizations from it.
So let’s all take a deep breath, face up to the Stapel case for what it is — neither exaggerating nor minimizing it — and then try to have a productive conversation about where we need to go next.
The tools we have available to us affect the way we interact with and even think about the world. “If all you have is a hammer” etc. Along these lines, I’ve been wondering what would happen if the makers of data analysis software like SPSS, SAS, etc. changed some of the defaults and options. Sort of in the spirit of Nudge — don’t necessarily change the list of what is ultimately possible to do, but make changes to make some things easier and other things harder (like via defaults and options).
Would people think about their data differently? Here’s my list of how I might change regression procedures, and what I think these changes might do:
1. Let users write common transformations of variables directly into the syntax. Things like centering, z-scoring, log-transforming, multiplying variables into interactions, etc. This is already part of some packages (it’s easy to do in R), but not others. In particular, running interactions in SPSS is a huge royal pain. For example, to do a simple 2-way interaction with centered variables, you have to write all this crap *and* cycle back and forth between the code and the output along the way:
desc x1 x2. * Run just the above, then look at the output and see what the means are, then edit the code below. compute x1_c = x1 - [whatever the mean was]. compute x2_c = x2 - [whatever the mean was]. compute x1x2 = x1_c*x2_c. regression /dependent y /enter x1_c x2_c x1x2.
Why shouldn’t we be able to do it all in one line like this?
regression /dependent y /enter center(x1) center(x2) center(x1)*center(x2).
The nudge: If it were easy to write everything into a single command, maybe more people would look at interactions more often. And maybe they’d stop doing median splits and then jamming everything into an ANOVA!
2. By default, the output shows you parameter estimates and confidence intervals.
3. Either by default or with an easy-to-implement option, you can get a variety of standardized effect size estimates with their confidence intervals. And let’s not make variance-explained metrics (like R^2 or eta^2) the defaults.
The nudge: #2 and #3 are both designed to focus people on point and interval estimation, rather than NHST.
This next one is a little more radical:
4. By default the output does not show you inferential t-tests and p-values — you have to ask for them through an option. And when you ask for them, you have to state what the null hypotheses are! So if you want to test the null that some parameter equals zero (as 99.9% of research in social science does), hey, go for it — but it has to be an active request, not a passive default. And if you want to test a null hypothesis that some parameter is some nonzero value, it would be easy to do that too.
The nudge. In the way a lot of statistics is taught in psychology, NHST is the main event and effect estimation is an afterthought. This would turn it around. And by making users specify a null hypothesis, it might spur us to pause and think about how and why we are doing so, rather than just mining for asterisks to put in tables. Heck, I bet some nontrivial number of psychology researchers don’t even know that the null hypothesis doesn’t have to be the nil hypothesis. (I still remember the “aha” feeling the first time I learned that you could do that — well along into graduate school, in an elective statistics class.) If we want researchers to move toward point or range predictions with strong hypothesis testing, we should make it easier to do.
All of these things are possible to do in most or all software packages. But as my SPSS example under #1 shows, they’re not necessarily easy to implement in a user-friendly way. Even R doesn’t do all of these things in the standard lm function. As a result, they probably don’t get done as much as they could or should.
Any other nudges you’d make?
Imagine that you have entered a charity drawing to win a free iPad. The charity organizer draws a ticket, and it’s your number. Hooray! But wait, someone else is cheering too. After a little investigation it turns out that due to a printing error, two different tickets had the same winning number. You don’t want to be a jerk and make the charity buy another iPad, and you can’t saw it in half. So you have to decide who gets the iPad.
Suppose that someone proposes to flip a coin to decide who gets the iPad. Sounds pretty fair, right?
But suppose that the other guy with a winning ticket — let’s call him Pete — instead proposes the following procedure. First the organizer will flip a coin. If Pete wins that flip, he gets the iPad. But if you win the flip, then the organizer will toss the coin 2 more times. If Pete wins best out of 3, he gets the iPad. If you win best out of 3, then the organizer will flip yet another 2 times. If Pete wins the best out of those (now) 5 flips, he gets the iPad. If not, keep going… Eventually, if Pete gets tired and gives up before he wins the iPad, you can have it.
Doesn’t sound so fair, does it?
The procedure I just described is not all that different from the research practice of data peeking. Data peeking goes something like this: you run some subjects, then do an analysis. If it comes out significant, you stop. If not, you run some more subjects and try again. What Peeky Pete’s iPad Procedure and data-peeking have in common is that you are starting with a process that includes randomness (coin flips, or the random error in subjects’ behavior) but then using a biased rule to stop the random process when it favors somebody’s outcome. Which means that the “randomness” is no longer random at all.
Statisticians have been studying the consequences of data-peeking for a long time (e.g., Armitage et al., 1969). But the practice has received new attention recently in psychology, in large part because of the Simmons et al. false-positive psychology paper that came out last year. Given this attention, it is fair to wonder (1) how common is data-peeking, and (2) how bad is it?
How common is data-peeking?
Anecdotally, lot of people seem to think data peeking is common. Tal Yarkoni described data peeking as “a time-honored tradition in the social sciences.” Dave Nussbaum wrote that “Most people don’t realize that looking at the data before collecting all of it is much of a problem,” and he says that until recently he was one of those people. Speaking from my own anecdotal experience, ever since Simmons et al. came out I’ve had enough casual conversations with colleagues in social psychology that have brought me around to thinking that data peeking is not rare. And in fact, I have talked to more than one fMRI researcher who considers data peeking not only acceptable but beneficial (more on that below).
More formally, when Leslie John and others surveyed academic research psychologists about questionable research practices, a majority (55%) outright admitted that they have “decid[ed] whether to collect more data after looking to see whether the results were significant.” John et al. use a variety of techniques to try to correct for underreporting; they estimate the real prevalence to be much higher. On the flip side, it is at least a little ambiguous whether some respondents might have interpreted “deciding whether to collect more data” to include running a new study, rather than adding new subjects to an existing one. But the bottom line is that data-peeking does not seem to be at all rare.
How bad is it?
You might be wondering, is all the fuss about data peeking just a bunch of rigid stats-nerd orthodoxy, or does it really matter? After all, statisticians sometimes get worked up about things that don’t make much difference in practice. If we’re talking about something that turns a 5% Type I error rate into 6%, is it really a big deal?
The short answer is yes, it’s a big deal. Once you start looking into the math behind data-peeking, it quickly becomes apparent that it has the potential to seriously distort results. Exactly how much depends on a lot of factors: how many cases you run before you take your first peek, how frequently you peek after that, how you decide when to keep running subjects and when to give up, etc. But a good and I think realistic illustration comes from some simulations that Tal Yarkoni posted a couple years ago. In one instance, Tal simulated what would happen if you run 10 subjects and then start peeking every 5 subjects after that. He found that you would effectively double your type I error rate by the time you hit 20 subjects. If you peek a little more intensively and run a few more subjects it gets a lot worse. Under what I think are pretty realistic conditions for a lot of psychology and neuroscience experiments, you could easily end up reporting p<.05 when the true false-positive rate is closer to p=20.
And that’s a serious difference. Most researchers would never dream of looking at a p=.19 in their SPSS output and then blatantly writing p<.05 in a manuscript. But if you data-peek enough, that could easily end up being de facto what you are doing, even if you didn’t realize it. As Tal put it, “It’s not the kind of thing you just brush off as needless pedantry.”
So what to do?
These issues are only becoming more timely, given current concerns about replicability in psychology. So what to do about it?
The standard advice to individual researchers is: don’t data-peek. Do a power analysis, set an a priori sample size, and then don’t look at your data until you are done for good. This should totally be the norm in the vast majority of psychology studies.
And to add some transparency and accountability, one of Psychological Science’s proposed disclosure statements would require you to state clearly how you determined your sample size. If that happens, other journals might follow after that. If you believe that most researchers want to be honest and just don’t realize how bad data-peeking is, that’s a pretty good way to spread the word. People will learn fast once their papers start getting sent back with a request to run a replication (or rejected outright).
But is abstinence the only way to go? Some researchers make a cost-benefit case for data peeking. The argument goes as follows: With very expensive procedures (like fMRI), it is wasteful to design high-powered studies if that means you end up running more subjects than you need to determine if there is an effect. (As a sidenote, high-powered studies are actually quite important if you are interested in unbiased (or at least less biased) effect size estimation, but that’s a separate conversation; here I’m assuming you only care about significance.) And on the flip side, the argument goes, Type II errors are wasteful too — if you follow a strict no-data-peeking policy, you might run 20 subjects and get p=.11 and then have to set aside the study and start over from scratch.
Of course, it’s also wasteful to report effects that don’t exist. And compounding that, studies that use expensive procedures are also less likely to get directly replicated, which means that false-positive errors are harder to get found out.
So if you don’t think you can abstain, the next-best thing is to use protection. For those looking to protect their p I have two words: interim analysis. It turns out this is a big issue in the design of clinical trials. Sometimes that is for very similar expense reasons. And sometimes it is because of ethical and safety issues: often in clinical trials you need ongoing monitoring so that you can stop the trial just as soon as you can definitively say that the treatment makes things better (so you can give it to the people in the placebo condition) or worse (so you can call off the trial). So statisticians have worked out a whole bunch of ways of designing and analyzing studies so that you can run interim analyses while keeping your false-positive rate in check. (Because of their history, such designs are sometimes called sequential clinical trials, but that shouldn’t chase you off — the statistics don’t care if you’re doing anything clinical.) SAS has a whole procedure for analyzing them, PROC SEQDESIGN. And R users have lots of options. (I don’t know if these procedures have been worked into fMRI analysis packages, but if they haven’t, they really should be.)
Very little that I’m saying is new. These issues have been known for decades. And in fact, the 2 recommendations I listed above — determine sample size in advance or properly account for interim testing in your analyses — are the same ones Tal made. So I could have saved some blog space and just said “please go read Armitage et al or Todd et al. or Yarkoni & Braver.” (UPDATE via a commenter: or Strube, 2006.)But blog space is cheap, and as far as I can tell word hasn’t gotten out very far. And with (now) readily available tools and growing concerns about replicability, it is time to put uncorrected data peeking to an end.
Pretty big news. Psychological Science is seriously discussing 3 new reform initiatives. They are outlined in a letter being circulated by Eric Eich, editor of the journal, and they come from a working group that includes top people from APS and several other scientists who have been active in working for reforms.
After reading it through (which I encourage everybody to do), here are my initial takes on the 3 initiatives:
Initiative 1: Create tutorials on power, effect size, and confidence intervals. There’s plenty of stuff out there already, but if PSci creates a good new source and funnels authors to it, it could be a good thing.
Initiative 2: Disclosure statements about research process (such as how sample size was determined, unreported measures, etc.) This could end up being a good thing, but it will be complicated. Simine Vazire, one of the working group members who is quoted in the proposal, puts it well:
We are essentially asking people to “incriminate” themselves — i.e., reveal information that, in the past, editors have treated as reasons not to publish a paper. If we want authors to be honest, I think they will want some explicit acknowledgement that some degree of messiness (e.g., a null result here and there) will be tolerated and perhaps even treated as evidence that the entire set of findings is even more plausible (a la [Gregory] Francis, [Uli] Schimmack, etc.).
I bet there would be low consensus about what kinds and amounts of messiness are okay, because no one is accustomed to seeing that kind of information on a large scale in other people’s studies. It is also the case that things that are problematic in one subfield may be more reasonable in another. And reviewers and editors who lack the time or local expertise to really judge messiness against merit may fall back on simplistic heuristics rather than thinking things through in a principled way. (Any psychologist who has ever tried to say anything about causation, however tentative and appropriately bounded, in data that was not from a randomized experiment probably knows what that feels like.)
Another basic issue is whether people will be uniformly honest in the disclosure statements. I’d like to believe so, but without a plan for real accountability I’m not sure. If some people can get away with fudging the truth, the honest ones will be at a disadvantage.
3. A special submission track for direct replications, with 2 dedicated Associate Editors and a system of pre-registration and prior review of protocols to allow publication decisions to be decoupled from outcomes. A replication section at a journal? If you’ve read my blog before you might guess that I like that idea a lot.
The section would be dedicated to studies previously published in Psychological Science, so in that sense it is in the same spirit as the Pottery Barn Rule. The pre-registration component sounds interesting — by putting a substantial amount of review in place before data are collected, it helps avoid the problem of replications getting suppressed because people don’t like the outcomes.
I feel mixed about another aspect of the proposal, limiting replications to “qualified” scientists. There does need to be some vetting, but my hope is that they will set the bar reasonably low. “This paradigm requires special technical knowledge” can too easily be cover for “only people who share our biases are allowed to study this effect.” My preference would be for a pro-data, pro-transparency philosophy. Make it easy for for lots of scientists to run and publish replication studies, and make sure the replication reports include information about the replicating researchers’ expertise and experience with the techniques, methods, etc. Then meta-analysts can code for the replicating lab’s expertise as a moderator variable, and actually test how much expertise matters.
My big-picture take. Retraction Watch just reported yesterday on a study showing that retractions, especially retractions due to misconduct, cause promising scientists to move to other fields and funding agencies to direct dollars elsewhere. Between alleged fraud cases like Stapel, Smeesters, and Sanna, and all the attention going to false-positive psychology and questionable research practices, psychology (and especially social psychology) is almost certainly at risk of a loss of talent and money.
Getting one of psychology’s top journals to make real reforms, with the institutional backing of APS, would go a long way to counteract those negative effects. A replication desk in particular would leapfrog psychology past what a lot of other scientific fields do. Huge credit goes to Eric Eich and everyone else at APS and the working group for trying to make real reforms happen. It stands a real chance of making our science better and improving our credibility.
This morning felt quite ignominious indeed, and naturally it reminded me of William James. From the Principles of Psychology, chapter 26, “Will”:
We know what it is to get out of bed on a freezing morning in a room without a fire, and how the very vital principle within us protests against the ordeal. Probably most persons have lain on certain mornings for an hour at a time unable to brace themselves to the resolve. We think how late we shall be, how the duties of the day will suffer; we say, “I must get up, this is ignominious,” etc.; but still the warm couch feels too delicious, the cold outside too cruel, and resolution faints away and postpones itself again and again just as it seemed on the verge of bursting the resistance and passing over into the decisive act. Now how do we ever get up under such circumstances? If I may generalize from my own experience, we more often than not get up without any struggle or decision at all. We suddenly find that we have got up. A fortunate lapse of consciousness occurs; we forget both the warmth and the cold; we fall into some revery connected with the day’s life, in the course of which the idea flashes across us, “Hollo! I must lie here no longer” – an idea which at that lucky instant awakens no contradictory or paralyzing suggestions, and consequently produces immediately its appropriate motor effects. It was our acute consciousness of both the warmth and the cold during the period of struggle, which paralyzed our activity then and kept our idea of rising in the condition of wish and not of will. The moment these inhibitory ideas ceased, the original idea exerted its effects.
James’s visible presence in contemporary psychology seem mostly limited to 2 roles. Someone finds a quote, puts it as an epigram at the top of their manuscript, and claims that their ideas have roots going more than a century back. Or alternatively someone finds a quote, puts it up as an epigram, and then claims that their ideas overturn more than a century of received wisdom.
But going back and actually re-reading James seriously is usually an enlightening activity. Just from that chapter on “Will” you can draw lines to contemporary research on delay of gratification, self-regulatory depletion, goal pursuit, the relationship between attention and executive control, and automaticity. James’s ideas about all of these topics are nuanced, with a lot of connections but few easy one-to-one mappings (whether supported or falsified) to contemporary research. Every once in a while I get the urge to go back and look at something James wrote, and if no contradictory or paralyzing suggestions stop me, I’m always glad that I did.