A Pottery Barn rule for scientific journals

Proposed: Once a journal has published a study, it becomes responsible for publishing direct replications of that study. Publication is subject to editorial review of technical merit but is not dependent on outcome. Replications shall be published as brief reports in an online supplement, linked from the electronic version of the original.


I wrote about this idea a year ago when JPSP refused to publish a paper that failed to replicate one of Daryl Bem’s notorious ESP studies. I discovered, immediately after writing up the blog post, that other people were thinking along similar lines. Since then I have heard versions of the idea come up here and there. And strands of it came up again in David Funder’s post on replication (“[replication] studies should, ideally, be published in the same journal that promulgated the original, misleading conclusion”) and the comments to it. When a lot of people are coming up with similar solutions to a problem, that’s probably a sign of something.

Like a lot of people, I believe that the key to improving our science is through incentives. You can finger-wag about the importance of replication all you want, but if there is nowhere to publish and no benefit for trying, you are not going to change behavior. To a large extent, the incentives for individual researchers are controlled through institutions — established journal publishers, professional societies, granting agencies, etc. So if you want to change researchers’ behavior, target those institutions.

Hence a Pottery Barn rule for journals: once you publish a study, you own its replicability (or at least a significant piece of it).

This would change the incentive structure for researchers and for journals in a few different ways. For researchers, there are currently insufficient incentives to run replications. This would give them a virtually guaranteed outlet for publishing a replication attempt. Such publications should be clearly marked on people’s CVs as brief replication reports (probably by giving the online supplement its own journal name, e.g., Journal of Personality and Social Psychology: Replication Reports). That would make it easier for the academic marketplace (like hiring and promotion committees, etc.) to reach its own valuation of such work.

I would expect that grad students would be big users of this opportunity. Others have proposed that running replications should be a standard part of graduate training (e.g., see Matt Lieberman’s idea). This would make it worth students’ while, but without the organizational overhead of Matt’s proposal. The best 1-2 combo, for grad students and PIs alike, would be to embed a direct replication in a replicate-and-extend study. Then if the “extend” part does not work out, the replication report is a fallback (hopefully with a footnote about the failed extend). And if it does, the new paper is a more cumulative contribution than the shot-in-the-dark papers we often see now.

A system like this would change the incentive structure for original studies too. Researchers would know that whatever they publish is eventually going to be linked to a list of replication attempts and their outcomes. As David pointed out, knowing that others will try to replicate your work — and in this proposal, knowing that reports of those attempts would be linked from your own paper! — would undermine the incentives to use questionable research practices far better than any heavy-handed regulatory response. (And if that list of replication attempts is empty 5 years down the road because nobody thinks it’s worth their while to replicate your stuff? That might say something too.)

What about the changed incentives for journals? One benefit would be that the increased accountability for individual researchers should lead to better quality submissions for journals that adopted this policy. That should be a big plus.

A Pottery Barn policy would also increase accountability for journals. It would become much easier to document a journal’s track record of replicability, which could become a counterweight to the relentless pursuit of impact factors. Such accountability would mean a greater emphasis on evaluating replicability during the review process — e.g., to consider statistical power, to let reviewers look at the raw data and the materials and stimuli, etc.

But sequestering replication reports into an online supplement means that the journal’s main mission can stay intact. So if a journal wants to continue to focus on groundbreaking first reports in its main section, it can continue to do so without fearing that its brand will be diluted (though I predict that it would have to accept a lower replication rate in exchange for its focus on novelty).

Replication reports would generate some editorial overhead, but not nearly as much as original reports. They could be published based directly on an editorial decision, or perhaps with a single peer reviewer. A structured reporting format like the one used at Psych File Drawer would make it easier to evaluate the replication study relative to the original. (I would add a field to describe the researchers’ technical expertise and experience with the methods, since that is a potential factor in explaining differences in results.)

Of course, journals would need an incentive to adopt the Pottery Barn rule in the first place. Competition from outlets like PLoS One (which does not consider importance/novelty in its review criteria) or Psych File Drawer (which only publishes replications) might push the traditional journals in this direction. But ultimately it is up to us scientists. If we cite replication studies, if we demand and use outlets that publish them, and if we we speak loudly enough — individually or through our professional organizations — I think the publishers will listen.

Replication, period. (A guest post by David Funder)

The following is a guest post by David Funder. David shares some of his thoughts about the best way forward through social psychology’s recent controversies over fraud and corner-cutting. David is a highly accomplished researcher with a lot of experience in the trenches of psychological science. He is also President-Elect of the Society for Personality and Social Psychology (SPSP), the main organization representing academic social psychologists — but he emphasizes that he is not writing on behalf of SPSP or its officers, and the views expressed in this essay are his own.


Can we believe everything (or anything) that social psychological research tells us? Suddenly, the answer to this question seems to be in doubt. The past few months have seen a shocking series of cases of fraud –researchers literally making their data up — by prominent psychologists at prestigious universities. These revelations have catalyzed an increase in concern about a much broader issue, the replicability of results reported by social psychologists. Numerous writers are questioning common research practices such as selectively reporting only studies that “work” and ignoring relevant negative findings that arise over the course of what is euphemistically called “pre-testing,” increasing N’s or deleting subjects from data sets until the desired findings are obtained and, perhaps worst of all, being inhospitable or even hostile to replication research that could, in principle, cure all these ills.

Reaction is visible. The European Association of Personality Psychology recently held a special three-day meeting on the topic, to result in a set of published recommendations for improved research practice, a well-financed conference in Santa Barbara in October will address the “decline effect” (the mysterious tendency of research findings to fade away over time), and the President of the Society for Personality and Social Psychology was recently motivated to post a message to the membership expressing official concern. These are just three reactions that I personally happen to be familiar with; I’ve also heard that other scientific organizations and even agencies of the federal government are looking into this issue, one way or another.

This burst of concern and activity might seem to be unjustified. After all, literally making your data up is a far cry from practices such as pre-testing, selective reporting, or running multiple statistical tests. These practices are even, in many cases, useful and legitimate. So why did they suddenly come under the microscope as a result of cases of data fraud? The common thread seems to be the issue of replication. As I already mentioned, the idealistic model of healthy scientific practice is that replication is a cure for all ills. Conclusions based on fraudulent data will fail to be replicated by independent investigators, and so eventually the truth will out. And, less dramatically, conclusions based on selectively reported data or derived from other forms of quasi-cheating, such as “p-hacking,” will also fade away over time.

The problem is that, in the cases of data fraud, this model visibly and spectacularly failed. The examples that were exposed so dramatically — and led tenured professors to resign from otherwise secure and comfortable positions (note: this NEVER happens except under the most extreme circumstances) — did not come to light because of replication studies. Indeed, anecdotally — which, sadly, seems to be the only way anybody ever hears of replication studies — various researchers had noticed that they weren’t able to repeat the findings that later turned out to be fraudulent, and one of the fakers even had a reputation of generating data that were “too good to be true.” But that’s not what brought them down. Faking of data was only revealed when research collaborators with first-hand knowledge — sometimes students — reported what was going on.

This fact has to make anyone wonder: what other cases are out there? If literal faking of data is only detected when someone you work with gets upset enough to report you, then most faking will never be detected. Just about everybody I know — including the most pessimistic critics of social psychology — believes, or perhaps hopes, that such outright fraud is very rare. But grant that point and the deeper moral of the story still remains: False findings can remain unchallenged in the literature indefinitely.

Here is the bridge to the wider issue of data practices that are not outright fraudulent, but increase the risk of misleading findings making it into the literature. I will repeat: so-called “questionable” data practices are not always wrong (they just need to be questioned). For example, explorations of large, complex (and expensive) data sets deserve and even require multiple analyses to address many different questions, and interesting findings that emerge should be reported. Internal safeguards are possible, such as split-half replications or randomization analyses to assess the probability of capitalizing on chance. But the ultimate safeguard to prevent misleading findings from permanent residence in (what we think is) our corpus of psychological knowledge is independent replication. Until then, you never really know.

Many remedies are being proposed to cure the ills, or alleged ills, of modern social psychology. These include new standards for research practice (e.g., registering hypotheses in advance of data gathering), new ethical safeguards (e.g., requiring collaborators on a study to attest that they have actually seen the data), new rules for making data publicly available, and so forth. All of these proposals are well-intentioned but the specifics of their implementation are debatable, and ultimately raise the specter of over-regulation. Anybody with a grant knows about the reams of paperwork one now must mindlessly sign attesting to everything from the exact percentage of their time each graduate student has worked on your project to the status of your lab as a drug-free workplace. And that’s not even to mention the number of rules — real and imagined — enforced by the typical campus IRB to “protect” subjects from the possible harm they might suffer from filling out a few questionnaires. Are we going to add yet another layer of rules and regulations to the average over-worked, under-funded, and (pre-tenure) insecure researcher? Over-regulation always starts out well-intentioned, but can ultimately do more harm than good.

The real cure-all is replication. The best thing about replication is that it does not rely on researchers doing less (e.g., running fewer statistical tests, only examining pre-registered hypotheses, etc.), but it depends on them doing more. It is sometimes said the best remedy for false speech is more speech. In the same spirit, the best remedy for misleading research is more research.

But this research needs to be able to see the light of day. Current journal practices, especially among our most prestigious journals, discourage and sometimes even prohibit replication studies from publication. Tenure committees value novel research over solid research. Funding agencies are always looking for the next new thing — they are bored with the “same old same old” and give low priority to research that seeks to build on existing findings — much less seeks to replicate them. Even the researchers who find failures to replicate often undervalue them. I must have done something wrong, most conclude, stashing the study into the proverbial “file drawer” as an unpublishable, expensive and sad waste of time. Those researchers who do become convinced that, in fact, an accepted finding is wrong, are unlikely to attempt to publish this conclusion. Instead, the failure becomes fodder for late-night conversations, fueled by beverages at hotel bars during scientific conferences. There, and pretty much only there, can you find out which famous findings are the ones that “everybody knows” can’t be replicated.

I am not arguing that every replication study must be published. Editors have to use their judgment. Pages really are limited (though less so in the arriving age of electronic publishing) and, more importantly, editors have a responsibility to direct the limited attentional resources of the research community to articles that matter. So any replication study should be carefully evaluated for the skill with which it was conducted, the appropriate level of statistical power, and the overall importance of the conclusion. For example, a solid set of high-powered studies showing that a widely accepted and consequential conclusion was dead wrong, would be important in my book. (So would a series of studies confirming that an important surprising and counter-intuitive finding was actually true. But most aren’t, I suspect.) And this series of studies should, ideally, be published in the same journal that promulgated the original, misleading conclusion. As your mother always said, clean up your own mess.

Other writers have recently laid out interesting, ambitious, and complex plans for reforming psychological research, and even have offered visions of a “research utopia.” I am not doing that here. I only seek to convince you of one point: psychology (and probably all of science) needs more replications. Simply not ruling replication studies as inadmissible out-of-hand would be an encouraging start. Do I ask too much?

From Walter Stewart to Uri Simonsohn

Over on G+, Ole Rogeberg asks what ever happened to Walter Stewart? Stewart was a biologist employed by NIH in the 80s and 90s who became involved in rooting out questionable research practices.

Rogeberg posts an old Omni Magazine interview with Stewart (from 1989) in which Stewart describes how he got involved in investigating fraud and misconduct and what led him to think that it was more widespread than many scientists were willing to acknowledge. If you have been following the fraud scandals in psychology and the work of Uri Simonsohn, you should read it. It is completely riveting. And I found some of the parallels to be uncanny.

For example, on Stewart’s first investigation of questionable research, one of the clues that raised his suspicions was a pattern of too-similar means in a researcher’s observations. Similar problems — estimates closer together than what would be expected by chance — led Simonsohn to finger 2 researchers for misconduct.

And anticipating contemporary calls for more data openness — including the title of Simonsohn’s working paper, “Just Post It,” Stewart writes:

“With present attitudes it’s difficult for an outsider to ask for a scientist’s raw data without appearing to question that person’s integrity. But that attitude absolutely has to change… Once you publish a paper, you’re in essence giving its ideas away. In return for benefits you gain from that – fame, recognition, or whatever – you should be willing to make your lab records and data available.”

Some of the details of how Stewart’s colleagues responded are also alarming. His boss at NIH mused publicly on why he was wasting his talents chasing fraud. Others were even less kind, calling him “the terrorist of the lab.” And when he got into a dispute with his suburban neighbors about not mowing his lawn, Science — yes, that Science — ran a gossip piece on the spat. (Some of the discussions of Simonsohn’s earlier data-detecting efforts have gotten a bit heated, but I haven’t seen anything get that far yet. Let’s hope there aren’t any other social psychologists on the board of his HOA.)

The Stewart interview brought home for me just how much these issues are perennial, and perhaps structural. But the difference from 23 years ago is that we have better tools for change. Journal editors’ gatekeeping powers are weakening in the face of open-access journals and post-publication review.

Will things change for the better? I don’t know. I feel like psychology has an opportunity right now. Maybe we’ll actually step back, have a difficult conversation about what really needs to be done, and make some changes. If not, I bet it won’t be 20 years before the next Stewart/Simonsohn comes along.