Reading “The Baby Factory” in context

cherry orchard
Photo credit: Des Blenkinsopp.

Yesterday I put up a post about David Peterson’s ethnography The Baby Factory, an ethnography of 3 baby labs that discusses Peterson’s experience as a participant observer. My post was mostly excerpts, with a short introduction at the beginning and a little discussion at the end. That was mostly to encourage people to go read it. (It’s open-access!)

Today I’d like to say a little more.

How you approach the article probably depends a lot on what background and context you come to it with. It would be a mistake to look to an ethnography for a generalizable estimate of something about a population, in this case about how common various problematic practices are. That’s not what ethnography is for. But at this point in history, we are not lacking for information about the ways we need to improve psychological science. There have been surveys and theoretical analyses and statistical analyses and single-lab replications and coordinated many-lab replications and all the rest. It’s getting harder and harder to claim that the evidence is cherry-picked without seriously considering the possibility that you’re in the middle of a cherry orchard. As Simine put it so well:

even if you look at your own practices and those of everyone you know, and you don’t see much p-hacking going on, the evidence is becoming overwhelming that p-hacking is happening a lot. my guess is that the reason people can’t reconcile that with the practices they see happening in their labs and their friends’ labs is that we’re not very good at recognizing p-hacking when it’s happening, much less after the fact. we can’t rely on our intuitions about p-hacking. we have to face the facts. and, in my view, the facts are starting to look pretty damning.

You don’t even have to go as far as Simine or me. You just have to come into reading the ethnography with a realistic belief that problematic practices are at least at a high enough rate to be worrisome. And then the ethnography does what ethnographies do, and well in my judgment: it illustrates what these things look like, out there in the world, when they are happening.

In particular, I think a valuable part of Peterson’s ethnography is that it shows how problematic practices don’t just have to happen furtively by one person with the door closed. Instead, they can work their way into the fabric of how members of a lab talk and interact. When Leslie John et al. introduced the term questionable research practices, they defined it as “exploitation of the gray area of acceptable practice.” The Baby Factory gives us a view into how that can be a social process. Gray zones are by definition ambiguous; should we be shocked to find out that people working closely together will come to a socially shared understanding of them?

Another thing Peterson’s ethnography does is talk about the larger context where all this is happening, and try to interpret his observations in that context. He writes about the pressures for creating a public narrative of science that looks sharp and clean, about the need to make the most of very limited resources and opportunities, and about the very real challenges of working with babies (the “difficult research objects” of the subtitle). A commenter yesterday thought he came to the project with an axe to grind. But his interpretive framing was very sympathetic to the challenges of doing infant cognition research. And his concluding paragraphs were quite optimistic, suggesting that the practices he observed may be part of a “local culture” that has figured out how they can promote positive scientific development. I wish he’d developed that argument more. I don’t think infant cognition research has lacked for important scientific discoveries — but I would say it is in spite of the compromises researchers have sometimes had to make, not because of them.

I do think it would be a mistake to come away thinking this is something limited to infant cognition research. Peterson grounds his discussion in the specific challenges of studying babies, who have a habit of getting distracted or falling asleep or putting your stimuli in their mouths. Those particular problems may be distinctive to having babies as subjects, and I can understand why that framing might make baby researchers feel especially uncomfortable. But anybody who is asking big questions about the human mind is working with a difficult research object, and we all face the same larger pressures and challenges. There are some great efforts under way to understand the particular challenges of research practice and replicability in infant research, but whatever we learn from that is going to be about how broader problems are manifesting in a specific area. I don’t really see how you can fairly conclude otherwise.

An eye-popping ethnography of three infant cognition labs

I don’t know how else to put it. David Peterson, a sociologist, recently published an ethnographic study of 3 infant cognition labs. Titled “The Baby Factory: Difficult Research Objects, Disciplinary Standards, and the Production of Statistical Significance,” it recounts his time spend as a participant observer in those labs, attending lab meetings and running subjects.

In his own words, Peterson “shows how psychologists produce statistically significant results under challenging circumstances by using strategies that enable them to bridge the distance between an uncontrollable research object and a professional culture that prizes methodological rigor.” The account of how the labs try to “bridge the distance” reveals one problematic practice after another, in a way that sometimes makes them seem like normal practice and no big deal to the people in the labs. Here are a few examples.

Protocol violations that break blinding and independence:

…As a routine part of the experiments, parents are asked to close their eyes to prevent any unconscious influence on their children. Although this was explicitly stated in the instructions given to parents, during the actual experiment, it was often overlooked; the parents’ eyes would remain open. Moreover, on several occasions, experimenters downplayed the importance of having one’s eyes closed. One psychologist told a mother, “During the trial, we ask you to close your eyes. That’s just for the journals so we can say you weren’t directing her attention. But you can peek if you want to. It’s not a big deal. But there’s not much to see.”

Optional stopping based on data peeking:

Rather than waiting for the results from a set number of infants, experimenters began “eyeballing” the data as soon as babies were run and often began looking for statistical significance after just 5 or 10 subjects. During lab meetings and one-on-one discussions, experiments that were “in progress” and still collecting data were evaluated on the basis of these early results. When the preliminary data looked good, the test continued. When they showed ambiguous but significant results, the test usually continued. But when, after just a few subjects, no significance was found, the original protocol was abandoned and new variations were developed.

Invalid comparisons of significant to nonsignificant:

Because experiments on infant subjects are very costly in terms of both time and money, throwing away data is highly undesirable. Instead, when faced with a struggling experiment using a trusted experimental paradigm, experimenters would regularly run another study that had higher odds of success. This was accomplished by varying one aspect of the experiment, such as the age of the participants. For instance, when one experiment with 14-month-olds failed, the experimenter reran the same study with 18-month-olds, which then succeeded. Once a significant result was achieved, the failures were no longer valueless. They now represented a part of a larger story: “Eighteen-month-olds can achieve behavior X, but 14-month-olds cannot.” Thus, the failed experiment becomes a boundary for the phenomenon.

And HARKing:

When a clear and interesting story could be told about significant findings, the original motivation was often abandoned. I attended a meeting between a graduate student and her mentor at which they were trying to decipher some results the student had just received. Their meaning was not at all clear, and the graduate student complained that she was having trouble remembering the motivation for the study in the first place. Her mentor responded, “You don’t have to reconstruct your logic. You have the results now. If you can come up with an interpretation that works, that will motivate the hypothesis.”

A blunt explanation of this strategy was given to me by an advanced graduate student: “You want to know how it works? We have a bunch of half-baked ideas. We run a bunch of experiments. Whatever data we get, we pretend that’s what we were looking for.” Rather than stay with the original, motivating hypothesis, researchers in developmental science learn to adjust to statistical significance. They then “fill out” the rest of the paper around this necessary core of psychological research.

Peterson discusses all this in light of recent discussions about replicability and scientific practices in psychology. He says that researchers have basically 3 choices: limit the scope of your questions to what you can do well with available methods, relax our expectations of what a rigorous study looks like, or engage in QRPs. I think that is basically right. It is why I believe that any attempt to reduce QRPs has to be accompanied by changes to incentive structures, which govern the first two.

Peterson also suggests that QRPs are “becoming increasingly unacceptable.” That may be true in public discourse, but the inside view presented by his ethnography suggests that unless more open practices become standard, labs will continue to have lots of opportunity to engage in them and little incentive not to.

UPDATE: I discuss what all this means in a followup post: Reading “The Baby Factory” in context.

Three ways to approach the replicability discussion

There are 3 ways to approach the replicability discussion/debate in science.

#1 is as a logic problem. There are correct answers, and the challenge is to work them out. The goal is to be right.

#2 is as a culture war. There are different sides with different motives, values, or ideologies. Some are better than others. So the goal is win out over the other side.

#3 is as a social movement. Scientific progress is a shared value. Recently accumulated knowledge and technology have given us better ways to achieve it, but institutions and practices are slow to change. So the goal is to get everybody on board to make things better.

Probably all of us have elements of all three in us (and to be clear, all of us are trying to use reasoning to solve problems — the point of #1 is that’s the end goal so you stop there). But you see noticeable differences in which predominates in people’s public behavior.

My friends can probably guess which approach I feel most aligned with.

Bold changes at Psychological Science

Style manuals sound like they ought to be boring things, full of arcane points about commas and whatnot. But Wikipedia’s style manual has an interesting admonition: Be bold. The idea is that if you see something that could be improved, you should dive in and start making it better. Don’t wait until you are ready to be comprehensive, don’t fret about getting every detail perfect. That’s the path to paralysis. Wikipedia is an ongoing work in progress, your changes won’t be the last word but you can make things better.

In a new editorial at Psychological Science, interim editor Stephen Lindsay is clearly following the be bold philosophy. He lays out a clear and progressive set of principles for evaluating research. Beware the “troubling trio” of low power, surprising results, and just-barely-significant results. Look for signs of p-hacking. Care about power and precision. Don’t confuse nonsignificant for null.

To people who have been paying attention to the science reform discussion of the last few years (and its longstanding precursors), none of this is new. What is new is that an editor of a prominent journal has clearly been reading and absorbing the last few years’ wave of careful and thoughtful scholarship on research methods and meta-science. And he is boldly acting on it.

I mean, yes, there are some things I am not 100% in love with in that editorial. Personally, I’d like to see more value placed on good exploratory research.* I’d like to see him discuss whether Psychological Science will be less results-oriented, since that is a major contributor to publication bias.** And I’m sure other people have their objections too.***

But… Improving science will forever be a work in progress. Lindsay has laid out a set of principles. In the short term, they will be interpreted and implemented by humans with intelligence and judgment. In the longer term, someone will eventually look at what is and is not working and will make more changes.

Are Lindsay’s changes as good as they could possibly be? The answers are (1) “duh” because obviously no and (2) “duh” because it’s the wrong question. Instead let’s ask, are these changes better than things have been? I’m not going to give that one a “duh,” but I’ll stand behind a considered “yes.”

———-

* Part of this is because in psychology we don’t have nearly as good a foundation of implicit knowledge and accumulated wisdom for differentiating good from bad exploratory research as we do for hypothesis-testing. So exploratory research gets a bad name because somebody hacks around in a tiny dataset and calls it “exploratory research,” and nobody has the language or concepts to say why they’re doing it wrong. I hope we can fix that. For starters, we could start stealing more ideas from the machine learning and genomics people, though we will need to adapt them for the particular features of our scientific problems. But that’s a blog post for another day.

** There are some nice comments about this already on the ISCON facebook page. Dan Simons brought up the exploratory issue; Victoria Savalei the issue about results-focus. My reactions on these issues are in part bouncing off of theirs.

*** When I got to the part about using confidence intervals to support the null, I immediately had a vision of steam coming out of some of the Twitter Bayesians’ ears.

Moderator interpretations of the Reproducibility Project

The Reproducibility Project: Psychology (RPP) was published in Science last week. There has been some excellent coverage and discussion since then. If you haven’t heard about it,* Ed Yong’s Atlantic coverage will catch you up. And one of my favorite commentaries so far is on Michael Frank’s blog, with several very smart and sensible ways the field can proceed next.

Rather than offering a broad commentary, in this post I’d like to discuss one possible interpretation of the results of the RPP, which is “hidden moderators.” Hidden moderators are unmeasured differences between original and replication experiments that would result in differences in the true, underlying effects and therefore in the observed results of replications. Things like differences in subject populations and experimental settings. Moderator interpretations were the subject of a lengthy discussion on the ISCON Facebook page recently, and are the focus of an op-ed by Lisa Feldman Barrett.

In the post below, I evaluate the hidden-moderator interpretation. The tl;dr version is this: Context moderators are probably common in the world at large and across independently-conceived experiments. But an explicit design goal of direct replication is to eliminate them, and there’s good reason to believe they are rare in replications.

1. Context moderators are probably not common in direct replications

Many social and personality psychologists believe that lots of important effects vary by context out in the world at large. I am one of those people — subject and setting moderators are an important part of what I study in my own work. William McGuire discussed the idea quite eloquently, and it can be captured in an almost koan-like quote from Niels Bohr: “The opposite of one profound truth may very well be another profound truth.” It is very often the case that support for a broad hypothesis will vary and even be reversed over traditional “moderators” like subjects and settings, as well as across different methods and even different theoretical interpretations.

Consider as a thought experiment** what would happen if you told 30 psychologists to go test the deceptively simple-seeming hypothesis “Happiness causes smiling” and turned them loose. You would end up with 30 different experiments that would differ in all kinds of ways that we can be sure would matter: subjects (with different cultural norms of expressiveness), settings (e.g., subjects run alone vs. in groups), manipulations and measures of the IV (film clips, IAPS pictures, frontal asymmetry, PANAS?) and DV (FACS, EMG, subjective ratings?), and even construct definitions (state or trait happiness? eudaimonic or hedonic? Duchenne or social smiles?). You could learn a lot by studying all the differences between the experiments and their results.

But that stands in stark contrast to how direct replications are carried out, including the RPP. Replicators aren’t just turned loose with a broad hypothesis. In direct replication, the goal is to test the hypothesis “If I do the same experiment, I will get the same result.” Sometimes a moderator-ish hypothesis is built in (“this study was originally done with college students, will I get the same effect on Mturk?”). But such differences from the original are planned in. The explicit goal of replication design is for any other differences to be controlled out. Well-designed replication research makes a concerted effort to faithfully repeat the original experiments in every way that documentation, expertise, and common sense say should matter (and often in consultation with original authors too). The point is to squeeze out any room for substantive differences.

Does it work? In a word, yes. We now have data telling us that the squeezing can be very effective. In Many Labs 1 and Many Labs 3 (which I reviewed here), different labs followed standardized replication protocols for a series of experiments. In principle, different experimenters, different lab settings, and different subject populations could have led to differences between lab sites. But in analyses of heterogeneity across sites, that was not the result. In ML1, some of the very large and obvious effects (like anchoring) varied a bit in just how large they were (from “kinda big” to “holy shit”). Across both projects, more modest effects were quite consistent. Nowhere was there evidence that interesting effects wink in and out of detectability for substantive reasons linked to sample or setting.

We will continue to learn more as our field gets more experience with direct replication. But right now, a reasonable conclusion from the good, systematic evidence we have available is this: When some researchers write down a good protocol and other researchers follow it, the results tend to be consistent. In the bigger picture this is a good result for social psychology: it is empirical evidence that good scientific control is within our reach, neither beyond our experimental skills nor intractable for the phenomena we study.***

But it also means that when replicators try to clamp down potential moderators, it is reasonable to think that they usually do a good job. Remember, the Many Labs labs weren’t just replicating the original experiments (from which their results sometimes differed – more on that in a moment). They were very successfully and consistently replicating each other. There could be individual exceptions here and there, but on the whole our field’s experience with direct replication so far tells us that it should be unusual for unanticipated moderators to escape replicators’ diligent efforts at standardization and control.

2. A comparison of a published original and a replication is not a good way to detect moderators

Moderation means there is a substantive difference between 2 or more (true, underlying) effects as a function of the moderator variable. When you design an experiment to test a moderation hypothesis, you have to set things up so you can make a valid comparison. Your observations should ideally be unbiased, or failing that, the biases should be the same at different levels of the moderator so that they cancel out in the comparison.

With the RPP (and most replication efforts), we are trying to interpret observed differences between published original results and replications. The moderator interpretation rests on the assumption that observed differences between experiments are caused by substantive differences between them (subjects, settings, etc.). An alternative explanation is that there are different biases. And that is almost certainly the case. The original experiments are generally noisier because of modest power, and that noise is then passed through a biased filter (publication bias for sure — these studies were all published at selective journals — and perhaps selective reporting in some cases too). By contrast, the replications are mostly higher powered, the analysis plans were pre-registered, and the replicators committed from the outset to publish their findings no matter what the results.

That means that a comparison of published original studies and replication studies in the RPP is a poor way to detect moderators, because you are comparing a noisy and biased observation to one that is much less so.**** And such a comparison would be a poor way to detect moderators even if you were quite confident that moderators were out there somewhere waiting to be found.

3. Moderator explanations of the Reproducibility Project are (now) post hoc

The Reproducibility Project has been conducted with an unprecedented degree of openness. It was started 4 years ago. Both the coordinating plan and the protocols of individual studies were pre-registered. The list of selected studies was open. Original authors were contacted and invited to consult.

What that means is that anyone could have looked at an original study and a replication protocol, applied their expert judgment, and made a genuinely a priori prediction of how the replication results would have differed from the original. Such a prediction could have been put out in the open at any time, or it could have been pre-registered and embargoed so as not to influence the replication researchers.

Until last Friday, that is.

Now the results of the RPP are widely known. And although it is tempting to now look back selectively at “failed” replications and generate substantively interesting reasons, such explanations have to be understood for what they are: untested post hoc speculation. (And if someone now says they expected a failure all along, they’re possibly HARKing too.)

Now, don’t get me wrong — untested post hoc speculation is often what inspires new experiments. So if someone thinks they see an important difference between an original result and a replication and gets an idea for a new study to test it out, more power to them. Get thee to the lab.

But as an interpretation of the data we have in front of us now, we should be clear-eyed in appraising such explanations, especially as an across-the-board factor for the RPP. From a bargain-basement Bayesian perspective, context moderators in well-controlled replications have a low prior probability (#1 above), and comparisons of original and replication studies have limited evidential value because of unequal noise and bias (#2). Put those things together and the clear message is that we should be cautious about concluding that there are hidden moderators lurking everywhere in the RPP. Here and there, there might be compelling, idiosyncratic reasons to think there could be substantive differences to motivate future research. But on the whole, as an explanation for the overall pattern of findings, hidden moderators are not a strong contender.

Instead, we need to face up to the very well-understood and very real differences that we know about between published original studies and replications. The noxious synergy between low power and selective publication is certainly a big part of the story. Fortunately, psychology has already started to make changes since 2008 when the RPP original studies were published. And positive changes keep happening.

Would it be nice to think that everything was fine all along? Of course. And moderator explanations are appealing because they suggest that everything is fine, we’re just discovering limits and boundary conditions like we’ve always been doing.***** But it would be counterproductive if that undermined our will to continue to make needed improvements to our methods and practices. Personally, I don’t think everything in our past is junk, even post-RPP – just that we can do better. Constant self-critique and improvement are an inherent part of science. We have diagnosed the problem and we have a good handle on the solution. All of that makes me feel pretty good.

———

* Seriously though?

** A thought meta-experiment? A gedankengedankenexperiment?

*** I think if you asked most social psychologists, divorced from the present conversation about replicability and hidden moderators, they would already have endorsed this view. But it is nice to have empirical meta-scientific evidence to support it. And to show the “psychology isn’t a science” ninnies.

**** This would be true even if you believed that the replicators were negatively biased and consciously or unconsciously sandbagged their efforts. You’d think the bias was in the other direction, but it would still be unequal and therefore make comparisons of original vs. replication a poor empirical test of moderation. (You’d also be a pretty cynical person, but anyway.)

***** For what it’s worth, I’m not so sure that the hidden moderator interpretation would actually be all that reassuring under the cold light of a rational analysis. This is not the usual assumption that moderators are ubiquitous out in the world. We are talking about moderators that pop up despite concerted efforts to prevent them. Suppose that ubiquitous occult moderators were the sole or primary explanation for the RPP results — so many effects changing when we change so little, going from one WEIRD sample to another, with maybe 5-ish years of potential for secular drift, and using a standardized protocol. That would suggest that we have a poor understanding of what is going on in our labs. It would also suggest that it is extraordinarily hard to study main effects or try to draw even tentative generalizable conclusions about them. And given how hard it is to detect interactions, that would mean that our power problem would be even worse than people think it is now.

What should SPSP do about APA and the Hoffman report?

I am a member-at-large in the Society for Personality and Social Psychology. We will be having our semiannual board meeting in a couple of weeks. On the agenda is a discussion of the Hoffman Report, which details collusion between American Psychological Association officials and the U.S. military to enable and support abusive interrogations.

I have had several discussions with people about what, if anything, SPSP should be doing about its relationship with APA. But I’d really like to open up the discussion and get feedback from more people, especially SPSP members. In this post I’d like to lay out some background about SPSP’s relationship with APA, and bring up some possibilities about what to do next.

What is SPSP’s current legal and financial relationship with APA?

It’s easy to get confused about this. Heck, I’m on the SPSP board and I still find it a bit confusing. (If I get any of this wrong I hope somebody corrects me.) Here goes.

SPSP is legally its own organization. We have our own articles of incorporation and bylaws. Membership in SPSP and APA are separate processes and you can be a member of (and pay dues to) either one without joining the other. We operate our own annual conference and our own non-APA journals.

So they’re totally independent, right? Well, not totally.

SPSP also operates APA Division 8. That means that if you are an APA Division 8 member, the SPSP Board is in charge of running your division. And conversely, 1 of the 11 voting members of the SPSP Board of Directors is the APA Division 8 council representative, who is elected by APA Division 8 members under APA’s bylaws. So our governance is at least somewhat intertwined.

And the APA website certainly blurs the distinction – if you navigate to their Division 8 page, it says that Division 8 is SPSP, and it links to SPSP’s website. (Contrast that to the SPSP website’s page about Division 8, which goes to pains to emphasize the difference.)

Financially, some money changes hands. On the income side, we get a bit of a cut from [corrected] APA Division 8 membership dues and from the fact that some APA Division 8 conference programming provides continuing education credits. On the expenses side, SPSP spends money on the APA conference. We pay to send the SPSP president, SPSP-selected APA program chair, and the Division 8 council representative to the APA conference, and we pay for some APA conference programming (a social hour and travel for an invited speaker). In our 2014 budget under “Division 8,” the numbers were roughly $3k in income from APA, and $13k in expenses, for a net of $10k being spent. On top of that we reimburse some travel expenses for the Division 8 representative and program chair to attend SPSP board meetings (these travel expenses are in a different part of the budget and not broken out, but it’s probably a few thousand bucks a year). In the context of SPSP’s overall budget those are modest numbers — SPSP’s total revenues for 2014 were over $2 million. If you are an SPSP member, it would technically not be correct to say that exactly $0 of your membership dues goes toward APA-related stuff. But the percentage is small.

So like I said, it’s easy to get confused. Legally and financially, SPSP is largely separate but not 100% independent. And by operating Division 8 for APA, we certainly have a real, ongoing relationship.

What are the reasons for maintaining our relationship or for ending it?

That’s really what I’m hoping to hear feedback from people about. I’ll summarize some of the arguments I’ve heard so far, but I hope people will elaborate on them.

On the “keep it” side, people point out that APA is the only organization that represents all of psychology in the United States. APA thus connects us to other psychological disciplines, including the practitioner disciplines. And they are going to continue to have influence on policy and lobbying (including things SPSP members are likely to care about like social science funding). Another thing people bring up is that many of the top-tier journals that SPSP members publish in (JPSP etc.) are APA journals, and we should keep our involvement and influence in that. And lastly, as a result of the Hoffman Report APA may make some fundamental reforms – so this may be an important time to stay involved to shape those reforms and make sure they are implemented.

On the “end it” side, people argue that APA is drifting further and further away from relevance to psychological scientists (aside from its journal publishing business), and that we have relatively little influence with them anyway (a single Division 8 member on the APA council). APA gets its credibility and influence in part from being associated with scientific organizations like us, and they have leveraged that credibility to do great damage. And this wasn’t just a few individuals abusing their offices – it was organized, institutional corruption. So maybe SPSP (as an organization and members as individuals) is getting too little and giving up to much from our relationship – materially, reputationally, or morally.

These are just brief statements of the arguments, in part because they are not all my own. And like I said, this is what I want to hear people discuss.

Realistically what might SPSP do in the immediate future?

As I see it there are 3 avenues:

1. Decide we should stay involved and keep our relationship.

2. Decide we’re past the point of no return with APA and end our relationship. As I understand it, the SPSP Board is not empowered to end our relationship at our meeting. Doing so would require amending our bylaws and articles of incorporation, and that requires a referendum of the SPSP membership. But such a referendum could be initiated by either the Board or by a petition of 100 SPSP members. [UPDATE AND CLARIFICATION: The SPSP Board has budgetary authority to stop spending money on Division 8-related expenses, without a full referendum. Removing the Division 8 representative position from the SPSP board would require amending the bylaws via referendum, but not the articles of incorporation. Thanks to Chad Rummel for the corrections.]

3. Wait and see. Many of the individuals implicated in the Hoffman Report have resigned, but the fallout continues to fall out. Many questions remain to be addressed: How (if at all) will APA change its governance system (which, according to the Hoffman report, was able to be manipulated and circumvented pretty easily toward horrific ends)? Change its ethics policy? Change its stance toward military and operational psychology? Will it reopen and investigate old ethics charges? Will it make restitution for the damage it has done? These questions are likely to be taken up at the APA conference and especially the council meeting in Toronto next week. And the SPSP board meeting will be immediately afterward. So if you think our continued relationship with APA should be contingent on any of those things (and if you view APA’s rehabilition as a still-open question), it may be too soon to make a decision just yet.

In addition to the 3 items above, there may be other things I am not considering. If so, I hope people will bring those up.

Like I said, I would like to hear from SPSP members – and any others who have a stake in SPSP’s relationship with APA. You can comment below. I’ll post a link to this blog post on social media (Facebook, Twitter) in places where SPSP members are likely to see it and have a chance to comment. If you would like to reach me privately, my email is my first name AT uoregon DOT edu.

Replicability in personality psychology, and the symbiosis between cumulative science and reproducible science

There is apparently an idea going around that personality psychologists are sitting on the sidelines having a moment of schadenfreude during the whole social psychology Replicability Crisis thing.

Not true.

The Association for Research in Personality conference just wrapped up in St. Louis. It was a great conference, with lots of terrific research. (Highlight: watching three of my students give kickass presentations.) And the ongoing scientific discussion about openness and reproducibility had a definite, noticeable effect on the program.

The most obvious influence was the (packed) opening session on reproducibility. First, Rich Lucas talked about the effects of JRP’s recent policy of requiring authors to explicitly talk about power and sample size decisions. The policy has had a noticeable impact on sample sizes of published papers, without major side effects like tilting toward college samples or cheap self-report measures.

Second, Simine Vazire talked about the particular challenges of addressing openness and replicability in personality psychology. A lot of the discussion in psychology has been driven by experimental psychologists, and Simine talked about what the general issues that cut across all of science look like when applied in particular to personality psychology. One cool recommendation she had (not just for personality psychologists) was to imagine that you had to include a “Most Damning Result” section in your paper, where you had to report the one result that looked worst for your hypothesis. How would that change your thinking?*

Third, David Condon talked about particular issues for early-career researchers, though really it was for anyone who wants to keep learning – he had a charming story of how he was inspired by seeing one of his big-name intellectual heroes give a major award address at a conference, then show up the next morning for an “Introduction to R” workshop. He talked a lot about tools and technology that we can use to help us do more open, reproducible science.

And finally, Dan Mroczek talked about research he has been doing with a large consortium to try to do reproducible research with existing longitudinal datasets. They have been using an integrated data analysis framework as a way of combining longitudinal datasets to test novel questions, and to look at issues like generalizability and reproducibility across existing data. Dan’s talk was a particularly good example of why we need broad participation in the replicability conversation. We all care about the same broad issues, but the particular solutions that experimental social psychologists identify aren’t going to work for everybody.

In addition to its obvious presence in the plenary session, reproducibility and openness seemed to suffuse the conference. As Rick Robins pointed out to me, there seemed to be a lot more people presenting null findings in a more open, frank way. And talk of which findings were replicated and which weren’t, people tempering conclusions from initial data, etc. was common and totally well received like it was a normal part of science. Imagine that.

One things that stuck out at me in particular was the relationship between reproducible science and cumulative science. Usually I think of the first helping the second; you need robust, reproducible findings as a foundation before you can either dig deeper into process or expand out in various ways. But in many ways, the conference reminded me that the reverse is true as well: cumulative science helps reproducibility.

When people are working on the same or related problems, using the same or related constructs and measures, etc. then it becomes much easier to do robust, reproducible science. In many ways structural models like the Big Five have helped personality psychology with that. For example, the integrated data analysis that Dan talked about requires you to have measures of the same constructs in every dataset. The Big Five provide a common coordinate system to map different trait measures onto, even if they weren’t originally conceptualized that way. Psychology needs more models like that in other domains – common coordinate systems of constructs and measures that help make sense of how different research programs fit together.

And Simine talked about (and has blogged about) the idea that we should collect fewer but better datasets, with more power and better but more labor-intensive methods. If we are open with our data, we can do something really well, and then combine or look across datasets better to take advantage of what other people do really well – but only if we are all working on the same things so that there is enough useful commonality across all those open datasets.

That means we need to move away from a career model of science where every researcher is supposed to have an effect, construct, or theory that is their own little domain that they’re king or queen of. Personality psychology used to be that way, but the Big Five has been a major counter to that, at least in the domain of traits. That kind of convergence isn’t problem-free — the model needs to evolve (Big Six, anyone?), which means that people need the freedom to work outside of it; and it can’t try to subsume things that are outside of its zone of relevance. Some people certainly won’t love it – there’s a certain satisfaction to being the World’s Leading Expert on X, even if X is some construct or process that only you and maybe your former students are studying. But that’s where other fields have gone, even going as far as expanding beyond the single-investigator lab model: Big Science is the norm in many parts of physics, genomics, and other fields. With the kinds of problems we are trying to solve in psychology – not just our reproducibility problems, but our substantive scientific ones — that may increasingly be a model for us as well.

 

———-

* Actually, I don’t think she was only imagining. Simine is the incoming editor at SPPS.** Give it a try, I bet she’ll desk-accept the first paper that does it, just on principle.

** And the main reason I now have footnotes in most of my blog posts.