How many (and whose) lives would you bet on your theory?

The following is a guest post by Neil Lewis, Jr.

Disclaimer: Before I begin, I want to start by acknowledging a major constraint on the generalizability of what I am about to say. In this post I will use the term “psychologist” periodically for the sake of brevity, instead of explicitly naming every sub-field when discussing points I think apply to multiple sub-fields of the discipline. When I say “psychologist,” however, it should be clear that I am not referring to the clinicians and other therapists that most people in the general public think of when hearing that term. The work of clinicians is highly essential at this moment; it is urgently needed to get us through this pandemic. Clinicians, and other essential workers, I thank you for your service. This post is about, and for, the rest of us.

I have found myself in a strange situation lately. For the past month and a half or so, I have been arguing somewhat vehemently against something that I typically encourage my fellow scientists to do: to disseminate our research broadly and expediently. I was trained as a social psychologist by intervention scientists who instilled in me the Lewinian values of conducting action research that simultaneously advances theory and produces practical knowledge. Lewin’s quotes, “there is nothing so practical as a good theory” and “research that produces nothing but books will not suffice,” guide the work that we do in my lab.

Because I’m such a Lewin stan, and by extension, a strong advocate for doing and disseminating socially-relevant research, some colleagues in the field have been surprised that I have been discouraging the rapid dissemination of social scientific – and in particular, (social) psychological – research on the COVID-19 pandemic. “Aren’t you the one who’s always getting on a soap box about the need for more applied research in the field?” Yes, I am. I want psychologists to do socially meaningful research; I want all social scientists to do that – to get rid of the stupid line we draw in the sand between basic and applied research, and advance knowledge and improve the world at the same time.

But doing that…doing that well doesn’t happen overnight. It takes a lot of time, patience, and humility to do the kind of action research that is necessary to understand a social problem deeply enough to develop an effective intervention to address it. This is one of the biggest lessons I’ve learned over the years as I’ve worked on education and health interventions in the United States.

Good Intentions With Unforeseen Consequences: An Early Lesson From My Career

When I first started doing intervention research, I had grand aspirations for building on previous findings in the field and using them to change the world—my primary focus at that time was reducing education inequities. I’ll never forget my first intervention study as a graduate student. I spent months reading the literature, pulling together relevant stimuli and measures, outlining the research design and thinking through which statistical model we would use to evaluate its effectiveness, and preparing the documents and presentation to make the pitch to the stakeholders. I was pumped! We had, I thought then, the perfect study to test the idea, and we were going to help those kids succeed in school. Then we went to the first “all-hands” meeting with the research team and school district partners to outline the plan of action. The superintendent smiled and nodded while we presented. Then when we were done she told us all of the reasons why it wasn’t going to happen—at least not in the way we envisioned. That response was a bit surprising. The lab had made other successful pitches like this, and had run other successful intervention studies in similar districts; it wasn’t apparent why “basic” ideas like the need to fully randomize were suddenly controversial. But we listened and learned some important lessons.

There were some historical pieces of information that we didn’t fully understand about that particular district until we were all sitting down in conversation that day. Some tense politics, at multiple levels, influenced what we could and couldn’t do. Long story short, we–psychologists, educators, social workers, and government officials–worked together to come up with an intervention that made sense for that context. But it’s worth noting that the study we ended up with looked nothing like the study we originally planned. And although it was frustrating at that time—because I was confident in our theory and the methods we had planned—looking back, it was the right thing to do. At that time our theory wasn’t specified enough to account for those politics.

That intervention story was actually the story of a relatively easy case of intervening– a case in which we had well-validated measures and stimuli that were effective in similar schools in a different part of that same state. In other words, the problem was well-defined, we had a solid body of research in similar situations to work with…yet we were still foiled by local context: structural, cultural, and political features that mattered greatly for the psychological experiences we were trying to produce in the intervention. If we had run the study with the randomization plan we initially envisioned (where some students would have ended up in a wait list control), for instance, it probably would have created more resentment between parents and district officials in a community in which citizens (for some good historical and contemporary reasons) did not have the greatest trust in their government.

Transferring Lessons to the COVID-19 Pandemic

What does this have to do with COVID-19? The story I just shared is the story of a case in which I thought I had strong priors for how to apply a psychological theory to a new context that, again, I thought I knew a lot about. Even with that much (perceived) knowledge going into the situation, the theory was not directly applicable without modifications—modifications that I could not have anticipated without the on-the-ground knowledge of a variety of experts in that community. COVID-19 is nothing like that situation. The pandemic takes the most complex intervention scenarios I’ve ever worked on and makes them look like child’s play.

For psychologists to intervene or otherwise apply our theories to the pandemic requires understanding the psychological experiences of a much more diverse group of people than the WEIRD students that we typically study (Henrich, Heine, & Norenzayan, 2010). It requires deep knowledge about more situations than the laboratories in which we study those students (Baumeister, Vohs, & Funder, 2007). It requires knowing whether the measures we’re using are valid indicators of our constructs of interest (Maul, 2017) and having a sense of confidence about whether the processes we’re measuring actually translate to behaviors or other policy-relevant outcomes (Webb & Sheeran, 2006)—something that, at least social and personality psychologists, have limited information about (Flake, Pek, & Hehman, 2017). It requires knowing the constraints on generalizability of our findings (Simons, Shoda, & Lindsay, 2017) and thus when and whether it is appropriate or inappropriate to try and apply them to different situations, and what the most effective strategies would be to implement them successfully (Earl & Lewis, 2019; Goroff, Lewis, Scheel, Scherer, & Tucker, 2018). Without those things, my guess is that, at best, the application of psychological theories to the COVID-19 pandemic will be like most other interventions in the field: they will not have high enough fidelity to be practically meaningful (DeAngelis, 2010). To use the language of Yeager and colleagues (2016): “although promising, self-administered psychological interventions have not often been tested in ways that are sufficiently relevant for policy and practice (pp. 375).

Caution is Warranted: Lives are at Stake 

These are some of the concerns that have been on my mind that have led me to trade in my “go forth and disseminate” bullhorn for a “slow down and be cautious” traffic sign. In normal times, I’m quite forgiving of the field’s flaws and I persist on the optimism that we will progressively get better. But right now, I fear that our optimism and good intentions are not enough; they may even be dangerous. People are dying from this, my relatives among them. In times like these when lives are hanging in the balance, I think we have to hold ourselves to a higher standard than our status quo.

Earlier this week, some colleagues who share these concerns and I released a new preprint making that very argument (IJzerman et al., 2020). The response on social media has been…eye-opening. I haven’t engaged, I’ve just been watching it unfold. We anticipated some push back. In fact we actually finalized writing the week before, but at my request my co-authors agreed to wait until I had the time, energy, and patience to fully listen to the field’s response; there was too much going on the week before (i.e., illness and death) so I did not trust myself to have the patience and self-control required to quietly absorb (defensive) responses. This week I was able to listen, and I learned a lot. Some of the responses we received make sense to me (even if I disagree with them), while for some others I still need to figure out where I stand. But there’s a third category that I find deeply disturbing. The third category makes me think there are some in our field who (a) do not fully appreciate the gravity of the situation that we are in, (b) do not understand or appreciate that there is a vast space between abstract theory and pilot interventions, and an even larger space between pilot interventions and dissemination at scale, and (c) do not realize that there are serious and tremendous opportunity costs associated with the recommendations they so confidently make.

There is a reason there is an entire field called implementation science devoted to studying how to put research into practice (and cataloguing the fallout of well-meaning interventions that went awry). Applying findings is no small feat, and though you may think your recommendations are cheap and easy to implement, and would surely provide benefits without incurring any costs, let me adapt one of economists’ favorite sayings to the current context and make something abundantly clear: there’s no such thing as a free intervention.

Opportunity Costs You May Not Have Considered

In psychology, and I suspect at least some of the other social sciences, we typically do research, write up the results and include a paragraph in the discussion section about how it can be used to change the world, and publish it in a paywalled journal that very few people read. If someone does actually read it and wants to implement it, that’s their problem to figure out— we’ve moved on to the next grant and paper. One of the things implementation experts have to figure out is the opportunity cost associated with using your intervention vs. something else vs. nothing. Here is a concrete example that actually changed the way I think about my own research and its utility for application. A few years ago a colleague (an economist who works on education policy) got very excited about psychological interventions in education—he read some of the best-selling books from psychologists who have become famous for small interventions that ostensibly produce large effects. We met to talk about them because he was considering incorporating some in his future work. In that discussion he asked me a question that stumped me, and that all these years later I still don’t have a good answer to. It’s an obvious questions from an economist’s perspective, but the kind of question that rarely gets asked in psychology: If you were a policy-maker with $100 million dollars to reduce education disparities [our topic of mutual interest], how much would you allocate to psychological interventions (pick whichever one you think is most effective), and how much would you allocate to something else like free lunch programs?

He asked that question because in the intervention world, there is a practical reality that we must contend with. Resources. Are. Constrained. Using your theory to intervene, almost always means not trying something else, or trying something else with fewer of the resources that it otherwise might have had to be effective. In the COVID-19 pandemic, we have both a health crisis and an economic crisis. There is a lot less money to go around which means there is more competition for every dollar. We need to spend money on testing. We need to spend money on protective gear for healthcare professionals and other essential workers who are on the front lines every day. We need to spend money on more ventilators in hospitals. We need to spend money on antibody tests. We need to spend money to develop vaccines. We need to spend money to feed people who are at home struggling to make ends meet because they lost their jobs, but still have children at home to feed

Do we really need to spend money implementing your theory? Is it really that urgent? If so, which of those buckets should we take the money from? Those categories listed above—food, protective gear, vaccines— are ones upon which I’m willing to bet a lot of money that they will save lives. Taking resources from them to reallocate to implementing our theories might save some lives, but probably at the cost of other lives that could be saved if the money spent on our theories was spent elsewhere. How is implementing your theory going to reduce morbidity and mortality rates? And what is the relative impact of that vs. other strategies with more pandemic-tested evidence?

If you are confident that your theory should be used in the pandemic response in the way that we’re using epidemiology and public health theories, please do me a favor. Look at your spouse. Or your child. Or your parent. Or your grandparent. Or whoever it is that you love most in this world. If they are not physically with you, take out a photo and look at them. And ask yourself:

Would you bet their life on your theory?

That is what you’re asking the public to do.

Personally, I am not willing to bet any of my loved one’s, or anyone else’s, lives on our theories during this pandemic, and I say that even as someone who does health communication intervention research in health clinics. As much as I would love to think the papers I’ve written contain the insights necessary to address this crisis, even I have to admit that my work is not ready for application in this pandemic context, and my work falls on probably the most applied end of the basic-to-applied continuum in the field. I just don’t see it as a worthwhile bet, particularly when we have much more relevant research from fields like epidemiology and public health who have spent decades preparing for moments like this. Physical distancing and contact tracing aren’t abstract, vaguely specified ideas. They’re robust strategies that have saved lives in previous pandemics. We may come up with our own crisis-ready interventions before the next existential threat if we start working together (with other disciplines) now, but at the moment, in my humble opinion, we’re just not there.

So What Can We Do With Our Desire to Help?

If our theories aren’t ready for pandemic application, does that mean there’s nothing we can do to help? Not at all. Our value in academia is often constructed based on the fame of our theories, and metrics like how often they are cited. However, we have many other skills that are valuable outside of our ivory towers, and that can be applied to the pandemic response right now. Here are just a few ways that we might help.

One of the things that I have been spending a lot of time on recently, that I’ve come to realize is more helpful than anticipated, is simply helping (local) organizations interpret data. I teach PhD students quantitative research methods, and undergraduate students persuasion and social influence, and intervention science courses—the types of classes that walk students through the process of how data is constructed and curated, and what that means for the inferences we can make from it. The same lessons I teach to my students turn out to transfer well to helping organizations figure out how to make sense of the firehose of new data that is coming in everyday. There are some journalists, for instance, that now check in with me to make sure they have a good understanding of new data before they report on it in newscasts. Even if it’s just background research, having that extra set of eyes that know to look for things like the pattern of missingness, and to ask whether there are systematic biases in how the data was collected, helps to minimize the likelihood that my local and regional neighbors are misinformed.

In addition to interpreting data, some local organizations now find themselves in situations where they have to figure out how to collect and manage new types of data that they have never worked with before so that they can make decisions for the future. The data collection and management lessons we teach to new graduate students for building a research workflow turns out to be valuable as well, especially for smaller organizations that do not (yet) have in-house data scientists. Walking through the logic that makes it clear why depending on what kinds of decisions they want/need to make, they may need to collect different kinds of data (i.e., to estimate different models), can be quite helpful.

These are just a few things that we can do to use our skills to help during this time. We don’t necessarily need to spend limited resources trying out a new message based on our theory; we can help our local journalists, soup kitchens, school districts, health departments, etc., with more pressing needs. The key thing is to avoid the trap of a top-down mentality: I am the expert here to tell you why my theory and methods are suitable for helping address this pandemic. A more effective approach might be to do the exact opposite: to go to those working tirelessly on the front-line and ask a simple question: how can I help?

Acknowledgments

I would like to thank Sonya Dal Cin, Lisa DeBruine, Berna Devezer, Patrick Forscher, Hans IJzerman, Andrew Przbylski, John Sakaluk, Anne Scheel, and Sanjay Srivastava for their very thoughtful feedback on an earlier draft of this post.

References

Baumeister, R. F., Vohs, K. D., & Funder, D. C. (2007). Psychology as the science of self-reports and finger movements: Whatever happened to actual behavior? Perspectives on Psychological Science, 2(4), 396-403.

DeAngelis, T. (2010, November). Getting research into the real world. Monitor on Psychology, 41(10). Retrieved from: https://www.apa.org/monitor/2010/11/dissemination

Earl, A., & Lewis, N. A., Jr. (2019). Health in Context: New Perspectives on Healthy Thinking and Healthy Living. Journal of Experimental Social Psychology, 81(3), 1-4.

Flake, J. K., Pek, J., & Hehman, E. (2017). Construct validation in social and personality research: Current practice and recommendations. Social Psychological and Personality Science, 8(4), 370-378.

Goroff, D. L., Lewis, N. A., Jr., Scheel, A. M., Scherer, L. D., & Tucker, J. (2018, November 1). The Inference Engine: A Grand Challenge to Address the Context Sensitivity Problem in Social Science Research. https://doi.org/10.31234/osf.io/j8b9a

Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33(2-3), 61-83.

IJzerman, H., Lewis, N. A., Jr., Weinstein, N., DeBruine, L. M., Ritchie, S. J., Vazire, S., … & Przybylski, A. K. (2020, April 27). Psychological Science is Not Yet a Crisis-Ready Discipline. https://doi.org/10.31234/osf.io/whds4

Maul, A. (2017). Rethinking Traditional Methods of Survey Validation. Measurement: Interdisciplinary Research and Perspectives, 15(2), 51-69.

Simons, D. J., Shoda, Y., & Lindsay, D. S. (2017). Constraints on generality (COG): A proposed addition to all empirical papers. Perspectives on Psychological Science, 12(6), 1123-1128.

Webb, T. L., & Sheeran, P. (2006). Does changing behavioral intentions engender behavior change? A meta-analysis of experimental evidence. Psychological Bulletin, 132(2), 249-268

Yeager, D. S., Romero, C., Paunesku, D., Hulleman, C. S., Schneider, B., Hinojosa, C., … & Dweck, C. S. (2016). Using Design Thinking to Improve Psychological Interventions: The Case of the Growth Mindset During the Transition to High School. Journal of Educational Psychology, 108(3), 374-391.

Will this time be different?

I had the honor to deliver the closing address to the Society for the Improvement of Psychological Science on July 9, 2019 in Rotterdam. The following are my prepared remarks. (These remarks are also archived on PsyArXiv.)

Some years ago, not long after people in psychology began talking in earnest about a replication crisis and what to do about it, I was talking with a colleague who has been around the field for longer than I have. He said to me, “Oh, this is all just cyclical. Psychology goes through a bout of self-flagellation every decade or two. It’ll be over soon and nothing will be different.”

I can’t say I blame him. Psychology has had other periods of reform that have fizzled out. One of the more recent ones was the statistical reform effort of the late 20th century – and you should read Fiona Fidler’s history of it, because it is completely fascinating. Luminaries like Jacob Cohen, Paul Meehl, Robert Rosenthal, and others were members and advisors of a blue-ribbon APA task force to change the practice of statistics in psychology. This resulted in the APA style manual adding a requirement to report effect sizes – one which is occasionally even followed, though the accompanying call to interpret effect sizes has gotten much less traction – and a few other modest improvements. But it was nothing like the sea change that many of them believed was needed.

But flash forward to the current decade. When people ask themselves, “Will this time be different?” it is fair to say there is a widespread feeling that indeed it could be. There is no single reason. Instead, as Bobbie Spellman and others have written, it is a confluence of contributing factors.

One of them is technology. The flow of scientific information is no longer limited by what we can print on sheets of pulped-up tree guts bound together into heavy volumes and sent by truck, ship, and airplane around the world. Networked computing and storage means that we can share data, materials, code, preprints, and more, at a quantity and speed that was barely imagined even a couple of decades ago when I was starting graduate school. Technology has given scientists far better ways to understand and verify the work we are building on, collaborate, and share what we have discovered.

A second difference is that more people now view the problem not just as an analytic one – the domain of logicians and statisticians – but also, complementarily, as a human one. So, for example, the statistical understanding of p-values as a function of a model and data has been married to a social-scientific understanding: p-values are also a function of the incentives and institutions that the people calculating them are working under. We see meta-scientists collecting data and developing new theories of how scientific knowledge is produced. More people see the goal of studying scientific practice not just as diagnosis – identify a problem, write a paper about it, and go to your grave knowing you were right about something – but also of designing effective interventions and embracing the challenges of scaling up to implementation.

A third difference, and perhaps the most profound, is where the ideas and energy are coming from. A popular debate on Twitter is what to call this moment in our field’s history. Is it a crisis? A renaissance? A revolution? One term that gets used a lot is “open science movement.” Once, when this came up on Twitter, I asked a friend who’s a sociologist what he thought. He stared at me for a good three seconds, like I’d just grown a second head, and said: “OF COURSE it’s a social movement.” (It turns out that people debating “are we a social movement?” is classic social movement behavior.) I think that idea has an underappreciated depth to it. Because maybe the biggest difference is that what we are seeing now is truly a grassroots social movement.

What does it mean to take seriously the idea that open science is a social movement? Unlike blue-ribbon task forces, social movements do not usually have a single agenda or a formal charge. They certainly aren’t made up of elites handpicked by august institutions. Instead, movements are coalitions – of individuals, groups, communities, and organizations that have aligned, but often not identical, values, priorities, and ideas.

We see that pluralism in the open science movement. To take just one example, many in psychology see a close connection between openness and rigor. We trace problems with replicability and cumulative scientific progress back, in part, to problems with transparency. When we cannot see details of the methods used to produce important findings, when we cannot see what the data actually look like, when we cannot verify when in the research process key decisions were made, then we cannot properly evaluate claims and evidence. But another very different argument for openness is about access and justice: expanding who gets to see and benefit from the work of scientists, join the discourse around it, and do scientific work. Issues of access would be important no matter how replicable and internally rigorous our science was. Of course, many – and I count myself among them – embrace both of these as animating concerns, even if we came to them from different starting points. That’s one of the powerful things that can happen when movements bring together people with different ideas and different experiences. But as the movement grows and matures, the differences will increase too. Different concerns and priorities will not always be so easily aligned. We need to be ready for that.

SIPS is not the open science movement – the movement is much bigger than we are. Nobody has to be a part of SIPS to do open science or be part of the movement. We should never make the mistake of believing that a SIPS membership defines open science, as my predecessor Katie Corker told us so eloquently last year. But we have the potential to be a powerful force for good within the movement. When SIPS had its first meeting just three years ago, it felt like a small, ragtag band of outsiders who had just discovered they weren’t alone. Now look at us. We have grown in size so fast that our conference organizers could barely keep up. 525 people flew from around the world to get together and do service. Signing up for service! (Don’t tell your department chair.) People are doing it because they believe in our mission and want to do something about it.

This brings me to what I see as the biggest challenge that lies ahead for SIPS. As we have grown and will continue to grow, we need to be asking: What do we do about differences? Both the differences that already exist in our organization, and the differences that could be represented here but aren’t yet. Differences in backgrounds and identities, differences in culture and geography, differences in subfields and interests and approaches. To which my answer is: Differences can be our strength. But that won’t happen automatically. It will take deliberation, intent, and work to make them an asset.

What does that mean? Within the open science movement, many have been working on improvements. But there is a natural tendency for people to craft narrow solutions that just work for themselves, and for people and situations they know. SIPS is at its best when it breaks through that, when it brings together people with different knowledge and concerns to work together. When a discussion about getting larger and more diverse samples includes people who come from different kinds of institutions who have access to different resources, different organizational and technical skills, but see common cause, we get the Psychological Science Accelerator. When people who work with secondary data are in the room talking about preregistration, then instead of another template for a simple two-by-two, we get an AMPPS paper about preregistration for existing data. When mixed-methods researchers feel welcomed one year, they come back the next year with friends and organize a whole session on open qualitative research.

Moving forward, for SIPS to continue to be a force for good, we have to take the same expectations we have of our science and apply them to our movement, our organization, and ourselves. We have to listen to criticism from both within and outside of the society and ask what we can learn from it. Each one of us has to take diversity and inclusion as our own responsibility and ask ourselves, how can I make this not some nice add-on, but integral to the way I am trying to improve psychology? We have to view self-correction and improvement – including improvement in how diverse and inclusive we are – as an ongoing task, not a project we will finish and move on from.

I say this not just as some nice paean to diversity, but as an existential task for SIPS and the open science movement. This is core to our values. If we remake psychological science into something that works smashingly well for the people in this room, but not for anyone else, we will have failed at our mission. The history of collective human endeavors, including social movements – the ways they can reproduce sexism and racism and other forms of inequality, and succumb to power and prestige and faction – gives us every reason to be on our guard. But the energy, passion, and ideals I’ve seen expressed these last few days by the people in this room give me cause for hope. We are, at the end of the day, a service organization. Hundreds of people turned up in Rotterdam to try to make psychology better not just for themselves, but for the world.

So when people ask, “Will this time be different?” my answer is this: Don’t ever feel certain that the answer is yes, and maybe this time it will be.

This is your Brain on Psychology – This is your Psychology on Brain (a guest post by Rob Chavez)

The following is a guest post by Rob Chavez.

If I’m ever asked ‘what was a defining moment of your career?’, I can think of a very specific instance that has stuck with me since my early days as a student in social neuroscience. I was at a journal club meeting where we were discussing an early paper using fMRI to investigate facial processing when looking at members of different racial groups. In this paper, the researchers found greater activity in the amygdala for viewing black faces than for white faces. Although the authors were careful not to say it explicitly, the implication for most readers was clear: The ‘threat center’ turned on for black faces more than white faces, therefore the participants may have implicit fear of black faces. Several students in the group brought up criticisms of that interpretation revolving around how the amygdala is involved in other processes, and we started throwing around ideas for study designs to possibly tease apart alternative explanations (e.g. lower-level visual properties, ambiguity, familiarity) that might also account for the amygdala activity.

Then it happened: The professor in the room finally chimed in. “Look, these are interesting ideas, but they don’t really tell us anything about racial bias. I don’t really care about what the amygdala does; I just care what it can tell us about social psychology.” Even in the nascent stages of my career, I was somewhat flabbergasted. Who wouldn’t want to know everything they possibly could about the thing they are using to draw inferences, especially when that thing is part of the source of mechanism? For me, this event marked a turning point where I began to think of neuroscience less as a method for answering psychological questions and started thinking of it as a multidisciplinary endeavor to which psychology has much to contribute.

Now as a card-carrying social neuroscientist, when I attend conferences, such as the Society for Personality and Social Psychology meeting, I am frequently asked what neuroscience can contribute to our understanding of psychology that we don’t already know from behavioral studies, which are often more flexible, less noisy, and much, much cheaper to run. However, contributing to psychological theory or outperforming behavioral predictions are often not the proximal goals for researchers using neuroimaging methods. Instead, much of the interest in social neuroscience stems from the potential of applying insights from psychology (and other disciplines) to better understand how cognitive and social phenomena are represented in the function and structure of the brain for its own sake, and not simply using the brain as a tool or methodology. I believe that these efforts help us refine the link between these levels of analysis and, frankly, are interesting in their own right.

To be fair to the professor at the journal club, there may be a reason that people hold the view that the brain can simply be used as a method or a tool. Many early researchers using neuroimaging were not originally trained in neuroscience per se but instead transitioned over to it from using other kinds of psychophysiological methods. As such, there are understandable reasons why many researchers doing psychophysiological work don’t have much of a motivation to care as deeply about the underlying physiological process itself. For example, if a researcher is doing a study measuring electrodermal activity, chances are that they don’t care very deeply about sweat (and possibly don’t really care about sympathetic nervous system activity), but rather use it as an indicator of emotional arousal. Put differently, nobody assumes that sweat is the origin of arousal or believes that the fingertip is the organ responsible for the seat of the mind.*

This is not true for the brain, and things start to get even more complicated quickly. Even if you want to just use fMRI amygdala activation to be a marker of threat or fear, the path to do so is not as clear as in other physiological measures. (The amygdala is not even a single structure but rather a collection of functionally distinct nuclei, each with its own functional tuning and connectivity profile to other parts of the brain.) I believe that the way many have been taught to think about measures of brain function and structure has been conflated with some of these more peripheral measures in other parts of the body that are obviously not ‘the source’ of the mind. As such, it doesn’t make much sense to ask how psychological processes are represented in skin conductance in the same way as asking how they are represented in the brain (even if using relatively crude and indirect tools like fMRI). Thus, the common criticism of some neuroimaging work that “we already know that the mind happens in the brain” is shortsighted. Yes, but how, when, and at what level of granularity? However, this perspective is not without its challenges.

One of ways in which brain imaging can be useful to psychologists is to know when activation of a particular brain region or network is indicative of a specific psychological phenomenon. However, in the context of neuroimaging, the term reverse inference is a bit of a dirty phrase.** When someone accuses an fMRI researcher of engaging in reverse inference – drawing conclusions about what psychological process is involved given the activation of a brain area – it is usually a criticism. However, reverse inference is indeed one of the overarching goals of how cognitive and social neuroscience inform psychology in general. We want to know when we can make sound inferences about the psychological processes involved based on neuroimaging metrics in a given task or under certain conditions. Although this is major goal of this endeavor, it is only one of them and is often a distal one. What people ought to be criticizing is premature, decontextualized, or otherwise incomplete reverse inference that overreaches on the conclusions drawn from these methods – Does amygdala activation really mean ‘threat’? Are there other processes involved that might explain it? Does that depend on the particular stimuli being used? Is it a single part of the amygdala or several in concert? Even if replicable, how confident are we that the paradigm being used is representative of the possibility space of reasonable paradigms that could have been used instead? – We have acknowledged for a long time that there is almost never a one-to-one mapping of activity in a single brain region to a single psychological process. Tackling the issues of how then to meaningfully relate psychological processes to the brain is what many of us are working on right now.

However, it’s hard to accomplish these goals while pushing the envelope of psychological theory simultaneously. The collective expectation that cognitive and social neuroscience experiments have an obligation to contribute to our understanding of complex psychological phenomena (and not vice versa) is often too premature to be definitive, much less revolutionary. Moreover, I feel strongly that the insistence that we frame the interpretation of our results in ways that placate this expectation has led many, otherwise cautious, researchers to take unwarranted liberties into the dark side of reverse inference. Ironically, this may have has spurred many criticisms of overreach from cognitive and social neuroscience from others within the broader psychology community. Just as it took years for psychometricians to gain an understanding of how measurement and the scope of our inferences make up the scaffolding upon which we can build psychological theory, it is my hope that a more mature understanding of the intersection between neuroscience and psychology can offer analogous insights. However, given the overwhelming complexity of each of these domains, these efforts will take time.

I sometimes like to think of cognitive/social neuroscience as more of an applied field like, say, psychology of law where researchers are using what we understand about cognition and behavior to inform how the processes are deployed in the legal system except in our case it’s how they are deployed in the brain. If we were talking to a psychology of law scholar, we would never say to them “I don’t care about the law; I just care about what it can tell us about psychology”, because the inferential arrow is not pointed in that direction. For many questions, I think psychology has more to offer neuroscience than the other way around. I am excited to be a part of this endeavor and use my knowledge of both domains to try and build a stronger and more fruitful bridge between them.

At the end of the day, neuroscience is going to move forward whether psychologists want to come along or not. And just as they have in many ‘big data’ domains, engineers working in neuroscience have already started asking questions about psychological phenomena without psychologists’ input. It seems to me like psychologists not only should want representation at the neuroscience table but also recognize that psychology is needed to have a comprehensive understanding of the brain; the engineers will not be able to figure it out without it. I see the work of researchers in my subfield as attempting to fill that chair to some degree. I hope others join us in not only appreciating the beauty of the brain, but also in recognizing the extent to which psychologists’ understanding of the mind and behavior is an essential for contributing to our understanding the very organ that gives rise to them and the challenges therein.


* To be clear, I am not saying that brain imaging is better than psychophysiology for making inferences about psychological phenomena. On the contrary, psychophysiological measures are often clearer and less expensive than their brain imaging counterparts. However, if you care about inference about the neural systems themselves, psychophysiology often can’t say as much about that (with some exceptions, like pupillometry and locus coeruleus activity).

** Because we cannot directly measure most phenomena of interest, almost all psychological measures – including things as simple as reaction times – are technically reverse inferences. Moreover, reaction times involve engaging volitional actions in motor cortex via a cascade of spatiotemporal events in the rest of the brain that are critical for understanding and making appropriate responses for the task at hand. There are many cogs in the machine, and in psychology there is no such thing as a free inference.

Data analysis is thinking, data analysis is theorizing

There is a popular adage about academic writing: “Writing is thinking.” The idea is this: There is a simple view of writing as just an expressive process – the ideas already exist in your mind and you are just putting them on a page. That may in fact be true some of the time, like for day-to-day emails or texting friends or whatever. But “writing is thinking” reminds us that for most scholarly work, the process of writing is a process of continued deep engagement with the world of ideas. Sometimes before you start writing you think you had it all worked out in your head, or maybe in a proposal or outline you wrote. But when you sit down to write the actual thing, you just cannot make it make sense without doing more intellectual heavy lifting. It’s not because you forgot, it’s not because you “just” can’t find the right words. It’s because you’ve got more thinking to do, and nothing other than sitting down and trying to write is going to show that to you.*

Something that is every bit as true, but less discussed and appreciated, is that in the quantitative sciences, the same applies to working with data. Data analysis is thinking. The ideas you had in your head, or the general strategy you wrote into your grant application or IRB protocol, are not enough. If that is all you have done so far, you almost always still have more thinking to do.

This point is exemplified really well in the recent Many Analysts, One Data Set paper. Twenty-nine teams of data analysts were given the same scientific hypothesis to test and the same dataset to test it in. But no two teams ran the same analysis, resulting in 29 different answers. This variability was neither statistical noise nor human error. Rather, the differences in results were because of different reasoned decisions by experienced data analysts. As the authors write in the introduction:

In the scientific process, creativity is mostly associated with the generation of testable hypotheses and the development of suitable research designs. Data analysis, on the other hand, is sometimes seen as the mechanical, unimaginative process of revealing results from a research study. Despite methodologists’ remonstrations…, it is easy to overlook the fact that results may depend on the chosen analytic strategy, which itself is imbued with theory, assumptions, and choice points.

The very end of the quote drives home a second, crucial point. Data analysis is thinking, but it is something else too. Data analysis is theorizing. And it is theorizing no matter how much or how little the analyst is thinking about it.

Scientific theory is not your mental state. Scientific theory consists of statements about nature. When, say, you decide on a scheme for how to exclude outliers in a response-time task, that decision implies a theory of which observations result from processes that are irrelevant to what you are studying and therefore ignorable. When you decide on how to transform variables for a regression, that decision implies a theory of the functional form of the relationships between measured variables. These theories may be longstanding ones, well-developed and deeply studied in the literature. Or they may be ad hoc, one-off implications. Moreover, the content of the analyst’s thinking may be framed in theoretical terms (“hmmm let me think through what’s generating this distribution”), or it may be shallow and rote (“this is how my old advisor said to trim response times”). But the analyst is still making decisions** that imply something about something in nature – the decisions are “imbued with theory.” That’s why scientists can invoke substantive reasons to critique each other’s analyses without probing each other’s mental states. “That exclusion threshold is too low, it excludes valid trials” is an admissible argument, and you don’t have to posit what was in the analyst’s head when you make it.

So data analysis decisions imply statements in theory-space, and in order to think well about them we probably need to think in that space too. To test one theory of interest, the process of data analysis will unavoidably invoke other theories. This idea is not, in fact, new. It is a longstanding and well-accepted principle in philosophy of science called the Duhem-Quine thesis. We just need to recognize that data analysis is part of that web of theories.

This gives us an expanded framework to understand phenomena like p-hacking. The philosopher Imre Lakatos said that if you make a habit of blaming the auxiliaries when your results don’t support your main theory, you are in what he called a degenerative research programme. As you might guess from the name, Imre wasn’t a fan. When we p-hack – try different analysis specifications until we get one we like – we are trying and discarding different configurations of auxiliary theories until we find one that lets us draw a preferred conclusion. We are doing degenerative science, maybe without even realizing it.

On the flip side, this is why preregistration can be a deeply intellectually engaging and rewarding process.*** Because without the data whispering in your ear, “Try it this way and if you get an asterisk we can go home,” you have one less shortcut around thinking about your analysis. You can, of course, leave the thinking until later. You can do so with full awareness and transparency: “This is an exploratory study, and we plan to analyze the data interactively after it is collected.” Or you can fool yourself, and maybe others, if you write a vague or partial preregistration. But if you commit to planning your whole data analysis workflow in advance, you will have nothing but thinking and theorizing to guide you through it. Which, sooner or later, is what you’re going to have to be doing.


* Or, you can write it anyway and not make sense, which also has a parallel in data analysis.
** Or outsourcing them to the software developer who decided what defaults to put in place.
*** I initially dragged my heels on starting to preregister – I know, I know – but when I finally started doing it with my lab, we experienced this for ourselves, somewhat to my own surprise.

What if we talked about p-hacking the way we talk about experimenter effects?

Discussions about p-hacking sometimes go sideways. A hypothetical exchange might go like this:

READER: Those p-values are all hovering just below .05, I bet the authors p-hacked.

AUTHOR: I know that I did not p-hack, and I resent the accusation.

By comparison, consider how we talk about another form of potential bias: experimenter effects.

It is widely accepted that experimenters’ expectations, beliefs, or other characteristics can influence participants in behavioral experiments and medical trials. We also accept that this can happen without intent or even awareness on the part of the experimenter. Expectations about how participants receiving a treatment are supposed to differ from those receiving a placebo might show up in the experimenter’s behavior in subtle ways that could influence the participants.

We also don’t have a complete theory of experimenter effects that allows us to reliably measure every manifestation or predict with high confidence when they will and won’t occur. So instead, we consider them as an open possibility in a wide range of situations. As a result, it is also widely accepted that using procedural safeguards against experimenter effects is a best practice in most experiments where a human experimenter will interact with subjects.

Because of all these shared assumptions, discussions around experimenter effects are often much less heated. If you are presenting a study design at lab meeting, and someone says “you’ll need to keep your RAs blind to condition, here’s an idea how to do that…” that’s generally considered a helpful suggestion rather than an insinuation of planned malfeasance.

And even after a study is done, it is generally considered fair game to ask about blinding and other safeguards, and incorporate their presence or absence into an evaluation of a study. If a study lacks such safeguards, authors generally don’t say things like “I would never stoop so low as to try to influence my participants, how dare you!” Everybody, including authors, understands that experimenters don’t always know how they might be influencing subjects. And when safeguards are missing, readers typically treat it as a reason for doubt and uncertainty. We allow and even expect readers to calibrate that uncertainty judgment based on other assumptions or information, like how plausible the effect seems, how strong or weak did partial or incomplete safeguards seem, etc.

For some reason though, when it comes to potential sources of bias in data analysis, we have not (yet) reached a place where we can talk about it in a similar way. This is despite the fact that it has a lot in common with experimenter effects.

It is certainly possible for somebody to deliberately and strategically p-hack, just like it’s possible for an experimenter to wink and nudge and say “are you sure you’re not feeling better?” or whatever. But bias in data analysis does not have to happen that way. Analysts do not have to have intention or even awareness in order to do things that capitalize on chance.

Consider, first of all, that almost every data analysis involves many decisions: what data to include or exclude, whether or how to transform it, a zillion possibilities in specifying the analysis (what particular variables to look at, what analyses to run on them, whether to use one- or two-tailed tests, what covariates to include, which main, interactive, simple, or contrast effect[s] to treat as critical tests of the hypothesis, etc.), and then decisions about what to report. We psychologists of all people know that you cannot un-know something. So once the analyst has seen anything about the data – distributions, scatterplots, preliminary or interim analyses, whatever else – all the subsequent decisions will be made by a person who has that knowledge. And after that point, it is simply impossible for anybody – including the analyst – to state with any confidence how those decisions might otherwise have been made without that knowledge. Which means that we have to treat seriously the possibility that the analyst made decisions that overfit the analyses to the data.

More subtly, as Gelman and Loken discuss in their “forking paths” paper, bias is not defined by a behavior (how many analyses did you run?), but by a set of counterfactuals (how many analyses could you have run?). So even if the objective history is that one and only one analysis was run, that is not a guarantee of no bias.

What all of this means is that when it comes to bias in data analysis, we are in very much a similar situation as with experimenter effects. It is virtually impossible to measure or observe it happening in a single instance, even by the person doing the data analysis. But what we can do is define a broad set of circumstances where we have to take it seriously as a possibility.

It would be great if we could collectively shift our conversations around this issue. I think that would involve changes from both critical readers and from authors.

Start by considering procedures, not behavior or outcomes. Were safeguards in place, and if so, how effective were they? For bias in data analysis, the most common safeguard is preregistration. The mere existence of a preregistration (as indicated by a badge or an OSF link in a manuscript) tells you very little though – many of them do not actually constrain bias. Sometimes that is even by design (for example, preregistering an exploratory study is a great way to prevent editors or reviewers from pressuring you to HARK later on). A preregistration is just a transparency step, you have to actually read it to find out what it does. In order for a preregistration to prevent analytic bias, it has to do two things. First, it has to have a  decision inventory – that is, it has to identify all of the decisions about what data to collect/analyze, how to analyze it, and what to report. So ask yourself: is there a section on exclusions? Transformations? Does it say what the critical test is? Etc. (This will be easier to do in domains where you are familiar with the analytic workflow for the research area. It can also be aided by consulting templates. And if authors write and post analysis code as part of a preregistration, that can make things clear too.) Second, the preregistration has to have a plan for all of those decision points. To the extent that the inventory is complete and the plans are specific and were determined separate from the data, the preregistration can be an effective safeguard against bias.

When safeguards are missing or incomplete, everyone – authors and readers alike -should treat analytic bias as a serious possibility. If there is no preregistration or other safeguards, then bias is possible. If there is a preregistration but it was vague or incomplete, bias is also possible. In a single instance it is often impossible to know what actually happened, for the reasons I discussed above. It can be reasonable to start looking at indirect stuff like statistical evidence (like the distribution of p-values), whether the result is a priori implausible, etc. Inferences about these things should be made with calibrated uncertainty. p-curves are neither perfect nor useless; improbable things really do happen though by definition rarely; etc. So usually we should not be too sure in any direction.

Inferences about authors should be rare. We should have a low bar for talking about science and a high bar for talking about scientists. This cuts both ways. Casual talk challenging authors’ competence, intentions, unreported behaviors, etc. is often both hurtful and unjustified when we are talking about single papers.* But also, authors’ positive assertions about their character, behavior, etc. rarely shed light and can have the perverse effect of reinforcing the message that they, and not just the work, are a legitimate part of the conversation. As much as possible, make all the nouns in your discussion things like “the results,” “the procedure,” etc. and not “the authors” (or for that matter “my critics”). And whether you are an author, a critic, or even an observer, you can point out when people are talking about authors and redirect the conversation to the work.

I realize this last item draws a razor-thin line and maybe sometimes it is no line at all. After all, things like what safeguards were in place, and what happened if they weren’t, are results of the researcher’s behavior. So even valid criticism implicates what the authors did or didn’t do, and it will likely be personally uncomfortable for them. But it’s a distinction that’s worth observing as much as you can when you criticize work or respond to criticisms. And I would hope we’ve learned from the ways we talk about experimenter effects that it is possible to have less heated, and frankly more substantive, discussions about bias when we do that.

Finally, it is worth pointing out that preregistration and other safeguards are still really new to psychology and many other scientific fields. We are all still learning, collectively, how to do them well. That means that we need to be able to criticize them openly, publicly, and vigorously – if we do not talk about them, we cannot get better at doing them. But it also means that some preregistration is almost always better than none, because even a flawed or incomplete one will increase transparency and make it possible to criticize work more effectively. Even as we critique preregistrations that could have been done better, we should recognize that anybody who makes that critique and improvement possible has done something of value.


* In the bigger picture, for better or worse, science pins career advancement, resources, prestige, etc. to people’s reputations. So at some point we have to be able to talk about these things. This is a difficult topic and not something I want to get into here, other than to say that discussions about who is a good scientist are probably better left to entirely separate conversations from ones where we scientifically evaluate single papers, because the evidentiary standards and consequences are so different.

Accountable replications at Royal Society Open Science: A model for scientific publishing

Kintsugi pottery

Kintsugi pottery. Source: Wikimedia commons.

Six years ago I floated an idea for scientific journals that I nicknamed the Pottery Barn Rule. The name is a reference to an apocryphal retail store policy captured in the phrase, “you break it, you bought it.” The idea is that if you pick something up in the store, you are responsible for what happens to it in your hands.* The gist of the idea in that blog post was as follows: “Once a journal has published a study, it becomes responsible for publishing direct replications of that study. Publication is subject to editorial review of technical merit but is not dependent on outcome.”

The Pottery Barn framing was somewhat lighthearted, but a more serious inspiration for it (though I don’t think I emphasized this much at the time) was newspaper correction policies. When news media take a strong stance on vetting reports of errors and correcting the ones they find, they are more credible in the long run. The good ones understand that taking a short-term hit when they mess up is part of that larger process.**

The core principle driving the Pottery Barn Rule is accountability. When a peer-reviewed journal publishes an empirical claim, its experts have judged the claim to be sound enough to tell the world about. A journal that adopts the Pottery Barn Rule is signaling that it stands behind that judgment. If a finding was important enough to publish the first time, then it is important enough to put under further scrutiny, and the journal takes responsibility to tell the world about those efforts too.

In the years since the blog post, a few journals have picked up the theme in their policies or practices. For example, the Journal of Research in Personality encourages replications and offers an abbreviated review process for studies it had published within the last 5 years. Psychological Science offers a preregistered direct replications submission track for replications of work they’ve published. And Scott Lilienfeld has announced that Clinical Psychological Science will follow the Pottery Barn Rule. In all three instances, these journals have led the field in signaling that they take responsibility for publishing replications.

And now, thanks to hard work by Chris Chambers,*** I was excited to read this morning that the journal Royal Society Open Science has announced a new replication policy that is the fullest implementation yet. Other journals should view the RSOS policy as a model for adoption. In the new policy, RSOS is committing to publishing any technically sound replication of any study it has published, regardless of the result, and providing a clear workflow for how it will handle such studies.

What makes the RSOS policy stand out? Accountability means tying your hands – you do not get to dodge it when it will sting or make you look bad. Under the RSOS policy, editors will still judge the technical faithfulness of replication studies. But they cannot avoid publishing replications on the basis of perceived importance or other subjective factors. Rather, whatever determination the journal originally made about those subjective questions at the time of the initial publication is applied to the replication. Making this a firm commitment, and having it spelled out in a transparent written policy, means that the scientific community knows where the journal stands and can easily see if the journal is sticking to its commitment. Making it a written policy (not just a practice) also means it is more likely to survive past the tenure of the current editors.

Such a policy should be a win both for the journal and for science. For RSOS – and for authors that publish there – articles will now have the additional credibility that comes from a journal saying it will stand by the decision. For science, this will contribute to a more complete and less biased scientific record.

Other journals should now follow suit. Just as readers would trust a news source more if they are transparent about corrections — and less if they leave it to other media to fix their mistakes — readers should have more trust in journals that view replications of things they’ve published as their responsibility, rather than leaving them to other (often implicitly “lesser”) journals. Adopting the RSOS policy, or one like it, will be way for journals to raise the credibility of the work that they publish while they make scientific publishing more rigorous and transparent.


* In reality, the actual Pottery Barn store will quietly write off the breakage as a loss and let you pretend it never happened. This is probably not a good model to emulate for science.

** One difference is that because newspapers report concrete facts, they work from a presumption that they got those things right, and they only publish corrections for errors. Whereas in science, uncertainty looms much larger in our epistemology. We draw conclusions from the accumulation of statistical evidence, so the results of all verification attempts have value regardless of outcome. But the common theme across both domains is being accountable for the things you have reported.

*** You may remember Chris from such films as registered reports and stop telling the world you can cure cancer because of seven mice.

The replication price-drop in social psychology

Why is the replication crisis centered on social psychology? In a recent post, Andrew Gelman offered a list of possible reasons. Although I don’t agree with every one of his answers (I don’t think data-sharing is common in social psych for example), it is an interesting list of ideas and an interesting question.

I want to riff on one of those answers, because it is something I’ve been musing about for a while. Gelman suggests that in social psychology, experiments are comparatively easy and cheap to replicate. Let’s stipulate that this is true of at least some parts of social psych. (Not necessarily all of them – I’ll come back to that.) What would easy and cheap replications do for a field? I’d suggest they have two, somewhat opposing effects.

On the one hand, running replications is the most straightforward way to obtain evidence about whether an effect is replicable.1 So the easier it is to run a replication, the easier it will be to discover if a result is a fluke. Broaden that out, and if a field has lots of replicability problems and replications are generally easy to run, it should be easier to diagnose the field.

But on the other hand, in a field or area where it is easy to run replications, that should facilitate a scientific ecosystem where unreplicable work can get weeded out. So over time, you might expect a field to settle into an equilibrium where by routinely running those easy and cheap replications, it is keeping unreplicable work at a comfortably low rate.2 No underlying replication problem, therefore no replication crisis.

The idea that I have been musing on for a while is that “replications are easy and cheap” is a relatively new development in social psychology, and I think that may have some interesting implications. I tweeted about it a while back but I thought I’d flesh it out.

Consider that until around a decade ago, almost all social psychology studies were run in person. You might be able to do a self-report questionnaire study in mass testing sessions, but a lot of experimental protocols could only be run a few subjects at a time. For example, any protocol that involved interaction or conditional logic (i.e., couldn’t just be printed on paper for subjects to read) required live RAs to run the show. A fair amount of measurement was analog and required data entry. And computerized presentation or assessment was rate-limited by the number of computers and cubicles a lab owned. All of this meant that even a lot of relatively simpler experiments required a nontrivial investment of labor and maybe money. And a lot of those costs were per-subject costs, so they did not scale up well.

All of this changed only fairly recently, with the explosion of internet experimentation. In the early days of the dotcom revolution you had to code websites yourself,3 but eventually companies like Qualtrics sprung up with pretty sophisticated and usable software for running interactive experiments. That meant that subjects could complete many kinds of experiments at home without any RA labor to run the study. And even for in-lab studies, a lot of data entry – which had been a labor-intensive part of running even a simple self-report study – was cut out. (Or if you were already using experiment software, you went from having to buy a site license for every subject-running computer to being able to run it on any device with a browser, even a tablet or phone.) And Mechanical Turk meant that you could recruit cheap subjects online in large numbers and they would be available virtually instantly.

All together, what this means is that for some kinds of experiments in some areas of psychology, replications have undergone a relatively recent and sizeable price drop. Some kinds of protocols pretty quickly went from something that might need a semester and a team of RAs to something you could set up and run in an afternoon.4 And since you weren’t booking up your finite lab space or spending a limited subject-pool allocation, the opportunity costs got lower too.

Notably, growth of all of the technologies that facilitated the price-drop accelerated right around the same time as the replication crisis was taking off. Bem, Stapel, and false-positive psychology were all in 2011. That’s the same year that Buhrmester et al published their guide to running experiments on Mechanical Turk, and just a year later Qualtrics got a big venture capital infusion and started expanding rapidly.

So my conjecture is that the sudden price drop helped shift social psychology out of a replications-are-rare equilibrium and moved it toward a new one. In pretty short order, experiments that previously would have been costly to replicate (in time, labor, money, or opportunity) got a lot cheaper. This meant that there was a gap between the two effects of cheap replications I described earlier: All of a sudden it was easy to detect flukes, but there was a buildup of unreplicable effects in the literature from the old equilibrium. This might explain why a lot of replications in the early twenty-teens were social priming5 studies and similar paradigms that lend themselves to online experimentation pretty well.

To be sure, I don’t think this could by any means be a complete theory. It’s more of a facilitating change along with other factors. Even if replications are easy and cheap, researchers still need to be motivated to go and run them. Social psychology had a pretty strong impetus to do that in 2011, with Bem, Stapel, and False-positive psychology all breaking in short order. And as researchers in social psychology started finding cause for concern in those newly-cheap studies, they were motivated to widen their scope to replicating other studies that had been designed, analyzed, and reported in similar ways but that hadn’t had so much of a price-drop.

To date that broadening-out from the easy and cheap studies hasn’t spread nearly as much to other subfields like clinical psychology or developmental psychology. Perhaps there is a bit of an ingroup/outgroup dynamic – it is easier to say “that’s their problem over there” than to recognize commonalities. And those fields don’t have a bunch of cheap-but-influential studies of their own to get them started internally.6

An optimistic spin on all this is that social psychology could be be on its way to a new equilibrium where running replications becomes more of a normal thing. But there will need to be an accompanying culture shift where researchers get used to seeing replications as part of mainstream scientific work.

Another implication is that the price-drop and resulting shift in equilibrium has created a kind of natural experiment where the weeding-out process has lagged behind the field’s ability to run cheap replications. A boom in metascience research has taken advantage of this lag to generate insights into what does7 and doesn’t8 make published findings less likely to be replicated. Rather than saying “oh that’s those people over there,” fields and areas where experiments are difficult and expensive could and should be saying, wow, we could have a problem and not even know it – but we can learn some lessons from seeing how “those people over there” discovered they had a problem and what they learned about it.


  1. Hi, my name is Captain Obvious. 
  2. Conversely, it is possible that a field where replications are hard and expensive might reach an equilibrium where unreplicable findings could sit around uncorrected. 
  3. RIP the relevance of my perl skills. 
  4. Or let’s say a week + an afternoon if you factor in getting your IRB exemption. 
  5. Yeah, I said social priming without scare quotes. Come at me. 
  6. Though admirably, some researchers in those fields are now trying anyway, costs be damned. 
  7. Selective reporting of underpowered results
  8. Hidden moderators

Reflections on SIPS (guest post by Neil Lewis, Jr.)

The following is a guest post by Neil Lewis, Jr. Neil is an assistant professor at Cornell University.

Last week I visited the Center for Open Science in Charlottesville, Virginia to participate in the second annual meeting of the Society for the Improvement of Psychological Science (SIPS). It was my first time going to SIPS, and I didn’t really know what to expect. The structure was unlike any other conference I’ve been to—it had very little formal structure—there were a few talks and workshops here and there, but the vast majority of the time was devoted to “hackathons” and “unconference” sessions where people got together and worked on addressing pressing issues in the field: making journals more transparent, designing syllabi for research methods courses, forming a new journal, changing departmental/university culture to reward open science practices, making open science more diverse and inclusive, and much more. Participants were free to work on whatever issues we wanted to and to set our own goals, timelines, and strategies for achieving those goals.

I spent most of the first two days at the diversity and inclusion hackathon that Sanjay and I co-organized. These sessions blew me away. Maybe we’re a little cynical, but going into the conference we thought maybe two or three people would stop by and thus it would essentially be the two of us trying to figure out what to do to make open science more diverse and inclusive. Instead, we had almost 40 people come and spend the first day identifying barriers to diversity and inclusion, and developing tools to address those barriers. We had sub-teams working on (1) improving measurement of diversity statistics (hard to know how much of a diversity problem one has if there’s poor measurement), (2) figuring out methods to assist those who study hard-to-reach populations, (3) articulating the benefits of open science and resources to get started for those who are new, (4) leveraging social media for mentorship on open science practices, and (5) developing materials to help PIs and institutions more broadly recruit and retain traditionally underrepresented students/scholars. Although we’re not finished, each team made substantial headway in each of these areas.

On the second day, those teams continued working, but in addition we had a “re-hack” that allowed teams that were working on other topics (e.g., developing research methods syllabi, developing guidelines for reviewers, starting a new academic journal) to present their ideas and get feedback on how to make their projects/products more inclusive from the very beginning (rather than having diversity and inclusion be an afterthought as is often the case). Once again, it was inspiring to see how committed people were to making sure so many dimensions of our science become more inclusive.

These sessions, and so many others at the conference, gave me a lot of hope for the field—hope that I (and I suspect others) could really use (special shout-outs to Jessica Flake’s unconference on improving measurement, Daniel Lakens and Jeremy Biesanz’s workshop on sample size and effect size, and Liz Page-Gould and Alex Danvers’s workshop on Fundamentals of R for data analysis). It’s been a tough few years to be a scientist. I was working on my PhD in social psychology at the time that the open science collaborative published their report estimating the reproducibility of psychological science to be somewhere between one-third and one-half. Then a similar report came out about the state of cancer research – only twenty five percent of papers replicated there. Now it seems like at least once a month there is some new failed replication study or some other study comes out that has major methodological flaw(s). As someone just starting out, constantly seeing findings I learned were fundamental fail to replicate, and new work emerge so flawed, I often find myself wondering (a) what the hell do we actually know, and (b) if so many others can’t get it right, what chance do I have?

Many Big Challenges with No Easy Solutions

To try and minimize future fuck-ups in my own work, I started following a lot of methodologists on Twitter so that I could stay in the loop on what I need to do to get things right (or at least not horribly wrong). There are a lot of proposed solutions out there (and some argument about those solutions, e.g., p < .005) but there are some big ones that seem to have reached consensus, including vastly increasing the size of our samples to increase the reliability of findings. These solutions make sense for addressing the issues that got us to this point, but the more I’ve thought about and talked to others about them, the more it became clear that some may unintentionally create another problem along the way, which is to “crowd out” some research questions and researchers. For example, when talking with scholars who study hard-to-reach populations (e.g., racial and sexual minorities), a frequently voiced concern is that it is nearly impossible to recruit the sample sizes needed to meet new thresholds of evidence.

To provide an example from my own research, I went to graduate school intending to study racial-ethnic disparities in academic outcomes (particularly Black-White achievement gaps). In my first semester at the University of Michigan I asked my advisor to pay for a pre-screen of the department of psychology’s participant pool to see how many Black students I would have to work with if I pursued that line of research. There were 42 Black students in the pool that semester. Forty-two. Out of 1,157. If memory serves me well, that was actually one of the highest concentrations of Black students in the pool in my entire time there. Seeing that, I asked others who study racial minorities what they did. I learned that unless they had well-funded advisors that could afford to pay for their samples, many either shifted their research questions to topics that were more feasible to study, or they would spend their graduate careers collecting data for one or two studies. In my area, that latter approach was not practical for being employable—professional development courses taught us that search committees expect multiple publications in the flagship journals, and those flagship journals usually require multiple studies for publication.

Learning about those dynamics, I temporarily shifted my research away from racial disparities until I figured out how to feasibly study those topics. In the interim, I studied other topics where I could recruit enough people to do the multi-study papers that were expected. That is not to say I am uninterested in those other topics I studied (I very much am) but disparities were what interested me most. Now, some may read that and think ‘Neil, that’s so careerist of you! You should have pursued the questions you were most passionate about, regardless of how long it took!’ And on an idealistic level, I agree with those people. But on a practical level—I have to keep a roof over my head and eat. There was no safety net at home if I was unable to get a job at the end of the program. So I played it safe for a few years before going back to the central questions that brought me to academia in the first place.

That was my solution. Others left altogether. As one friend depressingly put it—“there’s no more room for people like us; unless we get lucky with the big grants that are harder and harder to get, we can’t ask our questions—not when power analyses now say we need hundreds per cell; we’ve been priced out of the market.” And they’re not entirely wrong. Some collaborators and I recently ran a survey experiment with Black American participants; it was a 20-minute survey with 500 Black Americans. That one study cost us $11,000. Oh, and it’s a study for a paper that requires multiple studies. The only reason we can do this project is because we have a senior faculty collaborator who has an endowed chair and hence deep research pockets.

So that is the state of affairs. The goal post keeps shifting, and it seems that those of us who already had difficulty asking our questions have to choose between pursuing the questions we’re interested in, and pursuing questions that are practical for keeping roofs over our heads (e.g., questions that can be answered for $0.50 per participant on MTurk). And for a long time this has been discouraging because it felt as though those who have been leading the charge on research reform did not care. An example that reinforces this sentiment is a quote that floated around Twitter just last week. A researcher giving a talk at a conference said “if you’re running experiments with low sample n, you’re wasting your time. Not enough money? That’s not my problem.”

That researcher is not wrong. For all the reasons methodologists have been writing about for the past few years (and really, past few decades), issues like small sample sizes do compromise the integrity of our findings. At the same time, I can’t help but wonder about what we lose when the discussion stops there, at “that’s not my problem.” He’s right—it’s not his personal problem. But it is our collective problem, I think. What questions are we missing out on when we squeeze out those who do not have the thousands or millions of dollars it takes to study some of these topics? That’s a question that sometimes keeps me up at night, particularly the nights after conversations with colleagues who have incredibly important questions that they’ll never pursue because of the constraints I just described.

A Chance to Make Things Better

Part of what was so encouraging about SIPS was that we not only began discussing these issues, but people immediately took them seriously and started working on strategies to address them—putting together resources on “small-n designs” for those who can’t recruit the big samples, to name just one example. I have never seen issues of diversity and inclusion taken so seriously anywhere, and I’ve been involved in quite a few diversity and inclusion initiatives (given the short length of my career). At SIPS, people were working tirelessly to make actionable progress on these issues. And again, it wasn’t a fringe group of women and minority scholars doing this work as is so often the case—we had one of the largest hackathons at the conference. I really wish more people were there witness it—it was amazing, and energizing. It was the best of science—a group of committed individuals working incredibly hard to understand and address some of the most difficult questions that are still unanswered, and producing practical solutions to pressing social issues.

Now it is worth noting that I had some skepticism going into the conference. When I first learned about it I went back-and-forth on whether I should go; and even the week before the conference, I debated canceling the trip. I debated canceling because there was yet another episode of the “purely hypothetical scenario” that Will Gervais described in his recent blog post:

A purely hypothetical scenario, never happens [weekly+++]

Some of the characters from that scenario were people I knew would be attending the conference. I was so disgusted watching it unfold that I had no desire to interact with them the following week at the conference. My thought as I watched the discourse was: if it is just going to be a conference of the angry men from Twitter where people are patted on the back for their snark, using a structure from the tech industry—an industry not known for inclusion, then why bother attend? Apparently, I wasn’t alone in that thinking. At the diversity hackathon we discussed how several of us invited colleagues to come who declined because, due to their perceptions of who was going to be there and how those people often engage on social media, they did not feel it was worth their time.

I went despite my hesitation and am glad I did—it was the best conference I’ve ever attended. The attendees were not only warm and welcoming in real life, they also seemed to genuinely care about working together to improve our science, and to improve it in equitable and inclusive ways. They really wanted to hear what the issues are, and to work together to solve them.

If we regularly engage with each other (both online and face-to-face) in the ways that participants did at SIPS 2017, the sky is the limit for what we can accomplish together. The climate in that space for those few days provided the optimal conditions for scientific progress to occur. People were able to let their guards down, to acknowledge that what we’re trying to do is f*cking hard and that none of us know all the answers, to admit and embrace that we will probably mess up along the way, and that’s ok. As long as we know more and are doing better today than we knew and did yesterday, we’re doing ok – we just have to keep pushing forward.

That approach is something that I hope those who attended can take away, and figure out how to replicate in other contexts, across different mediums of communication (particularly online). I think it’s the best way to do, and to improve, our science.

I want to thank the organizers for all of the work they put into the conference. You have no idea how much being in that setting meant to me. I look forward to continuing to work together to improve our science, and hope others will join in this endeavor.

Improving Psychological Science at SIPS

Last week was the second meeting of the Society for the Improvement of Psychological Science, a.k.a. SIPS[1]. SIPS is a service organization with the mission of advancing and supporting all of psychological science. About 200 people met in Charlottesville, VA to participate in hackathons and lightning talks and unconference sessions, go to workshops, and meet other people interested in working to improve psychology.

What Is This Thing Called SIPS?

If you missed SIPS and are wondering what happened – or even if you were there but want to know more about the things you missed – here are a few resources I have found helpful:

The conference program gives you an overview and the conference OSF page has links to most of what went on, though it’s admittedly a lot to dig through. For an easier starting point, Richie Lennie posted an email he wrote to his department with highlights and links, written specifically with non-attendees in mind.

Drilling down one level from the conference OSF page, all of the workshop presenters put their materials online. I didn’t make it to any workshops so I appreciate having access to those resources. One good example is Simine Vazire and Bobbie Spellman’s workshop on writing transparent and reproducible articles. Their slideshow shows excerpts from published papers on things like how to transparently report exploratory analyses, how to report messy results, how to interpret a null result, and more. For me, writing is a lot easier when I have examples and models to work from, and I expect that I will be referring to those in the future.

The list of hackathon OSF pages is worth browsing. Hackathons are collaborative sessions for people interested in working on a defined project. Organizers varied in how much they used OSF – some used them mainly for internal organization, while others hosted finished or near-finished products on them. A standout example of the latter category is from the graduate research methods course hackathon. Their OSF wiki has a list of 31 topics, almost all of which are live links to pages with learning goals, reading lists, demonstrations, and assignments. If you teach grad research methods, or anything else with methodsy content, go raid the site for all sorts of useful materials.

The program also had space for smaller or less formal events. Unconferences were spontaneously organized sessions, some of which grew into bigger projects. Lightning talks were short presentations, often about work in progress.

As you browse through the resources, it is also worth keeping in the back of your mind that many projects get started at SIPS but not finished there, so look for more projects to come to fruition in the weeks and months ahead.

A challenge for future SIPS meetings is going to be figuring out how to reach beyond the people physically attending the meeting and get the broadest possible engagement, as well as to support dissemination of projects and initiatives that people create at SIPS. We have already gotten some valuable feedback about how other hackathons and unconferences manage that. This year’s meeting happened because of a Herculean effort by a very small group of volunteers[2] operating on a thin budget (at one point it was up in the air whether there’d be even wifi in the meeting space, if you can believe it) who had to plan an event that doubled in size from last year. As we grow we will always look for more and better ways to engage – the I in SIPS would not count for anything if the society did not apply it to itself.

My Personal Highlights

It is hard to summarize but I will mention a few highlights from things that I saw or participated in firsthand.

Neil Lewis Jr. and I co-organized a hackathon on diversity and inclusion in open science. We had so many people show up that we eventually split into five smaller groups working on different projects. My group worked on helping SIPS-the-organization start to collect member data so it can track how it is doing with respect to its diversity and inclusion goals. I posted a summary on the OSF page and would love to get feedback. (Neil is working on a guest post, so look for more here about that hackathon in the near future.)

Another session I participated in was the “diversity re-hack” on day two. The idea was that diversity and inclusion are relevant to everything, not just what comes up at a hackathon with “diversity and inclusion” in the title. So people who had worked on all the other hackathons on day one could come and workshop their in-progress projects to make them serve those goals even better. It was another well-attended session and we had representatives from nearly every hackathon group come to participate.

Katie Corker was the first recipient of the society’s first award, the SIPS Leadership Award. Katie has been instrumental in the creation of the society and in organizing the conference, and beyond SIPS she has also been a leader in open science in the academic community. Katie is a dynamo and deserves every bit of recognition she gets.

It was also exciting to see projects that originated at the 2016 SIPS meeting continuing to grow. During the meeting, APA announced that it will designate PsyArXiv as its preferred preprint server. And the creators of StudySwap, which also came out of SIPS 2016, just announced an upcoming Nexus (a fancy term for what we called “special issue” in the print days) with the journal Collabra: Psychology on crowdsourced research.

Speaking of which, Collabra: Psychology is now the official society journal of SIPS. It is fitting that SIPS partnered with an open-access journal, given the society’s mission. SIPS will oversee editorial responsibilities and the scientific mission of the journal, while the University of California Press will operate as the publisher.

But probably the most gratifying thing for me about SIPS was meeting early-career researchers who are excited about making psychological science more open and transparent, more rigorous and self-correcting, and more accessible and inclusive of everyone who wants to do science or could benefit from science. The challenges can sometimes feel huge, and I found it inspiring and energizing to spend time with people just starting out in the field who are dedicated to facing them.

*****

1. Or maybe it was the first meeting, since we ended last year’s meeting with a vote on whether to become a society, even though we were already calling ourselves that? I don’t know, bootstrapping is weird.

2. Not including me. I am on the SIPS Executive Committee so I got to see up close the absurd amount of work that went into making the conference. Credit for the actual heavy lifting goes to Katie Corker and Jack Arnal, the conference planning committee who made everything happen with the meeting space, hotel, meals, and all the other logistics; and the program committee of Brian Nosek, Michèle Nuijten, John Sakaluk, and Alexa Tullett, who were responsible for putting together the scientific (and, uh, I guess meta-scientific?) content of the conference.

Learning exactly the wrong lesson

For several years now I have heard fellow scientists worry that the dialogue around open and reproducible science could be used against science – to discredit results that people find inconvenient and even to de-fund science. And this has not just been fretting around the periphery. I have heard these concerns raised by scientists who hold policymaking positions in societies and journals.

A recent article by Ed Yong talks about this concern in the present political climate.

In this environment, many are concerned that attempts to improve science could be judo-flipped into ways of decrying or defunding it. “It’s been on our minds since the first week of November,” says Stuart Buck, Vice President of Research Integrity at the Laura and John Arnold Foundation, which funds attempts to improve reproducibility.

The worry is that policy-makers might ask why so much money should be poured into science if so many studies are weak or wrong? Or why should studies be allowed into the policy-making process if they’re inaccessible to public scrutiny? At a recent conference on reproducibility run by the National Academies of Sciences, clinical epidemiologist Hilda Bastian says that she and other speakers were told to consider these dangers when preparing their talks.

One possible conclusion is that this means we should slow down science’s movement toward greater openness and reproducibility. As Yong writes, “Everyone I spoke to felt that this is the wrong approach.” But as I said, those voices are out there and many could take Yong’s article as reinforcing their position. So I think it bears elaboration why that would be the wrong approach.

Probably the least principled reason, but an entirely unavoidable practical one, is just that it would be impossible. The discussion cannot be contained. Notwithstanding some defenses of gatekeeping and critiques of science discourse on social media (where much of this discussion is happening), there is just no way to keep scientists from talking about these issues in the open.

And imagine for a moment that we nevertheless tried to contain the conversation. Would that be a good idea? Consider the “climategate” faux-scandal. Opponents of climate science cooked up an anti-transparency conspiracy out of a few emails that showed nothing of the sort. Now imagine if we actually did that – if we kept scientists from discussing science’s problems in the open. And imagine that getting out. That would be a PR disaster to dwarf any misinterpretation of open science (because the worst PR disasters are the ones based in reality).

But to me, the even more compelling consideration is that if we put science’s public image first, we are inverting our core values. The conversation around open and reproducible science cuts to fundamental questions about what science is – such as that scientific knowledge is verifiable, and that it belongs to everyone – and why science offers unique value to society. We should fully and fearlessly engage in those questions and in making our institutions and practices better. We can solve the PR problem after that. In the long run, the way to make the best possible case for science is to make science the best possible.

Rather than shying away from talking about openness and reproducibility, I believe it is more critical than ever that we all pull together to move science forward. Because if we don’t, others will make changes in our name that serve other agendas.

For example, Yong’s article describes a bill pending in Congress that would set impossibly high standards of evidence for the Environmental Protection Agency to base policy on. Those standards are wrapped in the rhetoric of open science. But as Michael Eisen says in the article, “It won’t produce regulations based on more open science. It’ll just produce fewer regulations.” This is almost certainly the intended effect.

As long as scientists – individually and collectively in our societies and journals – drag our heels on making needed reforms, there will be a vacuum that others will try to fill. Turn that around, and the better the scientific community does its job of addressing openness and transparency in the service of actually making science do what science is supposed to do – making it more open, more verifiable, more accessible to everyone – the better positioned we will be to rebut those kinds of efforts by saying, “Nope, we got this.”