Reflections on SIPS (guest post by Neil Lewis, Jr.)

The following is a guest post by Neil Lewis, Jr. Neil is an assistant professor at Cornell University.

Last week I visited the Center for Open Science in Charlottesville, Virginia to participate in the second annual meeting of the Society for the Improvement of Psychological Science (SIPS). It was my first time going to SIPS, and I didn’t really know what to expect. The structure was unlike any other conference I’ve been to—it had very little formal structure—there were a few talks and workshops here and there, but the vast majority of the time was devoted to “hackathons” and “unconference” sessions where people got together and worked on addressing pressing issues in the field: making journals more transparent, designing syllabi for research methods courses, forming a new journal, changing departmental/university culture to reward open science practices, making open science more diverse and inclusive, and much more. Participants were free to work on whatever issues we wanted to and to set our own goals, timelines, and strategies for achieving those goals.

I spent most of the first two days at the diversity and inclusion hackathon that Sanjay and I co-organized. These sessions blew me away. Maybe we’re a little cynical, but going into the conference we thought maybe two or three people would stop by and thus it would essentially be the two of us trying to figure out what to do to make open science more diverse and inclusive. Instead, we had almost 40 people come and spend the first day identifying barriers to diversity and inclusion, and developing tools to address those barriers. We had sub-teams working on (1) improving measurement of diversity statistics (hard to know how much of a diversity problem one has if there’s poor measurement), (2) figuring out methods to assist those who study hard-to-reach populations, (3) articulating the benefits of open science and resources to get started for those who are new, (4) leveraging social media for mentorship on open science practices, and (5) developing materials to help PIs and institutions more broadly recruit and retain traditionally underrepresented students/scholars. Although we’re not finished, each team made substantial headway in each of these areas.

On the second day, those teams continued working, but in addition we had a “re-hack” that allowed teams that were working on other topics (e.g., developing research methods syllabi, developing guidelines for reviewers, starting a new academic journal) to present their ideas and get feedback on how to make their projects/products more inclusive from the very beginning (rather than having diversity and inclusion be an afterthought as is often the case). Once again, it was inspiring to see how committed people were to making sure so many dimensions of our science become more inclusive.

These sessions, and so many others at the conference, gave me a lot of hope for the field—hope that I (and I suspect others) could really use (special shout-outs to Jessica Flake’s unconference on improving measurement, Daniel Lakens and Jeremy Biesanz’s workshop on sample size and effect size, and Liz Page-Gould and Alex Danvers’s workshop on Fundamentals of R for data analysis). It’s been a tough few years to be a scientist. I was working on my PhD in social psychology at the time that the open science collaborative published their report estimating the reproducibility of psychological science to be somewhere between one-third and one-half. Then a similar report came out about the state of cancer research – only twenty five percent of papers replicated there. Now it seems like at least once a month there is some new failed replication study or some other study comes out that has major methodological flaw(s). As someone just starting out, constantly seeing findings I learned were fundamental fail to replicate, and new work emerge so flawed, I often find myself wondering (a) what the hell do we actually know, and (b) if so many others can’t get it right, what chance do I have?

Many Big Challenges with No Easy Solutions

To try and minimize future fuck-ups in my own work, I started following a lot of methodologists on Twitter so that I could stay in the loop on what I need to do to get things right (or at least not horribly wrong). There are a lot of proposed solutions out there (and some argument about those solutions, e.g., p < .005) but there are some big ones that seem to have reached consensus, including vastly increasing the size of our samples to increase the reliability of findings. These solutions make sense for addressing the issues that got us to this point, but the more I’ve thought about and talked to others about them, the more it became clear that some may unintentionally create another problem along the way, which is to “crowd out” some research questions and researchers. For example, when talking with scholars who study hard-to-reach populations (e.g., racial and sexual minorities), a frequently voiced concern is that it is nearly impossible to recruit the sample sizes needed to meet new thresholds of evidence.

To provide an example from my own research, I went to graduate school intending to study racial-ethnic disparities in academic outcomes (particularly Black-White achievement gaps). In my first semester at the University of Michigan I asked my advisor to pay for a pre-screen of the department of psychology’s participant pool to see how many Black students I would have to work with if I pursued that line of research. There were 42 Black students in the pool that semester. Forty-two. Out of 1,157. If memory serves me well, that was actually one of the highest concentrations of Black students in the pool in my entire time there. Seeing that, I asked others who study racial minorities what they did. I learned that unless they had well-funded advisors that could afford to pay for their samples, many either shifted their research questions to topics that were more feasible to study, or they would spend their graduate careers collecting data for one or two studies. In my area, that latter approach was not practical for being employable—professional development courses taught us that search committees expect multiple publications in the flagship journals, and those flagship journals usually require multiple studies for publication.

Learning about those dynamics, I temporarily shifted my research away from racial disparities until I figured out how to feasibly study those topics. In the interim, I studied other topics where I could recruit enough people to do the multi-study papers that were expected. That is not to say I am uninterested in those other topics I studied (I very much am) but disparities were what interested me most. Now, some may read that and think ‘Neil, that’s so careerist of you! You should have pursued the questions you were most passionate about, regardless of how long it took!’ And on an idealistic level, I agree with those people. But on a practical level—I have to keep a roof over my head and eat. There was no safety net at home if I was unable to get a job at the end of the program. So I played it safe for a few years before going back to the central questions that brought me to academia in the first place.

That was my solution. Others left altogether. As one friend depressingly put it—“there’s no more room for people like us; unless we get lucky with the big grants that are harder and harder to get, we can’t ask our questions—not when power analyses now say we need hundreds per cell; we’ve been priced out of the market.” And they’re not entirely wrong. Some collaborators and I recently ran a survey experiment with Black American participants; it was a 20-minute survey with 500 Black Americans. That one study cost us $11,000. Oh, and it’s a study for a paper that requires multiple studies. The only reason we can do this project is because we have a senior faculty collaborator who has an endowed chair and hence deep research pockets.

So that is the state of affairs. The goal post keeps shifting, and it seems that those of us who already had difficulty asking our questions have to choose between pursuing the questions we’re interested in, and pursuing questions that are practical for keeping roofs over our heads (e.g., questions that can be answered for $0.50 per participant on MTurk). And for a long time this has been discouraging because it felt as though those who have been leading the charge on research reform did not care. An example that reinforces this sentiment is a quote that floated around Twitter just last week. A researcher giving a talk at a conference said “if you’re running experiments with low sample n, you’re wasting your time. Not enough money? That’s not my problem.”

That researcher is not wrong. For all the reasons methodologists have been writing about for the past few years (and really, past few decades), issues like small sample sizes do compromise the integrity of our findings. At the same time, I can’t help but wonder about what we lose when the discussion stops there, at “that’s not my problem.” He’s right—it’s not his personal problem. But it is our collective problem, I think. What questions are we missing out on when we squeeze out those who do not have the thousands or millions of dollars it takes to study some of these topics? That’s a question that sometimes keeps me up at night, particularly the nights after conversations with colleagues who have incredibly important questions that they’ll never pursue because of the constraints I just described.

A Chance to Make Things Better

Part of what was so encouraging about SIPS was that we not only began discussing these issues, but people immediately took them seriously and started working on strategies to address them—putting together resources on “small-n designs” for those who can’t recruit the big samples, to name just one example. I have never seen issues of diversity and inclusion taken so seriously anywhere, and I’ve been involved in quite a few diversity and inclusion initiatives (given the short length of my career). At SIPS, people were working tirelessly to make actionable progress on these issues. And again, it wasn’t a fringe group of women and minority scholars doing this work as is so often the case—we had one of the largest hackathons at the conference. I really wish more people were there witness it—it was amazing, and energizing. It was the best of science—a group of committed individuals working incredibly hard to understand and address some of the most difficult questions that are still unanswered, and producing practical solutions to pressing social issues.

Now it is worth noting that I had some skepticism going into the conference. When I first learned about it I went back-and-forth on whether I should go; and even the week before the conference, I debated canceling the trip. I debated canceling because there was yet another episode of the “purely hypothetical scenario” that Will Gervais described in his recent blog post:

A purely hypothetical scenario, never happens [weekly+++]

Some of the characters from that scenario were people I knew would be attending the conference. I was so disgusted watching it unfold that I had no desire to interact with them the following week at the conference. My thought as I watched the discourse was: if it is just going to be a conference of the angry men from Twitter where people are patted on the back for their snark, using a structure from the tech industry—an industry not known for inclusion, then why bother attend? Apparently, I wasn’t alone in that thinking. At the diversity hackathon we discussed how several of us invited colleagues to come who declined because, due to their perceptions of who was going to be there and how those people often engage on social media, they did not feel it was worth their time.

I went despite my hesitation and am glad I did—it was the best conference I’ve ever attended. The attendees were not only warm and welcoming in real life, they also seemed to genuinely care about working together to improve our science, and to improve it in equitable and inclusive ways. They really wanted to hear what the issues are, and to work together to solve them.

If we regularly engage with each other (both online and face-to-face) in the ways that participants did at SIPS 2017, the sky is the limit for what we can accomplish together. The climate in that space for those few days provided the optimal conditions for scientific progress to occur. People were able to let their guards down, to acknowledge that what we’re trying to do is f*cking hard and that none of us know all the answers, to admit and embrace that we will probably mess up along the way, and that’s ok. As long as we know more and are doing better today than we knew and did yesterday, we’re doing ok – we just have to keep pushing forward.

That approach is something that I hope those who attended can take away, and figure out how to replicate in other contexts, across different mediums of communication (particularly online). I think it’s the best way to do, and to improve, our science.

I want to thank the organizers for all of the work they put into the conference. You have no idea how much being in that setting meant to me. I look forward to continuing to work together to improve our science, and hope others will join in this endeavor.

Improving Psychological Science at SIPS

Last week was the second meeting of the Society for the Improvement of Psychological Science, a.k.a. SIPS[1]. SIPS is a service organization with the mission of advancing and supporting all of psychological science. About 200 people met in Charlottesville, VA to participate in hackathons and lightning talks and unconference sessions, go to workshops, and meet other people interested in working to improve psychology.

What Is This Thing Called SIPS?

If you missed SIPS and are wondering what happened – or even if you were there but want to know more about the things you missed – here are a few resources I have found helpful:

The conference program gives you an overview and the conference OSF page has links to most of what went on, though it’s admittedly a lot to dig through. For an easier starting point, Richie Lennie posted an email he wrote to his department with highlights and links, written specifically with non-attendees in mind.

Drilling down one level from the conference OSF page, all of the workshop presenters put their materials online. I didn’t make it to any workshops so I appreciate having access to those resources. One good example is Simine Vazire and Bobbie Spellman’s workshop on writing transparent and reproducible articles. Their slideshow shows excerpts from published papers on things like how to transparently report exploratory analyses, how to report messy results, how to interpret a null result, and more. For me, writing is a lot easier when I have examples and models to work from, and I expect that I will be referring to those in the future.

The list of hackathon OSF pages is worth browsing. Hackathons are collaborative sessions for people interested in working on a defined project. Organizers varied in how much they used OSF – some used them mainly for internal organization, while others hosted finished or near-finished products on them. A standout example of the latter category is from the graduate research methods course hackathon. Their OSF wiki has a list of 31 topics, almost all of which are live links to pages with learning goals, reading lists, demonstrations, and assignments. If you teach grad research methods, or anything else with methodsy content, go raid the site for all sorts of useful materials.

The program also had space for smaller or less formal events. Unconferences were spontaneously organized sessions, some of which grew into bigger projects. Lightning talks were short presentations, often about work in progress.

As you browse through the resources, it is also worth keeping in the back of your mind that many projects get started at SIPS but not finished there, so look for more projects to come to fruition in the weeks and months ahead.

A challenge for future SIPS meetings is going to be figuring out how to reach beyond the people physically attending the meeting and get the broadest possible engagement, as well as to support dissemination of projects and initiatives that people create at SIPS. We have already gotten some valuable feedback about how other hackathons and unconferences manage that. This year’s meeting happened because of a Herculean effort by a very small group of volunteers[2] operating on a thin budget (at one point it was up in the air whether there’d be even wifi in the meeting space, if you can believe it) who had to plan an event that doubled in size from last year. As we grow we will always look for more and better ways to engage – the I in SIPS would not count for anything if the society did not apply it to itself.

My Personal Highlights

It is hard to summarize but I will mention a few highlights from things that I saw or participated in firsthand.

Neil Lewis Jr. and I co-organized a hackathon on diversity and inclusion in open science. We had so many people show up that we eventually split into five smaller groups working on different projects. My group worked on helping SIPS-the-organization start to collect member data so it can track how it is doing with respect to its diversity and inclusion goals. I posted a summary on the OSF page and would love to get feedback. (Neil is working on a guest post, so look for more here about that hackathon in the near future.)

Another session I participated in was the “diversity re-hack” on day two. The idea was that diversity and inclusion are relevant to everything, not just what comes up at a hackathon with “diversity and inclusion” in the title. So people who had worked on all the other hackathons on day one could come and workshop their in-progress projects to make them serve those goals even better. It was another well-attended session and we had representatives from nearly every hackathon group come to participate.

Katie Corker was the first recipient of the society’s first award, the SIPS Leadership Award. Katie has been instrumental in the creation of the society and in organizing the conference, and beyond SIPS she has also been a leader in open science in the academic community. Katie is a dynamo and deserves every bit of recognition she gets.

It was also exciting to see projects that originated at the 2016 SIPS meeting continuing to grow. During the meeting, APA announced that it will designate PsyArXiv as its preferred preprint server. And the creators of StudySwap, which also came out of SIPS 2016, just announced an upcoming Nexus (a fancy term for what we called “special issue” in the print days) with the journal Collabra: Psychology on crowdsourced research.

Speaking of which, Collabra: Psychology is now the official society journal of SIPS. It is fitting that SIPS partnered with an open-access journal, given the society’s mission. SIPS will oversee editorial responsibilities and the scientific mission of the journal, while the University of California Press will operate as the publisher.

But probably the most gratifying thing for me about SIPS was meeting early-career researchers who are excited about making psychological science more open and transparent, more rigorous and self-correcting, and more accessible and inclusive of everyone who wants to do science or could benefit from science. The challenges can sometimes feel huge, and I found it inspiring and energizing to spend time with people just starting out in the field who are dedicated to facing them.

*****

1. Or maybe it was the first meeting, since we ended last year’s meeting with a vote on whether to become a society, even though we were already calling ourselves that? I don’t know, bootstrapping is weird.

2. Not including me. I am on the SIPS Executive Committee so I got to see up close the absurd amount of work that went into making the conference. Credit for the actual heavy lifting goes to Katie Corker and Jack Arnal, the conference planning committee who made everything happen with the meeting space, hotel, meals, and all the other logistics; and the program committee of Brian Nosek, Michèle Nuijten, John Sakaluk, and Alexa Tullett, who were responsible for putting together the scientific (and, uh, I guess meta-scientific?) content of the conference.

Learning exactly the wrong lesson

For several years now I have heard fellow scientists worry that the dialogue around open and reproducible science could be used against science – to discredit results that people find inconvenient and even to de-fund science. And this has not just been fretting around the periphery. I have heard these concerns raised by scientists who hold policymaking positions in societies and journals.

A recent article by Ed Yong talks about this concern in the present political climate.

In this environment, many are concerned that attempts to improve science could be judo-flipped into ways of decrying or defunding it. “It’s been on our minds since the first week of November,” says Stuart Buck, Vice President of Research Integrity at the Laura and John Arnold Foundation, which funds attempts to improve reproducibility.

The worry is that policy-makers might ask why so much money should be poured into science if so many studies are weak or wrong? Or why should studies be allowed into the policy-making process if they’re inaccessible to public scrutiny? At a recent conference on reproducibility run by the National Academies of Sciences, clinical epidemiologist Hilda Bastian says that she and other speakers were told to consider these dangers when preparing their talks.

One possible conclusion is that this means we should slow down science’s movement toward greater openness and reproducibility. As Yong writes, “Everyone I spoke to felt that this is the wrong approach.” But as I said, those voices are out there and many could take Yong’s article as reinforcing their position. So I think it bears elaboration why that would be the wrong approach.

Probably the least principled reason, but an entirely unavoidable practical one, is just that it would be impossible. The discussion cannot be contained. Notwithstanding some defenses of gatekeeping and critiques of science discourse on social media (where much of this discussion is happening), there is just no way to keep scientists from talking about these issues in the open.

And imagine for a moment that we nevertheless tried to contain the conversation. Would that be a good idea? Consider the “climategate” faux-scandal. Opponents of climate science cooked up an anti-transparency conspiracy out of a few emails that showed nothing of the sort. Now imagine if we actually did that – if we kept scientists from discussing science’s problems in the open. And imagine that getting out. That would be a PR disaster to dwarf any misinterpretation of open science (because the worst PR disasters are the ones based in reality).

But to me, the even more compelling consideration is that if we put science’s public image first, we are inverting our core values. The conversation around open and reproducible science cuts to fundamental questions about what science is – such as that scientific knowledge is verifiable, and that it belongs to everyone – and why science offers unique value to society. We should fully and fearlessly engage in those questions and in making our institutions and practices better. We can solve the PR problem after that. In the long run, the way to make the best possible case for science is to make science the best possible.

Rather than shying away from talking about openness and reproducibility, I believe it is more critical than ever that we all pull together to move science forward. Because if we don’t, others will make changes in our name that serve other agendas.

For example, Yong’s article describes a bill pending in Congress that would set impossibly high standards of evidence for the Environmental Protection Agency to base policy on. Those standards are wrapped in the rhetoric of open science. But as Michael Eisen says in the article, “It won’t produce regulations based on more open science. It’ll just produce fewer regulations.” This is almost certainly the intended effect.

As long as scientists – individually and collectively in our societies and journals – drag our heels on making needed reforms, there will be a vacuum that others will try to fill. Turn that around, and the better the scientific community does its job of addressing openness and transparency in the service of actually making science do what science is supposed to do – making it more open, more verifiable, more accessible to everyone – the better positioned we will be to rebut those kinds of efforts by saying, “Nope, we got this.”

False-positive psychology five years later

Joe Simmons, Leif Nelson, and Uri Simonsohn have written a 5-years-later[1] retrospective on their “false-positive psychology” paper. It is for an upcoming issue of Perspectives on Psychological Science dedicated to the most-cited articles from APS publications. A preprint is now available.

It’s a short and snappy read with some surprises and gems. For example, footnote 2 notes that the Journal of Consumer Research declined to adopt their disclosure recommendations because they might “dull … some of the joy scholars may find in their craft.” No, really.

For the youngsters out there, they do a good job of capturing in a sentence a common view of what we now call p-hacking: “Everyone knew it was wrong, but they thought it was wrong the way it’s wrong to jaywalk. We decided to write ‘False-Positive Psychology’ when simulations revealed it was wrong the way it’s wrong to rob a bank.”[2]

The retrospective also contains a review of how the paper has been cited in 3 top psychology journals. About half of the citations are from researchers following the original paper’s recommendations, but typically only a subset of them. The most common citation practice is to justify having barely more than 20 subjects per cell, which they now describe as a “comically low threshold” and take a more nuanced view on.

But to me, the most noteworthy passage was this one because it speaks to institutional pushback on the most straightforward of their recommendations:

Our paper has had some impact. Many psychologists have read it, and it is required reading in at least a few methods courses. And a few journals – most notably, Psychological Science and Social Psychological and Personality Science – have implemented disclosure requirements of the sort that we proposed (Eich, 2014; Vazire, 2015). At the same time, it is worth pointing out that none of the top American Psychological Association journals have implemented disclosure requirements, and that some powerful psychologists (and journal interests) remain hostile to costless, common sense proposals to improve the integrity of our field.

Certainly there are some small refinements you could make to some of the original paper’s disclosure recommendations. For example, Psychological Science requires you to disclose all variables “that were analyzed for this article’s target research question,” not all variables period. Which is probably an okay accommodation for big multivariate studies with lots of measures.[3]

But it is odd to be broadly opposed to disclosing information in scientific publications that other scientists would consider relevant to evaluating the conclusions. And yet I have heard these kinds of objections raised many times. What is lost by saying that researchers have to report all the experimental conditions they ran, or whether data points were excluded and why? Yet here we are in 2017 and you can still get around doing that.

 


1. Well, five-ish. The paper came out in late 2011.

2. Though I did not have the sense at the time that everyone knew about everything. Rather, knowledge varied: a given person might think that fiddling with covariates was like jaywalking (technically wrong but mostly harmless), that undisclosed dropping of experimental conditions was a serious violation, but be completely oblivious to the perils of optional stopping. And a different person might have had a different constellation of views on the same 3 issues.

3. A counterpoint is that if you make your materials open, then without clogging up the article proper, you allow interested readers to go and see for themselves.

The University of Oregon should re-name Deady and Dunn Halls

Background: My university is currently weighing whether to re-name two buildings on campus. It was prompted to do so by the UO Black Student Task Force, which demanded last year that the buildings be renamed. Deady Hall is an academic building named after Matthew Deady, a politician and later judge who advocated successfully at the founding of our state to exclude Black citizens from residing here. Dunn Hall is a residence hall named after a former professor who was also an Exalted Cyclops of the KKK. Our president, Michael Schill, created a process for making the decision and appointed a commission of historians to study the two figures and their legacy. The commission has issued its report, and now Schill has invited comment from the university community. Below is the comment that I submitted:

Both buildings should be re-named. I find myself very much in agreement with the reasoning that Matthew Dennis stated in his August 21 Register-Guard editorial. Building names are not neutral markers; they are a way to put a name in a place of prominence and honor the namesake. The idea that we need to keep their names on the buildings as “reminders” simply does not stand up to scrutiny. We have a history department to teach us about history. We name buildings for other reasons.

Dunn Hall seems like the obvious case, being named after a former Exalted Cyclops of the KKK. I fear that in doing one obviously right thing, the university will feel morally licensed to “split the difference” and keep Deady. That would be a mistake.

At the founding of our state, Deady actively promoted the exclusion of Black citizens from Oregon. The defense of Deady seems to rest primarily on his later stance toward Chinese immigrants and descendants. The implication is that that somehow erases his lifelong anti-Black racism, as if racism and racial atonement against different groups are fungible. This strikes me as a distinctly White perspective, viewing non-White groups as interchangeable. Will we look in the face of a community that has been harmed, proclaim “But he was decent to those other people!” and expect them to accept that as amends?

The fact is that Deady never made amends for his anti-Black racism, he never disavowed it, and his actions are still reverberating today in a state whose population includes about 2% African Americans. My department (Psychology) has never had an African American tenure-track professor, and I have been told that the same is true across the entire Division of Natural Sciences. While the reasons are surely complex, I will note that when my department has tried to recruit African American faculty, the underrepresentation of African Americans in our community has come up as a challenge in enticing people to move here and make Oregon their home. That underrepresentation is the direct legacy of Matthew Deady’s political activism. This is not a man that the university should be honoring.

Postscript: You can read more about the history of racism in Oregon in Matt Novak’s well-researched article at Gizmodo.

Everything is fucked: The syllabus

PSY 607: Everything is Fucked
Prof. Sanjay Srivastava
Class meetings: Mondays 9:00 – 10:50 in 257 Straub
Office hours: Held on Twitter at your convenience (@hardsci)

In a much-discussed article at Slate, social psychologist Michael Inzlicht told a reporter, “Meta-analyses are fucked” (Engber, 2016). What does it mean, in science, for something to be fucked? Fucked needs to mean more than that something is complicated or must be undertaken with thought and care, as that would be trivially true of everything in science. In this class we will go a step further and say that something is fucked if it presents hard conceptual challenges to which implementable, real-world solutions for working scientists are either not available or routinely ignored in practice.

The format of this seminar is as follows: Each week we will read and discuss 1-2 papers that raise the question of whether something is fucked. Our focus will be on things that may be fucked in research methods, scientific practice, and philosophy of science. The potential fuckedness of specific theories, research topics, etc. will not be the focus of this class per se, but rather will be used to illustrate these important topics. To that end, each week a different student will be assigned to find a paper that illustrates the fuckedness (or lack thereof) of that week’s topic, and give a 15-minute presentation about whether it is indeed fucked.

Grading:

20% Attendance and participation
30% In-class presentation
50% Final exam

Week 1: Psychology is fucked

Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195-244.

Week 2: Significance testing is fucked

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304-1312.

Rouder, J. N., Morey, R. D., Verhagen, J., Province, J. M., & Wagenmakers, E. J. (2016). Is there a free lunch in inference? Topics in Cognitive Science, 8, 520-547.

Week 3: Causal inference from experiments is fucked

Chapter 3 from: Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.

Week 4: Mediation is fucked

Bullock, J. G., Green, D. P., & Ha, S. E. (2010). Yes, but what’s the mechanism?(don’t expect an easy answer). Journal of Personality and Social Psychology, 98, 550-558.

Week 5: Covariates are fucked

Culpepper, S. A., & Aguinis, H. (2011). Using analysis of covariance (ANCOVA) with fallible covariates. Psychological Methods, 16, 166-178.

Westfall, J., & Yarkoni, T. (2016). Statistically controlling for confounding constructs is harder than you think. PloS one, 11, e0152719.

Week 6: Replicability is fucked

Pashler, H., & Harris, C. R. (2012). Is the replicability crisis overblown? Three arguments examined. Perspectives on Psychological Science, 7, 531-536.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.

Week 7: Interlude: Everything is fine, calm the fuck down

Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on “Estimating the reproducibility of psychological science.” Science, 251, 1037a.

Maxwell, S. E., Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist, 70, 487-498.

Week 8: Scientific publishing is fucked

Fanelli, D. (2011). Negative results are disappearing from most disciplines and countries. Scientometrics, 90, 891-904.

Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Med, 2, e124.

Week 9: Meta-analysis is fucked

Inzlicht, M., Gervais, W., & Berkman, E. (2015). Bias-Correction Techniques Alone Cannot Determine Whether Ego Depletion is Different from Zero: Commentary on Carter, Kofler, Forster, & McCullough, 2015. Available at SSRN: http://ssrn.com/abstract=2659409 or http://dx.doi.org/10.2139/ssrn.2659409

Van Elk, M., Matzke, D., Gronau, Q. F., Guan, M., Vandekerckhove, J., & Wagenmakers, E. J. (2015). Meta-analyses are no substitute for registered replications: A skeptical perspective on religious priming. Frontiers in Psychology, 6.

Week 10: The scientific profession is fucked

Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules of the game called psychological science. Perspectives on Psychological Science, 7, 543-554.

Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7, 615-631.

Finals week

Wear black and bring a #2 pencil.

Don’t change your family-friendly tenure extension policy just yet

pixelated something

If you are an academic and on social media, then over the last weekend your feed was probably full of mentions of an article by economist Justin Wolfers in the New York Times titled “A Family-Friendly Policy That’s Friendliest to Male Professors.”

It describes a study by three economists of the effects of parental tenure extension policies, which give an extra year on the tenure clock when people become new parents. The conclusion is that tenure extension policies do make it easier for men to get tenure, but they unexpectedly make it harder for women. The finding has a counterintuitive flavor – a policy couched in gender-neutral terms and designed to help families actually widens a gender gap.

Except there are a bunch of odd things that start to stick out when you look more closely at the details, and especially at the original study.

Let’s start with the numbers in the NYT writeup:

The policies led to a 19 percentage-point rise in the probability that a male economist would earn tenure at his first job. In contrast, women’s chances of gaining tenure fell by 22 percentage points. Before the arrival of tenure extension, a little less than 30 percent of both women and men at these institutions gained tenure at their first jobs.

Two things caught my attention when I read this. First, that a 30% tenure rate sounded awfully low to me (this is at the top-50 PhD-granting economics departments). Second, that tenure extension policies took the field from parity (“30 percent of both men and women”) to a 6-to-1 lopsided rate favoring men (the effects are percentage points, so it goes to a 49% tenure rate for men vs. 8% for women). That would be a humongous effect size.

Regarding the 30% tenure rate, it turns out the key words are “at their first jobs.” This analysis compared people who got tenure at their first job to everybody else — which means that leaving for a better outside offer is treated the same in this analysis as being denied tenure. So the tenure-at-first-job variable is not a clear indicator of whether the policy is helping or hurting a career. What if you look at the effect of the policy on getting tenure anywhere? The authors did that, and they summarize the analysis succinctly: “We find no evidence that gender-neutral tenure clock stopping policies reduce the fraction of women who ultimately get tenure somewhere” (p. 4). That seems pretty important.

What about that swing from gender-neutral to a 6-to-1 disparity in the at-first-job analysis? Consider this: “There are relatively few women hired at each university during the sample period. On average, only four female assistant professors were hired at each university between 1985 and 2004, compared to 17 male assistant professors” (p. 17). That was a stop-right-there moment for me: if you are an economics department worried about gender equality, maybe instead of rethinking tenure extensions you should be looking at your damn hiring practices. But as far as the present study goes, there are n = 62 women at institutions that never adopted gender-neutral tenure extension policies, and n = 129 at institutions that did. (It’s even worse than that because only a fraction of them are relevant for estimating the policy effect; more on that below). With a small sample there is going to be a lot of uncertainty in the estimates under the best of conditions. And it’s not the best of conditions: Within the comparison group (the departments that never adopted a tenure extension policy), there are big, differential changes in men’s and women’s tenure rates over the study period (1985 to 2004): Over time, men’s tenure rate drops by about 25%, and women’s tenure rate doubles from 12% to 25%. Any observed effect of a department adopting a tenure-extension policy is going to have to be estimated in comparison to that noisy, moving target.

Critically, the statistical comparison of tenure-extension policy is averaged over every assistant professor in the sample, regardless of whether the individual professor used the policy. (The authors don’t have data on who took a tenure extension, or even on who had kids.) But causation is only defined for those individuals in whom we could observe a potential outcome at either level of the treatment. In plain English: “How does this policy affect people” only makes sense for people who could have been affected by the policy — meaning people who had kids as assistant professors, and therefore could have taken an extension if one were available. So if the policy did have an effect in this dataset, we should expect it to be a very small one because we are averaging it with a bunch of cases that by definition could not possibly show the effect. In light of that, a larger effect should make us more skeptical, not more persuaded.

There is also the odd finding that in departments that offered tenure extension policies, men took less time to get to tenure (about 1 year less on average). This is the opposite of what you’d expect if “men who took parental leave used the extra year to publish their research” as the NYT writeup claims. The original study authors offer a complicated, speculative story about why time-to-tenure would not be expected to change in the obvious way. If you accept the story, it requires invoking a bunch of mechanisms that are not measured in the paper and likely would add more noise and interpretive ambiguity to the estimates of interest.

There were still other analytic decisions that I had trouble understanding. For example, the authors excluded people who had 0 or 1 publication in their first 2 years. Isn’t this variance to go into the didn’t-get-tenure side of the analyses? And the analyses includes a whole bunch of covariates without a lot of discussion (and no pre-registration to limit researcher degrees of freedom). One of the covariates had a strange effect: holding a degree from a top-10 PhD-granting institution makes it less likely that you will get tenure in your first job. This does make sense if you think that top-10 graduates are likely to get killer outside offers – but then that just reinforces the lack of clarity about what the tenure-in-first-job variable is really an indicator of.

But when all is said and done, probably the most important part of the paper is two sentences right on the title page:

IZA Discussion Papers often represent preliminary work and are circulated to encourage discussion. Citation of such a paper should account for its provisional character.

The NYT writeup does no such thing; in fact it goes the opposite direction, trying to draw broad generalizations and make policy recommendations. This is no slight against the study’s original authors – it is typical in economics to circulate working papers for discussion and critique. Maybe they’d have compelling responses to everything I said, who knows? But at this stage, I have a hard time seeing how this working paper is ready for a popular media writeup for general consumption.

The biggest worry I have is that university administrators might take these results and run with them. I do agree with the fundamental motivation for doing this study, which is that policies need to be evaluated by their effects. Sometimes superficially gender-neutral policies have disparate impacts when they run into the realities of biological and social roles (“primary caregiver” leave policies being a case in point). It’s fairly obvious that in many ways the academic workplace is not structured to support involved parenthood, especially motherhood. But are tenure extension policies making the problem worse, better, or are they indifferent? For all the reasons outlined above, I don’t think this research gives us an actionable answer. Policy should come from cumulative knowledge, not noisy and ambiguous preliminary findings.

In the meantime, what administrator would not love to be able to put on the appearance of We Are Doing Something by rolling back a benefit? It would be a lot cheaper and easier than fixing disparities in the hiring process, providing subsidized child care, or offering true paid leave. I hope this piece does not license them to do that.


Thanks to Ryan Light and Rich Lucas, who dug into the paper and first raised some of the issues I discuss here in response to my initial perplexed tweet.