Don’t change your family-friendly tenure extension policy just yet

pixelated something

If you are an academic and on social media, then over the last weekend your feed was probably full of mentions of an article by economist Justin Wolfers in the New York Times titled “A Family-Friendly Policy That’s Friendliest to Male Professors.”

It describes a study by three economists of the effects of parental tenure extension policies, which give an extra year on the tenure clock when people become new parents. The conclusion is that tenure extension policies do make it easier for men to get tenure, but they unexpectedly make it harder for women. The finding has a counterintuitive flavor – a policy couched in gender-neutral terms and designed to help families actually widens a gender gap.

Except there are a bunch of odd things that start to stick out when you look more closely at the details, and especially at the original study.

Let’s start with the numbers in the NYT writeup:

The policies led to a 19 percentage-point rise in the probability that a male economist would earn tenure at his first job. In contrast, women’s chances of gaining tenure fell by 22 percentage points. Before the arrival of tenure extension, a little less than 30 percent of both women and men at these institutions gained tenure at their first jobs.

Two things caught my attention when I read this. First, that a 30% tenure rate sounded awfully low to me (this is at the top-50 PhD-granting economics departments). Second, that tenure extension policies took the field from parity (“30 percent of both men and women”) to a 6-to-1 lopsided rate favoring men (the effects are percentage points, so it goes to a 49% tenure rate for men vs. 8% for women). That would be a humongous effect size.

Regarding the 30% tenure rate, it turns out the key words are “at their first jobs.” This analysis compared people who got tenure at their first job to everybody else — which means that leaving for a better outside offer is treated the same in this analysis as being denied tenure. So the tenure-at-first-job variable is not a clear indicator of whether the policy is helping or hurting a career. What if you look at the effect of the policy on getting tenure anywhere? The authors did that, and they summarize the analysis succinctly: “We find no evidence that gender-neutral tenure clock stopping policies reduce the fraction of women who ultimately get tenure somewhere” (p. 4). That seems pretty important.

What about that swing from gender-neutral to a 6-to-1 disparity in the at-first-job analysis? Consider this: “There are relatively few women hired at each university during the sample period. On average, only four female assistant professors were hired at each university between 1985 and 2004, compared to 17 male assistant professors” (p. 17). That was a stop-right-there moment for me: if you are an economics department worried about gender equality, maybe instead of rethinking tenure extensions you should be looking at your damn hiring practices. But as far as the present study goes, there are n = 62 women at institutions that never adopted gender-neutral tenure extension policies, and n = 129 at institutions that did. (It’s even worse than that because only a fraction of them are relevant for estimating the policy effect; more on that below). With a small sample there is going to be a lot of uncertainty in the estimates under the best of conditions. And it’s not the best of conditions: Within the comparison group (the departments that never adopted a tenure extension policy), there are big, differential changes in men’s and women’s tenure rates over the study period (1985 to 2004): Over time, men’s tenure rate drops by about 25%, and women’s tenure rate doubles from 12% to 25%. Any observed effect of a department adopting a tenure-extension policy is going to have to be estimated in comparison to that noisy, moving target.

Critically, the statistical comparison of tenure-extension policy is averaged over every assistant professor in the sample, regardless of whether the individual professor used the policy. (The authors don’t have data on who took a tenure extension, or even on who had kids.) But causation is only defined for those individuals in whom we could observe a potential outcome at either level of the treatment. In plain English: “How does this policy affect people” only makes sense for people who could have been affected by the policy — meaning people who had kids as assistant professors, and therefore could have taken an extension if one were available. So if the policy did have an effect in this dataset, we should expect it to be a very small one because we are averaging it with a bunch of cases that by definition could not possibly show the effect. In light of that, a larger effect should make us more skeptical, not more persuaded.

There is also the odd finding that in departments that offered tenure extension policies, men took less time to get to tenure (about 1 year less on average). This is the opposite of what you’d expect if “men who took parental leave used the extra year to publish their research” as the NYT writeup claims. The original study authors offer a complicated, speculative story about why time-to-tenure would not be expected to change in the obvious way. If you accept the story, it requires invoking a bunch of mechanisms that are not measured in the paper and likely would add more noise and interpretive ambiguity to the estimates of interest.

There were still other analytic decisions that I had trouble understanding. For example, the authors excluded people who had 0 or 1 publication in their first 2 years. Isn’t this variance to go into the didn’t-get-tenure side of the analyses? And the analyses includes a whole bunch of covariates without a lot of discussion (and no pre-registration to limit researcher degrees of freedom). One of the covariates had a strange effect: holding a degree from a top-10 PhD-granting institution makes it less likely that you will get tenure in your first job. This does make sense if you think that top-10 graduates are likely to get killer outside offers – but then that just reinforces the lack of clarity about what the tenure-in-first-job variable is really an indicator of.

But when all is said and done, probably the most important part of the paper is two sentences right on the title page:

IZA Discussion Papers often represent preliminary work and are circulated to encourage discussion. Citation of such a paper should account for its provisional character.

The NYT writeup does no such thing; in fact it goes the opposite direction, trying to draw broad generalizations and make policy recommendations. This is no slight against the study’s original authors – it is typical in economics to circulate working papers for discussion and critique. Maybe they’d have compelling responses to everything I said, who knows? But at this stage, I have a hard time seeing how this working paper is ready for a popular media writeup for general consumption.

The biggest worry I have is that university administrators might take these results and run with them. I do agree with the fundamental motivation for doing this study, which is that policies need to be evaluated by their effects. Sometimes superficially gender-neutral policies have disparate impacts when they run into the realities of biological and social roles (“primary caregiver” leave policies being a case in point). It’s fairly obvious that in many ways the academic workplace is not structured to support involved parenthood, especially motherhood. But are tenure extension policies making the problem worse, better, or are they indifferent? For all the reasons outlined above, I don’t think this research gives us an actionable answer. Policy should come from cumulative knowledge, not noisy and ambiguous preliminary findings.

In the meantime, what administrator would not love to be able to put on the appearance of We Are Doing Something by rolling back a benefit? It would be a lot cheaper and easier than fixing disparities in the hiring process, providing subsidized child care, or offering true paid leave. I hope this piece does not license them to do that.

Thanks to Ryan Light and Rich Lucas, who dug into the paper and first raised some of the issues I discuss here in response to my initial perplexed tweet.

6 thoughts on “Don’t change your family-friendly tenure extension policy just yet

  1. A 19% increase in a 30% tenure rate would bring it to 35.7%, not 49%. Likewise, women’s rates would decrease to 23.4%. Still a large effect size, but not 6:1.

  2. In top economics departments like Stanford, NO ONE is tenured, they all leave for lesser institutions… the assumption they leave for better offers is not very valid.

    1. The analysis covered the top 50 economics departments. I know psychology much better than economics so I don’t know if it is a valid comparison, but in our field there is a small handful of top departments like that too. But there’s still plenty of room among the top 50 for people to get and take attractive outside offers.

  3. There are many other things that are problematic in that paper. Not only is the outcome variable tenure at their first job, but first job within 2 years of getting a Ph.D.. The authors themselves acknowledge a number of limitations of their data (yet, remarkably, they proceed with the analysis):
    “There are several limitations to note about this data. First, we do not observe actual tenure decisions,
    only outcomes. Some people who were granted tenure may have left anyway, while others may have left
    before they were up for tenure. We observe only actual promotions to tenured positions, which usually
    occur when individuals are promoted from assistant to associate professor. This means that we are
    identifying effects on actual tenure promotions, rather than on tenure decisions. Second, we do not observe
    births or the number of years taken to reach the tenure decision. Because we do not know who is eligible for
    a tenure clock extension or who utilizes the policy, the effects shown below should therefore be interpreted
    as intent-to-treat (ITT) effects. Finally, there are relatively few women hired at each university during the
    sample period. On average, only four female assistant professors were hired at each university between
    1985 and 2004, compared to 17 male assistant professors. Having few women makes it difficult to allow
    for gender-specific differences across universities.”

    So, they don’t know stopped the clock and for what reason, who had a child, all in a sample where men outnumber women nearly 5:1 (maybe try bootstrapping?). In terms of the outcome variable, they only know who got tenure at those institutions and stayed at those institutions, and not who a) left before tenure, for whatever reason, b) was denied tenure, or c) left though they got tenure. Sadly, as is the case with a lot of economics research, the data were collected from online sources as opposed to, say, surveys and questionnaires that can get at some of those crucial variables (like who actually used the policy and who had a baby!), and other potential confounding variables (for example, among economists who actually had a baby pre-tenure, what were their spouse/partner and childcare arrangements like? How many kids did they [already] have? etc.). Not to mention that the fact they limited the sample to people in their first TT job is arbitrary, because most women ended up getting tenure at other institutions. It’s likely those are lower-ranked institutions, but it’s also possible they moved “laterally” (among top-50 institutions) pre-tenure and got tenure at a top-50 department as their second job, but they wouldn’t be included in the sample (and neither would men who switched jobs).

    Onto the methods… I’m puzzled by the whole section 4 in the paper. It’s purely hypothetical modeling of potential decisions where to submit a paper for review (a top-5 journal or a “regular” journal) given child status and clock-stopping policy status. They use a parameter “alpha” to denote “different childbearing costs”, but where those values come from, I have no idea. Furthermore, that whole section is largely irrelevant for the rest of the paper (which would still be highly problematic even with that section removed), because the paper is not about decisions where to send papers, but tenure outcomes. Yes, it’s important to know what influences decisions to send papers to top journals or not, but if you look at their Figure 5, the “average number of top-5 publications within 7 years” for women varies between (roughly) 0.75 and 1, and for men between 1.15 and 1.45. Are those significantly different? On the other hand, “average number of non-top-5 publications within 7 years” is between (again, roughly) 6.2 and 7.5 for men, and 4.3 and 5.2 for women. Though the authors claim that the differences in both the former and the latter “begin wide, are eliminated in the middle of the sample period, and then re-emerge at the end of the sample period.”, that’s just not true.Yes, the number of top-5 pubs in the mid-90’s is around 1 for both men and women, but the number of non-top-5 pubs in the mid-90’s are about 5.2 and 6.1 for women and men, respectively, which is still one publication, on average, more among male economists, even during the best of times. All in all, the publication rate is consistently higher in male than female economists, regardless of the existence of clock stopping policies, and that’s what’s really interesting/frustrating/problematic. Unfortunately, the paper also fails to distinguish between single-authored and multi-authored publications (and the order of authorship on the latter), and existing research has shown that women’s work on multi-authored publications is devalued compared to men.

    Furthermore, when you look at the summary data on male and female productivity in Table 2, the values for men are consistently higher than for women, regardless of the implementation of the gender-neutral clock stopping policy. Annoyingly, in that table the authors divide the pre- and post-implementation data arbitrarily into 1985-1994 and 1995-2004 time periods, even though universities implemented GNCS policies in different years. Which basically means that the pre- and post- means for men and women can’t be compared within each category as the table (mis)leads you to do, but just across, sort of. It also means we can’t see the summary data that would tell us what the tenure outcomes or publications counts (for economists in top 49 departments hired for the first time within 2 years of getting a Ph.D.) were at a) institutions with gender-neutral clock stopping policies before and after the implementation of said policies and b) institutions without GNCS policies. Again, if you look at the data in Figure 5, the un-adjusted tenure outcomes (I refuse to call it “rates”, because that’s not what they measured) at all those institutions were always higher for men than for women, and they actually went up (well, up and down a bunch of times, but overall the trend is positive) for both men and women between 1985 and 2004 (though for men there is a big initial dip in the mid-80’s). The authors state (referring to table 2) that:
    “As this table is just meant to be descriptive and give some context, we have
    simply broken the data into two time periods with no regard for when the policy is actually adopted. At
    these institutions, the male tenure rate rose from 29 to 36 percent across the two time periods. Columns
    5-8 repeat this exercise for female assistant professors. The tenure rate for women at universities that are
    never-adopters rose from 12 to 25 percent across the two time periods, while the tenure rate for women
    at universities that ever adopt a policy fell from 26 to 17 percent.”
    Breaking down the data into two arbitrary time periods “with no regard for when the policy is actually adopted” creates the appearance of clock-stopping leading to a drop in tenure outcomes by 9 percentage points among female economists for universities that adopted GNCS policies but this is not necessarily true because the 1985-1994 and 1995-2004 time periods among universities who adopted GNCS policies include a mix of pre- and post-adoption values. Again, why not have a table with tenure outcomes “pre-adoption” and “post-adoption” of GNCS policies that really is pre- and post-?

    In terms of the model variables, I have some questions about them too. The authors have a binary “GN” variable in the model that indicates whether the gender-neutral policy was in effect at the time of hire, and another variable that indicates if such a policy was implemented during the first two years on the job. But why not have a “year on the job from which GN policy was in effect” variable instead of those two? Being able to stop the clock in years 3-5 on the tenure-track still seems like a valuable option (or at least something to consider). The mixed-effects model includes a bunch of other controlling variables: “number of undergraduate students, the number of graduate students, faculty size, average salary of full professors, average salary of assistant professors, annual revenue, the fraction of the faculty who are female, and the fraction of the faculty who are full professors.” It’s unclear to me which of these variables, if any, are significant, and significant enough to be included in the models at the expense of having a simpler model (AIC? Partial F-test? something?). I can’t make enough sense out of their tables 3 and 4 (mixed-effects modeling is not my forte) to discuss their coefficients and effect sizes, other that most of them don’t appear to be significant at p=<0.1.

    Lastly, and this is my pet peeve with much of economics research, there is not an r-square in sight. I’d like to know how much of the variance in tenure outcomes is actually explained by the model! I understand it’s a more complicated thing to estimate in mixed models than in LM or GLM, but it can be done (e.g. Nakagawa & Schielzeth, 2012. A general and simple method for obtaining R2 from generalized linear mixed-effects models.) I once saw an econ talk where the presented made a huge deal about a significant coefficient with a tiny effect size and an r2 (buried at the bottom of a very large table) of 0.07 (that is not a typo!), without ever acknowledging the fact that their model really didn’t explain all that much and that there’s a lot of work ahead to explain the remaining 93% of the variance. How much variance are we talking about in this case?

    As for the NYT piece, it conflates clock stopping with having a leave, which are two different things! This paper, as problematic as it is, only examines the effects of clock-stopping (except not really, because they don’t know who stopped the clock) and NOT the effects of having a parental leave. Casually mixing up these two benefits is callous and irresponsible, because most people (and possibly, some university administrators) will not look at the paper (and may not have training to understand the paper) before reaching the conclusion that both gender-neutral clock stopping and parental leave are bad for women. They may be, but this paper does not provide convincing evidence that is the case for GNCS.

  4. Thanks for this critique, which I thought was very informative and helpful. One point to add, though: having a PhD from a top 10 department might genuinely correlate with tenure denial because hiring is a collider for PhD departmental prestige and productivity prior to the hire. In other words, in order to get hired coming out of a less-prestigious department, you probably had to have a more impressive publication record already than someone coming from Berkeley or whatever. It may be that the people who meet this hurdle (hired despite less-prestigious PhD) are more prepared for life on the tenure track and less likely to buckle in the face of the new teaching and service demands of an assistant professorship.

Comments are closed.