An eye-popping ethnography of three infant cognition labs

I don’t know how else to put it. David Peterson, a sociologist, recently published an ethnographic study of 3 infant cognition labs. Titled “The Baby Factory: Difficult Research Objects, Disciplinary Standards, and the Production of Statistical Significance,” it recounts his time spend as a participant observer in those labs, attending lab meetings and running subjects.

In his own words, Peterson “shows how psychologists produce statistically significant results under challenging circumstances by using strategies that enable them to bridge the distance between an uncontrollable research object and a professional culture that prizes methodological rigor.” The account of how the labs try to “bridge the distance” reveals one problematic practice after another, in a way that sometimes makes them seem like normal practice and no big deal to the people in the labs. Here are a few examples.

Protocol violations that break blinding and independence:

…As a routine part of the experiments, parents are asked to close their eyes to prevent any unconscious influence on their children. Although this was explicitly stated in the instructions given to parents, during the actual experiment, it was often overlooked; the parents’ eyes would remain open. Moreover, on several occasions, experimenters downplayed the importance of having one’s eyes closed. One psychologist told a mother, “During the trial, we ask you to close your eyes. That’s just for the journals so we can say you weren’t directing her attention. But you can peek if you want to. It’s not a big deal. But there’s not much to see.”

Optional stopping based on data peeking:

Rather than waiting for the results from a set number of infants, experimenters began “eyeballing” the data as soon as babies were run and often began looking for statistical significance after just 5 or 10 subjects. During lab meetings and one-on-one discussions, experiments that were “in progress” and still collecting data were evaluated on the basis of these early results. When the preliminary data looked good, the test continued. When they showed ambiguous but significant results, the test usually continued. But when, after just a few subjects, no significance was found, the original protocol was abandoned and new variations were developed.

Invalid comparisons of significant to nonsignificant:

Because experiments on infant subjects are very costly in terms of both time and money, throwing away data is highly undesirable. Instead, when faced with a struggling experiment using a trusted experimental paradigm, experimenters would regularly run another study that had higher odds of success. This was accomplished by varying one aspect of the experiment, such as the age of the participants. For instance, when one experiment with 14-month-olds failed, the experimenter reran the same study with 18-month-olds, which then succeeded. Once a significant result was achieved, the failures were no longer valueless. They now represented a part of a larger story: “Eighteen-month-olds can achieve behavior X, but 14-month-olds cannot.” Thus, the failed experiment becomes a boundary for the phenomenon.

And HARKing:

When a clear and interesting story could be told about significant findings, the original motivation was often abandoned. I attended a meeting between a graduate student and her mentor at which they were trying to decipher some results the student had just received. Their meaning was not at all clear, and the graduate student complained that she was having trouble remembering the motivation for the study in the first place. Her mentor responded, “You don’t have to reconstruct your logic. You have the results now. If you can come up with an interpretation that works, that will motivate the hypothesis.”

A blunt explanation of this strategy was given to me by an advanced graduate student: “You want to know how it works? We have a bunch of half-baked ideas. We run a bunch of experiments. Whatever data we get, we pretend that’s what we were looking for.” Rather than stay with the original, motivating hypothesis, researchers in developmental science learn to adjust to statistical significance. They then “fill out” the rest of the paper around this necessary core of psychological research.

Peterson discusses all this in light of recent discussions about replicability and scientific practices in psychology. He says that researchers have basically 3 choices: limit the scope of your questions to what you can do well with available methods, relax our expectations of what a rigorous study looks like, or engage in QRPs. I think that is basically right. It is why I believe that any attempt to reduce QRPs has to be accompanied by changes to incentive structures, which govern the first two.

Peterson also suggests that QRPs are “becoming increasingly unacceptable.” That may be true in public discourse, but the inside view presented by his ethnography suggests that unless more open practices become standard, labs will continue to have lots of opportunity to engage in them and little incentive not to.

UPDATE: I discuss what all this means in a followup post: Reading “The Baby Factory” in context.

The importance of trust and accountability in effective human subjects protection

I had a good discussion with a friend about the “excused from prior review” category that would replace “exempt” in the proposed human subjects rule changes.

Under the current system, a limited number of activities qualify as exempt from review, but researchers are not supposed to make that determination themselves. The argument is that incentives and biases might distort the researcher’s own judgment. Instead, an administrator is supposed to determine whether a research plan qualifies as exempt. This leads to a Catch-22 where protocols must be reviewed to determine that they are exempt from review. (Administrators have some leeway in creating the exemption application, but my human subjects office requires almost as much paperwork for an exempt protocol as for a fully reviewed protocol.)

The new system would allow investigators to self-declare as “excused” (the new name for exempt), subject to random auditing. The details need to be worked out, but the hope is that this would greatly facilitate the work of people doing straightforward behavioral research. My friend raised a very legitimate concern about whether investigators can make that decision impartially. We’re both psychologists who are well aware of the motivated cognition literature. Additionally she cited her experience on an IRB where people have tried to slip clearly non-exempt protocols into the exempt category.

I don’t doubt that people submit crazy stuff for exempt review, but I think you have to look at the context. A lot of it may be strategic navigation of bureaucracy. Or, if you will, a rational response to distorted incentives. Right now, investigators are not held responsible for making a consequential decision at the submission and review stage. Instead, all of the incentives push investigators to lobby for the lowest possible level of review. It means that your study can get started a lot faster, and depending on your institution it may mean less paperwork. (Not for me though.) If an application gets bumped up to expedited or full review there is often no down side for the investigator — it just gets passed on to the IRB, often on the same timeline as if it had been initially submitted for expedited or full review anyway.

In short, at the submission stage the current system asks investigators to describe their protocol honestly — and I would infer that they must be disclosing enough relevant information if their non-exempt submissions are getting bumped up to expedited or full review. But the system neither trusts them nor holds them accountable for making ethics-based decisions about the information that they have honestly disclosed.

Under the proposed new system, if an investigator says that a study is excused, they will just file a brief form describing the study’s procedures and then go. Nobody is looking over an investigator’s shoulder before they start running subjects. Yes, that does open up room for rationalizations. (“Vaginal photoplethysmography is kind of like educational testing, right?”) But it also tells investigators something they have not been told before: “We expect you to make a real decision, and the buck stops with you.” Random retrospective auditing would add accountability, especially if repeated or egregious violations come with serious sanctions. (“You listed this as educational testing and you’ve been doing what? Um, please step outside your lab while we change the locks.”)

So if you believe that investigators are subject to the effects of incentives and motivated cognition — and I do — your choice is to either change the incentive structure, or take control out of their hands and put it in a regulatory system that has its own distorted incentives and biases. I do see both sides, and my friend might still disagree with me — but my money is on changing the motivational landscape for investigators. Trust but verify.

Finally, the new system would generate something we don’t have right now: data. Currently, there is a paucity of data showing which aspects of the IRB system, if any, actually achieve its goals of protecting human subjects. It’s merely an article of faith — backed with a couple of highly rare and misrepresented examples — that the current system needs to work the way it does. How many person-hours have been spent by investigators writing up exempt proposals (and like I said, at my institution it’s practically a full protocol)? How many hours have been spent by administrators reading “I’d like to administer the BFI” for the zillionth time, instead of monitoring safety on higher-risk studies? And all with no data showing that that is necessary? The ANPRM explicitly states that institutions can and should use the data generated by random audits to evaluate the effectiveness of the new policy and make adjustments accordingly. So if the policy gets instituted, we’ll actually know how it’s working and be able to make corrections if necessary.

From the department of “things that could be said about almost anything”

If you want to change the behavior, you have to change the incentives. Moralistic huffing and puffing won’t cut it.

That sentence jumped out at me as being true of just about every domain of public policy.

(In this case it’s from Dean Dad’s blog post about public higher ed outsourcing growth to private higher ed. My own institution has essentially done this internally. Our finances are more like a public institution with regard to in-state students, and like a private with regard to out-of-state students. Since our state’s contribution to higher ed is dismal and dropping, the higher-ups have decided to balance the budget through growth — but that’s almost entirely by admitting more out-of-state students.)

The perverse incentive structure of IRBs

As a researcher at a university, all of my human subjects research has to go through my university’s IRB. I believe that IRBs have an important role in research. However, in practice I sometimes find dealing with an IRB to be frustrating.

Pretty much all of the research that I do is very low risk. Yet I have to go through a review system that was invented as a response to Nazi medical experiments and other horrific incidents half a century ago. You might think that should make my behavioral research easier to get approved — I could just say, “hey, guess what, I’m not secretly giving people syphilis or anything” and get the thumbs-up. Sadly, though, it doesn’t work like that. Even when I have a study that is eligible for expedited review, there is a heck of a lot of paperwork to fill out, and time to wait, and often pointless revisions to make — all in order to do something as simple as asking people a few questions about what kind of day they had yesterday.

So why are university IRBs so inefficient? There are a number of reasons, but I believe that one of the core problems is that the system is built on a foundation of perverse incentives for the IRB.

The IRB’s task can be thought of like a signal detection problem. Simplifying a little bit, you can think of the protocols that researchers submit as being either worthy or unworthy. For any given protocol, the IRB has to decide to approve or reject. So there are two kinds of correct decisions (approve a worthy protocol or reject an unworthy one) and two kinds of mistaken decisions (reject a worthy protocol or approve an unworthy one). And the big problem is that the IRB’s potential costs associated with the two different kinds of mistakes are severely imbalanced.

If the IRB mistakenly rejects a worthy protocol, what is the worst thing that could happen? The investigator might make a phone call and resubmit the application, taking up some extra staff time, but the IRB will not get into any serious trouble. And the costs of this mistake are chiefly borne by the researcher, not the IRB. Furthermore, within a university, there is no appeals process or oversight authority empowered to act on a rejected protocol.

By contrast, if the IRB mistakenly approves an unworthy protocol, all kinds of bad things could happen. Even if no subjects are harmed, an audit could turn up the mistake and the IRB could get in trouble. And in more serious cases — if subjects do get exposed to inappropriate risks, or actually get harmed — things can get much, much worse. The IRB could get shut down (halting all research at the university), the professional IRB staff could get fired, and the university could get sued by the harmed subjects.

These asymmetric incentives mean that IRBs have a very strong incentive to err on the side of rejecting too much research. So it’s no wonder that the process is so slow and clunky, and even simple low-risk protocols are routinely sent back for revisions. The staff at my IRB are good people who want to help researchers when they can. But the actual review board members are often people with no personal stake in seeing that research gets done efficiently, and some have no formal science training at all (which can lead them to imagine harmful effects of research that have no basis in reality). And for both the paid staff and the board members, even those with the best intentions work within an incentive structure that is completely out of whack.

So a big part of me was outraged (and a tiny, naughty part of me jealous) to learn that in commercial medical settings, the IRB incentives are out of whack too — but in the opposite direction. If you are a researcher a private, for-profit research company, you get approval for your research by paying a commercial IRB to review it. It doesn’t take a genius to look at this setup and figure out that a commercial IRB that approves lots of research is going to be popular with its customer base. So it was probably just a matter of time before a scandal erupted. And now one has.

In a test of the commercial IRB system, the Government Accountability Office submitted a fake protocol to 3 different commercial IRBs. The protocol was rigged to be full of unsafe, high-risk elements. And apparently one of the companies, Coast IRB, fell for the sting, deeming the protocol safe and low-risk and giving it approval. Upon further investigation from the GAO, it turns out that Coast has not rejected a single proposal in the last 5 years, and it made over $9 million last year. Hmmm…

In the aftermath of this incident, it is very likely that attention is again going to get focused where it always gets focused: on the possibility that IRBs might be approving bad, unsafe research. But such a focus may be misguided. The case of Coast IRB shows that even commercial IRBs face very serious costs when they get caught approving bad research. The company has just seen its entire $9-mil-a-year business evaporate while it undergoes an audit. Employees may lose their jobs. Owners may lose profits and see their shares lose value. The entire company could go out of business.

Instead, the problem with both university and commercial IRBs is on the approval side: the system does not present the right level of incentives for approving worthy research. In the university IRB case, the incentive is too low. And in the commercial IRB case, it’s too high. Hypothetically speaking, even if somebody at a Coast IRB kind of place knew the potential costs of getting caught approving bad research, in a rational cost-benefit analysis those potential costs would have been balanced against a multimillion-dollar revenue stream that depended on them approving lots of protocols, good and bad.

So what will happen next? If you are a member of Congress and you want to fix commercial IRBs, you could alter the cost-benefit balance on either side. That is, you could either diminish the profit motive associated with approving research, or you could make it even more costly for a company to mistakenly approve bad research. The problem is that any new regulatory policy designed to fix commercial IRBs could very well affect university IRBs as well, since both kinds of IRBs fall under many of the same regulations. And if you raise the costs and punishments associated with approving bad research (or institute even more intrusive regulations and oversight to try to prevent such approvals from happening), you will make the perverse incentives at universities even more perverse.

Personally, I think it’s at least a littie bit weird that IRBs — institutions designed to safeguard the interests of research subjects — can be run as for-profit businesses whose very financial existence depends upon those they are supposed to watch. If Congress wants to fix the system in the commercial medical industry, they need to look at the fundamental question of whether that is a sustainable model, and narrowly tailor any changes to apply to commerical IRBs. The answer is most definitely not to create more intrusive oversight or threaten punishments across the board. Let’s hope that is not the direction they choose to go.