“Rely on the Rigorous Evidence” is bad advice (updated)

Here is what seems like a pretty simple question. You have done an RCT in country X and produced a consistent estimate of the treatment effect of intervention I on outcome O. I am in country Y and have a simple OLS estimate of the partial correlation of I on O. How much should I move my priors from being centered on the OLS estimate from my country to the “rigorous” treatment effect estimate from country X? (And one could extend this to having done N RCTs in N countries that were not country Y).

Not only does this seem like a simple question, it seems like a pretty important question as without an answer to this question one cannot make any claims about the benefit-cost calculations of doing RCT research. If the scope of applicability of the finding isn’t known then, at best, one can only apply the rigorous treatment effect estimate to exactly the conditions in which it was generated–and hence the benefit/cost is likely to be very low (unless the country/program is massive scale).

Unfortunately, the answer to this question is not simple, the “intuitive” answer has very bad properties, and empirically, just using OLS from Y and ignoring the estimate that was rigorous in country X can produce better predictions.

  1. the answer is not simple as an RCT, in principle (and often in practice) produces an OLS estimate and an RCT estimate and hence produces an estimate of the bias in OLS for the true treatment effect in X. This means there are two distinct pieces of “rigorous evidence”: the treatment effect estimate and the OLS bias estimate. Given an existing distribution of OLS estimates it is likely that these will suggest moving the OLS estimate in some countries, Y, in different directions–the treatment effect estimate will suggest the country Y treatment effect should be bigger but the OLS bias estimate will suggest country Y treatment effect should be smaller.
  2. If one says the seemingly intuitive “move the treatment effect estimate for Y to the estimate for X (or to the average of the estimates for the N countries)” this (likely) implies (i) some countries revise their TE estimate upward from OLS and others downward and (ii) that the variance across countries in the true treatment effect goes to zero, even when the OLS estimates have large variance. Neither of these make any sense.
  3. The Root Mean Square Error (RMSE) of the “collapse onto the rigorous estimates” prediction is, in now three empirical examples, larger than just using OLS country by country because the “internal validity” problem solved by RCTs is just so much smaller than the “external validity” problem created with the heterogeneity across countries in the “true” treatment effects.

(That was a new introduction to the following blog)

The attached paper (submitted to a special issue of Review of Development Economics) is a case in point. You would think that, after all the intellectual and financial resources that have gone into RCTs and into the creation of “systematic reviews” that aggregate the “rigorous evidence” there would be a sensible and empirically validated answer to the question: “How should be beliefs about the impact of actions X on outcome Y in my country context, call it C, (LATE(X,Y,C) change in response to a rigorous study (or systematic review of rigorous studies) from other countries/contexts?” But there just isn’t.

And it is easy to point out that things you might think sensible could be said, like: “Beliefs about LATE(X,Y,C) should move from existing the existing non-rigorous estimates in context C towards the findings from rigorous studies” don’t pass muster as being even logically coherent. As Justin Sandefur and I pointed out some years ago since the true LATE(X,Y,C) can be decomposed into the non-rigorous estimate in C and the bias in that estimate, if there is heterogeneity (variance) in the non-rigorous estimates across contexts (and there is) then this generically implies that in response to a systematic review some countries should shift their beliefs towards a larger LATE and some countries towards a lower LATE, which implies that the bias in those cases has a different sign. So any generic advice about adopting the “rigorous evidence” essentially demands people adopt beliefs about bias that are a wildly implausible set of measure zero. That is bad science.

Slightly harder is to point out that adopting as the prediction of LATE(X,Y,C) the systematic review point estimate does empirically worse that just using OLS(X,Y,C) as the prediction of LATE(X,Y,C). This point is stunning. If the heterogeneity in the true LATE across contexts is large relative to the bias in non-rigorous estimation methods then it is the case the most naïve possible thing is actually better than the supposedly new, better, cooler, more “sciency” approach of doing some RCTs in some countries and then aggregating those in a systematic review.

That this is so is harder to point out for two reasons. One, because there are so few RCTs that are even moderately comparable it is hard to have enough estimates of the “true” context specific LATE to create a variance in predictions. Two, because systematic reviews (and the underlying papers) tend to ignore the existing non-rigorous estimates altogether so the question of “how much better?” cannot be answered.

The attached paper solves these problems by using two sources that have comparable estimates of a “raw”, an “OLS” and a “LATE” for the same quantity for a larg(ish) number of countries. The LATE is the Oster estimate using the standard assumptions.

For 42 developing countries there is a Raw, OLS, and LATE(Oster) estimate of the wage gain for a typical low-education level worker moving from their home country to the USA from Clemens, Montenegro and Pritchett (2019).

For 29 developing countries there is a Raw, OLS, and LATE(Oster) estimate of the private sector learning premium from Patel and Sandefur (2020).

Once one has data like this, the rest is easy, just simple arithmetic (hence the temptation of paper arbitrage): compare the Root Mean Square Error (RMSE) of (i) using the average of the LATE estimates (the “systematic review”) to predict the LATE in each country or (ii) use the OLS from each country to estimate its LATE.

As expected, the answer depends on the ratio of the variance of the LATE to the typical bias in OLS. For wage gains the variance is huge and the bias modest so using context specific OLS is twice as good as using the “rigorous evidence.” For the private sector learning premium the variance is modest and the bias substantial (selectivity into private schools is large) and hence OLS and the “rigorous evidence” do about the same in RMSE.

So, for about 25 years now there has been a major advocacy movement selling people on the notion that doing RCTs about specific interventions in specific contexts and then aggregating these was going to lead to “evidence based” policies based on “rigorous evidence” and that would lead to better development outcomes. But there has never been any evidence these claims were true. Moreover, they always seemed pretty improbable and inconsistent with our “best available” models of development phenomena, that suggested that contextual variance in policy outcomes was likely to be very large.

These claims about external validity are not a picayune detail as without some clear idea about the scope of reliability of the results across contexts it is impossible to claim any piece of research, and especially expensive research, like RCTs, are a cost-effective investment.