Development work versus charity work

A really excellent academic paper often clarifies issues far beyond the particular point the paper makes. A recent paper in Nature magazine evaluates the impact of different components of anti-poverty programs. The paper shows that adding a “psycho-social” component to an anti-poverty program in Niger is enormously cost-effective, as it had similar impacts as adding a cash grant but was much less expensive. The paper, with 11 co-authors, is an exceptionally solid contribution.

And, what the paper indirectly illustrates so well is the difference between doing research about development and research about charity work. The development question is: “How can the people living in Niger come to have broad based prosperity and high levels of wellbeing?” The charity question is: “If some agency (perhaps of a government) is going to devote a modest amount of resources to targeted programs that attempt to mitigate the worst consequences of a country’s low level of development, what is the most cost-effective design of such programs?”

The paper is titled: “Tackling psychosocial and capital constraints to alleviate poverty” but does not start with the question: “What could lead to prosperity in Niger?” or even “What could lead to substantial reductions in poverty (at various poverty lines) in Niger?” Rather it starts with questions of program design: “How could a particular, already existing, targeted cash grant anti-poverty program in Niger be more (cost) effective at producing gains for its beneficiaries?” That is a great question but also a question that is a vanishingly small component of the first two questions.

Let me start with a graph I made from findings from the paper of the 18 month impact of different program designs (in addition to the cash grant received by the control group) on consumption per equivalent adult (Extended Data Table 1). This illustrates the basic finding of the paper which is that adding the psycho-social elements to the program alone (without a cash grant) raised household consumption from $1.70 a day per equivalent adult to $1.88 a day per equivalent adult (they also show impacts on a wide array of other indicators). Since, as they show in Extended Data Table 9, the psychosocial components were inexpensive this meant they had an astronomical benefit/cost ratio.

I am now going to show those exact results framed with three additional pieces of information.

(And these consumption numbers are not exactly comparable because, while all are PPP, they are not deflated to the same year and there are differences between “per equivalent adult” and per person.” But this as this is a blog, not a paper, and for the magnitude of what I am illustrating these differences are almost certainly de minimis).

First, this program clearly raised the consumption of poor households in a poor country, but how much did the program reduce headcount poverty? Well, that completely depends on the poverty line that one uses. The World Bank, for instance, now reports three different poverty lines, which one could call “extreme poverty” ($1.90 per day per person), “poverty” ($3.20 per day per person) and, at a higher line, something called, perhaps, “global poverty” ($5.5 per day per person).

The program, by having a better design, raised the consumption of the beneficiaries by 25 cents per day, pushing the average of control group from below the extreme poverty line to just above the extreme poverty line ($1.90/day) for full treatment group. However the extreme poverty line is more or less a completely arbitrary number and many other poverty lines are as completely defensible as this poverty line (including poverty lines much higher than $5.50/day). At the $5.50 line the control group is $3.80 below the poverty line and an increase of 25 cents a day eliminates only 6.6 percent of the gap to be out of poverty.

Of course the obvious response is that, at Niger’s level of per capita income and hence available resources it is impossible to fund any program that could raise consumption by that much. But that is exactly my point. My point is that there is the question: “What could Niger do to reduce poverty over the medium to long run?” and then there is the question: “What can pragmatically done with the available resources with the best designed targeted anti-poverty programs?” The former question is far and away the most interesting and the answer to the latter question, with respect to any reasonable and humane poverty level is: “not much.”

A different variant is to compare the gains in per person consumption of the “full program” treatment group to the gains for the median household from extended episodes of rapid growth. Again we see the terrific and cost-effective gain from $1.70/day to $1.90/day. In the 25 years from 1993 to 2018 the consumption of the median household in Vietnam went from about the level of those in the study in Niger, $1.87, in 1993 to $8.87 in 2018. This increase of $7/day is 28 times larger than the gain from the full program. Economic growth in Vietnam essentially eliminated extreme poverty, drastically reduced poverty, and even global poverty ($5.5/day) was well less than half the population in 2018.

The rural population of China went from having median consumption only about half that of the Niger study ($.90/day) in 1981 to $6.52 in 2016. Similar, though not as dramatic, gains were seen between 1984 and 2019 in rural Indonesia but still dwarf the programmatic impacts.

Again, I anticipate the response of “why not do both?” Of course, I am all for that. I am all for the funding of cost-effective targeted anti-poverty programs. But while it is optimal to do both, we development economists should keep in mind that sustained economic growth is empirically necessary and empirically sufficient for reducing poverty (at any poverty line) whereas targeted anti-poverty programs, while desirable, are neither necessary nor sufficient. Advocates of poverty programs say things like “growth is not enough” or that poverty programs are “equally important” as economic growth but these claims are just obviously false.

(This is well illustrated by the graph from Pritchett 2020 showing the very, very strong association of poverty headcount, at three poverty lines, and the median income/consumption of the country/year of the poverty estimate. At the “full program” consumption of $1.95/day the annual consumption would be $711 and countries with that level of median consumption all have $5.5/day poverty above 85 percent of the population.)

The third comparison is to cross-national differences in consumption of the median household. Development was always about the “developing” countries reaching the same levels as the “developed” and so one might ask, “how far along the spectrum of development does the program move households?” The additional 25 cents a day takes the targeted household about 17 percent of way to the median level in a poor country like Bangladesh, only about 2 percent of the way to the median for a country like Brazil and an inconsequentially small fraction of the way to the level of a typical (median) person in a rich country like Denmark. Even the best targeted programs are a just a tiny slice of development.

The paper shows that adding “psychosocial” elements–“life skills training” or “community sensitization on aspirations and social norms”–to anti-poverty programs might be cost-effective addition to anti-poverty programs and have a high rate of return. But that is not how, in fact, poverty has been substantially reduced in any country, ever. The massive reductions in poverty that we have seen both historically in the rich countries and more recently in the developing world has been through raising the productivity of the place so that individuals can use their resources to generate higher levels of income.

Moreover, it makes my very, very nervous the paper can and will be used (wrongly) to claim “since psychosocial interventions reduce poverty, it must be poverty was caused by a lack of psychosocial skills.” People in Niger are poor primarily because of the very limited choices they have not primarily because of the choices they make.

I am praising this paper. It is an excellent RCT impact evaluation that reveals interesting things about the design of targeted anti-poverty programs. But since it is such an excellent paper it reveals the deep and inherent limitations of this line of research. Better design of targeted anti-poverty programs is not an anti-poverty strategy, it is, at best, just one useful tactic.

Among the national governments–and populations–of the countries I have lived and worked in, the question about poverty and anti-poverty programs is just one question–and frankly, not their most important question. Their question to me is usually more of the type “How can my country come to have the levels of productivity, material prosperity, and wellbeing that countries like yours enjoy?” That, to my mind, is the development question.

Economic Growth in Five Figures (one with five variants)

In this piece (which is more than a blog and less than a paper) I support the claim that more rapid and sustained economic growth should be acknowledged as a (perhaps even “the”) key objective of “development.”  All development actors should acknowledge this—governments, international agencies, bilateral agencies, development banks, development academics, (development) NGOs, philanthropists.  Even if an organization or individual decides that promoting more rapid sustained economic growth isn’t their organization’s comparative advantage and/or priority and/or cup of tea, they still should acknowledge growth as an important and legitimate objective of development efforts.

An important element of my argument is separating whether economic growth should be a priority for “developed” or “rich” or “rich industrial” countries from the question of whether it should be a priority for poorer countries.  The Organization for Economic Cooperation and Development (OECD) has a large project researching other policy objectives to supplant the (supposed) dominant position of economic growth.  More strikingly, the Prime Minister of New Zealand, Jacinda Ardern has explicitly rejected GDP and economic growth as objectives for her government.  As I make clear below, I take no issue with those positions in and about policy stances towards economic growth of developed countries for themselves.   But preferences are not priorities and developed countries can recognize that further economic growth of income from its already high levels is not a priority for their country and yet, at the same time, acknowledge that economic growth is an (perhaps “the”) key objective for poorer countries and hence support development activities that promote more rapid and sustained economic growth.

Graph One:  The Hockey Stick with a Box Plot

Figure 1 uses recently updated Angus Maddison-style estimates of GDP per capita from Bolt and van Zanden (2020) to compare the historical evolution of the GDPPC of the “leading” economies versus the GDPPC of the developing countries in 2019.  GDPPC is in 2011 Purchasing Power Parity units which I call “M$” (for “Maddison-style PPP dollars”).

The orange line in the figure shows the historical trajectory of GDPPC of the highest three industrial countries (hence excluding the oil rich countries and a few small outliers (e.g. Luxemburg)).  The level of the three leading economies in 1 CE (common era) (which were Italy, Greece, Egypt) is shown at M$1266. The average of the three leading economies in 1850 (which were Great Britain, Netherlands, USA) is shown at M$3914.  This shows that the increase in the productivity frontier, the GDPPC of the leading countries, progressed very slowly throughout history as the level in 1850 was only about threefold higher than the level when Caesar Augustus ruled the Roman Empire, a compound growth rate of only .06 ppa (percent per annum).

Since sometime around 1850 “modern” sustained exponential growth of around 1.7 percent per annum (ppa) kicked in in the leading countries of the world and the graph shows the evolution of the three leading GDPPC countries in the world from 1870 to 2018 (which countries those were changed over time).   

At the right edge of the graph I show the box-plot of the distribution of 2018 GDPPC for 112 developing countries—defined for this graph as those with GDPPC less than M$21,000.  The box plot shows the 90th, 75th, 50th (median), 25th, and 10th percentiles of GDPPC. 

I label the 2018 GDPPC only for the 20 largest population countries (because labeling all countries gets visually messy). 

This somewhat unusual combination of a “time series” diagram and a “cross-section” diagram allows the visual comparison of (roughly) current GDPPC of developing countries to the level of GDPPC of the historically leading countries.

Three observations.

First, the typical poor country has a level of GDPPC that the leading countries had more than a century ago.  The median developing country GDPPC is M$6814 (roughly the 2018 level of Vietnam and India) which is substantially below that of the leading countries in 1900. 

Second, the poorest countries of the world (e.g. Ethiopia, Democratic Republic of Congo (COD), Niger) are at levels of GDPPC comparable—or lower—than those of Egypt during the Roman Empire 2000 years ago.

Third, three quarters of developing countries have a level of GDPPC lower than the leading countries in 1950, the 75th percentile (e.g. around China, Indonesia, Egypt) is M$12310.  The nature of exponential growth is that the same growth rate produces larger and larger absolute increases so the leaders in the year 2000 were at M$47,00—which is M$33,000 ahead of where they were in 1950.  Progress in absolute gains from 1 CE to 1900 was only about M$6200 so there has been five times from gain from 1950 to 2000 than from the time of Caesar to the Victorian era.

The main point of this graph is that it would be expected and natural for the currently high income countries to have very different priority of further economic growth than the currently poorer countries as their current level of GDPPC is so high relative to both their own history and other countries today. Three quick observations:

  • There was little to no discussion in the now leading countries that economic growth was not desirable and needed in 1900 but rather that economic growth was not a high priority this seemed to have emerged gradually.  The American politician Robert F Kennedy gave a very famous speech in which he attacked GDP as a goal in 1968.  In 1968 GDPPC was already M$23,691, well above any currently developing country and 3.5 times higher than the median developing country in 2018. 
  • Even at the currently very high levels of GDPPC of M$40,000 the debate is whether growth of GDPPC should receive less weight or “not a priority” but no leading politician has proposed adopting a policy of zero economic growth.
  • No leading politician in any advanced country is suggesting that it would be desirable if GDPPC fell—even a tiny bit.In the massive financial crisis of 2007-2009 personal consumption expenditures fell from US$33,001 in 2008 to US$32,194 in 2009 and this was considered a political catastrophe and every available effort was made to reverse that decline.
  • Practically no one (or no one practical, which is often the same thing) is proposing reducing GDPP to the current levels of any of the developing poor countries.  While New Zealand might not be enshrining GDPPC growth in its current priorities, neither is it suggesting a return to its 1950 level, where the now poor countries are.

Graph Two:  Median (typical) household income explains essentially all poverty differences across countries (in levels and over long episodes)

There is a relatively large literature about the relationship between “economic growth”—taken as the growth of GDP per capita—and standard Foster, Greer, Thorbecke (1984) measures of poverty.   What the World Bank reports as the poverty rate (or number of people in poverty) is “headcount” poverty (FGT(α=0) for a given poverty line using data on household income or consumption (depending on what is available, most very poor countries do not (cannot? ) measure income reliably).  This includes a set of papers by David Dollar and Aart Kraay (with others) (Dollar et al., 2016; Dollar et al., 2015; Dollar & Kraay, 2002)—and a very recent paper by Bromberg (2022).

The relationship between the headcount poverty rate and GDPPC combine (at least) three empirical issues:

  • How much of GDPPC is consumption expenditures
  • Whether the consumption expenditures measure in national accounts accords well and/or tracks with measured consumption expenditures in household surveys (for instance, in India, there has been a very large and persistent difference in the growth rate of PCE in that national accounts and the growth of average consumption expenditures in the household surveys used to measure poverty rates leading to very different views on poverty rates (e.g. Bhalla, Bhasin, and Virmani 2022).
  • National accounts PCE per person is a mean and hence changes in inequality at the upper end of the distribution can affect growth of the mean without changes in other measures of the central tendency of the distribution, like the median.

The graphs here use a different concept of “economic growth” which is the growth of the median of the distribution of consumption/income that the World Bank uses to compute poverty rates.  So, rather that ask “how much of the variation in the poverty rate across countries and time is associated with variation in GDPPC across countries and time?”  I ask, “How much of the variation in poverty rate across countries and over time is associated with variation in the level or growth over time of the consumption of the typical household (which is an alternative measure of the central tendency of the distribution)?

            I use the raw data from the World Bank’s PovCalNet to create this graph. 

  • I do the graph for the three poverty lines the World Bank routinely reports (P$1.9 /day,, P$ 3.2/day and P$5.5/day) where all are in purchasing power parity unite (the adjustment of which can lead to changes in measured poverty ). 
  • In order to create relatively complete and timely reporting of poverty rates even though the raw household data collection is sporadic the World Bank “fills in” poverty rates.  In my graph I only use the country/year poverty and median data that are near to the date of an actual survey. 
  • I use only those country/year data based on consumption data (not income data, which may or may not fully reflect the post tax and transfer distribution).

The first figure shows the connection between the level of the headcount poverty rate and the median for 189 country/year observations. In order to allow the association to be as flexible as possible I use a functional form for median consumption with powers from -2 to +5, the connection between poverty rates and median is analytically non-linear (Bromberg 2022).  The figure I truncated at P$8,000 as above that level poverty is essentially zero for all poverty lines. 

The second figure examines changes over time.  With a highly non-linear functional form one cannot just run a “changes on changes” regression.  Rather the graph calculates the predicted poverty rate at the beginning and end of the data for each country using the level estimates and then shows the association between the actual change in poverty rate and the change in the predicted poverty rate on the assumption of a stable non-linear relationship over time.

Source:  Pritchett 2020, Figures 3 and 4.

            The finding that emerges clearly is that essentially all of the variation across countries/over time in the reported World Bank headcount poverty rates is due to variations in the income of the typical (median) household in the country/year.  An R-Squared of .988 is practically unheard of as measurement error in either the left hand or right hand side variables lowers the R2.  (For instance, in Filmer and Pritchett  1999 we show that if one uses different years of the same data source, the Demographic and Health Survey (DHS) to measure child mortality of the same cohort of women recall error produces an upper bound on the R2 of the cohort child mortality rates of only .95.) 

            Some might object that this result is “baked in” as the estimated poverty rate is just the partial integral of up to the poverty line of the same distribution used to compute the median so the difference cannot be that big.  But since I am using only consumption data the data is, in principle, a post tax and transfer measure so that any programs that raised the consumption of those below the poverty relative to the median would in fact cause a deviation of poverty from the pre-tax and transfer distribution of income. 

The impact of targeting anti-poverty programs therefore should be reflected in this graph and hence the graph of levels suggests that at the very most 1.2 percent of the cross national variation in poverty rates could be due to the effects of programs that affected “the poor” without affecting the median.

This raises two points that help produce this very strong result.

First, in many cases the poverty line is above the median HH income and hence the poverty rate is higher than 50 percent.  (one can see in the graph that the association gets very right when the headcount poverty rate is near .5 for each poverty line).   

Of course, in poverty measures that reflect the “intensity” over poverty, such as the FGT(α=1) “poverty gap” or FGT(α=2) “squared gap” measures, could be affected by anti-poverty programs.  But these are rarely reported or used (more on that below).

Second, the results suggest that the magnitude, efficiency, and efficacy of anti-poverty programs is a very, very small part of the level of change in poverty over long episodes.  My argument is that, on reflection, this should strike us all engaged in development as quite “intuitive” on each aspect. 

  • The magnitude of targeted anti-poverty programs in poor countries is going to be limited by the ability to mobilize tax revenues (and poor countries consistently have lower tax/GDP ratios Burgess and Stern 1993, Besley and Persson 2013, Jensen (2019)) and by the many competing demands for those scarce fiscal resources (for external security and law and order, for infrastructure (roads, power, water, sanitation), for education, for health, for regulation, for administration, etc.).   
  • The fiscal efficiency of a targeted anti-poverty program can be measured as the ratio of the dollars in budget allocated that reach program the activities and benefiting targeted individuals.  One well known fact is that generally measured “state capability” is lower in poorer countries so one can easily doubt that either (a) identifying the poor (which is administratively demanding even in a static sense, and static targeting is not very effective at reaching the poor as poverty status of households changes over time (Jalan and Ravallion 1998,  Sumarto, Suryahadi and Pritchett 2000) and dynamic targeting is very demanding) and then (b) ensuring against leakage (of at least three types: (i) corruption to rent seeking government (political or administrative) offices, (ii) excess costs of administration, and (iii) mis-targeting of benefits) is going to be a strong suit of a poor country government.
  • The efficacy can be measured as the magnitude of the (sustained?) gain to the household from the activities.  It has been a major research agenda to investigate whether common anti-poverty activities like micro-credit or business training are actually effective.  While some activities have been demonstrated to be effective in some (but not all) contexts when implemented by NGOs (e.g. “graduation” style programs Banerjee et al 2015), they are quite complex in design (and that complexity appears to be essential to success) and implementation by governments is by no means assured.

The figures just show that in the data available so far, if “economic growth” is taken to mean the growth of the central tendency (median) of the distribution of consumption has been a very strong empirically necessary and empirically sufficient for headcount poverty reduction.

            This is not to say that governments (or NGOs or philanthropists) cannot reduce poverty through greater fiscal allocations, greater efficiency or greater efficacy nor that research (including using rigorous methods) might not contribute to that.  But this is likely to be a very, very, small part of the overall dynamics of poverty.

One additional figure, that I will not count as one of the “five” is, in some sense, the micro, qualitative counterpart of the macro figure.  In the major study Moving out of poverty (vol 2); Success from the Bottom Up a “ladder of life” community focus group was used to identify those in the village who had, in the last 10 years, moved out of poverty.  These households, identified by their neighbors as having moved out of poverty. were then interviewed about their own narrative of how they moved out of poverty.  A figure from this shows the distribution of the responses among the almost 4,000 interviewees. In their own narratives their undertaking a new initiative (either outside of agriculture (60.1%) or in agriculture (17.4%) or hard work (5.5%) or asset accumulation (4.7%) or increased community prosperity (1.6%) accounted for nearly all the moves out of poverty.

Source:  Narayan, Pritchett and Kapoor (2009)

Graph 3:  National Development Delivers

One of the elements of a push back (mostly in currently rich countries) against economic growth as an key policy objective is that it is not tightly connected to human wellbeing.  For instance, an organization called the Social Progress Initiative has proposed the development efforts should be guided by non-economic measures of progress and have created for that purpose a Social Progress Index (SPI). 

We dream of a world in which people come first. A world where families are safe, healthy and free. Economic development is important, but strong economies alone do not guarantee strong societies. If people lack the most basic human necessities, the building blocks to improve their quality of life, a healthy environment and the opportunity to reach their full potential, a society is failing no matter what the economic numbers say. The Social Progress Index is a new way to define the success of our societies. It is a comprehensive measure of real quality of life, independent of economic indicators. 

The SPI is an aggregate of three components, each of which is itself has four elements:

Basic Human Needs is the average of (i) Nutrition and Basic Medical Care, (ii) Water and Sanitation, (iii) Shelter, and (iv) Personal Safety

Foundations of Wellbeing is the average of (i) Access to Knowledge, (ii) Access to Information and Communications, (ii) Health and Wellness, and (iv) Environmental Quality

Opportunity is the average of: (i) Personal Rights, (ii) Personal Freedom of Choice, (iii) Inclusiveness, and (iv) Access of Advanced Education.    

My argument is that development efforts have routinely be predicated on the idea that “development” is a four-fold transformation of economy, administration capability, polity, and society at the country level and that, if successful, higher levels of national development will lead to higher levels of human well-being. 

This leads to an empirical question.  Suppose I regress the SPI across countries on three measures of national development: national development, state capability, and democracy (as a proxy for “polity”), how much of the variation in SPI will be captured by just these three measures of national development?         Pritchett (2022) shows that the answer is that the R-squared of regressing SPI on these three elements of national development is .9 (since the R-squared is the square of the correlation coefficient this implies the correlation of actual SPI and a national development index constructed using the regression weights is .95).

I invented a graph to illustrate the connection between the SPI and the NDI (national development index, which is the regression weight index of the three elements of national development), which is an “envelop” graph because the envelop shape completely encloses all of the country experiences. 

The lower bound of the envelop is the worst SPI is for any country with that level of NDI or higher.  For instance, India (IND) has the lowest SPI of any country with its level of NDI or higher, there are countries with higher SPI at their level of NDI but only countries with lower NDI have lower SPI.

            The upper bound of the envelop is the country with the highest SPI for any country with its level of NDI or lower.  For instance, Nepal (NPL) has a high SPI with a low NDI.  There are countries at higher SPI, but only those with higher NDI. 

The attractive feature of an enveloping graph is that the white space is meaningful as it illustrates the combinations of SPI and NDI that have not happened.  This illustrates how empirically necessary and empirically sufficient NDI is for achieving human wellbeing (as proxied by SPI). 

NDI is strongly empirically necessary for high levels of human wellbeing.  The empty “northwest” of the graph shows that no country has high SPI with low NDI.  For instance, Argentina (ARG) has a level of SPI around 80 and is on the upper range of the envelope.  Argentin’’s NDI is around 70 and no country has achieved a SPI above 80 with NDI below 70.

NDI is also strongly empirically sufficient for high levels of human wellbeing.  The empty “southeast” of the graph shows that countries do not have high GDPPC, strong state capability and democracy and still have low levels of human wellbeing.   Malaysia (MYS) for instance, does have SPI much lower than other countries at similar NDI (such as Spain or Korea) but its SPI is still higher than the SPI for any country with NDI of 60 or less. 

Even with the measures of human wellbeing proposed by advocates who are working against the supposed current “over emphasis” on growth, national development delivers.  It is just not the case that countries get to high levels of overall, omnibus, wellbeing without national development nor do countries achieve high levels of national development and not have high levels of human wellbeing.

Graph 4:  Basics and GDPPC

The Social Progress Index is one possible aggregate index of human wellbeing, but one might legitimately be concerned about a narrower indicator of “the basics”—elements of human wellbeing that are prioritized by people with low incomes.

In Pritchett and Lewis (2022) we examine a wide variety of ways to construct a country level measure of the basics of human wellbeing that covers a variety of dimensions (health, education, water and sanitation, infrastructure and housing conditions, poverty, natural environment).  We show that no matter how one builds an index of basics the relationship between BHWB and GDPPC is roughly like this figure, for which the indicator of “basics” is built from a set of 82 potential wellbeing indicators from the Legatum Prosperity Index, each of which is scaled to that the worst country is 1 and the highest country outcome is 100 (so this puts all indicators on a common scale but preserves cardinality of each indicator).

There are two important features of this graph.

One, the relationship is strongly non-linear and the BHWN rise very steeply with increases in GDPPC and then tapers off and then above a high level (say, the 80th percentile of countries at GDPPC of (roughly) P$40,000 it is essentially flat.  (This non-linear relationship is shown both with an OLS regression with a quartic in GDPPC but also in a non-parametric, robust statistic of the rolling median).

Second, the relationship is “tight” in the sense that the association is very strong.

The same “envelopment” curve approach as in the previous grapth shows that not just national development but GDPPC alone is strongly sufficient for BHWB. 

The one country not included in the envelopment is Equatorial Guinea (GNQ) which is the exception that proves the rule, in that GNQ has high GDPPC because of oil production but given than GNQ had a horrific government (since its independence from Spain in 1968 it has had two dictators (uncle and nephew) this high level of GDPPC has not translated into benefits for the population. 

GDPPC is also “empirically necessary” for high levels of BHWB.  Every country in the bottom 20 percent of GDPPC has very low levels of basics.  At middling levels of GDPPC there is more variation, but it is still the case that no country with low GDPPC achieves high level of BHWB characteristic of all of the OECD countries.  For instance Cuba is often cited as a country that achieves high levels of wellbeing at low levels of income, and indeed it does, but it is still substantially below the level of every OECD country. 

The very important implication of this graph is that “preferences don’t determine priorities.”  It would be perfectly natural for a country at the median level of GDPPC to be highly focused on rapid and sustained economic growth in order to provide the material basis for achieving high levels of the basics of human wellbeing.

And, by the same token, countries at very high levels of GDPPC, like say, New Zealand, might decide that there are other higher priorities for wellbeing that economic growth and that they already have the high levels of economic productivity and material conditions to solve their problems.

But, what would be a massive mistake would be for people in New Zealand (or any other high income country) to conclude that since their priority was not on economic growth that other countries, in radically different material conditions and radically different levels of the basics, should not prioritize economic growth. 

Graph 5:  Growth Incidence (with five variants)

So far I have not used any of the very popular adjectival modifiers often appended to growth that characterize both the level of growth and its distribution–like “inclusive” or “pro-poor.”   The last graph (which will need five variants just to convey the richness of what it can convey) addresses simultaneously:  (i) the pace of growth, (ii) the level of income from which growth begins, (iii) the “growth incidence” which is the pace of growth for each of the deciles of income/consumption and (iv) the normative valuation of incremental growth across levels of income.

For this figure I again use the World Bank PovCalNet data and I use the longest possible data span (with survey based estimates, not extrapolated) for each country.  I am not showing all possible countries, just countries selected to illustrate analytic features.

The horizontal axis is the level of income/consumption at the beginning of the episode for each country. 

Then, for each country I show (left vertical axis) a standard “growth incidence” curve which is the percentage rate of growth of income/consumption (again, nearly always consumption for poor countries, nearly always income for middle income and rich countries) for each decile.

On the right axis I show various indicators that are relevant to the normative valuation of the growth of income at any given level of income.  In the “basic” figure I show (i) the Engel curve that shows the predicted level of share of food in total consumption at each level of income and (ii) from the regressions of basics on GDPPC shown in the graph above (Pritchett and Lewis 2022) I show the elasticity of BHWB wrt GDPPC at each level. 

Fifth Figure, Variant 1:  The rich of the poor are poorer than the poor of the rich

One common justification for reducing the priority to economic growth is that it just benefits “the rich” and “the rich get richer and the poor get poorer.”

This can be misleading and confusing as often people saying things like this are not clear on whether they are using the word “the rich” in a consistent way.  Often people characterize “the rich” in relative terms to their own country and not in a consistent way comparing people across countries.

This confusion leads people who are “progressive” easily end up in a situation in which they are very much in favor of raising the income/consumption of one group (“the poor” of rich countries) but seem reluctant to support increases in income/consumption of people who are, in absolute terms, much poorer.

For instance, the average income (and this is income, not consumption) of the lowest decile in Denmark in 1987 was P$5753.  Among the global policy crowd European countries like Denmark are often roundly praised for the variety of programs that raise the consumption of “the poor” in Denmark so that their post-tax and transfer consumption can be much higher than their pre-tax and transfer income. 

But the average income of the richest decile in Bangladesh in 1984 was P$1908.  And since these “purchasing power parity” comparisons they are adjusted for the fact it is cheaper to live in poor countries so this is at least meant to compare people’s purchasing power directly in absolute terms.  So “the rich” of a poor country like Bangladesh are a factor of 3 poorer that the income of the poor of a rich country. 

Many people are skeptical about this fact and believe that the PPP really compare standards of living (often with no good reason).  But Pritchett and Spivack (2013) use data on the food share in consumption—which therefore involve no use of exchange rate comparisons of any kind—and the Engel curve to show that the differences in food share between “the rich of the poor” and the “poor of the rich” is consistent with the factor multiple differences in real consumption that PPP data show.

This “rich of the poorer are poorer than poor of the rich” is (of course) not true of upper-middle income countries with high inequality, like Brazil.  The average income (and it is income, not consumption) of the top decile in Brazil is P$20, 515, which is will above the median of the USA or Denmark.

And one can also distinguish between the rich and the global hyper-rich like billionaires, which is an entirely different issue (as we discuss below).

Fifth Figure, Variant 2:  Low bar poverty lines are economically indefensible (and morally obscene)

A second major objection to economic growth is to argue that the marginal normative valuation of additional consumption declines very rapidly to a very low level and hence growth may not be that important for a normative social objective function.

This normative under-valuation of income gains reaches truly surreal levels with “low bar” global poverty lines.  The main feature that distinguishes the mainstream FGT(α) poverty measures from any other social welfare function (like, say, an Atkinson-inequality adjusted income measure) is not that “poverty puts more weight on the wellbeing of poorer HHs”—as all inequality averse measures do that—but that poverty measures put exactly zero weight on income above the chosen poverty line.  For example, with the “dollar a day” poverty line (now P$1.9/day with inflation) if someone’s income increases from P$1.95 to P$2.00 this would have exactly zero impact on any standard (FGT) poverty measure using the P$1.90 poverty line.

This means that a standard normative welfare function and poverty as an objective only (roughly) coincide if one is willing to accept that the marginal valuation of consumption gains above the poverty line being used is (reasonably) well approximated by zero. 

That view for the standard poverty lines used by the World Bank is, I would argue, complete madness, for four reasons. 

First, there is no line.  The opening of Alfred Marshall’s Principles of Economics was Natura non facit saltus (“nature does not jump” in Latin).  The classification of something “above” or “below” some more or less arbitrarily drawn line through the income/consumption should not create the issue that things are qualitatively different below and above that line.  If one examines wellbeing outcomes—health, education, malnutrition, access to water, etc.—there is often a reasonably strong connection between those “goods” and HH income.  But I have never, ever, seen any empirically demonstrated discrete jump or “phase transition” around a specified line (or even a close approximation to it).

Second, even if there were a line above which it would be a reasonable approximation to treat consumption gains as having ‘near’ zero value it is nowhere near the World Bank poverty lines.

While there is no way to say what “marginal utility of income” is and how exactly it evolves with income the figure shows three pieces of empirical evidence.

One, I estimate an Engel curve relating food share in consumption to PPP consumption expenditures (and, as Pritchett and Spivack (2013) show the actual Engel curve parameters are remarkably similar over time and samples and method so the details don’t really matter).  At the P$1.9 per day poverty line the predicted food share is around 80 percent.  The marginal propensity to spend is less than the average (as it is declining) but there is no way one can argue that additional income has “about” zero impact on wellbeing when HHs are still spending 50 percent or more of the incremental income on food.

Two, the graph above showed that the slope of the relationship between basics and GDPPC was highly non-linear and that it was very steep at low levels of income.  However, the more common measure of responsiveness is the elasticity (which is the percentage change of basics over the percentage change in GDPPC) and when one computes the elasticity of BHWB wrt to GDPPC it emerges that this slope actually starts out low and then increases as GDPPC increases up to a point and then starts to decline to much lower levels at high levels of GDPPC.

The striking thing is that at the highest of the WB reported poverty lines, of P$5.5/day, which is consumption per year of only P$5.5/day*365days/year=P$2007/year (and then adjusting the elasticity curve so that elasticity is predicted consumption not GDPPC) the elasticity or responsiveness of basics to increases is not only “near zero” and not only is it not decreasing, but rather it is still rapidly increasing.  So countries with growth at the levels of the poverty lines are seeing basics of wellbeing (health, education, water and sanitation, etc.) improve at an increasing rate.  To assert that zero is good approximation to the benefits of income about that level is well approximated by zero (as poverty as an objective demands) is surreal.

Three, I ran a regression of the standard World Bank Under Five Child Mortality data on a flexible functional form in GDPPC in part to illustrate that the shape of the elasticity is not an artefact of scaling or building an index.  The same feature about the elasticity emerges as the responsiveness (elasticity) of child mortality to increases in GDPPC first increases (up to a quite high level) and then decreases (but is still very high even at the US 80th percentile).  Again, I just don’t see how one can adopt “poverty” as the “objective” of development at such low poverty lines and then hence discount how much improvements in GDPPC contribute to improving wellbeing.

I am not making any of: (i) “materialist” case that money/income/consumption is the only goal (all of my arguments bring in other dimensions of human wellbeing, like child mortality or the natural environment), (ii) the case against declining marginal utility of income, (iii) the case that, at some level of GDPPC the attention should shift from prioritizing growth to other objectives.

But I am making the case that for a global poverty line something much more like Denmark’s poverty line of around P$25/day makes much more sense than a poverty line at less than 1/10th that value.

I am increasingly of the view that the implicit acceptance (or at least complicit toleration) of the “low bar” poverty lines that implied zero valuation of income gains above a ridiculously low threshold is the root of all evil in development.  And I don’t mean “evil” in some vague metaphorical sense, I think the “golden rule” implies should only accept conditions for others we would accept for ourselves and no person advocating a “dollar a day” poverty line would ever accept that their own personal valuation of income gains above that threshold was zero.   Low-bar lines do not pass a simple ethical test (and the philosopher Derek Parfit (2011) argues that something very much like the Golden Rule emerges from a variety of different ethical approaches). 

Fifth Figure, Variant 3: Would you rather have a purple unicorn or a brown horse?

The third variant of the graph is to examine that growth incidence curves themselves considering both their slope/shape and their location.  A key question is that if we have concern mainly for the growth of the lowest deciles is that primarily driven or determined by whether growth is “inclusive” or not (the shape/slope) of the growth incidence curve or by the average pace of growth for the economy at large (say, the average growth or growth of the median)?

For pretty much overall normative evaluation one (you, me) would rather have more rather than less “inclusive” growth (any sort of declining marginal utility gives that result). 

But there is also the question would one (where “one” is you or me)—even if your only objective was growth of the poor rather have “inclusive slow growth” versus “pro-rich fast growth”? 

For instance, this data suggests that over the period 1984-2016 the consumption of the poorest decile in Bangladesh grew at 1.1 ppa and the consumption of the rich grew at 2.1 ppa, a difference between rich and poor of only 1 ppa so, while growth was not “pro-poor”, the growth incidence curve was not very steep. 

In contrast, growth of the poorest in urban China 1981 to 2018 was 4.2 ppa but the growth of the richest decile was 7.5 percent, higher by 3.2 ppa.  This was “pro-rich” growth but but in absolute terms, the consumption of the lowest decile grew massively.

The consumption of the lowest decile went from P$360 to P$525 in Bangladesh 1984-2016, while over a much period of 1981 to 2016 the consumption of the lowest decile in urban China went from P$353 to P$1489—more than quadrupling. 

This graph doesn’t “prove” the point or even fully illustrate it, but empirical analysis shows that the correlation between the growth of the poorest (1st decile) and the average growth in the country is very high (Dollar and Kraay 2002) and hence even though poverty rates are responsive to changes in inequality it is still the case that most of the variation in poverty reduction is due to differences in country average growth rates (Bromberg 2022) .  This itself is just a consequence of the fact that (a) the cross-national and time series variation in growth rates is very high and (b) the cross-national and time series differences in measures of inequality (like the Gini coefficient) are very stable. 

This point is important as a great deal of discussion in development circles is about the adjectives that modify “growth”—the goal is never stated as “rapid, sustained, growth” but as “inclusive sustainable growth.” 

One might prefer a purple unicorn to a green unicorn but their ontological status is the same and neither can pull your cart as well as a brown horse.

Fifth Figure, Variant 4:  There are episodes like “Equatorial Guinea”—but they are very rare—income growth and wealth growth are not the same dynamic

There have been two phenomena, mainly in the USA, that have sucked up the air in the economic growth room.

One, is that the growth incidence curve in the USA has been highly tilted with the top percentiles capturing a large fraction of the benefits of growth.

Two, Thomas Piketty (and a few others) have been remarkably successful in shifting attention from the distribution of flows (income, consumption) to stocks (wealth) and, while doing so ignoring the single most important asset for most people, huma capital, in favor of financial wealth.  This has been associated with attention to the rise in the number of the globally hyper-rich (e.g. numbers of billionaires), which is again about a stock of financial wealth, not the flow of income/consumption.

This has supported a push-back against economic growth (without adjectives) as a goal with the notion that the “hyper-rich” are capturing disproportionately large parts of the economic gains.  Of course starting from any given level of income inequality even if inequality is unchanged the rich already control a larger share of income and hence they will, even at “neutral” growth incidence in percent rate of growth capture the same share of the gains as of the levels.  For instance, in the data in the figures the top decile in Bangladesh had 22 percent of the consumption in 1984 and hence even if the income distribution did not get more unequal they would have had 22 percent of the gains (to keep their share constant). 

We cannot isolate the upper deciles precisely with the existing data by deciles, but we can do two calculations to show that the “hyper-rich” are not capturing all (or nearly all or even a large share) of the gains.

We take advantage of the fact that if the far right tail is included at all, it is in the average of the top decile—but does not affect at all the average of the deciles below. I show the average rate of growth of the average income 9th decile, which is truncated both below and above.  This growth is robustly quite high.  Two points.

First, just to illustrate why this calculation is relevant.  Imagine there was an economy of 100 people and their initial income was log-normally distributed and then income stayed the same for 999 of them and only the income of the top person doubled.  Then there would be substantial growth in the average income even though only one person’s income changed.  But if we did the percent change in income by decile growth would be zero for all deciles but the top, so the growth incidence curve would be a flat line at zero and then a rotated “L” shape with only the top decile (where the top individuals are) having positive growth. 

Second, that said, I am not denying that there might be concentration of wealth and power in very few hands, but only that this wealth dynamic is not what is driving the overall growth results. 

References

Bolt, J., & van Zanden, J. L. (2020). Maddison style estimates of the evolution of the world economy. A new 2020 update. Maddison-Project Working Paper, WP-15. https://www.rug.nl/ggdc/historicaldevelopment/maddison/publications/wp15.pdf

Dollar, D., Kleineberg, T., & Kraay, A. (2016). Growth still is good for the poor. European Economic Review, 81(C), 68-85.

Dollar, D., Kleineberg, T., Kraay, A., & Guriev, S. (2015). Growth, inequality and social welfare. Economic Poliy, 30(82), 335-377.

Dollar, D., & Kraay, A. (2002). Growth Is Good for the Poor. Journal of Economic Growth, 7(3), 195-225. http://www.jstor.org/stable/40216063

Filmer, D., & Pritchett, L. (1999). The impact of public spending on health: does money matter? Social Science & Medicine, 49(10), 1309-1323. https://doi.org/https://doi.org/10.1016/S0277-9536(99)00150-1

Foster, J., Greer, J., & Thorbecke, E. (1984). A Class of Decomposable Poverty Measures. Econometrica, 52(3), 761-766. https://doi.org/www.jstor.org/stable/1913475

Pritchett, L. (2022). National development delivers: And how! And how? Economic Modelling, 107, 105717. https://doi.org/https://doi.org/10.1016/j.econmod.2021.105717

Pritchett, L., & Spivack, M. (2013). Estimating Income/Expenditure Differences Across Populations: New Fun with Old Engel’s Law. Center for Global Development Working Paper, 339. https://doi.org/http://dx.doi.org/10.2139/ssrn.2364649

Keeping the Gold in the Golden Rule: Economic growth and the basics of human material wellbeing

There is a constant steady stream of opinions that economic growth, and in particular, GDP per capita, is overrated and that it should be downplayed as a policy objective, mostly from people in the rich countries of the world (the OECD). My main point in the attached new (draft) paper is that priorities depend not just on what you want, but how much of what you want you already have. What makes the rich countries rich is that they currently have very high levels of GDPPC, and, given that they have a lot, they might, at the margin, want to value other things they also want they have less of.

But it is a huge, huge, mistake to think that priorities are just preferences and countries with low GDPPC, even if they want exactly what countries and people in the OECD want as preferences, might have a very strong priority for more economic growth as they have so little now.

There are two key graphs. One, from a previous paper of mine, is just a graph built on the (updated) Maddison-style estimates of long run GDPPC (Bolt and Van Zanden 2020). This shows the evolution of the GDPPC of the leading three countries in GDPPC (whichever they were) from 1700 to 2018. The point of this graph (which is a variant on the “hockey stick” graphs that illustrate the consequences of the acceleration of economic growth that began sometime in the 19th century and has been sustained in the OECD countries at roughly 2 ppa) is that many countries in the world today are still at levels of GDPPC that the now advanced countries had a century or more ago. Twenty nine countries had GDPPC in 2018 lower than the most advanced countries had in 1700. Another nineteen countries have GDPPC lower than the leading countries had in 1870.

And the poorer countries in the world today are not that much higher than the GDPPC that the leading countries had 2000 or 1000 years ago. Niger is estimated to have lower GDPPC in 2018 than Egypt has in 1 CE. Ethiopia’s GDPPC, at M$1838 (M$ for the Maddison PPP units) is only about 50 percent above where China was in 1000 CE.

This is just important factual context as it is obviously one thing to debate how much a priority economic growth is when GDPPC is M$40,000 or more (Japan, Germany (DEU), USA) than if one is at the levels of India or Vietnam (M$6800) or less. Even China or Indonesia, countries that have obviously had extended periods of rapid growth are still only at the level the leading countries had in 1950. Clearly how much more economic growth is a priority depends on how much one has had.

The second key graph, and the key point of the paper, is that if one constructs any index of the basics of human material wellbeing not from economic output measures but from physical indicators like health, education, water and sanitation, nutrition, environment it is very tightly associated with GDPPC in a non-linear way. As countries move from low levels of GDPPC to levels up to about P$ 25,000 (about where Argentina and Chile are) the extent to which basic human needs are met increases very steeply. After that, and in particular for the top quartile of countries (above P$38,637, where P$ is the Penn World Tables PPP measure) there is little gain to basics–because the level of basics is very high.

And what the paper demonstrates at some length is that this strong, non-linear association of GDPPC and basics is robust to any plausible definition of “basics.” That is, if one constructs a multi-dimensional measure of basics of human wellbeing it doesn’t really matter which exact variables one includes or the weights one uses to add the indicators up, you get roughly the same results. So the claim is not that GDPPC and “a” measure of basics are related strongly and non-linearly, the claim is that this is true of “any” measure of basics.

The reason I wrote this paper is not that I want to weigh in about what priority should be placed on economic growth in the USA or Germany or New Zealand or the UK. they have a lot of it. But my worry is that the Golden Rule, that one should “do unto others as you want have them do unto you” can be confused between “preferences” and “priorities.” The debate about what development agencies like the World Bank or IADB or the bilateral development agencies of OECD countries might be influenced by thoughts of the type “Since we want in our own domestic affairs a lower priority on economic growth, development agencies should put less weight on economic growth.” But the Golden Rule has to be applied by asking “what would I want if I had my preferences and a given set of circumstances, like how much I already have of the various things I want.”

As we know from Engel’s law, when people have low levels of income their share of spending on food is very high (70 percent or more) and when people have very high levels of income their sharing of spending on food is more like 7 percent. This is not because preferences change but because priorities change and when one gets more food it becomes, at the margin, less of a priority. But obviously the Golden Rule conclusion for a rich person is not “since food is not a priority for me it is not a priority for others so I won’t worry about food consumption as a priority.”

This is part of a broader argument of mine that we need to keep economic growth (“gold”) in the Golden Rule of what development agencies see as their role to help poorer countries (and societies and households) achieve the levels of economic productivity that are essential to provide their populations with the basics–even if that, thankfully, is no longer a priority for prosperous places.

Here is the draft paper (revised May 22, 2022).

A new paper about development and aid–and monkeys

I have a new draft paper drafted as a handbook volume on “Development and Aid” being edited by Shanta Devarajan, Jennifer Tobin, and Raj Desai at Georgetown. The paper kind of builds on an earlier blog of mine The Perils of Partial Attribution in which I worry that people sometimes get confused about the facts and the counter-factual. The title of the paper is: “Development Happened. Did Aid Help?” (This is the version as of February 2024).

An organizing analogy is the Belichick-Brady debate. The fact is that the team they were both affiliated with, the New England Patriots, had one of the most impressive and successful periods of any major sports franchise, ever, winning 6 Super Bowls. One then might want to parse out some causal attribution of how much of this success was due to Bill Belichick as coach and how much was due to Tom Brady as quarterback. And so on imagines an alternative world in which Tom Brady had played for another team with an average (sequence of) coaches or that Bill Belichick had coached with a sequence of quarterbacks.

The difficulty with the development-aid debate outside of the realm of development experts is that, as Hans Rosling so gently puts it, the “person on the street” in a rich country knows less about conditions in developing countries than a monkey, because at least a monkey doesn’t know stuff that is wrong.

Something like 85 percent of Americans thought global poverty over recent decades had either “stayed about the same” or “risen.” This is like trying to engage in the Brady-Belichick debate with someone who says “Well, we know neither can be so terrific because their team never won a Super Bowl.” Well, no, that is actually not up for debate as that is a fact, a given, something that everyone agrees actually happened. The only question is what causally explains that fact.

If you believe, wrongly, that “development didn’t happen” then it is easy to think “well, since ‘aid’ was supposed to promote development and development didn’t happen, aid (probably) didn’t work.” The conclusion is not necessary wrong, but the argument is just stupid because it starts from demonstrably incorrect assumptions about the facts.

Let’s not start the discussion of development success with economic indicators (like GDP per capita growth) as they are controversial and let’s not start with poverty, as that has so many definitions. Let’s start with a couple of things that pretty much everyone agrees with: that every child should go to school and that it is bad when children die. Turns out in the “development era” progress on those was amazingly fast.

The figure shows the average years of schooling of the adult population in the developing world in 1870, 1950 and 2010 (from Lee and Lee 2016 data). From all of human history to 1950 (however many thousands of years you want to call that) the cumulated gain in years of schooling in formal education was 1.58, with most people having had no schooling at all. By 2010 the average was 7.5 years, so that there was more gain in years of schooling in the sixty years from 1950 to 2010 than in all previous human history, by a factor of 3.7.

I am not saying that everything about this expansion of education was a success, and I am a big advocate of acknowledging and addressing the “learning crisis” that lots of kids are getting schooling with little or not learning, but still, this is a massive, historically transformative success to have moved from little or no schooling to pretty much universal access to primary schools in pretty much every country. Big win. Huge win.

And, as an interesting dimension of that expansion of schooling, one can run a simple bivariate regression of years of schooling on GDPPC in 1950 and then “predict” the level of schooling that would have resulting from economic growth alone, and, contrary to the idea that “development” focused on growth but not human development indicators, here the progress in years of schooling was substantially more than growth would have “predicted” not less.

Same things are true of child mortality. Under 5 child mortality (deaths per 1000 births) fell from 293 in 1950 to 32.6 in 2018 (using data from Gapminder). Again, more reduction in child deaths from 1950 to 2018 than in all of previous human history, by a factor of about 1.7. Big win. Huge win.

So the question has to be: “What accounts for this enormous progress in a wide array of indicators of human wellbeing in the developing world in the development era?” It may be that aid played no role (or that it is just impossible to know with any confidence what role aid played as the right counter-factual just does not exist–as one suspects is the inevitable outcome of the Brady-Belichick debate) but at least let’s be smarter than monkeys about the actual facts that we are trying to understand and parse the causes of.

A new paper on labor mobility: Levers to open the Overton window

As a development economist it is easy to show that “in situ” interventions that have very large returns in raising the income or wellbeing of people in a given place are relatively rare (e.g. the “gold standard” RCT evaluation of the “graduation” approach to poverty across five countries make claims it is an effective program worth funding based on an ROI of around 7 percent). and have modest impacts. It is also easy to show that since the productivity of people with the exact same characteristics varies by factor multiples (the “place premium“) the income gains from labor mobility from poorer countries to richer countries are massive (here), orders of magnitude larger than most anti-poverty programs (here ). And, not surprisingly given the wage differentials, according to Gallup surveys around the world there are around a billion people willing to move if they were allowed to (or, if asked about permanent movement, about 750 million).

But, all that said, and roughly undisputed, there is little or no attention to international labor mobility (except as refugees from crisis or conflict, as we see recently with the war on Ukraine). I think that is because most people see tight restrictions on labor mobility in rich industrial countries as a “condition” not a “problem” and just take it for granted that since greater labor mobility is politically impossible it is not worth talking about.

My new paper “The political acceptability of time-limited labor mobility:
Five Bricks through the Overton window” (which was presented at a recent symposium at NYU and has been submitted to Public Affairs Quarterly) does not dispute that substantially greater labor mobility into rich countries has been politically impossible (while “immigration” has been going up in most OECD countries but it has been going up slowly, from a low base, and as much from within rich country mobility as allowing people from poorer countries), but argues that when the facts change, people can and do change their minds, and that the facts about rich countries are changing in ways that will put greater labor mobility, of multiple modalities, including more widespread time-limited mobility to meet specific labor market needs, squarely into the Overton window.

A principal driving cause of the shift in political acceptability is the combination of demographic shifts, where an ageing population implies many, many more people who need to be supported and many few labor force aged people to do the work and pay the taxes. And, shifts in the labor market are creating many jobs that require core skills (but not high levels of formal schooling) such that these jobs just cannot be filled as there aren’t enough native born youth who want these jobs (nor, in an economically efficient world, should they).

That much greater labor mobility from poorer to richer countries will become politically feasible in the near to medium run horizon (within a decade) because it will be in the best interest of voters in rich countries to allow it (again, in various controlled modalities, not “open borders”) is another view of mine that is in a decided minority, but right. You’ll see.

This is the paper revised as part of the submission process to Public Affairs Quarterly on May 20, 2022.

JPAL: Seize this teachable moment

[I wrote this blog back in May 2021 but then put it into a waiting period to make sure I wasn’t saying something rash and intemperate (which I have been known to do and which is why I am not on Twitter, ever. I finally (five months later) have decided to post it (with minor revisions) since the point continues to be topical.]

A May 10, 2021 blog post (article in Project Syndicate) from JPAL titled “Growth is not enough” has these striking lines:

But for millions of people living in poverty, growth is not enough. Specific, targeted social programs based on rigorous empirical evidence are equally important to prevent people from being left behind.

This claim of “equally important” is striking in four ways.

First, even without any knowledge on the topic, to anyone that takes empirical claims seriously this is strikingly implausible. “Equally important” is a set of measure zero. That is, if I told you that electron mass and proton mass were equal your first thought would be: “Really? Of all the masses that particles could have they just happen to be equal? There must be some really deep and important feature of the universe that makes that fact be so as, without that justification, it is just a strikingly implausible coincidence.” (And you would be right, the proton to electron rest mass ratio is 1836.15267343 (truncated at some digits) which is an appropriately arbitrary number). So even without knowing any empirical facts one already suspects this is not a factual claim at all, but emotive rhetoric.

Second, anyone with any knowledge about global poverty knows that, as a general claim, it is false, not by a little but by a lot (at least factor of 10, maybe a factor of 100) and known to be false. I have research (summarized here) that shows that growth is, in fact, enough: that a higher level of median consumption is in fact empirically sufficient for reducing a country’s absolute headcount poverty. If by “growth” one means differences across countries in the level of median income/consumption (which are necessarily the result of differences in long-run growth) then growth alone (with a flexible functional form) accounts for around 98 percent of differences in absolute headcount poverty.

David McKenzie (who has himself done some great RCTs) has a 2020 commentary titled “If it needs a power calculation does it matter for poverty reduction?” that starts from the premise that everyone accepts that the typical (median) productivity in the place (country/region) where a person works is far and away the most important determinant of their income and hence likelihood of being in poverty. Therefore growth and migration are obviously massively important for poverty, the only question is does anything else “matter” at all and by how much? (with zero possibility it is “equally important”). The point is that this is not a dispute between my research and their research. That growth and programs are not, in general, “equally important” is just common knowledge.

Third, this claim is specific in a way that makes it both even more obviously false and also obviously self-interested. The claim isn’t just that “social programs” are equally important. The claim is that the subset of social programs that are (i) “specific” and (ii) “targeted” and (iii) “based on rigorous empirical evidence” are “equally important.” The claim “public transit accounts for an equal share of commuting to work in the USA” would be implausible (again, why “equal”?) and empirically false but the JPAL claim is like asserting that “ridership on public transit in blue buses whose license plate ends in an odd number accounts for an equal share of commuting to work in the USA.” In the USA, for instance, one can estimate (and debate) how important Social Security was for the reduction in poverty among the elderly but since its design and adoption wasn’t “based on rigorous empirical analysis” it wouldn’t be in the set of programs this statement claims are “equally important.”

These qualifications on the type of social programs being promoted also reveal that this claim is completely and totally self-interested. JPAL takes money in order to generate “rigorous knowledge” for the design of “specific” and “targeted” social programs so this claim is just advertising for their product.

Fourth, the double standard they want readers to accept is striking. That is, JPAL wants you to (a) use rigorous empirical knowledge in making decisions about how to fight poverty (and improve wellbeing more generally) but also (b) accept their claim of a small subset of social programs being “equally important” to poverty as economic growth completely and totally without evidence.

That is, the implicit proposed double standard is (a) in the design and adoption of specific, targeted, social programs we, JPAL, think that “rigorous” is the standard to use for empirical evidence but (b) about the self-interested claims that we JPAL make about the benefits of using the standard of “rigorous” for empirical evidence (and hence in decisions in how much funding, we, JPAL, should receive) one should adopt a completely different standard of evidence. The standard we want for our claims is: just accept our rhetoric without any evidence at all. The sentence is not “Rigorous empirical analysis shows that specific, targeted social programs based on rigorous empirical evidence are equally important to prevent people from being left behind” rather the statement was just made ex cathedra, to be accepted just because it was said.

I think article creates a very teachable moment for JPAL, with three clear options.

One, retract the article/blog and make it clear that JPAL really stands for the use of rigorous empirical evidence in development decision making–including for itself.

Two, teach us all what “rigorous empirical evidence” means to JPAL by showing how this claim (and others in this article) are not just true, and not just backed by some evidence, or even by backed by “persuasive” evidence but are backed by “rigorous” evidence. Or, alternatively, teach us what kinds of empirical claims about development impact need to be backed by rigorous empirical evidence and which do not.

Three, do nothing, which will use this teachable moment to teach us something important about JPAL. I suspect a lot of people and organizations are rooting for option 3. If JPAL, an organization founded on its commitment to the generation and use of rigorous evidence, can live with a yawning double standard on evidence between their own rhetoric and their actual practice in their public-facing advocacy, then so, of course, can they, and hence so can everyone else.

Afghanistan 2021: A quickly made long tragedy

The tragedy for the Afghan people of the Taliban re-taking control of the country in August 2021 is the denouement of a process 20 years in the making. The sudden collapse of the Afghan government and the national security forces over the course of a few days is not a “surprise” to anyone, but was a widely expected outcome by many observers (including the CIA).

There are many many political and humanitarian aspects of the present crisis, but I want to just present my conjecture about the longer run question. How is it that, in 20 years of effort, backed by massive levels of resources, the “international community” (led by the USA obviously but there has been participation in the Afghanistan by other governments (e.g. the UK), aid agencies, multilateral organizations (e.g. World Bank, IMF, ADB), and NATO) has failed so badly in their efforts to create (or even allow to emerge) a capable and legitimate state in Afghanistan? Part and parcel with this question is not just how does one fail after 20 years of effort but also, how does one sustain 20 years of effort while failing?

The Duke of Albany’s last, plaintive, lines of Shakespeare’s King Lear are:

The weight of this sad time we must obey,
Speak what we feel, not what we ought to say.
The oldest have borne most; we that are young
Shall never see so much, nor live so long
.”

All of the machinery of the tragedy was set in motion by just the 115th line of the play in which Lear makes the rash decision to cut off Cordelia for having spoken the plain truth rather than flowery lies. That the lives were not spared by last ditch attempts to save them is not the central feature of the tragedy. The events are the long tragic sequelae of the original hubris.

I am a very visual person so I propose this diagram as an aid to understanding the tragedy, for both the USA but much more so the people of Afghanistan, of the US engagement. After the US and its allies threw out the Taliban there were some critical choices. One choice was the extent to which the USA was going to engage in “nation building” and attempt to create a capable and legitimate state before leaving. The USA could have said “We are not in the business of nation-building, we are militarily out of here when our narrow 9/11 related objectives and met, full stop, plan on it.” Or, they could have said “Given the consequences of our regime change we are here with an open-ended commitment until Afghanistan has a capable and legitimate state (on some clear(ish) criteria.” But it was politically expedient, and the height of unconstrained hubris, to say both. The USA said that they were both going to leave only when Afghanistan had a capable and legitimate state and that, don’t worry, that won’t take us very long, we are not making an open ended commitment.

Well, if you announce a distance and a time, you have announced a speed. The USA announced was the equivalent of saying they were going to run a Marathon distance (26.2 miles) in an hour. And when they were told, hey, people having been running Marathons since, well, Marathon, and no one has run one in anything like that time (and it is probably physiologically impossible as no one has run even one single mile at the pace 26.2 would have to be run) the response was some mix of i) hubris that the US military can achieve anything; ii) real or feigned inability to understand that the speed was wildly unrealistic; iii) resignation to political interests setting the goal and timeframe).

Once the hubris of “we are going to build a capable and legitimate state and it is not going to take us that long” was set in motion the tragedy was underway, even if not immediately obvious, as it set in motion three practices that are inimical to building either a capable or a legitimate state.

First, if one is told to do nation building on an unrealistic time deadline one is driven towards tactics and strategies that can at least appear to produce rapid success. This leads inexorably towards what we call “looking like a state” or, after the sociological concept of isomorphism, “isomorphic mimicry.” It is super easy to do things on paper, make constitutions, pass legislation. What is hard to create is capability to implement, a shared sense of nation-hood, a commitment to rule of law.

Filmmakers cannot build space ships or cities but they can create the effective illusion of having done the impossible. Giving people resources and putting pressure on people to do the impossible will not lead to the impossible, it will lead them to create illusions.

A second major flaw that undermines development success is what we call “premature load bearing.” As I type this in August of 2021 I just had surgical repair of my ruptured Achilles tendon. My leg is in a hard, non-weightbearing cast for two weeks. If I took that cast off on the second day after surgery and tried to run around I would immediately undo whatever benefit the surgery had been.

Asking political and governance mechanisms to do too much, too soon, with too little merely creates repeated failures.

A third common flaw in development efforts is to “cocoon” projects from the normal channels of implementation. If one feels very strongly that something needs to be done and one knows that the existing national mechanisms are to weak to do it, there is a temptation to bring in foreign contractors and import the capability. Given the resources and capabilities of American government and contracting firms, of course many things can be done quickly. But this usually not just does not build capability, it both undermines the building of national capability and does not improve a government’s legitimacy. Moreover, this gets done at costs that are astronomical relative to what the national government could ever hope to afford. At one point great claims were being made about the improvements in the health sector and health outcomes in Afghanistan. Even if we grant those were major and important gains, since it was being done by American contractors it meant an Afghan doctor could make many-fold more income working as a driver for the health project than he could as a doctor in a regular government clinic. Back of the envelope calculations were that the cost per person of the health system exceeded not just the potential total government expenditure per person but total post-withdrawal GDP per capita.

Figure 2 illustrates the dynamic in which the rash, overambitious commitments eventually confront a reality of little or no progress. Then, the political logic repeats itself. The USA either needs to leave, acknowledging they are doing so in spite of the fact there isn’t a capable and legitimate Afghan state in place, or, they need to push down the ambition and push out the time and try again. But the new attempts now face both the same politically set overambitious targets and the legacy of the past failed strategy and tactics. Increasingly the USA found itself integrated into, and part of, a corrupt policy: buying cooperation by turning an officially blind eye to corruption at the expense of democracy, rule of law, and legitimacy. This is how the tragedy gets long (and bipartisan).

The endgame, which many people both inside and outside of Afghanistan predicted, again and again from 2001 onwards was that eventually the USA would admit failure and announce they were getting out no matter what and try and put the best face on that fact.

I know personally, and have read about, many extraordinarily capable and well meaning people who sincerely worked at improving conditions in Afghanistan. But ultimately they all were powerless against the forces of tragedy set in motion. They became like the Earl of Kent, often speaking courageously against the madness:

Be Kent unmannerly
When Lear is mad. What wouldst thou do, old man?
Think’st thou that duty shall have dread to speak
When power to flattery bows? To plainness honour’s bound
When majesty falls to folly. Reverse thy doom;
And in thy best consideration check
This hideous rashness.

Only to be ignored be themselves banished, or, as Kent, finding a way to continue to struggle against the unfolding tragedy.

Afghanistan has deep and important lessons for nation-building, fragile states, conflict: issues which are an integral part of the practice of development. But I fear they are hard lessons to learn and even harder to convince politicians to swallow. I was working in South Sudan in 2011 and saw the exact pressures to announce as a “plan” a wildly overambitious pace of progress, often coming from “conflict” experts whose expertise rested on their experience in Afghanistan.

How do I do research that is both reliable and new?

There were two recent entries in the ongoing saga of the “replication crisis.”

One was a recent (12/30/2020) blog post suggesting that the evidence behind Daniel Kahneman’s wildly popular Thinking, Fast and Slow was not every reliable as many of the studies were underpowered. (This was a follow up of a 2017 critique of Chapter 4 of the book about implicit priming, after which Kahneman acknowledged he had relied on under powered studies–and he himself pointed out this was borderline ironic as one of his earliest papers in the 1970s was about the dangers of relying excessively on under powered studies). The blog has a cool graph that estimates the replication rate of the studies cited, adjusting for publication bias and estimates the replication rate for studies cited in the book at 46 percent. The obvious issue is that so many studies are cited with very modest absolute value z-statistics (where 1.96 is the conventional “5 percent two sided statistical significance”).

A second was an interesting blog reporting on three different studies about replication where various research teams were given exactly the same research question and the same data and asked to produce their best estimates and confidence intervals. The point is that the process of data cleaning, sample composition, variable definition, etc. involves many decisions that might seen common sense and innocuous but might affect results. Here is a graph from a paper that had 73 different teams. As one can see the results included a wide range of results and while the modal result was “not statistically significant” there were lots of negative and significant and lots of negative and significant (far more that “5 percent” would suggest).

This leads me to reflect how, in nearly 40 years of producing empirical results, I have dealt with these issues (and not always well).

I remember one thing I learned in econometrics class from Jerry Hausman when we were discussing the (then) new “robust” estimates of covariance matrices of the Newey-West and White type. His argument was that one should generally choose robustness over efficiency and start with a robust estimator. Then, you should ask yourself whether an efficient estimate of the covariance matrix is needed, in a practical sense. He said something like three things. (i) “If your t-statistic with a robust covariance matrix is 5 then why bother with reducing your standard errors with an efficient estimate anyway as all it is going to do is drive you t-statistic up and certainly you have better things to do.” (ii) “Would there be an practical value in a decision making sense?” That is, oftentimes in practical decision making one is going to do something is the estimate is greater than a threshold value. If your point estimate is already 5 standard errors from the threshold value then, move on. (iii) “if moving from a robust to an efficient standard error is going to make the difference in ‘statistical significance’, you are being dumb and or a fraud.” That is, if the t-statistic on your “favorite” variable (the one the paper/study is about) is 1.85 with a robust estimator but with an efficient (non-robust) estimator is 2.02 and you are going to test and then “fail to reject” the null of heteroskedasticity in order to use the efficient standard error estimate so that you can put a star (literally) on your favorite variable and claim it is “statistically significant” this is almost certainly BS.

One way of avoiding “replication” problems with your own research is to adopt something like a “five sigma” standard. That if your t-test is near 2 or 2.5 or even 3 (and I am using “t-test” just as shorthand, I really mean if the p-value on your H0 test is .01 or even .001) then the evidence is not really overwhelming whereas a p-level in the one in a million or one in a billion is much more reassuring that some modest change in method is not going to change results. In Physics there is some use that 3 sigma is “evidence for” but a “discovery” requires 5 sigma (about one in 3.5 million) evidence.

But then the question for younger academics and researchers is: “But isn’t everything that can be shown at 5 sigma levels already known?” Sure I could estimate an Engel curve from HH data and get a 5 sigma coefficient–but that is not new or interesting. The pressure to be new and interesting in order to get attention to one’s results is often what leads to the bias towards unreliable results as the “unexpectedly big” finding gets attention–and then precisely these fail to replicate.

Of course one way to deal with this is to “feign ignorance” and create a false version of what is “known” or “believed” (what actual beliefs are) so that your 5 sigma result seems new. Although this has worked well for the RCT crowd (e.g. publishing an RCT finding that kids are more likely to go to school if there is a school near them as if that were new) I don’t recommend it as real experts see it as the pathetic ploy that it is.

Here are some examples of empirical work of mine that has been 5 sigma and reliable but nevertheless got attention as examples of situations in which this is possible.

Digging into the data to address a big conceptual debate. In 1994 I published a paper showing that, across countries, actual fertility rates and “desired” fertility rates (however measured) were highly correlated and, although there is excess fertility of actual over desired, this excess fertility was roughly constant across countries and hence did not explain the variation in fertility rates across countries. I used the available Demographic and Health Surveys (DHS) in the empirical work. Since my paper several authors have revisited the findings using only data from DHS surveys carried out since my paper and the results replicate nearly exactly (and this is strong than “replication” this is more “reliability” or “reproducibility” in other samples and out of sample stability is, in and of itself, an kind of omnibus specification test of a relationship).

But then the question is, how was this 5 sigma result new and interesting? Well, there were other 5 sigma results that showed a strong cross-national correlation in the DHS data between TFR and contraceptive prevalence. So the question was whether that relationship was driven by supply (the more contraception is available the higher the use and the lower the TFR) or whether than relationship was driven by demand (when women wanted fewer children they were more likely to use contraception). There were a fair number of people arguing (often implicitly) that the relationship was driven by supply and hence greater supply would causally lead to (much) lower TFR.

It was reasonably well known that the DHS data had a survey response from women about their “ideal” number of children but the obvious and persuasive criticism to that was that women would be reluctant to admit that a child they had was not wanted or past their “ideal” and hence a tight correlation of expressed ideal number of children and TFR might not reflect “demand” but “ex post rationalization.”

What made the paper therefore a paper was to dig into the DHS reports and see that the DHS reported women’s future fertility desires by parity. So one could see the fraction of women who reported wanting to have another child (either now or in the future) who had, say, 2, 4 or 6 existing births. This was a measure of demand that was arguably free of ex post rationalization and arguably a reliable indicator of flow (not stock) demand for fertility.

With this data one could show that nearly all cross-national variation in actual TFR was associated with variation in women’s expressed expressed demand for children and that, conditional on expressed demand, the “supply” of contraception relationship was quite weak. And this finding has proved stable over time–Gunther and Harttgen 2016 replicate the main results using only data that has been produced since the paper and replicate the main findings almost exactly (with the exception that the relationship appears to have weakened somewhat in Africa).

Use some (compelling) outside logic to put debates based on existing data in a new light. In 1997 I published a paper Divergence, Big Time arguing that, over the long sweep of history (or since, say, “modern” growth in the developed world in 1870) there has been a massive increase in the dispersion of GDP per capita (in PPP). This paper was written as a counter-weight to the massive attention “convergence” was getting as it was seen that in the debate between “neoclassical” and “endogenous” growth models the question of “convergence” or “conditional convergence” was seen as critical as it was argued that standard Solow-Swan growth models implied conditional convergence whereas with endogenous growth models one could get differences in steady state growth rates and hence long term divergence (which, among others, Robert Solow regarded as a bug, not a feature as it implied levels of output could go to (essentially) infinity in finite time).

Anyway, at the time there was PPP data for most countries only since about 1960 and hence the analysis could only look at the 1960-1990 (or updated) period or one had historical data but nearly all the countries with reliable GDP data going back to 1870 were “developed” and hence the historical data was endogenous to being rich and hence could not answer the question. So, although everyone kind of intuitively knew the “hockey stick” take-off of growth implied divergence there was no accepted way to document the magnitude of divergence because we did not have GDP per capita data for, say, Ghana or Indonesia in 1870 on a comparable basis.

The key trick that made a paper possible was bringing some logic to bear and making the argument that GDP per capita has a lower bound as a demographically sustainable population requires at least some minimum level of output. So, for any given lower bound the highest the dispersion could have been historically was if each country with data was where the data said it was and each country without data were at the lower bound. Therefore one could compare an upper bound on historical dispersion with actual observed dispersion and show current dispersion was, in absolute numbers, an order of magnitude larger. Hence not just “divergence” but “divergence, big time” (and 5 sigma differences).

The main point here is that sometimes one can make progress by bringing some common sense into numbers made comparable to existing data. So everyone knows that people have to eat to stay alive, I just said “what would be the GDP per capita of a country that produced just enough food to produce caloric adequacy sufficient for demographic stability (e.g. not famine situation)” to create a lower bound from common sense comparable to GDP data (and then uses overlapping methods of triangulation to increase confidence).

Combine data not normally combined. In the “Place Premium” paper with Michael Clemens and Claudio Montenegro we estimate the wage gain of moving an equal intrinsic productivity worker from their country to the USA. Here everyone knew that wages were very different across countries but the question was how much of that was because of a movement “along” a wage relationship (say, along a Mincer curve, where average wages differ because the populations had different levels of education) and how much was a place specific difference in the wage relationships. So while there were literally thousands of Mincer-style wage regressions and probably hundreds of papers estimate the differences in wages between natives and migrants in the same country there were not any estimates of the gap between wages between observationally equivalent workers in two different places. So the main insight of this paper is that Claudio, as part of his research at the World Bank, had assembled a collection of labor force surveys from many countries and that the US data had information on people’s income and their birth country and at what age they moved to the USA. So we could, for any given country, say, Guatemala, estimate a wage regression for people born in Guatemala, educated in Guatemala, but now working the USA to a wage regression for people born in Guatemala, educated in Guatemala and working in Guatemala and therefore compute the wage gap for observationally equivalent workers between the two places. And we could do this for 40 countries. Of course we then had to worry a lot about how well “observational equivalent” implied “equal intrinsic (person specific) productivity” given that those who moved were self-selected, but at least we had a wage gap to start from.

The key insight here was to take the bold decision to combine data sets whereas all of the existing labor market studies did analysis on each of these data sets separately.

Shift the hypothesis being tested to an theoretically meaningful hypothesis. My paper “Where has all the education gone?” showed that a standard construction of a measure of the growth of “schooling capital” that these measures were not robustly associated with GDP per capita growth. One thing about the robustness of this paper is that I used multiple, independently constructed measures of schooling and of “physical” capital and of GDP per capita to be sure the results were not a fluke of a particular data set or measurement error.

The more important thing from my point of view was that I pointed out that the main reason to use macro-economic data to estimate returns to schooling was to test whether or not the the aggregate return was higher than the private return. That is, there are thousands of “Mincer” regressions showing that people with more schooling have higher wages. But that fact, in and of itself has no “policy” implications (any more so than the return to the stock market does). A commonly cited justification for government spending on schooling was that there were positive spillovers and hence the public/aggregate return to schooling was higher than the private return. Therefore the (or at least “a”) relevant hypothesis test was not whether the coefficient in a growth regression was zero but whether the regression was higher than the microeconomic/”Mincer” regressions would suggest it should be. This meant that since the coefficient should be about .3 (the human capital share in a production function) this turned a “failure to reject zero” into a rejection of .3 at very high significance level (or, if one wanted to be cheeky, high significance level rejection that human capital did not have a negative effect on standard measures of TFP).

(As an addendum to the “Where has all the education gone?” paper I did a review/Handbook chapter paper that could “encompass” all existing results within a functional form with parametric variation that was based on a parameter that could be estimated with observables and hence the differences in results were not random but I could show how to get from my result to other results that appeared different than mine just by varying a single parameter).

Do the robustness by estimating the same thing for many countries. In some cases there are data sets that collect the same data for many countries. A case in point is the Demographic and Health Surveys, which have repeated nearly exactly the same survey instrument in many countries, often many times. This allows one to estimate exactly the same regression for each country/survey separately. This has several advantages. One, you cannot really “data mine” as in the end you have to commit to the same specification for each country. Whereas working with a single data set there are just too many ways in which one can fit the data to one’s hypothesis (a temptation that of course RCTs do not solve as there are so many questions that have not definitively “right” answer that can affect results, see for instance, a detailed exploration of why the findings of an RCT about the impact of micro-credit in Morocco depended on particular–and peculiar–assumptions in variable construction and data cleaning (the link includes a back and forth with the authors)) whereas if one estimates the same regression for 50 countries the results are reported for each with the same specification. Two, one already has the variance of results likely to be expected across replications. If I estimate the same regression for 50 countries I already have not just an average but I also already have an entire distribution, so that if someone does the same regression for one additional country one can see where that new country stands in the previous distribution of the 50 previous estimates. Three, the aggregated results will be effectively using tens of thousands or millions of observations for the estimate of the “typical” value will often have six sigma precision.

This approach is, of course, somewhat harder to generate new and interesting findings as existing, comparable, data are often well explored. I have a recently published paper about the impact of not just “schooling” but schooling and learning separately with the DHS data that is an example of generating a distribution of results (Kaffenberger and Pritchett 2021) and a recent paper with Martina Viarengo (2021) doing analysis of seven countries with the new PISA-D data, but only time will tell what the citations will be. But, for instance, Le Nestour, Muscovitz, and Sandefur (2020) have a paper that estimates the evolution over time within countries of the likelihood a woman completing grade 5 (but no higher) can read that I think are going to make a huge splash.

Wait for prominent people to say things that wrong to first order. For a wide variety of reasons people will come to want things to be true that just aren’t. For instance, JPAL had an op-ed that claimed that targeted programs were “equally important” with economic growth in reducing poverty (to be fair, this was the Executive Director and PR person for JPAL, “Poor Economics” just hints at that as the authors are too crafty to say it). That claim is easy to show is wrong, at least by a order of magnitude (Pritchett 2020). Or, many people in development have begun to claim that GDP per capita is not reliably associated with improvements in well-being (like health and education and access to safe water) which is easy to refuted (even with their own data, strangely) at six sigma levels (Pritchett 2021).

Why I won’t sell “best buys”–and why you shouldn’t buy them

Let me start with an analogy.

The famous theater Carnegie Hall (note 1) is located in Manhattan on the east side of Seventh Avenue between 58th and 57th, hence between Central Park 59th and Times Square at 42nd. Suppose I observe someone walk down the east side of Seventh Avenue from Central Park (59th) to Times Square (42nd) and then I stop them and say: “If you are headed for Carnegie Hall you should turn around and walk back up the east side of Seventh Avenue to 57th Street.”

Can that statement of correct directions to Carnegie Hall be considered “advice” or a “recommendation” to the person I stopped? I think not. I think the prima facie and best interpretation of the person’s behavior is that they were not going to Carnegie Hall at the time. I would guess if I made this “recommendation” 10,000 times I would be very surprised if even once the response was: “Gee thanks mister, I was headed to Carnegie Hall but didn’t know how to get there and I am a little chagrined I walked right past it.”

I don’t think it properly counts as “advice” or a “recommendation” to give people conditional information: “if you want to achieve X, do Y” if there is no evidence they want to do X and, even more so, if the best available evidence from their observed behavior is that they don’t (currently) want to do X.

Now a personal story about “best buys” in education (or policy advice based on empirical estimates of cost effectiveness).

An early attempt to do “best buys” (or “smart buys”) was the Copenhagen Consensus which was an attempt to give expert and evidence informed recommendations as to how to best spend some given amount of money, like $25 billion, to promote human wellbeing. The process was, step 1, to choose a variety of potential domains in which there might be cost effective spending opportunities (e.g. education, health, corruption, water and sanitation) and hire an expert in each of those domains review the available evidence and then rank, with specific estimates, the most cost-effective “interventions” or “actions” and produce a “challenge” paper addressing the various challenges. Then, the Copenhagen Consensus process was that (step 2) the expert chapters would be read by two other experts who would provide comments and (step 3) the chapter authors and the discussants in each domain would present their findings and evidence to an expert panel. Step 4, the panel would then produce a “consensus” of the most cost effective ways to improve human wellbeing (note 2).

I was hired to write the education challenge paper. I wrote a long paper that had an explication of the simple producer theory of maximizing subject to constraints, a review of the literature of empirical estimates of the cost effectiveness, I then pointed out that if we were assuming “normative as positive”–that is, that our positive, descriptive, theory of the behavior of the producers of education–then this had (at least) four empirical implications and that all of those were, at least in many countries and instances rejected by very strong evidence.

In particular, my paper, drawing on my previous work with Deon Filmer “What education production functions really show” I pointed out that an empirical implication of normative producer theory with a multi-input production function was that the marginal gain in the producer’s objective function per dollar of spending on each input should be equalized. This implied that, if the evidence pointed to one particular input had a very, very high cost-effectiveness in producing a given output (say, some measure of learning gain per year of schooling) then this was prima facie evidence the producer choosing the input mix was maximizing that output. Therefore this evidence was evidence against “normative as positive”–that producers were actually maximizing an education output with choice of inputs–and therefore one could not–as it was not internally coherent–use that evidence to make “recommendations” on the assumption that the producer was maximizing. (The connection to the analogy is obvious, I cannot stop people who have walked right by Carnegie Hall and given then “recommendations” about how to get to Carnegie Hall and expect that to change their behavior as the best interpretation of their behavior is that they were not trying to get to Carnegie Hall at the time).

In my challenge paper I gave reasons why “recommendations” about how to improve education had to be based on a correct positive model of what education producers were actually doing and why and I made some suggestions of what such a model might look like. And in doing so, I explicitly explained why I was therefore not going to provide a list of “cost effective” (“best buy” or “smart buy”) actions or interventions in education, in spite of having presented empirical evidence that often showed there existed highly cost effective actions.

I submitted my paper. The organizers got back to me and pointed out I did not provide them with a list of “best buys” to be compared by the panel to other domains. I said yes, I was aware of that and that I thought my paper was an excellent guide of what might be done to improve education but that imagining there were discrete, compartmentalizable, actions that were “cost effective” and “recommending” those as ways for some outsider to spend money was not a way to improve education, one needed to think about education systems as systems and understand them.

The organizers then pointed out that the Terms of Reference of the output they were paying me X thousand dollars for (where I honestly don’t remember X but was on the order of $10,000) included that I provide them such a list and that I had already taken half of the payment up front. I acknowledged that, apologized for not having read and interpreted the TOR correctly, and offered to both not take the second payment on the contract but moreover, I would be happy to give the first half paid in advance back. I pointed out that it wasn’t just that I thought the evidence was too weak (not “rigorous” enough) I thought the idea of making recommendations based evidence and a positive model of the agents/actors to whom you were giving “recommendations” when the evidence was inconsistent with the positive model was intellectually incoherent, contradictory, and hence untenable. I would rather give up payments after I had done a massive amount of work rather than have my name associated with things that were so intellectually indefensible. I would not sell them “best buys.”

The final “challenge” paper I think remains a great short introduction into the economics of education.

In the end they relented as they were faced with the prospect of not having “education” as one of their considered domains, but, since I had not provided a list the expert panel list I don’t think (I did not pay that much attention to the overall process) had any education “interventions” in their top 10. The Copenhagen Consensus was repeated and in the next round, not surprisingly, they chose a different expert but, to their credit, I was asked to be a discussant and hence could articulate again my objections (although I went light on the “normative as positive” point).

None of my 2004 objections to the “normative as positive” contradictions in using evidence from studies of cost-effectiveness of individual interventions (no matter how “rigorous” these estimates are) to make “recommendations” have been addressed.

Rather, what has happened often illustrates exactly my points. Three examples, one from Kenya, one from India and one from Nigeria.

Duflo, Dupas and Kremer ( 2015) did a RCT study estimating the impact of reducing class sizes in early grades in Kenya from very high levels and there was a control group and four treatment arms from two options (2 by 2); (a) either the teacher was hired on a regular civil service appointment or was hired on a contract and (b) the additional classroom was divided on student scores or not. The results were that the “business as usual” reduction in class size (civil service appointment–non-tracked classrooms) had a very small (not statistically different from zero impact) whereas the contract teacher reduced class sizes had impacts on producing learning both in tracked and untracked treatment arms.

In a JPAL table showing the “rigorous” evidence about cost effectiveness (on which things like “best buys” or “smart buys” are based) this appears as “contract teachers” being an infinitely cost effective intervention.

Of course in any normative producer theory the existence of an infinitely cost effective input should set off loud, klaxon volume, warning bells: “Oooga! Oooga!” This finding is, in and of itself, a rejection of the model that the producer is efficient (as it cannot be the case that the cost effectiveness of all inputs is being equalized if one of them is infinite). So I cannot maintain as even semi-plausible that my positive theory of this producer is that they are maximizing the measured outcome subject to budget constraints. But if that isn’t my positive model what is? And in a viable positive model of producer behavior what would be the reaction to the “recommendation” of contract teachers and what would be the outcome?

The reason I used the Kenyan example is that the Kenyan government decided to scale up the reduction in class size using contract teachers. A group of researchers did an RCT of the impact of this scaling. The Kenyan government did not have capability to scale the program nationwide so they had an NGO do parts of the country and the government do parts of the country. The researchers (Bold, Kimenyi, Mwabu, Ng’ang’a, and Sandefur 2018) found that in the government implemented scaling up there was zero impact on learning. So an infinitely cost-effective intervention when done by an NGO–a “best buy”–had zero impact when actually scaled by government and so was not at all a “best buy.”

Another example comes from the state of Madhya Pradesh India where the state adopted at scale a “school improvement plan” project that was based on the experience of doing a similar approach in the UK. A recent paper by Muralidharan and Singh 2020 reports that the project was implemented, in the narrow sense of compliance: schools did in fact prepare school improvement plans. But overall there was zero impact on learning (and not just “fail to reject zero” but in early results the estimated impact on learning was zero to three digits) and the zero impact was consistent with estimates that, other than doing the school improvement plan nothing else changed in the behavior of anyone: teachers, principals, supervisors. So whether “school improvement plans” were or were not a “best buy” in some other context, they had zero impact at scale in Madhya Pradesh.

A third example is from a (forthcoming–wiil update) paper by Masooda Bano (2021) looking at the implementation of School Based Management Committees (SBMC) in Nigeria. In a qualitative examination of why SBMC seem to have little or no impact in the Nigeria context she finds that those responsible for implementation really don’t believe in SBMC or want them to succeed but see the going through the motions of doing SBMC as a convenient form of isomorphism as the donors like it and therefore the pretense of SBMC keeps the donors complacent. So, whatever evidence there might be that when well designed and well implemented SBMC can be cost effective is irrelevant to the cost effectiveness of SBMC in practice in Nigeria.

My point is not just another illustration of the lack of “external validity” of empirical estimates of cost-effectiveness, it is deeper than that. It is the point of the intellectual incoherence of making “recommendations” based on positive model of producer behavior (that producers are attempting to maximize an outcome subject to constraints) that the empirical estimates themselves are part of the evidence that rejects this positive model.

Let me end with a different analogy of “best buys.”

Suppose I have just read that spinach and broccoli are “cost effective” foods in providing high nutritional content at low prices. I am in the grocery store and see a fellow shopper whose cart is loaded with food that is both bad for you and expensive (e.g. sugared breakfast cereals) and nothing really nutritious. I could then go up to her/him and make a “recommendation” and give him/her my empirical evidence grounded “smart buy” advice: “Hello stranger, you should buy some broccoli because it is a cost effective source of vitamins.” One can imagine many outcomes from this but perhaps the least plausible response is: “Gee thanks, fellow stranger, I will now buy some broccoli and integrate this cost effective source of vitamins into my regular food consumption habits.”

Take the analogy a step further and suppose I have an altruistic interest in the health of my fellow shopper and so I just buy and broccoli and spinach and put it into his/her shopping bags for free. Again, one can imagine many outcomes from this action of mine, but I would think the most probable is that some broccoli and spinach gets thrown away.

“Smart buys” is just dumb (worse than dumb, as believing things that are false is very very common and easy to do–most of us do it most of the time about most topics–but believing things that are internally contradictory (“I believe both A and not A”) takes some additional mental effort to stick to an attractive but dumb idea). As my story illustrates, I personally would give up substantial sums of money rather than have my name associated with this approach. I will not sell “best buys.” Given the poor track record of slogging “best buy” evidence that then does not deliver in implementation in context, you should be wary of buying it.

Note 1) The reason I use directions Carnegie Hall is because there is the old joke about it. One person stops another on the street and asks: “Do you know how I can get to Carnegie Hall?” the answer: “Practice, practice, practice.”

Note 2) This Copenhagen Consensus process was called such because it was instigated and led by Bjorn Lomborg (who was based out of some organization in Copenhagen) and the not so hidden agenda was to just inform people that on the available evidence about the likely distribution of possible consequences of climate change and the likely costs of avoiding those consequences one need not be a “climate change denialist” to acknowledge the world had lots and lots of current and future problems and action on climate change should be compared/contrasted to other possible uses of scarce resources. So might discredit the exercise for this reason but one could (a) none of the domain experts in their sector papers had or were asked to form any view about climate change and (b) one can bracket the climate change estimates from the expert panel and the ranking within and across domains is unaffected. So whether you think climate change was unfairly treated in this process vis a vis education or health or nutrition, each of those was treated equally and, as best as I could tell, there wasn’t any bias across the other domains.

Current claims about the benefits of using “rigorous” evidence are deeply wrong

I also have a new paper that argues that the current conventional wisdom about “rigorous evidence” in policy making is just empirically wrong. That is, a current conventional wisdom is that, because of the dangers of a lack of internal validity of estimates of causal impact (LATE) one needs to do RCTs. Then, after doing some number of those, one should do a “systematic review” that aggregates those “rigorous” estimates and policy making should be “evidence based” and rely on these systematic reviews. The paper shows, with real world data across countries, that this approach actually produces larger prediction error in causal impact than if each country just relied on its own internally biased estimates.


A simple analogy is I helpful. Suppose all men lie about their height and claim to be 1 inch taller than they actually are. Then self-reported height is internally biased. One could do a study of produce “rigorous” estimate of the true height of men and have the distribution of true heights, which has a mean (say, 69 inches (5′ 9”) and a standard deviation (3 inches). Then suppose I want to predict Bob’s height. If I don’t know anything about Bob then 69 inches is my best guess. But suppose I do have Bob’s self-reported height and he says he is 6′ 3” (75 inches tall). The conventional wisdom of “RCTs plus systematic review” approach would tell us to guess 69 inches and ignore altogether Bob’s self-report because it is not a “rigorous” estimate of Bob’s height because it isn’t internally valid and is biased. But in this case that approach is obvious madness. We should guess that not that Bob is 69 inches but that he is 6’2” (74 inches) tall and if Fred says he is 5’5” (65 inches) we should guess not 69 inches but 64 inches.

The obvious point is that the prediction error across a number of cases depends on the relative magnitude of the true heterogeneity in the LATE across contexts versus the magnitude of internal bias in a given context. There is no scientifically defensible case for using the mean of the set of “rigorous” estimates as the context specific estimate of the LATE (the proposed “conventional wisdom”) in the absence of (a) specific and defensible claims about the heterogeneity of the true LATE across contexts (and the available evidence suggests heterogeneity of LATE is large) and the typical magnitude and heterogeneity of the internal bias from various context specific estimates (about which we know little).

The paper (which is a homage to Ed Leamer’s classic “Let’s take the con out of econometrics” paper)–and which is still a draft–illustrates this point with data about estimates of the private sector learning premium across countries, where I show that both the heterogeneity across countries in the estimates are large and the internal bias is also large and that the net is the the “rigorous estimates plus systematic review” approach produces larger RMSE (root mean square error) of prediction that just using the OLS estimate (adjusting estimates for student HH SES) for each country.