Rigorous Evidence Isn’t

written by Lant Pritchett

Currently, there are many statements floating around in development about the use of “rigorous evidence” in formulating policies and programs. Nearly all of these claims are fatuous. The problem is, rigorous evidence isn’t.

That is, suppose one generates some evidence about the impact of some programmatic or policy intervention in one particular context that is agreed by all to be “rigorous” because it meets methodological criteria for internal validity of its causal claims. But the instant this evidence is used in formulating policy it isn’t rigorous evidence any more.  Evidence would be “rigorous” about predicting the future impact of the adoption of a policy only if the conditions under which the policy was to be implemented were exactly the same in every relevant dimension as that under which the “rigorous” evidence was generated.  But that can never be so because neither economics—nor any other social science—have theoretically sound and empirically validated invariance laws that specify what “exactly the same” conditions would be.

So most uses of rigorous evidence aren’t.  Take, for instance, the justly famous 2007 JPE paper by Ben Olken on the impact of certain types of monitoring on certain types of corruption. According to Google Scholar as of today, this paper has been cited 637 times.  The question is, for how many of the uses of this “rigorous evidence” is it really “rigorous evidence”?  We (well, my assistant) sampled 50 of the citing papers with 57 unique mentions of Olken (2007).  Only 8 of those papers were about Indonesia (Of course even those 8 are only even arguably “rigorous” applications as they might be about different programs or different mechanisms or different contexts.)  47 of the 57 (82%) of the mentions are neither about Indonesia nor even an East Asia or Pacific country—they might be a review of the literature about corruption in general, about another country, or methodological.  We also tracked whether the words “context” or “external validity” appeared within +/- two paragraphs of the mention. In 34 of the 57 (60%) mentions, the evidence was not about Indonesia and did not mention that the results, while “rigorous” for the time, place and programmatic/policy context, have no claim to be rigorous about any other time, place, or programmatic/policy context.

Another justly famous paper, Angrist and Lavy (1999) in the QJE uses regression discontinuity to identify the impact of class size on student achievement in Israel.  This paper has been cited 1244 times.  I looked through the first 150 citations to this paper (which Google Scholar sorts by the number of times the citing paper has itself been cited) and (other than other papers by the authors) not one mentioned Israel  (not that surprisingly, as Israel is a tiny country) in the title or abstract while China, India, Bangladesh, Cambodia, Bolivia, UK, Wales, USA (various states and cities), Kenya and South Africa all figured.  Angrist and Lavy do not, and do not claim to, provide “rigorous” evidence about any of those contexts.

If one is formulating policies or programs for attacking corruption in highway procurement in Peru or reducing class size in secondary school in Thailand, it is impossible to base those policies on “rigorous evidence” as evidence that is rigorous for Indonesia or Israel isn’t rigorous for these other countries.

Now, some might make the argument that formulation of policies or programs in context X should rely exclusively/primarily/preferentially on evidence that is “rigorous” in context Z because at least we know that in context Z in which it was generated the evidence is internally valid.  This is both fatuous and false as a general proposition.

Fatuous in that no one understands the phrase “policy based on rigorous evidence” to mean “policy based on evidence that isn’t rigorous with respect to the actual policy context to which it is being applied (because there are no rigorous claims to external validity) but rather based on evidence that is rigorous in some other context.”  No one understands it that way because that isn’t rigorous evidence.

It is also false as a general proposition.  It is easy to construct plausible empirical examples in which the evidence suggests that the bias from internal validity is much (much) smaller than the bias from external validity as the contextual variation in “true” impact is much larger than the methodological bias from lack of “clean” causal identification of simple methods.  In these instances, better policy is made using “bad” (e.g. not internally valid) evidence from the same context than “rigorous” evidence from another context (e.g. Pritchett and Sandefur 2013).

Sadly perhaps, there is no shortcut around using judgment and wisdom in assessing all of the available evidence in formulating policies and programs.  Slogans like “rigorous evidence” are an abuse, not a use, of social science.

What’s in a counterfactual?

written by Salimah Samji

I am amazed by people’s obsession with the counterfactual, and how evidence cannot exist without it. Why are people so enamored by the idea of ‘the solution’ even though we have learned time and time again that there is no one size fits all?

Is the existence of a counterfactual a sufficient condition? Why don’t people ask questions about the design and implementation of the evaluation? Specifically:

  • What are you measuring and what is the nature of your context: Where in the design space are you? Is your fitness landscape smooth or rugged? Eppstein et al. in Searching the Clinical Fitness Landscape, test two approaches (multicenter randomized control trials vs. quality improvement collaboratives where you work with others, learn from collective experience, and customize based on local context), to identify which leads to healthcare improvements. They find that the quality improvement collaboratives are most effective in the complex socio-technical environments of healthcare institutions. Basically, the moment you introduce any complexity (increased interactions between variables) experiential methods trump experimental ones.
  • Who is collecting your data and how: Collecting data is a tedious task and the incentive to fill out surveys without having to go to the village is high, especially if no one is watching. Then there are questions of what you ask, where you ask, how you ask, what time period it is, how long the questionnaire is, etc.
  • How is the data entered and verified: Do you do random checks? Double data entry?
  • Is the data publicly available for scrutiny?

And then there is the external validity problem. Counterfactual or not, it is crucial to adapt development interventions to local contextual realities, where high quality implementation is paramount to success. Bold et al. in Scaling Up What Works: Experimental Evidence on External Validity in Kenyan Education, find that while NGO implementation of contract teachers in Kenya produces a positive effect on test scores, government implementation of the same program yielded zero effect. They cite implementation constraints and the political economy forces in play as reasons for the stark difference. In a paper entitled, using case studies to explore the external validity of ‘complex’ development interventions, Michael Woolcock argues for deploying case studies to better identify the conditions under which diverse outcomes are observed, with a focus on contextual idiosyncrasies, implementation capabilities and trajectories of change.

To top it off, today’s graduate students in economics don’t read Hirschman (some have never heard of him!) … should we be worried?