Rigorous evaluation, but to what end?

written by Salimah Samji

Many development projects fail because of poor design. They have no clear roadmap of how they will get from A to B and therefore no way of knowing whether they are on the right track. However, design alone is not enough for success. In fact, many development projects that are well designed, fail because of last mile implementation problems. These include (but are not limited to) existing capacity constraints, difficulties with finding skilled staff, lack of funds, politics and misaligned incentives.

Spending time, effort and resources to rigorously evaluate projects like these can be a wasteful exercise. It would make more sense to begin by testing the logic of your intervention by using low-cost tools, like process audits or collecting and analyzing meaningful data on outputs and intermediate outcomes, that feed back into the design and implementation thus increasing the possibility of success. Does the project make sense in your context?

Several years ago when I worked for Google.org, I had suggested that a grantee of ours, one who conducted large-scale assessments, consider a process audit to identify gaps and to improve implementation. They were very open to the idea. I wanted to hire an external consultant to ensure impartiality, but also ensure that the consultant was seen as a team player who was hired to help them. The grantee was thrilled with the findings and told me that this simple process audit had been much more useful and cost-effective than the other Randomized Control Trials (RCTs) they had been part of. They invited the consultant to share the findings with their entire team and together brainstormed ways to improve implementation. It marked the beginning of a long collaboration between the two – but that is another story …

Recently, I was pleasantly surprised to read a paper by Diana Epstein and Jacob Alex Klerman entitled, When is a Program Ready for Rigorous Impact Evaluation? The Role of a Falsifiable Logic Model. They argue that a program that cannot achieve the intermediate goals specified by its own logic model, will not have the desired impacts and should therefore not proceed to Rigorous Impact Evaluation (RIE). And the determination of whether a project can achieve its own intermediate goals can be done using conventional process evaluation methods, that is, careful observation of program operation, at low-cost.

They highlight five common forms of failure of programs to satisfy their own logic models. These include:

  • Failure to secure required inputs: ability to establish inter-organization partnerships and to recruit and retain certain types of staff.
  • Low program enrollment: it does not attract the target number of clients/participants because there is no demand.
  • Low program completion rates: initially enroll in the program but do not complete the expected treatment.
  • Low fidelity: the program as implemented falls short of what was envisioned in the logic model.
  • Lack of pre/post improvement: sometimes clients show minimal or no progress on pre/post measures of the intermediate outcome which could be due to history and maturation.

Then, for each form of failure, they discuss how it could have been detected by a process evaluation – through the observation of the operation of a program, an experienced outsider can suggest direct and immediate ways to improve the program’s operation.

They also note, “it seems plausible that lessons learned in the first year or two of operation might lead to program refinements, improved program implementation, and improved client outcomes.” This fits nicely with our paper, It’s All About MeE: Using Structured Experiential Learning (‘e’) to Crawl the Design Space which helps develop a learning strategy that achieves more than monitoring, is cheaper than impact evaluations, has timely dynamic feedback loops built into the project and extends the insights gained into the design and management of development projects.

Rigorous Evidence Isn’t

written by Lant Pritchett

Currently, there are many statements floating around in development about the use of “rigorous evidence” in formulating policies and programs. Nearly all of these claims are fatuous. The problem is, rigorous evidence isn’t.

That is, suppose one generates some evidence about the impact of some programmatic or policy intervention in one particular context that is agreed by all to be “rigorous” because it meets methodological criteria for internal validity of its causal claims. But the instant this evidence is used in formulating policy it isn’t rigorous evidence any more.  Evidence would be “rigorous” about predicting the future impact of the adoption of a policy only if the conditions under which the policy was to be implemented were exactly the same in every relevant dimension as that under which the “rigorous” evidence was generated.  But that can never be so because neither economics—nor any other social science—have theoretically sound and empirically validated invariance laws that specify what “exactly the same” conditions would be.

So most uses of rigorous evidence aren’t.  Take, for instance, the justly famous 2007 JPE paper by Ben Olken on the impact of certain types of monitoring on certain types of corruption. According to Google Scholar as of today, this paper has been cited 637 times.  The question is, for how many of the uses of this “rigorous evidence” is it really “rigorous evidence”?  We (well, my assistant) sampled 50 of the citing papers with 57 unique mentions of Olken (2007).  Only 8 of those papers were about Indonesia (Of course even those 8 are only even arguably “rigorous” applications as they might be about different programs or different mechanisms or different contexts.)  47 of the 57 (82%) of the mentions are neither about Indonesia nor even an East Asia or Pacific country—they might be a review of the literature about corruption in general, about another country, or methodological.  We also tracked whether the words “context” or “external validity” appeared within +/- two paragraphs of the mention. In 34 of the 57 (60%) mentions, the evidence was not about Indonesia and did not mention that the results, while “rigorous” for the time, place and programmatic/policy context, have no claim to be rigorous about any other time, place, or programmatic/policy context.

Another justly famous paper, Angrist and Lavy (1999) in the QJE uses regression discontinuity to identify the impact of class size on student achievement in Israel.  This paper has been cited 1244 times.  I looked through the first 150 citations to this paper (which Google Scholar sorts by the number of times the citing paper has itself been cited) and (other than other papers by the authors) not one mentioned Israel  (not that surprisingly, as Israel is a tiny country) in the title or abstract while China, India, Bangladesh, Cambodia, Bolivia, UK, Wales, USA (various states and cities), Kenya and South Africa all figured.  Angrist and Lavy do not, and do not claim to, provide “rigorous” evidence about any of those contexts.

If one is formulating policies or programs for attacking corruption in highway procurement in Peru or reducing class size in secondary school in Thailand, it is impossible to base those policies on “rigorous evidence” as evidence that is rigorous for Indonesia or Israel isn’t rigorous for these other countries.

Now, some might make the argument that formulation of policies or programs in context X should rely exclusively/primarily/preferentially on evidence that is “rigorous” in context Z because at least we know that in context Z in which it was generated the evidence is internally valid.  This is both fatuous and false as a general proposition.

Fatuous in that no one understands the phrase “policy based on rigorous evidence” to mean “policy based on evidence that isn’t rigorous with respect to the actual policy context to which it is being applied (because there are no rigorous claims to external validity) but rather based on evidence that is rigorous in some other context.”  No one understands it that way because that isn’t rigorous evidence.

It is also false as a general proposition.  It is easy to construct plausible empirical examples in which the evidence suggests that the bias from internal validity is much (much) smaller than the bias from external validity as the contextual variation in “true” impact is much larger than the methodological bias from lack of “clean” causal identification of simple methods.  In these instances, better policy is made using “bad” (e.g. not internally valid) evidence from the same context than “rigorous” evidence from another context (e.g. Pritchett and Sandefur 2013).

Sadly perhaps, there is no shortcut around using judgment and wisdom in assessing all of the available evidence in formulating policies and programs.  Slogans like “rigorous evidence” are an abuse, not a use, of social science.