Tuesday, October 05, 2010

Do we need a Required Level of Failure (RLF)?

(PS: This post was previously titled "Do we need a Minimum Level of Failure (MLF), which could be misinterpreted as suggesting that we need to minimise the level of failure - which I definitely did not mean to suggest)

This week I am attending the 2010 European Evaluation Society conference in Prague. Today I have been involved in a number of interesting discussions, including how to commission better evaluations and the potential and perils of randomised control trials (RCTs). This has prompted me to resurrect an idea I have previously raised partly in jest, but which I now think deserves more serious consideration.

Background first: RCTs have been promoted as an important means of improving the effectiveness of development aid projects. But there are also concerns that RCTs will become a dominating orthodoxy, driving out the use of other approaches to impact assessment, and in the worst case, discouraging investment in development projects which are not evaluable through the use of RCTs.

In my PhD thesis many years ago I looked at organisational learning through the lense of an evolutionary epistemology. That school of thought sees evolution (through the re-iteration of variation, selection and retention) as a kind of learning process, and human learning as a sub-set of that process. As I explain below, that view of the process of learning has some relevance to the current debate on how to improve aid effectiveness. It is also worth acknowledging the results of that process - evolution has been very effective in developing some extremely complex and sophisticated lifeforms, against which intentionally designed aid projects pale in comparison.

The point to be made: A common misconception is evolution is about the “survival of the fittest”. In fact this phrase, coined by Herbert Spencer, is significantly misleading. Biological evolution is NOT about the survival of the fittest, but the non-survival of the least fit. This process leaves room for some diversity amongst those that survive, and it is this diversity that enables further evolution. The lesson here is that the process of evolution is not about picking winners according to some global standard of fitness, but about culling of failures based on their lack of fitness to local circumstances.

This leads me to my own “modest proposal” for another route to improved aid effectiveness, which is an alternative to the widespread use of RCTs and the replication of the kinds of projects found to be effective via that means. This would be to build a widening consensus about the need for a defined “Minimum Level of Failure” (MLF) within the portfolio of activities funded or implemented by aid agencies. A MLF could be something like a 10% of projects by value. Participating agencies would committ to publicly declaring this proportion of their projects as failed. Each of these agencies would also need to show: (a) how in their particular operating context they have defined these as failures, and (b) what steps they will take to avoid the replication of these failures in the future. There would be no need for a global consensus on evaluation methods, or a hegemony of methods arising through less democratic processes.  PS: Using the current M&E terminology, the consensus would need to be on the desired outcomes, not on the activities needed to achieve them.

I can of course anticipate, if not already hear, some protests about how unrealistic this proposal is. Let us hear these protests, especially in public. Any agency that did so would probably be implying, if not explicitly arguing, that such a failure rate would be unacceptable, because public monies and poor people’s lives are at stake. However making such a de facto claim of a 90%+ rate of success would be a seriously high risk activity, becaus it would be very vulnerable to disproof, probably through journalistic inquiry alone. For anyone involved with development aid programmes, a brief moment’s reflection would suggest that the reality of aid effectiveness is very different, and that a 10% failure rate is probably way too optimistic and in real life failures are much more common.

Perhaps the protesting agencies might be better advised to consider the upside of a achieving a minimum level of failure. If taken seriously establishing a norm of a minimal level of failure could help get the public at large, along with journalists and politicians, past the shock-horror of failure itself and into the more interesting territory of why some projects fail. It could also help raise the level of risk tolerance, and enable the exploration of more innovative approaches to the uses of aid. Both of these developments would be in addition to a progressive improvement on the average performance of development projects resulting from a periodic culling of the worst performers.

It is possible that advocates of specific methods like RCTs (as the route to improved aid effectiveness) might also have some criticisms of the MLF proposal. They could argue that these methods will generate plenty of evidence of what does not work, and perhaps that evidence should be privileged. But the problem with this method-led solution is that there is already a body of evidence from a number of fields of scientific research that negative findings are widely under-reported. People like to publish positive findings. This may not be a big risk while RCTs are funded by one or two major actors, but it will become a systemic risk as the number of actors involved increases.  There needs to be an explicit and public focus on failure.

Actual data on failure rates

PS: 15th October 2010: Four days ago I posted below some information on the success and failure rates of  DFID projects. I have re-stated and re-edited that information here with additional comments:

There is some interesting data on failure within the DFID system, most notably the most recent review of Project Completion Reports (PCRs), undertaken in 2005. See the “An Analysis of Projects and programmes in Prism 2000-2005”report available on the DFID website. The percentage (68%) of projects “defined as ‘completely’ or ‘largely’ achieving their Goals (Rated 1 or 2)” was given at the beginning of the Executive Summary, but information about failures was less prominent. Under section “8. Lessons from Project Failures” on page 61 it is stated “There are only 23 projects [out of 453] within the sample that are rated as failing to meet their objectives (i.e. 4 or 5) and which have significant lessons” (italics added). This is equivalent to about 5% of the sampled projects.

More importantly are the 20% or so rated 3 = Likely to be partly achieved (see page 64). It could be argued that those with a rating of 3 should also be included as failures, since their objectives are only likely to be partly achieved, versus largely achieved in the case of rating 2. In other words a successful project should be defined as one likely to achieve more than 50% of its Output and Purpose objectives. Others are failures. This interpretation seems to be supported by a comment sent to me (whose author will remain anonymous): " "My understanding is that projects with scores of less than 2 are under real pressure and maybe quickly closed down unless they improve rapidly. I have certainly "felt the pressure" from projects to score them 2 rather than 3. That said I have not buckled to the pressure!"

I think the fact that DFID at least has a performance scoring system (for all its faults), that it has done this analysis of the project scores, and that it has made the results public, probably puts it well ahead of many other aid agencies. I would like to hear about any other agencies who have done anything like this, along with comments on the strengths and weaknesses of what they have done. I would also like to see DFID repeat the 2005 exercise at the end of this year, this time with more discussion on the projects rated 3 = Likely to be partly achieved, and what subsequently happened to these projects.

PS 2nd November 2010: See Lawrence Hadad's reference here to the same DFID set of statistics here, recently quoted/misused on the One Show 

PS 3rd November 2010:  Thanks to Yu-Lan van Alphen, Programmamanager, Stichting DOEN, Amsterdam, for this book reference: Kathryn Schulz – "Being Wrong",  reviewed in the NYT. It sounds like a good read.

PS 14th February 2011: Computer programs are intolerant of programming errors. So, computer programmers tried to avoid them at all costs, not always successfully. Doing so becomes a much bigger challenge as software grows in size and complexity. Now some programmers are trying a different approach, that involves recognising that there will always be programming errors. For more, see "Let It Crash" Programming" by Craig Stunz at http://blogs.teamb.com/craigstuntz/2008/05/19/37819/ 

PS 15th February 2011: "Why negative studies are good for health journalism, and where to find them" "
This is a guest column by Ivan Oransky, MD, who is executive editor of Reuters Health and blogs at Embargo Watch and Retraction Watch.One of the things that makes evaluating medical evidence difficult is knowing whether what's being published actually reflects reality. Are the studies we read a good representation of scientific truth, or are they full of cherry-picked data that help sell drugs or skew policy decisions?..."

PS: 21 February 2011: See also the Admitting Failure website

PS: 23 April 2011. See today's Bad Science column in the Guardian by Ben Goldacre, titled "I foresee that nobody will do anything about this problem", on the difficulty of getting negative findings published

PS: 23 May 2011 .The above analysis of DFID project ratings focuses on the recognition of failures that have already occurred. It is also possible, and important, to take steps to ensure that failures are possible to be recognised in the first place. A project that has no clear theory of change will be difficult to evaluate and thus difficult to classify as a success or failure. The most common means of describing a development project’s theory of change is probably via a LogFrame representation. Within a reasonably well constructed LogFrame representation there is a sequence of “if…and…then…” statements, spelling out what is expected to happen as the project is implemented and takes effect. While there may be positive developments in a project’s Goal level indicators, there also needs to be associated evidence that the expected chain of causation leading to that Goal has also taken place as expected. It is not uncommon, in my experience, to find that while the expected outcomes have occurred, the outputs that were meant to contribute to those outcomes were not successfully delivered. In this situation the project cannot claim to be successful. There is however a more generic point to be made here.  The more detailed a project’s ToC is the more vulnerable it will be to disproof. Any one of the many expected causal links could be found to have not worked as expected. However, if these linkages have not been disproved, then the stronger the project’s claims will be to have contributed to any expected and observed changes. Willingness to allow failure to be identified strengthens the claim of any success that is observed. This seems an important observation in the case of projects where there is no possibility of making comparisons with a control group where there was no intervention. In those circumstances a ToC should be as detailed and articulate as possible.

PS: 23 May 2011 :Articulating more disprovable theories of change may sound like a good idea, but it could be argued that this requirement risks locking aid agencies into a static view of the world they are working in, and one which is developed quite early in their intervention. In many settings, for example in humanitarian emergencies and highly politicised environments, aid agencies often have to revisit, revise and adapt their views of what is happening and how they should best respond. The best that might be expected in these circumstances is that those agencies are able to construct a detailed (and disprovable) history of what happened.  This could actually produce better (i.e. more disprovable) results. There is some research evidence which shows that people find it easier to imagine events in some detail when they are situated in the past than to imagine the same kind of events taking place in the future[i].

[i] Bavelas, J.B. (1973) Effects of the temporal context of information, Psychological Reports, 35, 695-698, cited in Dance with Chance, by Makridakis, Hogarth and Gaba, 2009, page189.


  1. Interesting. I think one of the problems would arise in getting any agency to admit to outright project failure (I'm ex-DFID). The ones that work less well always get 'marked up' to avoid loss of face. And where the public purse is involved government agencies are very nervous about PQs and adverse publicity - not least when there is pressure on overall spend, as now. Further to this there are probably very few projects which are a total failuire. Most projects have elements or components which were successful, and the danger might be that these elements would be lost, sunk with the ship.

  2. I would wholeheartedly welcome this as a shift in the cognitive frameworks with which we talk about international development. The business sector seems to have a healthier relationship with risk in their for-profit endeavours. Yet in the development sector, I've been observing an increasing desperation to “know” what can be inherently beyond logic and induction.

    The RCT "gold standard" is especially troubling when one is talking about grassroots-up initiatives. Imposing expectations to "try to evaluate every single intervention" on people who are in the process of organizing at the local level is most certainly a drain on their time and scarce resources. And what so many people on the ground have told me again and again is that abstract metrics or research constructs don’t help them understand their relationship to improving the well-being of the people they serve. As members of the community, they read trends through what’s happening on the ground, rather than using any theory. It's time for us to recognize that one can monitor not only through data, but also through dialogue.

    Let's always consider what is the appropriate cost and complexity needed for evaluation (especially given the size and scope of the program) and aim for proportional expectations so we ensure it remains a tool for learning, not risk-reduction.

  3. Responding to Pete:

    There is some interesting data on failure within the DFID system, most notably the most recent review of PCR Synthesis Reports in 2005. See “An Analysis of Projects and programmes in Prism 2000-2005” by Steve Martin and Kerstin Hinds Corporate Strategy Group, DFID. Some of your skepticism is well founded. It was easy to identify the percentage (68%) of projects “defined as ‘completely’ or ‘largely’ achieving their Goals (Rated 1 or 2)”, but the same degree of specificity about failures was more difficult to come by. Under section “8. Lessons from Project Failures” on page 61 it is stated “There are only 23 projects [out of 453] within the sample that are rated as failing to meet their objectives (i.e. 4 or 5) and which have significant lessons” (italics added). My eyeballing of the bar charts suggests that the percentage with ratings of 4 or 5 was closer to 5%, but perhaps these include those that did not “have significant lessons”

    More importantly are the 20% or so rated 3 = Likely to be partly achieved (see page 64). It could be argued that those with a rating of 3 should also be included as failures, since their objectives are only likely to be partly achieved, versus largely achieved in the case of rating 2. In other words a successful project should be defined as one likely to achieve more than 50% of its Output and Purpose objectives. Others are failures.

    Re your concern that “Most projects have elements or components which were successful, and the danger might be that these elements would be lost, sunk with the ship”. This strikes me as a luxury concern. Many of the ostensible lessons from what works in the projects largely achieving their objectives are also being lost sight of. I have recently and direct experience of this.

  4. Mike Powell writes:
    date Mon, Oct 11, 2010 at 6:05 PM
    subject Re: [MandENEWS] Do we need a Minimum Level of Failure (MLF)?

    An interesting blog Rick.

    It takes a different route to a similar conclusion than Pieter Van Lieshout, lead author of the recent Dutch Research Council report on Dutch development aid. He argues that NGOs should be funded on the basis of a 15% failure rate to counter what he perceives as a lack of innovation in the sector (and also, I think, a lack of critical reflection).

    For whatever reason some failure may be acceptable, the current refusal to admit or learn from its existence is very damaging. I would argue that RCTs - and other planning, control and evaluative activities which aim to 'prove' success, are even more irrational, desperately seeking to make realities fit bureaucratic and political comnfort zones, rather than accepting their complex nature and developing rational and appropriate, if never complete, methodologies to work with them.

    Best wishes

    Mike Powell

    IKM Emergent

  5. I really like this proposal! If we have to have targets let them be for failures (with the caveat that donors tend to construct both failure and success as a moveable policy feast – see Mosse’s Cultivating Development). However, if we understand ‘failure’ as ‘no change’ then the MSC methodology could easily be adapted to determine that.

  6. Thanks Rick and others who have commented so far. As co-author of the project synthesis report Rick mentioned above, and an adviser in DFID's evaluation department, I have a couple of thoughts.

    First, DFID published a more recent independent review of project completion reports (covering 2005-2008) which may be of interest. This showed the quality of DFID's projects had improved despite taking on riskier projects. It also showed that there was room for improvement (20% of scores were poorly justified), but that there was no evidence of systematic 'marking up' of projects. This review did not focus on looking at the lessons of projects that were found to have been unsuccessful, but I think it would of course make sense for DFID to do this in future studies.

    Second, I personally think the ideas of a) a minimal level of failure and b) revisiting what counts as project failure are useful ones. We're currently doing some work on improving the way we score projects - and will be able to look into these issues as part of that.

    On a related note, people may be interested in the growing amount of project level information available on DFID's website through the project information database and other commitments to improve aid transparency which will (I believe within the next 12 months) permit others to explore DFID's projects and lessons without having to wait for DFID to undertake such studies. Project reviews are not yet available but will be in 2011.

    Kerstin Hinds, DFID

  7. "If you don't make mistakes, you're not working on hard enough problems. And that's a big mistake." - Frank Wilczek, 2004 Nobel Prize winner in physics