Thursday, February 16, 2012

Evaluation questions: Managing agency, bias and scale

It is common to see in the Terms of Reference (ToRs) of an evaluation a list of evaluation questions. Or, at least a requirement that the evaluator develops such a list of questions as part of the evaluation plan. Such questions are typically fairly open-ended “how” and “whether” type questions. On the surface this approach makes sense. It gives some focus but leaves room for the unexpected and unknown.

But perhaps there is an argument for a much more focused and pre-specified approach. 


There are two grounds on which such an argument could be made. One is that aid organisations implementing development programs have “agency”, i.e. they are expected to be able to assess the situation they are in and act on the basis of informed judgements. They are not just mechanical instruments for implementing a program, like a computer. Given this fact, one could argue that evaluations should not simply focus on the behaviour of an organisation and its consequences, but on the organisation’s knowledge of its behaviour and its consequences. If that knowledge is misinformed then the sustainability of any achievements may be seriously in doubt. Likewise, it may be less likely that unintended negative consequences of a program will be identified and responded to appropriately.

One way to assess an organisation’s knowledge is to solicit their judgements about program outcomes in a form that can be tested by independent observation. For example, an organisation’s view on the percentage of households who have been lifted above the poverty line as a result of a livelihood intervention. An external evaluation could then gather independent data to test this judgement, or more realistically, audit the quality of the data and analysis that the organisation used to come to their judgement. In this latter case the role of the external evaluator is undertake a meta-evaluation, evaluating an organisation’s capacity by examining their judgements relating to key areas of expected program performance. This would require focused evaluation questions rather than open ended evaluation questions.


The second argument is arises from a body of research and argument about the prevalence of what appears to be endemic bias in many fields of research: the under-reporting of negative findings (i.e. non-relationships) and the related tendency of positive findings to disappear over time. The evidence here makes salutary reading, especially the evidence from the field of medical research where research protocols are perhaps the most demanding of all (for good reason, given the stakes involved). Lehrer’s 2010 article in the New York Times “The Truth Wears Off: Is there something wrong with the scientific method? is a good introduction, and Ioannidis’ work (cited by Lehrer) provides the more in-depth analysis and evidence. 

One solution that has been proposed to the problem of under-reporting of negative findings is the establishment of trial registries, whereby plans for experiments would be lodged in advance, before their results are known. This is now established practice in some fields of research and has recently been proposed for the use of randomised control trials by development agencies[1] Trial registries can provide two lines of defence against bias. The first is to make visible all trials, regardless of whether they are deemed “successful” and get published, or not. The other defence is against inappropriate “data mining”[2] within individual trials. The risk is that researchers can examine so many possible correlations between independent and dependent variables that some positive correlations will appear by chance alone. This risk is greater where a study looks at more than one outcome measure and at several different sub-groups. Multiple outcome measures are likely to be used when examining the impact on complex phenomenon such a poverty levels or governance, for example. When there are many relationships being examined there is also the well known risk of publication bias, of the evaluator only reporting the significant results.

These risks can be managed partly by the researchers themselves. Rasmussen et al suggests that if the outcomes are assumed to be fully independent, statistical significance values should be divided by the number of tests. Other approaches involve constructing mean standardised outcomes across a family of outcome measures. However these do not deal with the problem of selective reporting of results. Rasmussen et al argue that this risk would be best dealt with through the use of trial registries, where relationships to be examined are recorded in advance. In other words, researchers would spell out the hypothesis or claim to be tested, rather than simply state an open ended question. Open ended questions invite cherry picking of results according to the researcher’s interests, especially when there are lot of them.

As I have noted elsewhere, there are risks with this approach. One concern is that it might prevent evaluators from looking at the data and identifying new hypothesis that genuinely emerges as being of interest and worth testing.  However, registering hypotheses to be tested would not preclude this possibility. It should, however, make it evident when this is happening, and therefore encourage the evaluator to provide an explicit rationale for why additional hypotheses are being tested.

Same again, on a larger scale

The problems of biased reporting re-appear when individual studies are aggregated. Ben Goldacre explains:  

But individual experiments are not the end of the story. There is a second, crucial process in science, which is synthesising that evidence together to create a coherent picture.
In the very recent past, this was done badly. In the 1980s, researchers such as Celia Mulrow produced damning research showing that review articles in academic journals and textbooks, which everyone had trusted, actually presented a distorted and unrepresentative view, when compared with a systematic search of the academic literature. After struggling to exclude bias from every individual study, doctors and academics would then synthesise that evidence together with frightening arbitrariness.

The science of "systematic reviews" that grew from this research is exactly that: a science. It's a series of reproducible methods for searching information, to ensure that your evidence synthesis is as free from bias as your individual experiments. You describe not just what you found, but how you looked, which research databases you used, what search terms you typed, and so on. This apparently obvious manoeuvre has revolutionised the science of medicine”

Reviews face the same risks as individual experiments and evaluations. They may be selectively published, and their individual methodologies may not adequately deal with the problem of selective reporting of the more interesting results – sometimes described as cherry picking.  The development of review protocols and the registering of those prior to a review are an important means of reducing biased reporting, as they are with individual experiments. Systematic reviews are already a well established practice in the health sphere under the Cochrane Collaboration and in social policy under the Campbell Collaboration. Recently a new health sector journal, Systematic Reviews, has been established with the aim of ensuring that the results of all well-conducted systematic reviews are published, regardless of their outcome. The journal also aims to promote discussion of review methodologies, with the current issue including a paper on “Evidence summaries”, a rapid review approach.

It is common place for large aid organisations to request synthesis studies of achievements across a range of programs, defined by geography (e.g. a country program) or subject matter (e.g. livelihood interventions). A synthesis study requires some meta-evaluation, of what evidence is of sufficient quality and what is not. These judgements inform both the sampling of sources and the weighing of evidence found within the selected sources.  Despite the prevalence of synthesis studies, I am not aware of much literature existing on appropriate methodologies for such reviews, at least within the sphere of development evaluation. [I would welcome corrections to this view]

However, there are signs that experiences elsewhere with systematic reviews are being attended to. In the development field The International Development Coordinating Group has been established, under the auspices of the Campbell Collaboration, with the aim of encouraging registration of review plans and protocols and then disseminating “systematic reviews of high policy-relevance with a dedicated focus on social and economic development interventions in low and middle income countries”. DFID and AusAID have funded 3ie to commission a body of systematic reviews of what it identifies as rigorous impact evaluations, in a range of development fields. More recently an ODI Discussion Paper has reviewed some experiences with the implementation of systematic reviews. Associated with the publication of this paper was a useful online discussion.  

Three problems that were identified are of interest here. One is the difficulty of accessing source materials, especially evaluation reports many of which are not in the public domain, but should be. This problem is faced by all review methods, systematic and otherwise. This problem is now being addressed on multiple fronts, by individual organisation initiatives (e.g. 3ie and IDS evaluation databases) and by collective efforts such as the International Aid Transparency Initiative. The authors of the ODI paper note that “there are no guarantees that systematic reviews, or rather the individuals conducting them, will successfully identify every relevant study, meaning that subsequent conclusions may only partially reflect the true evidence base.” While this is (for any type of review process) it is the transparency of the sample selection - via protocols, and the visibility of the review itself – via registries, which help make this problem manageable.

The second problem, as seen by the authors, is that “Systematic reviews tend to privilege one kind of method over another, with full-blown randomised controlled trials (RCTs) often representing the ‘gold standard’ of methodology and in-depth qualitative evidence not really given the credit it deserves.” This does not have to be the case.   A systematic review has been usefully defined as “an overview of primary studies which contains an explicit statement of objectives, materials, and methods and has been conducted according to explicit and reproducible methodology” Replicability is key and this requires systematic and transparent process relating to sampling and analysis. This should be evident in protocols.

A third problem was identified by 3ie, in their commentary on the Discussion Paper. This relates directly to the initial focus of this blog, the argument for more focused evaluation questions. They comment that:

Even with plenty of data available, making systematic reviews work for international development requires applying the methodology to clearly defined research questions on issues where a review seems sensible. This is one of the key lessons to emerge from recent applications of the methodology. A review in medicine will often ask a narrow question such as the Cochrane Collaboration’s recent review on the inefficacy of oseltamivir (tamiflu) for preventing and treating influenza. Many of the review questions development researchers have attempted to answer in recent systematic reviews seem too broad, which inevitably leads to challenges. There is a trade-off between depth and breath, but if our goal is to build a sustainable community of practice around credible, high quality reviews we should be favouring depth of analysis where a trade-off needs to be made.”

[1] By the head of DFID EvD in 2011 and by Rasmussen et al, see below.
[2] See Ole Dahl Rasmussen, Nikolaj Malchow-Møller, Thomas Barnebeck Andersen, Walking the talk: the need for a trial registry for development interventions,  available via