It is common to see in the Terms of
Reference (ToRs) of an evaluation a list of evaluation questions. Or, at least
a requirement that the evaluator develops such a list of questions as part of
the evaluation plan. Such questions are typically fairly open-ended “how” and
“whether” type questions. On the surface this approach makes sense. It gives
some focus but leaves room for the unexpected and unknown.
But perhaps there is an argument for a much
more focused and pre-specified approach.
Agency
There are two grounds on which such an
argument could be made. One is that aid organisations implementing development programs
have “agency”,
i.e. they are expected to be able to assess the situation they are in and act on the basis of informed judgements. They are not just
mechanical instruments for implementing a program, like a computer. Given this
fact, one could argue that evaluations should not simply focus on the behaviour
of an organisation and its consequences, but on the organisation’s knowledge of its behaviour and its consequences.
If that knowledge is misinformed then the sustainability of any achievements
may be seriously in doubt. Likewise, it may be less likely that unintended
negative consequences of a program will be identified and responded to appropriately.
One way to assess an organisation’s
knowledge is to solicit their judgements about program outcomes in a form that
can be tested by independent observation. For example, an organisation’s view
on the percentage of households who have been lifted above the poverty line as
a result of a livelihood intervention. An external evaluation could then gather
independent data to test this judgement, or more realistically, audit the
quality of the data and analysis that the organisation used to come to their
judgement. In this latter case the role of the external evaluator is undertake
a meta-evaluation, evaluating an organisation’s
capacity by examining their judgements relating to key areas of expected program
performance. This would require focused evaluation questions rather than open
ended evaluation questions.
Bias
The second argument is arises from a body
of research and argument about the prevalence of what appears to be endemic
bias in many fields of research: the under-reporting of negative findings (i.e.
non-relationships) and the related tendency of positive findings to disappear
over time. The evidence here makes salutary reading, especially the evidence from
the field of medical research where research protocols are perhaps the most
demanding of all (for good reason, given the stakes involved). Lehrer’s 2010
article in the New York Times “The Truth Wears Off: Is there something
wrong with the scientific method?’ is a good introduction, and Ioannidis’
work (cited by Lehrer) provides the more in-depth analysis and evidence.
One solution that has been proposed to the
problem of under-reporting of negative findings is the establishment of trial
registries, whereby plans for experiments would be lodged in advance, before
their results are known. This is now established practice in some fields of
research and has recently been proposed for the use of randomised control
trials by development agencies[1]
Trial registries can provide two lines of defence against bias. The first is to
make visible all trials, regardless of whether they are deemed “successful” and
get published, or not. The other defence is against inappropriate “data mining”[2]
within individual trials. The risk is that researchers can examine so many
possible correlations between independent and dependent variables that some
positive correlations will appear by chance alone. This risk is greater where a
study looks at more than one outcome measure and at several different sub-groups.
Multiple outcome measures are likely to be used when examining the impact on
complex phenomenon such a poverty levels or governance, for example. When there
are many relationships being examined there is also the well known risk of
publication bias, of the evaluator only reporting the significant results.
These risks can be managed partly by the
researchers themselves. Rasmussen
et al suggests that if the outcomes are assumed to be fully independent,
statistical significance values should be divided by the number of tests. Other
approaches involve constructing mean standardised outcomes across a family of
outcome measures. However these do not deal with the problem of selective
reporting of results. Rasmussen et al argue that this risk would be best dealt
with through the use of trial registries, where
relationships to be examined are recorded in advance. In other words, researchers
would spell out the hypothesis or claim to be tested, rather than simply state
an open ended question. Open ended questions invite cherry picking of results
according to the researcher’s interests, especially when there are lot of them.
As I have noted elsewhere,
there are risks with this approach. One concern is that it might prevent
evaluators from looking at the data and identifying new hypothesis that
genuinely emerges as being of interest and worth testing. However, registering hypotheses to be tested
would not preclude this possibility. It should, however, make it evident when
this is happening, and therefore encourage the evaluator to provide an explicit
rationale for why additional hypotheses are being tested.
Same
again, on a larger scale
The problems of biased reporting re-appear
when individual studies are aggregated. Ben Goldacre explains:
“But individual experiments are not the end
of the story. There is a second, crucial process in science, which is
synthesising that evidence together to create a coherent picture.
In the very recent past, this was done badly. In the
1980s, researchers such as Celia Mulrow produced
damning research showing that review articles in academic journals and
textbooks, which everyone had trusted, actually presented a distorted and
unrepresentative view, when compared with a systematic search of the academic
literature. After struggling to exclude bias from every individual study,
doctors and academics would then synthesise that evidence together with
frightening arbitrariness.
The science of "systematic reviews" that
grew from this research is exactly that: a science. It's a series of
reproducible methods for searching information, to ensure that your evidence
synthesis is as free from bias as your individual experiments. You describe not
just what you found, but how you looked, which research databases you used,
what search terms you typed, and so on. This apparently obvious manoeuvre has
revolutionised the science of medicine”
Reviews face the same risks as individual
experiments and evaluations. They may be selectively published, and their
individual methodologies may not adequately deal with the problem of selective
reporting of the more interesting results – sometimes described as cherry
picking. The development of review
protocols and the registering of those prior to a review are an important means
of reducing biased reporting, as they are with individual experiments. Systematic
reviews are already a well established practice in the health sphere under the Cochrane
Collaboration and in social policy under the Campbell Collaboration. Recently
a new health sector journal, Systematic
Reviews, has been established with the aim of ensuring that the results of
all well-conducted systematic reviews are published, regardless of their
outcome. The journal also aims to promote discussion of review methodologies,
with the current issue including a paper on “Evidence summaries”, a rapid review approach.
It is common place for large aid organisations
to request synthesis studies of achievements across a range of programs,
defined by geography (e.g. a country program) or subject matter (e.g.
livelihood interventions). A synthesis study requires some meta-evaluation, of
what evidence is of sufficient quality and what is not. These judgements inform
both the sampling of sources and the weighing of evidence found within the
selected sources. Despite the prevalence
of synthesis studies, I am not aware of much literature existing on appropriate
methodologies for such reviews, at least within the sphere of development
evaluation. [I would welcome corrections to this view]
However, there are signs that experiences
elsewhere with systematic reviews are being attended to. In the development
field The
International Development Coordinating Group has been established, under
the auspices of the Campbell Collaboration, with the aim of encouraging
registration of review plans and protocols and then disseminating “systematic reviews of high policy-relevance
with a dedicated focus on social and economic development interventions in low
and middle income countries”. DFID and AusAID have funded 3ie to commission a body
of systematic reviews of what it identifies as rigorous impact evaluations, in
a range of development fields. More recently an ODI Discussion Paper
has reviewed some experiences with the implementation of systematic reviews. Associated
with the publication of this paper was a useful online
discussion.
Three problems that were identified are of
interest here. One is the difficulty of accessing source materials, especially
evaluation reports many of which are not in the public domain, but should be.
This problem is faced by all review methods, systematic and otherwise. This
problem is now being addressed on multiple fronts, by individual organisation
initiatives (e.g. 3ie and IDS evaluation databases) and by collective efforts
such as the International Aid Transparency Initiative. The authors of the ODI paper note that “there are no guarantees
that systematic reviews, or rather the individuals conducting them, will
successfully identify every relevant study, meaning that subsequent conclusions
may only partially reflect the true evidence base.” While this is (for any
type of review process) it is the transparency of the sample selection - via
protocols, and the visibility of the review itself – via registries, which help
make this problem manageable.
The second problem, as
seen by the authors, is that “Systematic reviews tend to privilege one kind of
method over another, with full-blown randomised controlled trials (RCTs) often
representing the ‘gold standard’ of methodology and in-depth qualitative
evidence not really given the credit it deserves.” This does not have to be the
case. A
systematic review has
been usefully defined as “an overview
of primary studies which contains an explicit statement of objectives,
materials, and methods and has been conducted according to explicit and
reproducible methodology” Replicability is key and this requires systematic and transparent process
relating to sampling and analysis. This should be evident in protocols.
A third problem was identified by 3ie, in their
commentary on the Discussion Paper. This relates directly to the initial
focus of this blog, the argument for more focused evaluation questions. They
comment that:
“Even with plenty of data available, making
systematic reviews work for international development requires applying the
methodology to clearly defined research questions on issues where a review
seems sensible. This is one of the key lessons to emerge from recent
applications of the methodology. A review in medicine will often ask a narrow
question such as the Cochrane Collaboration’s recent review on the inefficacy
of oseltamivir (tamiflu) for preventing and treating influenza. Many of the
review questions development researchers have attempted to answer in recent
systematic reviews seem too broad, which inevitably leads to challenges. There
is a trade-off between depth and breath, but if our goal is to build a
sustainable community of practice around credible, high quality reviews we
should be favouring depth of analysis where a trade-off needs to be made.”
[1] By the head
of DFID EvD in 2011 and by Rasmussen et al, see below.
[2] See Ole
Dahl Rasmussen, Nikolaj Malchow-Møller, Thomas Barnebeck Andersen, Walking the talk: the need for a trial
registry for development interventions, available via http://mande.co.uk/2011/uncategorized/walking-the-talk-the-need-for-a-trial-registry-for-development-interventions/