Since the beginning of this year I have
been part of a DFID funded exercise which has the aim of “Developing a broader
range of rigorous designs and methods for impact evaluations” Part of the brief
has been to develop draft quality standards, to help identify “the difference
between appropriate, high quality use of the approach and inappropriate/ poor
quality use”
A quick search of what already exists suggests
that there is no shortage of quality standards. Those relevant to
development projects have been listed online
here. They include:
- Standards agreed by multiple organisations, e.g. OECD-DAC and various national evaluation societies. The former are of interest to aid organisations where as the latter are of more interest to evaluators.
- Standards developed for use within individual organisations, e.g. DFID and EuropeAID
- Methodology specific standards, e.g. those relating to randomised and other kinds of experimental methods, and qualitative research
In addition there is a much larger body of academic
literature on the use and mis-use of various more specific methods.
A scan of the criteria I have listed shows that
a variety of types of evaluation criteria are used, including:
- Process criteria, where the focus is on how evaluations are done. e.g. relevance, timeliness, accessibility, inclusiveness
- Normative criteria, where the focus is on principles of behaviour e.g. independence, impartiality, ethicality
- Technical criteria, where the focus is on attributes of the methods used e.g. reliability and validity
Somewhat surprisingly, technical criteria
like reliability and validity are in the minority, being two of at least 20
OECD-DAC criteria. The more encompassing topic of Evaluation Design is only one of
the 17 main topics in the DFID Quality Assurance template for revising draft
evaluations. There are three possible reasons why this is so: (a) Process
attributes may be more important, in terms of their effects on what happens to
an evaluation, during and after its production, (b) It is hard to identify
generic quality criteria for a diversity of evaluation methodologies, (c) Lists have no
size limits. For example, the DFID QA template has 85 subsidiary questions under 17 main topics.
Given these circumstances what is the best
way forward, of addressing the need for quality standards for “a broader range
of rigorous designs and methods for impact evaluations”? The first step might
be to develop specific guidance which can be packed in separate notes on
particular evaluation designs and methods. The primary problem may be simple
lack of knowledge about the methods available; knowing how to choose between
them may be in fact “a problem we would like to have”, which needs to be
addressed after people at least know something about the alternative methods.
The Asian Development Bank has addressed this issue through its “Knowledge Solutions”
series of publications.
The second step that could be taken would
be to develop more generic guidance that can be incorporated into the existing
quality standards. Our initial proposal focused on developing some additional
design focused quality standards that could be used with some reliability
across different users. But perhaps this is a side issue. Finding out what
quality criteria really matter, may be more important. However, there seems to
be very little evidence on what quality attributes matter. In 2008 Forss et al
carried out a study: “Are
Sida Evaluations Good Enough? An Assessment of 34 Evaluation Reports” The authors
gathered and analysed empirical data on 40 different quality attributes of evaluation
reports published between 2003 and 2005. Despite suggestions made, the report
was not required to examine the relationship between these attributes and the
subsequent use of the evaluations. Yet, the insufficient use of evaluations has
been a long standing concern to evaluators and to those funding evaluations.
There are at least 4 different hypotheses that would be worth testing in future
versions of the SIDA study that did look at evaluation quality and usage:
- Quality is largely irrelevant, what matters is how the evaluation results are communicated.
- Quality matters, especially the use of a rigorous methodology, which is able to address attribution issues
- Quality matters, especially the use of participatory processes that engage stakeholders
- Quality matters, but it is a multi-dimensional issue. The more dimensions are addressed, the more likely that the evaluation results will be used.
The first is in effect the null hypothesis,
and one which needs to be taken seriously. The second hypothesis seems to be
the position taken by 3ie and other advocates of RCTs and their next-best substitutes.
It could be described as the killer assumption being made by RCT advocates that is yet to be tested. The
third could be the position of some of the supporters of the “Big
Push Back” against inappropriate demands for performance reporting. The
fourth is the view present in the OECD-DAC evaluation standards, which can be
read as a narrative theory of change about how a complex of evaluation quality
features will lead to evaluation use, strengthened accountability, contribute
to learning and improved development outcomes. I have taken the liberty of identifying
the various possible causal connections in that theory of change in this network diagram below. As noted
above, one interesting feature is that the attributes of reliability and
validity are only one part of a much bigger picture.
[Click on image to view a
larger version of the diagram]
While
we wait for the evidence…
We should consider
transparency as a pre-eminent quality criterion, which would be applicable
across all types of evaluation designs. It is a meta-quality, enabling
judgments about other qualities. It also addresses the issue of robustness,
which was of concern to DFID. The more explicit and articulated an evaluation
design is, the more vulnerable it will be to criticism and identification of
error. Robust designs will be those that
can survive this process. This view connects to wider ideas in the
philosophy of science about the importance of falsifiablity as a
quality of scientific theories (promoted by Popper and others).
Transparency might be expected at both a
macro and micro level. At the macro level, we might ask these types of quality
assurance questions:
- Before the evaluation: Has an evaluation plan been lodged, which includes the hypotheses to be tested? Doing so will help reduce selective reporting and opportunistic data mining
- After the evaluation: Is the evaluation report available? Is the raw data available for re-analysis using the same or different methods?
Substantial progress
is now being made with the availability of evaluation reports. Some bilateral
agencies are considering the use of evaluation/trial registries, which are increasingly
commonplace in some field of research. However, availability of raw data seems
likely to remain the most challenging requirement for many evaluators.
At the micro-level,
more transparency could be expected in the particular contents of evaluation
plans and reports. The DFID Quality Assurance templates seem to be most
operationalised set of evaluation quality standards available at present. The
following types of questions could be considered for inclusion in those
templates:
- Is it clear how specific features of the project/program influenced the evaluation design?
- Have rejected evaluation design choices been explained?
- Have terms like impact been clearly defined?
- What kinds of impact were examined?
- Where attribution is claimed is there also a plausible explanations of the causal processes at work?
- Have distinctions been made between causes which are necessary, sufficient or neither (but still contributory)?
- Are there assessments of what would have happened without the intervention?
This approach seems to
have some support in other spheres of evaluation work, not associated with
development aid: “The transparency, or
clarity, in the reporting of individual studies is key” TREND statement, 2004
In summary, three main
recommendations have been made above:
- Develop technical guidance notes, separate from additional quality criteria
- Identify specific areas where transparency of evaluation designs and methods is essential, for possible inclusion in DFID QA templates, and the like
- Seek and use opportunities to test out the relevance of different evaluation criteria, in terms of their effects on evaluation use
PS: This text was the
basis of one of the presentations to DFID staff (and others) in a workshop on 7th
October 2011 on the subject of “Developing a broader
range of rigorous designs and methods for impact evaluations” The views expressed above are my own and should not be taken to reflect the views of either DFID or others involved in the exercise.
I was interested (and pleased) to read this comemnt in Raab and Stuppert's blog (http://vawreview.blogspot.co.uk/) on their use of QCA to assess the value of a set of evaluation studies:
ReplyDelete"One thing that is special about our approach is that we do not only apply established quality standards to the evaluations we review. Instead, we will look into evaluation effects as well. Whether or not an evaluation has to fulfil established quality standards to produce positive effects is an open research question. To answer it, we have to include in our review evaluations that vary in the degree to which they fulfil certain methodological standards. We hope that our research will shed light on the factors that contribute to negative and positive evaluation effects."