Friday, March 16, 2012

Can we evolve explanations of observed outcomes?


In mathematics and computer science, an optimization problem is the problem of finding the best solution from all feasible solutions. There are various techniques for doing so.

Science as a whole can be seen as an optimisation process, involving a search for explanations that have the best fit with observed reality.

In evaluation we often have a similar task, of identifying what aspects of one or more project interventions best explains the observed outcomes of interest. For example, the effects of various kinds of improvements in health systems on rates of infant mortality.  This can done in two ways. One is by looking internally at the design of a project, at its expected workings and then trying to find evidence of whether it did so in practice. This is the territory of theory led-evaluation.  The other way is to look externally, at alternative explanations involving other influences, and to seek to test those. This is ostensibly good practice but not very common in reality, because it can be time consuming and to some extent inconclusive, in that there may always be other explanations not yet identified and thus untested. This is where randomised control trials (RCTs) come in. Randomised allocation of subjects between control and intervention groups nullifies the possible influence of other external causes.  Qualitative Comparative Analysis (QCA) takes a slightly different approach, searching for multiple possible configurations of conditions which are both necessary and sufficient to explain all observed outcomes (both positive and negative instances).

The value of theory led approaches, including QCA, is that the evaluator’s theories help the search for relevant data, amongst the myriad of possibly relevant design characteristics, and combinations thereof. The absence of a clear theory of change is often one reason why baseline surveys are so expansive in contents, but yet rarely used. Without a half way decent theory we can easily get lost. It is true that "There is nothing as practical as a good theory" (Kurt Lewin)

The alternative to theory led approaches

There is however an alternative search process which does not require a prior theory, known as the evolutionary algorithm, the kernel of the process of evolution. The evolutionary processes of variation, selection and retention, iterated many times over, have been able to solve many complex optimisation problems such as the design of a bird that can both fly long distances and dive deep in the sea for fish to eat. Genetic algorithms (GA) are embodiments of the same kinds of process in software programs, in order to solve problems of interest to scientists and businesses. These are useful in two respects. One is the ability to search vary large combinatorial spaces very quickly. The other is that they can come up with solutions involving particular combinations of attributes that might not have been so obvious to a human observer.

Development projects have attributes that vary. These include both the context in which they operate and the mechanisms by which they seek to work. There are many possible combinations of these attributes, but only some of these are likely to be associated with achieving a positive impact on peoples’ live. If they were relatively common then implementing development aid projects would not be so difficult. The challenge is how to find the right combination of attributes. Trial and error by varying project designs and their implementaion on the ground is a good idea in principle, but in practice it is slow. There is also a huge amount of systemic memory loss, for various reasons including poor or non-existent communications between various iterations of a project design taking place in different locations.

Can we instead develop models of projects, which combine real data about the distribution of project attributes with variable views of their relative importance in order to generate an aggregate predicted result? This expected result can then be compared to an observed result (ideally from independent sources).  By varying the influence of the different attributes a range of predicted results can be generated, some of which may be more accurate than others. The best way to search this large space of possibilities is by using a GA. Fortunately Excel now includes a simple GA add-in, known as Solver.

The following spreadsheet shows a very basic example of what such a model could look like, using a totally fictitious data set. The projects and their observed scores on four attributes (A-D) are shown on the left. Below them is a set of weights, reflecting the possible importance of each attribute for the aggregate performance of the projects. The Expected Outcome score for each project is the sum of the score on each attribute x the weight for that score.  In other words the more a project has an important attribute (or combination of these) the higher will be its Expected Outcome score. That score is important only as a relative measure, relative to that of the other projects in the model.

The Expected Outcome score for each project is then compared to an Observed Outcome measure (ideally converted to a comparable scale), and the difference is shown as the Prediction Error. On the bottom left an is aggregate measures of prediction error, the Standard Deviation. The original data can be found in this Excel file.

 

The initial weights were set at 25 for each attribute, in effect reflecting the absence of any view about which might be more important. With those weights, the SD of the Prediction Errors was 1.25 After 60,000+ iterations in the space of 1 minute the SD had been reduce down to 0.97. This was achieved with this new combination of weights: Attribute A:19, Attribute B: 0, Attribute C: 19, Attribute D: 61.The substantial error that remains can be considered as due to causal factors outside of the model (i.e. as is described by the list of attributes)[1].

It seems that it is also possible to find least appropriate solutions, i.e, those which make the least accurate Outcome Predictions. Using the GA set to find the maximum error, it was found that in the above example a 100% weighting given to Attribute A generated a SD of 1.87. This is the nearest that such an evolutionary approach comes to disproving a theory.

GA deliver functional rather than logical proofs that certain explanations are better than others. Unlike logical proofs, they are not immortal. With more projects included in the model it is possible that there may be a fitter solution, which applies to this wider set. However, the original solution to the smaller set would still stand.

Models of complex processes can sometimes be sensitive to starting conditions. Different results can be generated from initial settings that are very similar. This was not the case in this exercise, with widely different initial weighting’s evolving and converging on almost identical sets of final weightings e.g. 19, 0, 19,  62 versus 61) producing the same final error rate. This robustness is probably due to the absence of feedback loops in the model, which could be created where the weighted score of one attribute affected those of another.  That would a much more complex model, possibly worth exploring at another time.

Small changes in Attribute scores made a more noticable difference to the Prediction Error. In the above model varying Project 8’s score on attribute A from 3 to 4 increases the average error by 0.02. Changes in other cells varied in direction of their effects. In more realistic models with more kinds of attributes and more project cases the results are likely to be less sensitive to such small differences in attribute scores.

The heading of this post asks “Can we evolve explanations of observed outcomes?” My argument above suggests that in principle it should be possible. However there is a caveat. A set of weighted attributes that are associated with success might better be described as the ingredients of an explanation. Further investigative work would be needed to find out how those attributes actually interact together in real life.  Before then, it would be interesting to do some testing of this use of GAs on real project datasets.

Your comments please...

PS 6 April 2012: I have just come across the Kaggle website. This site hosts competitions to solve various kinds of prediction problems (re both past and future events) using a data set available to all entrants, and gives prizes to the winner - who must provide not only their prediction but the algorithm that generated the prediction. Have a look. Perhaps we should outsource the prediction and testing of results of development projects via this website? :-) Though..., even to do this the project managers would still have a major task on hand: to gather and provide reliable data about implementation characteristics, as well as measures of observed outcomes... Though...this might be easier with some projects that generate lots of data, say micro-finance or education system projects.

 View this Australian TV video, explaining how the site works and some of its achievements so far. And the Fast Company interview of the CEO

PS 9 April 2012:  I have just discovered that there is a whole literature on the use of genetic algorithms for rule discovery "In a nutshell, the motivation for applying evolutionary algorithms to data mining is that evolutionary algorithms are robust search methods which perform a global search in the space of candidate solutions (rules or another form of knowledge representation)" (Freitas, 2002) The rules referred to are typcially "IF...THEN..."type statements






[1] Bear in mind that this example set of attribute scores and observed outcome measures is totally fictitious, so the inability to find a really good set of fitting attributes should not be surprising. In reality some sets of attributes will not be found co-existing because of their incompatibility e.g. corrupt project management plus highly committed staff



Tuesday, March 13, 2012

Modular Theories of Change: A means of coping with diversity and change?


Two weeks ago I attended a DFID workshop at which Price Waterhouse Coopers (PwC) consultants presented the results of their work, commissioned by DFID, on “Monitoring Results from Low Carbon Development”. LCD is one of three areas of investment by International Climate Fund (ICF). The ICF is “a £2.9bn financial contribution … provided by the UK Government to support action on climate change and development. Having started to disperse funds, a comprehensive results framework is now required to measure the impact of this investment, to enable learning to inform future programming, and to show value for money on every pound”

The PwC consultants’ tasks included (a) consultation with HMG staff on the required functions of the LCD results framework; (b) a detailed analysis of potentially useful indicators through extensive consultations and research into the available data; and (c) exploration of opportunities to harmonise results and/or share methodologies and data collection with others. Their report documents the large amount of work that has been done, but also acknowledges that more work is still needed.

Following the workshop I sent in some comments on the PwC report, some of which I will focus on here because I think they might be of wider interest. There were three aspects of the PwC proposals that particularly interested me. One was the fact that they had managed to focus down on 28 indicators, and were proposing that set be limited even further, down to 20. Secondly, they had organised the indicators into a LogFrame type structure, but one which is covering two levels of performance in parallel (within countries and across countries), rather than in a sequence. Thirdly, they had advocated the use of Multi-Criteria Analysis (MCA) for the measurement of some of the more complex forms of change referred to in the Logframe.  MCA is similar in structure to the design of weighted checklists, which I have previously discussed here and elsewhere.

Monitorable versus evaluable frameworks

As it stands the current LCD LogFrame is a potential means of monitoring important changes relating to low carbon development. But it is not yet sufficiently developed to enable an evaluation of the impact of efforts aimed at promoting low carbon development. This is because there is not yet sufficient clarity about the expected causal linkages between the various events described in the Logframe. It is the case that, as is required by DFID Logframes, weightings have been given to each of the four Outputs describing their expected impact on Outcome level changes. But the differences in weightings are modest (+/- 10%) and each of the Outputs describes a bundle of up to 5 more indicator-specific changes.

Clarity about the expected causal linkages is an essential “evaluability” requirement. Impact evaluations in their current form seek to establish not only what changes occurred, but also their causes. Accounts of causation in turn need to include not only attribution (whether A can be said to have caused B) but also explanation (how A caused B). In order for the LCD results framework to be evaluable, someone needs “connect the dots” in some detail. That is, identify plausible explanations for how particular indicator-specific changes are expected influence each other. Once that is done, the LCD program could be said to have not just a set of indicators of change, but a Theory of Change about how the changes interact and function as a whole.

Indicator level changes as shared building blocks

There are two subsequent challenges here to developing an evaluable Theory of Change for LCD. One is the multiplicity of possible causal linkages. The second is the diversity of perspectives on which of these possible causal linkages really matter. With 28 different indicator-specific changes there are, at least hypothetically, many thousands of different possible combinations that could make up a given Theory of Change (ii, where i = number of indicator specific changes). But, it can be well argued that “this is a feature, not a bug”. As the title of this blog suggests, the 28 indicators can be considered as equivalent to Lego building blocks. The same set (or parts thereof) can be combined in a multiplicity of ways, to construct very different ToC. The positive side to this picture is the flexibility and low cost. Different ToC can be constructed for different countries, but each one does not involve a whole new set of data collection requirements. In fact it is reasonable to expect that in each country the causal linkages between different changes may be quite different, because of the differences in the physical, demographic, cultural and economic context.

Documenting expecting causal linkages (how the blocks are put together)

There are other more practical challenges, relating to how to exploit this flexibility. How do you seek stakeholder views of the expected causal connections, without getting lost in a sea of possibilities? One approach I have used in Indonesia and in Vietnam involves the use of simple network matrices, in workshops involving donor agencies and/or the national partners associated with a given project. Two examples are shown below. These don’t need to be understood in detail (one is still in Vietnamese), it is their overall structure that matters. 

A network matrix simply shows the entities that could be connected in the left column and top row. The convention is that each cell in the matrix provides data on whether that row entity is connected to that column entity (and it may also describe the nature of the connection)

The Indonesian example shown below shows expected relationships between 16 Output indicators (left column) and 11 Purpose level indicators (top row) in a maternal health project. Workshop participants were asked to consider one Purpose level indicator at a time, and allocate 100 percentage points across the 16 Output indicators, with more percentage points = an Output having more expected impact on the Purpose indicator, relative to other Output indicators. Debate was encouraged between participants as figures were proposed for each cell down a column. Looking within the matrix we can see that for Purpose 3 it was agreed that Output indicator 1.1 would have the most impact. For some other Purpose level changes, impact was expected from a wider range of Outputs. The column on the right side sums up the relative expected impact of each Output, providing useful guidance on where monitoring attention might be most usefully focused.


This exercise was completed in little over an hour. The matrix of results show one set of expected relationships amongst many thousands of other possible sets that could exist within the same list of indicators. The same kind of data can be collected on a larger scale via online surveys, where the options down each column are represented within a single multiple choice question. Matrices like these, obtained either from different individuals or different stakeholder groups, can be compared with each other to identify relationships (i.e. specific cells) where there is the most/least agreement, as well as which relationships are seen as most important, when all satkeholder views are added up. This information should then inform the focus of evaluations, allowing scarce attention and resources to be directed to the most critical relationships.

The second example of a network matrix used to explicate a tacit ToC comes from Vietnam, and is shown below. In this example, a Ministry’s programmes are shown (unconventionally) across the top row and the country’s 5 year plan objectives are shown down the left column. Cell entries, discussed and proposed by workshop participants, show the relative expected causal contribution of each programme to each 5 year plan objective. Summary row on the bottom shows the aggregate expected contribution of each programme and the summary column on the right show the aggregate extent to which each 5 year plan objective was expected to be affected.


 Modularity

The modules referred to in the title of this blog can be seen as referring to two types of entities that can be used to construct many different kinds of ToC. One is the indicator-specific changes in the LCD Logframe, for example. By treating them as a standard set available for use by different stakeholders in different settings, we may gain flexibility at a low cost. The other is the grouping of indicator specific changes into categories (e.g. Outputs 1-2-3-4) and larger sets of categories (Outputs, Outcomes, Purpose). The existence of one or more nested types of entities is sometimes described as modularity. In evolutionary theory it has been argued that modularity in design improves evolvablity. This can happen: (a) by allowing specific features to undergo changes without substantially altering the functionality of the entire system, (b) by allowing larger more structural changes to occur by recombining existing functional units.

In the conceptual world of Logframes, and the like, this suggests that we may need to think of ToC being constructed at multiple levels of detail, by different sized modules. In the LCD Logframe impact weightings had already been assigned to each Output, indicating its relative expected contribution to the Outcomes as a whole. But the flexibility of ToC design at this level was seriously constrained by the structure of the representational device being used. In a Logframe Outputs are expected to influence Outcomes, but not the other way. Nor are they expected to influence each other, contra other more graphic based logic models. Similarly, both of the above network matrix exercises made use of existing modules and accepted the kinds of relationship that was expected between them (Outputs should influence Purpose level changes; Ministry Programmes should influence 5Year Plan objectives achievements). 

The value of multiple causal pathways with a ToC

More recently I have seen the ToC for a major area of DFID policy that will be coming under review. This is represented in diagramatic form, showing various kinds of events (including some nested categories of events), and also shows the expected causal relationships between these events. It was quite a complex diagram, perhaps too much so for those who like traffic-light level simplicities. However, what interested me the most is that subsequent versions have been used to show how two specific in-country programs fit within this generic ToC. This has been done by highlighting the particular events that make up one of the number of causal chains that can be found within the generic ToC. In doing so it appears to be successfully addressing a common problem with generic ToC - the inability to reflect the diversity of the programs that make up the policy area described by a generic ToC.

Shared causal pathways justify more evaluation attention

This innovation points to an alternate and additional use of the matrices above. The cell numbers could refer to the numbers of constituent programs in a policy area (and/or which are funded by a single funding mechanism) that involve this particular causal link (i.e. between the row event and the column event). The higher this number, the more important it would be for evaluations to focus on that casual link - because the findings would have relevance across a number of programs in the policy area.


Thursday, February 16, 2012

Evaluation questions: Managing agency, bias and scale



It is common to see in the Terms of Reference (ToRs) of an evaluation a list of evaluation questions. Or, at least a requirement that the evaluator develops such a list of questions as part of the evaluation plan. Such questions are typically fairly open-ended “how” and “whether” type questions. On the surface this approach makes sense. It gives some focus but leaves room for the unexpected and unknown.

But perhaps there is an argument for a much more focused and pre-specified approach. 

Agency

There are two grounds on which such an argument could be made. One is that aid organisations implementing development programs have “agency”, i.e. they are expected to be able to assess the situation they are in and act on the basis of informed judgements. They are not just mechanical instruments for implementing a program, like a computer. Given this fact, one could argue that evaluations should not simply focus on the behaviour of an organisation and its consequences, but on the organisation’s knowledge of its behaviour and its consequences. If that knowledge is misinformed then the sustainability of any achievements may be seriously in doubt. Likewise, it may be less likely that unintended negative consequences of a program will be identified and responded to appropriately.

One way to assess an organisation’s knowledge is to solicit their judgements about program outcomes in a form that can be tested by independent observation. For example, an organisation’s view on the percentage of households who have been lifted above the poverty line as a result of a livelihood intervention. An external evaluation could then gather independent data to test this judgement, or more realistically, audit the quality of the data and analysis that the organisation used to come to their judgement. In this latter case the role of the external evaluator is undertake a meta-evaluation, evaluating an organisation’s capacity by examining their judgements relating to key areas of expected program performance. This would require focused evaluation questions rather than open ended evaluation questions.

Bias

The second argument is arises from a body of research and argument about the prevalence of what appears to be endemic bias in many fields of research: the under-reporting of negative findings (i.e. non-relationships) and the related tendency of positive findings to disappear over time. The evidence here makes salutary reading, especially the evidence from the field of medical research where research protocols are perhaps the most demanding of all (for good reason, given the stakes involved). Lehrer’s 2010 article in the New York Times “The Truth Wears Off: Is there something wrong with the scientific method? is a good introduction, and Ioannidis’ work (cited by Lehrer) provides the more in-depth analysis and evidence. 

One solution that has been proposed to the problem of under-reporting of negative findings is the establishment of trial registries, whereby plans for experiments would be lodged in advance, before their results are known. This is now established practice in some fields of research and has recently been proposed for the use of randomised control trials by development agencies[1] Trial registries can provide two lines of defence against bias. The first is to make visible all trials, regardless of whether they are deemed “successful” and get published, or not. The other defence is against inappropriate “data mining”[2] within individual trials. The risk is that researchers can examine so many possible correlations between independent and dependent variables that some positive correlations will appear by chance alone. This risk is greater where a study looks at more than one outcome measure and at several different sub-groups. Multiple outcome measures are likely to be used when examining the impact on complex phenomenon such a poverty levels or governance, for example. When there are many relationships being examined there is also the well known risk of publication bias, of the evaluator only reporting the significant results.

These risks can be managed partly by the researchers themselves. Rasmussen et al suggests that if the outcomes are assumed to be fully independent, statistical significance values should be divided by the number of tests. Other approaches involve constructing mean standardised outcomes across a family of outcome measures. However these do not deal with the problem of selective reporting of results. Rasmussen et al argue that this risk would be best dealt with through the use of trial registries, where relationships to be examined are recorded in advance. In other words, researchers would spell out the hypothesis or claim to be tested, rather than simply state an open ended question. Open ended questions invite cherry picking of results according to the researcher’s interests, especially when there are lot of them.

As I have noted elsewhere, there are risks with this approach. One concern is that it might prevent evaluators from looking at the data and identifying new hypothesis that genuinely emerges as being of interest and worth testing.  However, registering hypotheses to be tested would not preclude this possibility. It should, however, make it evident when this is happening, and therefore encourage the evaluator to provide an explicit rationale for why additional hypotheses are being tested.

Same again, on a larger scale

The problems of biased reporting re-appear when individual studies are aggregated. Ben Goldacre explains:  

But individual experiments are not the end of the story. There is a second, crucial process in science, which is synthesising that evidence together to create a coherent picture.
In the very recent past, this was done badly. In the 1980s, researchers such as Celia Mulrow produced damning research showing that review articles in academic journals and textbooks, which everyone had trusted, actually presented a distorted and unrepresentative view, when compared with a systematic search of the academic literature. After struggling to exclude bias from every individual study, doctors and academics would then synthesise that evidence together with frightening arbitrariness.

The science of "systematic reviews" that grew from this research is exactly that: a science. It's a series of reproducible methods for searching information, to ensure that your evidence synthesis is as free from bias as your individual experiments. You describe not just what you found, but how you looked, which research databases you used, what search terms you typed, and so on. This apparently obvious manoeuvre has revolutionised the science of medicine”

Reviews face the same risks as individual experiments and evaluations. They may be selectively published, and their individual methodologies may not adequately deal with the problem of selective reporting of the more interesting results – sometimes described as cherry picking.  The development of review protocols and the registering of those prior to a review are an important means of reducing biased reporting, as they are with individual experiments. Systematic reviews are already a well established practice in the health sphere under the Cochrane Collaboration and in social policy under the Campbell Collaboration. Recently a new health sector journal, Systematic Reviews, has been established with the aim of ensuring that the results of all well-conducted systematic reviews are published, regardless of their outcome. The journal also aims to promote discussion of review methodologies, with the current issue including a paper on “Evidence summaries”, a rapid review approach.

It is common place for large aid organisations to request synthesis studies of achievements across a range of programs, defined by geography (e.g. a country program) or subject matter (e.g. livelihood interventions). A synthesis study requires some meta-evaluation, of what evidence is of sufficient quality and what is not. These judgements inform both the sampling of sources and the weighing of evidence found within the selected sources.  Despite the prevalence of synthesis studies, I am not aware of much literature existing on appropriate methodologies for such reviews, at least within the sphere of development evaluation. [I would welcome corrections to this view]

However, there are signs that experiences elsewhere with systematic reviews are being attended to. In the development field The International Development Coordinating Group has been established, under the auspices of the Campbell Collaboration, with the aim of encouraging registration of review plans and protocols and then disseminating “systematic reviews of high policy-relevance with a dedicated focus on social and economic development interventions in low and middle income countries”. DFID and AusAID have funded 3ie to commission a body of systematic reviews of what it identifies as rigorous impact evaluations, in a range of development fields. More recently an ODI Discussion Paper has reviewed some experiences with the implementation of systematic reviews. Associated with the publication of this paper was a useful online discussion.  

Three problems that were identified are of interest here. One is the difficulty of accessing source materials, especially evaluation reports many of which are not in the public domain, but should be. This problem is faced by all review methods, systematic and otherwise. This problem is now being addressed on multiple fronts, by individual organisation initiatives (e.g. 3ie and IDS evaluation databases) and by collective efforts such as the International Aid Transparency Initiative. The authors of the ODI paper note that “there are no guarantees that systematic reviews, or rather the individuals conducting them, will successfully identify every relevant study, meaning that subsequent conclusions may only partially reflect the true evidence base.” While this is (for any type of review process) it is the transparency of the sample selection - via protocols, and the visibility of the review itself – via registries, which help make this problem manageable.

The second problem, as seen by the authors, is that “Systematic reviews tend to privilege one kind of method over another, with full-blown randomised controlled trials (RCTs) often representing the ‘gold standard’ of methodology and in-depth qualitative evidence not really given the credit it deserves.” This does not have to be the case.   A systematic review has been usefully defined as “an overview of primary studies which contains an explicit statement of objectives, materials, and methods and has been conducted according to explicit and reproducible methodology” Replicability is key and this requires systematic and transparent process relating to sampling and analysis. This should be evident in protocols.

A third problem was identified by 3ie, in their commentary on the Discussion Paper. This relates directly to the initial focus of this blog, the argument for more focused evaluation questions. They comment that:

Even with plenty of data available, making systematic reviews work for international development requires applying the methodology to clearly defined research questions on issues where a review seems sensible. This is one of the key lessons to emerge from recent applications of the methodology. A review in medicine will often ask a narrow question such as the Cochrane Collaboration’s recent review on the inefficacy of oseltamivir (tamiflu) for preventing and treating influenza. Many of the review questions development researchers have attempted to answer in recent systematic reviews seem too broad, which inevitably leads to challenges. There is a trade-off between depth and breath, but if our goal is to build a sustainable community of practice around credible, high quality reviews we should be favouring depth of analysis where a trade-off needs to be made.”





[1] By the head of DFID EvD in 2011 and by Rasmussen et al, see below.
[2] See Ole Dahl Rasmussen, Nikolaj Malchow-Møller, Thomas Barnebeck Andersen, Walking the talk: the need for a trial registry for development interventions,  available via http://mande.co.uk/2011/uncategorized/walking-the-talk-the-need-for-a-trial-registry-for-development-interventions/

Monday, October 24, 2011

Evaluation quality standards: Theories in need of testing?


Since the beginning of this year I have been part of a DFID funded exercise which has the aim of “Developing a broader range of rigorous designs and methods for impact evaluations” Part of the brief has been to develop draft quality standards, to help identify “the difference between appropriate, high quality use of the approach and inappropriate/ poor quality use”

A quick search of what already exists suggests that there is no shortage of quality standards. Those relevant to development projects have been listed online here. They include:
  • Standards agreed by multiple organisations, e.g. OECD-DAC and various national evaluation societies. The former are of interest to aid organisations where as the latter are of more interest to evaluators.
  • Standards developed for use within individual organisations, e.g. DFID and EuropeAID
  • Methodology specific standards, e.g. those relating to randomised and other kinds of experimental methods, and qualitative research
In addition there is a much larger body of academic literature on the use and mis-use of various more specific methods.

A scan of the criteria I have listed shows that a variety of types of evaluation criteria are used, including:
  • Process criteria, where the focus is on how evaluations are done. e.g. relevance, timeliness, accessibility, inclusiveness
  • Normative criteria, where the focus is on principles of behaviour e.g. independence, impartiality, ethicality
  • Technical criteria, where the focus is on attributes of the methods used e.g. reliability and validity
Somewhat surprisingly, technical criteria like reliability and validity are in the minority, being two of at least 20 OECD-DAC criteria. The more encompassing topic of Evaluation Design is only one of the 17 main topics in the DFID Quality Assurance template for revising draft evaluations. There are three possible reasons why this is so: (a) Process attributes may be more important, in terms of their effects on what happens to an evaluation, during and after its production, (b) It is hard to identify generic quality criteria for a diversity of evaluation methodologies, (c) Lists have no size limits. For example, the DFID QA template has 85 subsidiary questions under 17 main topics.

Given these circumstances what is the best way forward, of addressing the need for quality standards for “a broader range of rigorous designs and methods for impact evaluations”? The first step might be to develop specific guidance which can be packed in separate notes on particular evaluation designs and methods. The primary problem may be simple lack of knowledge about the methods available; knowing how to choose between them may be in fact “a problem we would like to have”, which needs to be addressed after people at least know something about the alternative methods. The Asian Development Bank has addressed this issue through its “Knowledge Solutions” series of publications. 

The second step that could be taken would be to develop more generic guidance that can be incorporated into the existing quality standards. Our initial proposal focused on developing some additional design focused quality standards that could be used with some reliability across different users. But perhaps this is a side issue. Finding out what quality criteria really matter, may be more important. However, there seems to be very little evidence on what quality attributes matter. In 2008 Forss et al carried out a study: “Are Sida Evaluations Good Enough? An Assessment of 34 Evaluation Reports” The authors gathered and analysed empirical data on 40 different quality attributes of evaluation reports published between 2003 and 2005. Despite suggestions made, the report was not required to examine the relationship between these attributes and the subsequent use of the evaluations. Yet, the insufficient use of evaluations has been a long standing concern to evaluators and to those funding evaluations. 

There are at least 4 different hypotheses that would be worth testing in future versions of the SIDA study that did look at evaluation quality and usage:
  1. Quality is largely irrelevant, what matters is how the evaluation results are communicated.
  2. Quality matters, especially the use of a rigorous methodology, which is able to address attribution issues
  3. Quality matters, especially the use of participatory processes that engage stakeholders
  4. Quality matters, but it is a multi-dimensional issue. The more dimensions are addressed, the more likely that the evaluation results will be used.
The first is in effect the null hypothesis, and one which needs to be taken seriously. The second hypothesis seems to be the position taken by 3ie and other advocates of RCTs and their next-best substitutes. It could be described as the killer assumption being made by RCT advocates that is yet to be tested. The third could be the position of some of the supporters of the “Big Push Back” against inappropriate demands for performance reporting. The fourth is the view present in the OECD-DAC evaluation standards, which can be read as a narrative theory of change about how a complex of evaluation quality features will lead to evaluation use, strengthened accountability, contribute to learning and improved development outcomes. I have taken the liberty of identifying the various possible causal connections in that theory of change in this network diagram below. As noted above, one interesting feature is that the attributes of reliability and validity are only one part of a much bigger picture. 


[Click on image to view a larger version of the diagram]

While we wait for the evidence…

We should consider transparency as a pre-eminent quality criterion, which would be applicable across all types of evaluation designs. It is a meta-quality, enabling judgments about other qualities. It also addresses the issue of robustness, which was of concern to DFID. The more explicit and articulated an evaluation design is, the more vulnerable it will be to criticism and identification of error. Robust designs will be those that  can survive this process. This view connects to wider ideas in the philosophy of science about the importance of falsifiablity as a quality of scientific theories (promoted by Popper and others).

Transparency might be expected at both a macro and micro level. At the macro level, we might ask these types of quality assurance questions:
  • Before the evaluation: Has an evaluation plan been lodged, which includes the hypotheses to be tested? Doing so will help reduce selective reporting and opportunistic data mining
  • After the evaluation: Is the evaluation report available? Is the raw data available for re-analysis using the same or different methods?
Substantial progress is now being made with the availability of evaluation reports. Some bilateral agencies are considering the use of evaluation/trial registries, which are increasingly commonplace in some field of research. However, availability of raw data seems likely to remain the most challenging requirement for many evaluators.

At the micro-level, more transparency could be expected in the particular contents of evaluation plans and reports. The DFID Quality Assurance templates seem to be most operationalised set of evaluation quality standards available at present. The following types of questions could be considered for inclusion in those templates:
  • Is it clear how specific features of the project/program influenced the evaluation design?
  • Have rejected evaluation design choices been explained?
  • Have terms like impact been clearly defined?
  • What kinds of impact were examined?
  • Where attribution is claimed is there also a plausible explanations of the causal processes at work?
  • Have distinctions been made between causes which are necessary, sufficient or neither (but still contributory)?
  •  Are there assessments of what would have happened without the intervention?
This approach seems to have some support in other spheres of evaluation work, not associated with development aid: “The transparency, or clarity, in the reporting of individual studies is key” TREND statement, 2004

In summary, three main recommendations have been made above:
  • Develop technical guidance notes, separate from additional quality criteria
  • Identify specific areas where transparency of evaluation designs and methods is essential, for possible inclusion in DFID QA templates, and the like
  • Seek and use opportunities to test out the relevance of different evaluation criteria, in terms of  their effects on evaluation use
PS: This text was the basis of one of the presentations to DFID staff (and others) in a workshop on 7th October 2011 on the subject of “Developing a broader range of rigorous designs and methods for impact evaluations” The views expressed above are my own and should not be taken to reflect the views of either DFID or others involved in the exercise.


Sunday, September 04, 2011

Relative rather than absolute counterfactuals: A more useful alternative?


Background

The basic design of a randomised control trial (RCT) involves comparisons of two groups: an intervention (or “treatment”) group and a control group, at two points of time, before an intervention begins and after the intervention ends. The expectation (hypothesis) is that there will be a bigger change on an agreed impact measure in the intervention group than in the control group. This hypothesis can be tested by comparing the average change in the impact status of members of the two groups, and applying a statistical test to establish that this difference was unlikely to be a chance finding (e.g. less than 5% probability of being a chance difference). The two groups are made comparable by randomly assigning participants to both groups. The types of comparisons involved are shown in this fictional example below.


A.       Intervention group B.       Control group
Before intervention Average income per household = $1000 year.
N = 500
Average income per household = $1000 year N=500
After intervention Average income per household = $1500 year.
N = 500
Average income per household = $1200 year N=500


PS: See Comment 3 below re this table]
Difference over time = $500 Difference over time = $200
Difference between changes in A and B = £300
This method allows a comparison with what could be called an absolute counterfactual: what would have happened if there was no intervention.

Note that only the impact indicator is measured, there is no measurement of the intervention. This is because the intervention is assumed to be the same across all participants in the intervention group. This assumption is reasonable with some development interventions, such as those involving financial or medical activities (e.g. cash transfers or de-worming). Some information based interventions, using radio programs or the distribution of booklets, can also be assumed to be available to all participants in a standardised form. Where delivery is standardised it makes sense to measure the average impacts on the intervention and control group, because significant variations in impact are not expected to arise from the intervention.

Alternate views

There are however many development interventions where delivery is not expected to be standardised and where the opposite is the case, that delivery is expected to be customised. Here the agent delivering the intervention is expected to have some autonomy and to use that autonomy to the benefit of the participants. Examples of such agents would include community development workers, agricultural extension workers, teachers, nurses, midwives, nurses, doctors, plus all their supervisors. On a more collective level would be providers of training to such groups working in different locations. Also included would be almost all forms of technical assistance provided by development agencies.

In these settings measurement of the intervention, as well as the actual impact, will be essential before any conclusions can be drawn about attribution – the extent to which the intervention caused the observed impacts. Let us temporarily assume that it will be possible to come up with a measurement of the degree to which an intervention has been successfully implemented, a quality measure of some kind. It might be very crude, such as number of days an extension worker has spent in villages they are responsible for, or it might be a more sophisticated index combining multiple attributes of quality (e.g. weighted checklists).

Data on implementation quality and observed impact (i.e. an After minus a Before measure) can now be brought together in a two dimensional scatter plot. In this exercise there is no longer a control group, just an intervention group where implementation has been variable but measured. This provides an opportunity to explore the relative counterfactual, what would have happened if implementation was less successful, and less successful still, etc. In this situation we could hypothesise that if the intervention did cause the observed impacts then there would be a statistically significant correlation between the quality of implementation and observed impact. In place of an absolute counterfactual obtained via the use of control group, where there was no intervention we have relative counterfactuals, in the form of participants exposed to interventions of different qualities. In place of an average, we have a correlation.

There are a number of advantages to this approach. Firstly, with the same amount of evaluation funds available, the number of intervention cases that can be measured can be doubled, because a control group is no longer being used. In addition to obtaining (or not) a statistically significant correlation, we can also identify the strength of the relationship between the intervention and the impact. This will be visible in the slope of the regression line. A steep slope[1] would imply that small improvements in implementation can make big improvements in observed impacts and vice versa. If a non-lineal relationship is found then the shape of a best fitting regression line might also be informative, about where improvements will generate more versus less improvement.

Another typical feature of scatter plots is outliers. There may be some participants (individuals or groups of) who have received a high quality intervention, but where the impact has been modest, i.e. a negative outlier. Conversely, there may be some participants who have received a poor quality intervention, but where the impact has been impressive, i.e. a positive outlier. These are both important learning opportunities, which could be explored via the use of in-depth cases studies . But ideally these case studies would be informed by some theory, directing us where to look.

Evaluators sometimes talk about implementation failure versus theory failure. In her Genuine Evaluation blogPatricia Rogers gives an interesting example from Ghana, involving the distribution of Vitamin A tablets to women in order to reduce pregnancy related mortality rates. Contrary to previous findings, there was no significant impact. But as Patricia noted, the researchers appeared to have failed to measure compliance i.e. whether all the women actually took the tables given to them! This appears to be a serious case of implementation failure, in that the implementers could have designed a delivery mechanism that ensured compliance. Theory failure would be where our understanding of how Vitamin A affects women’s health appears to be faulty, because expected impacts do not materialise, after women have taken the prescribed medication.

In the argument developed so far, we have already proposed measuring quality of implementation, rather than making any assumptions about how it is happening. However, it is still possible that we might face “implementation measurement failure”. In other words, there may be some aspect of the implementation process that was not captured by the measure used, and which was causally connected to the conspicuous impact, or lack thereof.  A case study, looking at the implementation process in the outlier cases might help us identify the missing dimension. Re-measurement of implementation success incorporating this dimension might produce a higher correlation result. If it did not, then we might by default then have a good reason to believe we are now dealing with theory failure, i.e. a lack of understanding of how an intervention has its impact. Again, case studies of the outliers could help generate hypotheses about these. Testing these out is likely to be more expensive than testing alternate views on implementation processes because data will be less readily at hand. For reasons of economy and practicality implementation failure should be our first suspect.

In addition to having informative outliers to explore, the use of a scatter plot enables us to identify another potential outcome not readily available via the use of control groups, where the focus is on averages. In some programmes poor implementation may not simply lead to no impact (i.e. no difference between the average impact of control and intervention groups). Poor implementation may lead to negative impacts. For example, a poorly managed savings and credit programme may lead to increased indebtedness in some communities. In a standard comparison between intervention and control groups this type of failure would usually need to be present in a large of cases before it became visible in a net negative average impact. In a scatter plot any negative cases would be immediately visible, including their relationship to implementation quality.

To summarise so far, the assumption about standardised delivery of an intervention does not fit the reality of many development programmes. Replacing assumptions by measurement will provide a much richer picture of the relationship between an intervention and the expected impacts. Overall impact can still be measured, by using a correlation coefficient. In addition we can see the potential for greater impact present in existing implementation practice (the slope of the regression line). We can also find outliers that can help improve our understanding of implementation and impact process. We can also quickly identify negative impacts, as well as the absence of any impact.

Perhaps more important still, the exploration of internal differences in implementation means that the autonomy of development agents can be valued and encouraged. Local experimentation might then generate more useful outliers, and not be seen simply as statistical noise. This is experimentation with a small e, of the kind advocated by Chris Blattman in his presentationto DFID on 1st September 2011, and of a kind long advocated by most competent NGOs.

Given this discussion is about counterfactuals, it might be worth considering what would happen if this implementation measurement based approach was not used, where an intervention is being delivered in a non-standard way. One example is a quasi-experimental evaluation of an agricultural project in Tanzania, described in Oxfam GB‘s paper on its Global Performance Framework[2] . “Oxfam is working with local partners in four districts of Shinyanga Region, Tanzania, to support over 4,000 smallholder  farmers (54% of whom are women) to enhance their production and marketing of local chicken and rice. To promote group cohesion and solidarity, the producers are encouraged to form themselves into savings and internal lending communities. They are also provided with specialised training and marketing supporting, including forming linkages with buyers through the establishment of collection centres.” This is a classic case where the staff of the partner organisations would need to exercise considerable judgement about how to best help each community. It is unlikely that each community was given a standard package of assistance, without any deliberate customisations nor any unintentional quality variations along the way. Nevertheless, the evaluation chose to measure the impact of the partner’s activities on changes in household incomes and women’s decision making power, by comparing the intervention group with a control group. Results of the two groups were described in terms of “% of targeted households living on more than £1.00 per day per capita”, and % of supported women are meaningfully involved in household decision making”. In using these measures to make comparisons Oxfam GB has effectively treated quality differences in the extension work as noise to be ignored, rather than as valuable information to be analysed. In the process they have unintentionally devalued the work of their partners.

A similar problem can be found elsewhere in the same document where Oxfam GB describes their new set of global outcome indicators. The Livelihood Support indicator is: % of targeted households living on more than £1.00 per day per capita (as used in the Tanzania example). In four of the six global indicators the unit of analysis are people, the ultimate intended beneficiaries of Oxfam GB’s work. However, the problem is that in most cases Oxfam GB does not work directly with such people. Instead Oxfam GB typically works with local NGOs who in turn work with such groups. In claiming to have increased the % of targeted households living on more than £1.00 per day per capita Oxfam GB is again obscuring through simplification the fact that it is those partners who are responsible for these achievements. Instead, I would argue that the unit of analysis many of Oxfam GB’s global outcome indicators should be the behaviour and performance of its partners. Its global indicator for Livelihood Support should read something like this: “x % of Oxfam GB partners working on rural livelihoods have managed to double the proportion of targeted households living on more than £1.00 per day per capita” Credit should be given to where credit is due.  However, these kinds of claims will only be possible if and where Oxfam GB encourages partners to measure their implementation performance as well as changes taking place in the communities they are working with, and then to analyse the relationship between both measures.

Ironically, the suggestion to measure implementation sounds rather unfashionable and regressive, because we are often reading how in the past aid organisations used to focus too much on outputs and that now they need to focus more on impacts. But in practice it is not an either/or question. What we need is both, and both done well. Not something quickly produced by the Department of Rough Measures.

PS 4th September 2011: I forgot to discuss the issue of whether any form of randomisation would be useful where relative counterfactuals are being explored. In an absolute counterfactual experiment the recipients’ membership of control versus intervention groups is randomised. In a relative counterfactual “experiment” all participants will receive an intervention so there is no need to randomly assign participants to control versus intervention groups. But randomisation could be used to decide which staff worked with which participants (/vice versa). For example, where a single extension worker is assigned to a given community. But this would be less easily where a whole group of staff e.g. in a local health centre or local school, are responsible for the surrounding community.

Even where randomisation of staff was possible this would not prevent the impact of external factors influencing the impact of the intervention. It could be argued that the groups experiencing least impact and the poorest quality implementation were doing so, because of the influence of an independent cause (e.g. geographical isolation) that is not present amongst the groups experiencing bigger impacts and better quality implementation. Geographical isolation is a common exterbal influence in many rural development projects, one which is likely to make implementation of a livelihood initiative more difficult as well as making it more difficult for the participants to realise any benefits e.g. through sales of new produce at a regional market. Other external influences may affect the impact but not the intervention e.g. subsequent changes in market prices for produce. However, identifying the significance of external influences should be relatively easy, by making statistical tests of the difference in their prevalence in the high and low impact groups. This does of course require being able to identify potential external influences whereas as with randomised control trials (RCTs) no knowledge of other possible causes is needed (their influence is assumed to be equally distributed between control and intervention groups). However, this requirement could be considered as a "feature" rather than a "bug", because exploration of the role of other causal factors could inform and help improve implementation. On the other hand, the randomisation of control and intervention groups could encourage management's neglect of the role of other causal factors. There are clearly trade-offs here between competing evaluation quality criteria of rigour and utility.


[1]i.e. with observed impact on the y axis and intervention quality on the x axis