Monday, October 24, 2011

Evaluation quality standards: Theories in need of testing?


Since the beginning of this year I have been part of a DFID funded exercise which has the aim of “Developing a broader range of rigorous designs and methods for impact evaluations” Part of the brief has been to develop draft quality standards, to help identify “the difference between appropriate, high quality use of the approach and inappropriate/ poor quality use”

A quick search of what already exists suggests that there is no shortage of quality standards. Those relevant to development projects have been listed online here. They include:
  • Standards agreed by multiple organisations, e.g. OECD-DAC and various national evaluation societies. The former are of interest to aid organisations where as the latter are of more interest to evaluators.
  • Standards developed for use within individual organisations, e.g. DFID and EuropeAID
  • Methodology specific standards, e.g. those relating to randomised and other kinds of experimental methods, and qualitative research
In addition there is a much larger body of academic literature on the use and mis-use of various more specific methods.

A scan of the criteria I have listed shows that a variety of types of evaluation criteria are used, including:
  • Process criteria, where the focus is on how evaluations are done. e.g. relevance, timeliness, accessibility, inclusiveness
  • Normative criteria, where the focus is on principles of behaviour e.g. independence, impartiality, ethicality
  • Technical criteria, where the focus is on attributes of the methods used e.g. reliability and validity
Somewhat surprisingly, technical criteria like reliability and validity are in the minority, being two of at least 20 OECD-DAC criteria. The more encompassing topic of Evaluation Design is only one of the 17 main topics in the DFID Quality Assurance template for revising draft evaluations. There are three possible reasons why this is so: (a) Process attributes may be more important, in terms of their effects on what happens to an evaluation, during and after its production, (b) It is hard to identify generic quality criteria for a diversity of evaluation methodologies, (c) Lists have no size limits. For example, the DFID QA template has 85 subsidiary questions under 17 main topics.

Given these circumstances what is the best way forward, of addressing the need for quality standards for “a broader range of rigorous designs and methods for impact evaluations”? The first step might be to develop specific guidance which can be packed in separate notes on particular evaluation designs and methods. The primary problem may be simple lack of knowledge about the methods available; knowing how to choose between them may be in fact “a problem we would like to have”, which needs to be addressed after people at least know something about the alternative methods. The Asian Development Bank has addressed this issue through its “Knowledge Solutions” series of publications. 

The second step that could be taken would be to develop more generic guidance that can be incorporated into the existing quality standards. Our initial proposal focused on developing some additional design focused quality standards that could be used with some reliability across different users. But perhaps this is a side issue. Finding out what quality criteria really matter, may be more important. However, there seems to be very little evidence on what quality attributes matter. In 2008 Forss et al carried out a study: “Are Sida Evaluations Good Enough? An Assessment of 34 Evaluation Reports” The authors gathered and analysed empirical data on 40 different quality attributes of evaluation reports published between 2003 and 2005. Despite suggestions made, the report was not required to examine the relationship between these attributes and the subsequent use of the evaluations. Yet, the insufficient use of evaluations has been a long standing concern to evaluators and to those funding evaluations. 

There are at least 4 different hypotheses that would be worth testing in future versions of the SIDA study that did look at evaluation quality and usage:
  1. Quality is largely irrelevant, what matters is how the evaluation results are communicated.
  2. Quality matters, especially the use of a rigorous methodology, which is able to address attribution issues
  3. Quality matters, especially the use of participatory processes that engage stakeholders
  4. Quality matters, but it is a multi-dimensional issue. The more dimensions are addressed, the more likely that the evaluation results will be used.
The first is in effect the null hypothesis, and one which needs to be taken seriously. The second hypothesis seems to be the position taken by 3ie and other advocates of RCTs and their next-best substitutes. It could be described as the killer assumption being made by RCT advocates that is yet to be tested. The third could be the position of some of the supporters of the “Big Push Back” against inappropriate demands for performance reporting. The fourth is the view present in the OECD-DAC evaluation standards, which can be read as a narrative theory of change about how a complex of evaluation quality features will lead to evaluation use, strengthened accountability, contribute to learning and improved development outcomes. I have taken the liberty of identifying the various possible causal connections in that theory of change in this network diagram below. As noted above, one interesting feature is that the attributes of reliability and validity are only one part of a much bigger picture. 


[Click on image to view a larger version of the diagram]

While we wait for the evidence…

We should consider transparency as a pre-eminent quality criterion, which would be applicable across all types of evaluation designs. It is a meta-quality, enabling judgments about other qualities. It also addresses the issue of robustness, which was of concern to DFID. The more explicit and articulated an evaluation design is, the more vulnerable it will be to criticism and identification of error. Robust designs will be those that  can survive this process. This view connects to wider ideas in the philosophy of science about the importance of falsifiablity as a quality of scientific theories (promoted by Popper and others).

Transparency might be expected at both a macro and micro level. At the macro level, we might ask these types of quality assurance questions:
  • Before the evaluation: Has an evaluation plan been lodged, which includes the hypotheses to be tested? Doing so will help reduce selective reporting and opportunistic data mining
  • After the evaluation: Is the evaluation report available? Is the raw data available for re-analysis using the same or different methods?
Substantial progress is now being made with the availability of evaluation reports. Some bilateral agencies are considering the use of evaluation/trial registries, which are increasingly commonplace in some field of research. However, availability of raw data seems likely to remain the most challenging requirement for many evaluators.

At the micro-level, more transparency could be expected in the particular contents of evaluation plans and reports. The DFID Quality Assurance templates seem to be most operationalised set of evaluation quality standards available at present. The following types of questions could be considered for inclusion in those templates:
  • Is it clear how specific features of the project/program influenced the evaluation design?
  • Have rejected evaluation design choices been explained?
  • Have terms like impact been clearly defined?
  • What kinds of impact were examined?
  • Where attribution is claimed is there also a plausible explanations of the causal processes at work?
  • Have distinctions been made between causes which are necessary, sufficient or neither (but still contributory)?
  •  Are there assessments of what would have happened without the intervention?
This approach seems to have some support in other spheres of evaluation work, not associated with development aid: “The transparency, or clarity, in the reporting of individual studies is key” TREND statement, 2004

In summary, three main recommendations have been made above:
  • Develop technical guidance notes, separate from additional quality criteria
  • Identify specific areas where transparency of evaluation designs and methods is essential, for possible inclusion in DFID QA templates, and the like
  • Seek and use opportunities to test out the relevance of different evaluation criteria, in terms of  their effects on evaluation use
PS: This text was the basis of one of the presentations to DFID staff (and others) in a workshop on 7th October 2011 on the subject of “Developing a broader range of rigorous designs and methods for impact evaluations” The views expressed above are my own and should not be taken to reflect the views of either DFID or others involved in the exercise.


Sunday, September 04, 2011

Relative rather than absolute counterfactuals: A more useful alternative?


Background

The basic design of a randomised control trial (RCT) involves comparisons of two groups: an intervention (or “treatment”) group and a control group, at two points of time, before an intervention begins and after the intervention ends. The expectation (hypothesis) is that there will be a bigger change on an agreed impact measure in the intervention group than in the control group. This hypothesis can be tested by comparing the average change in the impact status of members of the two groups, and applying a statistical test to establish that this difference was unlikely to be a chance finding (e.g. less than 5% probability of being a chance difference). The two groups are made comparable by randomly assigning participants to both groups. The types of comparisons involved are shown in this fictional example below.


A.       Intervention group B.       Control group
Before intervention Average income per household = $1000 year.
N = 500
Average income per household = $1000 year N=500
After intervention Average income per household = $1500 year.
N = 500
Average income per household = $1200 year N=500


PS: See Comment 3 below re this table]
Difference over time = $500 Difference over time = $200
Difference between changes in A and B = £300
This method allows a comparison with what could be called an absolute counterfactual: what would have happened if there was no intervention.

Note that only the impact indicator is measured, there is no measurement of the intervention. This is because the intervention is assumed to be the same across all participants in the intervention group. This assumption is reasonable with some development interventions, such as those involving financial or medical activities (e.g. cash transfers or de-worming). Some information based interventions, using radio programs or the distribution of booklets, can also be assumed to be available to all participants in a standardised form. Where delivery is standardised it makes sense to measure the average impacts on the intervention and control group, because significant variations in impact are not expected to arise from the intervention.

Alternate views

There are however many development interventions where delivery is not expected to be standardised and where the opposite is the case, that delivery is expected to be customised. Here the agent delivering the intervention is expected to have some autonomy and to use that autonomy to the benefit of the participants. Examples of such agents would include community development workers, agricultural extension workers, teachers, nurses, midwives, nurses, doctors, plus all their supervisors. On a more collective level would be providers of training to such groups working in different locations. Also included would be almost all forms of technical assistance provided by development agencies.

In these settings measurement of the intervention, as well as the actual impact, will be essential before any conclusions can be drawn about attribution – the extent to which the intervention caused the observed impacts. Let us temporarily assume that it will be possible to come up with a measurement of the degree to which an intervention has been successfully implemented, a quality measure of some kind. It might be very crude, such as number of days an extension worker has spent in villages they are responsible for, or it might be a more sophisticated index combining multiple attributes of quality (e.g. weighted checklists).

Data on implementation quality and observed impact (i.e. an After minus a Before measure) can now be brought together in a two dimensional scatter plot. In this exercise there is no longer a control group, just an intervention group where implementation has been variable but measured. This provides an opportunity to explore the relative counterfactual, what would have happened if implementation was less successful, and less successful still, etc. In this situation we could hypothesise that if the intervention did cause the observed impacts then there would be a statistically significant correlation between the quality of implementation and observed impact. In place of an absolute counterfactual obtained via the use of control group, where there was no intervention we have relative counterfactuals, in the form of participants exposed to interventions of different qualities. In place of an average, we have a correlation.

There are a number of advantages to this approach. Firstly, with the same amount of evaluation funds available, the number of intervention cases that can be measured can be doubled, because a control group is no longer being used. In addition to obtaining (or not) a statistically significant correlation, we can also identify the strength of the relationship between the intervention and the impact. This will be visible in the slope of the regression line. A steep slope[1] would imply that small improvements in implementation can make big improvements in observed impacts and vice versa. If a non-lineal relationship is found then the shape of a best fitting regression line might also be informative, about where improvements will generate more versus less improvement.

Another typical feature of scatter plots is outliers. There may be some participants (individuals or groups of) who have received a high quality intervention, but where the impact has been modest, i.e. a negative outlier. Conversely, there may be some participants who have received a poor quality intervention, but where the impact has been impressive, i.e. a positive outlier. These are both important learning opportunities, which could be explored via the use of in-depth cases studies . But ideally these case studies would be informed by some theory, directing us where to look.

Evaluators sometimes talk about implementation failure versus theory failure. In her Genuine Evaluation blogPatricia Rogers gives an interesting example from Ghana, involving the distribution of Vitamin A tablets to women in order to reduce pregnancy related mortality rates. Contrary to previous findings, there was no significant impact. But as Patricia noted, the researchers appeared to have failed to measure compliance i.e. whether all the women actually took the tables given to them! This appears to be a serious case of implementation failure, in that the implementers could have designed a delivery mechanism that ensured compliance. Theory failure would be where our understanding of how Vitamin A affects women’s health appears to be faulty, because expected impacts do not materialise, after women have taken the prescribed medication.

In the argument developed so far, we have already proposed measuring quality of implementation, rather than making any assumptions about how it is happening. However, it is still possible that we might face “implementation measurement failure”. In other words, there may be some aspect of the implementation process that was not captured by the measure used, and which was causally connected to the conspicuous impact, or lack thereof.  A case study, looking at the implementation process in the outlier cases might help us identify the missing dimension. Re-measurement of implementation success incorporating this dimension might produce a higher correlation result. If it did not, then we might by default then have a good reason to believe we are now dealing with theory failure, i.e. a lack of understanding of how an intervention has its impact. Again, case studies of the outliers could help generate hypotheses about these. Testing these out is likely to be more expensive than testing alternate views on implementation processes because data will be less readily at hand. For reasons of economy and practicality implementation failure should be our first suspect.

In addition to having informative outliers to explore, the use of a scatter plot enables us to identify another potential outcome not readily available via the use of control groups, where the focus is on averages. In some programmes poor implementation may not simply lead to no impact (i.e. no difference between the average impact of control and intervention groups). Poor implementation may lead to negative impacts. For example, a poorly managed savings and credit programme may lead to increased indebtedness in some communities. In a standard comparison between intervention and control groups this type of failure would usually need to be present in a large of cases before it became visible in a net negative average impact. In a scatter plot any negative cases would be immediately visible, including their relationship to implementation quality.

To summarise so far, the assumption about standardised delivery of an intervention does not fit the reality of many development programmes. Replacing assumptions by measurement will provide a much richer picture of the relationship between an intervention and the expected impacts. Overall impact can still be measured, by using a correlation coefficient. In addition we can see the potential for greater impact present in existing implementation practice (the slope of the regression line). We can also find outliers that can help improve our understanding of implementation and impact process. We can also quickly identify negative impacts, as well as the absence of any impact.

Perhaps more important still, the exploration of internal differences in implementation means that the autonomy of development agents can be valued and encouraged. Local experimentation might then generate more useful outliers, and not be seen simply as statistical noise. This is experimentation with a small e, of the kind advocated by Chris Blattman in his presentationto DFID on 1st September 2011, and of a kind long advocated by most competent NGOs.

Given this discussion is about counterfactuals, it might be worth considering what would happen if this implementation measurement based approach was not used, where an intervention is being delivered in a non-standard way. One example is a quasi-experimental evaluation of an agricultural project in Tanzania, described in Oxfam GB‘s paper on its Global Performance Framework[2] . “Oxfam is working with local partners in four districts of Shinyanga Region, Tanzania, to support over 4,000 smallholder  farmers (54% of whom are women) to enhance their production and marketing of local chicken and rice. To promote group cohesion and solidarity, the producers are encouraged to form themselves into savings and internal lending communities. They are also provided with specialised training and marketing supporting, including forming linkages with buyers through the establishment of collection centres.” This is a classic case where the staff of the partner organisations would need to exercise considerable judgement about how to best help each community. It is unlikely that each community was given a standard package of assistance, without any deliberate customisations nor any unintentional quality variations along the way. Nevertheless, the evaluation chose to measure the impact of the partner’s activities on changes in household incomes and women’s decision making power, by comparing the intervention group with a control group. Results of the two groups were described in terms of “% of targeted households living on more than £1.00 per day per capita”, and % of supported women are meaningfully involved in household decision making”. In using these measures to make comparisons Oxfam GB has effectively treated quality differences in the extension work as noise to be ignored, rather than as valuable information to be analysed. In the process they have unintentionally devalued the work of their partners.

A similar problem can be found elsewhere in the same document where Oxfam GB describes their new set of global outcome indicators. The Livelihood Support indicator is: % of targeted households living on more than £1.00 per day per capita (as used in the Tanzania example). In four of the six global indicators the unit of analysis are people, the ultimate intended beneficiaries of Oxfam GB’s work. However, the problem is that in most cases Oxfam GB does not work directly with such people. Instead Oxfam GB typically works with local NGOs who in turn work with such groups. In claiming to have increased the % of targeted households living on more than £1.00 per day per capita Oxfam GB is again obscuring through simplification the fact that it is those partners who are responsible for these achievements. Instead, I would argue that the unit of analysis many of Oxfam GB’s global outcome indicators should be the behaviour and performance of its partners. Its global indicator for Livelihood Support should read something like this: “x % of Oxfam GB partners working on rural livelihoods have managed to double the proportion of targeted households living on more than £1.00 per day per capita” Credit should be given to where credit is due.  However, these kinds of claims will only be possible if and where Oxfam GB encourages partners to measure their implementation performance as well as changes taking place in the communities they are working with, and then to analyse the relationship between both measures.

Ironically, the suggestion to measure implementation sounds rather unfashionable and regressive, because we are often reading how in the past aid organisations used to focus too much on outputs and that now they need to focus more on impacts. But in practice it is not an either/or question. What we need is both, and both done well. Not something quickly produced by the Department of Rough Measures.

PS 4th September 2011: I forgot to discuss the issue of whether any form of randomisation would be useful where relative counterfactuals are being explored. In an absolute counterfactual experiment the recipients’ membership of control versus intervention groups is randomised. In a relative counterfactual “experiment” all participants will receive an intervention so there is no need to randomly assign participants to control versus intervention groups. But randomisation could be used to decide which staff worked with which participants (/vice versa). For example, where a single extension worker is assigned to a given community. But this would be less easily where a whole group of staff e.g. in a local health centre or local school, are responsible for the surrounding community.

Even where randomisation of staff was possible this would not prevent the impact of external factors influencing the impact of the intervention. It could be argued that the groups experiencing least impact and the poorest quality implementation were doing so, because of the influence of an independent cause (e.g. geographical isolation) that is not present amongst the groups experiencing bigger impacts and better quality implementation. Geographical isolation is a common exterbal influence in many rural development projects, one which is likely to make implementation of a livelihood initiative more difficult as well as making it more difficult for the participants to realise any benefits e.g. through sales of new produce at a regional market. Other external influences may affect the impact but not the intervention e.g. subsequent changes in market prices for produce. However, identifying the significance of external influences should be relatively easy, by making statistical tests of the difference in their prevalence in the high and low impact groups. This does of course require being able to identify potential external influences whereas as with randomised control trials (RCTs) no knowledge of other possible causes is needed (their influence is assumed to be equally distributed between control and intervention groups). However, this requirement could be considered as a "feature" rather than a "bug", because exploration of the role of other causal factors could inform and help improve implementation. On the other hand, the randomisation of control and intervention groups could encourage management's neglect of the role of other causal factors. There are clearly trade-offs here between competing evaluation quality criteria of rigour and utility.


[1]i.e. with observed impact on the y axis and intervention quality on the x axis

Tuesday, August 16, 2011

Evaluation methods looking for projects or projects seeking appropriate evaluation methods?


A few months ago I carried out a brief desk review of 3ie's approach to funding impact evaluations, for AusAID's Office of Development Effectiveness. One question that I did not address was "Broadly, are there other organisations providing complementary approaches to 3ie for promoting quality evaluation to fill the evidence gap in international development?"

While my report for 3ie examined the pros and cons of 3ie's use of experimental methods as a preferred evaluation design, it did not look at the question of appropriate institutional structures for supporting better evaluations. Yet, you could argue that choices made about institutional structures could have more consequences than those involving the specifics of particular evaluation methods. The question quoted above seems to contain a tacit assumption about institutional arrangements, i.e that improvements in evaluation can best be promoted by funding externally located specialists centres of expertise, like 3ie. This kind of assumption seems questionable, for two sets of reasons that I explain below. One is to do with the results they generate, the other concerns the neglected potential of an alternative.

In the "3ie" (Open Window) model anyone can submit a proposal for an evaluation of a project implemented by any organisation. This approach is conducive to 'cherry picking' of evaluable (by experimental methods) projects and the collection of evaluations representing a miscellany of types of projects - about which it will be hard to generate useful generalisations. Plus an unknown number of other projects possibly being left unevaluated, because they dont fit the prevailing method preferences.

In the alternative scenario the funding of evaluations would not be outsourced to any specialist centre(s). Instead, an agency like DFID would identify a portfolio of projects needing evaluation. For example, those initiatives focusing on climate change adaptation. DFID would call for proposals for their evaluation and then screen those proposals, largely as it does now, but perhaps seeking a wider range of bidders.

Unlike the present process, it would then offer funding to the bidders who had provided, say the best 50% of, the proposals to develop those proposals further in more detail. At present there is no financial incentive to do so, and any time and money already spent on developing proposals is unlikely to be recompensed, because only one bidder will get the contract.

The expected result of this "proposal development" funding would be revised and expanded proposals that outlined the bidder's proposed methodology in considerable detail, in something like an  inception report. All the bidders involved at this stage would need access to a given set of project documents and at least one collective meeting with the project holders.

The revised proposals would then be assessed by DFID, but with a much greater weighting towards the technical content of the proposal than exists at present. These second level assessment would benefit from the involvement of external specialists, as in the 3ie model. DFID Evaluation Department already does this in the case of some evaluations through the use of a quality assurance panel.The best proposal would then be funded as normal, and the evaluation then carried out.

Both the winning and losing technical proposals would then be put in the public domain via the DFID website in order to encourage cross fertilisation of ideas, external critiquing and public accountability. This is not the case at present. All bidders operate in isolation. There are no opportunities to learn from each other. The same appears to be the case with 3ie, the full text of technical proposals are not publicly available (even of those who were successful). Making the proposals public would mean that the proposal development funding had not been wasted, even where the proposals were not successful.

In summary, with the "external centre of expertise" model there is a risk that methodological preferences are the driving force behind what gets evaluated.  The alternative is a portfolio-of-projects led approach, where interim funding support is used to generate a diversity of improved evaluation proposals, which are later made accessible by all and which can then inform future proposals.

A meta-evaluation might be useful to test the efficacy of this project-led approach. Other matched kinds of projects also needing evaluation could continue to be funded by the pre-existing mechanisms (e.g. in-country DFID offices). Pair comparisons could later be made of the quality of the evaluations that were subsequently produced by the two different mechanisms.Although it is likely there would be multiple points of difference, it should be possible for DFID, and any other stakeholders, to prioritise their relative importance, and come to an overall judgement of which has been most useful.

PS: 3ie seems to be heading in this direction, to some extent.  3ie now have a Policy Window where they have recently sought applications for the evaluation of projects belonging to a specific portfolio ("Poverty Reduction Interventions in Fiji" implemented by the Government of Fiji). Funding is available to cover costs of the successful  bidder (only) to visit Fiji "to develop a scope of work to be included in a subsequent Request for Proposal (RFP) to conduct the impact evaluation".Subject to 3ie's approval of the developed proposal 3ie will then fund the implementation of the evaluation by the bidder.The success of this approach will be worth watching, especially its ability to ensure the evaluation of the whole portfolio of projects (which is likely to depend on 3ie having some flexiblity about the methodologies used). However, I am perhaps making a risky assumption here, that the  projects within the portfolio to be evaluated have not already been pre-selected on the grounds of their suitability to 3ie's preferred approach.




PS: I have been reading the [Malawi] CIVIL SOCIETY GOVERNANCE FUND -
TECHNICAL SPECIFICATION REFERENCE DOCUMENT FOR POTENTIAL SERVICE PROVIDERS. In the section on the role of the Independent Evaluation Agent, it is stated that the agent will be responsible for "The commissioning and coordination of randomised control trials for two large projects funded during the first or second year of granting." This specification appears to have been made prior to the funding of any projects. So, will the fund managers feel obliged to find and fund two large projects that will be evaluable by RCTs? Fascinating, in a bizarre kind of way.

Friday, April 08, 2011

Models and reality: Dialogue through simulation


I have been finalising preparations for a training workshop on network visualisation as an evaluation tool. In the process I came across this "Causality Map for the Enhanced Evaluation Framework", for Budget Support activities.
On the surface this diagram seems realistic, budget support is a complex process. However,  I will probably use this diagram to highlight what is often missing in models of development interventions. Like many others, it lacks any feedback loops, and as such it is a model that is a long way from the reality it is trying to represent in summary form. Using a distinction being used more widely these days (and despite my reservations about it), I think this model qualifies as complicated but not complex. If you were to assign a numerical value to each node and to each connecting relationship, the value that would be generated at the end of the process (on the right) would always be the same.

The picture changes radically as soon as you include feedback loops, which is much easier to do when you use network rather than chain models (and where you give up using one dimension in the above type of diagram to represent the passage of time). Here below is my very simple example. This model represents five actors. They all start with a self-esteem rating of 1, but their subsequent self-esteem depends on the influence of the others they are connected to (represented by positive or negative link values, [randomly allocated]) and the self-esteem of those others.
You can see what happens when self-esteem values are recalculated to take into account those each actor is connected to, in this Excel file (best viewed with the NodeXL plugin). After ten iterations, Actor 0 has the highest self-esteem, and Actor 2 has the lowest. After 20 iterations Actor 2 has the highest self-esteem and Actor 1 has the lowest. After 30 iterations Actor 1 has the highest self-esteem and Actor O has the lowest. With more and more iterations the self-esteem of the actors involved might stabilise at a particular set of values, or it might repeat past patterns already seen, or maybe not.

There are two important points to be made here. The first is the dramatic affect of introducing feedback loops, in even the simplest of models. The aggregate results are not easily predictable, but they can be simulated. The second is that nature of the impact that is seen even in this very small complex system is a matter of the time period under examination. Impact seen at iteration 10 is different from iteration 20 and different again at iteration 30. In the words of the proponents of systems perspectives on evaluation, what is seen depends on the "perspective" that is chosen (Williams and Hummelbrunner, 2009).
PS 1: Michael Woolcock has written about the need to pay more attention to a related issue, captured by the term "impact trajectory". He argues that: "...in virtually all sectors, the development community has a weak (or at best implicit or assumed) understanding of the shape of the impact trajectories associated with its projects, and even less understanding of how these trajectories vary for different kinds of project operating in different contexts, at different scales and with varying degrees of implementation effectiveness; more forcefully, I argue that the weakness of this knowledge greatly compromises our capacity to make accurate statements about project impacts, irrespective of whether they are inspired by ‘demand’ or ‘supply’ side imperatives, and even if they have been subject to the most deftly implemented randomised trial"
PS1.1: Some examples:I recall that it has been argued that there is a big impact on households when they first join savings and credit groups, but the continuing impact drops down to a much more modest level thereafter. On the other hand, the impact of girls completing primary school may be the greatest when it reaches through to the next generation, to the number of their children and their survival rates. 
There is one downside to my actors' self-esteem model, which is its almost excessive sensitivity. Small changes to any of the node or link values can  significantly change the longer term impacts. This is because this simple model of a social system has no buffers or "slack". Buffers could be in the form of accumulated attributes of the actors (like an individual's self-confidence arising from their lifetime experience or a firm's accumulated inventory) and also provided via the wider context (like individuals having access to a wider network of friends or firms having alternate sources of suppliers) . This model could clearly be improved upon.
PS 2: I came across this quote by Duncan Watts, in a commentary on his latest book "Everything is obvious" - "when people base their decisions in part on what other people are deciding, collective outcomes become highly unpredictable" That is exactly what is happening in the self-esteem model above.Duncan Watts has written extensively on networks.
Here below is another unidirectional causal model, available on the Donor Committee for Enterprise Development website

What I like about this example is that visitors to the website can click on the links (but not in the copy I have made above) and be taken to other pages where they will be given a detailed account of the nature of the causal processes represented by those links. This is exactly what the web was designed for. Visitors can also click on any of the boxes at the bottom and find out more about the activities that input into the whole process.

The inclusion of a feedback loop in this diagram would not be too difficult to imagine. For example, from perhaps the top box back to one of the earlier boxes e.g New firms start / register. This positive feedback loop would quickly produce escalating results further up the diagram. Ideally, we would recognise that this type of outcome (simple continuous escalation) does not fit very well with our perception of what happens in reality. That awareness would then lead to further improvements to the model, which generated more realistic behaviors.
PS 24 May 2011:  In their feminist perspective on monitoring and evaluation Batliwala and Pittman have suggested that we need "to develop a “theory of constraints” to accompany our “theory of change” in any given context..." They noted that "… most tools do not allow for tracking negative change, reversals, backlash, unexpected change, and other processes that push back or shift the direction of a positive change trajectory. How do we create tools that can capture this “two steps forward, one step back” phenomenon that many activists and organizations acknowledge as a reality and in which large amounts of learning lay hidden? In women’s rights work, this is vital because as soon as advances seriously challenge patriarchal or other social power structures, there are often significant reactions and setbacks. These are not, ironically, always indicative of failure or lack of effectiveness, but exactly the opposite— this is evidence that the process was working and was creating resistance from the status quo as a result .”
But it is early days. Many development programs do not yet even have a decent unidirectional causal model of what they are trying to do. In this context, the inclusion of any sort of feedback loop would be a major improvement. As shown above, the next step that can be taken is to run simulations of those improved models by inserting numerical values in the links and functions/equations in nodes of those models. In the examples above we can see how simulations can help improve models by showing how their results do not fit with our own observations of reality. Perhaps in the future they will be seen as a useful form of pre-test, worth carrying out at the earliest stages of an evaluation.
PS 3: This blog was prompted by comments to me by Elliot Stern on the relevance of modeling and simulation to evaluation, on which I hope he has more to say.

PS 4 I am struggling through Manuel DeLanda's Philosophy and Simulation: The emergence of synthetic reason (2011), which being about simulations, relates to the contents of this post.
PS 5: I have just scanned Funnell and Roger's very useful new book, Purposeful Program Theory" and found 67 unidirectional models but only 15 models that have one or more feedback loops (that is, 23%). This is quite dissapointing. So is the advice on the use of feedback loops: "We advise against using so many feedback loops that the logic becomes meaningless. When feedback loops are incorporated, a balance needs to be struck between including all of them and (because everything is related to everything else) and capturing some important ones. Showing that everything leads to everything else can make an outcome chain very difficult to understand - what we call a spagetti junction model. Neverthless some feedback loops can be critical to the success of a program and should be included ..."p187
Given the scarcity of models with feedback links even in this book, the risk of having too many feedback loops sounds like "a  problem we would like to have" And I am not sure why an excess of feedback links should be any more of a probability than an excess of forward links. The concern about the understandability of models with feedback loops is however more reasonable,  for reasons I have outlined above. When you introduce feedback loops what were either simple and complicated models start to exhibit complex behavior. Welcome to something that is a bit closer to the real world.
PS6:  "As change makers, we should not try to design a better world. We should make better feedback loops", the text of the last slide in Owen Barder's presentation "Development Complexity and Evolution"
PS7: In Funnell and Roger's book (page 486) they describe how the Intergovernmental Panel on Climate Change (IPCC) " recognised that controlled experimentation with the climate system in which the hypothesised agents of change are systematically varied in order to determine the climate's sensitivity to these agents...[is] clearly not possible"  Different models of climate change were developed with different assumptions about possible contributing factors. "The observed patterns of warming, including greater warming over land than over the ocean, and their changes over time, are simulated only by the models that include anthropogenic forcing. No coupled global climate change model that has used natural forcing only has reproduced the continental warming trends in individual continents (except Antarctica) over the second half of the 20th centrury" (IPCC, 2001, p39)




Tuesday, March 22, 2011

A submission to the UK Independent Commission for Aid Impact (ICAI)

Background:
"ICAI is holding a public consultation to understand which areas of UK overseas aid stakeholder groups and the public believe the Commission should report on in its first three years. The consultation will run for 12 weeks from the 14th January until the 7th April 2011.
Click here to respond to the consultation
If you would like to read further information on ICAI, the consultation and the aid budget, please click on Consultation document.
If you would like to consider the questions in detail before responding, please click on Downloadable questions. You will need to access the online response form to respond".
Two initial comments:

1. An online survey is a fairly narrow approach to a public consultation. There are many other options, even if the ICAI is limited to those that can take place online, rather than face to face

2. The focus of the consultation is also narrow, i.e. "which areas of UK overseas aid stakeholder groups and the public believe the Commission should report on in its first three years". Equally important is how those areas of aid should be reported on.

On widening the process of consultation

1. All pages on the ICAI website should have associated Comment facilities, which visitors can make use of. More and more websites are being constructed with blog-type features such as these, because website managers these days expect to be interacting with their audience, not simply broadcasting. Built into such a facilitwould be an assumption that there will be an ongoing process of consultation, not a once off event.

2. The raw data results of the current online survey should be made publicly available, not just summaries. This is already possible, with minimal extra work required, because the survey provider (SurveyMonkey.com) is able to provide a public link to the survey results, with multiple options regarding reading, filtering and downloading the data. The more people who can access the data, the more value that might be obtained from it.  However, the most important reason for doing so is that the ICAI should be seen to be maximally transparent in its operations. Transparency will help build trust and confidence in the work and judgements of the commission.

3. Although late in the day, the ICAI should edit the Consultation page to include an invitation to people to submit their own submissions using their own words and structures.

4. The ICAI website should include an option for visitors to sign up for email notification of any changes to the website, including the main content pages and any comments made on those pages by visitors.

5. The ICAI should be open about what it will be open about. It should develop a policy on transparency and place that policy on its website. Disclosure policies are now commonplace for many large international organisations like the World Bank and IMF, and transparency in regard to international aid is high on the agenda of many governments, including the UK. Having such a policy does mean everything about the workings of the ICAI must be made public, but it would typically require a default assumption of openness along with specified procedures and conditions relating to when and where information will not be disclosed.

On widening the content of consultation

 1. The ICAI should be aware, if not already, that there continues to be intense debate about the best ways of assessing the value of international aid. This debate exists because of the  multiplicity of purposes behind aid programs, many and varied types of aid, the enormous diversity of contexts where it is provided, the wide range of people and organisations involved in its delivery, as well as some genuinely difficult issues of measurement and analysis. There are no simple and universally applicable solutions. Value for Money provides only a partial view of aid impact, and is only partially measurable. Randomise Control Trials (RCTs) can be useful for simple replicable interventions in comparable conditions, but many aid interventions are complex. The best immediate response in these circumstances is for the ICAI to be maximally transparent about the methods being used to assess aid interventions, and to be open to the wider debate.

A good starting point would be for the ICAI to make public: (a) the Terms of Reference for the "Contracted Out Service Provider" who will do the assessment work for the ICAI, and (b) the tendered proposal put forward by the winning bidder. Both of these documents refer to ways and means of doing the required work. It is also expected that there will be periodic reviews of the work of the winning bidder. The ToRs and reports of those reviews should also be publicly disclosed on the ICAI website. Finally, all the  "evaluations, reviews and investigations" to be carried out by the winning bidder on behalf of the ICAI should be publicly disclosed, as hopefully has already been agreed.

2. It would be useful, if only to help ensure that the ICAI itself delivers Value for Money", if the ICAI could clarify not only how its role will differ from that of the DFID Evaluation Department and multi-agency initiatives like 3IE, but also how it will cooperate with them to exploit any complementarities and possible synergies in their work.Complete and utter independence could lead to wasteful duplication. In the worst case various wheels could be reinvented. 

For example, the OECD's Development Assistance Committee (DAC) has over the years developed a widely agree set of evaluation criteria, however these are nowhere to be seen in the ICAI's ToR for the "Contracted Out Service Provider". Instead, Value For Money receives repeated attention, and its definition is sourced to the National Audit Office (NAO). However, the NAO is a member of the Improvement Network, which has provided a wider perspective on Value for Money. Their website notes that effectiveness is part of Value for Money, as well as efficiency and economy. Commenting on effectiveness they note:
"Effectiveness is a measure of the impact that has been achieved, which can be either quantitative or qualitative....Outcomes should be equitable across communities, so effectiveness measures should include aspects of equity, as well as quality. Sustainability is also an increasingly important aspect of effectiveness."
Sustainability is a DAC evaluation criteria which has been around for a decade or more.
------------------------------------------------------------------------------------------------
A link to this blog posting has been emailed to c-robathan@icai.independent.gov.uk.
Other readers of this blog might like to do the same, with their own views.

PS: See Alex Jacob's March 22nd submission to the ICAI: Advice for the new aid watchdog