Sunday, September 04, 2011

Relative rather than absolute counterfactuals: A more useful alternative?

Background

The basic design of a randomised control trial (RCT) involves comparisons of two groups: an intervention (or “treatment”) group and a control group, at two points of time, before an intervention begins and after the intervention ends. The expectation (hypothesis) is that there will be a bigger change on an agreed impact measure in the intervention group than in the control group. This hypothesis can be tested by comparing the average change in the impact status of members of the two groups, and applying a statistical test to establish that this difference was unlikely to be a chance finding (e.g. less than 5% probability of being a chance difference). The two groups are made comparable by randomly assigning participants to both groups. The types of comparisons involved are shown in this fictional example below.

 A.       Intervention group B.       Control group Before intervention Average income per household = \$1000 year. N = 500 Average income per household = \$1000 year N=500 After intervention Average income per household = \$1500 year. N = 500 Average income per household = \$1200 year N=500 PS: See Comment 3 below re this table] Difference over time = \$500 Difference over time = \$200 Difference between changes in A and B = £300
This method allows a comparison with what could be called an absolute counterfactual: what would have happened if there was no intervention.

Note that only the impact indicator is measured, there is no measurement of the intervention. This is because the intervention is assumed to be the same across all participants in the intervention group. This assumption is reasonable with some development interventions, such as those involving financial or medical activities (e.g. cash transfers or de-worming). Some information based interventions, using radio programs or the distribution of booklets, can also be assumed to be available to all participants in a standardised form. Where delivery is standardised it makes sense to measure the average impacts on the intervention and control group, because significant variations in impact are not expected to arise from the intervention.

Alternate views

There are however many development interventions where delivery is not expected to be standardised and where the opposite is the case, that delivery is expected to be customised. Here the agent delivering the intervention is expected to have some autonomy and to use that autonomy to the benefit of the participants. Examples of such agents would include community development workers, agricultural extension workers, teachers, nurses, midwives, nurses, doctors, plus all their supervisors. On a more collective level would be providers of training to such groups working in different locations. Also included would be almost all forms of technical assistance provided by development agencies.

In these settings measurement of the intervention, as well as the actual impact, will be essential before any conclusions can be drawn about attribution – the extent to which the intervention caused the observed impacts. Let us temporarily assume that it will be possible to come up with a measurement of the degree to which an intervention has been successfully implemented, a quality measure of some kind. It might be very crude, such as number of days an extension worker has spent in villages they are responsible for, or it might be a more sophisticated index combining multiple attributes of quality (e.g. weighted checklists).

Data on implementation quality and observed impact (i.e. an After minus a Before measure) can now be brought together in a two dimensional scatter plot. In this exercise there is no longer a control group, just an intervention group where implementation has been variable but measured. This provides an opportunity to explore the relative counterfactual, what would have happened if implementation was less successful, and less successful still, etc. In this situation we could hypothesise that if the intervention did cause the observed impacts then there would be a statistically significant correlation between the quality of implementation and observed impact. In place of an absolute counterfactual obtained via the use of control group, where there was no intervention we have relative counterfactuals, in the form of participants exposed to interventions of different qualities. In place of an average, we have a correlation.

There are a number of advantages to this approach. Firstly, with the same amount of evaluation funds available, the number of intervention cases that can be measured can be doubled, because a control group is no longer being used. In addition to obtaining (or not) a statistically significant correlation, we can also identify the strength of the relationship between the intervention and the impact. This will be visible in the slope of the regression line. A steep slope[1] would imply that small improvements in implementation can make big improvements in observed impacts and vice versa. If a non-lineal relationship is found then the shape of a best fitting regression line might also be informative, about where improvements will generate more versus less improvement.

Another typical feature of scatter plots is outliers. There may be some participants (individuals or groups of) who have received a high quality intervention, but where the impact has been modest, i.e. a negative outlier. Conversely, there may be some participants who have received a poor quality intervention, but where the impact has been impressive, i.e. a positive outlier. These are both important learning opportunities, which could be explored via the use of in-depth cases studies . But ideally these case studies would be informed by some theory, directing us where to look.

Evaluators sometimes talk about implementation failure versus theory failure. In her Genuine Evaluation blogPatricia Rogers gives an interesting example from Ghana, involving the distribution of Vitamin A tablets to women in order to reduce pregnancy related mortality rates. Contrary to previous findings, there was no significant impact. But as Patricia noted, the researchers appeared to have failed to measure compliance i.e. whether all the women actually took the tables given to them! This appears to be a serious case of implementation failure, in that the implementers could have designed a delivery mechanism that ensured compliance. Theory failure would be where our understanding of how Vitamin A affects women’s health appears to be faulty, because expected impacts do not materialise, after women have taken the prescribed medication.

In the argument developed so far, we have already proposed measuring quality of implementation, rather than making any assumptions about how it is happening. However, it is still possible that we might face “implementation measurement failure”. In other words, there may be some aspect of the implementation process that was not captured by the measure used, and which was causally connected to the conspicuous impact, or lack thereof.  A case study, looking at the implementation process in the outlier cases might help us identify the missing dimension. Re-measurement of implementation success incorporating this dimension might produce a higher correlation result. If it did not, then we might by default then have a good reason to believe we are now dealing with theory failure, i.e. a lack of understanding of how an intervention has its impact. Again, case studies of the outliers could help generate hypotheses about these. Testing these out is likely to be more expensive than testing alternate views on implementation processes because data will be less readily at hand. For reasons of economy and practicality implementation failure should be our first suspect.

In addition to having informative outliers to explore, the use of a scatter plot enables us to identify another potential outcome not readily available via the use of control groups, where the focus is on averages. In some programmes poor implementation may not simply lead to no impact (i.e. no difference between the average impact of control and intervention groups). Poor implementation may lead to negative impacts. For example, a poorly managed savings and credit programme may lead to increased indebtedness in some communities. In a standard comparison between intervention and control groups this type of failure would usually need to be present in a large of cases before it became visible in a net negative average impact. In a scatter plot any negative cases would be immediately visible, including their relationship to implementation quality.

To summarise so far, the assumption about standardised delivery of an intervention does not fit the reality of many development programmes. Replacing assumptions by measurement will provide a much richer picture of the relationship between an intervention and the expected impacts. Overall impact can still be measured, by using a correlation coefficient. In addition we can see the potential for greater impact present in existing implementation practice (the slope of the regression line). We can also find outliers that can help improve our understanding of implementation and impact process. We can also quickly identify negative impacts, as well as the absence of any impact.

Perhaps more important still, the exploration of internal differences in implementation means that the autonomy of development agents can be valued and encouraged. Local experimentation might then generate more useful outliers, and not be seen simply as statistical noise. This is experimentation with a small e, of the kind advocated by Chris Blattman in his presentationto DFID on 1st September 2011, and of a kind long advocated by most competent NGOs.

Given this discussion is about counterfactuals, it might be worth considering what would happen if this implementation measurement based approach was not used, where an intervention is being delivered in a non-standard way. One example is a quasi-experimental evaluation of an agricultural project in Tanzania, described in Oxfam GB‘s paper on its Global Performance Framework[2] . “Oxfam is working with local partners in four districts of Shinyanga Region, Tanzania, to support over 4,000 smallholder  farmers (54% of whom are women) to enhance their production and marketing of local chicken and rice. To promote group cohesion and solidarity, the producers are encouraged to form themselves into savings and internal lending communities. They are also provided with specialised training and marketing supporting, including forming linkages with buyers through the establishment of collection centres.” This is a classic case where the staff of the partner organisations would need to exercise considerable judgement about how to best help each community. It is unlikely that each community was given a standard package of assistance, without any deliberate customisations nor any unintentional quality variations along the way. Nevertheless, the evaluation chose to measure the impact of the partner’s activities on changes in household incomes and women’s decision making power, by comparing the intervention group with a control group. Results of the two groups were described in terms of “% of targeted households living on more than £1.00 per day per capita”, and % of supported women are meaningfully involved in household decision making”. In using these measures to make comparisons Oxfam GB has effectively treated quality differences in the extension work as noise to be ignored, rather than as valuable information to be analysed. In the process they have unintentionally devalued the work of their partners.

A similar problem can be found elsewhere in the same document where Oxfam GB describes their new set of global outcome indicators. The Livelihood Support indicator is: % of targeted households living on more than £1.00 per day per capita (as used in the Tanzania example). In four of the six global indicators the unit of analysis are people, the ultimate intended beneficiaries of Oxfam GB’s work. However, the problem is that in most cases Oxfam GB does not work directly with such people. Instead Oxfam GB typically works with local NGOs who in turn work with such groups. In claiming to have increased the % of targeted households living on more than £1.00 per day per capita Oxfam GB is again obscuring through simplification the fact that it is those partners who are responsible for these achievements. Instead, I would argue that the unit of analysis many of Oxfam GB’s global outcome indicators should be the behaviour and performance of its partners. Its global indicator for Livelihood Support should read something like this: “x % of Oxfam GB partners working on rural livelihoods have managed to double the proportion of targeted households living on more than £1.00 per day per capita” Credit should be given to where credit is due.  However, these kinds of claims will only be possible if and where Oxfam GB encourages partners to measure their implementation performance as well as changes taking place in the communities they are working with, and then to analyse the relationship between both measures.

Ironically, the suggestion to measure implementation sounds rather unfashionable and regressive, because we are often reading how in the past aid organisations used to focus too much on outputs and that now they need to focus more on impacts. But in practice it is not an either/or question. What we need is both, and both done well. Not something quickly produced by the Department of Rough Measures.

PS 4th September 2011: I forgot to discuss the issue of whether any form of randomisation would be useful where relative counterfactuals are being explored. In an absolute counterfactual experiment the recipients’ membership of control versus intervention groups is randomised. In a relative counterfactual “experiment” all participants will receive an intervention so there is no need to randomly assign participants to control versus intervention groups. But randomisation could be used to decide which staff worked with which participants (/vice versa). For example, where a single extension worker is assigned to a given community. But this would be less easily where a whole group of staff e.g. in a local health centre or local school, are responsible for the surrounding community.

Even where randomisation of staff was possible this would not prevent the impact of external factors influencing the impact of the intervention. It could be argued that the groups experiencing least impact and the poorest quality implementation were doing so, because of the influence of an independent cause (e.g. geographical isolation) that is not present amongst the groups experiencing bigger impacts and better quality implementation. Geographical isolation is a common exterbal influence in many rural development projects, one which is likely to make implementation of a livelihood initiative more difficult as well as making it more difficult for the participants to realise any benefits e.g. through sales of new produce at a regional market. Other external influences may affect the impact but not the intervention e.g. subsequent changes in market prices for produce. However, identifying the significance of external influences should be relatively easy, by making statistical tests of the difference in their prevalence in the high and low impact groups. This does of course require being able to identify potential external influences whereas as with randomised control trials (RCTs) no knowledge of other possible causes is needed (their influence is assumed to be equally distributed between control and intervention groups). However, this requirement could be considered as a "feature" rather than a "bug", because exploration of the role of other causal factors could inform and help improve implementation. On the other hand, the randomisation of control and intervention groups could encourage management's neglect of the role of other causal factors. There are clearly trade-offs here between competing evaluation quality criteria of rigour and utility.

[1]i.e. with observed impact on the y axis and intervention quality on the x axis

1. This comment has been removed by a blog administrator.

2. This is interesting Rick; your points about assessing implementation quality/intervention exposure are very well taken. Just for the record, we are attempting to assess intervention exposure in the various effectiveness audits we are carrying out. This is mentioned explicitly in the paper: "However,obtaining basic intervention exposure data is also valuable to avoid Type III error, i.e. falsely concluding that a poorly implemented intervention – that would have been effective if properly implemented – is intrinsically ineffective" (Dobson and Cook 1980).

As such, at the risk of sounding defensive, we are definitely not avoiding the implementation question. Taking into account time and resource limitations, we are attempting to look treatment fidelity within the evaluations and - where possible - undertake dose-response/treatment effect heterogeneity analysis.

Moreover, our aim is also to attempt to identify particular intervention variations/components that seem to work in specific contexts - trying to use the approach to leverage organisational learning.

Regarding the point on partners: As mentioned in the paper, the Oxfam GB's indicators, themselves, should not be the core focus of attention; it is more about the evaluation processes that underlie them. In particular, the evaluations are something that are carried out together with partners based on contextually grounded "theories of change," and the findings are used for context-specific programme strengthening purposes. Having similar ways of measuring key constructs such as household income, however, enable aggregation of the datasets to assess how well the randomly sampled interventions are doing overall in relation to the indicator in question.

A possible misconception is that our approach requires the implementation of standardised interventions. Oxfam, for example, implements many livelihood projects in various guises that are either explicitly or implicitly attempting to bolster household income. As such, all we are doing is evaluating such projects with appropriate ways of measuring this variable, one of which is household consumption and expenditure (the basis to the £ per capita per day indicator). If a core aim is to raise household income, but we find that we have not done so for the majority of targeted households (or even our entire random sample of projects), is it not a good thing that we understand this so we can go back to the drawing board? On the other hand, if we find that income has been raised by some projects and/or some populations but not others, is it not worth exploring why this is the case?

Cheers!

3. Karl Hughes has also commented (in an email to me) on the evaluation design I have described at the beginning of this post, as follows: "Technically speaking, the absolute counterfactual illustration you outline is the difference-in-differences design, which is usually used in non-experimental impact evaluations. At least theoretically, in an RCT,, the outcome averages of the two groups are compared directly ex-post. Baseline data are typically collected to see if randomisation was successful and to obtain increased estimation precision. However, Howard White argues that the dif-in-dif estimator should also be used in RCTs as well. So your illustration, at least to my understanding, is a special case." Which I found helpful.

4. Projecting past trends into the future, another approach to identifying a counterfactual.

See page 46 of "Micro-Methods in Evaluating
Governance Interventions" by Garcia, M., 2011. Evaluation Working Papers, Bundesministerium für wirtschaftliche Zusammenarbeit und Entwicklung, Bonn.

"If the programme is to be rolled out all at once and the treatment is to be administered uniformly, a time-series analysis consisting of repeated measures taken pe-riodically before and after the intervention could be used (Rossi and Freeman 1993). For example, a new reform agenda has been implemented nationwide. Time-series analysis entails predicting the post-intervention outcome using the pre-intervention trend. The trend of the predicted outcome becomes the counter-factual, i.e. what would have happened without the reform. The impact of the re-form is the comparison between the predicted outcome and the actual post-intervention outcome. The drawback of this approach is that pre-intervention ob-servations must be large enough to produce a robust prediction. Secondly, any treatment effects may not be fully credible owing to the bias resulting from various factors not taken into account. Apart from external factors, another problem is the presence of implementation lags, “announcement effects” and uncertainty as to when the programme actually took effect. This makes it difficult to pin down the exact timing of the programme. Fortunately, such structural breaks in the outcome can be formally tested (see Piehl et al. 1999)."

5. from "The Other Hand: High Bandwidth Development Policy by Ricardo Hausmann
John F. Kennedy School of Government - Harvard University, October 2008 pages 27-28

10. Randomized trials and benchmarking clubs in a high dimensional world

Another method that looses its appeal in a world of high dimensionality is the randomized trial approach. A typical program, whether a conditional cash transfer, a micro-finance program or a health intervention can easily have 15 relevant dimensions. Assume that each dimension can only take 2 values. Then the possible combinations are 215 or 32,768 possible combinations. But randomized trials can only distinguish between a control group and 1 to 3 treatment groups. So, many of the design or contextual features are kept constant while just 1 or 3 are being varied. This means that the search over the design space is quite limited, while the external validity of these experiments is reduced by the fact that many of the design or contextual elements are bound to change from place to place. So, for the majority of the design elements, policy makers must make choices on many of the design criteria in the absence of the support from randomized trials, which will necessarily play a secondary role in practice.

High dimensionality is more amenable to an evolutionary approach. Since the search space is so large, finding the optimum is just too difficult. So the point is to organize many searches and have a selection mechanism. In biology, the searches occur mostly at random, but if the selection mechanism is effective, the system will be constantly picking those variations that improve performance. Humans should be able to search more efficiently, but they still need an effective selection mechanism.

One approach that facilitates this process and is used effectively in the private sector is benchmarking, a practice that was started in the auto industry but has spread to many other areas. Units are given operational flexibility, but their performance is meticulously measured and compared. The feedback loop created by repeated comparative measures is meant to facilitate the decentralized open-ended search for improvements. A repeated game of standardized tests and school autonomy is a rather different approach to educational improvement compared to randomized trials that try to find the impact of class size, teaching materials, de-worming, micro-nutrients, toilets or incentives for teacher attendance on school performance. Clearly, the impact of any of these interventions is bound to be highly context-specific: class size is bound to matter little if the teacher does not attend school and micro-nutrients are likely to be ineffective where nutrition is adequate. Decentralized experimentation and benchmarking of outcomes is likely to be a more effective and dynamic way of making progress.