Friday, March 28, 2014

The challenges of using QCA



This blog posting is a response to my reading of the Inception Report written by the team who are undertaking a review of evaluations of interventions relating to violence against women and girls. The process of the review is well documented in a dedicated blog – EVAW Review

The Inception Report is well worth reading, which is not something I say about many evaluation reports! One reason is to benefit from the amount of careful attention the authors have given to the nuts and bolts of the process. Another is to see the kind of intensive questioning the process has been subjected to by the external quality assurance agents and the considered responses by the evaluation team. I found that many of the questions that came to my mind while reading the main text of the report were dealt with when I read the annex containing the issues raised by SEQUAS and the team’s responses to them.

I will focus on one issue that is challenge for both QCA and data mining methods like Decision Trees (which I have discussed elsewhere on this blog). That is the ratio of conditions to cases. In QCA conditions are attributes of the cases under examination that are provisionally considered as possible parts of causal configurations that explain at least some of the outcomes. After an exhaustive search and selection process the team has ended up with a set of 39 evaluations they will use as cases in a QCA analysis. After a close reading of these and other sources they have come up with a list of 20 conditions that might contribute to 5 different outcomes. With 20 different conditions there are 220 (i.e. 1,048,576) different possible configurations that could explain some or all of the outcomes. But there are only 39 evaluations, which at best will represent only 0.004% of the possible configurations. In QCA the remaining 1,048,537 are known as “logical remainders”. Some of these can usually be used in a QCA analysis through a process using explicit assumptions e.g. about particular configurations plus outcomes which by definition would be impossible to occur in real life. However, from what I understand of QCA practice, logical remainders would not usually exceed 50% of all possible configurations.

The review team has dealt with this problem by summarising the 20 conditions and 5 outcomes into 5 conditions and one outcome. This means there are 25 (i.e. 32) possible causal configurations, which is more reasonable considering there are 39 cases available to analyse. However there is a price to be paid for this solution, which is the increased level of abstraction/generality in the terms used to describe the conditions. This makes the task of coding the known cases more challenging and it will make the task of interpreting the results and then generalising from them more challenging as well. You can see the two versions of their model in the diagram below, taken from their report.
 
What fascinated me was the role of evaluation method in this model (see “Convincing methodology”). It is only one of five conditions that could explain some or all of the outcomes. It is quite possible therefore that all or some of the case outcomes could be explained without the use of this condition. This is quite radical, considering the centrality of evaluation methodology in much of the literature on evaluations. It may also be worrying to DFID in that one of their expectations of this review was it would “generate a robust understanding of the strengths, weaknesses and appropriateness of evaluation approaches and methods”. The other potential problem is that even if methodology is shown to be an important condition, its singular description does not provide any means to discriminating between forms which are more or less helpful.

The team seems to have responded to this problem by proposing additional QCA analyses, where there will be an additional condition that differentiates cases according to whether they used qualitative or quantitative methods.  However reviewers have still questioned whether this is sufficient. The team in return have commented that they will “add to the model further conditions that represent methodological choice after we have fully assessed the range of methodologies present in the set, to be able to differentiate between common methodological choices” It will be interesting to see how they go about doing this, while avoiding the problem of “insufficient diversity” of cases already mentioned above.

One possible way forward has been illustrated in a recent CIFOR Working Paper (Sehring et al, 2013) and which is also covered in Schneider and Wagemann (2012). They have illustrated how it is possible to do a “two-step QCA”, which differentiates between remote and proximate conditions. In the VAWG review this could take the form of an analysis of conditions other than methodology first, then a second analysis focusing on a number of methodology conditions. This process essentially reduces a larger number of remote conditions down to a smaller number of configurations that do make a difference to outcomes, which are then included in the second level of the analysis which uses the more proximate conditions. It has the effect of reducing the number of logical remainders. It will be interesting to see if this is the direction that the VAWG review team are heading.

PS 2014 03 30: I have found some further references to two-level QCA:
 And for people wanting a good introduction to QCA, see

Monday, January 13, 2014

Thinking about set relationships within monitoring data


I have just re-read Howard White's informative blog posting on "Using the causal chain to make sense of the numbers" which refers to what he calls "the funnel of attrition". I have reproduced a copy of his diagram here, one which represents the situation in an imaginary project. He uses the diagram to emphasis the need to do basic analyses of implementation data (informed by a theory of change) before launching into sophisticated analyses of relationships between outputs and impacts.
The same set of data can be represented using a Venn diagram, to show the relationship between these 8 sets of people, as shown in this truncated version below:


Venn diagrams like these can also be read as describing relationships of necessity and sufficiency. According to the above diagram, knowing about the interventions is a necessary condition of taking part in the intervention. There are no cases (in the above sets) where people have taken part without already knowing about the intervention. 

However, it is conceivable that some people could be assigned to an intervention without knowing about it in advance and making their own choice. In that case the set relationships could look more like the diagram below (yellow being participants who were assigned without any prior knowledge). Here the key change is the overlap in their memberships, the actual numbers of people could well be the same.


Its possible to imagine other complexities to this model. For example, some people may change their behavior without necessarily changing their attitudes beforehand, because of compulsion or pressure of some kind. So the revised model might look more like this...(brown being participants changing their behavior due to compulsion)


In both these examples above, what was a necessary condition has becomes a sufficient condition. Knowing about an intervention is sufficient to enable a person to participate in the intervention, but it is not the only way. People can also be assigned to the intervention. Similarly, changing their attitudes is one means whereby a person will change their behavior but behavior may also be changed through other means e.g. compulsion.

The point of these two examples is that when monitoring implementation it is not good enough to simply record and compare the relative numbers who belong to each consecutive group in the "funnel of attrition" . Doing so implies the Theory of Change (or Theory of Action, as some people might prefer to call this) is the only (i.e. necessary) means by which a desired outcome can occur, which seems highly unlikely. Instead, what is needed is a comparison of the membership relationships between one set and the next, to identify whether other conditions might also be sufficient for the expected change to happen. This can be done using nothing more complicated than cross-tabulations in Excel.

But this view does have significant implications for how we monitor project interventions. It means it is not good enough to simply track numbers of people participating in various activities. In order to identify possible relationships of necessity and sufficiency between these events we need to know who participated in each activity, so we can identify the extent to which membership in one set overlapped with another. In my experience this level of implementation monitoring is not very common.

PS: For more reading on set relationships and concepts of sufficient and necessary causal conditions, I highly recommend: 

Goertz, Gary, and James Mahoney. 2012. A Tale of Two Cultures: Qualitative and Quantitative Research in the Social Sciences. Princeton University Press. http://www.amazon.co.uk/Tale-Two-Cultures-Qualitative-Quantitative/dp/0691149712/ref=sr_1_1?ie=UTF8&qid=1353850106&sr=8-1.


Saturday, October 26, 2013

Complex Theories of Change: Recipes for failure or for learning?


The diagram below is a summary of a Theory of Change for interventions in the education sector in Country X. It did not stand on its own, it was supplemented by an extensive text description.



Its complex in the sense that there are many different parts to it and many interconnections between them, including some feedback loops. It seems realistic in the sense of capturing some of the complexity of social change. But it may be unrealistic if it is a prescription for achieving change. Whether it is the later depends on how we interpret the diagram, which I discuss below.

One way of viewing the Theory of Change is in terms of conditions (the elements in the diagram) that may or may not be necessary and/or sufficient for the final outcome to occur. The ideas of necessary and/or sufficient causal conditions are central to the notion of “configurational” models of causation, described by Mahoney and Goertz (2012) and others. A configuration is a set of conditions that may be either sufficient or necessary for an outcome e.g. Condition X + Condition T + Condition D + Condition P -> Outcome. This is in contrast to simpler notions of an outcome having a single cause e.g. Condition T -> Outcome.

The philosopher John Mackie (1974) argued that most of the “causes” that we talk about in everyday life are what are called INUS causes. That is, they are about a condition that is an Insufficient but Necessary part of a configuration of conditions but one which is Unnecessary but Sufficient for an outcome to occur. For example, smoking is a contributory cause of lung cancer, but it is neither necessary nor sufficient to get cancer. There are other ways of getting cancer and all smokers do not get cancer.


The interesting question for me is whether the above Theory of Change represents one or more than one causal configuration. I look at both possibilities and their implications.

If the Theory of Change represents a single configuration then each element, such as “More efficient management of teacher recruitment and deployment”, would be insufficient by itself, but a necessary part of the whole configuration. In other words, every element in the Theory of Change has to work or else the outcome won’t occur. This is quite a demanding expectation. The more complex this “single configuration” model becomes (i.e. by having more conditions), the more vulnerable it will becomes to implementation failure, because even if only part does not work, the whole process will fail. One saving grace is that it would be relatively easy to test this kind of theory. In any locations where the outcome did occur it would be expected that all elements would be present. If some were not, then the missing elements would not qualify as insufficient but necessary conditions.

 The alternative perspective is to see the above Theory of Change as representing multiple causal configurations i.e. multiple possible combinations of conditions, each of which can lead to the desired outcome. So any condition, again such as “More efficient management of teacher recruitment and deployment” may not be necessary under all circumstances. Instead it may be insufficient but necessary part of one of the configurations, but not the others. Viewed from this perspective, the Theory of Change seems less doomed to implementation failure, because there is more than one route to success.

However if there are multiple routes the challenge is then how to identify the different configurations that may be associated with successful outcomes. As it stands the current Theory of Change gives little guidance. Like many Theory of Change at this macro-level / sector perspective it tends towards showing “everything connected to everything”. In fact this limitation seems unavoidable, because with increasing scale there is often a corresponding increase in the diversity of actors, interventions and contexts. In such circumstances there are likely to be many more causal pathways at work. This view suggests that at such a macro level it might be more appropriate for a Theory of Change to initially have relatively modest ambitions and to limit itself to identifying the conditions that are likely to be involved in the various causal configurations.

The focus then would move to on what can be done through subsequent monitoring and evaluation efforts. This could involve three tasks: (a) Identifying where the outcomes have and have not occurred, (b) identifying how they differed in terms of the configuration of conditions that were associated with the outcomes (and absent where the outcomes did not occur). This would involve across-case comparisons. (c) Establishing plausible causal linkages between the observed conditions within each configuration. This would involve within-case analyses. Ideally, the overall findings about the configurations involved would help ensure the sustainability and replicability of the expected outcomes.

The Theory of Change will still be useful in as much as it successfully anticipates the various conditions making up the configurations associated with outcomes, and their absence. It will be less useful if it has omitted many elements, or included many that are irrelevant. Its usefulness could actually be measured! Going back to the recipe metaphor in the title, a good Theory of Change will have at least an appropriate list of ingredients but it will be really up to subsequent monitoring and evaluation efforts to identify what combinations of these produce the best results and how they do so (e.g. by looking at the causal mechanisms connecting these elements).

Some useful references to follow up:
Causality for Beginners, Ray Pawson, 2008
Qualitative Comparative Analysis, at Better Evaluation
Process Tracing, at Better Evaluation
Generalisation, at Better Evaluation

Postscript:

I have just read Owen Barder's review of Ben Ramalingam's new book "Aid on the Edge of Chaos" In that review he makes two comments that are relevant to the argument presented above:
"As Tim Harford showed in his book Adapt, all successful complex systems are the result of adaptation and evolution.  Many in the world of development policy accepted intellectually the story in Adapt but were left wondering how they could, practically and ethically, manage aid projects adaptively when they were dealing with human lives"
"Managing development programmes in a complex world does not mean abandoning the drive to improve value for money. Iteration and adaptation will often require the collection of more data and more rigorous analysis - indeed, it often calls for a focus on results and 'learning by measuring' which many people in development may find uncomfortable."
The point made in the last paragraph about requiring the collection of more data needs to be clearly recognised, as early as possible. Where there are likely to be many possible causal relationships at work, and few if any of these can be confidently hypothesised in advance, the coverage of data collection will need to be wider. Data collection (and then analysis) in this situation is like casting a net onto the waters, albeit still with some idea of where the fish may be. The net needs to be big enough to cover the possibilities.




Wednesday, August 14, 2013

Measuring the impact of ideas: Some testable propositions



Evaluating the impact of research on policy and practice can be quite a challenge, for at least three reasons: (a) Our ideas of the likely impact pathways may be poorly developed, (b) Actors within those pathways may not provide very reliable information about exposure to and use of the research we are interested in. Some may be over-obliging, others may be very reluctant to acknowledge its influence. Others may not even be concious of the influence that did occur, (c) It is quite likely that that there are many more pathways through which the research results travel that we cant yet imagine, let alone measure. Even more so when we are looking at impact over a longer span of time. When I look back to the first paper I wrote about MSC, which I put on the web in 1996, I could never have imagined the diversity of users and usages of MSC that have happened since then.

I am wondering if there is a proxy measure of impact that might be useful, and whose predictive value might even be testable, before it is put to work as a proxy. A proxy is conventionally defined as "a person authorized to act on behalf of another". In this case it is a measure that can be justifably used in place of another, because that other measure is not readily available.

What would that proxy measure look like? Lets start with an assumption that the more widely dispersed an idea is, the more likely someone will encounter it, if only by chance, and then make some use of it. Lets make a second assumption, that impact is greater when not only is the idea widely dispersed, say amongst 1000 people rather than 100, but when it is dispersed amongst a wide variety of people, not just one kind of people. Combined together, the proxy measure could be descirbed as Availability.

While one can imagine some circumstances where  impact will be bigger when the idea is widely dispersed but within a single type of people I would argue the success of these more "theory led" predictions will often be outnumbered by serindipitous encounters and impact, especially where there has been large scale dissemination, as will often be the case when research is disseminated via the web. This is a view that could be tested, see below.

How would the proxy measure be measured? As suggested by the assumptions above, Availability could be tracked using two measures. One is the number of references to the research that can be found (e.g. on the web), which we could call Abundance. The other is the Diversity of sources that make these references. The first measure seems relatively simple. The second, the measurement of diversity, is an interesting subject in its own right , and one which has been widely explored by ecologists and other disciplines for some decades now (For a summary of ideas, see Scott Page - Diversity and Complexity, 2001, chapter 2). One simple measure is Simpson's Reciprocal Index (1/D), which combines Richness ( the number of species [/ number of types of reference sources]) and Evenness, the relative abundance of species [/number of references] across those types). High diversity is a combination of high Richness and high Evenness (i.e. all species are similarly abundant). A calculation of the index is shown below:
How could the proxy measure be tested, before it can be widely used? We would need  a number of test cases where not only can we measure the abundance and diversity of references to a given piece of research, but we can also access some known evidence of impact(s) of that research. With the latter we may be able to generate a rank ordering of impact, through a pair comparison process - a process that can acknowledge the differences in the kinds of impact. We could then use data from these cases to identify which of the following distributions existed:



We could also compare cases with different combinations of abundance and diversity. It is possible that abundance is all that matters and diversity is irelevant.

Now, does anyone have a set of cases we could look at, to test the propositions outlined above?

Postscript: There are echoes of evolutionary theory in this proposal. Species that are large in number and widely dispersed, across many different habitats, tend to have better long term survival prospects in the face of changing climates and the co-evolution of competitors


Friday, July 26, 2013

A reverse QCA?



I have been talking to a project manager who needed some help clarifying their Theory of Change (and maybe the project design itself). The project aims to improve the working relationships between a particular organisation (A) and a number of organisations they work with (B). There is already a provisonal scale that could be used to measure the baseline state of relationships, and changes in those relationships thereafter. Project activities designed to help improve the relationships have already been identified and should be reasonably easy to monitor. But the expected impacts of the improved relationships on what B's do elsewhere via their other relationships have not been clarified or agreed to, and in all likelihood they could be many and varied. It will probably be easier to identify and categorise after the activities have been carried out, rather than during at any planning stage.

I have been considering the possible usefullness of QCA as a means of analysing the effectiveness of the project. The cases will be the various relationships between A and Bs that are assisted in different ways. The conditions will be different forms of assistance provided as well as differences in the context of these relationships (e.g. the people, organisations and communities involved). The outcome of interest will be the types of changes in the relationships between A and Bs. Not especially problematic, I hope.

Then I thought..., perhaps one could do a reverse QCA analysis to identify associations between specific types of relationship changes and the many different kinds of impacts that were subsequently observed on other relationships. The conditions in this analysis would be various categories of observed change (with data on their presence and absence). The configurations of conditions identified by the QCA analysis would in effect be a succinct typology of impact configurations associated with each kind of relationship change. As distinct from causal configurations sought via a conventional QCA.

This reversal of the usual QCA analysis should be possible and legitimate because relations between conditons and outcomes are set theoretic relations, not temporal relationships. My next step, will be to find out if someone has already tried to do this elsewhere (that I could learn from). These days this is highly likely.

Postscript 1: The same sort of reverse analyses could be done with Decision Tree algorithms, whose potential for use in evaluations has been discussed in earlier postings on this blog and elsewhere.

Postscript 2: I am slowly working my way through this comprehensive account of QCA, published last year:
Schneider, Carsten Q., and Claudius Wagemann. 2012. Set-Theoretic Methods for the Social Sciences: A Guide to Qualitative Comparative Analysis. Cambridge University Press.

Tuesday, April 16, 2013

Another perspective on the uses of control groups


I have been reading Eric Siegel's book on Predictive Analytics. Though it is a "pop science" account, with the usual "this will change the world" subtitle, it is definitely a worthwhile read.

In chapter 7 he talks the reader through what are called "uplift models", which are Decision Tree models that can not only differentiate groups who respond differently to an intervention, but how much differently when compared to a control group where there is no intervention. All this is in the context of companies marketing their products to the population at large, not the world of development aid organisations.

(Temporarily putting aside the idea of uplift models...) In this chapter he happens to use the matrix below, to illustrate the different possible sets of consumers that exist, given two types of scenarios that can be found where both a control and intervention group are being used.
But what happens if we re-label the matrix, using more development project type language? Here is my revised version below:
 

Looking at this new matrix it struck me that evaluators of development projects may have a relatively improverished view of the potential uses of control groups. Normally the focus is on the net difference in the improvement, between households in the control and intervention groups: How big is it and is it statistically significant? In other words, how many of those in the intervention group were really "self-helpers" who would have improved anyway, versus being "Need help'ers" who would not have improved without the intervention.

But this leaves aside two other sets of households who also surely deserve at least equal attention.One are the "hard cases", that did not improve in either setting. Possibly the poorest of the poor. How often are their numbers identified with the same attention to statistical detail? The other are the "Confused", who have improved in the control group, but not in the intervention group. Perhaps these are the ones we should really worry about, or at least be able to enumerate. Evaluators are often asked, in their ToRs, to also give attention to negative project impacts, but how often do we systematically look for such evidence?

Okay, but how will we recognise these groups? One way is to look at the distribution of cases that are possible. Each group can be characterised by how cases are distributed in the control and intervention group, as shown below. The first group (in green) are probably "self-help'ers" because the same proportion also improved in the control group. The second group are more likely to be "need-help'ers" because fewer people improved in the control group. The third group are likely to be the "confused" because more of them did not improve in the intervention group than in the control group. The fourth group are likely to be the "hard cases" if the same high proportion did not improve in the control group either.
At an aggregate level only one of the four outcome combinations shown above can be observed at any one time. This is the kind of distribution I found in the data set collected during a 2012 impact assessment of a rural livelihoods project in India. Here the overall distribution suggests that the “need-helpers” have benefited. 

How do we find if and where the other groups are? One way of doing this is to split the total population into sub-groups, using one household attribute at a time, to see what difference it makes to the distribution of results. For example, I thought that household’s wealth ranking might be associated with differences in outcomes. So I examined the distribution of outcomes for the poorest and least poor of the four wealth ranked groups. In the poorest group, those who benefited were the “need-help’ers” , but in the “Well-Off” group those who benefited were the “self-help’ers”, perhaps as expected






There are still the two other kinds of outcomes that might exist in some sub-groups – the “hard cases” and the “confused” How can we find where they are? At this point my theory-directed search fails me. I have no idea where to look for them. There are too many household attributes in the data set to consider manually examining how different their particular distribution of outcomes is from the aggregate distribution.

This is the territory where an automated algorithm would be useful. Using one attribute at a time, it would split the main population into two sub-groups, and search for the attribute that made the biggest difference. The difference to look for would be extremity of range, as measurable by the Standard Deviation.  The reason for this approach is that the most extreme range would be where one cell in the control group was 100 and the other was 0, and similarly in the intervention group. These would be pure examples of the four types of outcome distributions shown above. [Note that in the two wealth ranked sub-groups above, the Standard Deviation of the distributions was 14% and 15% versus 7% in the whole group]
This is the same sort of work that a Decision Tree algorithm does, except Decision Trees usually search for binary outcomes and use different “splitting” criteria. I am not sure if they can use the Standard Deviation, or if they can use a another measure which would deliver the same results (i.e. identify four possible types of outcomes).


Wednesday, April 10, 2013

Predicting evaluability: An example application of Decision Tree models


The project: In 2000 ITAD did an Evaluablity Assessment of Sida funded democracy and human rights projects in Latin America and South Africa. The results are available here:Vol.1 and Vol.2. Its a thorough and detailed report.

The data: Of interest to me were two tables of data, showing how each of the 28 projects were rated on 13 different evaluablity assessment criteria. The use of each of these criteria are explained in detail in the project specific assessments in the second volume of the report.

Here are the two tables. The rows list the evaluability criteria and the columns list the projects that were assessed. The cell values show the scores on each criteria: 1 = best possible, 4 = worst possible. The bottom row summarises the scores for each project, and assumes an equal weighting for each criteria, except for the top three, which were not included in the summary score.



CLICK ON THE TABLE TO VIEW AT FULL SIZE

The question of interest: Is it possible to find a small sub-set of these 13 criteria which could act as good predictors of likely evaluability? If so, this could provide a quicker means of assessing where evaluablity issues need attention.

The problem: With 13 different criteria there are conceivably 2 to the power of 13 possible combinations of criteria that might be good predictors i.e 8,192 possiblities

The response:  I amalgamated both tables into one, in an Excel file, and re-calculated the total scores, by including scores for the first three criteria (recoded as Y=1, N=2). I then recoded the aggregate score into a binary outcome measure, where 1 = above average evaluablity scores and 2 below average scores.

I then imported this data into Rapid Miner, an open source data mining package. I then used the Decision Tree module within that package to generate the following Decision Tree model, which I will explain below.



 CLICK ON THE DIAGRAM TO VIEW AT FULL SIZE

The results: Decision Tree models are read from the root (at the top) to the leaf, following each branch in turn.

This model tells us, in respect to the 28 projects examined, that IF a project scores less than 2.5 (which is good) on "Identifiable outputs"  AND if it scores less than 3.5 on "project benefits can be attributed to the project intervention alone"  THEN there is a 93% probability that the project is reasonably evaluable (i.e has above average aggregate score for evaluability in the original data set). It also tells us that 50% of all the cases (projects) meet these two criteria.

Looking down the right side of the tree we see that IF the project scores more than 2.5 (which is not good) on"Identifiable outputs" AND even though it scores less than 2.5 on "broad ownership of project purpose amongst stakeholders THEN there is a 100% probability that the project will have low evaluability. It also tells us that 32% of all cases meet these two criteria.

Improvements: This model could be improved in two ways. Firstly, the outcome measure, which is an above/below average aggregate score for each project could be made more demanding, so that only the top 25th percentile were rated as having good evaluability. We may want to set a higher standard.

Secondly, the assumption that all criteria are of equal importance, and thus their scores can simply be added up, could be questioned. Different weights could be given to each criterion, according to their perceived causal importance (i.e. the effects they will have). This will not necessarily bias the Decision Tree model towards using those criteria in a predictive model. If all projects were rated highly on a highly weighted criteria that criteria would have no particular value as a means of discriminating between them, so it would be unlikely to feature in the Decision Tree at all.

Weighting and perhaps subsequent re-weighting criteria may also help reconcile any conflict between what are accurate prediction rules and what seems to make sense as a combination of criteria that will cause high or low evaluability. For example in the above model, it seems odd that a criteria of merit (broad ownership of project purpose) should help us identify projects that have poor evaluablity.

Your comments are welcome

PS: For a pop science account of predictive modelling see Eric Siegel's book on Predictive Analytics

Wednesday, February 13, 2013

My two particular problems with RCTs


Up till now I have tried not to take sides in the debate, when crudely cast as between those "for" and those "against" RCTs (Randomised Control Trials)  I have always thought that there are "horses for courses" and that there is a time and place for RCTs, along with other methods, including non-experimental methods, for evaluating the impact of an intervention. I should also disclose that my first degree included a major and sub-major in psychology, much of which was experimental psychology. Psychologists have spent a lot of time thinking about rigorous experimental methods. Some of you may be familiar with one of the more well known contributors to the wider debates about methodology in the social sciences - Donald T Campbell - a psychologist whose influence has spread far beyond psychology. Twenty years after my first degree, his writings on epistemology subsequently influenced the direction of my PhD, which was not about experimental methods. In fact it was almost the opposite in orientation - the Most Significant Change (MSC) technique was one of its products.

This post has been prompted by my recent reading of two examples of RCT applications, one which has been completed and one which has been considered but not yet implemented. They are probably not exemplars of good practice, but in that respect they may still be useful, because they point to where RCTs should not be used. The completed RCT was of a rural development project in India. The contemplated RCT was on a primary education project in a Pacific nation. Significantly, both were large scale projects covering many districts in India and many schools in the Pacific nation.

Average effects

The first problem I had is with the use of the concept of Average Treatment Effect (ATE) in these two contexts. The India RCT found a statistically significant difference in the reduction in poverty of households involved in a rural development project, when compared to those who had not been involved. I have not queried this conclusion. The sample looked decent in size and the randomisation looked fine. The problem I have is with what was chosen as the "treatment" The treatment was the whole package of interventions provided by the project. This included various modalities of aid (credit, grants, training) in various sectors (agriculture, health, education, local governance and more) It was a classic "integrated rural development project, where a little bit of everything seemed to be on offer, delivered partly according to the designs of the project managers, and partly according to beneficiary plans and preferences. So, in this context, how sensible is it to seek the average effects on households of such a mixed up salad of activities? At best it tells us that if you replicate this particular mix (and God knows how you will do that...) you will be able to deliver the same significant impact on poverty. Assuming that can be done, this must still be about the most inefficient replication strategy available. Much more preferable, would be to find which particular project activities (or combinations thereof) were more effective in reducing poverty, and then to replicate those.

Even the accountability value of the RCT finding was questionable. Where direct assistance is being provided to households a plausible argument could be made that process tracing (by a decent auditor) would provide good enough assurance that assistance was reaching those intended. In other words, pay more attention to the causal "mechanism"

The proposed RCT of the primary education project had similar problems, in terms of its conception of a testable treatment. It proposed comparing the impact of two project "components", by themselves and in combination. However, as in India, each of these project components contained a range of different activities which would be variably made available and variably taken up locally across the project location.

Such projects are commonplace in development aid. Projects focusing on a single intervention, such as immunization or cash transfers are the exception, not the rule. The complex design of most development projects, tacitly if not explicitly, reflects a widespread view that promoting development involves multiple activities, whose specific composition often needs to be localised.

To summarise: It is possible to calculate average treatment effects, but its is questionable how useful that is in the project settings I have described - where there is a substantial diversity of project activities and combinations thereof


Context

Its commonplace amongst social scientists, especially the more qualitatively oriented, to emphasis the importance of context. Context is also important in the use of experimental methods, because it is a potential source of confounding factors, confusing the impact of a independent variable under investigation.

There are two ways of dealing with context. One is by ruling it out e.g. by randomising access to treatment so that historical and contextual influences are the same for intervention and control groups. This was done in both the India and Pacific RCT examples. In India there were significant caste and class variations that could have influenced project outcomes. In the Pacific there were significant ethnic and religious differences. Such diversity often seems to be inherent in large scale development projects.

The result of using this ruling-out strategy is hopefully a rigorous conclusion about the effectiveness of an intervention, that stands on its own, independent of the context. But how useful will that be? Replication of the same or similar project will have to take place in a real location where context will have its effects. How sensible is it to remain intentionally ignorant of those likely effects?

The alternative strategy is to include potentially relevant contextual factors into an analysis. Doing so takes us down the road of a configurational view of causation, embodied in the theory-led approaches of Realist Evaluation and QCA, and also in the use of data mining procedures that are less familiar to evaluators (Davies, 2012).

Evaluation as the default response

In the Pacific project it was even questionable if an evaluation spanning a period of years was the right approach (RCT based or otherwise). Outcomes data, in terms of student participation and performance data will be available on a yearly basis through various institutional monitoring mechanisms. Education is an area where data abounds, relative to many other development sectors, notwithstanding the inevitable quality issues. It could be cheaper, quicker and more useful to  develop and test (annually) predictive models of the outcomes of concern. One can even imagine using crowdsourcing services like Kaggle to do so. As I have argued elsewhere we could benefit by paying more attention to monitoring, relative to evaluation.

In summary, be wary of using RCTs where development interventions are complex and variable, where there are big differences in the context in which they take place, and where an evaluation may not even be the most sensible default option.
 


Tuesday, September 11, 2012

Evolutionary strategies for complex environments


 [This post is in response to Owen Barder’s blog posting “If development is complex, is the results agenda bunk?”]

Variation, selection and retention is at the core of the evolutionary algorithm. This algorithm has enabled the development incredibly sophisticated organisms able to survive in a diversity of complex environments over vast spans of time. Following the advent of computerisation the same algorithm has been employed by homo sapiens to solve complex design and optimisation problems in many fields of science and technology. It has also informed thinking about the history and philosophy of science (Toulmin, Hull, 1988, Dennett, 1996) and even cosmology (Lee Smolin, 1997). Even advocates of experimental approaches to building knowledge, now much debated by development agencies and workers, have been keen advocates of evolutionary views on the nature of learning (Donald Campbell, 1969)

So, it is good to see these ideas being publicised by the likes of Owen Barder. I would like to support his efforts by pointing out that the application of an evolutionary approach to learning and knowledge may in fact be easier than it seems on first reading of Owen’s blog. I have two propositions for consideration.

1. Re Variation: New types of development projects may not be needed. From 2006 to 2010 I led annual reviews of four different maternal and infant health projects in Indonesia. All of these projects all were being implemented in multiple districts.  In Indonesia district authorities have considerable autonomy. Not surprisingly, the ways the project was being implemented in each district varied, both intentionally and unintentionally. So did the results. But this diversity of contexts, interventions and outcomes was not exploited by the LogFrame based monitoring systems associated with each project. The LogFrames presented a singular view of the “the project”, one where aggregated judgements were needed about the whole set of districts that were involved. Diversity existed but was not being recognised and fully exploited. In my experience this phenomena is widespread. Development projects are frequently implemented in multiple locations in parallel. In practice implementation often varies across locations, by accident and intention. There is often no shortage of variation. There is however a shortage of attention to such variations. The problem is not so much in project design as in M&E approaches that fail to demand attention to variation - to ranges and exceptions as well as central tendencies and aggregate numbers.

2. Re Selection: Fitness tests are not that difficult to set up, once you recognise and make use of internal diversity. Locations within a project can be rank ordered by expected success, then rank ordered by observed success, using participatory and/or other methods. The rank order correlation of these two measures is a measure of fitness, of design to context. Outliers are the important learning opportunities (high expected & low actual success, low expected & high actual success) that warrant detailed case studies. The other extremes (most expected & actual success, least expected & actual success) also need investigation to make sure the internal causal mechanisms are as per the prior Theory of Change that informed the ranking.

It is possible to incorporate evolutionary ideas into the design of M&E systems. Some readers may know some of the background to the Most Significant Changes impact monitoring technique. Its design was informed by evolutionary epistemology. The MSC process deliberately includes an iterated process of exploiting diversity (of perceptions of change), subjecting these to selection processes (via structured choice exercises by stakeholders) and retention (of selected change accounts for further use by the organisation involved). MSC was tried out by a Bangladeshi NGO in 1993, retained and then expended in use over the next ten years. In parallel, it was also tried out by development agencies outside Bangladesh in the following years, and is now widely used by development NGOs. As a technique it has survived and proliferated. Although it is based on evolutionary ideas, I suspect that no more than 1 in 20 users might recognise this. No matter, nor are finches likely to be aware of Darwin’s evolutionary theory. Ideally the same might apply to good applications of complexity theory.

Our current thinking about Theories of Change (ToC) is ripe for some revolutionary thinking, aided by evolutionary perspective on the importance of variation and diversity. Singular theories abound, both in textbooks (Funnell and Rogers, 2011) and in examples developed in practice by organisations I have been in contact with. All simple results chain models are by definition singular theories of change. More complex network-like models with multiple pathways to a given set of expected outcomes are a step in the right direction. I have seen these in some DFID policy area ToCs. But what is really needed are models that consider a diversity of outcomes as well as means of getting there. One possible means of representing these models, which I am currently exploring, is the use of Decision Trees. Another, which I explored many years ago, and which I think deserves more attention, is a scenario planning type tool called Evolving Storylines. Both make use of divergent tree structures, as did Darwin when illustrating his conception of evolutionary process in his Origin of Species.

Friday, August 03, 2012

AusAID’s 'Revitalising Indonesia's Knowledge Sector for Development Policy' program


 Enrique Mendizabal has suggested I might like to comment on the M&E aspects of AusAID’s  'Revitalising Indonesia's Knowledge Sector for Development Policy' program, discussed on the AusAID’s Engage blog and Enrique’s On Think Tanks blog

Along with Enrique, I think the Engage blog posting on the new Indonesian program is a good development. It would be good to see this happening at the early stages of other AusAID programs. Perhaps it already has.

Enrique notes that “A weak point and a huge challenge for the programme is its monitoring and evaluation. I am afraid I cannot offer much advice on this expect that it should not be too focused on impact while not paying sufficient attention of the inputs. 

I can only agree. It seems ironic that so much attention is spent these days on assessing impact, while at the same time most LogFrames no longer seem to bother to detail the activities level. Yet, in practice the intervening agencies can reasonably be held most responsible for activities and outputs and least responsible for impact. It seems like the pendulum of development agency attention has swung too far, from an undue focus on activities in the (often mythologised) past to an undue emphasis on impact in the present

Enrique suggests that “A good alternative is to keep the evaluation of the programme independent from the delivery of the programme and to look for expert impact evaluators based in universities (and therefore with a focus on the discipline) to explore options and develop an appropriate approach for the programme. While the contractor may manage it it should be under a sub-contract that clearly states the independence of the evaluators. Having one of the partners in the bid in charge of the evaluation is only likely to create disincentives towards objectivity.”

This is a complex issue and there is unlikely to be a single simple solution. In a sense, evaluation has to be part of the work of all the parties involved, but there need to be checks and balances to make sure this is being done well. Transparency, in the form of public access to plans and progress reporting is an important part. Evaluability assessments of what is proposed is another part. Meta-evaluation and syntheses of what has been done is another. There will be a need for M&E roles within and outside the management structures. I agree with Guggenheim and Davis’ (AusAID) comment that “the contractor needs to be involved in setting the program outcomes alongside AusAID and Bappenas because M&E begins with clarity on what a program’s objectives are

Looking at the 2 pages on evaluation in the AusAID design document (see page 48-9) there are some things new and some things old. The focus of evaluation as hypothesis testing seems new and is something I have argued for in the past, in place of numerous and often vague evaluation questions. On the other hand the counting of products produced and viewed seems stuck in the past. Necessary but far from sufficient. What is needed is a more macro-perspective on change, which might be evident in: (a) changes in the structure of relationships between the institutional actors involved, and (b) in the content of the individual relationships. Producing and using products is only one part of those relationships. The categories of “supply”, “demand”, “intermediaries” and “enabling environment” are a crude first approximation of what is going on at the more macro level, which hopefully will soon be articulated into more detail.

The discussion of the Theory of Change in the annex to the design document is also interesting. On the one hand the authors rightly argue that this project and setting is complex and “For complex settings and problems, the ‘causal chain’ model often applied in service delivery programs is too linear and simplistic for understanding policy influence” On the other hand, some pages later there is the inevitable and perhaps mandatory, linear Program Logic Model, with nary a single feedback loop.

One of the purposes of the ToC (presumably including the Program Logic Model) is to “guide the implementation team to develop a robust monitoring and evaluation system” If so, it seems to me that this would be much easier if the events described in the Program Logic Model were being undertaken by identifiable actors (or categories thereof). However, reading the Program Logic Model we see references to only the broadest categories (government agencies, government policy makers, research organisation and networks of civil society) with one exception – Bappenas. 

Both these problems of how to describe complex change processes are by no means unique to AusAID, they are endemic in aid organisations. Yet at the same time, all the participants in this discussion I am now part of are enmeshed in an increasing socially and technologically networked world. We are surrounded by social networks, but yet seemingly incapable of planning in these terms. As they say “Go figure”


 
PS: I also meant to say that I strongly support Enrique’s suggestion that the ToRs for development projects, and the bids received in response to those ToRs, should be publicly available online, and used as a means of generating discussion about what should be done. I think that in the case of DFID at least the ToRs are already available online, to companies registering as interested in bidding for aid work. However, open debate is not facilitated and is unlikely to happen if the only parties present are the companies competing with each other for the work.


Tuesday, June 05, 2012

Open source evaluation - the way forward?


DFID has set up a special website at projects.dfid.gov.uk where anyone can search for and find details of the development projects it has funded.

As of June this year you can find details of 1512 operational projects, 1767 completed projects and 102 planned projects. The database is updated monthly as a result of an automated trawl through DFID's internal databases. It has been estimated that the database covers 98% of all current projects, with the remaining 2% being omited for security and other reasons.

There are two kinds of search facilities: (a) by key words, (b) by choices from drop down menus [7]. These searches can be combined to narrow a search (in effect, using an AND). But more complex searches using OR or NOT are not yet possible.

The search results are in two forms: (a) A list of projects shown on the webpage, which can also be downloaded as an Excel file. The Excel file has about 30 fields of data, many more than are visible in the webpage listing of the search results; (b) Documents produced by each project, a list of which is viewable after clicking on any project name in a search result. There are 10 different kinds of project documents, ranging from planning documents to progress reports and project completion reports. Evaluation reports are not yet available on this website.

In practice the coverage of project documents is still far from comprehensive. This was my count of what was available in early May, when I was searching for documents relating to DFID's 27 focus countries (out of 103 countries it has worked in up to 2010).
·         

Of all operational projects
Documents available
Up to and including 2010
(27 countries only)
Post-2010 projects
(27 countries only)
Business Case and Intervention Summary
8% (27)
40% (108)
Logical Frameworks

22% (74)
39% (104)
Annual Reviews

17% (55)
9% (24)
Number of  projects
100% (330)
100% (270)

Subject to its continued development, this online database has great potential for enabling what could be called "open source" evaluation, i.e investigation of DFID projects by anyone who can access the website.

With this in mind, I would encourage you to post comments here about:
  •  possible improvements to the database
  •  possible uses of the database.
Re improvements, after having publicised the database on the MandE NEWS email list one respondent made the useful suggestion that the database should include weblinks to existing project websites. Even if the project has closed and the website is no longer maintained access to the website may still be possible through web archive services such as Alexa.com's Wayback machine [PS: It contains copies of www.mande.co.uk dating back to 1998!]

Re uses of the database, earlier today I did a quick analysis of two downloaded  Excel files, one of completed projects and the other of currently operational projects, looking at the proportion of high risk projects in these two groups. The results:



Completed projects
Operational projects

High risk
331 (12%)
677 (19%)

Medium risk
1328 (47%)
1159 (33%)

Low risk
1167 (41%)
1648 (47%)





DFID appears to be funding more high risk projects than in the past. Unfortunately, we dont know what time period the "completed projects" category comes from, or the percentage of projects from that period that are listed in the database. Perhaps this is another possible improvement to the database: Make the source(s) and limitations of the data more visible.

PS 16 June: A salutary blog posting on  "What data can and cannot do" including reminders that,..
  •  Data is not a force unto itself
  •  Data is not a perfect reflection of the world
  •  Data does not speak for itself
  •  Data is not power
  •  Interpreting data is not easy