Wednesday, April 10, 2013

Predicting evaluability: An example application of Decision Tree models


The project: In 2000 ITAD did an Evaluablity Assessment of Sida funded democracy and human rights projects in Latin America and South Africa. The results are available here:Vol.1 and Vol.2. Its a thorough and detailed report.

The data: Of interest to me were two tables of data, showing how each of the 28 projects were rated on 13 different evaluablity assessment criteria. The use of each of these criteria are explained in detail in the project specific assessments in the second volume of the report.

Here are the two tables. The rows list the evaluability criteria and the columns list the projects that were assessed. The cell values show the scores on each criteria: 1 = best possible, 4 = worst possible. The bottom row summarises the scores for each project, and assumes an equal weighting for each criteria, except for the top three, which were not included in the summary score.



CLICK ON THE TABLE TO VIEW AT FULL SIZE

The question of interest: Is it possible to find a small sub-set of these 13 criteria which could act as good predictors of likely evaluability? If so, this could provide a quicker means of assessing where evaluablity issues need attention.

The problem: With 13 different criteria there are conceivably 2 to the power of 13 possible combinations of criteria that might be good predictors i.e 8,192 possiblities

The response:  I amalgamated both tables into one, in an Excel file, and re-calculated the total scores, by including scores for the first three criteria (recoded as Y=1, N=2). I then recoded the aggregate score into a binary outcome measure, where 1 = above average evaluablity scores and 2 below average scores.

I then imported this data into Rapid Miner, an open source data mining package. I then used the Decision Tree module within that package to generate the following Decision Tree model, which I will explain below.



 CLICK ON THE DIAGRAM TO VIEW AT FULL SIZE

The results: Decision Tree models are read from the root (at the top) to the leaf, following each branch in turn.

This model tells us, in respect to the 28 projects examined, that IF a project scores less than 2.5 (which is good) on "Identifiable outputs"  AND if it scores less than 3.5 on "project benefits can be attributed to the project intervention alone"  THEN there is a 93% probability that the project is reasonably evaluable (i.e has above average aggregate score for evaluability in the original data set). It also tells us that 50% of all the cases (projects) meet these two criteria.

Looking down the right side of the tree we see that IF the project scores more than 2.5 (which is not good) on"Identifiable outputs" AND even though it scores less than 2.5 on "broad ownership of project purpose amongst stakeholders THEN there is a 100% probability that the project will have low evaluability. It also tells us that 32% of all cases meet these two criteria.

Improvements: This model could be improved in two ways. Firstly, the outcome measure, which is an above/below average aggregate score for each project could be made more demanding, so that only the top 25th percentile were rated as having good evaluability. We may want to set a higher standard.

Secondly, the assumption that all criteria are of equal importance, and thus their scores can simply be added up, could be questioned. Different weights could be given to each criterion, according to their perceived causal importance (i.e. the effects they will have). This will not necessarily bias the Decision Tree model towards using those criteria in a predictive model. If all projects were rated highly on a highly weighted criteria that criteria would have no particular value as a means of discriminating between them, so it would be unlikely to feature in the Decision Tree at all.

Weighting and perhaps subsequent re-weighting criteria may also help reconcile any conflict between what are accurate prediction rules and what seems to make sense as a combination of criteria that will cause high or low evaluability. For example in the above model, it seems odd that a criteria of merit (broad ownership of project purpose) should help us identify projects that have poor evaluablity.

Your comments are welcome

PS: For a pop science account of predictive modelling see Eric Siegel's book on Predictive Analytics

Wednesday, February 13, 2013

My two particular problems with RCTs


Up till now I have tried not to take sides in the debate, when crudely cast as between those "for" and those "against" RCTs (Randomised Control Trials)  I have always thought that there are "horses for courses" and that there is a time and place for RCTs, along with other methods, including non-experimental methods, for evaluating the impact of an intervention. I should also disclose that my first degree included a major and sub-major in psychology, much of which was experimental psychology. Psychologists have spent a lot of time thinking about rigorous experimental methods. Some of you may be familiar with one of the more well known contributors to the wider debates about methodology in the social sciences - Donald T Campbell - a psychologist whose influence has spread far beyond psychology. Twenty years after my first degree, his writings on epistemology subsequently influenced the direction of my PhD, which was not about experimental methods. In fact it was almost the opposite in orientation - the Most Significant Change (MSC) technique was one of its products.

This post has been prompted by my recent reading of two examples of RCT applications, one which has been completed and one which has been considered but not yet implemented. They are probably not exemplars of good practice, but in that respect they may still be useful, because they point to where RCTs should not be used. The completed RCT was of a rural development project in India. The contemplated RCT was on a primary education project in a Pacific nation. Significantly, both were large scale projects covering many districts in India and many schools in the Pacific nation.

Average effects

The first problem I had is with the use of the concept of Average Treatment Effect (ATE) in these two contexts. The India RCT found a statistically significant difference in the reduction in poverty of households involved in a rural development project, when compared to those who had not been involved. I have not queried this conclusion. The sample looked decent in size and the randomisation looked fine. The problem I have is with what was chosen as the "treatment" The treatment was the whole package of interventions provided by the project. This included various modalities of aid (credit, grants, training) in various sectors (agriculture, health, education, local governance and more) It was a classic "integrated rural development project, where a little bit of everything seemed to be on offer, delivered partly according to the designs of the project managers, and partly according to beneficiary plans and preferences. So, in this context, how sensible is it to seek the average effects on households of such a mixed up salad of activities? At best it tells us that if you replicate this particular mix (and God knows how you will do that...) you will be able to deliver the same significant impact on poverty. Assuming that can be done, this must still be about the most inefficient replication strategy available. Much more preferable, would be to find which particular project activities (or combinations thereof) were more effective in reducing poverty, and then to replicate those.

Even the accountability value of the RCT finding was questionable. Where direct assistance is being provided to households a plausible argument could be made that process tracing (by a decent auditor) would provide good enough assurance that assistance was reaching those intended. In other words, pay more attention to the causal "mechanism"

The proposed RCT of the primary education project had similar problems, in terms of its conception of a testable treatment. It proposed comparing the impact of two project "components", by themselves and in combination. However, as in India, each of these project components contained a range of different activities which would be variably made available and variably taken up locally across the project location.

Such projects are commonplace in development aid. Projects focusing on a single intervention, such as immunization or cash transfers are the exception, not the rule. The complex design of most development projects, tacitly if not explicitly, reflects a widespread view that promoting development involves multiple activities, whose specific composition often needs to be localised.

To summarise: It is possible to calculate average treatment effects, but its is questionable how useful that is in the project settings I have described - where there is a substantial diversity of project activities and combinations thereof


Context

Its commonplace amongst social scientists, especially the more qualitatively oriented, to emphasis the importance of context. Context is also important in the use of experimental methods, because it is a potential source of confounding factors, confusing the impact of a independent variable under investigation.

There are two ways of dealing with context. One is by ruling it out e.g. by randomising access to treatment so that historical and contextual influences are the same for intervention and control groups. This was done in both the India and Pacific RCT examples. In India there were significant caste and class variations that could have influenced project outcomes. In the Pacific there were significant ethnic and religious differences. Such diversity often seems to be inherent in large scale development projects.

The result of using this ruling-out strategy is hopefully a rigorous conclusion about the effectiveness of an intervention, that stands on its own, independent of the context. But how useful will that be? Replication of the same or similar project will have to take place in a real location where context will have its effects. How sensible is it to remain intentionally ignorant of those likely effects?

The alternative strategy is to include potentially relevant contextual factors into an analysis. Doing so takes us down the road of a configurational view of causation, embodied in the theory-led approaches of Realist Evaluation and QCA, and also in the use of data mining procedures that are less familiar to evaluators (Davies, 2012).

Evaluation as the default response

In the Pacific project it was even questionable if an evaluation spanning a period of years was the right approach (RCT based or otherwise). Outcomes data, in terms of student participation and performance data will be available on a yearly basis through various institutional monitoring mechanisms. Education is an area where data abounds, relative to many other development sectors, notwithstanding the inevitable quality issues. It could be cheaper, quicker and more useful to  develop and test (annually) predictive models of the outcomes of concern. One can even imagine using crowdsourcing services like Kaggle to do so. As I have argued elsewhere we could benefit by paying more attention to monitoring, relative to evaluation.

In summary, be wary of using RCTs where development interventions are complex and variable, where there are big differences in the context in which they take place, and where an evaluation may not even be the most sensible default option.
 


Tuesday, September 11, 2012

Evolutionary strategies for complex environments


 [This post is in response to Owen Barder’s blog posting “If development is complex, is the results agenda bunk?”]

Variation, selection and retention is at the core of the evolutionary algorithm. This algorithm has enabled the development incredibly sophisticated organisms able to survive in a diversity of complex environments over vast spans of time. Following the advent of computerisation the same algorithm has been employed by homo sapiens to solve complex design and optimisation problems in many fields of science and technology. It has also informed thinking about the history and philosophy of science (Toulmin, Hull, 1988, Dennett, 1996) and even cosmology (Lee Smolin, 1997). Even advocates of experimental approaches to building knowledge, now much debated by development agencies and workers, have been keen advocates of evolutionary views on the nature of learning (Donald Campbell, 1969)

So, it is good to see these ideas being publicised by the likes of Owen Barder. I would like to support his efforts by pointing out that the application of an evolutionary approach to learning and knowledge may in fact be easier than it seems on first reading of Owen’s blog. I have two propositions for consideration.

1. Re Variation: New types of development projects may not be needed. From 2006 to 2010 I led annual reviews of four different maternal and infant health projects in Indonesia. All of these projects all were being implemented in multiple districts.  In Indonesia district authorities have considerable autonomy. Not surprisingly, the ways the project was being implemented in each district varied, both intentionally and unintentionally. So did the results. But this diversity of contexts, interventions and outcomes was not exploited by the LogFrame based monitoring systems associated with each project. The LogFrames presented a singular view of the “the project”, one where aggregated judgements were needed about the whole set of districts that were involved. Diversity existed but was not being recognised and fully exploited. In my experience this phenomena is widespread. Development projects are frequently implemented in multiple locations in parallel. In practice implementation often varies across locations, by accident and intention. There is often no shortage of variation. There is however a shortage of attention to such variations. The problem is not so much in project design as in M&E approaches that fail to demand attention to variation - to ranges and exceptions as well as central tendencies and aggregate numbers.

2. Re Selection: Fitness tests are not that difficult to set up, once you recognise and make use of internal diversity. Locations within a project can be rank ordered by expected success, then rank ordered by observed success, using participatory and/or other methods. The rank order correlation of these two measures is a measure of fitness, of design to context. Outliers are the important learning opportunities (high expected & low actual success, low expected & high actual success) that warrant detailed case studies. The other extremes (most expected & actual success, least expected & actual success) also need investigation to make sure the internal causal mechanisms are as per the prior Theory of Change that informed the ranking.

It is possible to incorporate evolutionary ideas into the design of M&E systems. Some readers may know some of the background to the Most Significant Changes impact monitoring technique. Its design was informed by evolutionary epistemology. The MSC process deliberately includes an iterated process of exploiting diversity (of perceptions of change), subjecting these to selection processes (via structured choice exercises by stakeholders) and retention (of selected change accounts for further use by the organisation involved). MSC was tried out by a Bangladeshi NGO in 1993, retained and then expended in use over the next ten years. In parallel, it was also tried out by development agencies outside Bangladesh in the following years, and is now widely used by development NGOs. As a technique it has survived and proliferated. Although it is based on evolutionary ideas, I suspect that no more than 1 in 20 users might recognise this. No matter, nor are finches likely to be aware of Darwin’s evolutionary theory. Ideally the same might apply to good applications of complexity theory.

Our current thinking about Theories of Change (ToC) is ripe for some revolutionary thinking, aided by evolutionary perspective on the importance of variation and diversity. Singular theories abound, both in textbooks (Funnell and Rogers, 2011) and in examples developed in practice by organisations I have been in contact with. All simple results chain models are by definition singular theories of change. More complex network-like models with multiple pathways to a given set of expected outcomes are a step in the right direction. I have seen these in some DFID policy area ToCs. But what is really needed are models that consider a diversity of outcomes as well as means of getting there. One possible means of representing these models, which I am currently exploring, is the use of Decision Trees. Another, which I explored many years ago, and which I think deserves more attention, is a scenario planning type tool called Evolving Storylines. Both make use of divergent tree structures, as did Darwin when illustrating his conception of evolutionary process in his Origin of Species.

Friday, August 03, 2012

AusAID’s 'Revitalising Indonesia's Knowledge Sector for Development Policy' program


 Enrique Mendizabal has suggested I might like to comment on the M&E aspects of AusAID’s  'Revitalising Indonesia's Knowledge Sector for Development Policy' program, discussed on the AusAID’s Engage blog and Enrique’s On Think Tanks blog

Along with Enrique, I think the Engage blog posting on the new Indonesian program is a good development. It would be good to see this happening at the early stages of other AusAID programs. Perhaps it already has.

Enrique notes that “A weak point and a huge challenge for the programme is its monitoring and evaluation. I am afraid I cannot offer much advice on this expect that it should not be too focused on impact while not paying sufficient attention of the inputs. 

I can only agree. It seems ironic that so much attention is spent these days on assessing impact, while at the same time most LogFrames no longer seem to bother to detail the activities level. Yet, in practice the intervening agencies can reasonably be held most responsible for activities and outputs and least responsible for impact. It seems like the pendulum of development agency attention has swung too far, from an undue focus on activities in the (often mythologised) past to an undue emphasis on impact in the present

Enrique suggests that “A good alternative is to keep the evaluation of the programme independent from the delivery of the programme and to look for expert impact evaluators based in universities (and therefore with a focus on the discipline) to explore options and develop an appropriate approach for the programme. While the contractor may manage it it should be under a sub-contract that clearly states the independence of the evaluators. Having one of the partners in the bid in charge of the evaluation is only likely to create disincentives towards objectivity.”

This is a complex issue and there is unlikely to be a single simple solution. In a sense, evaluation has to be part of the work of all the parties involved, but there need to be checks and balances to make sure this is being done well. Transparency, in the form of public access to plans and progress reporting is an important part. Evaluability assessments of what is proposed is another part. Meta-evaluation and syntheses of what has been done is another. There will be a need for M&E roles within and outside the management structures. I agree with Guggenheim and Davis’ (AusAID) comment that “the contractor needs to be involved in setting the program outcomes alongside AusAID and Bappenas because M&E begins with clarity on what a program’s objectives are

Looking at the 2 pages on evaluation in the AusAID design document (see page 48-9) there are some things new and some things old. The focus of evaluation as hypothesis testing seems new and is something I have argued for in the past, in place of numerous and often vague evaluation questions. On the other hand the counting of products produced and viewed seems stuck in the past. Necessary but far from sufficient. What is needed is a more macro-perspective on change, which might be evident in: (a) changes in the structure of relationships between the institutional actors involved, and (b) in the content of the individual relationships. Producing and using products is only one part of those relationships. The categories of “supply”, “demand”, “intermediaries” and “enabling environment” are a crude first approximation of what is going on at the more macro level, which hopefully will soon be articulated into more detail.

The discussion of the Theory of Change in the annex to the design document is also interesting. On the one hand the authors rightly argue that this project and setting is complex and “For complex settings and problems, the ‘causal chain’ model often applied in service delivery programs is too linear and simplistic for understanding policy influence” On the other hand, some pages later there is the inevitable and perhaps mandatory, linear Program Logic Model, with nary a single feedback loop.

One of the purposes of the ToC (presumably including the Program Logic Model) is to “guide the implementation team to develop a robust monitoring and evaluation system” If so, it seems to me that this would be much easier if the events described in the Program Logic Model were being undertaken by identifiable actors (or categories thereof). However, reading the Program Logic Model we see references to only the broadest categories (government agencies, government policy makers, research organisation and networks of civil society) with one exception – Bappenas. 

Both these problems of how to describe complex change processes are by no means unique to AusAID, they are endemic in aid organisations. Yet at the same time, all the participants in this discussion I am now part of are enmeshed in an increasing socially and technologically networked world. We are surrounded by social networks, but yet seemingly incapable of planning in these terms. As they say “Go figure”


 
PS: I also meant to say that I strongly support Enrique’s suggestion that the ToRs for development projects, and the bids received in response to those ToRs, should be publicly available online, and used as a means of generating discussion about what should be done. I think that in the case of DFID at least the ToRs are already available online, to companies registering as interested in bidding for aid work. However, open debate is not facilitated and is unlikely to happen if the only parties present are the companies competing with each other for the work.


Tuesday, June 05, 2012

Open source evaluation - the way forward?


DFID has set up a special website at projects.dfid.gov.uk where anyone can search for and find details of the development projects it has funded.

As of June this year you can find details of 1512 operational projects, 1767 completed projects and 102 planned projects. The database is updated monthly as a result of an automated trawl through DFID's internal databases. It has been estimated that the database covers 98% of all current projects, with the remaining 2% being omited for security and other reasons.

There are two kinds of search facilities: (a) by key words, (b) by choices from drop down menus [7]. These searches can be combined to narrow a search (in effect, using an AND). But more complex searches using OR or NOT are not yet possible.

The search results are in two forms: (a) A list of projects shown on the webpage, which can also be downloaded as an Excel file. The Excel file has about 30 fields of data, many more than are visible in the webpage listing of the search results; (b) Documents produced by each project, a list of which is viewable after clicking on any project name in a search result. There are 10 different kinds of project documents, ranging from planning documents to progress reports and project completion reports. Evaluation reports are not yet available on this website.

In practice the coverage of project documents is still far from comprehensive. This was my count of what was available in early May, when I was searching for documents relating to DFID's 27 focus countries (out of 103 countries it has worked in up to 2010).
·         

Of all operational projects
Documents available
Up to and including 2010
(27 countries only)
Post-2010 projects
(27 countries only)
Business Case and Intervention Summary
8% (27)
40% (108)
Logical Frameworks

22% (74)
39% (104)
Annual Reviews

17% (55)
9% (24)
Number of  projects
100% (330)
100% (270)

Subject to its continued development, this online database has great potential for enabling what could be called "open source" evaluation, i.e investigation of DFID projects by anyone who can access the website.

With this in mind, I would encourage you to post comments here about:
  •  possible improvements to the database
  •  possible uses of the database.
Re improvements, after having publicised the database on the MandE NEWS email list one respondent made the useful suggestion that the database should include weblinks to existing project websites. Even if the project has closed and the website is no longer maintained access to the website may still be possible through web archive services such as Alexa.com's Wayback machine [PS: It contains copies of www.mande.co.uk dating back to 1998!]

Re uses of the database, earlier today I did a quick analysis of two downloaded  Excel files, one of completed projects and the other of currently operational projects, looking at the proportion of high risk projects in these two groups. The results:



Completed projects
Operational projects

High risk
331 (12%)
677 (19%)

Medium risk
1328 (47%)
1159 (33%)

Low risk
1167 (41%)
1648 (47%)





DFID appears to be funding more high risk projects than in the past. Unfortunately, we dont know what time period the "completed projects" category comes from, or the percentage of projects from that period that are listed in the database. Perhaps this is another possible improvement to the database: Make the source(s) and limitations of the data more visible.

PS 16 June: A salutary blog posting on  "What data can and cannot do" including reminders that,..
  •  Data is not a force unto itself
  •  Data is not a perfect reflection of the world
  •  Data does not speak for itself
  •  Data is not power
  •  Interpreting data is not easy