Tuesday, September 11, 2012

Evolutionary strategies for complex environments


 [This post is in response to Owen Barder’s blog posting “If development is complex, is the results agenda bunk?”]

Variation, selection and retention is at the core of the evolutionary algorithm. This algorithm has enabled the development incredibly sophisticated organisms able to survive in a diversity of complex environments over vast spans of time. Following the advent of computerisation the same algorithm has been employed by homo sapiens to solve complex design and optimisation problems in many fields of science and technology. It has also informed thinking about the history and philosophy of science (Toulmin, Hull, 1988, Dennett, 1996) and even cosmology (Lee Smolin, 1997). Even advocates of experimental approaches to building knowledge, now much debated by development agencies and workers, have been keen advocates of evolutionary views on the nature of learning (Donald Campbell, 1969)

So, it is good to see these ideas being publicised by the likes of Owen Barder. I would like to support his efforts by pointing out that the application of an evolutionary approach to learning and knowledge may in fact be easier than it seems on first reading of Owen’s blog. I have two propositions for consideration.

1. Re Variation: New types of development projects may not be needed. From 2006 to 2010 I led annual reviews of four different maternal and infant health projects in Indonesia. All of these projects all were being implemented in multiple districts.  In Indonesia district authorities have considerable autonomy. Not surprisingly, the ways the project was being implemented in each district varied, both intentionally and unintentionally. So did the results. But this diversity of contexts, interventions and outcomes was not exploited by the LogFrame based monitoring systems associated with each project. The LogFrames presented a singular view of the “the project”, one where aggregated judgements were needed about the whole set of districts that were involved. Diversity existed but was not being recognised and fully exploited. In my experience this phenomena is widespread. Development projects are frequently implemented in multiple locations in parallel. In practice implementation often varies across locations, by accident and intention. There is often no shortage of variation. There is however a shortage of attention to such variations. The problem is not so much in project design as in M&E approaches that fail to demand attention to variation - to ranges and exceptions as well as central tendencies and aggregate numbers.

2. Re Selection: Fitness tests are not that difficult to set up, once you recognise and make use of internal diversity. Locations within a project can be rank ordered by expected success, then rank ordered by observed success, using participatory and/or other methods. The rank order correlation of these two measures is a measure of fitness, of design to context. Outliers are the important learning opportunities (high expected & low actual success, low expected & high actual success) that warrant detailed case studies. The other extremes (most expected & actual success, least expected & actual success) also need investigation to make sure the internal causal mechanisms are as per the prior Theory of Change that informed the ranking.

It is possible to incorporate evolutionary ideas into the design of M&E systems. Some readers may know some of the background to the Most Significant Changes impact monitoring technique. Its design was informed by evolutionary epistemology. The MSC process deliberately includes an iterated process of exploiting diversity (of perceptions of change), subjecting these to selection processes (via structured choice exercises by stakeholders) and retention (of selected change accounts for further use by the organisation involved). MSC was tried out by a Bangladeshi NGO in 1993, retained and then expended in use over the next ten years. In parallel, it was also tried out by development agencies outside Bangladesh in the following years, and is now widely used by development NGOs. As a technique it has survived and proliferated. Although it is based on evolutionary ideas, I suspect that no more than 1 in 20 users might recognise this. No matter, nor are finches likely to be aware of Darwin’s evolutionary theory. Ideally the same might apply to good applications of complexity theory.

Our current thinking about Theories of Change (ToC) is ripe for some revolutionary thinking, aided by evolutionary perspective on the importance of variation and diversity. Singular theories abound, both in textbooks (Funnell and Rogers, 2011) and in examples developed in practice by organisations I have been in contact with. All simple results chain models are by definition singular theories of change. More complex network-like models with multiple pathways to a given set of expected outcomes are a step in the right direction. I have seen these in some DFID policy area ToCs. But what is really needed are models that consider a diversity of outcomes as well as means of getting there. One possible means of representing these models, which I am currently exploring, is the use of Decision Trees. Another, which I explored many years ago, and which I think deserves more attention, is a scenario planning type tool called Evolving Storylines. Both make use of divergent tree structures, as did Darwin when illustrating his conception of evolutionary process in his Origin of Species.

Friday, August 03, 2012

AusAID’s 'Revitalising Indonesia's Knowledge Sector for Development Policy' program


 Enrique Mendizabal has suggested I might like to comment on the M&E aspects of AusAID’s  'Revitalising Indonesia's Knowledge Sector for Development Policy' program, discussed on the AusAID’s Engage blog and Enrique’s On Think Tanks blog

Along with Enrique, I think the Engage blog posting on the new Indonesian program is a good development. It would be good to see this happening at the early stages of other AusAID programs. Perhaps it already has.

Enrique notes that “A weak point and a huge challenge for the programme is its monitoring and evaluation. I am afraid I cannot offer much advice on this expect that it should not be too focused on impact while not paying sufficient attention of the inputs. 

I can only agree. It seems ironic that so much attention is spent these days on assessing impact, while at the same time most LogFrames no longer seem to bother to detail the activities level. Yet, in practice the intervening agencies can reasonably be held most responsible for activities and outputs and least responsible for impact. It seems like the pendulum of development agency attention has swung too far, from an undue focus on activities in the (often mythologised) past to an undue emphasis on impact in the present

Enrique suggests that “A good alternative is to keep the evaluation of the programme independent from the delivery of the programme and to look for expert impact evaluators based in universities (and therefore with a focus on the discipline) to explore options and develop an appropriate approach for the programme. While the contractor may manage it it should be under a sub-contract that clearly states the independence of the evaluators. Having one of the partners in the bid in charge of the evaluation is only likely to create disincentives towards objectivity.”

This is a complex issue and there is unlikely to be a single simple solution. In a sense, evaluation has to be part of the work of all the parties involved, but there need to be checks and balances to make sure this is being done well. Transparency, in the form of public access to plans and progress reporting is an important part. Evaluability assessments of what is proposed is another part. Meta-evaluation and syntheses of what has been done is another. There will be a need for M&E roles within and outside the management structures. I agree with Guggenheim and Davis’ (AusAID) comment that “the contractor needs to be involved in setting the program outcomes alongside AusAID and Bappenas because M&E begins with clarity on what a program’s objectives are

Looking at the 2 pages on evaluation in the AusAID design document (see page 48-9) there are some things new and some things old. The focus of evaluation as hypothesis testing seems new and is something I have argued for in the past, in place of numerous and often vague evaluation questions. On the other hand the counting of products produced and viewed seems stuck in the past. Necessary but far from sufficient. What is needed is a more macro-perspective on change, which might be evident in: (a) changes in the structure of relationships between the institutional actors involved, and (b) in the content of the individual relationships. Producing and using products is only one part of those relationships. The categories of “supply”, “demand”, “intermediaries” and “enabling environment” are a crude first approximation of what is going on at the more macro level, which hopefully will soon be articulated into more detail.

The discussion of the Theory of Change in the annex to the design document is also interesting. On the one hand the authors rightly argue that this project and setting is complex and “For complex settings and problems, the ‘causal chain’ model often applied in service delivery programs is too linear and simplistic for understanding policy influence” On the other hand, some pages later there is the inevitable and perhaps mandatory, linear Program Logic Model, with nary a single feedback loop.

One of the purposes of the ToC (presumably including the Program Logic Model) is to “guide the implementation team to develop a robust monitoring and evaluation system” If so, it seems to me that this would be much easier if the events described in the Program Logic Model were being undertaken by identifiable actors (or categories thereof). However, reading the Program Logic Model we see references to only the broadest categories (government agencies, government policy makers, research organisation and networks of civil society) with one exception – Bappenas. 

Both these problems of how to describe complex change processes are by no means unique to AusAID, they are endemic in aid organisations. Yet at the same time, all the participants in this discussion I am now part of are enmeshed in an increasing socially and technologically networked world. We are surrounded by social networks, but yet seemingly incapable of planning in these terms. As they say “Go figure”


 
PS: I also meant to say that I strongly support Enrique’s suggestion that the ToRs for development projects, and the bids received in response to those ToRs, should be publicly available online, and used as a means of generating discussion about what should be done. I think that in the case of DFID at least the ToRs are already available online, to companies registering as interested in bidding for aid work. However, open debate is not facilitated and is unlikely to happen if the only parties present are the companies competing with each other for the work.


Tuesday, June 05, 2012

Open source evaluation - the way forward?


DFID has set up a special website at projects.dfid.gov.uk where anyone can search for and find details of the development projects it has funded.

As of June this year you can find details of 1512 operational projects, 1767 completed projects and 102 planned projects. The database is updated monthly as a result of an automated trawl through DFID's internal databases. It has been estimated that the database covers 98% of all current projects, with the remaining 2% being omited for security and other reasons.

There are two kinds of search facilities: (a) by key words, (b) by choices from drop down menus [7]. These searches can be combined to narrow a search (in effect, using an AND). But more complex searches using OR or NOT are not yet possible.

The search results are in two forms: (a) A list of projects shown on the webpage, which can also be downloaded as an Excel file. The Excel file has about 30 fields of data, many more than are visible in the webpage listing of the search results; (b) Documents produced by each project, a list of which is viewable after clicking on any project name in a search result. There are 10 different kinds of project documents, ranging from planning documents to progress reports and project completion reports. Evaluation reports are not yet available on this website.

In practice the coverage of project documents is still far from comprehensive. This was my count of what was available in early May, when I was searching for documents relating to DFID's 27 focus countries (out of 103 countries it has worked in up to 2010).
·         

Of all operational projects
Documents available
Up to and including 2010
(27 countries only)
Post-2010 projects
(27 countries only)
Business Case and Intervention Summary
8% (27)
40% (108)
Logical Frameworks

22% (74)
39% (104)
Annual Reviews

17% (55)
9% (24)
Number of  projects
100% (330)
100% (270)

Subject to its continued development, this online database has great potential for enabling what could be called "open source" evaluation, i.e investigation of DFID projects by anyone who can access the website.

With this in mind, I would encourage you to post comments here about:
  •  possible improvements to the database
  •  possible uses of the database.
Re improvements, after having publicised the database on the MandE NEWS email list one respondent made the useful suggestion that the database should include weblinks to existing project websites. Even if the project has closed and the website is no longer maintained access to the website may still be possible through web archive services such as Alexa.com's Wayback machine [PS: It contains copies of www.mande.co.uk dating back to 1998!]

Re uses of the database, earlier today I did a quick analysis of two downloaded  Excel files, one of completed projects and the other of currently operational projects, looking at the proportion of high risk projects in these two groups. The results:



Completed projects
Operational projects

High risk
331 (12%)
677 (19%)

Medium risk
1328 (47%)
1159 (33%)

Low risk
1167 (41%)
1648 (47%)





DFID appears to be funding more high risk projects than in the past. Unfortunately, we dont know what time period the "completed projects" category comes from, or the percentage of projects from that period that are listed in the database. Perhaps this is another possible improvement to the database: Make the source(s) and limitations of the data more visible.

PS 16 June: A salutary blog posting on  "What data can and cannot do" including reminders that,..
  •  Data is not a force unto itself
  •  Data is not a perfect reflection of the world
  •  Data does not speak for itself
  •  Data is not power
  •  Interpreting data is not easy

Friday, June 01, 2012

Representing different combinations of causal conditions


This week I attended a workshop on QCA (Qualitative ComparativeAnalysis). QCA is a useful approach to analysing possible causality in small-n situations, i.e. where there are not many cases to examine (e.g. villages or districts), and where perhaps only categorical data is available.  Equally importantly, QCA enables the identification of different configurations of conditions associated with observed outcomes in a set of cases. In that respect it shares the ambitions of the school of Realist Evaluation (Pawson and Tilley). The downside is that QCA findings are expressed in Boolean logic, which is not exactly user friendly. For example, here is the result of one analysis:

 
Clue: in Boolean notation the symbol "+" means OR and the symbol "*" means AND. The letters in upper case refer to conditions present and the letters in lower case refer to conditions absent.

Decision trees


In parallel I have reading about and testing some data mining methods, especially classification trees (see recent blog). These are also able to identify multiple configurations of causal conditions. In addition they produce user friendly results in the form of tree diagrams, which are easy to read and understand. The same kind of decision trees can be used to represent the results of QCA analyses. In fact they can be used in a wide range of ways, including more participatory and ethnographic forms of inquiry (See Ethnographic Decision Models). From an evaluation perspective I think Decision Trees could be a very valuable tool, one which could help us answer the frequently asked question of "what works well in what circumstances". This because they can provide summary statements of the various combinations of conditions that lead to desired outcomes in complex circumstances.

In the first set of graphics below I have shown how Decision Trees can represent four important different types of causal combinations. These relate to whether an observed condition can be seen as a Necessary or Sufficient cause. The graphic is followed by an example of four fictional data sets, each of which contains one of the causal combinations shown in the graphic (highlighted in yellow). Double click on the graphic to make it easier to read.
 

Implications for evaluation work


There has been a lot of discussion amongst evaluators of development projects about whether it is appropriate to talk about causal attribution versus causal contribution, and in the latter case, how causal contribution can be best described. Descriptions of particular conditions in terms of whether they are necessary and/or sufficient are one way of doing so, especially when made visible in particular Decision Tree structures.

When necessary or sufficient conditions (1,2,3) are believed to be present this should provide some focus for evaluation efforts, enabling the direction of scarce evaluation attention towards the most vulnerable part of an explanatory model.

It has been argued that the most common causal configuration is 4., where an intervention is a necessary part of a package but that package is not essential, and that other packages can also generate the same results. If so, this suggests the need for some modesty by development agencies in their claims about making an impact and some generosity of views about the importance of other influences.

How do decision trees relate to Theories of Change?

 

The comparator here is the kind of diagramatic Theories of Change seen in  Funnell and Rogers (2011) Purposeful Program Theory. A common feature of most of their examples is that they show a sequence of events over time, leading to an expected outcome. We could call them causal pathway ToC. In my view these would include LogFrames, although some people dont consider these as embodying a ToC.

I would argue that Decision Trees can also describe a ToC, but there are significant differences:

1. Decision Trees tend to describe multiple configurations that as a set can explain all observed outcomes. ToC, especially LogFrames, tend to describe a single configuration that will lead to one desired outcome. In doing so each part of the configuration appears to be necessary but not sufficient for the expected outcome.

2. Decision Trees describe configurations but not sequences. It is important to note that in Decision Trees there is no causal direction implied by relative positions in the branch structure, unlike in a ToC.  The sequence of conditions associated together along a branch could in theory be in any order. What matters is what conditions are associated with what.

3. Decision Tree models are testable. Unlike most causal pathway ToC  (at least those that I know of) Decision Trees can be generated direct from one data set (i.e. a training set), and they can be then tested again other data set (i.e test data) containing new cases with the same kinds of attributes and outcomes. These tests examine not only whether the predicted outcome happened when the expected attributes were present, but also whether the predicted outcome did not happen when the expected attributes were absent.

Causal pathway ToC are testable, by examining whether their implementation leads to the achievement of target values on performance indicators. The opposite possibility is also testable in principle, by observing if expected outcomes were absent when events in the causal pathway did not take place, via the use of a control group. However, compared to Decision Tree models, this kind of testing is much more laborious, and requires considerable upfront preparation.

Despite the differences there is also some potential inter-operability between Decision Tree models and causal pathway ToC:

1. An expected causal sequence of events in a ToC (e.g. in a LogFrame)  could be represented in a Decision Tree, as a collection of attributes all located in one branch. Looking in the reverse direction, different branches of Decision Trees can be seen as constituents of  seperate causal pathways in ToCs that have a more network rather than chain structure.

2. While Logframes may be suitable for individual projects, Decision Tree models may be suitable for portfolios of projects, capturing the difference in contexts and interventions that are involved in different projects.

3. Decision trees have some compatability with Realist Evaluators ways of thinking about change. The Realist Evaluation formulation of  "Context + Mechanism = Outcome" type configurations can easily be represented in the above tables by creating two broad categories of conditions, about Context,  Mechanisms and Outcome conditions respectively.

Decision tree analysis of QCA data set

 

 Decision Tree algorithms can be used as a means of triangulating the results generated by other methods such as QCA. 

The following table  of data can be found in a paper on "Women’s Representation in Parliament: A Qualitative Comparative Analysis" by Krook (2010)

The values in this table were then converted to binary values, using various cut-off values explained in the paper, resulting in the table below.

In Krook's paper this data was analysed using QCA. I have taken the same data set and used Rapid Miner to generate the following Decision Tree, which enables you to find all cases where women's participation in national parliament was high (defined as above 17%)

The same result was found via the QCA analysis:

Translated this means:

IF Quotas AND Post-conflict situation THEN high % women in Parliament  [= far right branch]
OR
IF Women's status is high AND Post-conflict situation THEN high % women in Parliament [=3rd from left branch]
OR
IF Quotas NO post-conflict situation AND women's staus is high THEN high % women in Parliament [=3rd from right branch]

Assessing the performance of decision trees

 

Relative to causal pathway ToC, there are many systematic ways to assess the performance of Decision Trees.

 1. When used for description purposes

There are two useful measures of the performance of decision trees when they have been developed as summary representations of known cases:

1. Purity: Are the cases found at the end of a branch all of one kind of attribute (i.e. pure), or a mixture of kinds.

2. Coverage: What proportion of all positive case that exist are found at the end of any particular branch. In data mining exercises branches that have low coverage are often "pruned" i.e. removed from the model, to reduce the complexity of the model (and thus help increase its generalisability).

QCA uses similar measures of consistency and coverage. See page 84 of the fsQCA manual

Decision Trees can also be compared in terms of their simplicity, with simpler being better. The simplest measure is the number of branches in the tree, relative to the total number of cases (fewer = better). Another is the number of attributes used in the tree, relative to all available (fewer = better).

2. When used for prediction purposes

After having been developed as good descriptive models, decision trees are often then used as predictive models. At that stage different performance criteria come into play.

The most important metric is prediction accuracy: the ability of the Decision Tree to accurately classify new cases. From what I have read, it seems that a minimum level of accuracy is 80%, but the rationale for this cut-off point is unclear. Both predictive and descriptive accuracy can be measured using a Confusion Matrix 

"I wanted to add that a typical trade-off analysis is done with learners in general (and decision trees are no exception) that compares model accuracy within a data set to model accuracy at classifying new data. A more generalizable model would be more favorable for predictive analysis. A more accurate, specialized model would be good for understanding a particular data set. Limiting the tree-depth is (in my opinion) probably the fastest way to explore these trade-offs."[from rakirk on RapidMiner blog]

Greater descriptive accuracy risks what data mining specialists call "over-fitting" - that is, after a certain point is reached the descriptive model's ability to accuractely predict outcomes in a new set of cases will start to diminish. (A classic tradeoff between internal and external validity)

Moore et al, 2001  provide criteria that mix both descriptive and predictive purposes. In their view "... the most desirable trees are:
1.  Accurate (low generalization error rates)
2.  Parsimonious (representing and generalizing the relationships succinctly)
3.  Non-trivial (producing interesting results)
4.  Feasible (time and resources)
5. Transparent and interpretable (providing high level representations of and insights into the data relationships, regularities, or trends)"

More information on decision trees, which is not maths intensive!

New software


PS July 2012: I have just found out about BigML, an online service where you can upload data, create Decision Tree models, test them and use them to generate predictions. So far it looks very good, although still under development. I have been offered extra invitation codes, which I can share with interested readers. If you are interested, email rick at gmail.com

I have been experimenting with two data sets on BigML, one is the results of a 2006 household poverty survey in Vietnam (596 cases, 23 attributes), and the other is a randomly generated data set (102 cases, 11 attributes).

A Decision Tree model of the household poverty data has the potential to enables people to do two things:

  • Find classification rules that find households with poverty scores in a given range e.g. above a known poverty level. Useful if you want to target assistance to specific groups
  • Find the poverty score of a specific household with a given set of attributes. Useful if you wanted to see if they are eligible for a targeted package of assistance
Here is a graphic of the BigML Decision Tree model. Its unorthodox in that it does not display branches with negative cases, but this approach does simplify the layout of complex trees. On the right of the tree is the decision rule assocated with the highlighted branch (on the right side). The outcomes it predicts (the leaf at the end of the branch) is the Basic Necessity Survey (BNS) poverty score for the households in that group (32 in the right side branch)

This tree has been minimally pruned, and shows branch ends containing 1% or more of all cases (i.e. 5 or more in this case). The highlighted branch shows one classification rule that accounts for about 8% of all households above the poverty line. All the green nodes in the tree account for around 92% of all households above the poverty line. The remainder will be found when the other colored "leave" nodes are clicked on. 

My main finding from this exercise is that there is no classification rule that accounts for a large proportion of cases. The largest is one rule (Bathroom + Motorbike+Pesticide pump+Stone built house) that accounts for 31% of households above the poverty line. My interpretation is that this finding reflects the diversity of causal influences present, most probably being the agency of the households themselves.






PS 15 July 2012: Although at the start of this blog I made a clear distinction between four types of situations, where a condition or attribute is necessary and/or sufficent, it could be argued that there are degrees of necessity. If a complex decision tree has 25 branches (or explanatory rules), as in the above example, a certain condition may be present in many of the branches (as necessary but not sufficient part of a package that is sufficient but not necessary i.e. INUS). For example, having a watch is a condition present in 4 of the 25 branches. One way of looking for conditions that are relatively necessary is to look at the upper levels of the tree. Having a bathroom is relatively necessary, it is a necessary part of 14 of the 25 branches. This is still a fairly crude measure, we also need to take into account what proportion of all the cases are covered by these 14 branches. In this example, the 14 branches cover 70% of all the cases (households). Having a stone built house is not a necessary condtion to be judged as not-poor, but is a fairly necessary condition!

PS 18 July 2012: One dimension of the structure of a Decision  Tree is its "diversity". After Stirling (2007), diversity can be seen as a mix of variety (number of branches), balance (spread of cases across those branches) and disparity (distance between the end of each branch, measured by degrees - number of intervening linkages). A rougher measure is simply the number of branches x the number of kinds of attributes making up all those branches. Diversity suggests, to me, a larger number of causes at work. How does this diversity connect to notions of complexity? Diversity and complexity are not simply one and the same thing. My reading is that complexity = diversity + structure (relationships between diverse entities). I need to go back and read read / finish reading Page, S (2011) on (Diversity and Complexity" and "Diversity versus Complexity" by Shahid Naeem (2001)