Friday, March 16, 2012

Can we evolve explanations of observed outcomes?

In mathematics and computer science, an optimization problem is the problem of finding the best solution from all feasible solutions. There are various techniques for doing so.

Science as a whole can be seen as an optimisation process, involving a search for explanations that have the best fit with observed reality.

In evaluation we often have a similar task, of identifying what aspects of one or more project interventions best explains the observed outcomes of interest. For example, the effects of various kinds of improvements in health systems on rates of infant mortality.  This can done in two ways. One is by looking internally at the design of a project, at its expected workings and then trying to find evidence of whether it did so in practice. This is the territory of theory led-evaluation.  The other way is to look externally, at alternative explanations involving other influences, and to seek to test those. This is ostensibly good practice but not very common in reality, because it can be time consuming and to some extent inconclusive, in that there may always be other explanations not yet identified and thus untested. This is where randomised control trials (RCTs) come in. Randomised allocation of subjects between control and intervention groups nullifies the possible influence of other external causes.  Qualitative Comparative Analysis (QCA) takes a slightly different approach, searching for multiple possible configurations of conditions which are both necessary and sufficient to explain all observed outcomes (both positive and negative instances).

The value of theory led approaches, including QCA, is that the evaluator’s theories help the search for relevant data, amongst the myriad of possibly relevant design characteristics, and combinations thereof. The absence of a clear theory of change is often one reason why baseline surveys are so expansive in contents, but yet rarely used. Without a half way decent theory we can easily get lost. It is true that "There is nothing as practical as a good theory" (Kurt Lewin)

The alternative to theory led approaches

There is however an alternative search process which does not require a prior theory, known as the evolutionary algorithm, the kernel of the process of evolution. The evolutionary processes of variation, selection and retention, iterated many times over, have been able to solve many complex optimisation problems such as the design of a bird that can both fly long distances and dive deep in the sea for fish to eat. Genetic algorithms (GA) are embodiments of the same kinds of process in software programs, in order to solve problems of interest to scientists and businesses. These are useful in two respects. One is the ability to search vary large combinatorial spaces very quickly. The other is that they can come up with solutions involving particular combinations of attributes that might not have been so obvious to a human observer.

Development projects have attributes that vary. These include both the context in which they operate and the mechanisms by which they seek to work. There are many possible combinations of these attributes, but only some of these are likely to be associated with achieving a positive impact on peoples’ live. If they were relatively common then implementing development aid projects would not be so difficult. The challenge is how to find the right combination of attributes. Trial and error by varying project designs and their implementaion on the ground is a good idea in principle, but in practice it is slow. There is also a huge amount of systemic memory loss, for various reasons including poor or non-existent communications between various iterations of a project design taking place in different locations.

Can we instead develop models of projects, which combine real data about the distribution of project attributes with variable views of their relative importance in order to generate an aggregate predicted result? This expected result can then be compared to an observed result (ideally from independent sources).  By varying the influence of the different attributes a range of predicted results can be generated, some of which may be more accurate than others. The best way to search this large space of possibilities is by using a GA. Fortunately Excel now includes a simple GA add-in, known as Solver.

The following spreadsheet shows a very basic example of what such a model could look like, using a totally fictitious data set. The projects and their observed scores on four attributes (A-D) are shown on the left. Below them is a set of weights, reflecting the possible importance of each attribute for the aggregate performance of the projects. The Expected Outcome score for each project is the sum of the score on each attribute x the weight for that score.  In other words the more a project has an important attribute (or combination of these) the higher will be its Expected Outcome score. That score is important only as a relative measure, relative to that of the other projects in the model.

The Expected Outcome score for each project is then compared to an Observed Outcome measure (ideally converted to a comparable scale), and the difference is shown as the Prediction Error. On the bottom left an is aggregate measures of prediction error, the Standard Deviation. The original data can be found in this Excel file.


The initial weights were set at 25 for each attribute, in effect reflecting the absence of any view about which might be more important. With those weights, the SD of the Prediction Errors was 1.25 After 60,000+ iterations in the space of 1 minute the SD had been reduce down to 0.97. This was achieved with this new combination of weights: Attribute A:19, Attribute B: 0, Attribute C: 19, Attribute D: 61.The substantial error that remains can be considered as due to causal factors outside of the model (i.e. as is described by the list of attributes)[1].

It seems that it is also possible to find least appropriate solutions, i.e, those which make the least accurate Outcome Predictions. Using the GA set to find the maximum error, it was found that in the above example a 100% weighting given to Attribute A generated a SD of 1.87. This is the nearest that such an evolutionary approach comes to disproving a theory.

GA deliver functional rather than logical proofs that certain explanations are better than others. Unlike logical proofs, they are not immortal. With more projects included in the model it is possible that there may be a fitter solution, which applies to this wider set. However, the original solution to the smaller set would still stand.

Models of complex processes can sometimes be sensitive to starting conditions. Different results can be generated from initial settings that are very similar. This was not the case in this exercise, with widely different initial weighting’s evolving and converging on almost identical sets of final weightings e.g. 19, 0, 19,  62 versus 61) producing the same final error rate. This robustness is probably due to the absence of feedback loops in the model, which could be created where the weighted score of one attribute affected those of another.  That would a much more complex model, possibly worth exploring at another time.

Small changes in Attribute scores made a more noticable difference to the Prediction Error. In the above model varying Project 8’s score on attribute A from 3 to 4 increases the average error by 0.02. Changes in other cells varied in direction of their effects. In more realistic models with more kinds of attributes and more project cases the results are likely to be less sensitive to such small differences in attribute scores.

The heading of this post asks “Can we evolve explanations of observed outcomes?” My argument above suggests that in principle it should be possible. However there is a caveat. A set of weighted attributes that are associated with success might better be described as the ingredients of an explanation. Further investigative work would be needed to find out how those attributes actually interact together in real life.  Before then, it would be interesting to do some testing of this use of GAs on real project datasets.

Your comments please...

PS 6 April 2012: I have just come across the Kaggle website. This site hosts competitions to solve various kinds of prediction problems (re both past and future events) using a data set available to all entrants, and gives prizes to the winner - who must provide not only their prediction but the algorithm that generated the prediction. Have a look. Perhaps we should outsource the prediction and testing of results of development projects via this website? :-) Though..., even to do this the project managers would still have a major task on hand: to gather and provide reliable data about implementation characteristics, as well as measures of observed outcomes... Though...this might be easier with some projects that generate lots of data, say micro-finance or education system projects.

 View this Australian TV video, explaining how the site works and some of its achievements so far. And the Fast Company interview of the CEO

PS 9 April 2012:  I have just discovered that there is a whole literature on the use of genetic algorithms for rule discovery "In a nutshell, the motivation for applying evolutionary algorithms to data mining is that evolutionary algorithms are robust search methods which perform a global search in the space of candidate solutions (rules or another form of knowledge representation)" (Freitas, 2002) The rules referred to are typcially "IF...THEN..."type statements

[1] Bear in mind that this example set of attribute scores and observed outcome measures is totally fictitious, so the inability to find a really good set of fitting attributes should not be surprising. In reality some sets of attributes will not be found co-existing because of their incompatibility e.g. corrupt project management plus highly committed staff

Tuesday, March 13, 2012

Modular Theories of Change: A means of coping with diversity and change?

Two weeks ago I attended a DFID workshop at which Price Waterhouse Coopers (PwC) consultants presented the results of their work, commissioned by DFID, on “Monitoring Results from Low Carbon Development”. LCD is one of three areas of investment by International Climate Fund (ICF). The ICF is “a £2.9bn financial contribution … provided by the UK Government to support action on climate change and development. Having started to disperse funds, a comprehensive results framework is now required to measure the impact of this investment, to enable learning to inform future programming, and to show value for money on every pound”

The PwC consultants’ tasks included (a) consultation with HMG staff on the required functions of the LCD results framework; (b) a detailed analysis of potentially useful indicators through extensive consultations and research into the available data; and (c) exploration of opportunities to harmonise results and/or share methodologies and data collection with others. Their report documents the large amount of work that has been done, but also acknowledges that more work is still needed.

Following the workshop I sent in some comments on the PwC report, some of which I will focus on here because I think they might be of wider interest. There were three aspects of the PwC proposals that particularly interested me. One was the fact that they had managed to focus down on 28 indicators, and were proposing that set be limited even further, down to 20. Secondly, they had organised the indicators into a LogFrame type structure, but one which is covering two levels of performance in parallel (within countries and across countries), rather than in a sequence. Thirdly, they had advocated the use of Multi-Criteria Analysis (MCA) for the measurement of some of the more complex forms of change referred to in the Logframe.  MCA is similar in structure to the design of weighted checklists, which I have previously discussed here and elsewhere.

Monitorable versus evaluable frameworks

As it stands the current LCD LogFrame is a potential means of monitoring important changes relating to low carbon development. But it is not yet sufficiently developed to enable an evaluation of the impact of efforts aimed at promoting low carbon development. This is because there is not yet sufficient clarity about the expected causal linkages between the various events described in the Logframe. It is the case that, as is required by DFID Logframes, weightings have been given to each of the four Outputs describing their expected impact on Outcome level changes. But the differences in weightings are modest (+/- 10%) and each of the Outputs describes a bundle of up to 5 more indicator-specific changes.

Clarity about the expected causal linkages is an essential “evaluability” requirement. Impact evaluations in their current form seek to establish not only what changes occurred, but also their causes. Accounts of causation in turn need to include not only attribution (whether A can be said to have caused B) but also explanation (how A caused B). In order for the LCD results framework to be evaluable, someone needs “connect the dots” in some detail. That is, identify plausible explanations for how particular indicator-specific changes are expected influence each other. Once that is done, the LCD program could be said to have not just a set of indicators of change, but a Theory of Change about how the changes interact and function as a whole.

Indicator level changes as shared building blocks

There are two subsequent challenges here to developing an evaluable Theory of Change for LCD. One is the multiplicity of possible causal linkages. The second is the diversity of perspectives on which of these possible causal linkages really matter. With 28 different indicator-specific changes there are, at least hypothetically, many thousands of different possible combinations that could make up a given Theory of Change (ii, where i = number of indicator specific changes). But, it can be well argued that “this is a feature, not a bug”. As the title of this blog suggests, the 28 indicators can be considered as equivalent to Lego building blocks. The same set (or parts thereof) can be combined in a multiplicity of ways, to construct very different ToC. The positive side to this picture is the flexibility and low cost. Different ToC can be constructed for different countries, but each one does not involve a whole new set of data collection requirements. In fact it is reasonable to expect that in each country the causal linkages between different changes may be quite different, because of the differences in the physical, demographic, cultural and economic context.

Documenting expecting causal linkages (how the blocks are put together)

There are other more practical challenges, relating to how to exploit this flexibility. How do you seek stakeholder views of the expected causal connections, without getting lost in a sea of possibilities? One approach I have used in Indonesia and in Vietnam involves the use of simple network matrices, in workshops involving donor agencies and/or the national partners associated with a given project. Two examples are shown below. These don’t need to be understood in detail (one is still in Vietnamese), it is their overall structure that matters. 

A network matrix simply shows the entities that could be connected in the left column and top row. The convention is that each cell in the matrix provides data on whether that row entity is connected to that column entity (and it may also describe the nature of the connection)

The Indonesian example shown below shows expected relationships between 16 Output indicators (left column) and 11 Purpose level indicators (top row) in a maternal health project. Workshop participants were asked to consider one Purpose level indicator at a time, and allocate 100 percentage points across the 16 Output indicators, with more percentage points = an Output having more expected impact on the Purpose indicator, relative to other Output indicators. Debate was encouraged between participants as figures were proposed for each cell down a column. Looking within the matrix we can see that for Purpose 3 it was agreed that Output indicator 1.1 would have the most impact. For some other Purpose level changes, impact was expected from a wider range of Outputs. The column on the right side sums up the relative expected impact of each Output, providing useful guidance on where monitoring attention might be most usefully focused.

This exercise was completed in little over an hour. The matrix of results show one set of expected relationships amongst many thousands of other possible sets that could exist within the same list of indicators. The same kind of data can be collected on a larger scale via online surveys, where the options down each column are represented within a single multiple choice question. Matrices like these, obtained either from different individuals or different stakeholder groups, can be compared with each other to identify relationships (i.e. specific cells) where there is the most/least agreement, as well as which relationships are seen as most important, when all satkeholder views are added up. This information should then inform the focus of evaluations, allowing scarce attention and resources to be directed to the most critical relationships.

The second example of a network matrix used to explicate a tacit ToC comes from Vietnam, and is shown below. In this example, a Ministry’s programmes are shown (unconventionally) across the top row and the country’s 5 year plan objectives are shown down the left column. Cell entries, discussed and proposed by workshop participants, show the relative expected causal contribution of each programme to each 5 year plan objective. Summary row on the bottom shows the aggregate expected contribution of each programme and the summary column on the right show the aggregate extent to which each 5 year plan objective was expected to be affected.


The modules referred to in the title of this blog can be seen as referring to two types of entities that can be used to construct many different kinds of ToC. One is the indicator-specific changes in the LCD Logframe, for example. By treating them as a standard set available for use by different stakeholders in different settings, we may gain flexibility at a low cost. The other is the grouping of indicator specific changes into categories (e.g. Outputs 1-2-3-4) and larger sets of categories (Outputs, Outcomes, Purpose). The existence of one or more nested types of entities is sometimes described as modularity. In evolutionary theory it has been argued that modularity in design improves evolvablity. This can happen: (a) by allowing specific features to undergo changes without substantially altering the functionality of the entire system, (b) by allowing larger more structural changes to occur by recombining existing functional units.

In the conceptual world of Logframes, and the like, this suggests that we may need to think of ToC being constructed at multiple levels of detail, by different sized modules. In the LCD Logframe impact weightings had already been assigned to each Output, indicating its relative expected contribution to the Outcomes as a whole. But the flexibility of ToC design at this level was seriously constrained by the structure of the representational device being used. In a Logframe Outputs are expected to influence Outcomes, but not the other way. Nor are they expected to influence each other, contra other more graphic based logic models. Similarly, both of the above network matrix exercises made use of existing modules and accepted the kinds of relationship that was expected between them (Outputs should influence Purpose level changes; Ministry Programmes should influence 5Year Plan objectives achievements). 

The value of multiple causal pathways with a ToC

More recently I have seen the ToC for a major area of DFID policy that will be coming under review. This is represented in diagramatic form, showing various kinds of events (including some nested categories of events), and also shows the expected causal relationships between these events. It was quite a complex diagram, perhaps too much so for those who like traffic-light level simplicities. However, what interested me the most is that subsequent versions have been used to show how two specific in-country programs fit within this generic ToC. This has been done by highlighting the particular events that make up one of the number of causal chains that can be found within the generic ToC. In doing so it appears to be successfully addressing a common problem with generic ToC - the inability to reflect the diversity of the programs that make up the policy area described by a generic ToC.

Shared causal pathways justify more evaluation attention

This innovation points to an alternate and additional use of the matrices above. The cell numbers could refer to the numbers of constituent programs in a policy area (and/or which are funded by a single funding mechanism) that involve this particular causal link (i.e. between the row event and the column event). The higher this number, the more important it would be for evaluations to focus on that casual link - because the findings would have relevance across a number of programs in the policy area.