Saturday, July 16, 2016

EvalC3 - an Excel-based package of tools for exploring and evaluating complex causal configurations


Over the last few years I have been exposed to two different approaches to identifying and evaluating complex causal configurations within sets of data describing the attributes of projects and their outcomes. One is Qualitative Comparative Analysis (QCA) and the other is Predictive Analytics (and particularly Decision Tree algorithms). Both can work with binary data, which is easier to access than numerical data, but both require specialist software - which requires time and effort to learn how to use

In the last year I have spent some time and money, in association with a software company called Aptivate (Mark Skipper in particular) developing an Excel based package which will do many of the things that both of the above software packages can do, as well as provide some additional capacities that neither have.

This is called EvalC3, and is now available [free] to people who are interested to test it out, either using their own data and/or some example data sets that are available. The "manual" on how to use EvalC3 is a supporting website of the same name, found here: https://evalc3.net/  There is also a  short introductory video here.

Its purpose is to enable users: (a) to identify sets of project & context attributes which are  good predictors of the achievement of an outcome of interest,  (b) to compare and evaluate the performance of these predictive models, and (c) to identify relevant cases for follow-up within-case investigations to uncover any causal mechanisms at work.

The overall approach is based on the view that “association is a necessary but insufficient basis for a strong claim about causation, which is a more useful perspective than simply saying “correlation does not equal causation”.While the process involves systematic quantitative cross-case comparisons, its use should be informed by  within-case knowledge at both the pre-analysis planning and post-analysis interpretation stages.

The EvalC3 tools are organised in a work flow as shown below:



The selling points:




  • EvalC3 is free, and distributed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
  • It uses Excel, which many people already have and know how to use
  • It uses binary data. Numerical data can be converted to binary but not the other way
  • It combines manual hypothesis testing with  algorithm based (i.e. automated) searches for good performing predictive models
  • There are four different algorithms that can be used
  • Prediction models can be saved and compared
  • There are case-selection strategies for follow-up case-comparisons to identify any casual mechanisms at work "underneath" the prediction models

If you would like to try using EvalC3 email rick.davies at gmail.com

Skype video support can be provided in some instances. i.e. if your application is of interest to me :-)

Monday, March 07, 2016

Why I am sick of (some) Evaluation Questions!


[Beginning of rant] Evaluation questions are are a cop out, and not only that, they are an expensive cop out. Donors commissioning evaluations should not be posing lists of sundry open ended questions about how their funded activities are working and or having an impact.

They should have at least some idea of what is working (or not) and they should be able to  articulate these ideas. Not only that, they should be willing, and even obliged, to use evaluations to test those claims. These guys are spending public monies, and the public hopefully expects that they have some idea about what they are doing i.e. what works. [voice of inner skeptic: they are constantly rotated through different jobs, so probably don't have much idea about what is working, at all]

If open ended evaluation questions were replaced by specific claims or hypotheses then evaluation efforts could be much more focused and in-depth, rather than broad ranging and shallow. And then we might have some progress in the accumulation of knowledge about what works.

The use of swathes of open ended evaluation questions also relates to the subject of institutional memory about what has worked in the past. The use of open ended questions suggests little has been retained from the past, OR is now deemed to be of any value. Alas and alack, all is lost, either way [end of rant]

Background: I am reviewing yet another inception report, which includes a lot of discussion about how evaluation questions will be developed. Some example questions being considered:
How can we value ecosystem goods and services and biodiversity?  

How does capacity building for better climate risk management at the institutional level
translate into positive changes in resilience

What are the links between protected/improved livelihoods and the resilience of people and communities, and what are the limits to livelihood-based approaches to improving resilience?

Friday, March 04, 2016

Why we should also pay attention to "what does not work"


There is no shortage of research on poverty and how people become poor and often remain poor.

Back in the 1990s (ancient times indeed, at least in the aid world :-) a couple of researchers in Vietnam were looking at the nutrition status of children in poor households. In the process they came across a small number of households where the child was well nourished, despite the household being poor. The family's feeding practices were investigated and the lessons learned were then disseminated throughout the community. The existence of such positive outliers from a dominant trend was later called "positive deviance" and this subsequently became the basis of large field of research and development practice. You can read more on the Positive Deviance Initiative website

From my recent reading of the work done by those associated with this movement the main means that has been used to find positive deviance cases has been participatory investigations by the communities themselves. I have no problem with this.

But because I have been somewhat obsessed with the potential applications of predictive modeling over the last few years I have wondered if the search for positive deviance could be carried out on a much larger scale, using relatively non-participatory methods. More specifically, using data mining methods aimed at developing predictive models. Predictive models are association rules that perform well in predicting an outcome of interest. For example, that projects with x,y,z attributes in contexts with a,b, and c attributes will lead to project outcomes that are above average in achieving their objectives.

The core idea is relatively simple. As well as developing predictive models of what does work (the most common practice) we should also develop predictive models of what does not work. It is quite likely that many of these models will be imperfect, in the sense that there are likely to be some False Positives. In this type of analysis FPs will be cases where the development outcome did take place, despite all the conditions being favorable to it not taking place. These are the candidate "Positive Deviants" which would then be worth investigating in detail via case studies, and it is at this stage that participatory methods of inquiry would then be appropriate.

Here is a simple example, using some data collated and analysed by Krook in 2010, on factors affecting levels of women's participation in parliaments in Africa. Elsewhere in this blog I have shown how this data can be analysed using Decision Tree algorithms, to develop predictors of when womens' participation will be high versus low. I have re-presented the Decision Tree model below
In this predictive model the absence of quotas for women in parliament is a good predictor of low levels of their participation in parliaments. 13 of the 14 countries with no quotas have low levels of women's participation. The one exception, the False Positive of this prediction rule and an example of "positive deviance", is the case of Lesotho, where despite the absence of quotas there is a (relatively) high level of women's participation in parliament. The next question is why so, and then whether the causes are transferable to other countries with no quotas for women. This avenue was not explored in the Krook paper, but it could be a practically useful next step.

Postscript: I was pleased to see that the Positive Deviance Initiative website now has a section on the potential uses of predictive analytics (aka predictive modelling) and they are seeking to establish some piloting of methods in this area with other interested parties



Monday, December 28, 2015

Aiming for the stars versus "the adjacent possible"


Background: I have been exploring the uses of a new Excel application I have been developing with the help of Aptivate, provisionally called EvalC3. You can find out more about it here: http://evalc3.net/

If you have a data set that describes a range of attributes of a set of projects, plus an outcome measure for these projects which is of interest, you may be able to identify a set of attributes (aka a model) which best predicts the presence of the outcome.

In one small data experiment I used a randomly generated data set, with 100 cases and 10 attributes. Using EvalC3 I found that the presence of attributes "A" and "I" best predicted the presence of the outcome with an accuracy of 65%. In other words, of all the cases with these attributes 65% also had the outcome present.

Imagine I am running a project with the attributes D and J but not A or I. In the data set this set of attributes was associated with the presence of the outcome in 49% of the cases. Not very good really, I probably need to make some changes to the project design. But if I want to do the best possible, according the data analysis so far, I will need to ditch the core features of my current project (D and A) and replace them with the new features (A and I). This sounds like a big risk to me.

Alternately, I could explore what has been called by Stuart Kauffmann "the adjacent possible". In other words, make small changes to my project design that might improve its likelihood of success, even though the improvements might fall well short of the optimum level shown by the analysis above (i.e. 65%).

If data was available on a wide range of projects I could do this exploration virtually, in the sense of finding other projects with similar but different attributes to mine, and see how well they performed. In my data based experiment my existing project had attributes D and J. Using EvalC3 I then carried out a systematic search for a better set of attributes that kept these two original attributes but introduced one extra attribute. This is what could be called a conservative innovation strategy. The search process found that including a particular extra attribute in the design improved the accuracy of my project model from 49% to 54%. Then introducing another particular attribute improved it to 59%.

So what? Well, if you are an existing project and there is a real life data set of reasonably comparable (but not identical) projects you would be able to explore explore relatively low risk ways of improving your performance. The findings from the same data set on the model which produced the best possible performance (65% in the example above) might be more relevant to those designing new projects from scratch. Secondly,  your subsequent experience with these cautious experiments could be used to update and extend the project data base with extra data on what is effectively a new case i.e a project with a new set of attributes slightly different from its previous status.

The connection with evolutionary theory: On a more theoretical level you may be interested in the correspondence of this approach with evolutionary strategies for innovation. As I have explained elsewhere "Evolution may change speed (e.g. as in punctuated equilibrium), but it does not make big jumps. It progresses through numerous small moves, exploring adjacent spaces of what else might be possible. Some of those spaces lead to better fitness, some to less. This is low cost exploration, big mutational jumps involve much more risk that the changes will be dysfunctional, or even terminal" A good read on how innovation arises from such re-iterated local searches is Andreas Wagner's recent book "Arrival of the Fittest"

Fitness ladscapes: There is another concept from evolutionary theory that is relevant here. This is the metaphor of a "fitness landscape" Any given position on the landscape represents, in simplified form, one of many possible designs in what is in reality a multidimensional space of possible designs. The height of any position on the landscape represents the relative fitness of that design, higher being more fit. Fitness in the example above is the performance of the model in accurately predicting whether an outcome is present of not.

An important distinction that can be made between fitness landscapes, or parts thereof, is whether they are smooth or rugged. A smooth landscape means the transition in the fitness of one design (point in the landscape) to that of another very similar design located next door is not sudden but gradual, like a gentle slope on a real landscape. A rugged landscape is the opposite. The fitness of one design may be very different from the fitness of a design immediately next door (i.e. very similar). Metaphorically speaking, immediately next door there maybe a sinkhole or a mountain. A conservative innovation strategy as described above will work better on a smooth landscape, where there are no sudden surprises.

With data sets of the kind described above it may be possible to measure how smooth or rough a fitness landscape is, and thus make informed choices  about the best innovation strategy to use. As mentioned elsewhere in this website, the similarity of the attributes of two cases can be measured using Hamming distance, which is simply the proportion of all their attributes which are different from each other. If each case in a data set is compared to all other cases in the same data set then each case can be described in terms of its average similarity with all other cases. In a smooth landscape very similar cases should have a similar fitness level i.e  be of similar "height", but the more dissimilar cases should have more disparate fitness levels. In a rugged landscape the differences in fitness will have no relationship to similarity measures.

Postscript:  In my 2015 analysis of Civil Society Challenge Fund data it seemed that there were often adjacent designs that did almost as well as the best performing designs that could be found. This finding suggests that we should be cautious about research or evaluation based claims about "what works" that are too dogmatic and exclusive of other possibly relevant versions.