Monday, December 28, 2015

Aiming for the stars versus "the adjacent possible"


Background: I have been exploring the uses of a new Excel application I have been developing with the help of Aptivate, provisionally called EvalC3. You can find out more about it here: http://evalc3.net/

If you have a data set that describes a range of attributes of a set of projects, plus an outcome measure for these projects which is of interest, you may be able to identify a set of attributes (aka a model) which best predicts the presence of the outcome.

In one small data experiment I used a randomly generated data set, with 100 cases and 10 attributes. Using EvalC3 I found that the presence of attributes "A" and "I" best predicted the presence of the outcome with an accuracy of 65%. In other words, of all the cases with these attributes 65% also had the outcome present.

Imagine I am running a project with the attributes D and J but not A or I. In the data set this set of attributes was associated with the presence of the outcome in 49% of the cases. Not very good really, I probably need to make some changes to the project design. But if I want to do the best possible, according the data analysis so far, I will need to ditch the core features of my current project (D and A) and replace them with the new features (A and I). This sounds like a big risk to me.

Alternately, I could explore what has been called by Stuart Kauffmann "the adjacent possible". In other words, make small changes to my project design that might improve its likelihood of success, even though the improvements might fall well short of the optimum level shown by the analysis above (i.e. 65%).

If data was available on a wide range of projects I could do this exploration virtually, in the sense of finding other projects with similar but different attributes to mine, and see how well they performed. In my data based experiment my existing project had attributes D and J. Using EvalC3 I then carried out a systematic search for a better set of attributes that kept these two original attributes but introduced one extra attribute. This is what could be called a conservative innovation strategy. The search process found that including a particular extra attribute in the design improved the accuracy of my project model from 49% to 54%. Then introducing another particular attribute improved it to 59%.

So what? Well, if you are an existing project and there is a real life data set of reasonably comparable (but not identical) projects you would be able to explore explore relatively low risk ways of improving your performance. The findings from the same data set on the model which produced the best possible performance (65% in the example above) might be more relevant to those designing new projects from scratch. Secondly,  your subsequent experience with these cautious experiments could be used to update and extend the project data base with extra data on what is effectively a new case i.e a project with a new set of attributes slightly different from its previous status.

The connection with evolutionary theory: On a more theoretical level you may be interested in the correspondence of this approach with evolutionary strategies for innovation. As I have explained elsewhere "Evolution may change speed (e.g. as in punctuated equilibrium), but it does not make big jumps. It progresses through numerous small moves, exploring adjacent spaces of what else might be possible. Some of those spaces lead to better fitness, some to less. This is low cost exploration, big mutational jumps involve much more risk that the changes will be dysfunctional, or even terminal" A good read on how innovation arises from such re-iterated local searches is Andreas Wagner's recent book "Arrival of the Fittest"

Fitness ladscapes: There is another concept from evolutionary theory that is relevant here. This is the metaphor of a "fitness landscape" Any given position on the landscape represents, in simplified form, one of many possible designs in what is in reality a multidimensional space of possible designs. The height of any position on the landscape represents the relative fitness of that design, higher being more fit. Fitness in the example above is the performance of the model in accurately predicting whether an outcome is present of not.

An important distinction that can be made between fitness landscapes, or parts thereof, is whether they are smooth or rugged. A smooth landscape means the transition in the fitness of one design (point in the landscape) to that of another very similar design located next door is not sudden but gradual, like a gentle slope on a real landscape. A rugged landscape is the opposite. The fitness of one design may be very different from the fitness of a design immediately next door (i.e. very similar). Metaphorically speaking, immediately next door there maybe a sinkhole or a mountain. A conservative innovation strategy as described above will work better on a smooth landscape, where there are no sudden surprises.

With data sets of the kind described above it may be possible to measure how smooth or rough a fitness landscape is, and thus make informed choices  about the best innovation strategy to use. As mentioned elsewhere in this website, the similarity of the attributes of two cases can be measured using Hamming distance, which is simply the proportion of all their attributes which are different from each other. If each case in a data set is compared to all other cases in the same data set then each case can be described in terms of its average similarity with all other cases. In a smooth landscape very similar cases should have a similar fitness level i.e  be of similar "height", but the more dissimilar cases should have more disparate fitness levels. In a rugged landscape the differences in fitness will have no relationship to similarity measures.

Postscript:  In my 2015 analysis of Civil Society Challenge Fund data it seemed that there were often adjacent designs that did almost as well as the best performing designs that could be found. This finding suggests that we should be cautious about research or evaluation based claims about "what works" that are too dogmatic and exclusive of other possibly relevant versions.


Saturday, December 26, 2015

False Positives - why we should pay more attention to them


In the last year I have been involved in two pieces of work that have sought to find patterns in data that are good predictors of project outcomes that were of interest. In one cases as the researcher, in another case in a quality assurance role, looking over someone else's analysis.

In both situations two types of prediction rules were found: (a) some confirming stakeholders' existing understandings, (b) others contradicting that understanding and/or proposing a novel perspective. The value of further investigating the latter was evident but the value of investigating findings that seemed to confirm existing views seemed less evident to the clients in both cases. "We know that...lets move on.../show us something new" seemed to be the attitude. Albeit after some time, it occurred to me that two different next steps were needed for each of these kinds of findings:

  • Where findings are novel, it is the True Positive cases that need further investigation. These are the cases where the outcome was predicted by a rule, and confirmed as being present by the data.
  • Where findings are familiar, it is the False Positives that need further investigations. These are the cases where the rule predicted the outcome but the data indicated the outcome was not present. In my experience so far, most of the confirmatory prediction rules had at least some False Positives. These are important to investigate because if we do so this could help identify important boundaries to our confidence about where and when a given rule works.
Thinking more widely it occurred to me how much more attention we should pay to False Positives in the way that public policy supposedly works. In war time, civilian casualties are often False Positives, in the calculations about the efficacy of airstrikes for example. We hear about the number of enemy combatant killed, but much less often about the civilians killed by the same "successful" strikes. There are many areas of public policy, especially in law I suspect, where there are the equivalent of these civilian deaths, metaphorically if not literally. The "War on Drugs" and the current "War on Terrorism" are two that come to mind. Those implementing these policies are preoccupied with the numbers of True Positives they have achieved and with the False Negatives i.e the cases known but not yet detected and hit. But counting False Positives is much less so in their immediate interest, raising questions of if not by them, then by who?

Some Christmas/New Year thoughts from a dry, warm, safe and secure house in the northern hemisphere...

PS : seehttp://arstechnica.co.uk/security/2016/02/the-nsas-skynet-program-may-be-killing-thousands-of-innocent-people/

Meta versus macro theories of change


A macro-ToC is single ToC that seek to aggregate into one view the contents of many micro-ToCs. For example, the aggregation of many project-specific ToCs  into an single country-level ToC. There are two risks with this approach:

  1. The loss of detail involved in this aggregation will lead to a loss of measurability, which presents problems for evaluability of a macro-ToC
  2. Even where the macro-ToC can be tested the relevance of the results to a specific project could be contested, because individual projects could challenge the macro-ToC as not being an adequate representation of their project intentions. 
The alternative to a macro-ToC is something that could be called a meta-ToC. A meta-theory is a theory about theories. A meta-ToC would be a structured set of ideas about the significant differences between various ToCs.  These differences might be of various kinds e.g. about the context, the intervention, the intended beneficiaries, or any mediating causal mechanisms. Consider the following (imagined) structure. This is in effect a nested classification of projects. Each branch represents what might be seen by a respondent as significant differences between projects, ideally as apparent in the contents of their ToCs and associated documents. This kind of structure can be developed by participatory or expert judgement methods  (See PS 2 link below for how). The former is preferable because it could increase buy in to the final representation by the constituent projects and their associated ToCs.
The virtue of this approach is that if well done, each difference in the tree structure represents the seed of a hypothesis that could be the focus of attention in a macro evaluation. That is, the "IF.." part of an "IF..THEN.." statement. If each difference represents the most significant difference, the respondents could then be asked a follow-up question: "What difference has or will this difference made?" Combined with the original difference, the answers to this second questions generates what are are essentially hypotheses (IF...THEN...statements), ones that should be testable by comparing the projects fitting into the two categories described.

Some of these differences will be more worthwhile testing than others, if they cover more projects. For example, in the tree structure above, the difference in "Number of funders" applies to all five projects, whereas the difference in "Geographic scale of project" only applies to two projects. More important differences, that apply to more projects, will also by definition, have more cases that can be compared to each other

It is also possible to identify compound hypotheses worth testing. That is, "IF...AND...THEN..." type statements. Participants could be asked to walk down each branch in turn and indicate at each branch point "Which of these types of projects do you think has/will be the most successful?" The combination of project attributes described by a given branch is the configuration of conditions hypothesised to lead to the result predicted. Knowledge about which of these are more effective could be practically useful. 

In summary: This meta-theory approaches maximises the use of diversity that can be present in a large portfolio of activities, rather than aggregating it out of existence. Or more accurately, out of visibility.

PS 1: These thoughts have been prompted by my experience of being involved in a number of macro-evaluations of projects in recent years.

PS 2: For more on creating such nested classifications see https://mande.co.uk/special-issues/hierarchical-card-sorting-hcs/

Friday, August 21, 2015

Clustering projects according to similarities in outcomes they achieve

Among some users of LogFrames it is verboten to have more than one Purpose level (i.e. outcome) statement. They are likely to argue that where there are multiple intended outcomes a project's efforts will be dissipated and will ultimately be ineffective. However, a reasonable counter-argument would be that in some cases multiple outcome measures may simply be more nuanced description of an outcome that others might want to insist is expressed in a singular form.

The "problem" of multiple outcome measures becomes more common when we look at portfolios of projects where there may be one or two over-arching objectives but it is recognised that there are multiple pathways to their achievement. Or, that it is recognized that individual projects may want to trying different mixes of strategies , rather than just one alone.

How can an evaluator deal with multiple outcomes, and data on these? Some years ago one strategy that I used was to gather the project staff together to identify for each output, what its expected relative causal contribution was of each of the project outcomes. These judgements were expressed in individual values that added up to 100 percentage points per outcome, plotted in an (Excel) Outputs x Outcome matrix, projected onto a screen for all to see, argue and edit. The results enabled us to prioritise which Output to Outcome linkages to give further attention to, and to identify, in aggregate, which Outputs would need more attention than others.

There is also another possible approach. More recently I have been exploring the potential uses of clustering modules within the RapidMiner data mining package. I have a data set of 34 projects with data on their achievements on 11 different outcome measures. A month ago I was contracted to develop some predictive models for each of these outcomes, which I did. But it now occurs to me that doing so may be somewhat redundant, in that there may not really be 11 different types of project performance. Rather, it is possible that there are a smaller number of clusters of projects, and within each of these there are projects having similar patterns of achievement across the various possible outcomes.

With this in mind I have been exploring the use of two different clustering algorithms: (k-Means clustering and DBSCAN clustering. Both are described in practically useful detail in Kotu and Deshpande's book "Predictive Analytics and Data Mining"

With k-Means you have to specify the number of clusters you are looking for (k), which may be useful in some circumstances. but I would prefer to find an "ideal" number. This could be the number of clusters where there is the highest level of similarity of cases within a cluster compared to other alternative numbers of clusterings of the same cases. The performance metrics of k-Means clustering allows this kind of assessment to be made. The best performing clustering result I found identified four clusters. With DBSCAN you don't nominate any preferred number of clusters, but it turns out there are other parameters you do need to set, which also affect the result, including the number of clusters found. But again, you can compare and assess these using a performance measure, which I did. However, in this case the best performing result was two clusters rather than four!

What to do? Talk to the owners of the data, who know the details of the cases involved and show them the alternative clustering, including information on which projects belong to which clusters. Then ask them which clustering makes the most sense i.e. is most interpretable, given their knowledge of these projects.

And then what? Having identified the preferred clustering model it would make sense then to go back to the full data set and develop predictive models for these clusters: i.e. to find what package of project attributes will best predict the particular cluster of outcome achievements that are of interest.