Monday, December 28, 2015

Aiming for the stars versus "the adjacent possible"

Background: I have been exploring the uses of a new Excel application I have been developing with the help of Aptivate, provisionally called EvalC3. You can find out more about it here:

If you have a data set that describes a range of attributes of a set of projects, plus an outcome measure for these projects which is of interest, you may be able to identify a set of attributes (aka a model) which best predicts the presence of the outcome.

In one small data experiment I used a randomly generated data set, with 100 cases and 10 attributes. Using EvalC3 I found that the presence of attributes "A" and "I" best predicted the presence of the outcome with an accuracy of 65%. In other words, of all the cases with these attributes 65% also had the outcome present.

Imagine I am running a project with the attributes D and J but not A or I. In the data set this set of attributes was associated with the presence of the outcome in 49% of the cases. Not very good really, I probably need to make some changes to the project design. But if I want to do the best possible, according the data analysis so far, I will need to ditch the core features of my current project (D and A) and replace them with the new features (A and I). This sounds like a big risk to me.

Alternately, I could explore what has been called by Stuart Kauffmann "the adjacent possible". In other words, make small changes to my project design that might improve its likelihood of success, even though the improvements might fall well short of the optimum level shown by the analysis above (i.e. 65%).

If data was available on a wide range of projects I could do this exploration virtually, in the sense of finding other projects with similar but different attributes to mine, and see how well they performed. In my data based experiment my existing project had attributes D and J. Using EvalC3 I then carried out a systematic search for a better set of attributes that kept these two original attributes but introduced one extra attribute. This is what could be called a conservative innovation strategy. The search process found that including a particular extra attribute in the design improved the accuracy of my project model from 49% to 54%. Then introducing another particular attribute improved it to 59%.

So what? Well, if you are an existing project and there is a real life data set of reasonably comparable (but not identical) projects you would be able to explore explore relatively low risk ways of improving your performance. The findings from the same data set on the model which produced the best possible performance (65% in the example above) might be more relevant to those designing new projects from scratch. Secondly,  your subsequent experience with these cautious experiments could be used to update and extend the project data base with extra data on what is effectively a new case i.e a project with a new set of attributes slightly different from its previous status.

The connection with evolutionary theory: On a more theoretical level you may be interested in the correspondence of this approach with evolutionary strategies for innovation. As I have explained elsewhere "Evolution may change speed (e.g. as in punctuated equilibrium), but it does not make big jumps. It progresses through numerous small moves, exploring adjacent spaces of what else might be possible. Some of those spaces lead to better fitness, some to less. This is low cost exploration, big mutational jumps involve much more risk that the changes will be dysfunctional, or even terminal" A good read on how innovation arises from such re-iterated local searches is Andreas Wagner's recent book "Arrival of the Fittest"

Fitness ladscapes: There is another concept from evolutionary theory that is relevant here. This is the metaphor of a "fitness landscape" Any given position on the landscape represents, in simplified form, one of many possible designs in what is in reality a multidimensional space of possible designs. The height of any position on the landscape represents the relative fitness of that design, higher being more fit. Fitness in the example above is the performance of the model in accurately predicting whether an outcome is present of not.

An important distinction that can be made between fitness landscapes, or parts thereof, is whether they are smooth or rugged. A smooth landscape means the transition in the fitness of one design (point in the landscape) to that of another very similar design located next door is not sudden but gradual, like a gentle slope on a real landscape. A rugged landscape is the opposite. The fitness of one design may be very different from the fitness of a design immediately next door (i.e. very similar). Metaphorically speaking, immediately next door there maybe a sinkhole or a mountain. A conservative innovation strategy as described above will work better on a smooth landscape, where there are no sudden surprises.

With data sets of the kind described above it may be possible to measure how smooth or rough a fitness landscape is, and thus make informed choices  about the best innovation strategy to use. As mentioned elsewhere in this website, the similarity of the attributes of two cases can be measured using Hamming distance, which is simply the proportion of all their attributes which are different from each other. If each case in a data set is compared to all other cases in the same data set then each case can be described in terms of its average similarity with all other cases. In a smooth landscape very similar cases should have a similar fitness level i.e  be of similar "height", but the more dissimilar cases should have more disparate fitness levels. In a rugged landscape the differences in fitness will have no relationship to similarity measures.

Postscript:  In my 2015 analysis of Civil Society Challenge Fund data it seemed that there were often adjacent designs that did almost as well as the best performing designs that could be found. This finding suggests that we should be cautious about research or evaluation based claims about "what works" that are too dogmatic and exclusive of other possibly relevant versions.

Saturday, December 26, 2015

False Positives - why we should pay more attention to them

In the last year I have been involved in two pieces of work that have sought to find patterns in data that are good predictors of project outcomes that were of interest. In one cases as the researcher, in another case in a quality assurance role, looking over someone else's analysis.

In both situations two types of prediction rules were found: (a) some confirming stakeholders' existing understandings, (b) others contradicting that understanding and/or proposing a novel perspective. The value of further investigating the latter was evident but the value of investigating findings that seemed to confirm existing views seemed less evident to the clients in both cases. "We know that...lets move on.../show us something new" seemed to be the attitude. Albeit after some time, it occurred to me that two different next steps were needed for each of these kinds of findings:

  • Where findings are novel, it is the True Positive cases that need further investigation. These are the cases where the outcome was predicted by a rule, and confirmed as being present by the data.
  • Where findings are familiar, it is the False Positives that need further investigations. These are the cases where the rule predicted the outcome but the data indicated the outcome was not present. In my experience so far, most of the confirmatory prediction rules had at least some False Positives. These are important to investigate because if we do so this could help identify important boundaries to our confidence about where and when a given rule works.
Thinking more widely it occurred to me how much more attention we should pay to False Positives in the way that public policy supposedly works. In war time, civilian casualties are often False Positives, in the calculations about the efficacy of airstrikes for example. We hear about the number of enemy combatant killed, but much less often about the civilians killed by the same "successful" strikes. There are many areas of public policy, especially in law I suspect, where there are the equivalent of these civilian deaths, metaphorically if not literally. The "War on Drugs" and the current "War on Terrorism" are two that come to mind. Those implementing these policies are preoccupied with the numbers of True Positives they have achieved and with the False Negatives i.e the cases known but not yet detected and hit. But counting False Positives is much less so in their immediate interest, raising questions of if not by them, then by who?

Some Christmas/New Year thoughts from a dry, warm, safe and secure house in the northern hemisphere...

PS : see

Meta versus macro theories of change

A macro-ToC is single ToC that seek to aggregate into one view the contents of many micro-ToCs. For example, the aggregation of many project specific ToCs  into an single portfolio ToC. There are two risks with this approach:

  1. The loss of detail involved in this aggregation will lead to a loss of measurability, which presents problems for evaluability of a macro-ToC
  2. Even where the macro-ToC can be tested the relevance of the results to a specific project could be contested, because individual projects could challenge the macro-ToC as a representation of their project intentions. 
The alternative to a macro-ToC is something that could be called a meta-ToC. A meta-theory is a theory about theories. A meta-ToC would be a structured set of ideas about the significant differences between various ToCs.  Consider the following (imagined) structure. This is a nested classification of projects. Each branch represents what might be seen as significant differences between projects, ideally as apparent in the contents of their ToCs and associated documents. This kind of structure can be developed by participatory or expert judgement methods. The former is preferable because it could increase buy in to the final representation by the constituent projects
The virtue of this approach is that if well done, each difference in the tree structure represents the seed of a hypothesis that could be the focus of attention in a macro evaluation. If each difference represents the most significant difference the participants could identify then a follow up question that should then be asked is "What difference has or will this difference made?" The answers to these questions are essentially hypotheses, ones that should be testable by comparing the projects fitting into the two categories described.

Some of these differences will be more worthwhile testing than others, if they cover more projects. For example, in the tree structure above, the difference in "Number of funders" applies to all five projects, whereas the difference in "Geographic scale of project" only applies to two projects. More important differences, that apply to more projects, will also by definition, have more cases that can be compared to each other

It is also possible to identify compound hypotheses worth testing. Participants could be asked to walk down each branch in turn and indicate at each branch point "Which of these types of projects do you think has/will be the most successful?" The combination of project attributes described by a given branch is the configuration of conditions hypothesised to lead to the result predicted.

PS: These thoughts have been prompted by my experience of being involved in a number of macro-evaluations of projects in recent years.