Friday, August 21, 2015

Clustering projects according to similarities in outcomes they achieve

Among some users of LogFrames it is verboten to have more than one Purpose level (i.e. outcome) statement. They are likely to argue that where there are multiple intended outcomes a project's efforts will be dissipated and will ultimately be ineffective. However, a reasonable counter-argument would be that in some cases multiple outcome measures may simply be more nuanced description of an outcome that others might want to insist is expressed in a singular form.

The "problem" of multiple outcome measures becomes more common when we look at portfolios of projects where there may be one or two over-arching objectives but it is recognised that there are multiple pathways to their achievement. Or, that it is recognized that individual projects may want to trying different mixes of strategies , rather than just one alone.

How can an evaluator deal with multiple outcomes, and data on these? Some years ago one strategy that I used was to gather the project staff together to identify for each output, what its expected relative causal contribution was of each of the project outcomes. These judgements were expressed in individual values that added up to 100 percentage points per outcome, plotted in an (Excel) Outputs x Outcome matrix, projected onto a screen for all to see, argue and edit. The results enabled us to prioritise which Output to Outcome linkages to give further attention to, and to identify, in aggregate, which Outputs would need more attention than others.

There is also another possible approach. More recently I have been exploring the potential uses of clustering modules within the RapidMiner data mining package. I have a data set of 34 projects with data on their achievements on 11 different outcome measures. A month ago I was contracted to develop some predictive models for each of these outcomes, which I did. But it now occurs to me that doing so may be somewhat redundant, in that there may not really be 11 different types of project performance. Rather, it is possible that there are a smaller number of clusters of projects, and within each of these there are projects having similar patterns of achievement across the various possible outcomes.

With this in mind I have been exploring the use of two different clustering algorithms: (k-Means clustering and DBSCAN clustering. Both are described in practically useful detail in Kotu and Deshpande's book "Predictive Analytics and Data Mining"

With k-Means you have to specify the number of clusters you are looking for (k), which may be useful in some circumstances. but I would prefer to find an "ideal" number. This could be the number of clusters where there is the highest level of similarity of cases within a cluster compared to other alternative numbers of clusterings of the same cases. The performance metrics of k-Means clustering allows this kind of assessment to be made. The best performing clustering result I found identified four clusters. With DBSCAN you don't nominate any preferred number of clusters, but it turns out there are other parameters you do need to set, which also affect the result, including the number of clusters found. But again, you can compare and assess these using a performance measure, which I did. However, in this case the best performing result was two clusters rather than four!

What to do? Talk to the owners of the data, who know the details of the cases involved and show them the alternative clustering, including information on which projects belong to which clusters. Then ask them which clustering makes the most sense i.e. is most interpretable, given their knowledge of these projects.

And then what? Having identified the preferred clustering model it would make sense then to go back to the full data set and develop predictive models for these clusters: i.e. to find what package of project attributes will best predict the particular cluster of outcome achievements that are of interest.


Friday, June 26, 2015

Evolving better performing hypotheses, using Excel



You may not know this, but there is an Excel add-in called Solver. This can be used to evolve better solutions to computation problems.

It can also be used to identify hypotheses that have a better fit with available data. Here is how I have been using Solver in Excel....

I start by entering a data set into Excel that is made up of cases (e.g.25 projects), listed row by row. The columns then describe attributes of those projects, e.g. as captured in project completion reports. These attributes can include aspects of the project context, the way the project was working plus some outcome performance measures.

A filtering mechanism is then set up in Excel, where you can chose a specific sub-set of the attributes that are of interest (aka our hypothesis), and then all the projects that have these attributes are automatically highlighted.

The performance of this sub-set of projects is then tested, using a simple device called a Confusion Matrix , which tells us what proportion of the selected projects have "successful" versus "unsuccessful" outcomes (if that is the outcome of interest). Ideally, if we have selected the right set of attributes then the proportion of "successful" projects in the selected sub-set will be greater than their prevalence rate in the whole set of projects.

But given that the number of possible hypotheses available to test doubles with each extra attribute added into the data set, this sort of intuitive/theory led search could take us a long time to find the best available solution. Especially when the data set has a large number of attributes.

This is where Solver helps. Using its evolutionary algorithm the Solver add-in provides a quick means of searching a very large space of possible solutions. To do this there are three parameter which need to be set, before setting it to work. The first is the Objective, which is the value you want to maximise. I usually choose the "Accuracy" measure in the Confusion Matrix. The second is the range of cells whose values can be varied. These are the ones identifying the set of project attributes, which can be used to make a hypothesis. They can be set to (1), absent (0) or not relevant (2). The third is is the Constraints, which limit the values that these variable cells can take e.g. not negative, and nothing other than these three types of values.

In the space of a minute Solver then explores up to 70,000 possible combinations of project attributes to find the combination that generate the most accurate prediction i.e. a set of projects with the highest possible proportion of "successful" projects. In my recent trials, using a real data set, the accuracy levels have been up around 90%. I have been able to compare the results with those found using another algorithm which I have written about in earlier posts here, called a  Decision Tree algorithm. Despite being quite different algorithms, the solutions that both algorithms have found (i.e. the specific set of project attributes) have been very similar in content (i.e. the attributes in the solution), and both had the same level of accuracy.

An important side benefit of this approach to finding out "what works" is that by manually tweaking the solutions found by Solver you can measure the extent to which each attribute in the winning solution makes a difference to its overall accuracy. This is the kind of question many donors want to answer re the projects they fund, when they ask "What difference does my particular contribution make?"

If you want help setting up Excel to try out this method, and have a data set ready to work on, feel free to contact me for advice: rick.davies@gmail.com, or leave a comment below