Friday, August 21, 2015

Clustering projects according to similarities in outcomes they achieve

Among some users of LogFrames it is verboten to have more than one Purpose level (i.e. outcome) statement. They are likely to argue that where there are multiple intended outcomes a project's efforts will be dissipated and will ultimately be ineffective. However, a reasonable counter-argument would be that in some cases multiple outcome measures may simply be more nuanced description of an outcome that others might want to insist is expressed in a singular form.

The "problem" of multiple outcome measures becomes more common when we look at portfolios of projects where there may be one or two over-arching objectives but it is recognised that there are multiple pathways to their achievement. Or, that it is recognized that individual projects may want to trying different mixes of strategies , rather than just one alone.

How can an evaluator deal with multiple outcomes, and data on these? Some years ago one strategy that I used was to gather the project staff together to identify for each output, what its expected relative causal contribution was of each of the project outcomes. These judgements were expressed in individual values that added up to 100 percentage points per outcome, plotted in an (Excel) Outputs x Outcome matrix, projected onto a screen for all to see, argue and edit. The results enabled us to prioritise which Output to Outcome linkages to give further attention to, and to identify, in aggregate, which Outputs would need more attention than others.

There is also another possible approach. More recently I have been exploring the potential uses of clustering modules within the RapidMiner data mining package. I have a data set of 34 projects with data on their achievements on 11 different outcome measures. A month ago I was contracted to develop some predictive models for each of these outcomes, which I did. But it now occurs to me that doing so may be somewhat redundant, in that there may not really be 11 different types of project performance. Rather, it is possible that there are a smaller number of clusters of projects, and within each of these there are projects having similar patterns of achievement across the various possible outcomes.

With this in mind I have been exploring the use of two different clustering algorithms: (k-Means clustering and DBSCAN clustering. Both are described in practically useful detail in Kotu and Deshpande's book "Predictive Analytics and Data Mining"

With k-Means you have to specify the number of clusters you are looking for (k), which may be useful in some circumstances. but I would prefer to find an "ideal" number. This could be the number of clusters where there is the highest level of similarity of cases within a cluster compared to other alternative numbers of clusterings of the same cases. The performance metrics of k-Means clustering allows this kind of assessment to be made. The best performing clustering result I found identified four clusters. With DBSCAN you don't nominate any preferred number of clusters, but it turns out there are other parameters you do need to set, which also affect the result, including the number of clusters found. But again, you can compare and assess these using a performance measure, which I did. However, in this case the best performing result was two clusters rather than four!

What to do? Talk to the owners of the data, who know the details of the cases involved and show them the alternative clustering, including information on which projects belong to which clusters. Then ask them which clustering makes the most sense i.e. is most interpretable, given their knowledge of these projects.

And then what? Having identified the preferred clustering model it would make sense then to go back to the full data set and develop predictive models for these clusters: i.e. to find what package of project attributes will best predict the particular cluster of outcome achievements that are of interest.