Thursday, March 28, 2019

Where there is no (decent / usable) Theory of Change...



I have been reviewing a draft evaluation report in which two key points are made about the relevant Theory of Change:

  • A comprehensive assessment of the extent to which expected outcomes were achieved (effectiveness) was not carried out, as the xxx TOC defines these only in broad terms.
  •  ...this assessment was also hindered by the lack of a consistent outcome monitoring system.
I am sure this situation is not unique to this program. 

Later on the same report, I read about the evaluation's sampling strategy. As with many other evaluations I have seen, the aim was to sample a diverse range of locations in such a way that was maximally representative of the diversity of how and where the program was working. This is quite a common approach and a reasonable one at that.

But it did strike me later on that this intentionally diverse sample was an underexploited resource. If 15 different locations were chosen, one could imagine a 15 x 15 matrix. Each of the cells in the matrix could be used to describe how a row location compared to a column location. In practice, only half the matrix would be needed, because each relationship would be mentioned twice e.g. Row location A and its relation to Column location J would also be covered by Row location J and its relation to Column location A.

What sort of information would go in such cells? Obviously, there could be a lot to choose from. But one option would be to ask key stakeholders, especially those funding and/or managing any two compared locations. I would suggest they be asked something like this:
  • "What do you think is the most significant difference between these two locations/projects, in the ways they are working?"
And then ask a follow-up question...
  • "What difference do you think this difference will make?"
The answers are potential (if...then...) hypotheses, worth testing by an evaluation team. In a matrix generated by a sample of 15 locations, this exercise could generate ((15*15)-15))/2 = 105 potentially useful hypotheses, which could then be subject to a prioritisation / filtering exercise, which should include considerations of their evaluability (Davies, 2013). More specifically, how they relate to any Theory of Change, whether there is relevant data available, and whether any stakeholders are interested in the answers.

Doing so might also help address a more general problem, which I have noted elsewhere (Davies, 2018). And which was also a characteristic of the evaluation mentioned above. That is the prevalence in evaluation ToRs of open-ended evaluation questions, rather than hypothesis testing questions: 
" While they may refer to the occurrence of specific outcomes or interventions, their phrasings do not include expectations about the particular causal pathways that are involved. In effect these open-ended questions imply either that those posting the questions either know nothing, or they are not willing to put what they think they know on the table as testable propositions. Either way this is bad news, especially if the stakeholders have any form of programme funding or programme management responsibilities. While programme managers are typically accountable for programme implementation it seems they and their donors are not being held accountable for accumulating testable knowledge about how these programmes actually work. Given the decades-old arguments for more adaptive programme management, it’s about time this changed (Rondinelli, 1993; DFID, 2018).  (Davies, 2018)



Saturday, March 09, 2019

On using clustering algorithms to help with sampling decisions



I have spent the last two days in a training workshop run by BigML, a company that provides very impressive, coding-free, online machine learning services. One of the sessions was on the use of clustering algorithms, an area I have some interest in, but have not done much with, over the last year or so. The whole two days were very much centered around data and the kinds of analyses that could be done using different algorithms, and with more aggregated workflow processes.

Independently, over the previous two weeks, I have had meetings with the staff of two agencies in two different countries, both at different stages of carrying out an evaluation of a large set of their funded projects. By large, I mean 1000+ projects. One is at the early planning stage, the other is now in the inception stage. In both evaluations, the question of what sort of sampling strategy to use was a real concern.

My most immediate inclination was to think of using a stratified sampling process, where the first unit of analysis would be the country, then the projects within each country. In one of the two agencies, the projects were all governance related, so an initial country level sampling process seemed to make a lot of sense. Otherwise, the governance projects would risk being decontextualized. There were already some clear distinctions between countries in terms of how these projects were being put to work, within the agency's country strategy. These differences could have consequences. The articulation of any expected consequences could provide some evaluable hypotheses, giving the evaluation a useful focus, beyond the usual endless list of open-ended questions typical of so many evaluation Terms of Reference.

This led me to speculate on other ways of generating such hypotheses. Such as getting key staff managing these projects to do pile/card sorting exercises to sort countries, then projects, into pairs of groups, separated by a difference that might make a difference. These distinctions could reflect ideas embedded in an overarching theory of change, or more tacit and informal theories in the heads of such staff, which may nevertheless still be influential because they were operating (but perhaps untested) assumptions. They would provide other sources of what could be evaluable hypotheses.

However, regardless of whether it was a result of a systematic project document review or pile sorting exercises, you could easily end up with many different attributes that could be used to describe projects and then use as the basis of a stratified sampling process. One evaluation team seemed to be facing this challenge right now, of struggling to decide what attributes to choose. (PS: this problem can arise either from having too many theories or no theory at all)

This is where clustering algorithms, like K-means clustering, could come in handy. On the BigML website you can upload a data set (e.g. projects with their attributes) then do a one-click cluster analysis. This will find clusters of projects that have a number of interesting features: (a) Similarity within clusters is maximised, (b) Dissimilarity between clusters is maximised and visualised, (c) It is possible to identify what are called "centroids" i.e. the specific attributes which are most central to the identity of a cluster.

These features are relevant to sampling decisions. A sample from within a cluster will have a high level of generalisability within that cluster because all cases within that cluster are maximally similar. Secondly, other clusters can be found which range in their degree of difference from that cluster. This is useful if you want to find two contrasting clusters that might capture a difference that makes a difference.

I can imagine two types of analysis that might be interesting here:
1. Find a maximally different cluster (A and B) and see if a set of attributes found to be associated with an outcome of interest in A is also present in B. This might be indicative of how robust that association is
2, Find a maximally similar set of clusters (A and C) and see if incremental alterations to a set of attributes associated with an outcome in A means the outcome is not found associated in C. This might be indicative of how significant each attribute is.

These two strategies could be read as (1) Vary the context, (2) Vary the intervention

For more information, check out this BigML video tutorial on cluster analysis. I found it very useful

PS: I have also been exploring BigMLs Association Rule facility. This could be very helpful as another means of analysing the contents of a given cluster of cases. This analysis will generate a list of attribute associations, ranked by different measures of their significance. Examining such a list could help evaluators widen their view of the possible causal configurations that are present.