Wednesday, April 24, 2024

Developing and using a Configurational Theory of Change within an evaluation

 

Figure 1

Ho-hum, yet another evaluation brand being promoted in an already crowded marketplace.FFS... 

Yes, I think this reaction is understandable, but I think there is something here captured under this title (Developing and using a Configurational Theory of Change... ) which has potential value.  I will try to explain...

Many evaluators make use of theories of change, as part of a theory-based approach to evaluation. Many theories of change are described in some type of diagrammatic form. And a typical feature of those diagrams is their convergent nature. That is, they start of with a range of different types of inputs and activities which follow various causal pathways towards a limited number of final outcomes.

This image is almost the complete opposite of what happens in actual practice on the ground. Financial inputs come from a limited number of sources, these become available to a small range of partners who carry out their own range of activities, in a variety of different locations each with their own populations, including those intended and not intended to be affected. This description is of course a simplification, but it applies to many development aid programme designs. The point I'm making here is that this in-reality process is not convergent it is divergent!  It seems like the diagrammatic theories of change I have described are a type of Procrustean bed

This blog posting has been most immediately prompted by a report I have just reviewed on potential evaluation strategies for a large national level climate finance strategy (CFS). The theory of change describes multiple causal pathways connecting the initial provision of government finance through to four expected types of expected impacts.  With two of these causal pathways alone the number of projects being funded is in the hundreds. The report struggled with the issue of how to measure the expected impacts given the scale and likely diversity of events on the ground. And the corresponding challenge of how to sample those projects. Part of my diagnosis of the problem here was the evaluation team's measurement-led approach. And the weakness of the conceptual framework i.e. the incapacity of the theory of change to capture the diversity of what was taking place.

Describing the alternative to my client is now my challenge. I think the alternative has two parts. Firstly, one should start at the beginning, where the money becomes available, and then follow the money (and the people responsible) as it gets distributed according to its intended purposes. If things are not happening as expected early on in this process then this affects expectations of what might and might not be observable later on in the form of 'outcomes' or 'impacts'. Put crudely, there is no point trying to observe the impact of something that has not yet been delivered. And in the case of strategies like the CFS, a large part of success can simply be gettting the money where it should be spent.

Secondly, as money is distributed from a central fund, decisions are going to be made about how it should be parcelled out in different amounts for different purposes through different institutions.  Each time that happens the decisions that have been made about how to do this are hopefully not random. Evaluating how those decisions were made may not necessarily be all that useful, because often there will be opaque mini, meso and macro political processes involved. But the announced decisions may include some intentionally explicit expectations about the official purposes of different allocations. Interviews those responsible for those allocations might also elicit more informal and more current expectations about what might be the short and longer terms effects of some of these allocations, when compared to others.The point I am emphasising here is that sometimes we can come to evaluative judgements not through the use of any overriding predetermined criteria, but by using a more inductive process, where we compare one option to another. This is an excuse for me to quote Marx (G): 

Friend says to Marx – 'Life is difficult'.

Marx replies to friend – 'Compared to what?'

This type of inductive comparative evaluation doesn't have to be completely free form. It is conceivable for example that we could look at two tranches of government climate finance funding and ask (those with proximate responsibilities for that funding) what difference there might between those blocks of funding in terms of how each might meet one or more of the OECD criteria (These range in their concerns from the more immediate issues of coherence and efficiency to later concerns with effectiveness and impact). Respondents answers in the form of expectations can be seen as mini theories a.k.a. hypotheses that then might be testable through the gathering of relevant data.  

Before these questions can be posed the cases that are going to be compared would need to be identified. The 'cases' in this example would be particular blocks of funding. Further along the implementation process the cases could be partners who are receiving funding, or activities that those partners implementing, or communities those activities are directed towards. Nevertheless, at any point along this chain there is still a challenge, which is how to select cases for comparison. For example, if we are looking at a particular budget document which distributes funding into multiple purpose categories we will be faced with the question of which of these categories to compare.

One way forward is to let the interviewed person decide, especially if they have responsibilities in this area. Using hierarchical card sorting (HCS) the interviewer starts with a request, which is phrased like this: 'What is the most significant difference between all these budget categories in terms of how they will achieve the objectives of the climate Finance strategy? Please sort the budget categories into two piles according to this difference and then explain it to me".  Having identified ppiles of types of cases that can be compared the respondent can then be asked for details about their expectations of the cases in one pile versus the other (See FN1).The same question can then be reiterated by focusing on each of those two piles in turn and getting the respondent to break them into two smaller sub- piles. When their answers are followed by explanations this will help differentiate expectations in further detail.

Figure 2 (click on to enlarge)

Figure 2 shows the results of such an exercise, where the respondents were NGO staff responsible for the development and management of a portfolio of projects. They were asked to sort the projects into two piles according to "What they saw as the most significant difference between the projects, of a kind that would make a difference to what they could achieve". Their choices generated the tree structure. They were then asked to make a series of binary choices at each branching point, indentifying which of the two types of projects described there that "they expected to be most successful, in terms of the extent to which they will contribute to the achievement of the overall objectives of the portfolio" . Their choices are shown by the red links. In this diagram their responses have been sorted such that the preferred red option is always shown above the non-preferred option. The aggregate result is a ranked set of 8 types of projects, with the highest rank (1) at the top. Each of these types is not an isolated category of its own, but part of a configuration that can be read along each branch, from left to right.  

Here are some of the type descriptions and the reasons why one versus the other was selected most likely to contirbute to the portfolio objectives. Further discusison would be needed to esytablish how the presence/absence of these characteristics could be identified on the ground.

Wider focus

Aim to influence wider policy and environment, and have more sustainable and wider impact beyond children and their families.

Likely to be more successful: Because it will have a wider reach and be more sustainable

Local focus

More hands on work with children on a day to day basis. Impact may be sustained but it will be limited to children and their families.

Likely to be less successful:

Locally driven

Partner and the projects are locally rooted, driven by local needs and priorities. They are more likely to “get it right”. They can’t walk away when Comic Relief funding ends. More likely to be sustainable.

Likely to be more successful: More embedded in the context, will outlast the project, be more responsive.

UK driven

UK driven projects, almost sub-contracting. They have a set end-point.

Likely to be less successful:

There is a larger question here of course that also relates to sampling. Who are you going to interview in this way? The suggestion above was 'to follow the money '. In other words, to follow lines of responsibility and interview people about the domains of activity they are responsible for, using HCS as a means of structuring the discussion. There is a strategy choice here between what is known as a breadth-first search versus a depth-first search strategies. From a given point in a flow of funds (and of responsibilities) there can be distributions going in different directions, each of which could all be explored. Following all of these is a form of breadth-first search. Alternatively the focus could be just on one of those developments, and following the subsequent distribution of funding and responsibility further down one (or few) line. This is a form of depth-first search. Which of those search strategies to pursue is probably a matter to be decided by the evaluation client. But may also need to be adaptive, informed by what was found by the evaluation team in prior interviews.


Courtesy Jacky Lieu: Comparison of Breadth-First Search and Depth-First Search: Understanding Their Methods and Uses 

But what about aggregation?


If you followed my suggested approach, the closer you got to the people whose lives were of  final/main concern, the small the segments of all the funding you would be looking at. These would be more comparable than when looking at as part of a larger group, with more customised context specific assessments of expected and actual impact. But how would you / the evaluation team then be able to make any overall statement about the strategy as a whole?

The way forward is to think of performance measurement in slightly different terms, than just using a simple indicator based measure. Imagine a scatter plot, with one dimension X describing relative i.e. ranked expectations of achievement and the other dimension Y describing ranked actual/observed/assessed achievements. The entities in the scatter plot are the groups of cases in the smallest available sub-categories that were developed. Their rank position, relative to each other, is evident  when all the binary assessments of expected performance are generated through the process described above. See here for more on how this is done. The scatter plot can in turn be summarised in at least two different ways: using a measure of rank correlation (or how achievement relates to expectations) and using Classification Accuracy, if and when a minimum rank position of achievement is identfiied. Equally importantly, qualitative descriptions can be given of cases that exemplify performance that most meets expectations, and the reverse, along with  positive and negative deviants (outliers).

What we could end up with is a tree structure documenting multiple routes to both high and low performance, implemented in varyingly different  contexts (describable at different levels of scale).

Other scatter plot designs are more relevant to assessments of strategies. The ranking generated by Figure 2 was plotted against the age of the projects and their grant size, which might be expected to be influenced by the contents of a funding strategy. Neither of these two measures showed any relationship to perceived strategic priorities!

To be continued....


PS1: When asking about expected effects of one type of allocation versus another, it may make sense to encourage a focus on more immediately expected effects first, and then later ones. They may be more likely, more easily articulated and more evaluable.

PS2: Hughes-McLure, S. (2022). Follow the money. Environment and Planning A: Economy and Space, 54(7), 1299–1322. https://doi.org/10.1177/0308518X221103267










Friday, December 22, 2023

Using the Confusion Matrix as a general-purpose analytic framework


Background

This posting has been prompted by work I have done this year for the World Food Programme (WFP) as member of their Evaluation Methods Advisory Panel (EMAP). One task was to carry out a review, along with colleague Mike Reynolds, of the methods used in the 2023 Country Strategic Plans evaluations. You will be able to read about these, and related work, in a forthcoming report on the panel's work, which I will link to here when it becomes available.

One of the many findings of potential interest was: "there were relatively very few references to how data would be analysed, especially compared to the detailed description of data collection methods". In my own experience, this problem is widespread, found well beyond WFP. In the same report I proposed the use of what is known as the Confusion Matrix, as a general purpose analytic framework. Not as the only framework, but as one that could be used alongside more specific frameworks associated with particular intervention theories such as those derived from the social sciences.

What is a Confusion Matrix?

A Confusion Matrix is a type of truth table,  i.e., a table representing all the logically possible combinations of two variables or characteristics. In an evaluation context these two characteristics could be the presence and absence of an intervention, and the presence and absence of an outcome.  An intervention represents a specific theory (aka model), which includes a prediction that a specific type of outcome will occur if the intervention is implemented.  In the 2 x 2 version you can see above, there are four types of possibilities:

  1. The intervention is present and the outcome is present. Cases like this are known as True Positives
  2.  The intervention is present but the outcome is absent. Cases like this are known as False Positives. 
  3. The intervention is absent and the outcome is absent. Cases like this are known as True Negatives
  4. The intervention is absent but the outcome is present. Cases like this are known as False Negatives. 
Common uses of the Confusion Matrix

The use of Confusion Matrices is most commony associated with the field of machine learning and predictive analytics, but it has much wider application. These include the fields of medical diagnostic testing, predictive maintenance,  fraud detection,  customer churn prediction, remote sensing and geospatial analysis, cyber security, computer vision, and natural language processing. In these applications the Confusion Matrix is populated by the number of cases falling into each of the four categories. These numbers are in turn the basis of a wide range of performance measures, which are described in detail in the Wikipedia article on the Confusion Matrix. A selection of these is described here, in this blog on the use of the EvalC3 Excel app

The claim

Although the use of a Confusion Matrix is  commonly associated with quantitative analyses of performance, such as the accuracy of predictive models, it can also be a useful framework for thinking in more qualitative terms. This is a less well known and publicised use, which I elaborate on below. It is the inclusion of this wider potential use that is the basis of my claim that the Confusion Matrix can be seen as a general-purpose analytic framework.

The supporting arguments

The claim has at least four main arguments:

  1. The structure of the Confusion Matrix serves as a useful reminder and checklist, that at least four different kinds of cases should be sought after, when constructing and/or evaluating a claim that X (e.g. an intervention) lead to Y (e.g an outcome). 
    1. True Positive cases, which we will usually start looking for first of all. At worst, this is all we look for.
    2. False Positive cases, which we are often advised to do, but often dont invest much time in actually doing so. Here we can learn what does not work and why so.
    3. False Negative cases, which we probably do even less often. Here we can learn what else works, and perhaps why so,
    4. True Negative cases, because sometimes there are asymmetric causes at play i.e not just the absence of the expected causes
  2. The contents of the Confusion Matrix helps us to identify interventions that are necessary, sufficient or both. This can be practically useful knowledge
    1. If there are no FP cases, this suggests an intervention is sufficient for the outcome to occur. The more cases we investigate , without still finding a TP, the stronger this suggestion is. But if only one FP is found, that tells us the intervention is not sufficient. Single cases can be informative. Large numbers of cases are not aways needed.
    2. If there are no FN cases, this suggests an intervention is necessary for the outcome to occur. The more cases we investigate , without still finding a FN, the stronger this suggestion is. But if only one FN is found, that tells us the intervention is not necessary. 
    3. If there are no FP or FN cases, this suggests an intervention is sufficient and necessary for the outcome to occur. The more cases we investigate, without still finding a TP or FN, the stronger this suggestion is. But if only one FP, or FN is found, that tells us that the intervention is not sufficient or not necessary, respectively. 
  3. The contents of the Confusion Matrix help us identify the type and scale of errors  and their acceptability. FP and FN cases are two different types of error that have different consequences in different contexts. A brain surgeon will be looking for an intervention that has a very low FP rate, because errors in brain surgery can be fatal, so cannot be recovered. On the other hand, a stockmarket investor is likely to be looking for a more general purpose model, with few FNs. However, it only has to be right 55% of the time to still make them money. So a high rate of FPs may not be a big  concern. They can recover their losses through further trading. In the field of humanitarian assistance the corresponding concerns are with coverage (reaching all those in need, i.e minimising False Negatives) and leakage (minimising inclusion of those not in need i.e False Positives). There are Confusion Matrix based performance measures for both kinds error and for the degree that both kinds of error are balanced (See the Wikipedia entry)
  4. The contents of the Confusion Matrix can help us identify usefull case studies for comparison purposes. These can include
    1. Cases which exemplify the True Positive results, where the model (e.g an intervention) correctly predicted the presence of the outcome. Look within these cases to find any likely causal mechanisms connecting the intervention and outcome. Two sub-types can be useful to compare:
      1. Modal cases, which represent the most common characteristics seen in this group, taking all comparable attributes into account, not just those within the prediction model. 
      2. Outlier cases, which represent those which were most dissimilar to all other cases in this group, apart from having the same prediction model characteristics
    2. Cases which exemplify the False Positives, where the model incorrectly predicted the presence of the outcome.There are at least two possible explanations that can be explored:
      1. In the False Positive cases, there are one or more other factors that all the cases have in common, which are blocking the model configuration from working i.e. delivering the outcome
      2. In the True Positive cases, there are one or more other factors that all the cases have in common, which are enabling the model configuration from working i.e. delivering the outcome, but which are absent in the False Positive cases
        1. Note: For comparisons with TPs cases, TP and FP cases should be maximally  similar in their case attributes. I think this is called  MSDO (most similar, different outcome) based case selection
    3. Cases which exemplify the False Negatives, where the outcome occurred despite the absence the attributes of the model. There are three possibilities of interest here:
      1. There may be some False Negative cases that have all but one of the attributes found in the prediction model. These cases would be worth examining, in order to understand why the absence of a particular attribute that is part of the predictive model does not prevent the outcome from occurring. There may be some counter-balancing enabling factor at work, enabling the outcome.
      2. It is possible that some cases have been classed as FNs because they missed specific data on crucial attributes that would have otherwise classed them as TPs.
      3. Other cases may represent genuine alternatives, which need within-case investigation to identify the attributes that appear to make them successful 
    4. Cases which exemplify the True Negatives, where the absence the attributes of the model is associated with the absence of the outcome.
      1. Normally this are seen as not being of much interest. But there may cases here with all but one of the intervention attributes. If found then the missing attribute may be viewed as: 
        1. A necessary attribute, without which the outcome can occur
        2. An INUS attribute i.e. an attribute that is Insufficient but Necessary in a configuration that is Unnecessary but Sufficient for the outcome (See Befani, 2016). It would then be worth investigating how these critical attributes have their effects by doing a detailed within-case analysis of the cases with the critical missing attribute.
      2. Cases may become TNs for two reasons. The first, and most expected, is that the causes of positive outcomes are absent. The second, which is worth investigating, is that there are additional and different causes at work which are causing the outcome to be absent. The first of these is described as causal symmetry, the second of these is described as causal asymmetry. Because of the second possibility is worthwhile paying close attention to TN cases to identify the extent to which symmetrical causes or asymmetrical causes are at work. The findings could have significant implications for any intervention that is being designed. Here a useful comparision would be  between maximally similar TP and TN cases.
Resources

Some of you may know that I have built the Confusion Matrix into the design of EvalC3, an Excel app for cross-case analysis, that combines measurement concepts from the disparate fields of machine learning and QCA (Qualitative Comparative Analysis). With fair winds. this should become available as a free to use web app in early 2024, courtesy of a team at Sheffield Hallam University. There you will be able to explore and exploit the uses of the Confusion Matrix for both quantative and qualitative analyses.