Tuesday, December 02, 2025

Objectives as data: The potential uses of updatable outcome targets

 The context

A specialist agency is funding more than 40 different partner organisations, each working in a different part of the country but with the same overall objective of increasing people's levels of physical activity (because of the positive health consequences). These partners are often working with quite different communities, and all have substantial degree of independence about how they work towards the overall objective. 

Some agency representatives have asked about the nature of the target that program as a whole is working towards, and have emphasised how essential it is that there be clarity in this area. By target they mean an actual number. Specifically the percentage of people self-reporting that they achieve a certain level of physical activity each week, as identified by an annual survey that is already underway and will be repeated in future.

Possible responses

In principle it would be possible to set a target for the proportion of the population reporting being physically active. Such as 75%.  But it would be very hard to identify an optimal target percentage, given the diversity of partner localities, and the communities within these. 

Relative targets may be more appropriate. Such as a 25% increase in reported activity levels. Especially if partners were each asked to identify what they think are achievable percentage increases in their own localities within the next survey period. This estimation would take place in a context where these partners already have experience working in those locations, identifying some of the things that work and dont work. My hypothesis, yet to be tested, would be that these partners will make quite conservative estimates. And if so, this might come as some surprise to the donor and perhaps lead to some revision of their own expectations

Taking this idea further, partners could be periodically asked if they wanted to adjust their expectations upwards or downwards , of the change that could be achieved - in the time remaining in the interventions lifespan. Subject to being able to explain the rationale for doing so. My second hypothesis is that this number, and commentary, could be a valid and useful form of progress reporting in its own right.

Making sense of the responses

An assessment of overall progress over longer time scale would need to consider both the scale of ambitions and the extent of their achievement. These can't be combined into one number based on a simple formula because any such number could be achieved by adjustment of expectations and or performance. However it could be usefully represented by a scatterplot, with data points reflecting each of the partners, of the kind shown below.

The location of partners in different quadrants suggests different implications about how the different partners should be managed

  • High ambition/low achievement: May need additional support, capacity building, or problem-solving
  • Low ambition/low achievement: May need fundamental partnership restructuring or exit considerations
  • High ambition/high achievement: Candidates for scaling, sharing learning, reduced support intensity
  • Modest ambition/high achievement: Opportunities to stretch ambitions

This framework also provides plenty of potentially useful analytic questions

  • Are ambitions increasing or decreasing?
  • Is the gap between expected and actual narrowing or widening?
  • For a given level of actual achievement did differences in expectations have any role or consequences
  • For a given level of expected change what might explain the differences in the partners actual achievements
  • How do individual partners positions within this matrix change over time? Are there distinct types of trajectories and how can these differences be explained?

In summary

A single numerical value based on the data in this matrix will provide a meaningless simplification. 

In contrast, a scatterplot visualisation can generate multiple potentially useful perspectives. 

It is more useful to see targets as necessarily malleable responses to changing conditions, than as unarguable reference points.

Postscript

There is a type  of reinforcement learning algorithm known as Temporal Difference Learning (TDD), that embodies a very similar process. It is described as "a model-free reinforcement learning method that learns by updating its predictions based on the difference between future predictions and the current estimate". Model-free means it has no built in model of the world it is working in. 

When implemented as a human process it is vulnerable to gaming, because the agents (humans) are aware of the system's mechanics, unlike the neural networks or simplified agents typically used in computational TD learning. But one adaptation, suggested by Gemini AI, is to "reward partners not just for the +/-gap, but for the accuracy of their final predictions over multiple cycles". Relatively higher accuracy, over multiple time periods, might be indicative of potentially generalisable / replicable delivery capacity, usable beyound the current context.

Tuesday, June 18, 2024

On two types of Theories of Change: Temporal and atemporal, and how they might be bridged



There are two quite different ways of representing theories of change – of the kind that might be useful when planning and monitoring development programmes of one kind or another.

The first kind is seen in representational devices such as the Logical Framework, Logic Models and boxes-and-arrows type diagrams. These differentiate events according to their location at different points over time, taking place between the initial provision of funding, its allocation and use and then it's subsequent effects and final impacts. These are temporal models.

The second kind, seen much less often, are seen in the analyses generated by Qualitative Comparative Analysis (QCA) and simple machine learning methods known as Decision Trees or Classifiers.  Here the theory is in the form of multiple configurations of different attributes that are associated with desired outcome, and its absence. Those attributes may be of the intervention and/or its context.  The defining feature of this approach is the focus on cases and differences between cases, rather than different points or periods in time. These cases are often geographical entities, or groups or persons, which have some persistence over time. They are effectively atemporal models.

Each of these approaches have their own merits. Theory of change which describes the sequence of expected events over time and how they relate to each other is useful for planning, monitoring and evaluation purposes.  But it runs the risk of assuming a homogeneity of effects across all locations where it is is implemented. On the other hand, a QCA-type configurational approach helps us identify diversity in contexts and implementations, and its consequences. But it may not have any immediate management consequences, about what needs to be done when.

One of my current interests is exploring the possibility of combining these two approaches, such that we have theories of change that differentiate those events over time, while also differentiating cases across space where those events may or may not be happening. 

One paper which I've just been told about is exploring these possibilities, as seen from a QCA starting point:Pagliarin, S., & Gerrits, L. (2020). Trajectory-based Qualitative Comparative Analysis: Accounting for case-based time dynamics. Methodological Innovations, 13.  In this paper the authors introduce the innovative idea of cases as different periods of time in the same location, where each of those subsequent periods of time may have various attributes of interest present or absent, along with an outcome of interest being present or absent.  This approach seems to have potential for enabling a systematic approach to within-case investigations complementing what might have been prior cross-case investigations.  There is the potential to identify specific attributes, or combinations of these, which are necessary or sufficient for changes to take place within a given case.

Somewhat tangentially...

The same paper reminded me reminded me of some evaluation fieldwork I did in Burkina Faso in 1992, where I was interviewing farmers about the history of their development of a small market garden using irrigation water obtained from a nearby lake. Looking back at the history of the market, which I think was about six years old at the time, I asked them to identify the most significant change that had taken place during the period of time. They identified installation of the water pump in year 198?, and pointed out how it expanded the scale of their cultivation thereafter. I can remember also asking, but with less recall of what they then said, follow-up questions about the most significant change that it taken place in each smaller time period either side of that event, and then its consequences.  I was in effect asking them to carve up the history of the garden into segments, and sub-segments, of time not defined by calendar, but by key events – each of which had consequences. These were in effect temporal "cases". Each of these had a configuration of multiple attributes, i.e. being attributes of the nested set of time periods that it belonged to. Associated with each of these were differenting judgements about the  the productivity of the market garden.  But with our team's time being short supply, I never got the opportunity to gather a full data set, so to speak. 

Another of my current interests, prompted by the above conjectures, is the possible use of specific form of Hierarchical Card Sorting (HCS) as a means introducing a temporal element into case-based configurational analysis. The HCS process generates a tree structure of nested binary distinctions between cases. It is concievable that different broad criteria could be introduced for the type of differences being identified at each level of the branching structure. For example, at the top level the "most significant difference" being sought could be specified as being  "in terms of funding received", then at the next level, "in terms of outputs generated" , and so on (Criteria 1,2,3 etc in Figure 1 below) .

Figure 1 below







Wednesday, April 24, 2024

Developing and using a Configurational Theory of Change within an evaluation

 

Figure 1

Ho-hum, yet another evaluation brand being promoted in an already crowded marketplace.FFS... 

Yes, I think this reaction is understandable, but I think there is something here captured under this title (Developing and using a Configurational Theory of Change... ) which has potential value.  I will try to explain...

Many evaluators make use of theories of change, as part of a theory-based approach to evaluation. Many theories of change are described in some type of diagrammatic form. And a typical feature of those diagrams is their convergent nature. That is, they start of with a range of different types of inputs and activities which follow various causal pathways towards a limited number of final outcomes.

This image is almost the complete opposite of what happens in actual practice on the ground. Financial inputs come from a limited number of sources, these become available to a small range of partners who carry out their own range of activities, in a variety of different locations each with their own populations, including those intended and not intended to be affected. This description is of course a simplification, but it applies to many development aid programme designs. The point I'm making here is that this in-reality process is not convergent it is divergent!  It seems like the diagrammatic theories of change I have described are a type of Procrustean bed

This blog posting has been most immediately prompted by a report I have just reviewed on potential evaluation strategies for a large national level climate finance strategy (CFS). The theory of change describes multiple causal pathways connecting the initial provision of government finance through to four expected types of expected impacts.  With two of these causal pathways alone the number of projects being funded is in the hundreds. The report struggled with the issue of how to measure the expected impacts given the scale and likely diversity of events on the ground. And the corresponding challenge of how to sample those projects. Part of my diagnosis of the problem here was the evaluation team's measurement-led approach. And the weakness of the conceptual framework i.e. the incapacity of the theory of change to capture the diversity of what was taking place.

Describing the alternative to my client is now my challenge. I think the alternative has two parts. Firstly, one should start at the beginning, where the money becomes available, and then follow the money (and the people responsible) as it gets distributed according to its intended purposes. If things are not happening as expected early on in this process then this affects expectations of what might and might not be observable later on in the form of 'outcomes' or 'impacts'. Put crudely, there is no point trying to observe the impact of something that has not yet been delivered. And in the case of strategies like the CFS, a large part of success can simply be gettting the money where it should be spent.

Secondly, as money is distributed from a central fund, decisions are going to be made about how it should be parcelled out in different amounts for different purposes through different institutions.  Each time that happens the decisions that have been made about how to do this are hopefully not random. Evaluating how those decisions were made may not necessarily be all that useful, because often there will be opaque mini, meso and macro political processes involved. But the announced decisions may include some intentionally explicit expectations about the official purposes of different allocations. Interviews those responsible for those allocations might also elicit more informal and more current expectations about what might be the short and longer terms effects of some of these allocations, when compared to others.The point I am emphasising here is that sometimes we can come to evaluative judgements not through the use of any overriding predetermined criteria, but by using a more inductive process, where we compare one option to another. This is an excuse for me to quote Marx (G): 

Friend says to Marx – 'Life is difficult'.

Marx replies to friend – 'Compared to what?'

This type of inductive comparative evaluation doesn't have to be completely free form. It is conceivable for example that we could look at two tranches of government climate finance funding and ask (those with proximate responsibilities for that funding) what difference there might between those blocks of funding in terms of how each might meet one or more of the OECD criteria (These range in their concerns from the more immediate issues of coherence and efficiency to later concerns with effectiveness and impact). Respondents answers in the form of expectations can be seen as mini theories a.k.a. hypotheses that then might be testable through the gathering of relevant data.  

Before these questions can be posed the cases that are going to be compared would need to be identified. The 'cases' in this example would be particular blocks of funding. Further along the implementation process the cases could be partners who are receiving funding, or activities that those partners implementing, or communities those activities are directed towards. Nevertheless, at any point along this chain there is still a challenge, which is how to select cases for comparison. For example, if we are looking at a particular budget document which distributes funding into multiple purpose categories we will be faced with the question of which of these categories to compare.

One way forward is to let the interviewed person decide, especially if they have responsibilities in this area. Using hierarchical card sorting (HCS) the interviewer starts with a request, which is phrased like this: 'What is the most significant difference between all these budget categories in terms of how they will achieve the objectives of the climate Finance strategy? Please sort the budget categories into two piles according to this difference and then explain it to me".  Having identified ppiles of types of cases that can be compared the respondent can then be asked for details about their expectations of the cases in one pile versus the other (See FN1).The same question can then be reiterated by focusing on each of those two piles in turn and getting the respondent to break them into two smaller sub- piles. When their answers are followed by explanations this will help differentiate expectations in further detail.

Figure 2 (click on to enlarge)

Figure 2 shows the results of such an exercise, where the respondents were NGO staff responsible for the development and management of a portfolio of projects. They were asked to sort the projects into two piles according to "What they saw as the most significant difference between the projects, of a kind that would make a difference to what they could achieve". Their choices generated the tree structure. They were then asked to make a series of binary choices at each branching point, indentifying which of the two types of projects described there that "they expected to be most successful, in terms of the extent to which they will contribute to the achievement of the overall objectives of the portfolio" . Their choices are shown by the red links. In this diagram their responses have been sorted such that the preferred red option is always shown above the non-preferred option. The aggregate result is a ranked set of 8 types of projects, with the highest rank (1) at the top. Each of these types is not an isolated category of its own, but part of a configuration that can be read along each branch, from left to right.  

Here are some of the type descriptions and the reasons why one versus the other was selected most likely to contirbute to the portfolio objectives. Further discusison would be needed to esytablish how the presence/absence of these characteristics could be identified on the ground.

Wider focus

Aim to influence wider policy and environment, and have more sustainable and wider impact beyond children and their families.

Likely to be more successful: Because it will have a wider reach and be more sustainable

Local focus

More hands on work with children on a day to day basis. Impact may be sustained but it will be limited to children and their families.

Likely to be less successful:

Locally driven

Partner and the projects are locally rooted, driven by local needs and priorities. They are more likely to “get it right”. They can’t walk away when Comic Relief funding ends. More likely to be sustainable.

Likely to be more successful: More embedded in the context, will outlast the project, be more responsive.

UK driven

UK driven projects, almost sub-contracting. They have a set end-point.

Likely to be less successful:

There is a larger question here of course that also relates to sampling. Who are you going to interview in this way? The suggestion above was 'to follow the money '. In other words, to follow lines of responsibility and interview people about the domains of activity they are responsible for, using HCS as a means of structuring the discussion. There is a strategy choice here between what is known as a breadth-first search versus a depth-first search strategies. From a given point in a flow of funds (and of responsibilities) there can be distributions going in different directions, each of which could all be explored. Following all of these is a form of breadth-first search. Alternatively the focus could be just on one of those developments, and following the subsequent distribution of funding and responsibility further down one (or few) line. This is a form of depth-first search. Which of those search strategies to pursue is probably a matter to be decided by the evaluation client. But may also need to be adaptive, informed by what was found by the evaluation team in prior interviews.


Courtesy Jacky Lieu: Comparison of Breadth-First Search and Depth-First Search: Understanding Their Methods and Uses 

But what about aggregation?


If you followed my suggested approach, the closer you got to the people whose lives were of  final/main concern, the small the segments of all the funding you would be looking at. These would be more comparable than when looking at as part of a larger group, with more customised context specific assessments of expected and actual impact. But how would you / the evaluation team then be able to make any overall statement about the strategy as a whole?

The way forward is to think of performance measurement in slightly different terms, than just using a simple indicator based measure. Imagine a scatter plot, with one dimension X describing relative i.e. ranked expectations of achievement and the other dimension Y describing ranked actual/observed/assessed achievements. The entities in the scatter plot are the groups of cases in the smallest available sub-categories that were developed. Their rank position, relative to each other, is evident  when all the binary assessments of expected performance are generated through the process described above. See here for more on how this is done. The scatter plot can in turn be summarised in at least two different ways: using a measure of rank correlation (or how achievement relates to expectations) and using Classification Accuracy, if and when a minimum rank position of achievement is identfiied. Equally importantly, qualitative descriptions can be given of cases that exemplify performance that most meets expectations, and the reverse, along with  positive and negative deviants (outliers).

What we could end up with is a tree structure documenting multiple routes to both high and low performance, implemented in varyingly different  contexts (describable at different levels of scale).

Other scatter plot designs are more relevant to assessments of strategies. The ranking generated by Figure 2 was plotted against the age of the projects and their grant size, which might be expected to be influenced by the contents of a funding strategy. Neither of these two measures showed any relationship to perceived strategic priorities!

To be continued....


PS1: When asking about expected effects of one type of allocation versus another, it may make sense to encourage a focus on more immediately expected effects first, and then later ones. They may be more likely, more easily articulated and more evaluable.

PS2: Hughes-McLure, S. (2022). Follow the money. Environment and Planning A: Economy and Space, 54(7), 1299–1322. https://doi.org/10.1177/0308518X221103267










Friday, December 22, 2023

Using the Confusion Matrix as a general-purpose analytic framework


Background

This posting has been prompted by work I have done this year for the World Food Programme (WFP) as member of their Evaluation Methods Advisory Panel (EMAP). One task was to carry out a review, along with colleague Mike Reynolds, of the methods used in the 2023 Country Strategic Plans evaluations. You will be able to read about these, and related work, in a forthcoming report on the panel's work, which I will link to here when it becomes available.

One of the many findings of potential interest was: "there were relatively very few references to how data would be analysed, especially compared to the detailed description of data collection methods". In my own experience, this problem is widespread, found well beyond WFP. In the same report I proposed the use of what is known as the Confusion Matrix, as a general purpose analytic framework. Not as the only framework, but as one that could be used alongside more specific frameworks associated with particular intervention theories such as those derived from the social sciences.

What is a Confusion Matrix?

A Confusion Matrix is a type of truth table,  i.e., a table representing all the logically possible combinations of two variables or characteristics. In an evaluation context these two characteristics could be the presence and absence of an intervention, and the presence and absence of an outcome.  An intervention represents a specific theory (aka model), which includes a prediction that a specific type of outcome will occur if the intervention is implemented.  In the 2 x 2 version you can see above, there are four types of possibilities:

  1. The intervention is present and the outcome is present. Cases like this are known as True Positives
  2.  The intervention is present but the outcome is absent. Cases like this are known as False Positives. 
  3. The intervention is absent and the outcome is absent. Cases like this are known as True Negatives
  4. The intervention is absent but the outcome is present. Cases like this are known as False Negatives. 
Common uses of the Confusion Matrix

The use of Confusion Matrices is most commony associated with the field of machine learning and predictive analytics, but it has much wider application. These include the fields of medical diagnostic testing, predictive maintenance,  fraud detection,  customer churn prediction, remote sensing and geospatial analysis, cyber security, computer vision, and natural language processing. In these applications the Confusion Matrix is populated by the number of cases falling into each of the four categories. These numbers are in turn the basis of a wide range of performance measures, which are described in detail in the Wikipedia article on the Confusion Matrix. A selection of these is described here, in this blog on the use of the EvalC3 Excel app

The claim

Although the use of a Confusion Matrix is  commonly associated with quantitative analyses of performance, such as the accuracy of predictive models, it can also be a useful framework for thinking in more qualitative terms. This is a less well known and publicised use, which I elaborate on below. It is the inclusion of this wider potential use that is the basis of my claim that the Confusion Matrix can be seen as a general-purpose analytic framework.

The supporting arguments

The claim has at least four main arguments:

  1. The structure of the Confusion Matrix serves as a useful reminder and checklist, that at least four different kinds of cases should be sought after, when constructing and/or evaluating a claim that X (e.g. an intervention) lead to Y (e.g an outcome). 
    1. True Positive cases, which we will usually start looking for first of all. At worst, this is all we look for.
    2. False Positive cases, which we are often advised to do, but often dont invest much time in actually doing so. Here we can learn what does not work and why so.
    3. False Negative cases, which we probably do even less often. Here we can learn what else works, and perhaps why so,
    4. True Negative cases, because sometimes there are asymmetric causes at play i.e not just the absence of the expected causes
  2. The contents of the Confusion Matrix helps us to identify interventions that are necessary, sufficient or both. This can be practically useful knowledge
    1. If there are no FP cases, this suggests an intervention is sufficient for the outcome to occur. The more cases we investigate , without still finding a TP, the stronger this suggestion is. But if only one FP is found, that tells us the intervention is not sufficient. Single cases can be informative. Large numbers of cases are not aways needed.
    2. If there are no FN cases, this suggests an intervention is necessary for the outcome to occur. The more cases we investigate , without still finding a FN, the stronger this suggestion is. But if only one FN is found, that tells us the intervention is not necessary. 
    3. If there are no FP or FN cases, this suggests an intervention is sufficient and necessary for the outcome to occur. The more cases we investigate, without still finding a TP or FN, the stronger this suggestion is. But if only one FP, or FN is found, that tells us that the intervention is not sufficient or not necessary, respectively. 
  3. The contents of the Confusion Matrix help us identify the type and scale of errors  and their acceptability. FP and FN cases are two different types of error that have different consequences in different contexts. A brain surgeon will be looking for an intervention that has a very low FP rate, because errors in brain surgery can be fatal, so cannot be recovered. On the other hand, a stockmarket investor is likely to be looking for a more general purpose model, with few FNs. However, it only has to be right 55% of the time to still make them money. So a high rate of FPs may not be a big  concern. They can recover their losses through further trading. In the field of humanitarian assistance the corresponding concerns are with coverage (reaching all those in need, i.e minimising False Negatives) and leakage (minimising inclusion of those not in need i.e False Positives). There are Confusion Matrix based performance measures for both kinds error and for the degree that both kinds of error are balanced (See the Wikipedia entry)
  4. The contents of the Confusion Matrix can help us identify usefull case studies for comparison purposes. These can include
    1. Cases which exemplify the True Positive results, where the model (e.g an intervention) correctly predicted the presence of the outcome. Look within these cases to find any likely causal mechanisms connecting the intervention and outcome. Two sub-types can be useful to compare:
      1. Modal cases, which represent the most common characteristics seen in this group, taking all comparable attributes into account, not just those within the prediction model. 
      2. Outlier cases, which represent those which were most dissimilar to all other cases in this group, apart from having the same prediction model characteristics
    2. Cases which exemplify the False Positives, where the model incorrectly predicted the presence of the outcome.There are at least two possible explanations that can be explored:
      1. In the False Positive cases, there are one or more other factors that all the cases have in common, which are blocking the model configuration from working i.e. delivering the outcome
      2. In the True Positive cases, there are one or more other factors that all the cases have in common, which are enabling the model configuration from working i.e. delivering the outcome, but which are absent in the False Positive cases
        1. Note: For comparisons with TPs cases, TP and FP cases should be maximally  similar in their case attributes. I think this is called  MSDO (most similar, different outcome) based case selection
    3. Cases which exemplify the False Negatives, where the outcome occurred despite the absence the attributes of the model. There are three possibilities of interest here:
      1. There may be some False Negative cases that have all but one of the attributes found in the prediction model. These cases would be worth examining, in order to understand why the absence of a particular attribute that is part of the predictive model does not prevent the outcome from occurring. There may be some counter-balancing enabling factor at work, enabling the outcome.
      2. It is possible that some cases have been classed as FNs because they missed specific data on crucial attributes that would have otherwise classed them as TPs.
      3. Other cases may represent genuine alternatives, which need within-case investigation to identify the attributes that appear to make them successful 
    4. Cases which exemplify the True Negatives, where the absence the attributes of the model is associated with the absence of the outcome.
      1. Normally this are seen as not being of much interest. But there may cases here with all but one of the intervention attributes. If found then the missing attribute may be viewed as: 
        1. A necessary attribute, without which the outcome can occur
        2. An INUS attribute i.e. an attribute that is Insufficient but Necessary in a configuration that is Unnecessary but Sufficient for the outcome (See Befani, 2016). It would then be worth investigating how these critical attributes have their effects by doing a detailed within-case analysis of the cases with the critical missing attribute.
      2. Cases may become TNs for two reasons. The first, and most expected, is that the causes of positive outcomes are absent. The second, which is worth investigating, is that there are additional and different causes at work which are causing the outcome to be absent. The first of these is described as causal symmetry, the second of these is described as causal asymmetry. Because of the second possibility is worthwhile paying close attention to TN cases to identify the extent to which symmetrical causes or asymmetrical causes are at work. The findings could have significant implications for any intervention that is being designed. Here a useful comparision would be  between maximally similar TP and TN cases.
Resources

Some of you may know that I have built the Confusion Matrix into the design of EvalC3, an Excel app for cross-case analysis, that combines measurement concepts from the disparate fields of machine learning and QCA (Qualitative Comparative Analysis). With fair winds. this should become available as a free to use web app in early 2024, courtesy of a team at Sheffield Hallam University. There you will be able to explore and exploit the uses of the Confusion Matrix for both quantative and qualitative analyses.



Saturday, October 28, 2023

Beyond summarisation by AI and/or editors- Readers can now interrogate full transcripts of meeting discussions

Over the last two months, a small group of us have been managing a MSC Monthly Online Gathering. In each meeting we have recorded the discussions, then generated a transcript, both using Otter.AI. Then I have used Claude AI, to generate a one-page summary of each discussion. That itself seems likely to be useful to both attendeess and non-attendees. (Though I have yet obtain feedback on this meeting output). You can view two AI summaries of discussions in the October meeting, here: 

https://mande.co.uk/wp-content/uploads/2023/10/18th-October-MSC-AM-Rick.pdf

https://mande.co.uk/wp-content/uploads/2023/10/18th-October-PM-Konny.pdf

But why not jump ahead and give people more than a simple feedback opportunity. Let's enable them to question the full text of the transcript, in their own individual way, albeit after being informed about the overall topics covered during the discussion via the AI summaries above. This is now possible using a third party app known as Pickaxe. Here you can design an AI prompt that can then be made publically usable, preloaded with a given discussion transcript.

Here is a link to the two very simple Pickaxe public prompts I have developed that you can now use to interrogate the two discussions. 

AM session 
PM session

You can ask follow up questions, click on "Go to Chat"

If you try these out, I will get feedback, in the form of a visible record of how you used it. You could also provide feedback on this experience, using the Comment function below

Give it a go, now...!


Postscript 31 October

I think the performance of Pickaxe on this task is poor, compared to that of Claue AI on the same task.  I will be disabling this implementation in the next day or so


Thursday, August 31, 2023

Evaluating thematic coding and text summarisation work done by artificial intelligence (LLM)


Evaluation is a core part of the workings of artificial intelligence algorithms. It is something that can be built in, in the shape of specific segments of code. But it is also an additional human element which needs to complement and inform the subsequent use of any outputs of artificial intelligence systems.

If we take supervised machine learning algorithms as one of the simpler forms of artificial intelligence, all of these have a very simple basic structure.  Their operations involve the reiteration of search followed by evaluation. For example, we have a dataset which describes a number of cases, which could be different locations where a particular development intervention is taking place. Each of these cases have a number of attributes which we think may be useful predictors of an outcome we are interested in. And in addition, some of those predictors (or combinations thereof) might reflect some underlying causal mechanisms which would be useful for us to know about. The simplest form of machine learning will involve what is called an exhaustive or brute force search of each possible combination of those attributes (defined in terms of their presence or absence, in this simple example). Taking one combination at a time, the algorithm will evaluate whether it predicted the outcome or not, and then store that judgement. Reiterating that process, it will then compare the next judgement to this earlier judgement and replace that earlier judgement if the new one is better. And so on until all possible combinations have been evaluated and compared to previous judgement. In more complex machine learning algorithms involving artificial neural networks the evaluation and feedback processes can be much more complex, but the abstract description still fits.

What I'm interested in talking about here is what happens outside the block of code that does this type of processing. Specifically, with the products that are produced and how we humans can evaluate its value. This is territory where a lot of effort has already been expended, most noticeably on the subject of algorithmic fairness and what is known as the alignment problem.  These could be crudely described as representing both short and long-term concerns respectively.  I won't be exploring that literature here, interesting and important as it is.

What I will be talking about here is two examples of my own current experiments with the use of one AI application known as Claude AI, used to do some forms of qualitative data analysis. In the field that I work in, which is largely to do with international development aid programs, a huge amount of qualitative data i.e text is generated and I think it is fair to say that its analysis is a lot more problematic than when we are dealing with many forms of quantitative data. So the arrival of large language model (LLM) versions of artificial intelligence appears to offer some interesting opportunities for making some usable progress in this difficult area.

The text data that I have been working with as been generated by participants in a participatory scenario planning process, carried out using ParEvo.org, and implemented by the International Civil Society Centre in Germany this year. The full details of that exercise will be available soon in a ICSC publication.  The exercise generated a branching tree structure of storylines about the future, built with 109 paragraphs of text, contributed by 15 participants, over eight iterations.What I will be describing here concerns two types of analysis of that text data.

Text summarisation

[this section has been redrafted] The first was a text summarisation task, where I asked Claude AI to produce one sentence headline summaries of each of these 109 texts. Text summarisation is a very common application of LLMs. This it did quickly, as usual, and the results looked plausible. But but by now I had also learned to be appropriately sceptical and was asking myself how 'accurate' these headlines were. I could examine each headline and its associated text, but this would take time. So I tried another approach.

I opened up a new prompt window in Claude AI and uploaded 2 files. One containing the headlines, and the other containing each of the 109 texts preceded by an identification number. I then asked Claude AI to match each headline with the text that it best described, and to display the results using the ID number of the text (rather than its full contents) and the predicted associated headline. This process has some similarities with back translation. What I was interested in here was how well it could reassign the headlines to their original texts.  If it did well this would give me some confidence in the accuracy of its analytic processes, and might obviate the need for a manual check up of the headlines' fit with content.  

My first attempt was a clear failure, with classification accuracy of 21%, being far worse than chance. On examination this was caused by the way I had formated the uploaded data. The second attempt, using two separated data files, was more successful This time the classification accuracy was 63%. Given that the 27% error could occur at two stages (headline creation and headline matching) it could be argued that the classification error was more like half this value i.e 13.5% and so the classication accuracy was more like 76.5%. At this point it seemed worthwhile to also examine the misclassifications ( a back translation stage called reconciliation) - what headline was mismatched with what headline.  An examination of the false classifications suggested that around 40% of the mismatches may have been because of words they had in common, despote the full headline being different.

Where does that leave me? With some confidence in the headline generation process, but could we do better?  Could we find a better way to generate reproducable headlines...See further below where I talk about ensemble methods.

Content analysis

The second task was a type of content analysis. Because of a specific interest, I had separated a subset of the hundred nine paragraphs into two groups, the first of which had been the subject to further narrative development by the participants (aka surviving storylines), and the second being others which were not developed any further (aka extinct storylines). I asked Claude AI to analyse the subset of the texts in terms of three attributes: the vocabulary, the style of writing, and the genre. Then for each attribute, to sort the texts into two groups, and describe what each group had in common and how they differed from the other group. It then did so. Here is an image of its output.

But how can I evaluate this output? If I looked at one of the texts in a particular group would I find the attributes that Claude AI was telling me that the group it belonged to possessed?  In order to make this form of verification easier, and smaller in scale,  I gave Claude AI a follow-up task: for each of the two groups under each of the three attributes of the text Claude AI should provide the ID number of an exemplar body of text which best represented the presence of the characteristics that were described.  This it was able to do, and in my first use of the specific case examples I found that 9/10 did fit the summary description provided for the group. This strategy is similar to another one which I've used with GPT4, when trying to extract specific information about evaluation methods used in a set of evaluation reports. There I have asked it to provide page or paragraph references for any claim about what methods are being used in the evaluation. Broadly speaking, in a large majority of cases, these page references pointed to relevant sections of text. 

My second strategy was another version of back translation, connecting concrete instances with pre-existing abstract descriptions. This time I opened a new prompt session, still within Claude AI, and uploaded a file containing the same subset of paragraphs, and then in the prompt window I copy and pasted the description of the attributes of the three sets of two groups identified earlier (without information on which text belnged to which group).  I then asked Claude AI to identify which paragraphs of text fitted which of the 3 x 2 groups, which it did.  I then collated the results of the two tasks in an Excel file, which you can see here below (click on image to magnify it). The green cells are where the predicted group matches the original group, and the yellow cells are where there were mismatches.  The overall classification accuracy was 67%, whch is better than chance but not great either. I should also add that this was done with prompt information that included the IDs of the exemplars mentioned above (a format called "one-shot learning")


What was I evaluating when I was doing these "reverse translations"? It could probably be described as a test of, or search for, some form of construct validity. Was there any stable concept involved? 

Ensemble methods

Given the two results reported above, which were better than chance, but not much better, what else could be done? There is one possible way forward, which might give us more confidence in the products generated by LLM analyses.  Both Claude AI and ChatGPT4, and probably others, allow users to hit a Retry button, to generate another response to the same prompt. These will usually vary, and the degree of variation can be controlled by a parameter known as "temperature". 

An ensemble approach in this context would be to generate multiple responses using the same prompt and then use some type of aggregation process to find the best result. Similar to 'wisdom of crowds" processes.  In its simplest form this would, for example, involve counting the number of times each different headlines were proposed for the same item of text, and selecting one with the highest count. This approach will work where you have predefined categories as "targets". Those categories could have been developed inductively (as above) or deductively, from prior theory. It may even be possible to design a prompt script that include multiple genetration steps, and even the aggregation and evaluation stages. 

But to begin with i will probably focus on testing a manual version of the process. I will report on some experiments with this approach in the next few days....

Update 02/09/23: A yet to be tested draft prompt that could automate the process


A lesson learned on the way: I initially wrote out a rough draft of a Claude AI prompt that might help automate the process I've described above.  I then ask Claude AI to convert this into a prompt which would be understood and generate reliable and interpretable results.  When it did this it was clear that part of my intentions had not been understood correctly (however you interpret the word understood).  This could be just an epiphenomenon, in the sense of it only being generated by this particular enquiry.  Or, it could point to a deeper or more structurally embedded analytic risk that would have consequences if I actually ask Claude AI to implement the rough draft in its original form (as distinct from simply refine that text as a prompt).  The latter possibility concerned me, so I edited the prompt text that had been revised by Claude AI to remove the misunderstood part of the process. The version you see above is Claude AIs interpretation of my revised version, which I think will now work. Lets see,,,!

Update 03/09/23: It looks like the ensemble method may work as expected. Using 10 iterations only, which is a small number compared to how they are normally used, the classication accuracy increased to 84%. In the data displayed about numbers of time each predicted headline was matched to a given text there were 4 instances where there were ties. There were also 8 instances where the best match was still only found in less than 5 of the 10 iterations. More iterations might generate more definitive best matches and increase the accuracy rate. The correct match was already visible in the second and third ranking best matches of 4 of the 18 incorrectly matches headlines.

Another lesson learned, perhaps: Careful wording of prompts is important, the more explicit the instructions are the better. I learned to preface the word "match" with a more specific "analyze the content of all the numbered texts in File 1 and identify which one the headline best describes" . And careful formating of the text data files was also potentially important. making it clear where each text began and ended and removing any formating artifacts that could cause confusion.

And because of experiences with such sensitivities, I think i should re-do the whole analysis, to see if I generate the same or similar results!!!

Ensembles of brittle prompts?

I just came across this glimpse of a paper "Prompt Ensembles Make LLMs More Reliable" which is a different version of the idea I explored above. Here the prompt that is in use is also varied, from iteration to iteration. 

 

Ensembles of brittel prompts?




Friday, May 26, 2023

Finding useful distinctions between different futures

 

This blog posting is a response to Joseph Voros's informative blog posting about the Futures Cone. It is a useful contribution in as much as it helps us think about the future in terms of different sets of possibilities. Here is a copy of his edited version.

Figure 1: Voros, 2017


My alternative, shown below, was developed in the context of supporting ParEvo.org explorations of alternative futures. It has some similarities and differences. For a start, here is the diagram.

Figure 2: Sets and sub-sets of alternative futures Davies, 2023

I will now list Joseph's explanation of each of the terms he used, and how they might relate to mine (in red)


  • Possible – these are those futures that we think ‘might’ happen, based on some future knowledge we do not yet possess, but which we might possess someday (e.g., warp drive). I think these fall in the grey area above (which also contain the dark and light green).
  • Plausible – those we think ‘could’ happen based on our current understanding of how the world works (physical laws, social processes, etc).I think these fall somewhere within the green matrix
  • Probable – those we think are ‘likely to’ happen, usually based on (in many cases, quantitative) current trends. These probably fall within the Likely row of the green matrix
  • Preferable – those we think ‘should’ or ‘ought to’ happen: normative value judgements as opposed to the mostly cognitive, above. There is also of course the associated converse class—the un-preferred futures—a ‘shadow’ form of anti-normative futures that we think should not happen nor ever be allowed to happen (e.g., global climate change scenarios comes to mind).These probably fall within the Desirable column of the green matrix
  • Projected – the (singular) default, business as usual, ‘baseline’, extrapolated ‘continuation of the past through the present’ future. This single future could also be considered as being ‘the most probable’ of the Probable futures. As suggested above, probably at the most likely end of the Likely row in the above green matrix
  • (Predicted) – the future that someone claims ‘will’ happen. I briefly toyed with using this category for a few years quite some time ago now, but I ended up not using it anymore because it tends to cloud the openness to possibilities (or, more usefully, the ‘preposter-abilities’!) that using the full Futures Cone is intended to engender. Probably also at the most likely end of the Likely row in the above green matrix
Preposterious events are not really covered. Perhaps they are at the extreme end of the Unlikely events with known probabilities i.e zero likelihood.

 

Though lacking in alliteration my schema does have some more practically useful features

The primary additional feature is that for each different kind of future there are some conjectured consequences in terms of likely appropriate responses. Some of these are shown red:

  • Organisational "slack" i.e. uncommitted resources or reserves that could enable responses to the unforeseen (though, of course,  not every kind of unforseen event)
  • Fringe investments, such as blue sky research, can be appropriate where a possibility is in sight but its likelihood of happening is far from clear
  • Robust responses are those that might work, though not necessarily be the most effective or most efficient, across a span of possibilities having varying probabilities and desirabilities
  • Customised responses are those more tailored to specific combinations of un/likely and un/desirable events. The following more detailed version of the green martix describes some major possible variations of this kind
Figure 3
Where to next?

I would like to hear from readers their views on the possible utility of these distinctions. And whether any other distinctions could be added to or replace those I have used. 

Postscript 2025 01 22

I came across this useful matrix view in
Luís, A., Garnett, K., Pollard, S. J. T., Lickorish, F., Jude, S., & Leinster, P. (2021). Fusing strategic risk and futures methods to inform long-term strategic planning: Case of water utilities. Environment Systems & Decisions, 41(4), 523–540. https://doi.org/10.1007/s10669-021-09815-1




Monday, March 06, 2023

How can evaluators practically think about multiple Theories of Change in a particular context?


This blog posting is been prompted by participation in two recent events. One was some work I was doing with the ICRC, reviewing Terms of Reference for an evaluation.  The other was listening in as a participant to this week's European Investment Bank conference titled "Picking up the pace: Evaluation in a rapidly changing world". 

When I was reviewing some Terms of Reference for an evaluation I noticed a gap which I have seen many times before. While there was a reasonable discussion of the types of information that would need to be gathered there was a conspicuous absence of any discussion of how that data would be analysed. My feedback included the suggestion that the Terms of Reference needed to ask the evaluation team for a description of the analytical framework they would use to analyse the data they were collecting.

The first two sessions of this week's EIB conference were on the subject of foresight and evaluation. In other words how evaluators can think more creatively and usefully about  possible futures – a subject of considerable interest to me. You might notice that I've referred to futures rather than the future, intentionally emphasising the fact that there may be many different kinds of futures, and with some exceptions (e.g. climate change) is not easy to identify which of these will actually eventuate.

To be honest, I wasn't too impressed with the ideas that came up in this morning's discussion about how evaluators could pay more attention to the plurality of possible futures. On the other hand, I did feel some sympathy for the panel members who were put on the spot to answer some quite difficult questions on this topic.

Benefiting from the luxury of more time to think about this topic, I would like to make a suggestion that might be practically usable by evaluators, and worth considering by commissioners of evaluations. The suggestion is how an evaluation team could realistically give attention not just to a single "official"  Theory Of Change about an intervention, but to multiple relevant Theories Of Change about an intervention and its expected outcomes. In doing so I hope to address both issues I have raised above: (a) the need for an evaluation team to have a conceptual framework structuring how it will analyse the data it collects, and (b) the need to think about more than one possible future and how that might be realised i.e. more than one Theory of Change.

The core idea is to make use of something which I have discussed many times previously in this blog, known as the Confusion Matrix – to those involved in machine learning, and more generally described simply as a truth table - one that describes four types of possibilities. It takes the following form:

In the field of machine learning the main interest in the Confusion Matrix is the associated performance measures that can be generated, and used to analyse and assess the performance of different predictive models.  While these are of interest, what I want to talk about here is how we can use the same framework to think about different types of theories, as distinct from different types of observed results.

There are four different types of Theories of Change that can be seen in the Confusion Matrix. The first (1) describes what is happening when intervention is present and the expected outcome of that intervention is present. This is the familiar territory of the kind of Theories of Change that an evaluator will be asked to examine.

The second (2) describes what is happening when intervention is present and the expected outcome of that intervention is absent. This theory would describe what additional conditions are present, or what expected conditions are absent, which will make a difference – leading to the expected outcome being absent.  When it comes to analysing data on what actually happened identifying these conditions can lead to modification of the first (1) Theory of Change such that it becomes a better predictor of the outcome and there are fewer False Positives (found in cell 2). Ideally the less False Positives the better. But from a theory development point of view there should always be some situations described in cell 2 because there will never be an all-encompassing theory that works everywhere. There will always be boundary conditions beyond which the theory is not expected to work. So an important part of an evaluation is not just to refine the theory about what works (1) but also to refine the theory of the circumstances in which it will not be expected to work  (2),  sometimes known as conditions or boundary conditions.

The third theory (3) describes what is happening when the intervention is absent but nevertheless the outcome is present. Consideration of this possibility involves recognition of what is known as "multi-finality" i.e. that some events can arise from multiple alternative causal conditions (or combinations of  causal conditions).  It's not uncommon to find advice to evaluators that they should consider alternative theories to those they are currently focused on. For example in the literature on contribution analysis. But it strikes me that this is often close to a ritualistic requirement, or at least treated that way in practice. In this perspective alternative theories are a potential threat to the theory being focused on (1). But a much more useful perspective would be to treat these alternative theories as potentially useful other courses of an action that an agent could take, which warrant serious attention in their own right. And if they are shown to have some validity this does not by definition mean that the main theory of change (1) is wrong. It' simply means that there are alternative ways of achieving the outcome, which can only be a bonus finding. 

The fourth theory describes what is happening when intervention is absent and the outcome is also absent (4).  In its simplest interpretation, it may be that the actual absence of the attributes of the intervention is the reason why the outcome is not present. But this can't be assumed. There may be other factors which have been more important causes. For example the presence of an earthquake, or the holding of a very contested election. This possibility is captured by the term "asymmetric causality" i.e. that the causes of something not happening may not simply be the absence of the causes of something happening. Knowing about these other possible causes of desired outcome not happening is surely important, in addition to and alongside knowing about how an intervention does cause the outcome. Knowing more about these causes might help other parties with other interventions in mind move cases with this experience from being True Negatives (4) to being False Negatives (3)

In summary, I think there is an argument for evaluators not being too myopic when they are thinking about Theories of Change they need to pay attention to.  It should not be all about testing the first (1) type of Theory of Change, and considering all the other possibility is simply as challengers, which may or may not then be dismissed  Each of those other types of theories (2-3-4) are important and useful in their own right and deserve attention.