Sunday, March 22, 2026

Introducing Rank Order Counterfactuals (ROC)



A counterfactual is a description of what would have happened, if an intervention had not taken place. The use of randomised control groups is one way to construct a counterfactual. A population of people are randomly assigned to either a control group or an intervention group. Differences in the outcomes of those populations are then compared. If the difference is sufficiently statistically significant then a plausible causal claim can be made that difference in outcomes is because of the intervention.

As might be expected, there are plenty of circumstances when social programs are designed and implemented, where it is simply not practical to organise a randomised control group. In addition, the comparisons that are made will be between averages of the two groups. However, in many social programmes such averages are of limited practical use, because the implementation contexts are so varied and no single “solution” is likely to be applicable. Average effects can still be informative at a high level, but they need to be complemented by methods that take contextual diversity seriously.

I'm currently working with an evaluation team that is examining a large-scale public health programme in the United Kingdom, covering many different locations and involving many different types of local partnerships. But with one common outcome of concern, which is to increase people's physical activity levels in their daily life. In their work the evaluation team is already making use of a causal configurational approach to the understanding of what works for whom in what circumstances. It is finding different configurations of causal conditions across these locations that are associated with changes in activity levels. This approach is consistent with the high level of diversity in locations partnerships and interventions.

But what it does not yet have is a counterfactual, a defensible description of what might have happened in these locations in the absence of this intervention. This is where the idea of a rank order counterfactual becomes relevant. By a rank‑order counterfactual I mean a very specific kind of “what would have happened otherwise.” Instead of trying to predict the exact outcome that would have been achieved in each location without an intervention, we can start by asking a simpler, comparative question: which location would probably have changed more, and which less, if the intervention had never existed? The answer will be in the form of a rank ordering of locations, from those with more to less expected change. That ranking would be constructed based on all available baseline information, trends, and contextual knowledge. This proposed approach falls into a category of counterfactuals known as "logically constructed counterfactuals", and it aligns well with configurational evaluation because it focuses on patterns of relative change across diverse contexts.

A subsequent evaluation of those same locations should also be able to generate a new rank ordering, which is based on observed outcomes. These counterfactual and actual rankings can then be compared, using a scatterplot and correlation measures. The scatterplot is also visually powerful for communication: it lets people see at a glance which locations behave as expected and which ones stand out as surprises. If the intervention had no effect we should see a linear relationship, the observed and counterfactual rankings should be the same. If the intervention had positive, or perhaps even negative effects, this should not happen. We might see various locations which are outliers from that expected trend. When locations we expected to be “natural leaders” did not improve much, and those we expected to be “natural laggards” moved to the top of the league table, that pattern is a signal that the intervention may have been influential, and it gives the evaluation team clear cases where alternative explanations should be probed. The task of the evaluation is then to probe those alternative explanations, not to assume the intervention is the only possible cause. The rankings are not a substitute for theory‑based evaluation; they are a way to make its claims sharper and more testable. The focus on ranking differences can convert a vague theory (“we think these factors matter”) into a concrete, specific prediction about which locations should do better.

The sensitivity of the rank comparison process will depend on the number of ranked items. The more rank positions there are, the more sensitivity there will be to differences in performance, which is good. But, as shown in research on sorting algorithms, the time required to generate a complete sorting, using any of the well-known methods, can be significant. Growing faster than proportionally to the number of items, though far slower than exponential growth. In addition to the extra time required, a rank order counterfactual will require a stronger evidential base where the number of rank positions is greater. 

When a large number of locations are involved in an intervention one practical way of addressing this tension is to use a stratified random sample, and to generate the rankings for that sample only. Another approach to managing large numbers of locations is to think of ranked bands of locations rather than individual rankings for each location. What should be of interest, then, are systematic shifts in band membership between the counterfactual and actual observations – for example, locations expected to be in the “low‑change” band turning up in the “high‑change” band in practice.

In this way, rank‑order counterfactuals do not replace theory‑based evaluation, but sharpen it: they turn general expectations about context into explicit, testable predictions about who should have changed most in the absence of the programme. In work which I hope to document in the European Evaluation Society conference later this year I will explain how the use of the hierarchical card sorting process was used to generate argument and evidence based counterfactual rank orderings, and how an LLM was used to support this process. 


Saturday, March 07, 2026

Making implicit knowledge explicit, contestable and usable: an introduction to The Ethnographic Explorer



Caveat Emptor: I delegated the writing of this blog posting to Claude AI (Sonnet 4.6), based on an extended prior dialogue on the subject below, then a summary prompt of what was wanted. My post-production edits were quite limited.
--o0o--

There is a problem that turns up repeatedly in evaluation practice — and in many other fields — that many of us work around rather than solve directly. The problem is this: the people who know most about a programme, a portfolio, or a set of cases often cannot easily say what they know, or why they make the judgements they do.

This is not a failure of intelligence. It is the normal condition of what Michael Polanyi called tacit knowledge — the "we know more than we can tell" problem that is endemic to any field built on experience and judgement. The challenge for evaluation is to find structured ways of drawing it out.

The theoretical anchor: information as difference

The approach behind The Ethnographic Explorer draws on a deceptively simple idea from Gregory Bateson: information is a difference that makes a difference. What we notice, and consider significant, is always defined by contrast — not by properties of things in isolation. A project is "successful" relative to others; a group is "vulnerable" in comparison to other groups in a given context. And some of those differences have noticeable consequences, they "make a difference". Together they can be seen as simple "IF...THEN..." rules.

This suggests that a structured method for eliciting knowledge should be built around comparison — specifically, around asking people to identify and articulate differences, and then to explain what those differences imply. The Hierarchical Card Sort methodology is one way of doing this and is the basis of The Ethnographic Explorer's process of inquiry.

How the Ethnographic Explorer works

There are three linked stages.

In the Sort stage, a respondent is presented with a set of cases — projects, organisations, events, beneficiaries, or any entities they know well. They are asked to divide all the cases into two groups representing the most significant difference between them, from their point of view. Each group is named, the difference is recorded, and then the respondent is asked: "What difference does this difference make?" — a question that surfaces the consequence or implication of the distinction. The process repeats on each subgroup until every group contains a single case. The result is a binary tree: a structured, hierarchical map of the respondent's view of the case set.

The tree is informative in three ways: it reveals the contents of the distinctions the respondent considers important; it identifies the limits of their knowledge (where further differences cannot be found); and it indicates the direction of their attention (where further distinctions could usefully be explored).

In the Compare stage, the facilitator asks comparison questions at each split in the tree. Questions can be in degree ("Which group of cases is more likely to face sustainability problems?") or in kind ("How do these two groups differ in terms of their relationship with local government?"). Degree questions produce a ranking of all cases; kind questions produce descriptive contrasts. Both types build on the structure already revealed by the sort.

In the Contrast stage, any two degree-based rankings are plotted against each other in a scatter plot. The resulting quadrant analysis shows where the two rankings agree (cases high on both, or low on both) and where they diverge (cases high on one and low on the other). Adjustable cutoff sliders allow the facilitator to explore different thresholds, and a Spearman correlation coefficient summarises the overall relationship. Outlier cases — those that diverge most between the two rankings — are often the most analytically interesting. Strong relationships can be cast as potentially useful IF...THEN rules

The Ethnographic Explorer

With substantial coding help from Claude AI, I have been developing a browser-based implementation of this methodology — The Ethnographic Explorer — as a standalone single-file application. This new version supersedes an earlier WordPress-based tool at ethnographic.mande.co.uk, which required a server to run. The new version requires nothing beyond a web browser.

▶ Try The Ethnographic Explorer [When you get there, click on Introduction for guidance on how to explore the tool's functions]

The tool is designed for use in a shared-screen video call with a single respondent, or screened to multiple participants in a workshop setting. The facilitator drives the interface; the respondent provides the knowledge. A typical exercise with 8–12 cases takes between 45 minutes and two hours, depending on range of comparisons made.

A worked example: 12 Largest cities

Seven jscon files are are available alongside the app to allow you to explore the use of the app. The first is simply a list, which you can then sort and compare and contrast. The second is the same list, already sorted by Claude AI at my request, on the basis of " their appeal to international tourists, as seen from 6 different perspectives.

To load either dataset, click Import JSON in the top-right toolbar and select the file (after being downloaded into your computer).

Download 12 Largest Cities (unsorted)

Download 12 Largest Cities - Budget Backpacker view (sorted)

Download 12 Largest Cities - Food and Culture view (sorted)

Download 12 Largest Cities - Safety Concious view (sorted)

Download 12 Largest Cities - Sustainable Tourism view (sorted)

Download 12 Largest Cities - Travel Journalist view (sorted)

Download 12 Largest Cities - Heritage Specialist view (sorted)

What the tool produces

Each exercise generates a structured record of the respondent's knowledge about a set of cases:

  • a binary sort tree, with each split labelled by the most significant difference and its consequence
  • one or more named rankings, each derived from a sequence of binary degree judgements working from the root set down through the subsets
  • one or more descriptive contrasts, capturing how subgroups differ in kind on specified attributes
  • a scatter plot for any pair of degree rankings, with quadrant analysis and Spearman correlation
  • a qualitative responses panel, collecting the in-kind descriptions at each split

The full exercise exports as a JSON file, preserving the complete tree structure, all difference descriptions, all responses, and the exercise metadata. Files can be imported to resume exactly where you left off, or shared with a colleague for further analysis.

Beyond evaluation

The underlying structure of the method applies wherever you have a set of cases and a respondent who has differentiated knowledge about them. Some other framings that could be explored with the same tool:

  • Organisational learning: a team reviews a set of completed projects, identifying the distinctions that, in retrospect, most predicted success or failure
  • Capacity assessment: a trainer sorts a cohort of staff by their readiness for different kinds of work, making the basis for those judgements explicit and discussable
  • Stakeholder analysis: a key informant sorts a set of stakeholder organisations, revealing the distinctions they consider most consequential for programme implementation
  • Policy analysis: a policy analyst sorts a set of interventions by their perceived effectiveness, then compares that ranking against a ranking of their political feasibility

In each case, the sort structure is the same, the comparison logic is identical, and only the cases, the domain framing, and the named differences change.

An invitation to experiment

The tool is best understood by using it. I would encourage anyone engaged in evaluation, learning, or knowledge management work to try loading in a set of cases they know well — even with a rough sort to start — and see what structure emerges. The Compare and Contrast stages are particularly useful for surfacing assumptions that are rarely made explicit in standard reporting.

I am continuing to develop the tool and would welcome feedback on the methodology, the interface, or applications I have not yet considered.

Accessing the code: In addition to trying the app online, you can download a copy of the code and run it independently. Go to the app in yur directory, right-click, select View Page Source, copy the entire code, paste it into a text file, rename the file to end in .html rather than .txt, and open it in a web browser. The tool runs entirely in your browser — no login required, no data sent anywhere.

Further reading

  • Polanyi, M. (1966). The Tacit Dimension. Doubleday. — the source of the "we know more than we can tell" framing.
  • Bateson, G. (1972). Steps to an Ecology of Mind. Ballantine Books. — the source of the "difference that makes a difference" framing.
  • Kelly, G.A. (1955). The Psychology of Personal Constructs. Norton. — the intellectual precursor to card sorting methods.
  • The 2025 WordPress version of the tool, with documentation and worked examples, is at ethnographic.mande.co.uk
  • Help pages for the comparison stage, with nine worked examples of question types, are at the Comparisons Help page

Tuesday, March 03, 2026

Optimising method selection: an introduction to the set covering problem — and a tool to help





Caveat Emptor: I delegated the writing of this blog posting to Claude AI(Sonnet 4.6), based on an extended prior dialogue on the subject below, then a summary prompt of what was wanted. My post production edits were quite limited
--0O0--

There is a class of problem that turns up repeatedly in evaluation planning, and in many other fields, that most of us solve informally and imprecisely. The problem is this: given a set of needs to be addressed, what is the smallest combination of responses that covers all of them?

In evaluation this might be: given a set of evaluation questions I need to answer, what is the minimum combination of methods that gives me adequate coverage across all of them? In public health it might be: which combination of clinics ensures every neighbourhood has access to at least one? In software testing: which set of test cases exercises every code path? In logistics: which depots can serve every delivery zone?

These are all instances of what computer scientists call the Set Covering Problem — a well-studied problem in combinatorial optimisation that dates back to the 1970s. The formal version asks: given a collection of sets, find the smallest sub-collection whose union covers all elements of a target universe.

Why it matters for evaluation practice

When designing an evaluation, practitioners face exactly this structure. We have a list of evaluation questions (or dimensions of quality, or stakeholder concerns), and a repertoire of methods — each of which can address some but not all of those questions to a satisfactory level. Choosing methods one at a time, based on familiarity or habit, rarely produces the most efficient combination. We either end up with redundant overlaps in some areas and blind spots in others, or with far more methods than the budget or timeline can support.

A more systematic approach asks: which combination of methods is both complete (covers all questions at an adequate level) and minimal (uses as few methods, and as little resource, as possible)?

How the optimisation works

For a small number of methods and questions, you could check all possible combinations by hand. But the number of combinations grows exponentially — with ten methods, there are over a thousand possible subsets to evaluate. This is where an algorithm helps.

The simplest approach is a greedy algorithm: at each step, pick the method that covers the most currently-uncovered questions, then repeat until everything is covered. This is fast and usually finds a good solution, but not necessarily the best one.

A more thorough approach — exhaustive search — systematically checks all combinations up to a specified size and returns every minimal solution. This is slower but reveals the full landscape of equally-good options, which is often more useful than a single answer, particularly when cost or other practical constraints come into play.

The Coverage Optimiser

With substantial coding help from Claude AI, I have been developing a small browser-based tool — the Coverage Optimiser — that applies both approaches to a user-defined matrix of methods and question types.

▶  Try the Coverage Optimiser [When you get there click on Introduction, to get suggestions on how to explore the functions of the app]

The default example matrix uses ten foresight methods (Scenario Planning, Delphi, Horizon Scanning, and others) rated against five question types — Descriptive, Valuative, Explanatory, Predictive, and Prescriptive — at HIGH, MEDIUM, or LOW in terms of their usefulness in building a futures perspective into an evaluation. The tool finds the minimal combinations of methods that achieve the desired coverage level across all question types.

Each solution is displayed with:

  • the methods involved, each with its cost rating
  • a question-by-question coverage check
  • the total cost of the combination
  • an overlap score — the number of questions covered by more than one method in the solution

Overlap is worth attending to: it indicates redundancy, which in evaluation terms means resilience. If one method proves impractical in the field, a solution with higher overlap is more likely to remain viable.

The matrix itself is fully editable in a dedicated Matrix Editor tab. You can rename methods and question types, adjust HIGH/MEDIUM/LOW ratings on a choosen criteria, set importance weights for each question type (0–10), set cost ratings for each method (0–10), sort rows by any column, import and export as JSON, and print or save the matrix or results as PDF.

Note: The Coverage Optimiser runs entirely in your browser — no login required, no data sent anywhere.

Beyond methods and questions

The tool is not limited to foresight methods or evaluation questions. The underlying logic applies wherever you have a set of options and a set of requirements, and where each option partially addresses some requirements. Some other framings that could be loaded into the same tool:

  • Solutions × Problems: which combination of policy interventions covers the broadest range of identified problems?
  • Stakeholders × Information needs: which combination of engagement activities ensures all key stakeholder groups have their core information needs met?
  • Data sources × Indicators: which combination of data collection instruments covers all required indicators, at minimum cost?
  • Partners × Geographic areas: which combination of implementing partners ensures all target districts are reached?

In each case the matrix structure is the same, the optimisation logic is identical, and only the labels and rating criteria change.

An invitation to experiment

The tool is best understood by using it. I would encourage anyone planning a multi-method evaluation, or foresight exercise, to load in their own methods and questions — even with rough ratings to start — and see what combinations emerge. The exhaustive search mode is particularly useful for revealing that several equally-minimal combinations exist, which opens up a more deliberate conversation about which is preferable given cost, feasibility, or complementarity.

I am continuing to develop the tool and would welcome feedback on the matrix structure, the rating scales, or applications I have not yet considered.

Those who have used EvalC3 will find a family resemblance here: both tools use systematic search to find efficient combinations — EvalC3 searching for attribute combinations that predict specific individual outcomes, the Coverage Optimiser searching for method combinations that cover multiple requirements. The underlying computational logic is related, even if the problems look different on the surface. 

Accessing the code: In addition to trying out the app, as it already exists online, you can also download a copy of the code and have it working independently on your own website. Go to the app online, right click your mouse, select View Page Source, copy the entire code, paste it into a txt file, rename the text file to end in .html not .txt, click on that file in a directory to open it up in a web browser. Simples, yes?


Further reading


  • A lot of the available reading on set covering algorithms is in the computer science domain. For a more accessible starting point, see the Wikipedia article on the Set cover problem 
  • The next blog posting explores comparisons with QCA.
  • This exercise in method development was an unplanned outcome of the writing of a book chapter on bridging the fields of evaluation and foresight. The book is...
    • Lea Kleinsorg, Jan Tobias Polak, Christian Grünwald (2027): Futures-informed evaluation: Methodological approaches and empirical applications, SpringerNature, Heidelberg.




Thursday, December 25, 2025

Exracting additional knowledge and performance from a configurational model that already has wide coverage

 
A decision tree algorithm, as available within EvalC3, can generate a classification tree  (a set of predictive models) of the kind shown here.


Some of the models (each branch is a model) are very detailed (i.e. has lots of attributes) and have narrow coverage. Such as HasQuotas+NotPost Conflict Situation+High Level of Human Development+ Low Womens Status = Low levels of womens representation in Parliament, which covers two cases (Senegal, Tanzania). 

Others are quite simple, with only two or three attributes and can have much wider coverage. Such as HasQuotas+ IsPostConflict = High levels of womens representation in Parliament, which covers two six cases (Burundi, Ethiopia, Mozambique, Namibia, South Africa, Uganda)

These wide coverage models may have unexplored potential, in the form of unexploited information content within the cases they cover. The raw (i.e. numerical) outcome data for the cases they cover only can be examined and recalibrated i.e re-dichotomised into two new sub-groups representing relatively higher versus lower outcome values within that set only

A new configurational analysis can then focus on that sub-set of cases to see if (a) any of the pre-existing attrubutes could predict membership of the two sub-groups, or (b) if any additional attributes, based on other knowledge of these cases, could do so.The ability to predict such finer grained performance differences would be a significant improvement.

This analytic step is a complementary move to that known as "pruning", where the removal of a mode attribute improves coverage, at the cost of precision.  Here an extra attribute is sought that will improve precision but at the cost of coverage. Perhaps it could be called "grafting"...

Postscript: But how significant will this addition to the model be? If, as above, there are six cases involved, there are 2^6 possible binary groupings of these case i.e 32. So any one grouping of two sets of cases has a 1/32 or 3.125% chance of occuring randomly (if the cases are causally independent). 

Tuesday, December 02, 2025

Objectives as data: The potential uses of updatable outcome targets

 The context

A specialist agency is funding more than 40 different partner organisations, each working in a different part of the country but with the same overall objective of increasing people's levels of physical activity (because of the positive health consequences). These partners are often working with quite different communities, and all have substantial degree of independence about how they work towards the overall objective. 

Some agency representatives have asked about the nature of the target that program as a whole is working towards, and have emphasised how essential it is that there be clarity in this area. By target they mean an actual number. Specifically the percentage of people self-reporting that they achieve a certain level of physical activity each week, as identified by an annual survey that is already underway and will be repeated in future.

Possible responses

In principle it would be possible to set a target for the proportion of the population reporting being physically active. Such as 75%.  But it would be very hard to identify an optimal target percentage, given the diversity of partner localities, and the communities within these. 

Relative targets may be more appropriate. Such as a 25% increase in reported activity levels. Especially if partners were each asked to identify what they think are achievable percentage increases in their own localities within the next survey period. This estimation would take place in a context where these partners already have experience working in those locations, identifying some of the things that work and dont work. My hypothesis, yet to be tested, would be that these partners will make quite conservative estimates. And if so, this might come as some surprise to the donor and perhaps lead to some revision of their own expectations

Taking this idea further, partners could be periodically asked if they wanted to adjust their expectations upwards or downwards , of the change that could be achieved - in the time remaining in the interventions lifespan. Subject to being able to explain the rationale for doing so. My second hypothesis is that this number, and commentary, could be a valid and useful form of progress reporting in its own right.

Making sense of the responses

An assessment of overall progress over longer time scale would need to consider both the scale of ambitions and the extent of their achievement. These can't be combined into one number based on a simple formula because any such number could be achieved by adjustment of expectations and or performance. However it could be usefully represented by a scatterplot, with data points reflecting each of the partners, of the kind shown below.

The location of partners in different quadrants suggests different implications about how the different partners should be managed

  • High ambition/low achievement: May need additional support, capacity building, or problem-solving
  • Low ambition/low achievement: May need fundamental partnership restructuring or exit considerations
  • High ambition/high achievement: Candidates for scaling, sharing learning, reduced support intensity
  • Modest ambition/high achievement: Opportunities to stretch ambitions

This framework also provides plenty of potentially useful analytic questions

  • Are ambitions increasing or decreasing?
  • Is the gap between expected and actual narrowing or widening?
  • For a given level of actual achievement did differences in expectations have any role or consequences
  • For a given level of expected change what might explain the differences in the partners actual achievements
  • How do individual partners positions within this matrix change over time? Are there distinct types of trajectories and how can these differences be explained?

In summary

A single numerical value based on the data in this matrix will provide a meaningless simplification. 

In contrast, a scatterplot visualisation can generate multiple potentially useful perspectives. 

It is more useful to see targets as necessarily malleable responses to changing conditions, than as unarguable reference points.

Postscript

There is a type  of reinforcement learning algorithm known as Temporal Difference Learning (TDD), that embodies a very similar process. It is described as "a model-free reinforcement learning method that learns by updating its predictions based on the difference between future predictions and the current estimate". Model-free means it has no built in model of the world it is working in. 

When implemented as a human process it is vulnerable to gaming, because the agents (humans) are aware of the system's mechanics, unlike the neural networks or simplified agents typically used in computational TD learning. But one adaptation, suggested by Gemini AI, is to "reward partners not just for the +/-gap, but for the accuracy of their final predictions over multiple cycles". Relatively higher accuracy, over multiple time periods, might be indicative of potentially generalisable / replicable delivery capacity, usable beyound the current context.

Tuesday, June 18, 2024

On two types of Theories of Change: Temporal and atemporal, and how they might be bridged



There are two quite different ways of representing theories of change – of the kind that might be useful when planning and monitoring development programmes of one kind or another.

The first kind is seen in representational devices such as the Logical Framework, Logic Models and boxes-and-arrows type diagrams. These differentiate events according to their location at different points over time, taking place between the initial provision of funding, its allocation and use and then it's subsequent effects and final impacts. These are temporal models.

The second kind, seen much less often, are seen in the analyses generated by Qualitative Comparative Analysis (QCA) and simple machine learning methods known as Decision Trees or Classifiers.  Here the theory is in the form of multiple configurations of different attributes that are associated with desired outcome, and its absence. Those attributes may be of the intervention and/or its context.  The defining feature of this approach is the focus on cases and differences between cases, rather than different points or periods in time. These cases are often geographical entities, or groups or persons, which have some persistence over time. They are effectively atemporal models.

Each of these approaches have their own merits. Theory of change which describes the sequence of expected events over time and how they relate to each other is useful for planning, monitoring and evaluation purposes.  But it runs the risk of assuming a homogeneity of effects across all locations where it is is implemented. On the other hand, a QCA-type configurational approach helps us identify diversity in contexts and implementations, and its consequences. But it may not have any immediate management consequences, about what needs to be done when.

One of my current interests is exploring the possibility of combining these two approaches, such that we have theories of change that differentiate those events over time, while also differentiating cases across space where those events may or may not be happening. 

One paper which I've just been told about is exploring these possibilities, as seen from a QCA starting point:Pagliarin, S., & Gerrits, L. (2020). Trajectory-based Qualitative Comparative Analysis: Accounting for case-based time dynamics. Methodological Innovations, 13.  In this paper the authors introduce the innovative idea of cases as different periods of time in the same location, where each of those subsequent periods of time may have various attributes of interest present or absent, along with an outcome of interest being present or absent.  This approach seems to have potential for enabling a systematic approach to within-case investigations complementing what might have been prior cross-case investigations.  There is the potential to identify specific attributes, or combinations of these, which are necessary or sufficient for changes to take place within a given case.

Somewhat tangentially...

The same paper reminded me reminded me of some evaluation fieldwork I did in Burkina Faso in 1992, where I was interviewing farmers about the history of their development of a small market garden using irrigation water obtained from a nearby lake. Looking back at the history of the market, which I think was about six years old at the time, I asked them to identify the most significant change that had taken place during the period of time. They identified installation of the water pump in year 198?, and pointed out how it expanded the scale of their cultivation thereafter. I can remember also asking, but with less recall of what they then said, follow-up questions about the most significant change that it taken place in each smaller time period either side of that event, and then its consequences.  I was in effect asking them to carve up the history of the garden into segments, and sub-segments, of time not defined by calendar, but by key events – each of which had consequences. These were in effect temporal "cases". Each of these had a configuration of multiple attributes, i.e. being attributes of the nested set of time periods that it belonged to. Associated with each of these were differenting judgements about the  the productivity of the market garden.  But with our team's time being short supply, I never got the opportunity to gather a full data set, so to speak. 

Another of my current interests, prompted by the above conjectures, is the possible use of specific form of Hierarchical Card Sorting (HCS) as a means introducing a temporal element into case-based configurational analysis. The HCS process generates a tree structure of nested binary distinctions between cases. It is concievable that different broad criteria could be introduced for the type of differences being identified at each level of the branching structure. For example, at the top level the "most significant difference" being sought could be specified as being  "in terms of funding received", then at the next level, "in terms of outputs generated" , and so on (Criteria 1,2,3 etc in Figure 1 below) .

Figure 1 below







Wednesday, April 24, 2024

Developing and using a Configurational Theory of Change within an evaluation

 

Figure 1

Ho-hum, yet another evaluation brand being promoted in an already crowded marketplace.FFS... 

Yes, I think this reaction is understandable, but I think there is something here captured under this title (Developing and using a Configurational Theory of Change... ) which has potential value.  I will try to explain...

Many evaluators make use of theories of change, as part of a theory-based approach to evaluation. Many theories of change are described in some type of diagrammatic form. And a typical feature of those diagrams is their convergent nature. That is, they start of with a range of different types of inputs and activities which follow various causal pathways towards a limited number of final outcomes.

This image is almost the complete opposite of what happens in actual practice on the ground. Financial inputs come from a limited number of sources, these become available to a small range of partners who carry out their own range of activities, in a variety of different locations each with their own populations, including those intended and not intended to be affected. This description is of course a simplification, but it applies to many development aid programme designs. The point I'm making here is that this in-reality process is not convergent it is divergent!  It seems like the diagrammatic theories of change I have described are a type of Procrustean bed

This blog posting has been most immediately prompted by a report I have just reviewed on potential evaluation strategies for a large national level climate finance strategy (CFS). The theory of change describes multiple causal pathways connecting the initial provision of government finance through to four expected types of expected impacts.  With two of these causal pathways alone the number of projects being funded is in the hundreds. The report struggled with the issue of how to measure the expected impacts given the scale and likely diversity of events on the ground. And the corresponding challenge of how to sample those projects. Part of my diagnosis of the problem here was the evaluation team's measurement-led approach. And the weakness of the conceptual framework i.e. the incapacity of the theory of change to capture the diversity of what was taking place.

Describing the alternative to my client is now my challenge. I think the alternative has two parts. Firstly, one should start at the beginning, where the money becomes available, and then follow the money (and the people responsible) as it gets distributed according to its intended purposes. If things are not happening as expected early on in this process then this affects expectations of what might and might not be observable later on in the form of 'outcomes' or 'impacts'. Put crudely, there is no point trying to observe the impact of something that has not yet been delivered. And in the case of strategies like the CFS, a large part of success can simply be gettting the money where it should be spent.

Secondly, as money is distributed from a central fund, decisions are going to be made about how it should be parcelled out in different amounts for different purposes through different institutions.  Each time that happens the decisions that have been made about how to do this are hopefully not random. Evaluating how those decisions were made may not necessarily be all that useful, because often there will be opaque mini, meso and macro political processes involved. But the announced decisions may include some intentionally explicit expectations about the official purposes of different allocations. Interviews those responsible for those allocations might also elicit more informal and more current expectations about what might be the short and longer terms effects of some of these allocations, when compared to others.The point I am emphasising here is that sometimes we can come to evaluative judgements not through the use of any overriding predetermined criteria, but by using a more inductive process, where we compare one option to another. This is an excuse for me to quote Marx (G): 

Friend says to Marx – 'Life is difficult'.

Marx replies to friend – 'Compared to what?'

This type of inductive comparative evaluation doesn't have to be completely free form. It is conceivable for example that we could look at two tranches of government climate finance funding and ask (those with proximate responsibilities for that funding) what difference there might between those blocks of funding in terms of how each might meet one or more of the OECD criteria (These range in their concerns from the more immediate issues of coherence and efficiency to later concerns with effectiveness and impact). Respondents answers in the form of expectations can be seen as mini theories a.k.a. hypotheses that then might be testable through the gathering of relevant data.  

Before these questions can be posed the cases that are going to be compared would need to be identified. The 'cases' in this example would be particular blocks of funding. Further along the implementation process the cases could be partners who are receiving funding, or activities that those partners implementing, or communities those activities are directed towards. Nevertheless, at any point along this chain there is still a challenge, which is how to select cases for comparison. For example, if we are looking at a particular budget document which distributes funding into multiple purpose categories we will be faced with the question of which of these categories to compare.

One way forward is to let the interviewed person decide, especially if they have responsibilities in this area. Using hierarchical card sorting (HCS) the interviewer starts with a request, which is phrased like this: 'What is the most significant difference between all these budget categories in terms of how they will achieve the objectives of the climate Finance strategy? Please sort the budget categories into two piles according to this difference and then explain it to me".  Having identified ppiles of types of cases that can be compared the respondent can then be asked for details about their expectations of the cases in one pile versus the other (See FN1).The same question can then be reiterated by focusing on each of those two piles in turn and getting the respondent to break them into two smaller sub- piles. When their answers are followed by explanations this will help differentiate expectations in further detail.

Figure 2 (click on to enlarge)

Figure 2 shows the results of such an exercise, where the respondents were NGO staff responsible for the development and management of a portfolio of projects. They were asked to sort the projects into two piles according to "What they saw as the most significant difference between the projects, of a kind that would make a difference to what they could achieve". Their choices generated the tree structure. They were then asked to make a series of binary choices at each branching point, indentifying which of the two types of projects described there that "they expected to be most successful, in terms of the extent to which they will contribute to the achievement of the overall objectives of the portfolio" . Their choices are shown by the red links. In this diagram their responses have been sorted such that the preferred red option is always shown above the non-preferred option. The aggregate result is a ranked set of 8 types of projects, with the highest rank (1) at the top. Each of these types is not an isolated category of its own, but part of a configuration that can be read along each branch, from left to right.  

Here are some of the type descriptions and the reasons why one versus the other was selected most likely to contirbute to the portfolio objectives. Further discusison would be needed to esytablish how the presence/absence of these characteristics could be identified on the ground.

Wider focus

Aim to influence wider policy and environment, and have more sustainable and wider impact beyond children and their families.

Likely to be more successful: Because it will have a wider reach and be more sustainable

Local focus

More hands on work with children on a day to day basis. Impact may be sustained but it will be limited to children and their families.

Likely to be less successful:

Locally driven

Partner and the projects are locally rooted, driven by local needs and priorities. They are more likely to “get it right”. They can’t walk away when Comic Relief funding ends. More likely to be sustainable.

Likely to be more successful: More embedded in the context, will outlast the project, be more responsive.

UK driven

UK driven projects, almost sub-contracting. They have a set end-point.

Likely to be less successful:

There is a larger question here of course that also relates to sampling. Who are you going to interview in this way? The suggestion above was 'to follow the money '. In other words, to follow lines of responsibility and interview people about the domains of activity they are responsible for, using HCS as a means of structuring the discussion. There is a strategy choice here between what is known as a breadth-first search versus a depth-first search strategies. From a given point in a flow of funds (and of responsibilities) there can be distributions going in different directions, each of which could all be explored. Following all of these is a form of breadth-first search. Alternatively the focus could be just on one of those developments, and following the subsequent distribution of funding and responsibility further down one (or few) line. This is a form of depth-first search. Which of those search strategies to pursue is probably a matter to be decided by the evaluation client. But may also need to be adaptive, informed by what was found by the evaluation team in prior interviews.


Courtesy Jacky Lieu: Comparison of Breadth-First Search and Depth-First Search: Understanding Their Methods and Uses 

But what about aggregation?


If you followed my suggested approach, the closer you got to the people whose lives were of  final/main concern, the small the segments of all the funding you would be looking at. These would be more comparable than when looking at as part of a larger group, with more customised context specific assessments of expected and actual impact. But how would you / the evaluation team then be able to make any overall statement about the strategy as a whole?

The way forward is to think of performance measurement in slightly different terms, than just using a simple indicator based measure. Imagine a scatter plot, with one dimension X describing relative i.e. ranked expectations of achievement and the other dimension Y describing ranked actual/observed/assessed achievements. The entities in the scatter plot are the groups of cases in the smallest available sub-categories that were developed. Their rank position, relative to each other, is evident  when all the binary assessments of expected performance are generated through the process described above. See here for more on how this is done. The scatter plot can in turn be summarised in at least two different ways: using a measure of rank correlation (or how achievement relates to expectations) and using Classification Accuracy, if and when a minimum rank position of achievement is identfiied. Equally importantly, qualitative descriptions can be given of cases that exemplify performance that most meets expectations, and the reverse, along with  positive and negative deviants (outliers).

What we could end up with is a tree structure documenting multiple routes to both high and low performance, implemented in varyingly different  contexts (describable at different levels of scale).

Other scatter plot designs are more relevant to assessments of strategies. The ranking generated by Figure 2 was plotted against the age of the projects and their grant size, which might be expected to be influenced by the contents of a funding strategy. Neither of these two measures showed any relationship to perceived strategic priorities!

To be continued....


PS1: When asking about expected effects of one type of allocation versus another, it may make sense to encourage a focus on more immediately expected effects first, and then later ones. They may be more likely, more easily articulated and more evaluable.

PS2: Hughes-McLure, S. (2022). Follow the money. Environment and Planning A: Economy and Space, 54(7), 1299–1322. https://doi.org/10.1177/0308518X221103267










Friday, December 22, 2023

Using the Confusion Matrix as a general-purpose analytic framework


Background

This posting has been prompted by work I have done this year for the World Food Programme (WFP) as member of their Evaluation Methods Advisory Panel (EMAP). One task was to carry out a review, along with colleague Mike Reynolds, of the methods used in the 2023 Country Strategic Plans evaluations. You will be able to read about these, and related work, in a forthcoming report on the panel's work, which I will link to here when it becomes available.

One of the many findings of potential interest was: "there were relatively very few references to how data would be analysed, especially compared to the detailed description of data collection methods". In my own experience, this problem is widespread, found well beyond WFP. In the same report I proposed the use of what is known as the Confusion Matrix, as a general purpose analytic framework. Not as the only framework, but as one that could be used alongside more specific frameworks associated with particular intervention theories such as those derived from the social sciences.

What is a Confusion Matrix?

A Confusion Matrix is a type of truth table,  i.e., a table representing all the logically possible combinations of two variables or characteristics. In an evaluation context these two characteristics could be the presence and absence of an intervention, and the presence and absence of an outcome.  An intervention represents a specific theory (aka model), which includes a prediction that a specific type of outcome will occur if the intervention is implemented.  In the 2 x 2 version you can see above, there are four types of possibilities:

  1. The intervention is present and the outcome is present. Cases like this are known as True Positives
  2.  The intervention is present but the outcome is absent. Cases like this are known as False Positives. 
  3. The intervention is absent and the outcome is absent. Cases like this are known as True Negatives
  4. The intervention is absent but the outcome is present. Cases like this are known as False Negatives. 
Common uses of the Confusion Matrix

The use of Confusion Matrices is most commony associated with the field of machine learning and predictive analytics, but it has much wider application. These include the fields of medical diagnostic testing, predictive maintenance,  fraud detection,  customer churn prediction, remote sensing and geospatial analysis, cyber security, computer vision, and natural language processing. In these applications the Confusion Matrix is populated by the number of cases falling into each of the four categories. These numbers are in turn the basis of a wide range of performance measures, which are described in detail in the Wikipedia article on the Confusion Matrix. A selection of these is described here, in this blog on the use of the EvalC3 Excel app

The claim

Although the use of a Confusion Matrix is  commonly associated with quantitative analyses of performance, such as the accuracy of predictive models, it can also be a useful framework for thinking in more qualitative terms. This is a less well known and publicised use, which I elaborate on below. It is the inclusion of this wider potential use that is the basis of my claim that the Confusion Matrix can be seen as a general-purpose analytic framework.

The supporting arguments

The claim has at least four main arguments:

  1. The structure of the Confusion Matrix serves as a useful reminder and checklist, that at least four different kinds of cases should be sought after, when constructing and/or evaluating a claim that X (e.g. an intervention) lead to Y (e.g an outcome). 
    1. True Positive cases, which we will usually start looking for first of all. At worst, this is all we look for.
    2. False Positive cases, which we are often advised to do, but often dont invest much time in actually doing so. Here we can learn what does not work and why so.
    3. False Negative cases, which we probably do even less often. Here we can learn what else works, and perhaps why so,
    4. True Negative cases, because sometimes there are asymmetric causes at play i.e not just the absence of the expected causes
  2. The contents of the Confusion Matrix helps us to identify interventions that are necessary, sufficient or both. This can be practically useful knowledge
    1. If there are no FP cases, this suggests an intervention is sufficient for the outcome to occur. The more cases we investigate , without still finding a TP, the stronger this suggestion is. But if only one FP is found, that tells us the intervention is not sufficient. Single cases can be informative. Large numbers of cases are not aways needed.
    2. If there are no FN cases, this suggests an intervention is necessary for the outcome to occur. The more cases we investigate , without still finding a FN, the stronger this suggestion is. But if only one FN is found, that tells us the intervention is not necessary. 
    3. If there are no FP or FN cases, this suggests an intervention is sufficient and necessary for the outcome to occur. The more cases we investigate, without still finding a TP or FN, the stronger this suggestion is. But if only one FP, or FN is found, that tells us that the intervention is not sufficient or not necessary, respectively. 
  3. The contents of the Confusion Matrix help us identify the type and scale of errors  and their acceptability. FP and FN cases are two different types of error that have different consequences in different contexts. A brain surgeon will be looking for an intervention that has a very low FP rate, because errors in brain surgery can be fatal, so cannot be recovered. On the other hand, a stockmarket investor is likely to be looking for a more general purpose model, with few FNs. However, it only has to be right 55% of the time to still make them money. So a high rate of FPs may not be a big  concern. They can recover their losses through further trading. In the field of humanitarian assistance the corresponding concerns are with coverage (reaching all those in need, i.e minimising False Negatives) and leakage (minimising inclusion of those not in need i.e False Positives). There are Confusion Matrix based performance measures for both kinds error and for the degree that both kinds of error are balanced (See the Wikipedia entry)
  4. The contents of the Confusion Matrix can help us identify usefull case studies for comparison purposes. These can include
    1. Cases which exemplify the True Positive results, where the model (e.g an intervention) correctly predicted the presence of the outcome. Look within these cases to find any likely causal mechanisms connecting the intervention and outcome. Two sub-types can be useful to compare:
      1. Modal cases, which represent the most common characteristics seen in this group, taking all comparable attributes into account, not just those within the prediction model. 
      2. Outlier cases, which represent those which were most dissimilar to all other cases in this group, apart from having the same prediction model characteristics
    2. Cases which exemplify the False Positives, where the model incorrectly predicted the presence of the outcome.There are at least two possible explanations that can be explored:
      1. In the False Positive cases, there are one or more other factors that all the cases have in common, which are blocking the model configuration from working i.e. delivering the outcome
      2. In the True Positive cases, there are one or more other factors that all the cases have in common, which are enabling the model configuration from working i.e. delivering the outcome, but which are absent in the False Positive cases
        1. Note: For comparisons with TPs cases, TP and FP cases should be maximally  similar in their case attributes. I think this is called  MSDO (most similar, different outcome) based case selection
    3. Cases which exemplify the False Negatives, where the outcome occurred despite the absence the attributes of the model. There are three possibilities of interest here:
      1. There may be some False Negative cases that have all but one of the attributes found in the prediction model. These cases would be worth examining, in order to understand why the absence of a particular attribute that is part of the predictive model does not prevent the outcome from occurring. There may be some counter-balancing enabling factor at work, enabling the outcome.
      2. It is possible that some cases have been classed as FNs because they missed specific data on crucial attributes that would have otherwise classed them as TPs.
      3. Other cases may represent genuine alternatives, which need within-case investigation to identify the attributes that appear to make them successful 
    4. Cases which exemplify the True Negatives, where the absence the attributes of the model is associated with the absence of the outcome.
      1. Normally this are seen as not being of much interest. But there may cases here with all but one of the intervention attributes. If found then the missing attribute may be viewed as: 
        1. A necessary attribute, without which the outcome can occur
        2. An INUS attribute i.e. an attribute that is Insufficient but Necessary in a configuration that is Unnecessary but Sufficient for the outcome (See Befani, 2016). It would then be worth investigating how these critical attributes have their effects by doing a detailed within-case analysis of the cases with the critical missing attribute.
      2. Cases may become TNs for two reasons. The first, and most expected, is that the causes of positive outcomes are absent. The second, which is worth investigating, is that there are additional and different causes at work which are causing the outcome to be absent. The first of these is described as causal symmetry, the second of these is described as causal asymmetry. Because of the second possibility is worthwhile paying close attention to TN cases to identify the extent to which symmetrical causes or asymmetrical causes are at work. The findings could have significant implications for any intervention that is being designed. Here a useful comparision would be  between maximally similar TP and TN cases.
Resources

Some of you may know that I have built the Confusion Matrix into the design of EvalC3, an Excel app for cross-case analysis, that combines measurement concepts from the disparate fields of machine learning and QCA (Qualitative Comparative Analysis). With fair winds. this should become available as a free to use web app in early 2024, courtesy of a team at Sheffield Hallam University. There you will be able to explore and exploit the uses of the Confusion Matrix for both quantative and qualitative analyses.