Sunday, March 22, 2026

Introducing Rank Order Counterfactuals (ROC)



A counterfactual is a description of what would have happened, if an intervention had not taken place. The use of randomised control groups is one way to construct a counterfactual. A population of people are randomly assigned to either a control group or an intervention group. Differences in the outcomes of those populations are then compared. If the difference is sufficiently statistically significant then a plausible causal claim can be made that difference in outcomes is because of the intervention.

As might be expected, there are plenty of circumstances when social programs are designed and implemented, where it is simply not practical to organise a randomised control group. In addition, the comparisons that are made will be between averages of the two groups. However, in many social programmes such averages are of limited practical use, because the implementation contexts are so varied and no single “solution” is likely to be applicable. Average effects can still be informative at a high level, but they need to be complemented by methods that take contextual diversity seriously.

I'm currently working with an evaluation team that is examining a large-scale public health programme in the United Kingdom, covering many different locations and involving many different types of local partnerships. But with one common outcome of concern, which is to increase people's physical activity levels in their daily life. In their work the evaluation team is already making use of a causal configurational approach to the understanding of what works for whom in what circumstances. It is finding different configurations of causal conditions across these locations that are associated with changes in activity levels. This approach is consistent with the high level of diversity in locations partnerships and interventions.

But what it does not yet have is a counterfactual, a defensible description of what might have happened in these locations in the absence of this intervention. This is where the idea of a rank order counterfactual becomes relevant. By a rank‑order counterfactual I mean a very specific kind of “what would have happened otherwise.” Instead of trying to predict the exact outcome that would have been achieved in each location without an intervention, we can start by asking a simpler, comparative question: which location would probably have changed more, and which less, if the intervention had never existed? The answer will be in the form of a rank ordering of locations, from those with more to less expected change. That ranking would be constructed based on all available baseline information, trends, and contextual knowledge. This proposed approach falls into a category of counterfactuals known as "logically constructed counterfactuals", and it aligns well with configurational evaluation because it focuses on patterns of relative change across diverse contexts.

A subsequent evaluation of those same locations should also be able to generate a new rank ordering, which is based on observed outcomes. These counterfactual and actual rankings can then be compared, using a scatterplot and correlation measures. The scatterplot is also visually powerful for communication: it lets people see at a glance which locations behave as expected and which ones stand out as surprises. If the intervention had no effect we should see a linear relationship, the observed and counterfactual rankings should be the same. If the intervention had positive, or perhaps even negative effects, this should not happen. We might see various locations which are outliers from that expected trend. When locations we expected to be “natural leaders” did not improve much, and those we expected to be “natural laggards” moved to the top of the league table, that pattern is a signal that the intervention may have been influential, and it gives the evaluation team clear cases where alternative explanations should be probed. The task of the evaluation is then to probe those alternative explanations, not to assume the intervention is the only possible cause. The rankings are not a substitute for theory‑based evaluation; they are a way to make its claims sharper and more testable. The focus on ranking differences can convert a vague theory (“we think these factors matter”) into a concrete, specific prediction about which locations should do better.

The sensitivity of the rank comparison process will depend on the number of ranked items. The more rank positions there are, the more sensitivity there will be to differences in performance, which is good. But, as shown in research on sorting algorithms, the time required to generate a complete sorting, using any of the well-known methods, can be significant. Growing faster than proportionally to the number of items, though far slower than exponential growth. In addition to the extra time required, a rank order counterfactual will require a stronger evidential base where the number of rank positions is greater. 

When a large number of locations are involved in an intervention one practical way of addressing this tension is to use a stratified random sample, and to generate the rankings for that sample only. Another approach to managing large numbers of locations is to think of ranked bands of locations rather than individual rankings for each location. What should be of interest, then, are systematic shifts in band membership between the counterfactual and actual observations – for example, locations expected to be in the “low‑change” band turning up in the “high‑change” band in practice.

In this way, rank‑order counterfactuals do not replace theory‑based evaluation, but sharpen it: they turn general expectations about context into explicit, testable predictions about who should have changed most in the absence of the programme. In work which I hope to document in the European Evaluation Society conference later this year I will explain how the use of the hierarchical card sorting process was used to generate argument and evidence based counterfactual rank orderings, and how an LLM was used to support this process. 


Saturday, March 07, 2026

Making implicit knowledge explicit, contestable and usable: an introduction to The Ethnographic Explorer



Caveat Emptor: I delegated the writing of this blog posting to Claude AI (Sonnet 4.6), based on an extended prior dialogue on the subject below, then a summary prompt of what was wanted. My post-production edits were quite limited.
--o0o--

There is a problem that turns up repeatedly in evaluation practice — and in many other fields — that many of us work around rather than solve directly. The problem is this: the people who know most about a programme, a portfolio, or a set of cases often cannot easily say what they know, or why they make the judgements they do.

This is not a failure of intelligence. It is the normal condition of what Michael Polanyi called tacit knowledge — the "we know more than we can tell" problem that is endemic to any field built on experience and judgement. The challenge for evaluation is to find structured ways of drawing it out.

The theoretical anchor: information as difference

The approach behind The Ethnographic Explorer draws on a deceptively simple idea from Gregory Bateson: information is a difference that makes a difference. What we notice, and consider significant, is always defined by contrast — not by properties of things in isolation. A project is "successful" relative to others; a group is "vulnerable" in comparison to other groups in a given context. And some of those differences have noticeable consequences, they "make a difference". Together they can be seen as simple "IF...THEN..." rules.

This suggests that a structured method for eliciting knowledge should be built around comparison — specifically, around asking people to identify and articulate differences, and then to explain what those differences imply. The Hierarchical Card Sort methodology is one way of doing this and is the basis of The Ethnographic Explorer's process of inquiry.

How the Ethnographic Explorer works

There are three linked stages.

In the Sort stage, a respondent is presented with a set of cases — projects, organisations, events, beneficiaries, or any entities they know well. They are asked to divide all the cases into two groups representing the most significant difference between them, from their point of view. Each group is named, the difference is recorded, and then the respondent is asked: "What difference does this difference make?" — a question that surfaces the consequence or implication of the distinction. The process repeats on each subgroup until every group contains a single case. The result is a binary tree: a structured, hierarchical map of the respondent's view of the case set.

The tree is informative in three ways: it reveals the contents of the distinctions the respondent considers important; it identifies the limits of their knowledge (where further differences cannot be found); and it indicates the direction of their attention (where further distinctions could usefully be explored).

In the Compare stage, the facilitator asks comparison questions at each split in the tree. Questions can be in degree ("Which group of cases is more likely to face sustainability problems?") or in kind ("How do these two groups differ in terms of their relationship with local government?"). Degree questions produce a ranking of all cases; kind questions produce descriptive contrasts. Both types build on the structure already revealed by the sort.

In the Contrast stage, any two degree-based rankings are plotted against each other in a scatter plot. The resulting quadrant analysis shows where the two rankings agree (cases high on both, or low on both) and where they diverge (cases high on one and low on the other). Adjustable cutoff sliders allow the facilitator to explore different thresholds, and a Spearman correlation coefficient summarises the overall relationship. Outlier cases — those that diverge most between the two rankings — are often the most analytically interesting. Strong relationships can be cast as potentially useful IF...THEN rules

The Ethnographic Explorer

With substantial coding help from Claude AI, I have been developing a browser-based implementation of this methodology — The Ethnographic Explorer — as a standalone single-file application. This new version supersedes an earlier WordPress-based tool at ethnographic.mande.co.uk, which required a server to run. The new version requires nothing beyond a web browser.

▶ Try The Ethnographic Explorer [When you get there, click on Introduction for guidance on how to explore the tool's functions]

The tool is designed for use in a shared-screen video call with a single respondent, or screened to multiple participants in a workshop setting. The facilitator drives the interface; the respondent provides the knowledge. A typical exercise with 8–12 cases takes between 45 minutes and two hours, depending on range of comparisons made.

A worked example: 12 Largest cities

Seven jscon files are are available alongside the app to allow you to explore the use of the app. The first is simply a list, which you can then sort and compare and contrast. The second is the same list, already sorted by Claude AI at my request, on the basis of " their appeal to international tourists, as seen from 6 different perspectives.

To load either dataset, click Import JSON in the top-right toolbar and select the file (after being downloaded into your computer).

Download 12 Largest Cities (unsorted)

Download 12 Largest Cities - Budget Backpacker view (sorted)

Download 12 Largest Cities - Food and Culture view (sorted)

Download 12 Largest Cities - Safety Concious view (sorted)

Download 12 Largest Cities - Sustainable Tourism view (sorted)

Download 12 Largest Cities - Travel Journalist view (sorted)

Download 12 Largest Cities - Heritage Specialist view (sorted)

What the tool produces

Each exercise generates a structured record of the respondent's knowledge about a set of cases:

  • a binary sort tree, with each split labelled by the most significant difference and its consequence
  • one or more named rankings, each derived from a sequence of binary degree judgements working from the root set down through the subsets
  • one or more descriptive contrasts, capturing how subgroups differ in kind on specified attributes
  • a scatter plot for any pair of degree rankings, with quadrant analysis and Spearman correlation
  • a qualitative responses panel, collecting the in-kind descriptions at each split

The full exercise exports as a JSON file, preserving the complete tree structure, all difference descriptions, all responses, and the exercise metadata. Files can be imported to resume exactly where you left off, or shared with a colleague for further analysis.

Beyond evaluation

The underlying structure of the method applies wherever you have a set of cases and a respondent who has differentiated knowledge about them. Some other framings that could be explored with the same tool:

  • Organisational learning: a team reviews a set of completed projects, identifying the distinctions that, in retrospect, most predicted success or failure
  • Capacity assessment: a trainer sorts a cohort of staff by their readiness for different kinds of work, making the basis for those judgements explicit and discussable
  • Stakeholder analysis: a key informant sorts a set of stakeholder organisations, revealing the distinctions they consider most consequential for programme implementation
  • Policy analysis: a policy analyst sorts a set of interventions by their perceived effectiveness, then compares that ranking against a ranking of their political feasibility

In each case, the sort structure is the same, the comparison logic is identical, and only the cases, the domain framing, and the named differences change.

An invitation to experiment

The tool is best understood by using it. I would encourage anyone engaged in evaluation, learning, or knowledge management work to try loading in a set of cases they know well — even with a rough sort to start — and see what structure emerges. The Compare and Contrast stages are particularly useful for surfacing assumptions that are rarely made explicit in standard reporting.

I am continuing to develop the tool and would welcome feedback on the methodology, the interface, or applications I have not yet considered.

Accessing the code: In addition to trying the app online, you can download a copy of the code and run it independently. Go to the app in yur directory, right-click, select View Page Source, copy the entire code, paste it into a text file, rename the file to end in .html rather than .txt, and open it in a web browser. The tool runs entirely in your browser — no login required, no data sent anywhere.

Further reading

  • Polanyi, M. (1966). The Tacit Dimension. Doubleday. — the source of the "we know more than we can tell" framing.
  • Bateson, G. (1972). Steps to an Ecology of Mind. Ballantine Books. — the source of the "difference that makes a difference" framing.
  • Kelly, G.A. (1955). The Psychology of Personal Constructs. Norton. — the intellectual precursor to card sorting methods.
  • The 2025 WordPress version of the tool, with documentation and worked examples, is at ethnographic.mande.co.uk
  • Help pages for the comparison stage, with nine worked examples of question types, are at the Comparisons Help page

Tuesday, March 03, 2026

Optimising method selection: an introduction to the set covering problem — and a tool to help





Caveat Emptor: I delegated the writing of this blog posting to Claude AI(Sonnet 4.6), based on an extended prior dialogue on the subject below, then a summary prompt of what was wanted. My post production edits were quite limited
--0O0--

There is a class of problem that turns up repeatedly in evaluation planning, and in many other fields, that most of us solve informally and imprecisely. The problem is this: given a set of needs to be addressed, what is the smallest combination of responses that covers all of them?

In evaluation this might be: given a set of evaluation questions I need to answer, what is the minimum combination of methods that gives me adequate coverage across all of them? In public health it might be: which combination of clinics ensures every neighbourhood has access to at least one? In software testing: which set of test cases exercises every code path? In logistics: which depots can serve every delivery zone?

These are all instances of what computer scientists call the Set Covering Problem — a well-studied problem in combinatorial optimisation that dates back to the 1970s. The formal version asks: given a collection of sets, find the smallest sub-collection whose union covers all elements of a target universe.

Why it matters for evaluation practice

When designing an evaluation, practitioners face exactly this structure. We have a list of evaluation questions (or dimensions of quality, or stakeholder concerns), and a repertoire of methods — each of which can address some but not all of those questions to a satisfactory level. Choosing methods one at a time, based on familiarity or habit, rarely produces the most efficient combination. We either end up with redundant overlaps in some areas and blind spots in others, or with far more methods than the budget or timeline can support.

A more systematic approach asks: which combination of methods is both complete (covers all questions at an adequate level) and minimal (uses as few methods, and as little resource, as possible)?

How the optimisation works

For a small number of methods and questions, you could check all possible combinations by hand. But the number of combinations grows exponentially — with ten methods, there are over a thousand possible subsets to evaluate. This is where an algorithm helps.

The simplest approach is a greedy algorithm: at each step, pick the method that covers the most currently-uncovered questions, then repeat until everything is covered. This is fast and usually finds a good solution, but not necessarily the best one.

A more thorough approach — exhaustive search — systematically checks all combinations up to a specified size and returns every minimal solution. This is slower but reveals the full landscape of equally-good options, which is often more useful than a single answer, particularly when cost or other practical constraints come into play.

The Coverage Optimiser

With substantial coding help from Claude AI, I have been developing a small browser-based tool — the Coverage Optimiser — that applies both approaches to a user-defined matrix of methods and question types.

▶  Try the Coverage Optimiser [When you get there click on Introduction, to get suggestions on how to explore the functions of the app]

The default example matrix uses ten foresight methods (Scenario Planning, Delphi, Horizon Scanning, and others) rated against five question types — Descriptive, Valuative, Explanatory, Predictive, and Prescriptive — at HIGH, MEDIUM, or LOW in terms of their usefulness in building a futures perspective into an evaluation. The tool finds the minimal combinations of methods that achieve the desired coverage level across all question types.

Each solution is displayed with:

  • the methods involved, each with its cost rating
  • a question-by-question coverage check
  • the total cost of the combination
  • an overlap score — the number of questions covered by more than one method in the solution

Overlap is worth attending to: it indicates redundancy, which in evaluation terms means resilience. If one method proves impractical in the field, a solution with higher overlap is more likely to remain viable.

The matrix itself is fully editable in a dedicated Matrix Editor tab. You can rename methods and question types, adjust HIGH/MEDIUM/LOW ratings on a choosen criteria, set importance weights for each question type (0–10), set cost ratings for each method (0–10), sort rows by any column, import and export as JSON, and print or save the matrix or results as PDF.

Note: The Coverage Optimiser runs entirely in your browser — no login required, no data sent anywhere.

Beyond methods and questions

The tool is not limited to foresight methods or evaluation questions. The underlying logic applies wherever you have a set of options and a set of requirements, and where each option partially addresses some requirements. Some other framings that could be loaded into the same tool:

  • Solutions × Problems: which combination of policy interventions covers the broadest range of identified problems?
  • Stakeholders × Information needs: which combination of engagement activities ensures all key stakeholder groups have their core information needs met?
  • Data sources × Indicators: which combination of data collection instruments covers all required indicators, at minimum cost?
  • Partners × Geographic areas: which combination of implementing partners ensures all target districts are reached?

In each case the matrix structure is the same, the optimisation logic is identical, and only the labels and rating criteria change.

An invitation to experiment

The tool is best understood by using it. I would encourage anyone planning a multi-method evaluation, or foresight exercise, to load in their own methods and questions — even with rough ratings to start — and see what combinations emerge. The exhaustive search mode is particularly useful for revealing that several equally-minimal combinations exist, which opens up a more deliberate conversation about which is preferable given cost, feasibility, or complementarity.

I am continuing to develop the tool and would welcome feedback on the matrix structure, the rating scales, or applications I have not yet considered.

Those who have used EvalC3 will find a family resemblance here: both tools use systematic search to find efficient combinations — EvalC3 searching for attribute combinations that predict specific individual outcomes, the Coverage Optimiser searching for method combinations that cover multiple requirements. The underlying computational logic is related, even if the problems look different on the surface. 

Accessing the code: In addition to trying out the app, as it already exists online, you can also download a copy of the code and have it working independently on your own website. Go to the app online, right click your mouse, select View Page Source, copy the entire code, paste it into a txt file, rename the text file to end in .html not .txt, click on that file in a directory to open it up in a web browser. Simples, yes?


Further reading


  • A lot of the available reading on set covering algorithms is in the computer science domain. For a more accessible starting point, see the Wikipedia article on the Set cover problem 
  • The next blog posting explores comparisons with QCA.
  • This exercise in method development was an unplanned outcome of the writing of a book chapter on bridging the fields of evaluation and foresight. The book is...
    • Lea Kleinsorg, Jan Tobias Polak, Christian Grünwald (2027): Futures-informed evaluation: Methodological approaches and empirical applications, SpringerNature, Heidelberg.