Rick On the Road: 2026

Sunday, May 31, 2026

Cluster analysis for evaluation purposes, and how a hybrid Human-LLM approach can help

Friend to Groucho Marx: “Life is difficult”

Groucho Marx to Friend: “Compared to what?”

Comparison is intrinsic to evaluation. Not only between one thing and another, but more often, between one category of things and another. Typologies are involved in many of the comparisons evaluators have to make (between people, activities, outcomes, locations, et cetera). Typologies can spring forth from our minds, but there are also systematic methods for developing them, broadly described as clustering methods. It is this second grouping that I want to discuss here.

What types of clustering methods are there?

Using one prompt to start off with, Claude identified for me 33 different clustering methods, which were organised into 9 different categories. These categories were defined largely on the basis of differences in computational mechanisms and mathematical principles which are involved in the operation of the clustering methods. But there were a couple of groups which were organised using different criteria relating to who does the task (human or algorithm) and what the method is for.

I then did some experimentation of my own, getting Claude to cluster the methods, using two different clustering methods. One is called a “maximum spanning tree”, which connects methods according to which method is most similar to which other method. You can see this in figure 1 below, which is best read by double clicking on the image to get greater magnification. The second experiment used an agglomerative hierarchical clustering to produce what is called a dendrogram i.e. a tree structure displaying nested categories of methods. You can see this in figure 2 below, again probably best inspected by magnifying the image first. I like both of these, for reasons that will become clearer below.

Figure 1: Maximum Spanning Tree

Figure 2: Dendrogram

Introducing the Text Cluster Analysis Lab
I developed this mini-app recently, in May this year, with very substantial help from Claude AI. You can view (and copy) the app by following this link. You will see that one of the eight workflow tabs builds a dendrogram.

How does it compare?

When I asked Claude to compare this app with the list of 33 clustering methods, it identified similarities with a number of methods, including hierarchical clustering, latent class analysis, and pile sorting. But its best fitting answer was that the method is a hybrid. “Its uniqueness comes precisely from joining a human-sorting logic (criteria generated from the material, framed by the researcher's domain statement) to an algorithmic back end (binary scoring + agglomerative clustering). No single method in the set occupies that position”. The Lab's nearest relatives — hierarchical clustering, LCA, pile/card sorting — sit in three different families i.e. separate branches of the spanning tree and dendrogram. In this respect the Lab can claim novelty. Hopefully the basis of this claim will become clearer as I explain the Lab in some detail.

The inputs

By Claude's estimate around 75% of cluster analysis methods work with quantitative data only, the rest work with text only (15%) or text or numbers (10%). The lab works with multiple bodies of texts. More specifically, the focus of the Lab is on similarities and differences between texts, not on internal structure within a text, as is the case with much thematic coding. The texts I have used, while testing the mini-app, are a set of storylines about alternative futures, collaboratively developed by participants in a ParEvo.org exercise. I also have plans to analyse a set of Most Significant Change stories.

Another important input is the choices users make about various settings, in each of the eight tabs making up the staged workflow. These include what version of Claude to use at various stages of the analysis, the temperature setting for each stage, and most importantly, the precise wording of the prompt that Claude will respond to, which tells it what to look for. In the Reliability and Cluster tabs choices are made about which identified sorting criteria to include, and what clustering settings to use. All these choices are important, as will be discussed below. They are what makes this approach hybrid: part human, part automated processes.

The outputs

The first is a list of attributes, which differentiate one group of texts from another. And a matrix, where rows list the texts, columns describe the attributes of those texts and cells describe their presence or absence in each text. The text attributes are identified by a Claude AI search and comparison of the texts contents, operating within user-defined context and scope settings. Attributes that fail to discriminate between the texts (those present in all of them or none of them) are filtered out automatically, because they contribute nothing to distinguishing one text from another.

A second matrix is constructed as the results of a "back-translation" type of rater reliability test. This enables users to remove from use those attributes which are unreliably identifiable and to identify texts whose analyses are less reliable than others.

The third output is a dendrogram, representing a nested classification of the texts, using the user's selection of relevant text attributes. This is built using an agglomerative process, firstly finding pairs of text which are most similar, then pairs of pairs of texts which are most similar, et cetera. When building the tree, the user also chooses how similarity between texts is measured (using either a Hamming or a Jaccard distance) and how clusters are linked together as they merge. Each branch of the tree structure includes tool-tip information on how the attributes of that cluster of texts differ from its sibling branch. This is systematically identified by the app, and humanly verifiable. It is not a subjective judgement — as is the case with a number of other types of clustering processes.

The fourth output is an open-ended Claude chat-type query facility, where the user can ask questions about: (a) a cluster of texts on their own, or (b) in comparison to their sibling cluster in the dendrogram, or (c) in comparison to the whole set of texts. With or without uploading of additional context information.

The fifth output is a set of exportable products, including a detailed provenance statement describing how all the products have been produced, a listing of all the identified text attributes, a copy of the text-by-attributes matrix, the rater reliability assessment, a copy of the dendrogram, a copy of any query dialogue, and a breakdown of token use and token costs for each stage of each analysis. The details of the analysis process, including settings and contents generated, can also be saved as a JSON file and reimported for reuse.

A sixth output supports the choice of criteria itself. Because the criteria matter more than any other input choice (they define the matrix from which everything else follows) it is worth being able to compare different sets of them and being able to choose deliberately from within these, rather than accepting whatever a single run happens to produce. The Selection tab lets you place the criteria from two or more runs side by side, each shown with quantitative measures of its performance: how reliably it was identified, how many texts it applies to, and how much its work overlaps that of the other criteria. Alongside these per-criterion figures are measures for the set as a whole, including its overall reliability and a measure of how well the set as a whole tells the texts apart. From this you can hand-pick a set of criteria drawn from across the runs, add criteria of your own wording if a distinction you care about was missed, and then re-score the texts against just that curated set. The result is a new analysis like any other, which can be clustered, reliability-tested, and queried in turn. A short built-in guide explains how these measures relate to one another — for instance, why a criterion that applies to very few texts can look misleadingly distinctive, or why minimising overlap is not always the right aim — and points to the wider feature-selection literature for those who want to follow the ideas further.

For much more detail, go to this Introduction tab, on the Lab site

If this is the solution, what was the problem?

Eight of the nine clustering categories (29 of the 33 methods) identified by Claude can produce identifiable clusters of texts through replicable, transparent, deterministic processes — but then require a subjective human judgement to name and describe those clusters. That seems weirdly self-contradictory: a rigorous sorting process handed off, at the last step, to an undocumented act of interpretation. The remaining four methods (pile sorting, card sorting, Q Methodology, and Repertory Grid) are subjective throughout, because of their ethnographic orientation, so there is not such a visible internal contradiction.

A second problem concerns the interpretability of the dimensions within which clusters are located. At least eight of the methods rely on abstract derived axes — dimensions that exist and could in principle be examined, but which are statistical composites a non-statistician can't readily interpret (PCA, Factor analysis, ICA, t-SNE, UMAP, MDS, Spectral clustering, and SOM). The difficulty is not just that these axes are hard to read; it is that many of them can't be explained simply to a non-specialist without misrepresenting what they are.

A third problem is unaccountable input choices, such as the number of clusters, topics, classes, or dimensions to be identified. Yes, it is true that every method involves choices, including the Lab. But what matters is is not the number of choices but how defensible each one is. Choices can differ on at least four ways: whether the choice is visible in the output or buried behind it; whether it is checkable against some standard after the fact; whether a poor choice fails visibly rather than silently; and whether it is documented. A choice that is buried, uncheckable, silently failing, and undisclosed is the least defensible; one that is visible, checkable, self-signalling, and recorded is the most.

My assessment

In summary, the Lab addresses the naming problem and the dimensions problem directly, and improves on, without completely escaping, the input-choices problem.

On naming, the dendrogram's tooltips identify how each cluster differs from its sibling by reporting which attributes it has and that the sibling lacks. This is read straight off the matrix; it is systematic and humanly verifiable, not a subjective label imposed after the fact. In addition, the query facility can be used to generate names for the clusters based on their contents, which can then be tested for their reliability through another form of back translation. But it should be remembered that this is a reliability test, not validity test. That is, a back-translation can confirm a name is applied consistently across the texts, but not that it is the right name for what they share. That needs human checking.

On interpretability of dimensions, the dendrogram works at two levels. Its tree structure (a nested classification of groups within groups) is intuitively understandable to most readers without any statistical background; they can see directly which texts joined first and which groups merged later. Its horizontal axis, representing degree of similarity, is less immediately self-explanatory, but it can be explained in one plain sentence: similarity is the degree to which two texts share the same set of attributes. That similarity is computed using one of two standard measures: Hamming or Jaccard distance.

On input choices, as noted above the Lab provides many of these, like others do. The dendrogram's similarity and linkage choices are inherited from its hierarchical-clustering parentage and are similar to others in the degree of technical knowldge involved. But its single most consequential input choice ( the criteria used to rate the texts) is much more defensible, in four ways: the criteria are visible (they are the matrix columns, in plain language), checkable (the reliability test back-translates them and flags unreliable ones), self-signalling (a non-discriminating criterion is filtered out automatically rather than quietly distorting the result), and documented (they are recorded in the provenance statement).

And more recently, with the development of the Selection tab, now also improvable: alternative sets of criteria can be compared on these measures and a better one converged upon, rather than living with whatever a single run produced. The Lab does not remove the input-choices problem; but its strengths lie where it most matters, where the choice can actually be inspected and challenged.The criteria matter most because they define the matrix from which everything else follows: the dendrogram, the sibling comparisons, and the reliability test are all computed from the presence and absence of those attributes, so a choice made well or badly there propagates through every later step.

Returning to the bigger picture

One other criterion for assessing cluster analysis methods is utility or usefulness. A dendrogram is much more useful structure than a simple list of clusters, each of which is different from the other. A dendrogram is a nested set of groups and subgroups, and as such can provide macro, meso, and micro level perspective on the entities that have been grouped.

In addition, the binary branching structure means that at each branching point a pair comparison can be made. Pair comparisons, as distinct from comparisons of large numbers of entities, are most useful when the entities being compared are multidimensional. This is because the mind can only hold so many differences in view at once: comparing two entities that differ on many dimensions is manageable, but comparing ten such entities simultaneously is not, because the number of differences to track at once becomes overwhelming. The fewer dimensions of difference between entities, the more of them that can be compared at any one time. The constraint is cognitive, a limit on human attention, rather than mathematical.

Comparisons of sibling groups in the dendrogram can be past or future oriented in the type of question being asked. Comparison questions can be quantitative, as in which of these is more or less x, or qualitative, as in how is this group different from this group, in terms of x. Comparisons can be retrospective and evaluative, or prospective and planning orientated, and it is this second, forward-looking use that turns the structure from a record of how things are into a guide for what to do next. Most importantly, a dendrogram can be used as a kind of decision tree, not simply seen as a passive classification. They can be used by humans, with and without the assistance of AI models like Claude.

What's next?

When I first wrote this, the open question was optimisation: if the criteria are the most consequential choice, can the Lab help you make that choice well rather than just make it visible? Three things seemed worth optimising — the reliability of the set of criteria, their resolution (their ability to tell the texts apart), and the number of criteria needed to do so. I noted that the last two might be in tension: you can always tell more texts apart by adding more criteria, so a high-resolution set achieved lazily, by piling criteria on, is no great achievement.

The Selection tab, described above, now addresses the first two directly — it puts reliability and resolution in front of you as measures you can read and act on. Looking into the near future, I think the tension between resolution and parsimony will be best handled not by having the tool pick an answer, but by making the trade-off visible: for a set of criteria you have curated, how few of them do you actually need, and what does each additional one buy you in resolution? A set might reach ninety per cent of its discriminating power with four criteria and crawl to a hundred only by adding six more — in which case the four may be the better choice, depending on what those criteria mean. That is a frontier of diminishing returns, and seeing its shape is more useful than being handed a single "optimal" set, because the choice of where to stop on it is a judgement about the criteria's meaning that only the analyst can make. A specification for this exists; building it is the next step.

There is a wider point here worth recording. This problem — the smallest set of things that adequately covers a set of requirements — is a recognised one, the set covering problem, and I have met it before in a different guise: an earlier tool of mine, the Coverage Optimiser, searches for the smallest combination of foresight methods that covers a set of evaluation questions. Choosing the fewest criteria that still tell a set of texts apart is the same shape of problem, transposed. It also has close relatives in Qualitative Comparative Analysis and in the feature-selection methods used in machine learning. The Lab's expected contribution is not to solve the set covering problem automatically, but to make the avalable solution(s) something an evaluator can see, inspect, and decide on.

The patron saint of evaluation?

Footnote: The term "decision tree" can mean two related but distinct things. In machine learning it is an algorithm used to induce prediction rules from data, to classify cases or predict values (the classification and regression-tree family). In decision analysis and operations research, the older sense, it is a decision-support model in which people map out choices, chance events, and their consequences to compare the expected value or utility of competing options. It is this second, human-facing sense I am referring to when suggesting that a dendrogram can be read as a decision tree: not a model that makes the decision, but a structure that lays out the comparisons a decision-maker can make, at macro, meso and micro levels of detail. See Wikipedia, "Decision tree" and "Decision tree learning" and The Decision Lab, "Decision Tree Analysis."

Tuesday, April 07, 2026

Rethinking how we share evaluation methods

This year I have been experimenting with a different approach to making evaluation methods more accessible and reusable. Working with Claude AI I've developed a series of mini-apps, each implementing a specific method of analysis as a single self-contained HTM webpage

What makes this approach worth sharing?

1. No dependency on AI services to use them. Once built, the app runs entirely in the browser, anyone can copy the html file and use it independently with no subscription, no login, no connectivity requirements

2. Data stays with the user. There's no server, no database, no cloud storage. Data is uploaded from, and downloaded to, the user's own device as a JSON file. For work involving sensitive information this matters.

3. Surprisingly fast to build with Claude AI. Turning a method of analysis into a working customised tool takes a fraction of the time you might expect, even for non developers.

4. Collective potential. If practioners share analysis methods AND the tools to implement them then others can use them directly or adapt them with AI assistance for the for their own context. The barriers to entry is low.

5. Easy to check for viruses. Being only a single webpage most widely used virus protection software should be able to scan any such mini-apps very quickly and thoroughly.

I've documented a number of examples so far, at mandenews.blogspot.com/2026.

if you're working on evaluation methods and curious as to whether this model fits your context I'm happy to discuss.

Postscript: Some tips on desirable features of a mini-app for the above purposes:

1. App name, that captures its purpose

2. Authors name along with copyright symbol and year. A Comment text in the underlying html should spell out under what conditions the mini-app code can be edited and redistributed

3. Tabs across the top that give access to a discrete set of processing steps, in sequence. All preceded by an Introduction tab that spells out the purpose and workings of the min-app, tab by tab.

4. Import and Export icons for managing json or csv files

5. json file with an example data set that can be imported into .the app, which is explained in the Introduction tab.

...to be continued...

Saturday, March 28, 2026

When rankings tell different stories: an introduction to the Rank Explorer

Caveat Emptor: I delegated the writing of this blog posting to Claude AI (Sonnet 4.6), based on an extended prior dialogue on the subject below, then a summary prompt of what was wanted. My post-production edits were quite limited.

--o0o--

There is a situation that turns up repeatedly in evaluation and research practice that is easy to overlook precisely because it looks like a data analysis problem rather than a methodological one. The situation is this: you have a set of cases that have been ranked on multiple factors, along with a ranking of their outcomes, and you want to understand what the relationship is between the factors and the outcome.

This sounds straightforward. And to a degree it is — you can correlate each factor against the outcome, identify the strongest relationships, and build a composite ranking that aggregates them. These are useful things to do. But they share a common assumption that is not always warranted: that the factors combine additively, that each one contributes independently to the outcome, and that a case that scores well on several factors will therefore tend to score well overall.

That assumption is often wrong. And when it is wrong, the gap between your composite ranking and the actual outcomes is telling you something important.

The problem with additive aggregation

Consider a concrete example. You have 63 local authority areas. You have ranked them on ten factors thought to be associated with population-level physical activity — access to green space, deprivation levels, sports facility density, and others. You have also ranked them on an outcome measure. You build a composite ranking from the factors, correlate it with the outcome, and find it predicts reasonably well — perhaps a Spearman r of 0.75.

That is a decent result. But it is hiding something. Some areas with strong factor rankings are performing poorly on the outcome; others with weak factor rankings are performing well. If you look closely, there are two or three quite different combinations of factors that each seem sufficient, on their own, to predict a good outcome. These are not variations on the same story — they are distinct causal pathways. However an additive composite averages across them, and in doing so obscures the structure.

This is what researchers in the QCA tradition call equifinality — multiple routes to the same outcome. Additive methods cannot find it. A decision tree can.

What the Rank Explorer does

The Rank Explorer is a browser-based companion tool to The Ethnographic Explorer. It is designed to import the rankings data that TEE generates from its Contrast tab, though it will also accept ranking data from any other source in the same CSV format.

[▶ Try the Rank Explorer]

The tool has four analysis tabs. The first three — Individual Factors, Composite Builder, and Scatter Plot — provide standard additive analysis: Spearman correlations for each factor, composite rankings using several aggregation methods (equal-weight, correlation-weighted, stepwise greedy, and exhaustive search), and a scatter plot that visualises how well your composite ranking predicts the outcome, with adjustable classification thresholds.

The fourth tab, Pathway Explorer, is where the configurational logic comes in. It builds an optimal classification tree over your data using exhaustive search: at each node, every available factor and every possible rank cut-off is tested, and the split that best separates high-outcome from low-outcome cases is chosen. The result is a tree that shows which specific combinations of factor ranks distinguish the two groups, displayed in a row-by-level icicle layout that makes the branching structure easy to follow.

The tree is not just a visual. Each leaf node shows which cases ended up there, whether they were correctly classified, and the conditions that led to that grouping. A pathway summary below the tree lists the conditions for each leaf in plain language — for instance, "Active travel infrastructure: rank 5 or better AND Deprivation index: rank 8 or better."

When the gap between additive and configurational results is itself a finding

One of the more useful diagnostics the tool enables is comparing the classification accuracy of the best composite ranking against that of the decision tree. If both are similar, the additive story is probably adequate. If the tree substantially outperforms the composite — reaching, say, 90% or 100% accuracy where the composite only reached 75% — that gap is a finding in itself: the causal structure in the data is better described by conjunctions of conditions than by sums of contributions.

This matters for intervention design. If a high-outcome classification requires both good green space access and good active travel infrastructure (rather than either being substitutable for the other), then improving one without the other may produce no discernible effect. Additive analysis will not surface that conclusion; configurational analysis will.

A note on scale, depth, and selective deepening

The tree-building algorithm uses exhaustive search, which is thorough but computationally intensive. With datasets of 60–70 cases and 10 factors, a depth-3 tree typically builds in a few seconds. Depth 4 or beyond is a different matter: computation time increases steeply, and more importantly, deeper trees on small datasets will often find spurious distinctions — patterns that reflect the quirks of the sample rather than anything real.

The Rank Explorer addresses this through a subgroup analysis feature that may be a modest innovation in decision tree practice for small-N datasets. Once a tree has been built, each leaf node displays an Analyse subgroup → button. Clicking it filters the dataset to only the cases in that leaf and opens a fresh analysis session for that group alone — with all four tabs, including the Pathway Explorer, reconfigured for the subgroup. The outcome cut-off resets automatically to the median of the subgroup's outcome ranks, so the high/low distinction remains balanced within the smaller group.

This allows selective deepening of a specific branch without rebuilding the entire tree at greater depth. If one leaf contains 24 cases that the main tree could not separate further, the subgroup analysis asks a different and analytically legitimate question: within this group, what distinguishes the relatively better-performing cases from the worse ones? The answer applies conditionally — only to cases that reached that leaf — but that conditionality is precisely what makes it interpretable. A banner remains visible throughout the subgroup session as a reminder that the high/low labels are relative to the subgroup, not the full dataset.

For larger or more complex datasets, the stepwise greedy method in the Composite Builder tab is a useful preliminary step: it adds factors to the composite one at a time, selecting whichever remaining factor most improves the correlation with the outcome at each step. The resulting path table shows the marginal contribution of each factor, making it straightforward to identify a smaller subset that carries most of the predictive weight — before running the Pathway Explorer on that reduced set.

Beyond TEE data

The tool is designed as a TEE companion but is not restricted to it. Any CSV with a column of case names and a set of ranking columns will load correctly. Evaluation practitioners who have generated case rankings through other means — expert scoring panels, secondary data, peer comparison exercises — can use the same analytical workflow.

Some framings that could map onto the same tool:

Programme portfolios: rank a set of projects on design-quality dimensions and an outcome measure, then identify which combinations of design features distinguish the most successful from the rest
Organisational assessments: rank a set of partner organisations on capability dimensions, use the tree to find which combinations are most predictive of delivery performance
Cross-country comparison: rank a set of countries or regions on contextual factors alongside a development indicator, and look for the configurational patterns that additive index approaches miss

In each case the structure is the same: cases, factor rankings, an outcome ranking, and the question of what the relationship looks like once you stop assuming it is additive.

An invitation to experiment

The tool is best explored with data you already have. If you have ever built a composite index and felt that it was not quite capturing something you could see in the data, or have had the experience of an outlier case that your model consistently misclassifies, the Pathway Explorer is a reasonable next step. Loading your own data, building a tree at depth 2 or 3, and comparing the pathway classification against your composite should take no more than a few minutes.

I am continuing to develop both tools and would welcome feedback on the approach, the interface, or uses I have not considered.

Accessing the code: The Rank Explorer runs entirely in your browser — no login required, no data is transmitted anywhere. To save your own copy of the code, open the tool, right-click, select View Page Source, copy the entire code, paste it into a text file, rename it to end in .html rather than .txt, and open it in any web browser.

Sunday, March 22, 2026

Introducing Rank Order Counterfactuals (ROC)

A counterfactual is a description of what would have happened, if an intervention had not taken place. The use of randomised control groups is one way to construct a counterfactual. A population of people are randomly assigned to either a control group or an intervention group. Differences in the outcomes of those populations are then compared. If the difference is sufficiently statistically significant then a plausible causal claim can be made that difference in outcomes is because of the intervention.

As might be expected, there are plenty of circumstances when social programs are designed and implemented, where it is simply not practical to organise a randomised control group. In addition, the comparisons that are made will be between averages of the two groups. However, in many social programmes such averages are of limited practical use, because the implementation contexts are so varied and no single “solution” is likely to be applicable. Average effects can still be informative at a high level, but they need to be complemented by methods that take contextual diversity seriously.

I'm currently working with an evaluation team that is examining a large-scale public health programme in the United Kingdom, covering many different locations and involving many different types of local partnerships. But with one common outcome of concern, which is to increase people's physical activity levels in their daily life. In their work the evaluation team is already making use of a causal configurational approach to the understanding of what works for whom in what circumstances. It is finding different configurations of causal conditions across these locations that are associated with changes in activity levels. This approach is consistent with the high level of diversity in locations partnerships and interventions.

But what it does not yet have is a counterfactual, a defensible description of what might have happened in these locations in the absence of this intervention. This is where the idea of a rank order counterfactual becomes relevant. By a rank‑order counterfactual I mean a very specific kind of “what would have happened otherwise.” Instead of trying to predict the exact outcome that would have been achieved in each location without an intervention, we can start by asking a simpler, comparative question: which location would probably have changed more, and which less, if the intervention had never existed? The answer will be in the form of a rank ordering of locations, from those with more to less expected change. That ranking would be constructed based on all available baseline information, trends, and contextual knowledge. This proposed approach falls into a category of counterfactuals known as "logically constructed counterfactuals", and it aligns well with configurational evaluation because it focuses on patterns of relative change across diverse contexts.

A subsequent evaluation of those same locations should also be able to generate a new rank ordering, which is based on observed outcomes. These counterfactual and actual rankings can then be compared, using a scatterplot and correlation measures. The scatterplot is also visually powerful for communication: it lets people see at a glance which locations behave as expected and which ones stand out as surprises. If the intervention had no effect we should see a linear relationship, the observed and counterfactual rankings should be the same. If the intervention had positive, or perhaps even negative effects, this should not happen. We might see various locations which are outliers from that expected trend. When locations we expected to be “natural leaders” did not improve much, and those we expected to be “natural laggards” moved to the top of the league table, that pattern is a signal that the intervention may have been influential, and it gives the evaluation team clear cases where alternative explanations should be probed. The task of the evaluation is then to probe those alternative explanations, not to assume the intervention is the only possible cause. The rankings are not a substitute for theory‑based evaluation; they are a way to make its claims sharper and more testable. The focus on ranking differences can convert a vague theory (“we think these factors matter”) into a concrete, specific prediction about which locations should do better.

The sensitivity of the rank comparison process will depend on the number of ranked items. The more rank positions there are, the more sensitivity there will be to differences in performance, which is good. But, as shown in research on sorting algorithms, the time required to generate a complete sorting, using any of the well-known methods, can be significant. Growing faster than proportionally to the number of items, though far slower than exponential growth. In addition to the extra time required, a rank order counterfactual will require a stronger evidential base where the number of rank positions is greater.

When a large number of locations are involved in an intervention one practical way of addressing this tension is to use a stratified random sample, and to generate the rankings for that sample only. Another approach to managing large numbers of locations is to think of ranked bands of locations rather than individual rankings for each location. What should be of interest, then, are systematic shifts in band membership between the counterfactual and actual observations – for example, locations expected to be in the “low‑change” band turning up in the “high‑change” band in practice.

In this way, rank‑order counterfactuals do not replace theory‑based evaluation, but sharpen it: they turn general expectations about context into explicit, testable predictions about who should have changed most in the absence of the programme. In work which I hope to document in the European Evaluation Society conference later this year I will explain how the use of the hierarchical card sorting process was used to generate argument and evidence based counterfactual rank orderings, and how an LLM was used to support this process.

Saturday, March 07, 2026

Making implicit knowledge explicit, contestable and usable: an introduction to The Ethnographic Explorer

--o0o--

There is a problem that turns up repeatedly in evaluation practice — and in many other fields — that many of us work around rather than solve directly. The problem is this: the people who know most about a programme, a portfolio, or a set of cases often cannot easily say what they know, or why they make the judgements they do.

This is not a failure of intelligence. It is the normal condition of what Michael Polanyi called tacit knowledge — the "we know more than we can tell" problem that is endemic to any field built on experience and judgement. The challenge for evaluation is to find structured ways of drawing it out.

The theoretical anchor: information as difference

The approach behind The Ethnographic Explorer draws on a deceptively simple idea from Gregory Bateson: information is a difference that makes a difference. What we notice, and consider significant, is always defined by contrast — not by properties of things in isolation. A project is "successful" relative to others; a group is "vulnerable" in comparison to other groups in a given context. And some of those differences have noticeable consequences, they "make a difference". Together they can be seen as simple "IF...THEN..." rules.

This suggests that a structured method for eliciting knowledge should be built around comparison — specifically, around asking people to identify and articulate differences, and then to explain what those differences imply. The Hierarchical Card Sort methodology is one way of doing this and is the basis of The Ethnographic Explorer's process of inquiry.

How the Ethnographic Explorer works

There are three linked stages.

In the Sort stage, a respondent is presented with a set of cases — projects, organisations, events, beneficiaries, or any entities they know well. They are asked to divide all the cases into two groups representing the most significant difference between them, from their point of view. Each group is named, the difference is recorded, and then the respondent is asked: "What difference does this difference make?" — a question that surfaces the consequence or implication of the distinction. The process repeats on each subgroup until every group contains a single case. The result is a binary tree: a structured, hierarchical map of the respondent's view of the case set.

The tree is informative in three ways: it reveals the contents of the distinctions the respondent considers important; it identifies the limits of their knowledge (where further differences cannot be found); and it indicates the direction of their attention (where further distinctions could usefully be explored).

In the Compare stage, the facilitator asks comparison questions at each split in the tree. Questions can be in degree ("Which group of cases is more likely to face sustainability problems?") or in kind ("How do these two groups differ in terms of their relationship with local government?"). Degree questions produce a ranking of all cases; kind questions produce descriptive contrasts. Both types build on the structure already revealed by the sort.

In the Contrast stage, any two degree-based rankings are plotted against each other in a scatter plot. The resulting quadrant analysis shows where the two rankings agree (cases high on both, or low on both) and where they diverge (cases high on one and low on the other). Adjustable cutoff sliders allow the facilitator to explore different thresholds, and a Spearman correlation coefficient summarises the overall relationship. Outlier cases — those that diverge most between the two rankings — are often the most analytically interesting. Strong relationships can be cast as potentially useful IF...THEN rules

The Ethnographic Explorer

With substantial coding help from Claude AI, I have been developing a browser-based implementation of this methodology — The Ethnographic Explorer — as a standalone single-file application. This new version supersedes an earlier WordPress-based tool at ethnographic.mande.co.uk, which required a server to run. The new version requires nothing beyond a web browser.

▶ Try The Ethnographic Explorer [When you get there, click on Introduction for guidance on how to explore the tool's functions]

The tool is designed for use in a shared-screen video call with a single respondent, or screened to multiple participants in a workshop setting. The facilitator drives the interface; the respondent provides the knowledge. A typical exercise with 8–12 cases takes between 45 minutes and two hours, depending on range of comparisons made.

A worked example: 12 Largest cities

Seven jscon files are are available alongside the app to allow you to explore the use of the app. The first is simply a list, which you can then sort and compare and contrast. The second is the same list, already sorted by Claude AI at my request, on the basis of " their appeal to international tourists, as seen from 6 different perspectives.

To load either dataset, click Import JSON in the top-right toolbar and select the file (after being downloaded into your computer).

Download 12 Largest Cities (unsorted)

Download 12 Largest Cities - Budget Backpacker view (sorted)

Download 12 Largest Cities - Food and Culture view (sorted)

Download 12 Largest Cities - Safety Concious view (sorted)

Download 12 Largest Cities - Sustainable Tourism view (sorted)

Download 12 Largest Cities - Travel Journalist view (sorted)

Download 12 Largest Cities - Heritage Specialist view (sorted)

What the tool produces

Each exercise generates a structured record of the respondent's knowledge about a set of cases:

a binary sort tree, with each split labelled by the most significant difference and its consequence
one or more named rankings, each derived from a sequence of binary degree judgements working from the root set down through the subsets
one or more descriptive contrasts, capturing how subgroups differ in kind on specified attributes
a scatter plot for any pair of degree rankings, with quadrant analysis and Spearman correlation
a qualitative responses panel, collecting the in-kind descriptions at each split

The full exercise exports as a JSON file, preserving the complete tree structure, all difference descriptions, all responses, and the exercise metadata. Files can be imported to resume exactly where you left off, or shared with a colleague for further analysis.

Beyond evaluation

The underlying structure of the method applies wherever you have a set of cases and a respondent who has differentiated knowledge about them. Some other framings that could be explored with the same tool:

Organisational learning: a team reviews a set of completed projects, identifying the distinctions that, in retrospect, most predicted success or failure
Capacity assessment: a trainer sorts a cohort of staff by their readiness for different kinds of work, making the basis for those judgements explicit and discussable
Stakeholder analysis: a key informant sorts a set of stakeholder organisations, revealing the distinctions they consider most consequential for programme implementation
Policy analysis: a policy analyst sorts a set of interventions by their perceived effectiveness, then compares that ranking against a ranking of their political feasibility

In each case, the sort structure is the same, the comparison logic is identical, and only the cases, the domain framing, and the named differences change.

An invitation to experiment

The tool is best understood by using it. I would encourage anyone engaged in evaluation, learning, or knowledge management work to try loading in a set of cases they know well — even with a rough sort to start — and see what structure emerges. The Compare and Contrast stages are particularly useful for surfacing assumptions that are rarely made explicit in standard reporting.

I am continuing to develop the tool and would welcome feedback on the methodology, the interface, or applications I have not yet considered.

Accessing the code: In addition to trying the app online, you can download a copy of the code and run it independently. Go to the app in yur directory, right-click, select View Page Source, copy the entire code, paste it into a text file, rename the file to end in .html rather than .txt, and open it in a web browser. The tool runs entirely in your browser — no login required, no data sent anywhere.

Tuesday, March 03, 2026

Optimising method selection: an introduction to the set covering problem — and a tool to help

Caveat Emptor: I delegated the writing of this blog posting to Claude AI(Sonnet 4.6), based on an extended prior dialogue on the subject below, then a summary prompt of what was wanted. My post production edits were quite limited

--0O0--

There is a class of problem that turns up repeatedly in evaluation planning, and in many other fields, that most of us solve informally and imprecisely. The problem is this: given a set of needs to be addressed, what is the smallest combination of responses that covers all of them?

In evaluation this might be: given a set of evaluation questions I need to answer, what is the minimum combination of methods that gives me adequate coverage across all of them? In public health it might be: which combination of clinics ensures every neighbourhood has access to at least one? In software testing: which set of test cases exercises every code path? In logistics: which depots can serve every delivery zone?

These are all instances of what computer scientists call the Set Covering Problem — a well-studied problem in combinatorial optimisation that dates back to the 1970s. The formal version asks: given a collection of sets, find the smallest sub-collection whose union covers all elements of a target universe.

Why it matters for evaluation practice

When designing an evaluation, practitioners face exactly this structure. We have a list of evaluation questions (or dimensions of quality, or stakeholder concerns), and a repertoire of methods — each of which can address some but not all of those questions to a satisfactory level. Choosing methods one at a time, based on familiarity or habit, rarely produces the most efficient combination. We either end up with redundant overlaps in some areas and blind spots in others, or with far more methods than the budget or timeline can support.

A more systematic approach asks: which combination of methods is both complete (covers all questions at an adequate level) and minimal (uses as few methods, and as little resource, as possible)?

How the optimisation works

For a small number of methods and questions, you could check all possible combinations by hand. But the number of combinations grows exponentially — with ten methods, there are over a thousand possible subsets to evaluate. This is where an algorithm helps.

The simplest approach is a greedy algorithm: at each step, pick the method that covers the most currently-uncovered questions, then repeat until everything is covered. This is fast and usually finds a good solution, but not necessarily the best one.

A more thorough approach — exhaustive search — systematically checks all combinations up to a specified size and returns every minimal solution. This is slower but reveals the full landscape of equally-good options, which is often more useful than a single answer, particularly when cost or other practical constraints come into play.

The Coverage Optimiser

With substantial coding help from Claude AI, I have been developing a small browser-based tool — the Coverage Optimiser — that applies both approaches to a user-defined matrix of methods and question types.

▶ Try the Coverage Optimiser [When you get there click on Introduction, to get suggestions on how to explore the functions of the app]

The default example matrix uses ten foresight methods (Scenario Planning, Delphi, Horizon Scanning, and others) rated against five question types — Descriptive, Valuative, Explanatory, Predictive, and Prescriptive — at HIGH, MEDIUM, or LOW in terms of their usefulness in building a futures perspective into an evaluation. The tool finds the minimal combinations of methods that achieve the desired coverage level across all question types.

Each solution is displayed with:

the methods involved, each with its cost rating
a question-by-question coverage check
the total cost of the combination
an overlap score — the number of questions covered by more than one method in the solution

Overlap is worth attending to: it indicates redundancy, which in evaluation terms means resilience. If one method proves impractical in the field, a solution with higher overlap is more likely to remain viable.

The matrix itself is fully editable in a dedicated Matrix Editor tab. You can rename methods and question types, adjust HIGH/MEDIUM/LOW ratings on a choosen criteria, set importance weights for each question type (0–10), set cost ratings for each method (0–10), sort rows by any column, import and export as JSON, and print or save the matrix or results as PDF.

Note: The Coverage Optimiser runs entirely in your browser — no login required, no data sent anywhere.

Beyond methods and questions

The tool is not limited to foresight methods or evaluation questions. The underlying logic applies wherever you have a set of options and a set of requirements, and where each option partially addresses some requirements. Some other framings that could be loaded into the same tool:

Solutions × Problems: which combination of policy interventions covers the broadest range of identified problems?
Stakeholders × Information needs: which combination of engagement activities ensures all key stakeholder groups have their core information needs met?
Data sources × Indicators: which combination of data collection instruments covers all required indicators, at minimum cost?
Partners × Geographic areas: which combination of implementing partners ensures all target districts are reached?

In each case the matrix structure is the same, the optimisation logic is identical, and only the labels and rating criteria change.

An invitation to experiment

The tool is best understood by using it. I would encourage anyone planning a multi-method evaluation, or foresight exercise, to load in their own methods and questions — even with rough ratings to start — and see what combinations emerge. The exhaustive search mode is particularly useful for revealing that several equally-minimal combinations exist, which opens up a more deliberate conversation about which is preferable given cost, feasibility, or complementarity.

I am continuing to develop the tool and would welcome feedback on the matrix structure, the rating scales, or applications I have not yet considered.

Those who have used EvalC3 will find a family resemblance here: both tools use systematic search to find efficient combinations — EvalC3 searching for attribute combinations that predict specific individual outcomes, the Coverage Optimiser searching for method combinations that cover multiple requirements. The underlying computational logic is related, even if the problems look different on the surface.

Accessing the code: In addition to trying out the app, as it already exists online, you can also download a copy of the code and have it working independently on your own website. Go to the app online, right click your mouse, select View Page Source, copy the entire code, paste it into a txt file, rename the text file to end in .html not .txt, click on that file in a directory to open it up in a web browser. Simples, yes?

Rick On the Road

Sunday, May 31, 2026

Cluster analysis for evaluation purposes, and how a hybrid Human-LLM approach can help

Tuesday, April 07, 2026

Rethinking how we share evaluation methods

Saturday, March 28, 2026

When rankings tell different stories: an introduction to the Rank Explorer

The problem with additive aggregation

What the Rank Explorer does

When the gap between additive and configurational results is itself a finding

A note on scale, depth, and selective deepening

Beyond TEE data

An invitation to experiment

Further reading

Sunday, March 22, 2026

Introducing Rank Order Counterfactuals (ROC)

Saturday, March 07, 2026

Making implicit knowledge explicit, contestable and usable: an introduction to The Ethnographic Explorer

The theoretical anchor: information as difference

How the Ethnographic Explorer works

The Ethnographic Explorer

A worked example: 12 Largest cities

What the tool produces

Beyond evaluation

An invitation to experiment

Further reading

Tuesday, March 03, 2026

Optimising method selection: an introduction to the set covering problem — and a tool to help

Why it matters for evaluation practice

How the optimisation works

The Coverage Optimiser

Beyond methods and questions

An invitation to experiment

Further reading

Followers

Search This Blog

Blog Archive

Links

Rick Davies

Creative Commons