Rick On the Road: May 2026

Friend to Groucho Marx: “Life is difficult”

Groucho Marx to Friend: “Compared to what?”

Comparison is intrinsic to evaluation. Not only between one thing and another, but more often, between one category of things and another. Typologies are involved in many of the comparisons evaluators have to make (between people, activities, outcomes, locations, et cetera). Typologies can spring forth from our minds, but there are also systematic methods for developing them, broadly described as clustering methods. It is this second grouping that I want to discuss here.

What types of clustering methods are there?

Using one prompt to start off with, Claude identified for me 33 different clustering methods, which were organised into 9 different categories. These categories were defined largely on the basis of differences in computational mechanisms and mathematical principles which are involved in the operation of the clustering methods. But there were a couple of groups which were organised using different criteria relating to who does the task (human or algorithm) and what the method is for.

I then did some experimentation of my own, getting Claude to cluster the methods, using two different clustering methods. One is called a “maximum spanning tree”, which connects methods according to which method is most similar to which other method. You can see this in figure 1 below, which is best read by double clicking on the image to get greater magnification. The second experiment used an agglomerative hierarchical clustering to produce what is called a dendrogram i.e. a tree structure displaying nested categories of methods. You can see this in figure 2 below, again probably best inspected by magnifying the image first. I like both of these, for reasons that will become clearer below.

Figure 1: Maximum Spanning Tree

Figure 2: Dendrogram

Introducing the Text Cluster Analysis Lab
I developed this mini-app recently, in May this year, with very substantial help from Claude AI. You can view (and copy) the app by following this link. You will see that one of the eight workflow tabs builds a dendrogram.

How does it compare?

When I asked Claude to compare this app with the list of 33 clustering methods, it identified similarities with a number of methods, including hierarchical clustering, latent class analysis, and pile sorting. But its best fitting answer was that the method is a hybrid. “Its uniqueness comes precisely from joining a human-sorting logic (criteria generated from the material, framed by the researcher's domain statement) to an algorithmic back end (binary scoring + agglomerative clustering). No single method in the set occupies that position”. The Lab's nearest relatives — hierarchical clustering, LCA, pile/card sorting — sit in three different families i.e. separate branches of the spanning tree and dendrogram. In this respect the Lab can claim novelty. Hopefully the basis of this claim will become clearer as I explain the Lab in some detail.

The inputs

By Claude's estimate around 75% of cluster analysis methods work with quantitative data only, the rest work with text only (15%) or text or numbers (10%). The lab works with multiple bodies of texts. More specifically, the focus of the Lab is on similarities and differences between texts, not on internal structure within a text, as is the case with much thematic coding. The texts I have used, while testing the mini-app, are a set of storylines about alternative futures, collaboratively developed by participants in a ParEvo.org exercise. I also have plans to analyse a set of Most Significant Change stories.

Another important input is the choices users make about various settings, in each of the eight tabs making up the staged workflow. These include what version of Claude to use at various stages of the analysis, the temperature setting for each stage, and most importantly, the precise wording of the prompt that Claude will respond to, which tells it what to look for. In the Reliability and Cluster tabs choices are made about which identified sorting criteria to include, and what clustering settings to use. All these choices are important, as will be discussed below. They are what makes this approach hybrid: part human, part automated processes.

The outputs

The first is a list of attributes, which differentiate one group of texts from another. And a matrix, where rows list the texts, columns describe the attributes of those texts and cells describe their presence or absence in each text. The text attributes are identified by a Claude AI search and comparison of the texts contents, operating within user-defined context and scope settings. Attributes that fail to discriminate between the texts (those present in all of them or none of them) are filtered out automatically, because they contribute nothing to distinguishing one text from another.

A second matrix is constructed as the results of a "back-translation" type of rater reliability test. This enables users to remove from use those attributes which are unreliably identifiable and to identify texts whose analyses are less reliable than others.

The third output is a dendrogram, representing a nested classification of the texts, using the user's selection of relevant text attributes. This is built using an agglomerative process, firstly finding pairs of text which are most similar, then pairs of pairs of texts which are most similar, et cetera. When building the tree, the user also chooses how similarity between texts is measured (using either a Hamming or a Jaccard distance) and how clusters are linked together as they merge. Each branch of the tree structure includes tool-tip information on how the attributes of that cluster of texts differ from its sibling branch. This is systematically identified by the app, and humanly verifiable. It is not a subjective judgement — as is the case with a number of other types of clustering processes.

The fourth output is an open-ended Claude chat-type query facility, where the user can ask questions about: (a) a cluster of texts on their own, or (b) in comparison to their sibling cluster in the dendrogram, or (c) in comparison to the whole set of texts. With or without uploading of additional context information.

The fifth output is a set of exportable products, including a detailed provenance statement describing how all the products have been produced, a listing of all the identified text attributes, a copy of the text-by-attributes matrix, the rater reliability assessment, a copy of the dendrogram, a copy of any query dialogue, and a breakdown of token use and token costs for each stage of each analysis. The details of the analysis process, including settings and contents generated, can also be saved as a JSON file and reimported for reuse.

A sixth output supports the choice of criteria itself. Because the criteria matter more than any other input choice (they define the matrix from which everything else follows) it is worth being able to compare different sets of them and being able to choose deliberately from within these, rather than accepting whatever a single run happens to produce. The Selection tab lets you place the criteria from two or more runs side by side, each shown with quantitative measures of its performance: how reliably it was identified, how many texts it applies to, and how much its work overlaps that of the other criteria. Alongside these per-criterion figures are measures for the set as a whole, including its overall reliability and a measure of how well the set as a whole tells the texts apart. From this you can hand-pick a set of criteria drawn from across the runs, add criteria of your own wording if a distinction you care about was missed, and then re-score the texts against just that curated set. The result is a new analysis like any other, which can be clustered, reliability-tested, and queried in turn. A short built-in guide explains how these measures relate to one another — for instance, why a criterion that applies to very few texts can look misleadingly distinctive, or why minimising overlap is not always the right aim — and points to the wider feature-selection literature for those who want to follow the ideas further.

For much more detail, go to this Introduction tab, on the Lab site

If this is the solution, what was the problem?

Eight of the nine clustering categories (29 of the 33 methods) identified by Claude can produce identifiable clusters of texts through replicable, transparent, deterministic processes — but then require a subjective human judgement to name and describe those clusters. That seems weirdly self-contradictory: a rigorous sorting process handed off, at the last step, to an undocumented act of interpretation. The remaining four methods (pile sorting, card sorting, Q Methodology, and Repertory Grid) are subjective throughout, because of their ethnographic orientation, so there is not such a visible internal contradiction.

A second problem concerns the interpretability of the dimensions within which clusters are located. At least eight of the methods rely on abstract derived axes — dimensions that exist and could in principle be examined, but which are statistical composites a non-statistician can't readily interpret (PCA, Factor analysis, ICA, t-SNE, UMAP, MDS, Spectral clustering, and SOM). The difficulty is not just that these axes are hard to read; it is that many of them can't be explained simply to a non-specialist without misrepresenting what they are.

A third problem is unaccountable input choices, such as the number of clusters, topics, classes, or dimensions to be identified. Yes, it is true that every method involves choices, including the Lab. But what matters is is not the number of choices but how defensible each one is. Choices can differ on at least four ways: whether the choice is visible in the output or buried behind it; whether it is checkable against some standard after the fact; whether a poor choice fails visibly rather than silently; and whether it is documented. A choice that is buried, uncheckable, silently failing, and undisclosed is the least defensible; one that is visible, checkable, self-signalling, and recorded is the most.

My assessment

In summary, the Lab addresses the naming problem and the dimensions problem directly, and improves on, without completely escaping, the input-choices problem.

On naming, the dendrogram's tooltips identify how each cluster differs from its sibling by reporting which attributes it has and that the sibling lacks. This is read straight off the matrix; it is systematic and humanly verifiable, not a subjective label imposed after the fact. In addition, the query facility can be used to generate names for the clusters based on their contents, which can then be tested for their reliability through another form of back translation. But it should be remembered that this is a reliability test, not validity test. That is, a back-translation can confirm a name is applied consistently across the texts, but not that it is the right name for what they share. That needs human checking.

On interpretability of dimensions, the dendrogram works at two levels. Its tree structure (a nested classification of groups within groups) is intuitively understandable to most readers without any statistical background; they can see directly which texts joined first and which groups merged later. Its horizontal axis, representing degree of similarity, is less immediately self-explanatory, but it can be explained in one plain sentence: similarity is the degree to which two texts share the same set of attributes. That similarity is computed using one of two standard measures: Hamming or Jaccard distance.

On input choices, as noted above the Lab provides many of these, like others do. The dendrogram's similarity and linkage choices are inherited from its hierarchical-clustering parentage and are similar to others in the degree of technical knowldge involved. But its single most consequential input choice ( the criteria used to rate the texts) is much more defensible, in four ways: the criteria are visible (they are the matrix columns, in plain language), checkable (the reliability test back-translates them and flags unreliable ones), self-signalling (a non-discriminating criterion is filtered out automatically rather than quietly distorting the result), and documented (they are recorded in the provenance statement).

And more recently, with the development of the Selection tab, now also improvable: alternative sets of criteria can be compared on these measures and a better one converged upon, rather than living with whatever a single run produced. The Lab does not remove the input-choices problem; but its strengths lie where it most matters, where the choice can actually be inspected and challenged.The criteria matter most because they define the matrix from which everything else follows: the dendrogram, the sibling comparisons, and the reliability test are all computed from the presence and absence of those attributes, so a choice made well or badly there propagates through every later step.

Returning to the bigger picture

One other criterion for assessing cluster analysis methods is utility or usefulness. A dendrogram is much more useful structure than a simple list of clusters, each of which is different from the other. A dendrogram is a nested set of groups and subgroups, and as such can provide macro, meso, and micro level perspective on the entities that have been grouped.

In addition, the binary branching structure means that at each branching point a pair comparison can be made. Pair comparisons, as distinct from comparisons of large numbers of entities, are most useful when the entities being compared are multidimensional. This is because the mind can only hold so many differences in view at once: comparing two entities that differ on many dimensions is manageable, but comparing ten such entities simultaneously is not, because the number of differences to track at once becomes overwhelming. The fewer dimensions of difference between entities, the more of them that can be compared at any one time. The constraint is cognitive, a limit on human attention, rather than mathematical.

Comparisons of sibling groups in the dendrogram can be past or future oriented in the type of question being asked. Comparison questions can be quantitative, as in which of these is more or less x, or qualitative, as in how is this group different from this group, in terms of x. Comparisons can be retrospective and evaluative, or prospective and planning orientated, and it is this second, forward-looking use that turns the structure from a record of how things are into a guide for what to do next. Most importantly, a dendrogram can be used as a kind of decision tree, not simply seen as a passive classification. They can be used by humans, with and without the assistance of AI models like Claude.

What's next?

When I first wrote this, the open question was optimisation: if the criteria are the most consequential choice, can the Lab help you make that choice well rather than just make it visible? Three things seemed worth optimising — the reliability of the set of criteria, their resolution (their ability to tell the texts apart), and the number of criteria needed to do so. I noted that the last two might be in tension: you can always tell more texts apart by adding more criteria, so a high-resolution set achieved lazily, by piling criteria on, is no great achievement.

The Selection tab, described above, now addresses the first two directly — it puts reliability and resolution in front of you as measures you can read and act on. Looking into the near future, I think the tension between resolution and parsimony will be best handled not by having the tool pick an answer, but by making the trade-off visible: for a set of criteria you have curated, how few of them do you actually need, and what does each additional one buy you in resolution? A set might reach ninety per cent of its discriminating power with four criteria and crawl to a hundred only by adding six more — in which case the four may be the better choice, depending on what those criteria mean. That is a frontier of diminishing returns, and seeing its shape is more useful than being handed a single "optimal" set, because the choice of where to stop on it is a judgement about the criteria's meaning that only the analyst can make. A specification for this exists; building it is the next step.

There is a wider point here worth recording. This problem — the smallest set of things that adequately covers a set of requirements — is a recognised one, the set covering problem, and I have met it before in a different guise: an earlier tool of mine, the Coverage Optimiser, searches for the smallest combination of foresight methods that covers a set of evaluation questions. Choosing the fewest criteria that still tell a set of texts apart is the same shape of problem, transposed. It also has close relatives in Qualitative Comparative Analysis and in the feature-selection methods used in machine learning. The Lab's expected contribution is not to solve the set covering problem automatically, but to make the avalable solution(s) something an evaluator can see, inspect, and decide on.

The patron saint of evaluation?

Footnote: The term "decision tree" can mean two related but distinct things. In machine learning it is an algorithm used to induce prediction rules from data, to classify cases or predict values (the classification and regression-tree family). In decision analysis and operations research, the older sense, it is a decision-support model in which people map out choices, chance events, and their consequences to compare the expected value or utility of competing options. It is this second, human-facing sense I am referring to when suggesting that a dendrogram can be read as a decision tree: not a model that makes the decision, but a structure that lays out the comparisons a decision-maker can make, at macro, meso and micro levels of detail. See Wikipedia, "Decision tree" and "Decision tree learning" and The Decision Lab, "Decision Tree Analysis."

Rick On the Road

Sunday, May 31, 2026

Cluster analysis for evaluation purposes, and how a hybrid Human-LLM approach can help

Followers

Search This Blog

Blog Archive

Links

Rick Davies

Creative Commons