Sunday, May 31, 2026

Cluster analysis for evaluation purposes, and how a hybrid Human-LLM approach can help

Friend to Groucho Marx: “Life is difficult”

Groucho Marx to Friend: “Compared to what?”


Comparison is intrinsic to evaluation. Not only between one thing and another, but more often, between one category of things and another. Typologies are involved in many of the comparisons evaluators have to make (between people, activities, outcomes, locations, et cetera). Typologies can spring forth from our minds, but there are also systematic methods for developing them, broadly described as clustering methods.  It is this second grouping that I want to discuss here. 


What types of clustering methods are there?


Using one prompt to start off with, Claude identified for me 33 different clustering methods, which were organised into 9 different categories. These categories were defined largely on the basis of differences in computational mechanisms and mathematical principles which are involved in the operation of the clustering methods. But there were a couple of groups which were organised using different criteria relating to who does the task (human or algorithm) and what the method is for. 


I then did some experimentation of my own, getting Claude to cluster the methods, using two different clustering methods. One is called a “maximum spanning tree”, which connects methods according to which method is most similar to which other method. You can see this in figure 1 below, which is best read by double clicking on the image to get greater magnification. The second experiment used an agglomerative hierarchical clustering to produce what is called a dendrogram i.e. a tree structure displaying nested categories of methods. You can see this in figure 2 below, again probably best inspected by magnifying the image first. I also experimented with some other types of classifications of clustering methods, but I won’t go into detail with those here.

    

Figure 1: Maximum Spanning Tree
Figure 2: Dendrogram

 

Introducing the Text Cluster Analysis Lab

I developed this recently, in May this year, with very substantial help from Claude AI.  
You can take a quick preview of the app by following this link. 


How does it compare? When I asked Claude to compare this app with the list of 33 clustering methods, it identified similarities with a number of methods, including hierarchical clustering, latent class analysis, and pile sorting. But its best fitting answer was that the method is a hybrid. “Its uniqueness comes precisely from joining a human-sorting logic (criteria generated from the material, framed by the researcher's domain statement) to an algorithmic back end (binary scoring + agglomerative clustering). No single method in the set occupies that position”. The Lab's nearest relatives — hierarchical clustering, LCA, pile/card sorting — sit in three different families i.e. separate branches of the spanning tree and dendrogram. In this respect the Lab can claim novelty.


If this is the solution, what was the problem? Eight of the nine clustering categories (29 of the 33 methods) identified by Claude can produce identifiable clusters of texts through replicable, transparent, deterministic processes — but then require a subjective human judgement to name and describe those clusters. That has always struck me as oddly self-contradictory: a rigorous sorting process handed off, at the last step, to an unrecorded act of interpretation. The remaining four methods (pile sorting, card sorting, Q Methodology, and Repertory Grid) are subjective throughout, by virtue of their ethnographic orientation, so the contradiction does not arise for them in the same way.


A second problem concerns the interpretability of the dimensions within which clusters are graphically located. At least eight of the methods rely on *abstract derived axes* — dimensions that exist and could in principle be examined, but which are statistical composites a non-statistician cannot readily interpret (PCA, Factor analysis, ICA, t-SNE, UMAP, MDS, Spectral clustering, and SOM). The difficulty is not merely that these axes are hard to read; it is that several of them cannot be explained to a non-specialist in plain language without misrepresenting what they are.


A third problem concerns unaccountable input choices — the number of clusters, topics, classes, or dimensions to be identified. Here I think a blanket criticism is too easy, and too easily deflected, because every method involves choices and so, of course, does the Lab. What matters is not how many choices a method demands but how defensible each one is, and choices differ on four counts: whether the choice is *visible* in the output or buried behind it; whether it is *checkable* against some standard after the fact; whether a poor choice *fails visibly* rather than silently; and whether it is *documented*. A choice that is buried, uncheckable, silently failing, and undisclosed is the least defensible; one that is visible, checkable, self-signalling, and recorded is the most.


Before turning to how the Lab fares against these three problems, it is worth noting that the assessment below draws on several of its features — a text-by-attribute matrix, a reliability test, a dendrogram with comparative tooltips, automatic filtering of weak attributes, and a provenance record. Each is described more fully in the features section further down; here I refer to them only as far as the argument requires.


My assessment is that the Lab addresses the naming problem and the dimensions problem directly, and improves on — without escaping — the input-choices problem.


On naming, the dendrogram's tooltips identify how each cluster differs from its sibling by reporting which attributes it has that the sibling lacks. This is read straight off the matrix; it is systematic and humanly verifiable, not a subjective label imposed after the fact.


On interpretability of dimensions, the dendrogram works at two levels. Its tree structure — a nested classification of groups within groups — is intuitively understandable to most readers without any statistical background; one can see directly which texts joined first and which groups merged later. Its horizontal axis, representing degree of similarity, is less immediately self-explanatory, but it can be explained honestly in one plain sentence: similarity is the degree to which two texts share the same set of attributes. (For those who want the mechanics, that similarity is computed using one of two standard measures, Hamming or Jaccard distance, both of which are straightforward to explain given a little more space.) This is the categorical difference from the derived-axis methods: the Lab's axis has a truthful plain-language gloss, whereas a principal component or a t-SNE coordinate does not.


On input choices, the Lab makes a comparable number to other methods — the distance metric, the linkage method, the cut point for the number of clusters, the domain framing, and the temperature. It would be dishonest to claim it sidesteps the problem. Its metric and linkage choices, in particular, are inherited from its hierarchical-clustering parentage and are exactly the buried, weakly-checkable kind I have just criticised elsewhere. But its single most consequential input choice — the criteria used to rate the texts — sits in the opposite, defensible quadrant on all four counts: the criteria are visible (they are the matrix columns, in plain language), checkable (the reliability test back-translates them and flags unreliable ones), self-signalling (a non-discriminating criterion is filtered out automatically rather than quietly distorting the result), and documented (they are recorded in the provenance statement). The Lab does not remove the input-choices problem; it relocates the choice that matters most into the part of that space where the choice can actually be inspected and challenged.


The features of the Lab, in more detail.  Having referred to several of these already, here I describe the full set — the inputs the Lab works on, and the five kinds of output it produces.


The inputsThese are multiple bodies of texts. By Claude's estimate around 75% of cluster analysis methods work with quantitative data only, the rest work with text only (15%) or text or numbers (10%). More specifically, the focus of the Lab is on similarities and differences between texts, not on internal structure within a text (as is the case with much thematic coding).



The outputs:There are five types. The first is a matrix, where rows describe the texts, columns describe attributes of those texts and cells describe their presence or absence. The attributes are identified by a Claude AI search and comparison of the texts contents, operating within a user-defined context and scope. Attributes that fail to discriminate between the texts — those present in all of them or none of them — are filtered out automatically, since they contribute nothing to distinguishing one text from another. The completed matrix is supported by a full description of each text attribute (not shown below).


A second matrix is constructed as the results of a "back-translation" type of rater reliability test. This enables users to remove from use those attributes which are unreliably identifiable and to identify texts whose analyses are less reliable than others.


The third output is a dendrogram, representing a nested classification of the texts using the user's selection of relevant text attributes. This is built using an agglomerative process, firstly finding pairs of text which are most similar, then pairs of pairs of texts which are most similar, et cetera. When building the tree, the user also chooses how similarity between texts is measured (using either a Hamming or a Jaccard distance) and how clusters are linked together as they merge. These are genuine analyst choices, and ones the app is comparatively quiet about — a point I return to in the assessment earlier. Each branch of the tree structure includes tool-tip information on how the attributes of that cluster of texts differ from its sibling branch. This is systematically identified by the app, and humanly verifiable. It is not a subjective judgement — as is the case with a number of other types of clustering processes.


The fourth output is an open-ended Claude chat-type query facility, where the user can ask questions about a cluster of texts on their own, or in comparison to their sibling cluster in the dendrogram, or to the whole set of texts.


The fifth output is a set of exportable products, including a detailed provenance statement describing how all the products have been produced, a listing of all the identified text attributes, a copy of the text-by-attributes matrix, the rater reliability assessment, a copy of the dendrogram, and a copy of any query dialogue. The results of the complete analysis process can be saved as a JSON file and reimported for reuse.


Other features of note


1. Users can choose which of the different Claude  LLM models to use for which tasks in the workflow


2. The estimated and actual token costs are tracked at a number of stages. These costs are incurred through the app's use of a user specific API. In my experience so far, working with about a dozen different documents, these costs can range between US$1 and US$3  per complete analysis


Fir much more detail, go to this Introduction tab, on the Lab site