Friend to Groucho Marx: “Life is difficult”
Groucho Marx to Friend: “Compared to what?”
Comparison is intrinsic to evaluation. Not only between one thing and another, but more often, between one category of things and another. Typologies are involved in many of the comparisons evaluators have to make (between people, activities, outcomes, locations, et cetera). Typologies can spring forth from our minds, but there are also systematic methods for developing them, broadly described as clustering methods. It is this second grouping that I want to discuss here.
What types of clustering methods are there?
Using one prompt to start off with, Claude identified for me 33 different clustering methods, which were organised into 9 different categories. These categories were defined largely on the basis of differences in computational mechanisms and mathematical principles which are involved in the operation of the clustering methods. But there were a couple of groups which were organised using different criteria relating to who does the task (human or algorithm) and what the method is for.
I then did some experimentation of my own, getting Claude to cluster the methods, using two different clustering methods. One is called a “maximum spanning tree”, which connects methods according to which method is most similar to which other method. You can see this in figure 1 below, which is best read by double clicking on the image to get greater magnification. The second experiment used an agglomerative hierarchical clustering to produce what is called a dendrogram i.e. a tree structure displaying nested categories of methods. You can see this in figure 2 below, again probably best inspected by magnifying the image first. I like both of these, for reasons that will become clearer below.
Figure 1: Maximum Spanning Tree
|
Figure 2: Dendrogram |
Introducing the Text
Cluster Analysis Lab
I developed this min-app recently, in May this year, with very substantial help from Claude AI. You can take a view the app by following this link. You will see that one of the seven workflow tabs builds a dendrogram.
How does it compare?
When I asked Claude to compare this app with the list of 33 clustering methods, it identified similarities with a number of methods, including hierarchical clustering, latent class analysis, and pile sorting. But its best fitting answer was that the method is a hybrid. “Its uniqueness comes precisely from joining a human-sorting logic (criteria generated from the material, framed by the researcher's domain statement) to an algorithmic back end (binary scoring + agglomerative clustering). No single method in the set occupies that position”. The Lab's nearest relatives — hierarchical clustering, LCA, pile/card sorting — sit in three different families i.e. separate branches of the spanning tree and dendrogram. In this respect the Lab can claim novelty. Hopefully the basis of this claim will become clearer as I explain the lab in some detail.
The inputs
By Claude's estimate around 75% of cluster analysis methods work with quantitative data only, the rest work with text only (15%) or text or numbers (10%). The lab works with multiple bodies of texts. More specifically, the focus of the Lab is on similarities and differences between texts, not on internal structure within a text, as is the case with much thematic coding.
Users also make choices about various settings. These include what version of Claude to use at various stages of the analysis, the temperature setting for each stage, and most importantly, the precise wording of the prompt that Claude will respond to, which tells it what to look for.
The outputs
The first is a matrix, where rows describe the texts, columns describe attributes of those texts and cells describe their presence or absence. The attributes of the text are identified by a Claude AI search and comparison of the texts contents, operating within user-defined context and scope settings. Attributes that fail to discriminate between the texts — those present in all of them or none of them — are filtered out automatically, since they contribute nothing to distinguishing one text from another. The completed matrix is supported by a full description of each text attribute (not shown below).
A second matrix is constructed as the results of a "back-translation" type of rater reliability test. This enables users to remove from use those attributes which are unreliably identifiable and to identify texts whose analyses are less reliable than others.
The third output is a dendrogram, representing a nested classification of the texts, using the user's selection of relevant text attributes. This is built using an agglomerative process, firstly finding pairs of text which are most similar, then pairs of pairs of texts which are most similar, et cetera. When building the tree, the user also chooses how similarity between texts is measured (using either a Hamming or a Jaccard distance) and how clusters are linked together as they merge. Each branch of the tree structure includes tool-tip information on how the attributes of that cluster of texts differ from its sibling branch. This is systematically identified by the app, and humanly verifiable. It is not a subjective judgement — as is the case with a number of other types of clustering processes.
The fourth output is an open-ended Claude chat-type query facility, where the user can ask questions about: (a) a cluster of texts on their own, or (b) in comparison to their sibling cluster in the dendrogram, or (c) to the whole set of texts. With or without uploading of additional context information.
The fifth output is a set of exportable products, including a detailed provenance statement describing how all the products have been produced, a listing of all the identified text attributes, a copy of the text-by-attributes matrix, the rater reliability assessment, a copy of the dendrogram, a copy of any query dialogue, and a breakdown of token use and token costs for each stage of each analysis. The details of the analysis process, including settings and contents generated, can also be saved as a JSON file and reimported for reuse.
For much more detail, go to this Introduction tab, on the Lab site
If all this is the solution, what was the problem?
Eight of the nine clustering categories (29 of the 33 methods) identified by Claude can produce identifiable clusters of texts through replicable, transparent, deterministic processes — but then require a subjective human judgement to name and describe those clusters. That seems wierdly self-contradictory: a rigorous sorting process handed off, at the last step, to an undocumented act of interpretation. The remaining four methods (pile sorting, card sorting, Q Methodology, and Repertory Grid) are subjective throughout, because of their ethnographic orientation, so there is not such a visible internal contradiction.
A second problem concerns the interpretability of the dimensions within which clusters are located. At least eight of the methods rely on abstract derived axes — dimensions that exist and could in principle be examined, but which are statistical composites a non-statistician cannot readily interpret (PCA, Factor analysis, ICA, t-SNE, UMAP, MDS, Spectral clustering, and SOM). The difficulty is not just that these axes are hard to read; it is that many of them can't be explained simply to a non-specialist without misrepresenting what they are.
A third problem is unaccountable input choices, such as the number of clusters, topics, classes, or dimensions to be identified. Here a blanket criticism is easy to make, and to dismiss, because every method involves choices including the Lab. What matters is not how many choices are involved but how defensible each one is. Choices can differ on at least four ways: whether the choice is visible in the output or buried behind it; whether it is checkable against some standard after the fact; whether a poor choice fails visibly rather than silently; and whether it is documented. A choice that is buried, uncheckable, silently failing, and undisclosed is the least defensible; one that is visible, checkable, self-signalling, and recorded is the most.
My assessment
In summary, the Lab addresses the naming problem and the dimensions problem directly, and improves on, without completly escaping, the input-choices problem.
On naming, the dendrogram's tooltips identify how each cluster differs from its sibling by reporting which attributes it has that the sibling lacks. This is read straight off the matrix; it is systematic and humanly verifiable, not a subjective label imposed after the fact. In addition, the query facility can be used to generate names for the clusters based on their contents, which can then be tested for their reliability through another form of back translation. But it should be remembered that this is a reliability test, not validity test.
On interpretability of dimensions, the dendrogram works at two levels. Its tree structure (a nested classification of groups within groups) is intuitively understandable to most readers without any statistical background; they can see directly which texts joined first and which groups merged later. Its horizontal axis, representing degree of similarity, is less immediately self-explanatory, but it can be explained in one plain sentence: similarity is the degree to which two texts share the same set of attributes. That similarity is computed using one of two standard measures, Hamming or Jaccard distance.
On input choices, as noted above the Lab provides many of these, like others do. The dendrogram's metric and linkage choices are inherited from its hierarchical-clustering parentage and are similar to others in their degree of technical obscurity. But its single most consequential input choice ( the criteria used to rate the texts) is much more defensible, in four ways: the criteria are visible (they are the matrix columns, in plain language), checkable (the reliability test back-translates them and flags unreliable ones), self-signalling (a non-discriminating criterion is filtered out automatically rather than quietly distorting the result), and documented (they are recorded in the provenance statement). The Lab does not remove the input-choices problem; but its strengths lie where it most matters, where the choice can actually be inspected and challenged.The criteria matter most because they define the matrix from which everything else follows: the dendrogram, the sibling comparisons, and the reliability test are all computed from the presence and absence of those attributes, so a choice made well or badly there propagates through every later step.
The patron saint of evaluation?

