Tuesday, November 05, 2019

Combining the use of the Confusion Matrix as a visualisation tool with a Bayesian view of probability


Caveat: This blog posting is total re-write of an earlier version on the same subject. Hopefully, this one will be more coherent and more useful!


Quick Summary
In this revised blog I:
1. Explain what a Confusion Matrix is and what Bayes Theorem says
2. Explain three possible uses for Bayes Theorem when combined with a Confusion Matrix

What is a Confusion Matrix?


A Confusion Matrix is a tabular structure that displays four possible combinations of two types of events, each of which may have happened, or not happened. Wikipedia provides a good description.

Here is an example, with real data, taken from an EvalC3 analysis.


    TP = True Positive, FP = False Positive, FN = False Negative, TN = True Negative

In this example, the top row of the table tells us that when the attributes of a particular predictive model (as was identified by EvalC3) are present there are 8 cases where the expected outcome is also present (True Positives). But there are also 4 cases where the expected outcome is not also present (False Positives). In all the remaining cases (all in the bottom row), which do not have the attributes of the predictive model, there is one case where the outcome is nevertheless present (False Negative) and 13 cases where the outcome is not present (True Negative). As can be seen in the Wikipedia article, and also in EvalC3, there is a range of performance measures which can be used to tell is how well this particular predictive model is performing – and all of these measures are based on particular combinations of the types of values in this Confusion Matrix.

Bayes theorem


According to Wikipedia, 'Bayes' theorem (alternatively Bayes's theorem, Bayes's law or Bayes's rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event '

The Bayes formula reads as follows:

P(A|B) = The probability of A, given the presence of B
P(B|A) = The probability of B, given the presence of A
P(A) = The probability of A
P(B) = The probability of B

This formula can be calculated using data represented within a Confusion Matrix. Using the example above, the outcome being present = A in the formula, and the model attributes being present = B in the formula.  So this formula could tell us the probability of finding True Positives i.e when these are both present. Here is how the various parts of the formula are calculated, along with some alternate names for their parts:


So far, in this blog posting, the Bayes formula simply provides one additional means of evaluating the usefulness of prediction models found through the use of machine learning algorithms, using EvalC3 or other means.

Process Tracing application


But I'm more interested here in the use of the Bayes formula for process tracing purposes, something that Barbara Befani has written about. Process tracing is all about articulating and evaluating conjectured causal processes in detail. A process tracing analysis is usually focused on one case or instance, not multiple cases. It is a within-case rather than cross-case method of analysis. 

In this context, the rows and columns of the Confusion Matrix have slightly different names. The columns described whether a theory is true or not, and the rows described whether evidence of a particular kind is present or not. More importantly, the values in the cells are not numbers of actual observed cases.  Rather, they are the analyst's interpretation of what is described as the "conditional probabilities" of what is happening in one case or instance.  In the two cells in the first column, the analyst puts their own probability estimates, between zero and one, reflecting the likelihood: (a) that if the evidence is present in the theory is true, that a man was the murderer, and (B) that if the evidence is absent the theory is true. In the two cells in the second column, the analyst puts their own probability estimates, between one and two, reflecting the likelihood: (a) that if the evidence is theory is not true, that a man was not the murderer, and (B) that if the evidence is absent the theory is not true. 

Here is a notional example. The theory is that a man was the murderer. The available evidence suggests that the murderer would have needed exceptional strength.
The analyst also needs to enter their "priors". That is, their belief about the overall prevalence of the theory being true i.e. men are most often the murderers. Wikipedia suggests that 80% of murders are committed by men.These prior probabilities are entered in the third row of the Confusion Matrix, as shown below. The main cell values are then updated in the light of those new value, as also shown below

Using the Bayes formula provided above, we can now calculate P(A|B),i.e. man being the murderer if the evidence x was found..  P(A|B) = TP/(TP+FP) = 0.97

"Naive Bayes" 


This is another useful application, based on an algorithm of this name, described here. 
On that web page, an example is given of a data set where each row describes three attributes of a car (color, type and origin) and whether the car was stolen or not. Predictive models (Bayesian or otherwise)  could be developed to identify how well each of these attributes predicts whether a car is stolen or not. In addition, we may want to know how good a predictor the combination of all these three individual predictors is. But the dataset does not include any examples of these types of cases.

The article then explains how the probability of a combination of all three of these attributes can be used to predict whether a car is stolen or not.

1. Calculate (TP/(TP+FP)) for color * (TP/(TP+FP)) for type * (TP/(TP+FP)) for origin 
2. Calculate (FP/(TP+FP)) for color * (FP/(TP+FP)) for type * (FP/(TP+FP)) for origin
3. Compare the two calculated values. If the first is higher classify a car as most likely stolen. If it is lower, classify a car as most likely not stolen.

A caution: Naive Bayes calculations assume (as its name suggests) that each of the attributes of the predictive model is causally independent. This may not always be the case.

In summary


Bayes formula seems to have three uses:

1. As an additional performance measure when evaluating predictive models generated by any algorithm, or other means. Here the cell values do represent numbers of individual cases.

2.  As a way of measuring the probability of a particular causal mechanism working as proposed, within the context of a process-tracing exercise. Here the cell values are conjectures about relative probabilities relating to a specific case, not numbers of individual cases.

3.  As a way of measuring the probability of a combination of predictive models being a good predictor of an outcome of concern. Here the cell values could represent either multiple real cases or conjectured probabilities (part of a Bayesian analysis of a causal mechanism) regarding events within one case only.






Saturday, October 19, 2019

On finding the weakest link...



Last week I read and responded to a flurry of email exchanges that were prompted by Jonathan Morell circulating a think piece titled 'Can Knowledge of Evolutionary Biology and Ecology Inform Evaluation?". Putting aside the details of the subsequent discussions, many of the participants were in agreement with the idea that evaluation theory and practice could definitely benefit by more actively seeking out relevant ideas from other disciplines.

So when I was reading Tim Harford's column in this weekend's Financial Times, titled 'The weakest link in the strong Nobel winner 'I was very interested in this section:
Then there’s Prof Kremer’s O-ring Theory of Development, which demonstrates just how far one can see from that comfortable armchair. The failure of vulnerable rubber “O-rings” destroyed the Challenger space shuttle in 1986; Kremer borrowed that image for his theory, which — simply summarised — is that for many production processes, the weakest link matters.
Consider a meal at a fancy restaurant. If the ingredients are stale, or the sous-chef has the norovirus, or the chef is drunk and burns the food, or the waiter drops the meal in the diner’s lap, or the lavatories are backing up and the entire restaurant smells of sewage, it doesn’t matter what else goes right. The meal is only satisfactory if none of these things go wrong.
As you will find when you do a Google search to find out more information about the O-ring Theory of Development, you will find there is a lot more to the theory than this, much of it very relevant to evaluators.  Prof Kremer is an economist, by the way.

This quote was of interest to me because in the last week I have been having discussions with a big agency in London about how to go ahead with an evaluation of one of their complex programs. By complex, in this instance, I mean a program that is not easily decomposable into multiple parts – where it might otherwise be possible to do some form of cross-case analysis, using either observational data or experimental data. We have been talking about strategies for identifying multiple alternative causal pathways that might be at work, connecting the program's interventions with the outcomes it is interested in. I'll be reporting more on this in the near future, I hope.

But let's go right now to a position a bit further along, where an evaluation team has identified which causal pathway (s) are most valuable/plausible/relevant. In those circumstances, particularly in a large complex program, the causal pathway itself could be quite long, with many elements or segments. This in itself is not a bad thing, because the more segments there are in a causal pathway that can be examined then the more vulnerable to disproof the theory about that causal pathway is – which in principle is a good thing – especially if the theory is not disproved – it means it's a pretty good theory. But on the other hand, a causal pathway with many segments or steps pose a problem for an evaluation team, in terms of where they are going to allocate their resource-limited attention.

What I like about the paragraph from Tim Harford's column is the sensible advice that it provides to an evaluation team in this type of context. That is, look first for the weakest link in the causal pathway. Of course, that does raise a question of what we mean by the weakest link. A link may be weak in terms of its verifiability or its plausibility, or in other ways. My inclination at this point would be to focus on the weakest link in terms of plausibility. Your thoughts on this would be appreciated. How one would go about identifying such weak links would also need attention. Two obvious choices would be either to use expert judgement or different stakeholders perspectives on the question. Or probably better, a combination of both.

Postscript: I subsequently discovered some other related musings:


.