Showing posts sorted by relevance for query QCA. Sort by date Show all posts
Showing posts sorted by relevance for query QCA. Sort by date Show all posts

Friday, July 26, 2013

A reverse QCA?



I have been talking to a project manager who needed some help clarifying their Theory of Change (and maybe the project design itself). The project aims to improve the working relationships between a particular organisation (A) and a number of organisations they work with (B). There is already a provisonal scale that could be used to measure the baseline state of relationships, and changes in those relationships thereafter. Project activities designed to help improve the relationships have already been identified and should be reasonably easy to monitor. But the expected impacts of the improved relationships on what B's do elsewhere via their other relationships have not been clarified or agreed to, and in all likelihood they could be many and varied. It will probably be easier to identify and categorise after the activities have been carried out, rather than during at any planning stage.

I have been considering the possible usefullness of QCA as a means of analysing the effectiveness of the project. The cases will be the various relationships between A and Bs that are assisted in different ways. The conditions will be different forms of assistance provided as well as differences in the context of these relationships (e.g. the people, organisations and communities involved). The outcome of interest will be the types of changes in the relationships between A and Bs. Not especially problematic, I hope.

Then I thought..., perhaps one could do a reverse QCA analysis to identify associations between specific types of relationship changes and the many different kinds of impacts that were subsequently observed on other relationships. The conditions in this analysis would be various categories of observed change (with data on their presence and absence). The configurations of conditions identified by the QCA analysis would in effect be a succinct typology of impact configurations associated with each kind of relationship change. As distinct from causal configurations sought via a conventional QCA.

This reversal of the usual QCA analysis should be possible and legitimate because relations between conditons and outcomes are set theoretic relations, not temporal relationships. My next step, will be to find out if someone has already tried to do this elsewhere (that I could learn from). These days this is highly likely.

Postscript 1: The same sort of reverse analyses could be done with Decision Tree algorithms, whose potential for use in evaluations has been discussed in earlier postings on this blog and elsewhere.

Postscript 2: I am slowly working my way through this comprehensive account of QCA, published last year:
Schneider, Carsten Q., and Claudius Wagemann. 2012. Set-Theoretic Methods for the Social Sciences: A Guide to Qualitative Comparative Analysis. Cambridge University Press.

Tuesday, May 19, 2015

How to select which hypotheses to test?



I have been reviewing an evaluation that has made use of QCA (Qualitative Comparative Analysis). An important part of the report is the section on findings, which lists a number of hypotheses that have been tested and the results of those tests. All of these are fairly complex, involving a configuration of different contexts and interventions, as you might expect in a QCA oriented evaluation.  There were three main hypotheses, which in the results section were dis-aggregated into six more specific hypotheses. The question for me, which has  much wider relevance, is how do you select hypotheses for testing, given limited time and resources available in any evaluation?

The evaluation team have developed three different data sets, each will 11 cases, and with 6, 6 and 9 attributes of these cases (shown in columns), known as "conditions" in QCA jargon. This means there are 26 + 26  + 29 = 640  possible combinations of these conditions that could be associated with and cause the outcome of interest. Each of the hypotheses being explored by the evaluation team represents one of these configurations. In this type of situation, the task of choosing an appropriate hypotheses seems a little like looking for a needle in a haystack

It seems there are at least three options, which could be combined. The first is to review the literature and find what claims (supported by evidence) are made there about "what works" and select from these those that are worth testing e.g. one that seems to have wide practical use, and/or one that could have different and significant program design implications if it is right or wrong. This seems to be the approach that the evaluation team has taken, though I am not so sure to what extent they have used the programming implications as an associated filter.

The second approach is to look for constituencies of interest among the staff of the client who has contracted the evaluation.There have been consultations, but it is not clear what sort of constituencies each of the tested hypotheses have. There were some early intimations that some of the hypotheses that were selected are not very understandable. That is clearly an important issue, potentially limiting the usage of the evaluation findings.

The third approach is an inductive search, using QCA or other software, for configurations of conditions associated with an outcome that have both high level of consistency (i.e. they are always associated with the presence (or the absence ) of an outcome) and  coverage (i.e. they apply to a large proportion of the outcomes of interest). In their barest form these configurations can be be considered as hypotheses. I was surprised to find that this approach had not been used, or at least reported on, in the evaluation report I read. If it had been used but no potentially useful configurations found then this should have been reported (as a fact, not a fault).

Somewhat incidentally, I have been playing around with the design of an Excel worksheet and managed to build in a set of formula for automatically testing how well different configurations of conditions of particular interest (aka hypotheses) account for a set of outcomes of interest, for a given data set. The tests involve measures taken from QCA (consistency and coverage, as above) and from machine learning practice (known as a Confusion Matrix). This set-up provides an opportunity to do some quick filtering of a larger number of hypotheses than an evaluation team might initially be willing to consider (i.e. the 6 above). It would not be as efficient a search as the QCA algorithm, but it would however be a search that could be directed according to specific interest. Ideally this directed search process would identify configurations that are both necessary and sufficient (for more than a small minority of outcomes). A second best result would be those that are necessary but insufficient, or vice versa. (I will elaborate on these possibilities and their measurement in another blog posting)

The wider point to make here is that with the availability of a quick screening capacity the evaluation team, in its consultations with the client, should then be able to broaden the focus of useful discussions away from what are currently quite specific hypotheses,  and towards the contents of a menu of a limited number of conditions that can not only make up these hypotheses but also other alternative versions. It is the choice of these particular conditions that will really make the difference, to the scale and usability of the results of a QCA oriented evaluation. More optimistically, the search facility could even be made available online, for continued use by those interested in the evaluation results, and their possible variants

The Excel file for quick hypotheses testing is here: http://wp.me/afibj-1ux




Wednesday, May 20, 2015

Evaluating the performance of binary predictions


(Updated 2015 06 06)

Background: This blog posting has its origins in a recent review of a QCA oriented evaluation, in which a number of hypotheses were proposed and then tested using a QCA type data set. In these data set cases (projects) are listed row by row and the attributes of these projects are listed in columns. Additional columns to the right describe associated outcomes of interest. The attributes of the projects may include features of the context as well as the interventions involved. The cell values in the data sets were binary (1=attribute present, 0= not present), though there are other options.

When such a data set is available a search can be made for configurations of conditions that are strongly associated with an outcome of interest. This can be done inductively or deductively. Inductive searches involve the uses of systematic search processes (aka algorithms), of which there are a number available. QCA uses the Quine–McCluskey algorithm. Deductive searches involve the development of specific hypotheses from a body of theory, for example about the relationship between the context, intervention and outcome.

Regardless of which approach is used, the resulting claims of association need evaluation. There are a number of different approaches to doing this that I know of, and probably more. All involve, in the simplest form, the analysis of a truth table in this form:


In this truth table the cell values refer to the number of cases that have each combination of configuration and outcome. For further reference below I will label each cell as A and B (top row) and C and D (bottom row)

The first approach to testing is a statistical approach. I am sure that there are a number of ways of doing this, but the one I am most familiar with is the widely used Chi-Square test. Results will be seen as most statistically significant when all cases  are in the A and D cells. They will be least significant when they are equally distributed across all four cells.

The second approach to testing is the one used by QCA. There are two performance measures. One is Consistency, which is the proportion of all cases where the configuration is present and the outcome is also present (=A/(A+B)). The other is Coverage, which is the proportion of all outcomes that are associated with the configuration (=(A/(A+C)).

When some of the cells have 0 values three categorical judgements can also be made. If only cell B is empty then it can be said that the configuration is Sufficient but not Necessary. Because there are still values in cell C this means there are other ways of achieving the outcome in addition to this configuration.

If only cell C is empty then it can be said that the configuration is Necessary but not Sufficient. Because there are still values in cell B this means there are other additional conditions that are needed to ensure the outcome.

If cells B and C are empty then it can be said that the configuration is both Necessary and Sufficient

In all three situations there only needs to be one case to be found in a previously empty cell(s) to disprove the standing proposition. This is a logical test, not a statistical test.

The third approach is one used in the field of machine learning, where the above matrix is known as a Confusion Matrix. Here there is a profusion of performance measures available (at least 13). Some of the more immediately useful measures are:
  • Accuracy: (A+D)/(A+B+C+D), which is similar to but different from the Chi Square measure above
  • True Positives: A/(A+B), also called Precision, which corresponds to QCA consistency
  • True Negatives: D/(B+D)
  • False Positives: C/(A+C)
  • False Negatives: B/(B+D)
  • Positive predictive value: A/(A+C), also called Recall, which corresponds to QCA coverage
  • Negative predictive value: D/(C+D)
In addition to these three types of tests there are three other criteria that are worth taking into account as well: simplicity, diversity and similarity

Simplicity: Essentially the same idea as that captured in Occam's Razor. This is that simpler configurations are preferable, all other things being equal. For example: A+F+J leads to D is a simpler hypothesis than A+X+Y+W+F leads to D. Complex configurations can have a better fit with the data, but at the cost of being poor at generalising to other contexts. In Decision Tree modelling this is called "over-fitting" and solution is "pruning", i.e. cutting back on the complexity of the configuration.  Simplicity has practical value, when it comes to applying tested hypotheses in real life programmes. They are easier to communicate and to implement. Simplicity can be measured at two levels: (a) the number of attributes in a configuration that is associated with an outcome, (b) and the number of configurations needed to account for an outcome.

Diversity: The diversity of configurations is simply the number of different specific configurations in a data set. It can be made into a comparable measure by calculating it as a percentage of the total number possible. The total number possible is 2 to the power of A where A = number of kinds of attributes in the data set. A bigger percentage = more diversity.

If you want to find how "robust" a hypothesis is, you could calculate the diversity present in the configurations of all the cases covered by the hypothesis (i.e. not just the attributes specified by the hypotheses, which will be all the same). If that percentage is large this suggests the hypothesis works in a greater diversity of circumstances, a feature that could be of real practical value.

This notion of diversity is to some extent implicit in the Coverage measure. More coverage implies more diversity of circumstances. But two hypotheses with the same coverage (i.e. proportion of cases they apply to) could be working in circumstances with quite different degrees of diversity (i.e. the cases covered were much more diverse in their overall configurations).

Similarity: Each row in a QCA like data set is a string of binary values. The similarity of these configurations of attributes can be measured in a number of ways:
  • Jaccard index, the proportion of all instances in two configurations where the binary value 1 is present in the same position i.e. the same attribute is present.
  • Hamming distance, the number of positions at which the corresponding values in two configurations are different. This includes the values 0 and 1, whereas Jaccard only looks at 1 values
These measures are relevant in two ways, which are discussed in more detail further down this post:
  • If you want to find a "representative" case in a  data set, you would look for the case with the lowest average Hamming distance in the whole data set
  • If you wanted to compare the two most similar cases, you would look for the pair of cases with the lowest Hamming distance.
Similarity can be seen as a third facet of diversity, a measure of the distance between any two types of cases. Stirling (2007) used the term disparity to describe the same thing.

Choosing relevant criteria: It is important to note that the relevance of these different association tests and criteria will depend on the context. A surgeon would want a very high level of consistency, even if it was at the cost of low coverage (i.e. applicable only in a limited range of situations). However, a stock market investor would be happy with a consistency of 0.55 (i.e 55%), especially if it had wide coverage. Even more so if that wide coverage contained a high level of diversity. Returning to the medical example, a false positive might have different consequences to false negatives e.g. unnecessary surgery versus unnecessary deaths. In other non-medical circumstances, false positives may be more expensive mistakes than false negatives.

Applying the criteria: My immediate interest is in the use of these kinds of tests for two evaluation purposes. The first is selective screening of hypotheses about causal configurations that are worth more time intensive investigations, an issue raised in a recent blog.
  • Configurations that are Sufficient and not Necessary or Necessary but not Sufficient. 
    • Among these, configurations which were Sufficient but not Necessary, and with high coverage should be selected, 
    • And configurations which were Necessary but not Sufficient, and with high consistency, should also be selected. 
  • Plus all configurations that were Sufficient and Necessary (which are likely to be less common)
The second purpose is to identify implications for more time consuming within-case investigations. These are essential, in order to identify casual mechanism at work that connect the conditions that are associated in a given configuration. As I have argued elsewhere, associations are a necessary but insufficient basis for a strong claim of causation. Evidence of mechanisms is like muscles on the bones of a body, enabling it to move.

Having done the filtering suggested above, the following kinds of within-case investigations would seem useful:
  • Are there any common casual mechanisms underlying all the cases  found to be Necessary and Sufficient, i.e those within cell A? 
    • A good starting point would be a case within this set of cases that had the lowest average Hamming distance, i.e. one with the highest level of similarity with all the other cases. 
    • Once one or more plausible mechanism were discovered in that case a check could be made to see if they are present in other cases in that set, this could be done in two ways: (a) incrementally, by examining adjacent cases, i.e cases with the lowest Hamming distance from the representative case, (b) by partitioning the rest of the cases, and examining a case with a median level Hamming distance, i.e. half way between being the most similar and most different cases.
  • Where the configuration is Necessary but not Sufficient, how do the cases in cell B differ from those in cell A, in ways that might shed more light on how the same configuration leads to different outcomes? This is what has been called a MostSimilarDifferentOutcome (MSDO) comparison,
    • If there are many cases this could be quite a challenge, because the cases could differ on many dimensions (i.e. on many attributes). But using the Hamming distance measure we could make this problem more manageable by selecting a case from cell A and B that had the lowest possible Hamming distance. Then a within-case investigation could find additional undocumented differences that account for some or all of the difference in outcomes. 
      • That difference could then be incorporated into the current hypothesis (and data set) enabling more cases from cell B to now be found in cell A i..e Consistency would be improved
  • Where the configuration is Sufficient but not Necessary,in what ways are the cases in cell C the same as those in cell A, in ways that might shed more light on how the same outcome is achieved by different configurations? This is what has been called a MostDifferentSimilarOutcome (MDSO) comparison,
    • As above, if there are many cases this could be quite a challenge. Here I am less clear, but de Meur et al (page 72) say the correct approach is "...one has to look for similarities in the characteristics of initiatives that differ the most from each other; firstly the identification of the most differing pair of cases and secondly the identification of similarities between those two cases" The within-case investigation should look for undocumented similarities that account for some of the similar outcomes. 
      • That difference could then be incorporated into the current hypothesis (and data set) enabling more cases from cell C to now be found in cell A i..e Coverage would be improved


Monday, June 14, 2021

Paired case comparisons as an alternative to a configurational analysis (QCA or otherwise)

[Take care, this is still very much a working draft!] Criticisms and comments welcome though

The challenge

The other day I was asked for some advice on how to implement a QCA type of analysis within an evaluation that was already fairly circumscribed in its design. Circumscribed both by the commissioner and by the team proposing to carry out the evaluation. The commissioner had already indicated that they wanted a case study orientated approach and had even identified the maximum number of case studies that they wanted to see (ten) .  While the evaluation team could see the potential use of a QCA type analyses they were already committed to undertaking a process type evaluation, and did not want a QCA type analyses to dominate their approach. In addition, it appeared that there already was a quite developed conceptual framework that included many different factors which might be contribute causes to the outcomes of interest.

As is often the case, there seemed to be a shortage of cases and an excess of potentially explanatory variables. In addition, there were doubts within the evaluation team as to whether a thorough QCA analysis would be possible or justifiable given the available resources and priorities.

Paired case comparisons as the alternative

My first suggestion to the evaluation team was to recognise that there is some middle ground between across-case analysis involving medium to large numbers of cases, and a within-case analysis. As described by Rihoux and Ragin (2009)  a QCA analysis will use both, going back and forth, using one to inform the other, over a number of iterations.. The middle ground between these two options is case comparisons – particularly comparisons of pairs of cases. Although in the situation described above there will be a maximum of 10 cases that can be explored, the number of pairs of these cases that can be compared is still quite big (45).  With these sort of numbers some sort of strategy is necessary for making choices about the types of pairs of cases that will be compared. Fortunately there is already a large literature on case selection. My favourite summary is the one by  Gerring, J., & Cojocaru, L. (2015). Case-Selection: A Diversity of Methods and Criteria. 

My suggested approach was to use what is known as the Confusion Matrix as the basis for structuring the choice of cases to be compared.  A Confusion Matrix is a simple truth table, showing a combination of two sets of possibilities (rows and columns), and the incidence of those possibilities (cell values). For example as follows:


Inside the Confusion Matrix are four types of cases: 
  1. True Positives where there are cases with attributes that fit my theory and where the expected outcome is present
  2. False Positives, where there are cases with attributes that fit my theory but where the expected outcome is absent
  3. False Negatives, where there are cases which do not have attributes that fit my theory but where nevertheless the outcome is present
  4. True Negatives, where there are cases which do not have attributes that fit my theory and where the outcome is absent as expected
Both QCA and supervised machine learning approaches are good at identifying individual (or packages of)  attributes which are good predictors of when outcomes are present or when they are absent – in other words where there are large number of True Positive and True Negative cases. And the incidence of exceptions: the False Positive and False Negatives. But this type of cross case-based led analysis do not seem to be available as an option to the evaluation team I have mentioned above.

1. Starting with True Positives

So my suggestion has been to look at the 10 cases that they have at hand, and start by focusing in on those cases where the outcome is present (first column). Focus on the case that is most similar to others with the outcome present, because findings about this case may be more likely to apply to others. See below on measuring similarity) . When examining that case identify one or more attributes which is the most likely explanation for the outcome being present. Note here that this initial theory is coming from a single within-case analysis, not a cross-case analysis. The evaluation team will now have a single case in the category of True Positive. 

2. Comparing False Positives and True Positives

The next step in the analysis is to identify cases which can be provisionally described as a False Positive. Start by finding a case which has the outcome absent. Does it have the same theory-relevant attributes as the True Positive? If so, retain it as a False Positive. Otherwise, move it to the True Negative category. Repeat this move for all remaining cases with the outcome absent. From among all those qualifying as False Positives, find one which is otherwise be as similar as possible in all its other attributes to the True Positive case.This type of analysis choice is called MSDO, standing for "most similar design, different outcome" - see the de Meur reference below.  Also see below on how to measure this form of similarity. 

The aim here is to find how the causal mechanisms at work differ. One way to explore this question is to look for an additional attribute that is present in the True Positive case but absent in the False Negative case, despite those cases otherwise being most similar.  Or, an attribute that is absent in the True Positive but present in the False Negative case. In the former case the missing case could be seen as a kind of enabling factor, whereas in the latter case it could be seen as more like a blocking factor.  If nether can be found by comparison of coded attributes of the cases then a more intensive examination of raw data on the case might still identify them, and lead to an updated/elaboration of theory behind the True Positive case. Alternately, that examination might suggest measurement error is the problem and that the False Positive case needs to be reclassified as True Positive.

3. Comparing False Negatives and True Positives

The third step in the analysis is to identify at least one most relevant case which can be described as a False Negative.  This False-Negative case should be one that is as different as possible in all its attributes to the True Positive case. This type of analysis choice is called MDSO, standing for "most different design, same outcome". 

 The aim here should be to try to identify if the same or different causal mechanisms is at work,  when compared to that seen in the True Positive case. One way to explore this question is to look for one or more attributes that both the True Positive and False Negative case have in common, despite otherwise being "most different". If found, and if associated with the causal theory in the True Positive case,  then the False Negative case can now be reclassed as a True Positive. The theory describing the now two True Positive cases can now be seen as provisionally "necessary"for the outcome, until another False Negative case is found and examined in a similar fashion.If the casual mechanism seems to be different then the case remains as a False Negative.

Both the second and third step comparisons described above will help : (a0 elaborate the details, and (b) establish the limits of the scope of the theory identified in step one. This suggested process makes use of the Confusion Matrix as a kind of very simple chess board, where pieces (aka cases) are introduced on to the board, one at a time, and then sometimes moved to other adjacent positions (depending on their relation to other pieces on the board).Or, the theory behind their chosen location is updated.

If there are only ten cases available to study, and these have an even distribution of outcomes present and absent, then this three step process of analysis could be reiterated five times i.e. once for each case where the outcome was present. Thus involving  up to 10 case comparisons, out of the 45 possible.

Measuring similarity

The above process depends on the ability to make systematic and transparent judgements about similarity. One way of doing this, which I have previously built into an Excel app called EvalC3, is to start by describing each case with a string of binary coded attributes of the same kind as used in QCA, and in some forms of supervised machine learning. An example set of workings can be seen in this Excel sheet, showing  an imagined data set of 10 cases, with 10 different attributes and then the calculation and use of  Hamming Distance as the similarity measure to chose cases for the kinds of comparisons described above. That list of attributes and the Hamming distance measure, is likely to  need to be updated, as the investigation of False Positives and False Negatives proceeds.

Incidentally, the more attributes that have been coded per case, the more discriminating this kind of approach can become. In contrast to cross-case analysis where an increase in numbers of attributes per case is usually problematic

Related sources

For some of my earlier thoughts on case comparative analysis see  here, These were developed for use within the context of a cross-case analysis process. But the argument above is about how to proceed when the staring point is a within-case analysis.

See also:
  • Nielsen, R. A. (2014). Case Selection via Matching
  • de Meur, G., Bursens, P., & Gottcheiner, A. (2006). MSDO/MDSO Revisited for Public Policy Analysis. In B. Rihoux & H. Grimm (Eds.), Innovative Comparative Methods for Policy Analysis (pp. 67–94). Springer US. 
  • de Meur, G., & Gottcheiner, A. (2012). The Logic and Assumptions of MDSO–MSDO Designs. In The SAGE Handbook of Case-Based Methods (pp. 208–221). 
  • Rihoux, B., & Ragin, C. C. (Eds.). (2009). Configurational Comparative Methods: Qualitative Comparative Analysis (QCA) and Related Techniques. Sage. Pages 28-32 for a description of "MSDO/MDSO: A systematic  procedure for matching cases and conditions". 
  • Goertz, G. (2017). Multimethod research, causal mechanisms, and case studies: An integrated approach. Princeton University Press.

Thursday, April 19, 2012

Data mining algorithms as evaluation tools


For years now I have been in favour of theory-led evaluation approaches. Many of the previous postings on this website are evidence of this. But this post is about something quite different, about a particular form of data mining, how to do it and how it might be useful. Some have argued that data mining is radically different from hypothesis-led research (or evaluation, for that matter). Others have argued that there are some important continuities and complimentarities (Yu, 2007)

Recently I have started reading about different data mining algorithms, especially the use of what are called classification trees and genetic algorithms (GAs). The latter was the subject of my recent post, about whether we could evolve models of development projects as well as design them. Genetic algorithms are software embodiments of the evolutionary algorithm (i.e. iterated variation, selection, retention) at work in the biological world. They are good for exploring large possibility spaces and for coming up with new solutions that may not be nearby to current practice.

I had wondered if this idea could be connected to the use of Qualitative Comparative Analysis (QCA), a method of identifying configurations of attributes (e.g. of development projects) associated with a particular type of outcomes (e.g. reduced household poverty). QCA is a theory-led approach, which uses very basic forms of data about attributes (i.e. categorical), then describes configurations of these attributes using Boolean logic expressions, and analyses these with the help of software that can compare and manipulate these statements. The aim is to come up with a minimal number of simple “IF…THEN” type statements describing what sorts of conditions are associated with particular outcomes. This is potentially very useful for development aid managers who are often asking about “what works where in what circumstances”. (But before then there is the challenge of getting on top of the technical language required to be able to do QCA).

My initial thought  was whether genetic algorithms could be used evolve and test statements describing different configurations, as distinct from constructing them one by one on the basis of a current theory. This might lead to quicker resolution, and perhaps discoveries that had not been suggested by current theory. 

As described in my previous post, there is already a simple GA built into Excel, known as Solver. What I could not work out was how to represent logic elements like AND, NOT, OR in such a way that Solver could vary them to create different statements representing different configurations of existing attributes.  In the process of trying to sort out this problem I discovered that there is a  whole literature on GAs and rule discovery (rules as in IF-THEN statements). Around the same time, a technical adviser from FrontlineSolver suggested I try a different approach to the automated search for association rules. This involved the use of Classification Trees, a tool which has the merit of producing results which are readable by ordinary mortals, unlike the results of some other data mining methods. 

An example!

This Excel file contains a small data set, which has previously been used for QCA analysis. It contains 36 cases, each with 4 attributes and 1 outcome of interest. The cases relate to different ethnic minorities in countries across Europe and the extent to which there has been ethnic political mobilisation in their countries (being the outcome of interest). Both the attributes and outcomes are coded as either 0 or 1 meaning absent or present. 

With each case having up to four different attributes there could be 16 different combinations of attributes. A classification algorithm in XLMiner software (and others like it) is able to automatically sort through these possibilities to find the simplest classification tree that can correctly point to where the different outcomes take place. XLMiner produced the following classification tree, which I have annotated and will through below.



We start at the top with the attribute “large” referring to the size of the linguistic subnation within their own country. Those that are large have then been divided according to whether their subnational region is “growing” or not. Those that are not have then been divided into those who are relatively “wealthy” group within their nation and those who are not. The smaller linguistic substations  have also been divided into those who are relatively wealthy group within their nation and those who are not, and those who are relatively wealthy are then divided into those whose subnational region speak and write in their own language or not. The square nodes at the end of each “branch” indicate the outcome associated with these combinations of conditions - where there has been ethnic political mobilisation (1) or not (0). Under each square node are the ethnic groups placed in that category. These results fit with the original data in Excel (right column). 

This is my summary of the rules described by the classification tree:
  • ·         IF a linguistic subnation’s economy is large AND growing THEN ethnic political mobilisation will be present [14 of 19 positive cases]
  • ·         IF a linguistic subnation’s economy is large, NOT growing AND is relatively wealthy THEN ethnic political mobilisation will be present [2 of 19 positive cases]
  • ·         IF a linguistic subnation’s economy is NOT large AND is relatively wealthy AND speaks and writes in its own language THEN ethnic political mobilisation will be present [3 of 19 positive cases]
Both QCA and classification trees have procedures for simplifying the association rules that are found. With classification trees there is an automated “pruning” option that removes redundant parts. My impression is that there are no redundant parts in the above tree, but I may be wrong.
These rules are, in realist evaluation terminology, describing three different configurations of possible causal processes. I say "possible" because what we have above are associations. Like correlation co-effecients, they don't necessarily mean causation. However, they are at least candidate configurations of causal processes at work.

The origins of this data set and its coding are described in pages 137-149 of The Comparative Method: Moving Beyond Qualitative and Quantitative Strategies by Charles C. Ragin, viewable on Google Books. Also discussed there is the QCA analysis of this data set and its implications for different theories of ethnic political mobilisation. My thanks to Charles Ragin for making the data set available.

I think this type of analysis, by both QCA and classification tree algorithms, has considerable potential use in the evaluation of development programs. Because it uses nominal data the range of data sources that can be used is much wider than statistical methods that need ratio or interval scale data. Nominal data can either be derived from pre-existing more sophisticated data (by using cut-off points to create categories) or be collected as primary data, including by participatory processes such as card/pile sorting and ranking exercises. The results in the form of IF…THEN rules should be of practical use, if only in the first instance as a source of hypotheses needing further testing by more detailed inquiries. 

There are some fields of development work where large amounts of potentially useful, but rarely used, data is generated on a continuing basis such a microfinance services and to a less extent healthy and education services. Much of the demand for data mining capacity has come from industries that are finding themselves flooded with data, but lack the means to exploit it. This may well be the case with more development agencies in the future, as they make more use of interactive websites and mobile phone data collection methods and the like. 

For those who are interested, there is a range of software worth exploring in addition to the package I have mentioned above. See these lists: A and B  I have a particular interest in GATree, which uses genetic algorithm to evolve the best fitting classification tree, and to avoid the problem of being stuck in a “local optimum”. There is also another type of algorithm with the delightfull name of Random Forests, which uses the “wisdom of crowds” principle to find the best fitting classification tree. But note the caveat: “Unlike decision trees, the classifications made by Random Forests are difficult for humans to interpret”. These and other algorithms are in use by participants in the Kaggle competitions online, which themselves could be considered as a kind of semi-automated meta-algorithm (i.e. an algorithm for finding useful algorithms). Lots to explore!

PS: I have just found and tested another package, called XLSTAT, that also generates classification trees. Here is a graphic showing the same result as found above, but in more detail. (Click on the image to enlarge it)

PS 29 April 2012: In QCA distinctions are made between a condition being "necessary" and or "sufficient" for the outcome to occur.  In the simplest setting a single condition can be a necessary and sufficient cause. In more complex settings a single condition may be a necessary part of a configuration of conditions which itself is sufficient but not necessary. For example a "growing" economy in the right branch of the first tree above. In classification trees the presence/absence of the necessary/sufficient conditions can easily be observed. If a condition appears in all "yes" branches of the tree (= different configurations) then it is "necessary". If a condition appears along with another in a given "yes" branch of  of a tree then it is not "sufficient". "Wealthy" is a condition that appears necessary but not sufficient. See more on this distinction in a more recent post:Representing different combinations of causal conditions

PS 4 May 2012: I have just discovered there is what looks like a very good open source data mining package called RapidMiner, which comes with a whole stack of training videos, and a big support and development community


PS 29 May 2012: Pertinent comment from Dilbert 

PS 3 June 2012: Prediction versus explanation: I have recently found a number of web pages on the issue of prediction versus explanation. Data mining methods can deliver good predictions. However information relevant to good predictions does not always provide good explanations e.g. smoking may be predictive of teenage pregnancy but it is not a cause of it (see interesting exercise here). So is data mining a waste of time for evaluators? On reflection it occured to me that it depends on the circumstances and how the results of any analysis are to be used. In some circumstances the next steps may be to choose between existing alternatives. For example, which organisation or project to fund. Here good predictive knowledge, based on data about past performance, would be valuable. In other circumstances a new intervention may need to be designed from the ground up. Here access to some explanatory knowledge about possible causal relationships would be especially relevant.On further reflection, even where a new intervention has to be designed it is likely that it will involve choices of various modules (e.g. kinds of staff, kinds of activities) where knowledge of their past performance record is very relevant. But so would be a theory about their likely interactions.

At the risk of being too abstract,it would seem that a two way relationship is needed: proposed explanations need to be followed by tested predictions and successful predictions need to be followed by verified explanations.











Friday, March 16, 2012

Can we evolve explanations of observed outcomes?


In mathematics and computer science, an optimization problem is the problem of finding the best solution from all feasible solutions. There are various techniques for doing so.

Science as a whole can be seen as an optimisation process, involving a search for explanations that have the best fit with observed reality.

In evaluation we often have a similar task, of identifying what aspects of one or more project interventions best explains the observed outcomes of interest. For example, the effects of various kinds of improvements in health systems on rates of infant mortality.  This can done in two ways. One is by looking internally at the design of a project, at its expected workings and then trying to find evidence of whether it did so in practice. This is the territory of theory led-evaluation.  The other way is to look externally, at alternative explanations involving other influences, and to seek to test those. This is ostensibly good practice but not very common in reality, because it can be time consuming and to some extent inconclusive, in that there may always be other explanations not yet identified and thus untested. This is where randomised control trials (RCTs) come in. Randomised allocation of subjects between control and intervention groups nullifies the possible influence of other external causes.  Qualitative Comparative Analysis (QCA) takes a slightly different approach, searching for multiple possible configurations of conditions which are both necessary and sufficient to explain all observed outcomes (both positive and negative instances).

The value of theory led approaches, including QCA, is that the evaluator’s theories help the search for relevant data, amongst the myriad of possibly relevant design characteristics, and combinations thereof. The absence of a clear theory of change is often one reason why baseline surveys are so expansive in contents, but yet rarely used. Without a half way decent theory we can easily get lost. It is true that "There is nothing as practical as a good theory" (Kurt Lewin)

The alternative to theory led approaches

There is however an alternative search process which does not require a prior theory, known as the evolutionary algorithm, the kernel of the process of evolution. The evolutionary processes of variation, selection and retention, iterated many times over, have been able to solve many complex optimisation problems such as the design of a bird that can both fly long distances and dive deep in the sea for fish to eat. Genetic algorithms (GA) are embodiments of the same kinds of process in software programs, in order to solve problems of interest to scientists and businesses. These are useful in two respects. One is the ability to search vary large combinatorial spaces very quickly. The other is that they can come up with solutions involving particular combinations of attributes that might not have been so obvious to a human observer.

Development projects have attributes that vary. These include both the context in which they operate and the mechanisms by which they seek to work. There are many possible combinations of these attributes, but only some of these are likely to be associated with achieving a positive impact on peoples’ live. If they were relatively common then implementing development aid projects would not be so difficult. The challenge is how to find the right combination of attributes. Trial and error by varying project designs and their implementaion on the ground is a good idea in principle, but in practice it is slow. There is also a huge amount of systemic memory loss, for various reasons including poor or non-existent communications between various iterations of a project design taking place in different locations.

Can we instead develop models of projects, which combine real data about the distribution of project attributes with variable views of their relative importance in order to generate an aggregate predicted result? This expected result can then be compared to an observed result (ideally from independent sources).  By varying the influence of the different attributes a range of predicted results can be generated, some of which may be more accurate than others. The best way to search this large space of possibilities is by using a GA. Fortunately Excel now includes a simple GA add-in, known as Solver.

The following spreadsheet shows a very basic example of what such a model could look like, using a totally fictitious data set. The projects and their observed scores on four attributes (A-D) are shown on the left. Below them is a set of weights, reflecting the possible importance of each attribute for the aggregate performance of the projects. The Expected Outcome score for each project is the sum of the score on each attribute x the weight for that score.  In other words the more a project has an important attribute (or combination of these) the higher will be its Expected Outcome score. That score is important only as a relative measure, relative to that of the other projects in the model.

The Expected Outcome score for each project is then compared to an Observed Outcome measure (ideally converted to a comparable scale), and the difference is shown as the Prediction Error. On the bottom left an is aggregate measures of prediction error, the Standard Deviation. The original data can be found in this Excel file.

 

The initial weights were set at 25 for each attribute, in effect reflecting the absence of any view about which might be more important. With those weights, the SD of the Prediction Errors was 1.25 After 60,000+ iterations in the space of 1 minute the SD had been reduce down to 0.97. This was achieved with this new combination of weights: Attribute A:19, Attribute B: 0, Attribute C: 19, Attribute D: 61.The substantial error that remains can be considered as due to causal factors outside of the model (i.e. as is described by the list of attributes)[1].

It seems that it is also possible to find least appropriate solutions, i.e, those which make the least accurate Outcome Predictions. Using the GA set to find the maximum error, it was found that in the above example a 100% weighting given to Attribute A generated a SD of 1.87. This is the nearest that such an evolutionary approach comes to disproving a theory.

GA deliver functional rather than logical proofs that certain explanations are better than others. Unlike logical proofs, they are not immortal. With more projects included in the model it is possible that there may be a fitter solution, which applies to this wider set. However, the original solution to the smaller set would still stand.

Models of complex processes can sometimes be sensitive to starting conditions. Different results can be generated from initial settings that are very similar. This was not the case in this exercise, with widely different initial weighting’s evolving and converging on almost identical sets of final weightings e.g. 19, 0, 19,  62 versus 61) producing the same final error rate. This robustness is probably due to the absence of feedback loops in the model, which could be created where the weighted score of one attribute affected those of another.  That would a much more complex model, possibly worth exploring at another time.

Small changes in Attribute scores made a more noticable difference to the Prediction Error. In the above model varying Project 8’s score on attribute A from 3 to 4 increases the average error by 0.02. Changes in other cells varied in direction of their effects. In more realistic models with more kinds of attributes and more project cases the results are likely to be less sensitive to such small differences in attribute scores.

The heading of this post asks “Can we evolve explanations of observed outcomes?” My argument above suggests that in principle it should be possible. However there is a caveat. A set of weighted attributes that are associated with success might better be described as the ingredients of an explanation. Further investigative work would be needed to find out how those attributes actually interact together in real life.  Before then, it would be interesting to do some testing of this use of GAs on real project datasets.

Your comments please...

PS 6 April 2012: I have just come across the Kaggle website. This site hosts competitions to solve various kinds of prediction problems (re both past and future events) using a data set available to all entrants, and gives prizes to the winner - who must provide not only their prediction but the algorithm that generated the prediction. Have a look. Perhaps we should outsource the prediction and testing of results of development projects via this website? :-) Though..., even to do this the project managers would still have a major task on hand: to gather and provide reliable data about implementation characteristics, as well as measures of observed outcomes... Though...this might be easier with some projects that generate lots of data, say micro-finance or education system projects.

 View this Australian TV video, explaining how the site works and some of its achievements so far. And the Fast Company interview of the CEO

PS 9 April 2012:  I have just discovered that there is a whole literature on the use of genetic algorithms for rule discovery "In a nutshell, the motivation for applying evolutionary algorithms to data mining is that evolutionary algorithms are robust search methods which perform a global search in the space of candidate solutions (rules or another form of knowledge representation)" (Freitas, 2002) The rules referred to are typcially "IF...THEN..."type statements






[1] Bear in mind that this example set of attribute scores and observed outcome measures is totally fictitious, so the inability to find a really good set of fitting attributes should not be surprising. In reality some sets of attributes will not be found co-existing because of their incompatibility e.g. corrupt project management plus highly committed staff



Monday, May 25, 2015

Characterising purposive samples



In some situations it is not possible to develop a random sample of cases to examine for evaluation purposes. There may be more immediate challenges, such as finding enough cases with sufficient information and sufficient quality of information.

The problem then is knowing to what extent, if at all, the findings from this purposive sample can be generalised, even in the more informal sense of speculating on the relevance of findings to other cases in the same general population.

One way this process can be facilitated is by "characterising" the sample, a term I have taken from elsewhere. It means to describe the distinctive features of something. This could best be done using attributes or measures that can, and probably already have been, used to describe the wider population where the sample came from. For example, the sample of people could be described as being of average age of 35 versus 25 in the whole population, and 35% women versus 55% in the wider population. This seems a rather basic idea, but it is not always applied.

Another more holistic way of doing so is to measure the diversity of the sample. This is relatively easy to do when the data set associated with the sample is in binary form, as for example is used in QCA analysis (i.e. cases are rows, columns are attributes and cell values of 0 or 1 indicate if the attributes was absent or present)

As noted in earlier blog postings,Simpsons Reciprocal Index is a useful measure of diversity. This takes into account two aspects of diversity: (a) richness, which in a data set could be seen in the number of unique configurations of attributes found across all the cases( think metaphorically of organisms - cases, chromosomes-configurations and genes-attributes) and (b) evenness, which could be seen in the relative number of cases having particular configurations. When the number of cases is evenly distributed across all configurations this is seen as being more diverse than when the number of cases per configuration varies.

The degree of diversity in a data base can have consequences. Where a data set that has little diversity in terms of "richness" there is a possibility that configurations that are identified by QCA or other algorithmic based methods, will have limited external validity, because they may easily be contradicted by cases outside the sample data set that are different from already encountered configurations. A simple way of measuring this form of diversity is to calculate the original number of unique configurations in the sample data set as a percentage of the total number possible, given the number of binary attributes in the sample data set (which is 2 to the power of the number of attributes). The higher the percentage, the less risk that the findings will be contradicted by configurations found in new sets of data (all other things being constant).

Where a data set has little diversity in terms of "balance" it will be more difficult to assess the consistency of any configuration's association with an outcome, compared to others, because there will be more cases associated with some configurations than others. Where there are more cases of a given configuration there will be more opportunities for its consistency of association with an outcome to be challenged by contrary cases.

My suggestion therefore is that when results are published from the analysis of purposive samples there should be adequate characterisation of the sample, both in terms of: (a) simple descriptive statistics available on the sample and wider population, and (b) the internal diversity of the sample, relative to the maximum scores possible on the two aspects of diversity.



Friday, December 22, 2023

Using the Confusion Matrix as a general-purpose analytic framework


Background

This posting has been prompted by work I have done this year for the World Food Programme (WFP) as member of their Evaluation Methods Advisory Panel (EMAP). One task was to carry out a review, along with colleague Mike Reynolds, of the methods used in the 2023 Country Strategic Plans evaluations. You will be able to read about these, and related work, in a forthcoming report on the panel's work, which I will link to here when it becomes available.

One of the many findings of potential interest was: "there were relatively very few references to how data would be analysed, especially compared to the detailed description of data collection methods". In my own experience, this problem is widespread, found well beyond WFP. In the same report I proposed the use of what is known as the Confusion Matrix, as a general purpose analytic framework. Not as the only framework, but as one that could be used alongside more specific frameworks associated with particular intervention theories such as those derived from the social sciences.

What is a Confusion Matrix?

A Confusion Matrix is a type of truth table,  i.e., a table representing all the logically possible combinations of two variables or characteristics. In an evaluation context these two characteristics could be the presence and absence of an intervention, and the presence and absence of an outcome.  An intervention represents a specific theory (aka model), which includes a prediction that a specific type of outcome will occur if the intervention is implemented.  In the 2 x 2 version you can see above, there are four types of possibilities:

  1. The intervention is present and the outcome is present. Cases like this are known as True Positives
  2.  The intervention is present but the outcome is absent. Cases like this are known as False Positives. 
  3. The intervention is absent and the outcome is absent. Cases like this are known as True Negatives
  4. The intervention is absent but the outcome is present. Cases like this are known as False Negatives. 
Common uses of the Confusion Matrix

The use of Confusion Matrices is most commony associated with the field of machine learning and predictive analytics, but it has much wider application. These include the fields of medical diagnostic testing, predictive maintenance,  fraud detection,  customer churn prediction, remote sensing and geospatial analysis, cyber security, computer vision, and natural language processing. In these applications the Confusion Matrix is populated by the number of cases falling into each of the four categories. These numbers are in turn the basis of a wide range of performance measures, which are described in detail in the Wikipedia article on the Confusion Matrix. A selection of these is described here, in this blog on the use of the EvalC3 Excel app

The claim

Although the use of a Confusion Matrix is  commonly associated with quantitative analyses of performance, such as the accuracy of predictive models, it can also be a useful framework for thinking in more qualitative terms. This is a less well known and publicised use, which I elaborate on below. It is the inclusion of this wider potential use that is the basis of my claim that the Confusion Matrix can be seen as a general-purpose analytic framework.

The supporting arguments

The claim has at least four main arguments:

  1. The structure of the Confusion Matrix serves as a useful reminder and checklist, that at least four different kinds of cases should be sought after, when constructing and/or evaluating a claim that X (e.g. an intervention) lead to Y (e.g an outcome). 
    1. True Positive cases, which we will usually start looking for first of all. At worst, this is all we look for.
    2. False Positive cases, which we are often advised to do, but often dont invest much time in actually doing so. Here we can learn what does not work and why so.
    3. False Negative cases, which we probably do even less often. Here we can learn what else works, and perhaps why so,
    4. True Negative cases, because sometimes there are asymmetric causes at play i.e not just the absence of the expected causes
  2. The contents of the Confusion Matrix helps us to identify interventions that are necessary, sufficient or both. This can be practically useful knowledge
    1. If there are no FP cases, this suggests an intervention is sufficient for the outcome to occur. The more cases we investigate , without still finding a TP, the stronger this suggestion is. But if only one FP is found, that tells us the intervention is not sufficient. Single cases can be informative. Large numbers of cases are not aways needed.
    2. If there are no FN cases, this suggests an intervention is necessary for the outcome to occur. The more cases we investigate , without still finding a FN, the stronger this suggestion is. But if only one FN is found, that tells us the intervention is not necessary. 
    3. If there are no FP or FN cases, this suggests an intervention is sufficient and necessary for the outcome to occur. The more cases we investigate, without still finding a TP or FN, the stronger this suggestion is. But if only one FP, or FN is found, that tells us that the intervention is not sufficient or not necessary, respectively. 
  3. The contents of the Confusion Matrix help us identify the type and scale of errors  and their acceptability. FP and FN cases are two different types of error that have different consequences in different contexts. A brain surgeon will be looking for an intervention that has a very low FP rate, because errors in brain surgery can be fatal, so cannot be recovered. On the other hand, a stockmarket investor is likely to be looking for a more general purpose model, with few FNs. However, it only has to be right 55% of the time to still make them money. So a high rate of FPs may not be a big  concern. They can recover their losses through further trading. In the field of humanitarian assistance the corresponding concerns are with coverage (reaching all those in need, i.e minimising False Negatives) and leakage (minimising inclusion of those not in need i.e False Positives). There are Confusion Matrix based performance measures for both kinds error and for the degree that both kinds of error are balanced (See the Wikipedia entry)
  4. The contents of the Confusion Matrix can help us identify usefull case studies for comparison purposes. These can include
    1. Cases which exemplify the True Positive results, where the model (e.g an intervention) correctly predicted the presence of the outcome. Look within these cases to find any likely causal mechanisms connecting the intervention and outcome. Two sub-types can be useful to compare:
      1. Modal cases, which represent the most common characteristics seen in this group, taking all comparable attributes into account, not just those within the prediction model. 
      2. Outlier cases, which represent those which were most dissimilar to all other cases in this group, apart from having the same prediction model characteristics
    2. Cases which exemplify the False Positives, where the model incorrectly predicted the presence of the outcome.There are at least two possible explanations that can be explored:
      1. In the False Positive cases, there are one or more other factors that all the cases have in common, which are blocking the model configuration from working i.e. delivering the outcome
      2. In the True Positive cases, there are one or more other factors that all the cases have in common, which are enabling the model configuration from working i.e. delivering the outcome, but which are absent in the False Positive cases
        1. Note: For comparisons with TPs cases, TP and FP cases should be maximally  similar in their case attributes. I think this is called  MSDO (most similar, different outcome) based case selection
    3. Cases which exemplify the False Negatives, where the outcome occurred despite the absence the attributes of the model. There are three possibilities of interest here:
      1. There may be some False Negative cases that have all but one of the attributes found in the prediction model. These cases would be worth examining, in order to understand why the absence of a particular attribute that is part of the predictive model does not prevent the outcome from occurring. There may be some counter-balancing enabling factor at work, enabling the outcome.
      2. It is possible that some cases have been classed as FNs because they missed specific data on crucial attributes that would have otherwise classed them as TPs.
      3. Other cases may represent genuine alternatives, which need within-case investigation to identify the attributes that appear to make them successful 
    4. Cases which exemplify the True Negatives, where the absence the attributes of the model is associated with the absence of the outcome.
      1. Normally this are seen as not being of much interest. But there may cases here with all but one of the intervention attributes. If found then the missing attribute may be viewed as: 
        1. A necessary attribute, without which the outcome can occur
        2. An INUS attribute i.e. an attribute that is Insufficient but Necessary in a configuration that is Unnecessary but Sufficient for the outcome (See Befani, 2016). It would then be worth investigating how these critical attributes have their effects by doing a detailed within-case analysis of the cases with the critical missing attribute.
      2. Cases may become TNs for two reasons. The first, and most expected, is that the causes of positive outcomes are absent. The second, which is worth investigating, is that there are additional and different causes at work which are causing the outcome to be absent. The first of these is described as causal symmetry, the second of these is described as causal asymmetry. Because of the second possibility is worthwhile paying close attention to TN cases to identify the extent to which symmetrical causes or asymmetrical causes are at work. The findings could have significant implications for any intervention that is being designed. Here a useful comparision would be  between maximally similar TP and TN cases.
Resources

Some of you may know that I have built the Confusion Matrix into the design of EvalC3, an Excel app for cross-case analysis, that combines measurement concepts from the disparate fields of machine learning and QCA (Qualitative Comparative Analysis). With fair winds. this should become available as a free to use web app in early 2024, courtesy of a team at Sheffield Hallam University. There you will be able to explore and exploit the uses of the Confusion Matrix for both quantative and qualitative analyses.