Friday, June 26, 2015

Evolving better performing hypotheses, using Excel



You may not know this, but there is an Excel add-in called Solver. This can be used to evolve better solutions to computation problems.

It can also be used to identify hypotheses that have a better fit with available data. Here is how I have been using Solver in Excel....

I start by entering a data set into Excel that is made up of cases (e.g.25 projects), listed row by row. The columns then describe attributes of those projects, e.g. as captured in project completion reports. These attributes can include aspects of the project context, the way the project was working plus some outcome performance measures.

A filtering mechanism is then set up in Excel, where you can chose a specific sub-set of the attributes that are of interest (aka our hypothesis), and then all the projects that have these attributes are automatically highlighted.

The performance of this sub-set of projects is then tested, using a simple device called a Confusion Matrix , which tells us what proportion of the selected projects have "successful" versus "unsuccessful" outcomes (if that is the outcome of interest). Ideally, if we have selected the right set of attributes then the proportion of "successful" projects in the selected sub-set will be greater than their prevalence rate in the whole set of projects.

But given that the number of possible hypotheses available to test doubles with each extra attribute added into the data set, this sort of intuitive/theory led search could take us a long time to find the best available solution. Especially when the data set has a large number of attributes.

This is where Solver helps. Using its evolutionary algorithm the Solver add-in provides a quick means of searching a very large space of possible solutions. To do this there are three parameter which need to be set, before setting it to work. The first is the Objective, which is the value you want to maximise. I usually choose the "Accuracy" measure in the Confusion Matrix. The second is the range of cells whose values can be varied. These are the ones identifying the set of project attributes, which can be used to make a hypothesis. They can be set to (1), absent (0) or not relevant (2). The third is is the Constraints, which limit the values that these variable cells can take e.g. not negative, and nothing other than these three types of values.

In the space of a minute Solver then explores up to 70,000 possible combinations of project attributes to find the combination that generate the most accurate prediction i.e. a set of projects with the highest possible proportion of "successful" projects. In my recent trials, using a real data set, the accuracy levels have been up around 90%. I have been able to compare the results with those found using another algorithm which I have written about in earlier posts here, called a  Decision Tree algorithm. Despite being quite different algorithms, the solutions that both algorithms have found (i.e. the specific set of project attributes) have been very similar in content (i.e. the attributes in the solution), and both had the same level of accuracy.

An important side benefit of this approach to finding out "what works" is that by manually tweaking the solutions found by Solver you can measure the extent to which each attribute in the winning solution makes a difference to its overall accuracy. This is the kind of question many donors want to answer re the projects they fund, when they ask "What difference does my particular contribution make?"

If you want help setting up Excel to try out this method, and have a data set ready to work on, feel free to contact me for advice: rick.davies@gmail.com, or leave a comment below

Tuesday, June 23, 2015

Is QCA it's own worst enemy?

[As you may have read elsewhere on this blog] QCA stands for Qualitative Comparative Analysis. It is a method that is finding increased use as an evaluation tool, especially for exploring claims about the causal role of interventions of various kinds. What I like about it is its ability to recognize and analyse complex causal configurations, which have some fit with the complexity of the real world as we know it.

What I don't like about it is its complexity, it can sometimes be annoyingly obscure and excessively complicated. This is a serious problem if you want to see the method being used more widely and if you want the results to be effectively communicated and properly understood. I have seen instances recently where this has been such a problem that it threatened to derail an ongoing evaluation.

In this blog post I want to highlight where the QCA methodology is unnecessary complex and and suggest some ways to avoid this type of problem. In fact I will start with the simple solution, then explain how QCA manages to make it more complex.

Let me start with a relatively simple perspective. QCA analyses fall in to the broad category of "classifiers". These include a variety of algorithmic processes for deciding what category various instances belongs to. For example which types of projects were successful or not in achieving their objectives.

I will start with a two by two table, a Truth Table, showing the various possible results that can be found, by QCA and other methods. Configuration X here is a particular combination of conditions that an analysis has found to be associated with the presence of an outcome. The Truth Table helps us identify just how good that association is, by comparing the incidences where the configuration is present or absent with the incidences where the outcome is present or absent.


As I have explained in an earlier blog, one way of assessing the adequately of the result shown in such a matrix is by using a statistical test such as Chi Square, to see if the distribution is significantly different from what a chance distribution would look like.There are only two possible results when the outcome is present: the association is statistically significant or it is not.

However, if you import the ideas of Necessary and/or Sufficient causes the range of interesting results increases. The matrix can now show four possible types of results when the outcome is present:

  1. The configuration of conditions is Necessary and Sufficient for the outcome to be present. Here cells C and B would be empty of cases
  2. The configuration of conditions is Necessary but Insufficient for the outcome to be present. Here cell C would be empty of cases
  3. The configuration of conditions is Unnecessary but Sufficient for the outcome to be present. Here cell  B would be empty of cases
  4. The configuration of conditions is Unnecessary and Insufficient for the outcome to be present. Here no cells would be empty of cases
The interesting thing about the first three options is that they are easy to disprove. There only needs to be one case found in the cell(s) meant to be empty, for that claim to be falsified.

And we can provide a lot more nuance to the type 4 results, by looking at the proportion of cases found in cells B and C, relative to cell A. The proportion of A/(A+B) tells us about the consistency of the results, in the simple sense of consistency of results found via an examination of a QCA Truth Table. The proportion of A/(A+B) tells us about the coverage of the results, as in the proportion of all present outcomes that exist that were identified by the configuration. 

So how does QCA deal with all this? Well, as far as I can see, it does so in a way makes it more complex than necessary. Here I am basing my understanding mainly on Schneider and Wagemann's account of QCA.
  1. Firstly, they leave aside the simplest notions of Necessity and Sufficiency as described above, which are based on a categorical notion of Necessity and Sufficiency i..e a configuration either is or is not Sufficient etc. One of the arguments I have seen for doing this is these types of results are rare and part of this may be due to measurement error, so we should take  more generous/less demanding view of what constitutes Necessity and Sufficiency
  2. Instead they focus on Truth Tables with results as shown below (classed as 4. Unnecessary and Insufficient above). They then propose ways of analyzing these in terms of having degrees of Necessity and Sufficiency conditions. This involves two counter-intuitive mirror-opposite ways of measuring the consistency and coverage of the results, according to whether the focus is on analyzing the extent of Sufficiency or Necessity conditions (see Chapter 5 for details)
  3. Further complicating the analysis is the introduction of a minimum thresholds for the consistency of Necessity and Sufficiency conditions (because the more basic categorical idea has been put aside). There is no straightforward basis for defining these levels. It is suggested that they depend on the nature of the problem being identified.

  Configuration X contains conditions which are neither Necessary or Sufficient 

Using my strict interpretation of Sufficiency and Necessity there is no need for a consistency measure where a condition (or configuration) is found to be Sufficient but Unnecessary, because there will be no cases in cell B. Likewise, there is no need for a coverage measure where a condition (or configuration) is found to be Necessary but Insufficient, because there will be no cells in cell C,

We do need to know the consistency where a condition (or configuration) is Necessary but Insufficient, and the the coverage, where where a condition (or configuration) is found to be Sufficient but Unnecessary.

Monday, May 25, 2015

Characterising purposive samples



In some situations it is not possible to develop a random sample of cases to examine for evaluation purposes. There may be more immediate challenges, such as finding enough cases with sufficient information and sufficient quality of information.

The problem then is knowing to what extent, if at all, the findings from this purposive sample can be generalised, even in the more informal sense of speculating on the relevance of findings to other cases in the same general population.

One way this process can be facilitated is by "characterising" the sample, a term I have taken from elsewhere. It means to describe the distinctive features of something. This could best be done using attributes or measures that can, and probably already have been, used to describe the wider population where the sample came from. For example, the sample of people could be described as being of average age of 35 versus 25 in the whole population, and 35% women versus 55% in the wider population. This seems a rather basic idea, but it is not always applied.

Another more holistic way of doing so is to measure the diversity of the sample. This is relatively easy to do when the data set associated with the sample is in binary form, as for example is used in QCA analysis (i.e. cases are rows, columns are attributes and cell values of 0 or 1 indicate if the attributes was absent or present)

As noted in earlier blog postings,Simpsons Reciprocal Index is a useful measure of diversity. This takes into account two aspects of diversity: (a) richness, which in a data set could be seen in the number of unique configurations of attributes found across all the cases( think metaphorically of organisms - cases, chromosomes-configurations and genes-attributes) and (b) evenness, which could be seen in the relative number of cases having particular configurations. When the number of cases is evenly distributed across all configurations this is seen as being more diverse than when the number of cases per configuration varies.

The degree of diversity in a data base can have consequences. Where a data set that has little diversity in terms of "richness" there is a possibility that configurations that are identified by QCA or other algorithmic based methods, will have limited external validity, because they may easily be contradicted by cases outside the sample data set that are different from already encountered configurations. A simple way of measuring this form of diversity is to calculate the original number of unique configurations in the sample data set as a percentage of the total number possible, given the number of binary attributes in the sample data set (which is 2 to the power of the number of attributes). The higher the percentage, the less risk that the findings will be contradicted by configurations found in new sets of data (all other things being constant).

Where a data set has little diversity in terms of "balance" it will be more difficult to assess the consistency of any configuration's association with an outcome, compared to others, because there will be more cases associated with some configurations than others. Where there are more cases of a given configuration there will be more opportunities for its consistency of association with an outcome to be challenged by contrary cases.

My suggestion therefore is that when results are published from the analysis of purposive samples there should be adequate characterisation of the sample, both in terms of: (a) simple descriptive statistics available on the sample and wider population, and (b) the internal diversity of the sample, relative to the maximum scores possible on the two aspects of diversity.



Wednesday, May 20, 2015

Evaluating the performance of binary predictions


(Updated 2015 06 06)

Background: This blog posting has its origins in a recent review of a QCA oriented evaluation, in which a number of hypotheses were proposed and then tested using a QCA type data set. In these data set cases (projects) are listed row by row and the attributes of these projects are listed in columns. Additional columns to the right describe associated outcomes of interest. The attributes of the projects may include features of the context as well as the interventions involved. The cell values in the data sets were binary (1=attribute present, 0= not present), though there are other options.

When such a data set is available a search can be made for configurations of conditions that are strongly associated with an outcome of interest. This can be done inductively or deductively. Inductive searches involve the uses of systematic search processes (aka algorithms), of which there are a number available. QCA uses the Quine–McCluskey algorithm. Deductive searches involve the development of specific hypotheses from a body of theory, for example about the relationship between the context, intervention and outcome.

Regardless of which approach is used, the resulting claims of association need evaluation. There are a number of different approaches to doing this that I know of, and probably more. All involve, in the simplest form, the analysis of a truth table in this form:


In this truth table the cell values refer to the number of cases that have each combination of configuration and outcome. For further reference below I will label each cell as A and B (top row) and C and D (bottom row)

The first approach to testing is a statistical approach. I am sure that there are a number of ways of doing this, but the one I am most familiar with is the widely used Chi-Square test. Results will be seen as most statistically significant when all cases  are in the A and D cells. They will be least significant when they are equally distributed across all four cells.

The second approach to testing is the one used by QCA. There are two performance measures. One is Consistency, which is the proportion of all cases where the configuration is present and the outcome is also present (=A/(A+B)). The other is Coverage, which is the proportion of all outcomes that are associated with the configuration (=(A/(A+C)).

When some of the cells have 0 values three categorical judgements can also be made. If only cell B is empty then it can be said that the configuration is Sufficient but not Necessary. Because there are still values in cell C this means there are other ways of achieving the outcome in addition to this configuration.

If only cell C is empty then it can be said that the configuration is Necessary but not Sufficient. Because there are still values in cell B this means there are other additional conditions that are needed to ensure the outcome.

If cells B and C are empty then it can be said that the configuration is both Necessary and Sufficient

In all three situations there only needs to be one case to be found in a previously empty cell(s) to disprove the standing proposition. This is a logical test, not a statistical test.

The third approach is one used in the field of machine learning, where the above matrix is known as a Confusion Matrix. Here there is a profusion of performance measures available (at least 13). Some of the more immediately useful measures are:
  • Accuracy: (A+D)/(A+B+C+D), which is similar to but different from the Chi Square measure above
  • True Positives: A/(A+B), also called Precision, which corresponds to QCA consistency
  • True Negatives: D/(B+D)
  • False Positives: C/(A+C)
  • False Negatives: B/(B+D)
  • Positive predictive value: A/(A+C), also called Recall, which corresponds to QCA coverage
  • Negative predictive value: D/(C+D)
In addition to these three types of tests there are three other criteria that are worth taking into account as well: simplicity, diversity and similarity

Simplicity: Essentially the same idea as that captured in Occam's Razor. This is that simpler configurations are preferable, all other things being equal. For example: A+F+J leads to D is a simpler hypothesis than A+X+Y+W+F leads to D. Complex configurations can have a better fit with the data, but at the cost of being poor at generalising to other contexts. In Decision Tree modelling this is called "over-fitting" and solution is "pruning", i.e. cutting back on the complexity of the configuration.  Simplicity has practical value, when it comes to applying tested hypotheses in real life programmes. They are easier to communicate and to implement. Simplicity can be measured at two levels: (a) the number of attributes in a configuration that is associated with an outcome, (b) and the number of configurations needed to account for an outcome.

Diversity: The diversity of configurations is simply the number of different specific configurations in a data set. It can be made into a comparable measure by calculating it as a percentage of the total number possible. The total number possible is 2 to the power of A where A = number of kinds of attributes in the data set. A bigger percentage = more diversity.

If you want to find how "robust" a hypothesis is, you could calculate the diversity present in the configurations of all the cases covered by the hypothesis (i.e. not just the attributes specified by the hypotheses, which will be all the same). If that percentage is large this suggests the hypothesis works in a greater diversity of circumstances, a feature that could be of real practical value.

This notion of diversity is to some extent implicit in the Coverage measure. More coverage implies more diversity of circumstances. But two hypotheses with the same coverage (i.e. proportion of cases they apply to) could be working in circumstances with quite different degrees of diversity (i.e. the cases covered were much more diverse in their overall configurations).

Similarity: Each row in a QCA like data set is a string of binary values. The similarity of these configurations of attributes can be measured in a number of ways:
  • Jaccard index, the proportion of all instances in two configurations where the binary value 1 is present in the same position i.e. the same attribute is present.
  • Hamming distance, the number of positions at which the corresponding values in two configurations are different. This includes the values 0 and 1, whereas Jaccard only looks at 1 values
These measures are relevant in two ways, which are discussed in more detail further down this post:
  • If you want to find a "representative" case in a  data set, you would look for the case with the lowest average Hamming distance in the whole data set
  • If you wanted to compare the two most similar cases, you would look for the pair of cases with the lowest Hamming distance.
Similarity can be seen as a third facet of diversity, a measure of the distance between any two types of cases. Stirling (2007) used the term disparity to describe the same thing.

Choosing relevant criteria: It is important to note that the relevance of these different association tests and criteria will depend on the context. A surgeon would want a very high level of consistency, even if it was at the cost of low coverage (i.e. applicable only in a limited range of situations). However, a stock market investor would be happy with a consistency of 0.55 (i.e 55%), especially if it had wide coverage. Even more so if that wide coverage contained a high level of diversity. Returning to the medical example, a false positive might have different consequences to false negatives e.g. unnecessary surgery versus unnecessary deaths. In other non-medical circumstances, false positives may be more expensive mistakes than false negatives.

Applying the criteria: My immediate interest is in the use of these kinds of tests for two evaluation purposes. The first is selective screening of hypotheses about causal configurations that are worth more time intensive investigations, an issue raised in a recent blog.
  • Configurations that are Sufficient and not Necessary or Necessary but not Sufficient. 
    • Among these, configurations which were Sufficient but not Necessary, and with high coverage should be selected, 
    • And configurations which were Necessary but not Sufficient, and with high consistency, should also be selected. 
  • Plus all configurations that were Sufficient and Necessary (which are likely to be less common)
The second purpose is to identify implications for more time consuming within-case investigations. These are essential, in order to identify casual mechanism at work that connect the conditions that are associated in a given configuration. As I have argued elsewhere, associations are a necessary but insufficient basis for a strong claim of causation. Evidence of mechanisms is like muscles on the bones of a body, enabling it to move.

Having done the filtering suggested above, the following kinds of within-case investigations would seem useful:
  • Are there any common casual mechanisms underlying all the cases  found to be Necessary and Sufficient, i.e those within cell A? 
    • A good starting point would be a case within this set of cases that had the lowest average Hamming distance, i.e. one with the highest level of similarity with all the other cases. 
    • Once one or more plausible mechanism were discovered in that case a check could be made to see if they are present in other cases in that set, this could be done in two ways: (a) incrementally, by examining adjacent cases, i.e cases with the lowest Hamming distance from the representative case, (b) by partitioning the rest of the cases, and examining a case with a median level Hamming distance, i.e. half way between being the most similar and most different cases.
  • Where the configuration is Necessary but not Sufficient, how do the cases in cell B differ from those in cell A, in ways that might shed more light on how the same configuration leads to different outcomes? This is what has been called a MostSimilarDifferentOutcome (MSDO) comparison,
    • If there are many cases this could be quite a challenge, because the cases could differ on many dimensions (i.e. on many attributes). But using the Hamming distance measure we could make this problem more manageable by selecting a case from cell A and B that had the lowest possible Hamming distance. Then a within-case investigation could find additional undocumented differences that account for some or all of the difference in outcomes. 
      • That difference could then be incorporated into the current hypothesis (and data set) enabling more cases from cell B to now be found in cell A i..e Consistency would be improved
  • Where the configuration is Sufficient but not Necessary,in what ways are the cases in cell C the same as those in cell A, in ways that might shed more light on how the same outcome is achieved by different configurations? This is what has been called a MostDifferentSimilarOutcome (MDSO) comparison,
    • As above, if there are many cases this could be quite a challenge. Here I am less clear, but de Meur et al (page 72) say the correct approach is "...one has to look for similarities in the characteristics of initiatives that differ the most from each other; firstly the identification of the most differing pair of cases and secondly the identification of similarities between those two cases" The within-case investigation should look for undocumented similarities that account for some of the similar outcomes. 
      • That difference could then be incorporated into the current hypothesis (and data set) enabling more cases from cell C to now be found in cell A i..e Coverage would be improved