Showing posts sorted by relevance for query decision trees. Sort by date Show all posts
Showing posts sorted by relevance for query decision trees. Sort by date Show all posts

Tuesday, October 07, 2014

Comparing QCA and Decision Tree models - an ongoing discussion



This blog is a continuation of a dialogue that is based on Michaela Raab and Wolfgang Stuppert's  EVAW blog. I would have preferred to post my response below via their blog's Comment facility, but it cant cope with long responses or hypertext links. They in turn have had difficulty posting comments on my YouTube site where this EES presentation (Triangulating the results of Qualitative Comparative Analyses (EES Dublin 2014)  can be seen. It was this presentation that prompted their response here on their blog.

Hi Michaela and Wolfgang

Thanks for going to the trouble of responding in detail to my EES presentation.

Before responding in detail I should point out to readers that the EES presentation was on the subject of triangulation, and how to compare QCA and Decision Tree models, when applied to the same data set. In my own view I think it is unlikely that either of these methods will produce the “best” results in all circumstances. The interesting challenge is to develop ways of thinking about how to compare and choose between specific models generated by these, and what may be other comparable methods of analysis. The penultimate slide (#17)  in the presentation highlights the options I think we can try out when faced with different kinds of differences between models.

The rest of this post responds to particular points that have been made by Michaela and Wolfgang, and then makes a more general conclusion.

Re  “1. The  decision tree analysis is not based on the same data set as our QCA” This is correct. I was in a bit of a quandary because while the original data set was fuzzy set (i.e. there intermediate values between 0 and 1) the solutions that were found were described in binary form i.e. the conditions and outcomes either were or were not present. I did produce a Decision Tree with the fuzzy set data but I had no easy means of comparing the results with the binary results of the QCA model. That said, Michaela and Wolfgang are right in expecting that such a model would be more complex and have more configurations.

Re “2. Decision tree analysis is compared with a type of QCA solution that is not meant to maximise parsimony.”  I agree that “If the purpose was to compare the parsimony of QCA results with those of decision trees, then the 'parsimonious' QCA solution should be used” But the intermediate solution was the solution that was available to me, and parsimony was not the only criteria of interest in my presentation. Accuracy (or consistency in QCA terms) was also of interest. But it was the difference in parsimony that stood out the most in this particular model comparison.

Re “3. The decision tree analysis performs less well than stated in the presentation” Here I think I disagree. The focus of the presentation is on consistency of those configurations that predict effective evaluations only (indicated in the tree diagram by squares with 0.0 value rather than 1.0 value ), not the whole model.  Among the three configurations that predict effective evaluations the consistency was 82%. Slide 15 may have confused the discussion because the figures there refer to coverage rather than consistency (I should have made this clear).

Re “none of the paths in our QCA is redundant”. The basis for my claim here was some simple color coding of each case according to which QCA configuration applied to them. Looking back at the Excel file it appears to me that cases 14 and 16 were covered by two configurations and cases 16 and 32 by another two configurations. BUT bear in mind this was done with the binary (crisp) data, not the fuzzy valued data. (The two configurations that did not seem to cover unique cases were  quanqca*sensit*parti_2  and qualqca*quanqca*sensit*compevi_3). The important point here is not that redundancy is “bad” but where it is found it can prompt us to think about how to investigate such cases if and when they arise (including when two different models provide alternate configurations for the same cases).

4. “The decision tree consistency measure is less rigorous than in QCA”       I am not sure that this matters in the case of the comparison at hand but it may matter when other comparisons are made. I say this because on the measures given on slide 13 the QCA model actually seems to perform better than the Decision Tree model. BUT again, a possibly confounding factor is the use of crisp versus fuzzy values behind the two measures. There is nevertheless a positive message here though, which is to look carefully into how the consistency measures are calculated for any two models being compared. On a wider note, there is an extensive array of performance measures for Decision Tree (aka classification) models that can be summarised in a structure known as a Confusion Matrix. Here is a good summary of these: http://www.saedsayad.com/model_evaluation_c.htm

Moving on, I am pleased that Michaela and Wolfgang have taken this extra step: “Intrigued by the idea of 'triangulating' QCA results with decision tree analysis, we have converted our QCA dataset into a binary format (as Rick did, see point 1 above) and conducted a csQCA with that data”. Their results show that the QCA model does better in three of four comparisons (twice on consistency levels and once on number of configurations). However, we differ in how to measure the performance of the Decision Tree model. Their count of configurations seem to involve double counting (4+4 for both types of outcome), whereas I count 3 and 2, reflecting a total of the 5 that exist in the tree. On this basis I see the Decision Tree model doing better on parsimony for both types of outcome but the QCA model doing better on consistency for both types of outcomes.

What would be really interesting to explore,  now that we have two more comparable models, is how much overlap there was in the contents of the configurations found by the two analyses, and the actual contents of those configurations i.e. the specific conditions involved. That is what will probably be of most interest to the donor (DFID) who funded the EVAW work. The findings could have operational consequences.

In addition to exploring the concrete differences between models based on the same data I think one other area that will be interesting to explore is how often the best levels of parsimony and accuracy can be found in one model versus one being available at the cost of the other in any given model. I suspect QCA may privilege consistency whereas Decision Tree algorithms might not do so. But this may simply reflect variations in analysis settings given for a particular analysis. This question has some wider relevance, since some parties might want to prioritise accuracy whereas others might want to prioritise parsimony. For example, a stock market investor could do well with a model that has 55% accuracy, whereas a surgeon might need 98%. Others might want to optimise both.

And a final word of thanks is appropriate, to Michaela and Wolfgang for making their data set publicly available for others to analyse. This is all too rare an event, but hopefully one that will become more common in the future, encouraged by donors and modeled by examples such as theirs.


Monday, April 20, 2015

In defense of the (careful) use of algorithms and the need for dialogue between tacit (expertise) and explicit (rules) forms of knowledge



This blog posting is a response to the following paper now available online
Greenhalgh, T., Howick, J., Maskrey, N., 2014. Evidence based medicine: a movement in crisis? BMJ 348, http://www.bmj.com/content/348/bmj.g3725
Background: Chris Roche passed this very interesting paper on to me, received via "Kate", who posted a comment on  Chris's posting on "What has cancer taught me about the links between medicine and development? which can be found on Duncan Green's "From Poverty to Power" blog. 

The paper is interesting in the first instance because both the debate and practice about evidence based policy and practice seems to be much further ahead in the field of medicine than it is in the field of development aid (...broad generalisation that this is...).

It is also of interest to reflect on the problems and solutions copied below and to think how many of these kinds of issues can also be seen in development aid programs.

 According to the paper, the problems with the current version of evidence based medicine include:

  1. Distortion of the evidence based brand ("The first problem is that the evidence based “quality mark” has been misappropriated and distorted by vested interests. In particular, the drug and medical devices industries increasingly set the research agenda. They define what counts as disease ... They also decide which tests and treatments will  be compared in empirical studies and choose (often surrogate) outcome measures for establishing “efficacy.”
  2. Too much evidence:  The second aspect of evidence based medicine’s crisis (and yet, ironically, also a measure of its success) is the sheer volume of evidence available. In particular, the number of clinical guidelines is now both unmanageable and unfathomable. One 2005 audit of a 24 hour medical take in an acute hospital, for example, included 18 patients with 44 diagnoses and identified 3679 pages of national guidelines (an estimated 122 hours ofreading) relevant to their immediate care"
  3. Marginal gains and a shift from disease to risk: "Large trials designed to achieve marginal gains in a near saturated therapeutic field typically overestimate potential benefits (because trial samples are unrepresentative and, if the trial is overpowered, effects may be statistically but not clinically significant) and underestimate harms (because adverse events tend to be under detected or under reported)."
  4. Overemphasis on following algorithmic rules: "Well intentioned efforts to automate use of evidence through computerised decision support systems, structured templates, and point of care prompts can crowd out the local,individualised, and patient initiated elements of the clinical consultation"
  5. Poor fit for multi-morbidity. "Multi-morbidity (a single condition only in name) affects every person differently and seems to defy efforts to produce or apply objective scores, metrics, interventions, or guidelines"
The paper's proposed solutions or ways forward include:
  1. Individualised for the patient: Real evidence based medicine has the care of individual patients as its top priority, asking, “what is the best course of action for this patient, in these circumstances, at this point in their illness or condition?” It consciously and reflexively refuses to let process (doing tests, prescribing medicines) dominate outcomes (the agreed goal of management in an individual case). 
  2. Judgment not rules. Real evidence based medicine is not bound by rules.  
  3. Aligned with professional, relationship based care.  Research evidence may still be key to making the right decision—but it does not determine that decision. Clinicians may provide information, but they are also trained to make ethical and technical judgments, and they hold a socially recognised role to care, comfort, and bear witness to suffering.
  4. Public health dimension . Although we have focused on individual clinical care, there is also an important evidence base relating to population level interventions aimed at improving public health (such as pricing and labelling of consumables, fluoridation of water, and sex education). These are often complex, multifaceted programmes with important ethical and practical dimensions, but the same principles apply as in clinical care. 
  5. Delivering real evidence based medicine. To deliver real evidence based medicine, the movement’s stakeholders must be proactive and persistent. Patients (for whose care the movement exists) must demand better evidence, better presented, better explained, and applied in a more personalised way with sensitivity to context and individual goals.
  6. Training must be reoriented from rule following Critical appraisal skills—including basic numeracy, electronic database searching, and the ability systematically to ask questions of a research study—are prerequisites for competence in evidence based medicine. But clinicians need to be able to apply them to real case examples.
  7. Evidence must be usable as well as robust. Another precondition for real evidence based medicine is that those who produce and summarise research evidence must attend more closely to the needs of those who might use it
  8. Publishers must raise the bar. This raises an imperative for publishing standards. Just as journal editors shifted the expression of probability from potentially misleading P values to more meaningful confidence intervals by requiring them in publication standards, so they should now raise the bar for authors to improve the usability of evidence, and especially to require that research findings are presented in a way that informs individualised conversations.
  9. ...and more
While many of these complaints and claims that make a lot of sense, I think there is also a risk"throwing the baby out with the bathwater" if care is not taken with some. I will focus on a couple of ideas that run through the paper.

The risk lies in seeing two alternative modes of practice as exclusive choices. One is rule based, focused on average affects when trying to meet common needs in populations and the other is expertise focused on the specific and often unique needs of individuals. Parallels could be drawn between different type of aid programs, e.g. centrally planned and nationally rolled out services meeting basic needs like water supply or education and much more person centered participatory rural development programs

Alternatively, one can see these two approaches as having complementary roles that can help and enrich each other. The authors describe one theory of learning which probably applies in many fields, including medicine: The first stage " ...beginning with the novice who learns the basic rules and applies them mechanically with no attention to context. The next two stages involve increasing depth of knowledge and sensitivity to context when applying rules. In the fourth and fifth stages, rule following gives way to expert judgments, characterised by rapid, intuitive reasoning informed by imagination, common sense, and judiciously selected research evidence and other rules"  During this process a lot of explicit knowledge become tacit, and almost automated, with conscious attention left for the more case specific features of a situation. It is an economic use of human cognitive powers. Michael Polanyi wrote about this process years ago (1966, The Tacit Dimension).

The other side of this process is when tacit knowledge gets converted into explicit knowledge. That's what some anthropologists and ethnographers do. They seek to get into the inner world of their subjects and to make it accessible to others. One practitioner whose work interests me in particular is Christina Gladwin, who wrote a book on Ethnographic Decision Trees in 1989. This was all about eliciting how people, like small farmers in west Africa, made decisions about what crops to plant. The result was a decision tree model, that summarised all the key choices farmers could make, and the final outcomes those different choices would lead to. This was not  a model of how they actually thought, but a model of how different combinations of choices were associated with different outcomes of interest. These decision trees are not so far removed from those used in medical practice today.

A new farmer coming into the same location could arguably make use of such a decision tree to decide what to crops to plant. Alternatively they could work with one of the farmers for a number of seasons, which then might cover all the eventualities in the decision tree, and learn from that direct experience. But this would take much more time. In this type of setting explicit rule based knowledge is an  easier and quicker means of transferring knowledge between people. Rule based knowledge that can be quickly and reliably communicated is also testable knowledge.  Following the same pattern of rules may or may not always lead to the expected outcome in another context.

And now a word about algorithms. An algorithm is a clearly defined sequence of steps that will lead to a desired end, sometimes involving some iteration until that end state gets closer. A sequence of choices in a decision tree is an algorithm. At each choice point the answer will dictate what choices to be made next. These are the rules mentioned in the paper above. There are also algorithms for constructing such algorithms. On this blog I have made a number of postings about QCA and (automated) Decision Tree models, both of which are means of constructing testable causal models. Both involve computerised processes for finding rules that best predict outcomes of interest. I think they have a lot of potential in the field of development aid.

But returning to the problems of evidence based medicine, it is very important to note that algorithms are means of achieving specific goals. Deciding which goals need to be pursued remains a very human choice. Even within the use of both QCA and (automated) Decision tree modeling users have to decide the extent to which they want to focus on finding rules that are very accurate or those which are less accurate but which apply to a wider range of circumstances (usually simple rather than complex rules).

So, in summary, in any move towards evidence based practice, we need to ensure that tacit and explicit forms of knowledge build upon each other rather than getting separated as different and competing forms of knowledge. And while we should develop, test and use good algorithms, we should remember they are always means to an end, and we remain responsible for choosing the ends we are trying to achieve.

Postscript 2015 05 04: Please also read this recent cautionary analysis of the use of algorithms for the purposes of public policy implementation. The author points out that algorithms can embody and perpetuate cultural biases. How is that possible? It is possible because all evidence-based algorithms are developed using historical data i.e. data sets of what has happened in the past. Those data sets, e.g. of arrest and conviction data in a given city reflect historical practice by human institutions in that city, with all their biases, conscious and not so conscious. They don't reflect ideal practice, simply the actual practice at the time. Where an algorithm is not based on analysis of historical data then it may have its origins in a more ethnographic study of the practice of human experts in the domain of interest. Their practice, and their interpretations of their practice, are also equally subject to cultural biases. The analysis by Virginia Eubanks include four useful suggestions to counter these risks, one of which is that "We need to learn more about how policy algorithms work" by demanding more transparency about the design of a given algorithm and its decisions. But this may not be possible, or in some cases publicly desirable. One alternative method of interest is the algorithmic audit.

Thursday, April 19, 2012

Data mining algorithms as evaluation tools


For years now I have been in favour of theory-led evaluation approaches. Many of the previous postings on this website are evidence of this. But this post is about something quite different, about a particular form of data mining, how to do it and how it might be useful. Some have argued that data mining is radically different from hypothesis-led research (or evaluation, for that matter). Others have argued that there are some important continuities and complimentarities (Yu, 2007)

Recently I have started reading about different data mining algorithms, especially the use of what are called classification trees and genetic algorithms (GAs). The latter was the subject of my recent post, about whether we could evolve models of development projects as well as design them. Genetic algorithms are software embodiments of the evolutionary algorithm (i.e. iterated variation, selection, retention) at work in the biological world. They are good for exploring large possibility spaces and for coming up with new solutions that may not be nearby to current practice.

I had wondered if this idea could be connected to the use of Qualitative Comparative Analysis (QCA), a method of identifying configurations of attributes (e.g. of development projects) associated with a particular type of outcomes (e.g. reduced household poverty). QCA is a theory-led approach, which uses very basic forms of data about attributes (i.e. categorical), then describes configurations of these attributes using Boolean logic expressions, and analyses these with the help of software that can compare and manipulate these statements. The aim is to come up with a minimal number of simple “IF…THEN” type statements describing what sorts of conditions are associated with particular outcomes. This is potentially very useful for development aid managers who are often asking about “what works where in what circumstances”. (But before then there is the challenge of getting on top of the technical language required to be able to do QCA).

My initial thought  was whether genetic algorithms could be used evolve and test statements describing different configurations, as distinct from constructing them one by one on the basis of a current theory. This might lead to quicker resolution, and perhaps discoveries that had not been suggested by current theory. 

As described in my previous post, there is already a simple GA built into Excel, known as Solver. What I could not work out was how to represent logic elements like AND, NOT, OR in such a way that Solver could vary them to create different statements representing different configurations of existing attributes.  In the process of trying to sort out this problem I discovered that there is a  whole literature on GAs and rule discovery (rules as in IF-THEN statements). Around the same time, a technical adviser from FrontlineSolver suggested I try a different approach to the automated search for association rules. This involved the use of Classification Trees, a tool which has the merit of producing results which are readable by ordinary mortals, unlike the results of some other data mining methods. 

An example!

This Excel file contains a small data set, which has previously been used for QCA analysis. It contains 36 cases, each with 4 attributes and 1 outcome of interest. The cases relate to different ethnic minorities in countries across Europe and the extent to which there has been ethnic political mobilisation in their countries (being the outcome of interest). Both the attributes and outcomes are coded as either 0 or 1 meaning absent or present. 

With each case having up to four different attributes there could be 16 different combinations of attributes. A classification algorithm in XLMiner software (and others like it) is able to automatically sort through these possibilities to find the simplest classification tree that can correctly point to where the different outcomes take place. XLMiner produced the following classification tree, which I have annotated and will through below.



We start at the top with the attribute “large” referring to the size of the linguistic subnation within their own country. Those that are large have then been divided according to whether their subnational region is “growing” or not. Those that are not have then been divided into those who are relatively “wealthy” group within their nation and those who are not. The smaller linguistic substations  have also been divided into those who are relatively wealthy group within their nation and those who are not, and those who are relatively wealthy are then divided into those whose subnational region speak and write in their own language or not. The square nodes at the end of each “branch” indicate the outcome associated with these combinations of conditions - where there has been ethnic political mobilisation (1) or not (0). Under each square node are the ethnic groups placed in that category. These results fit with the original data in Excel (right column). 

This is my summary of the rules described by the classification tree:
  • ·         IF a linguistic subnation’s economy is large AND growing THEN ethnic political mobilisation will be present [14 of 19 positive cases]
  • ·         IF a linguistic subnation’s economy is large, NOT growing AND is relatively wealthy THEN ethnic political mobilisation will be present [2 of 19 positive cases]
  • ·         IF a linguistic subnation’s economy is NOT large AND is relatively wealthy AND speaks and writes in its own language THEN ethnic political mobilisation will be present [3 of 19 positive cases]
Both QCA and classification trees have procedures for simplifying the association rules that are found. With classification trees there is an automated “pruning” option that removes redundant parts. My impression is that there are no redundant parts in the above tree, but I may be wrong.
These rules are, in realist evaluation terminology, describing three different configurations of possible causal processes. I say "possible" because what we have above are associations. Like correlation co-effecients, they don't necessarily mean causation. However, they are at least candidate configurations of causal processes at work.

The origins of this data set and its coding are described in pages 137-149 of The Comparative Method: Moving Beyond Qualitative and Quantitative Strategies by Charles C. Ragin, viewable on Google Books. Also discussed there is the QCA analysis of this data set and its implications for different theories of ethnic political mobilisation. My thanks to Charles Ragin for making the data set available.

I think this type of analysis, by both QCA and classification tree algorithms, has considerable potential use in the evaluation of development programs. Because it uses nominal data the range of data sources that can be used is much wider than statistical methods that need ratio or interval scale data. Nominal data can either be derived from pre-existing more sophisticated data (by using cut-off points to create categories) or be collected as primary data, including by participatory processes such as card/pile sorting and ranking exercises. The results in the form of IF…THEN rules should be of practical use, if only in the first instance as a source of hypotheses needing further testing by more detailed inquiries. 

There are some fields of development work where large amounts of potentially useful, but rarely used, data is generated on a continuing basis such a microfinance services and to a less extent healthy and education services. Much of the demand for data mining capacity has come from industries that are finding themselves flooded with data, but lack the means to exploit it. This may well be the case with more development agencies in the future, as they make more use of interactive websites and mobile phone data collection methods and the like. 

For those who are interested, there is a range of software worth exploring in addition to the package I have mentioned above. See these lists: A and B  I have a particular interest in GATree, which uses genetic algorithm to evolve the best fitting classification tree, and to avoid the problem of being stuck in a “local optimum”. There is also another type of algorithm with the delightfull name of Random Forests, which uses the “wisdom of crowds” principle to find the best fitting classification tree. But note the caveat: “Unlike decision trees, the classifications made by Random Forests are difficult for humans to interpret”. These and other algorithms are in use by participants in the Kaggle competitions online, which themselves could be considered as a kind of semi-automated meta-algorithm (i.e. an algorithm for finding useful algorithms). Lots to explore!

PS: I have just found and tested another package, called XLSTAT, that also generates classification trees. Here is a graphic showing the same result as found above, but in more detail. (Click on the image to enlarge it)

PS 29 April 2012: In QCA distinctions are made between a condition being "necessary" and or "sufficient" for the outcome to occur.  In the simplest setting a single condition can be a necessary and sufficient cause. In more complex settings a single condition may be a necessary part of a configuration of conditions which itself is sufficient but not necessary. For example a "growing" economy in the right branch of the first tree above. In classification trees the presence/absence of the necessary/sufficient conditions can easily be observed. If a condition appears in all "yes" branches of the tree (= different configurations) then it is "necessary". If a condition appears along with another in a given "yes" branch of  of a tree then it is not "sufficient". "Wealthy" is a condition that appears necessary but not sufficient. See more on this distinction in a more recent post:Representing different combinations of causal conditions

PS 4 May 2012: I have just discovered there is what looks like a very good open source data mining package called RapidMiner, which comes with a whole stack of training videos, and a big support and development community


PS 29 May 2012: Pertinent comment from Dilbert 

PS 3 June 2012: Prediction versus explanation: I have recently found a number of web pages on the issue of prediction versus explanation. Data mining methods can deliver good predictions. However information relevant to good predictions does not always provide good explanations e.g. smoking may be predictive of teenage pregnancy but it is not a cause of it (see interesting exercise here). So is data mining a waste of time for evaluators? On reflection it occured to me that it depends on the circumstances and how the results of any analysis are to be used. In some circumstances the next steps may be to choose between existing alternatives. For example, which organisation or project to fund. Here good predictive knowledge, based on data about past performance, would be valuable. In other circumstances a new intervention may need to be designed from the ground up. Here access to some explanatory knowledge about possible causal relationships would be especially relevant.On further reflection, even where a new intervention has to be designed it is likely that it will involve choices of various modules (e.g. kinds of staff, kinds of activities) where knowledge of their past performance record is very relevant. But so would be a theory about their likely interactions.

At the risk of being too abstract,it would seem that a two way relationship is needed: proposed explanations need to be followed by tested predictions and successful predictions need to be followed by verified explanations.











Tuesday, April 16, 2013

Another perspective on the uses of control groups


I have been reading Eric Siegel's book on Predictive Analytics. Though it is a "pop science" account, with the usual "this will change the world" subtitle, it is definitely a worthwhile read.

In chapter 7 he talks the reader through what are called "uplift models", which are Decision Tree models that can not only differentiate groups who respond differently to an intervention, but how much differently when compared to a control group where there is no intervention. All this is in the context of companies marketing their products to the population at large, not the world of development aid organisations.

(Temporarily putting aside the idea of uplift models...) In this chapter he happens to use the matrix below, to illustrate the different possible sets of consumers that exist, given two types of scenarios that can be found where both a control and intervention group are being used.
But what happens if we re-label the matrix, using more development project type language? Here is my revised version below:
 

Looking at this new matrix it struck me that evaluators of development projects may have a relatively improverished view of the potential uses of control groups. Normally the focus is on the net difference in the improvement, between households in the control and intervention groups: How big is it and is it statistically significant? In other words, how many of those in the intervention group were really "self-helpers" who would have improved anyway, versus being "Need help'ers" who would not have improved without the intervention.

But this leaves aside two other sets of households who also surely deserve at least equal attention.One are the "hard cases", that did not improve in either setting. Possibly the poorest of the poor. How often are their numbers identified with the same attention to statistical detail? The other are the "Confused", who have improved in the control group, but not in the intervention group. Perhaps these are the ones we should really worry about, or at least be able to enumerate. Evaluators are often asked, in their ToRs, to also give attention to negative project impacts, but how often do we systematically look for such evidence?

Okay, but how will we recognise these groups? One way is to look at the distribution of cases that are possible. Each group can be characterised by how cases are distributed in the control and intervention group, as shown below. The first group (in green) are probably "self-help'ers" because the same proportion also improved in the control group. The second group are more likely to be "need-help'ers" because fewer people improved in the control group. The third group are likely to be the "confused" because more of them did not improve in the intervention group than in the control group. The fourth group are likely to be the "hard cases" if the same high proportion did not improve in the control group either.
At an aggregate level only one of the four outcome combinations shown above can be observed at any one time. This is the kind of distribution I found in the data set collected during a 2012 impact assessment of a rural livelihoods project in India. Here the overall distribution suggests that the “need-helpers” have benefited. 

How do we find if and where the other groups are? One way of doing this is to split the total population into sub-groups, using one household attribute at a time, to see what difference it makes to the distribution of results. For example, I thought that household’s wealth ranking might be associated with differences in outcomes. So I examined the distribution of outcomes for the poorest and least poor of the four wealth ranked groups. In the poorest group, those who benefited were the “need-help’ers” , but in the “Well-Off” group those who benefited were the “self-help’ers”, perhaps as expected






There are still the two other kinds of outcomes that might exist in some sub-groups – the “hard cases” and the “confused” How can we find where they are? At this point my theory-directed search fails me. I have no idea where to look for them. There are too many household attributes in the data set to consider manually examining how different their particular distribution of outcomes is from the aggregate distribution.

This is the territory where an automated algorithm would be useful. Using one attribute at a time, it would split the main population into two sub-groups, and search for the attribute that made the biggest difference. The difference to look for would be extremity of range, as measurable by the Standard Deviation.  The reason for this approach is that the most extreme range would be where one cell in the control group was 100 and the other was 0, and similarly in the intervention group. These would be pure examples of the four types of outcome distributions shown above. [Note that in the two wealth ranked sub-groups above, the Standard Deviation of the distributions was 14% and 15% versus 7% in the whole group]
This is the same sort of work that a Decision Tree algorithm does, except Decision Trees usually search for binary outcomes and use different “splitting” criteria. I am not sure if they can use the Standard Deviation, or if they can use a another measure which would deliver the same results (i.e. identify four possible types of outcomes).


Friday, March 28, 2014

The challenges of using QCA



This blog posting is a response to my reading of the Inception Report written by the team who are undertaking a review of evaluations of interventions relating to violence against women and girls. The process of the review is well documented in a dedicated blog – EVAW Review

The Inception Report is well worth reading, which is not something I say about many evaluation reports! One reason is to benefit from the amount of careful attention the authors have given to the nuts and bolts of the process. Another is to see the kind of intensive questioning the process has been subjected to by the external quality assurance agents and the considered responses by the evaluation team. I found that many of the questions that came to my mind while reading the main text of the report were dealt with when I read the annex containing the issues raised by SEQUAS and the team’s responses to them.

I will focus on one issue that is challenge for both QCA and data mining methods like Decision Trees (which I have discussed elsewhere on this blog). That is the ratio of conditions to cases. In QCA conditions are attributes of the cases under examination that are provisionally considered as possible parts of causal configurations that explain at least some of the outcomes. After an exhaustive search and selection process the team has ended up with a set of 39 evaluations they will use as cases in a QCA analysis. After a close reading of these and other sources they have come up with a list of 20 conditions that might contribute to 5 different outcomes. With 20 different conditions there are 220 (i.e. 1,048,576) different possible configurations that could explain some or all of the outcomes. But there are only 39 evaluations, which at best will represent only 0.004% of the possible configurations. In QCA the remaining 1,048,537 are known as “logical remainders”. Some of these can usually be used in a QCA analysis through a process using explicit assumptions e.g. about particular configurations plus outcomes which by definition would be impossible to occur in real life. However, from what I understand of QCA practice, logical remainders would not usually exceed 50% of all possible configurations.

The review team has dealt with this problem by summarising the 20 conditions and 5 outcomes into 5 conditions and one outcome. This means there are 25 (i.e. 32) possible causal configurations, which is more reasonable considering there are 39 cases available to analyse. However there is a price to be paid for this solution, which is the increased level of abstraction/generality in the terms used to describe the conditions. This makes the task of coding the known cases more challenging and it will make the task of interpreting the results and then generalising from them more challenging as well. You can see the two versions of their model in the diagram below, taken from their report.
 
What fascinated me was the role of evaluation method in this model (see “Convincing methodology”). It is only one of five conditions that could explain some or all of the outcomes. It is quite possible therefore that all or some of the case outcomes could be explained without the use of this condition. This is quite radical, considering the centrality of evaluation methodology in much of the literature on evaluations. It may also be worrying to DFID in that one of their expectations of this review was it would “generate a robust understanding of the strengths, weaknesses and appropriateness of evaluation approaches and methods”. The other potential problem is that even if methodology is shown to be an important condition, its singular description does not provide any means to discriminating between forms which are more or less helpful.

The team seems to have responded to this problem by proposing additional QCA analyses, where there will be an additional condition that differentiates cases according to whether they used qualitative or quantitative methods.  However reviewers have still questioned whether this is sufficient. The team in return have commented that they will “add to the model further conditions that represent methodological choice after we have fully assessed the range of methodologies present in the set, to be able to differentiate between common methodological choices” It will be interesting to see how they go about doing this, while avoiding the problem of “insufficient diversity” of cases already mentioned above.

One possible way forward has been illustrated in a recent CIFOR Working Paper (Sehring et al, 2013) and which is also covered in Schneider and Wagemann (2012). They have illustrated how it is possible to do a “two-step QCA”, which differentiates between remote and proximate conditions. In the VAWG review this could take the form of an analysis of conditions other than methodology first, then a second analysis focusing on a number of methodology conditions. This process essentially reduces a larger number of remote conditions down to a smaller number of configurations that do make a difference to outcomes, which are then included in the second level of the analysis which uses the more proximate conditions. It has the effect of reducing the number of logical remainders. It will be interesting to see if this is the direction that the VAWG review team are heading.

PS 2014 03 30: I have found some further references to two-level QCA:
 And for people wanting a good introduction to QCA, see