For years now I have been in favour of theory-led evaluation
approaches. Many of the previous postings on this website are evidence of this.
But this post is about something quite different, about a particular form of data mining, how to do it
and how it might be useful. Some have argued that data mining is radically different from hypothesis-led research (or evaluation, for that matter). Others have argued that there are some important continuities and complimentarities (Yu, 2007)
Recently I have started reading about different data mining algorithms, especially the
use of what are called classification trees and genetic algorithms (GAs). The
latter was the subject of my recent
post, about whether we could evolve models of development projects as well
as design them. Genetic algorithms are software embodiments of the evolutionary
algorithm (i.e. iterated variation, selection, retention) at work in the
biological world. They are good for exploring large possibility spaces and for
coming up with new solutions that may not be nearby to current practice.
I had wondered if this idea could be connected to the use of
Qualitative Comparative Analysis (QCA),
a method of identifying configurations of attributes (e.g. of development
projects) associated with a particular type of outcomes (e.g. reduced household
poverty). QCA is a theory-led approach, which uses very basic forms of data
about attributes (i.e. categorical), then describes configurations of these
attributes using Boolean logic expressions, and analyses these with the help of
software that can compare and manipulate these statements. The aim is to come
up with a minimal number of simple “IF…THEN” type statements describing what
sorts of conditions are associated with particular outcomes. This is
potentially very useful for development aid managers who are often asking about
“what works where in what circumstances”. (But before then there is the
challenge of getting on top of the technical language required to be able to do
QCA).
My initial thought
was whether genetic algorithms could be used evolve and test statements
describing different configurations, as distinct from constructing them one by
one on the basis of a current theory. This might lead to quicker resolution,
and perhaps discoveries that had not been suggested by current theory.
As described in my previous post, there is already a simple
GA built into Excel, known as Solver.
What I could not work out was how to represent logic elements like AND, NOT, OR
in such a way that Solver could vary them to create different statements
representing different configurations of existing attributes. In the process of trying to sort out this
problem I discovered that there is a
whole literature on GAs and rule discovery
(rules as in IF-THEN statements). Around the same time, a technical adviser
from FrontlineSolver suggested I try a
different approach to the automated search for association rules. This involved
the use of Classification
Trees, a tool which has the merit of producing results which are readable
by ordinary mortals, unlike the results of some other data mining methods.
An example!
This
Excel file contains a small data set, which has previously been used for
QCA analysis. It contains 36 cases, each with 4 attributes and 1 outcome of
interest. The cases relate to different ethnic minorities in countries across Europe
and the extent to which there has been ethnic political mobilisation in their countries
(being the outcome of interest). Both the attributes and outcomes are coded as
either 0 or 1 meaning absent or present.
With each case having up to four different attributes there could
be 16 different combinations of attributes. A classification algorithm in XLMiner software (and others like it) is able to
automatically sort through these possibilities to find the simplest
classification tree that can correctly point to where the different outcomes take
place. XLMiner produced the
following classification tree, which I have annotated and will through
below.
We start at the top with the attribute “large” referring to
the size of the linguistic subnation within their own country. Those that are
large have then been divided according to whether their subnational region is
“growing” or not. Those that are not have then been divided into those who are
relatively “wealthy” group within their nation and those who are not. The
smaller linguistic substations have also
been divided into those who are relatively wealthy group within their nation
and those who are not, and those who are relatively wealthy are then divided
into those whose subnational region speak and write in their own language or
not. The square nodes at the end of each “branch” indicate the outcome
associated with these combinations of conditions - where there has been ethnic
political mobilisation (1) or not (0). Under each square node are the ethnic
groups placed in that category. These results fit with the original data in
Excel (right column).
This is my summary of the rules described by the classification
tree:
- · IF a linguistic subnation’s economy is large AND growing THEN ethnic political mobilisation will be present [14 of 19 positive cases]
- · IF a linguistic subnation’s economy is large, NOT growing AND is relatively wealthy THEN ethnic political mobilisation will be present [2 of 19 positive cases]
- · IF a linguistic subnation’s economy is NOT large AND is relatively wealthy AND speaks and writes in its own language THEN ethnic political mobilisation will be present [3 of 19 positive cases]
Both QCA and classification trees have procedures for simplifying the association rules that are found. With classification trees there is an automated “pruning” option that removes redundant parts. My impression is that there are no redundant parts in the above tree, but I may be wrong.
These rules are, in realist evaluation terminology, describing three different configurations of possible causal processes. I say "possible" because what we have above are associations. Like correlation co-effecients, they don't necessarily mean causation. However, they are at least candidate configurations of causal processes at work.
The origins of this data set and its coding are described in
pages 137-149 of The Comparative Method: Moving Beyond Qualitative and
Quantitative Strategies by Charles C. Ragin, viewable on Google
Books. Also discussed there is the QCA analysis of this data set and its
implications for different theories of ethnic political mobilisation. My thanks to Charles Ragin for making the data set available.
I think this type of analysis, by both QCA and
classification tree algorithms, has considerable potential use in the
evaluation of development programs. Because it uses nominal data the range of
data sources that can be used is much wider than statistical methods that need
ratio or interval scale data. Nominal data can either be derived from pre-existing
more sophisticated data (by using cut-off points to create categories) or be
collected as primary data, including by participatory processes such as card/pile
sorting and ranking exercises. The results in the form of IF…THEN rules should be of practical use,
if only in the first instance as a source of hypotheses needing further testing
by more detailed inquiries.
There are some fields of development work where large
amounts of potentially useful, but rarely used, data is generated on a
continuing basis such a microfinance services and to a less extent healthy and
education services. Much of the demand for data mining capacity has come from
industries that are finding themselves flooded with data, but lack the means to
exploit it. This may well be the case with more development agencies in the
future, as they make more use of interactive websites and mobile phone data
collection methods and the like.
For those who are interested, there is a range of software
worth exploring in addition to the package I have mentioned above. See these
lists: A
and B I have a particular interest in GATree, which uses genetic
algorithm to evolve the best fitting classification tree, and to avoid the
problem of being stuck in a “local optimum”. There is
also another type of algorithm with the delightfull name of Random Forests, which
uses the “wisdom of crowds” principle to find the best fitting classification
tree. But note the caveat: “Unlike decision trees, the
classifications made by Random Forests are difficult for humans to interpret”.
These and other algorithms are in use by participants in the Kaggle competitions online,
which themselves could be considered as a kind of semi-automated meta-algorithm
(i.e. an algorithm for finding useful algorithms). Lots to explore!
PS: I have just found and tested another package, called XLSTAT, that also generates classification trees. Here is a graphic showing the same result as found above, but in more detail. (Click on the image to enlarge it)
PS 29 April 2012: In QCA distinctions are made between a condition being "necessary" and or "sufficient" for the outcome to occur. In the simplest setting a single condition can be a necessary and sufficient cause. In more complex settings a single condition may be a necessary part of a configuration of conditions which itself is sufficient but not necessary. For example a "growing" economy in the right branch of the first tree above. In classification trees the presence/absence of the necessary/sufficient conditions can easily be observed. If a condition appears in all "yes" branches of the tree (= different configurations) then it is "necessary". If a condition appears along with another in a given "yes" branch of of a tree then it is not "sufficient". "Wealthy" is a condition that appears necessary but not sufficient. See more on this distinction in a more recent post:Representing different combinations of causal conditions
PS 4 May 2012: I have just discovered there is what looks like a very good open source data mining package called RapidMiner, which comes with a whole stack of training videos, and a big support and development community
PS 29 May 2012: Pertinent comment from Dilbert
PS 3 June 2012: Prediction versus explanation: I have recently found a number of web pages on the issue of prediction versus explanation. Data mining methods can deliver good predictions. However information relevant to good predictions does not always provide good explanations e.g. smoking may be predictive of teenage pregnancy but it is not a cause of it (see interesting exercise here). So is data mining a waste of time for evaluators? On reflection it occured to me that it depends on the circumstances and how the results of any analysis are to be used. In some circumstances the next steps may be to choose between existing alternatives. For example, which organisation or project to fund. Here good predictive knowledge, based on data about past performance, would be valuable. In other circumstances a new intervention may need to be designed from the ground up. Here access to some explanatory knowledge about possible causal relationships would be especially relevant.On further reflection, even where a new intervention has to be designed it is likely that it will involve choices of various modules (e.g. kinds of staff, kinds of activities) where knowledge of their past performance record is very relevant. But so would be a theory about their likely interactions.
At the risk of being too abstract,it would seem that a two way relationship is needed: proposed explanations need to be followed by tested predictions and successful predictions need to be followed by verified explanations.
PS 4 May 2012: I have just discovered there is what looks like a very good open source data mining package called RapidMiner, which comes with a whole stack of training videos, and a big support and development community
PS 29 May 2012: Pertinent comment from Dilbert
PS 3 June 2012: Prediction versus explanation: I have recently found a number of web pages on the issue of prediction versus explanation. Data mining methods can deliver good predictions. However information relevant to good predictions does not always provide good explanations e.g. smoking may be predictive of teenage pregnancy but it is not a cause of it (see interesting exercise here). So is data mining a waste of time for evaluators? On reflection it occured to me that it depends on the circumstances and how the results of any analysis are to be used. In some circumstances the next steps may be to choose between existing alternatives. For example, which organisation or project to fund. Here good predictive knowledge, based on data about past performance, would be valuable. In other circumstances a new intervention may need to be designed from the ground up. Here access to some explanatory knowledge about possible causal relationships would be especially relevant.On further reflection, even where a new intervention has to be designed it is likely that it will involve choices of various modules (e.g. kinds of staff, kinds of activities) where knowledge of their past performance record is very relevant. But so would be a theory about their likely interactions.
At the risk of being too abstract,it would seem that a two way relationship is needed: proposed explanations need to be followed by tested predictions and successful predictions need to be followed by verified explanations.