Rick On the Road

Friday, June 01, 2012

Representing different combinations of causal conditions

This week I attended a workshop on QCA (Qualitative ComparativeAnalysis). QCA is a useful approach to analysing possible causality in small-n situations, i.e. where there are not many cases to examine (e.g. villages or districts), and where perhaps only categorical data is available. Equally importantly, QCA enables the identification of different configurations of conditions associated with observed outcomes in a set of cases. In that respect it shares the ambitions of the school of Realist Evaluation (Pawson and Tilley). The downside is that QCA findings are expressed in Boolean logic, which is not exactly user friendly. For example, here is the result of one analysis:

Clue: in Boolean notation the symbol "+" means OR and the symbol "*" means AND. The letters in upper case refer to conditions present and the letters in lower case refer to conditions absent.

Decision trees

In parallel I have reading about and testing some data mining methods, especially classification trees (see recent blog). These are also able to identify multiple configurations of causal conditions. In addition they produce user friendly results in the form of tree diagrams, which are easy to read and understand. The same kind of decision trees can be used to represent the results of QCA analyses. In fact they can be used in a wide range of ways, including more participatory and ethnographic forms of inquiry (See Ethnographic Decision Models). From an evaluation perspective I think Decision Trees could be a very valuable tool, one which could help us answer the frequently asked question of "what works well in what circumstances". This because they can provide summary statements of the various combinations of conditions that lead to desired outcomes in complex circumstances.

In the first set of graphics below I have shown how Decision Trees can represent four important different types of causal combinations. These relate to whether an observed condition can be seen as a Necessary or Sufficient cause. The graphic is followed by an example of four fictional data sets, each of which contains one of the causal combinations shown in the graphic (highlighted in yellow). Double click on the graphic to make it easier to read.

Implications for evaluation work

There has been a lot of discussion amongst evaluators of development projects about whether it is appropriate to talk about causal attribution versus causal contribution, and in the latter case, how causal contribution can be best described. Descriptions of particular conditions in terms of whether they are necessary and/or sufficient are one way of doing so, especially when made visible in particular Decision Tree structures.

When necessary or sufficient conditions (1,2,3) are believed to be present this should provide some focus for evaluation efforts, enabling the direction of scarce evaluation attention towards the most vulnerable part of an explanatory model.

It has been argued that the most common causal configuration is 4., where an intervention is a necessary part of a package but that package is not essential, and that other packages can also generate the same results. If so, this suggests the need for some modesty by development agencies in their claims about making an impact and some generosity of views about the importance of other influences.

How do decision trees relate to Theories of Change?

The comparator here is the kind of diagramatic Theories of Change seen in Funnell and Rogers (2011) Purposeful Program Theory. A common feature of most of their examples is that they show a sequence of events over time, leading to an expected outcome. We could call them causal pathway ToC. In my view these would include LogFrames, although some people dont consider these as embodying a ToC.

I would argue that Decision Trees can also describe a ToC, but there are significant differences:

1. Decision Trees tend to describe multiple configurations that as a set can explain all observed outcomes. ToC, especially LogFrames, tend to describe a single configuration that will lead to one desired outcome. In doing so each part of the configuration appears to be necessary but not sufficient for the expected outcome.

2. Decision Trees describe configurations but not sequences. It is important to note that in Decision Trees there is no causal direction implied by relative positions in the branch structure, unlike in a ToC. The sequence of conditions associated together along a branch could in theory be in any order. What matters is what conditions are associated with what.

3. Decision Tree models are testable. Unlike most causal pathway ToC (at least those that I know of) Decision Trees can be generated direct from one data set (i.e. a training set), and they can be then tested again other data set (i.e test data) containing new cases with the same kinds of attributes and outcomes. These tests examine not only whether the predicted outcome happened when the expected attributes were present, but also whether the predicted outcome did not happen when the expected attributes were absent.

Causal pathway ToC are testable, by examining whether their implementation leads to the achievement of target values on performance indicators. The opposite possibility is also testable in principle, by observing if expected outcomes were absent when events in the causal pathway did not take place, via the use of a control group. However, compared to Decision Tree models, this kind of testing is much more laborious, and requires considerable upfront preparation.

Despite the differences there is also some potential inter-operability between Decision Tree models and causal pathway ToC:

1. An expected causal sequence of events in a ToC (e.g. in a LogFrame) could be represented in a Decision Tree, as a collection of attributes all located in one branch. Looking in the reverse direction, different branches of Decision Trees can be seen as constituents of seperate causal pathways in ToCs that have a more network rather than chain structure.

2. While Logframes may be suitable for individual projects, Decision Tree models may be suitable for portfolios of projects, capturing the difference in contexts and interventions that are involved in different projects.

3. Decision trees have some compatability with Realist Evaluators ways of thinking about change. The Realist Evaluation formulation of "Context + Mechanism = Outcome" type configurations can easily be represented in the above tables by creating two broad categories of conditions, about Context, Mechanisms and Outcome conditions respectively.

Decision tree analysis of QCA data set

Decision Tree algorithms can be used as a means of triangulating the results generated by other methods such as QCA.

The following table of data can be found in a paper on "Women’s Representation in Parliament: A Qualitative Comparative Analysis" by Krook (2010)

The values in this table were then converted to binary values, using various cut-off values explained in the paper, resulting in the table below.

In Krook's paper this data was analysed using QCA. I have taken the same data set and used Rapid Miner to generate the following Decision Tree, which enables you to find all cases where women's participation in national parliament was high (defined as above 17%)

The same result was found via the QCA analysis:

Translated this means:

IF Quotas AND Post-conflict situation THEN high % women in Parliament [= far right branch]

IF Women's status is high AND Post-conflict situation THEN high % women in Parliament [=3rd from left branch]

IF Quotas NO post-conflict situation AND women's staus is high THEN high % women in Parliament [=3rd from right branch]

Assessing the performance of decision trees

Relative to causal pathway ToC, there are many systematic ways to assess the performance of Decision Trees.

1. When used for description purposes

There are two useful measures of the performance of decision trees when they have been developed as summary representations of known cases:

1. Purity: Are the cases found at the end of a branch all of one kind of attribute (i.e. pure), or a mixture of kinds.

2. Coverage: What proportion of all positive case that exist are found at the end of any particular branch. In data mining exercises branches that have low coverage are often "pruned" i.e. removed from the model, to reduce the complexity of the model (and thus help increase its generalisability).

QCA uses similar measures of consistency and coverage. See page 84 of the fsQCA manual

Decision Trees can also be compared in terms of their simplicity, with simpler being better. The simplest measure is the number of branches in the tree, relative to the total number of cases (fewer = better). Another is the number of attributes used in the tree, relative to all available (fewer = better).

2. When used for prediction purposes

After having been developed as good descriptive models, decision trees are often then used as predictive models. At that stage different performance criteria come into play.

The most important metric is prediction accuracy: the ability of the Decision Tree to accurately classify new cases. From what I have read, it seems that a minimum level of accuracy is 80%, but the rationale for this cut-off point is unclear. Both predictive and descriptive accuracy can be measured using a Confusion Matrix

"I wanted to add that a typical trade-off analysis is done with learners in general (and decision trees are no exception) that compares model accuracy within a data set to model accuracy at classifying new data. A more generalizable model would be more favorable for predictive analysis. A more accurate, specialized model would be good for understanding a particular data set. Limiting the tree-depth is (in my opinion) probably the fastest way to explore these trade-offs."[from rakirk on RapidMiner blog]

Greater descriptive accuracy risks what data mining specialists call "over-fitting" - that is, after a certain point is reached the descriptive model's ability to accuractely predict outcomes in a new set of cases will start to diminish. (A classic tradeoff between internal and external validity)

Moore et al, 2001 provide criteria that mix both descriptive and predictive purposes. In their view "... the most desirable trees are:
1. Accurate (low generalization error rates)
2. Parsimonious (representing and generalizing the relationships succinctly)
3. Non-trivial (producing interesting results)
4. Feasible (time and resources)
5. Transparent and interpretable (providing high level representations of and insights into the data relationships, regularities, or trends)"

More information on decision trees, which is not maths intensive!

New software

PS July 2012: I have just found out about BigML, an online service where you can upload data, create Decision Tree models, test them and use them to generate predictions. So far it looks very good, although still under development. I have been offered extra invitation codes, which I can share with interested readers. If you are interested, email rick at gmail.com

I have been experimenting with two data sets on BigML, one is the results of a 2006 household poverty survey in Vietnam (596 cases, 23 attributes), and the other is a randomly generated data set (102 cases, 11 attributes).

A Decision Tree model of the household poverty data has the potential to enables people to do two things:

Find classification rules that find households with poverty scores in a given range e.g. above a known poverty level. Useful if you want to target assistance to specific groups
Find the poverty score of a specific household with a given set of attributes. Useful if you wanted to see if they are eligible for a targeted package of assistance

Here is a graphic of the BigML Decision Tree model. Its unorthodox in that it does not display branches with negative cases, but this approach does simplify the layout of complex trees. On the right of the tree is the decision rule assocated with the highlighted branch (on the right side). The outcomes it predicts (the leaf at the end of the branch) is the Basic Necessity Survey (BNS) poverty score for the households in that group (32 in the right side branch)

This tree has been minimally pruned, and shows branch ends containing 1% or more of all cases (i.e. 5 or more in this case). The highlighted branch shows one classification rule that accounts for about 8% of all households above the poverty line. All the green nodes in the tree account for around 92% of all households above the poverty line. The remainder will be found when the other colored "leave" nodes are clicked on.

My main finding from this exercise is that there is no classification rule that accounts for a large proportion of cases. The largest is one rule (Bathroom + Motorbike+Pesticide pump+Stone built house) that accounts for 31% of households above the poverty line. My interpretation is that this finding reflects the diversity of causal influences present, most probably being the agency of the households themselves.

PS 15 July 2012: Although at the start of this blog I made a clear distinction between four types of situations, where a condition or attribute is necessary and/or sufficent, it could be argued that there are degrees of necessity. If a complex decision tree has 25 branches (or explanatory rules), as in the above example, a certain condition may be present in many of the branches (as necessary but not sufficient part of a package that is sufficient but not necessary i.e. INUS). For example, having a watch is a condition present in 4 of the 25 branches. One way of looking for conditions that are relatively necessary is to look at the upper levels of the tree. Having a bathroom is relatively necessary, it is a necessary part of 14 of the 25 branches. This is still a fairly crude measure, we also need to take into account what proportion of all the cases are covered by these 14 branches. In this example, the 14 branches cover 70% of all the cases (households). Having a stone built house is not a necessary condtion to be judged as not-poor, but is a fairly necessary condition!

PS 18 July 2012: One dimension of the structure of a Decision Tree is its "diversity". After Stirling (2007), diversity can be seen as a mix of variety (number of branches), balance (spread of cases across those branches) and disparity (distance between the end of each branch, measured by degrees - number of intervening linkages). A rougher measure is simply the number of branches x the number of kinds of attributes making up all those branches. Diversity suggests, to me, a larger number of causes at work. How does this diversity connect to notions of complexity? Diversity and complexity are not simply one and the same thing. My reading is that complexity = diversity + structure (relationships between diverse entities). I need to go back and read read / finish reading Page, S (2011) on (Diversity and Complexity" and "Diversity versus Complexity" by Shahid Naeem (2001)

Thursday, May 24, 2012

A perspective on "Value for Money" relationships

The constituents of "value for money"

Matrices can be a useful means of showing the results of different combinations of things In this matrix I show how three different performance attributes can be seen as the results of different combinations of change in unit costs and effectiveness.

Source: Department of Crude Measures

PS: DFID and ICAI documents talk about Value for Money (VfM) as being made up of three elements: Economy, Efficiency and Effectiveness. But if we take VfM literally, as being about a relationship between value and money, then two of these three elements don't belong. Economy is just about money and effectiveness is just about value. For more, perhaps too much, on ideas about VfM, see this list of documents at www.mande.co.uk

Another take on definitions

My client is faced with the task of comparing multiple diverse projects within a policy portfolio. I have to think about what sort of comparisons are possible in this context. I come up with the following matrix:

Applying this simple set of distinctions may not be so easy. At what point would you be able to say two or more interventions were the same kind and scale? Or that the outcomes of two or more interventions were the same kind and scale?

Thursday, April 19, 2012

Data mining algorithms as evaluation tools

For years now I have been in favour of theory-led evaluation approaches. Many of the previous postings on this website are evidence of this. But this post is about something quite different, about a particular form of data mining, how to do it and how it might be useful. Some have argued that data mining is radically different from hypothesis-led research (or evaluation, for that matter). Others have argued that there are some important continuities and complimentarities (Yu, 2007)

Recently I have started reading about different data mining algorithms, especially the use of what are called classification trees and genetic algorithms (GAs). The latter was the subject of my recent post, about whether we could evolve models of development projects as well as design them. Genetic algorithms are software embodiments of the evolutionary algorithm (i.e. iterated variation, selection, retention) at work in the biological world. They are good for exploring large possibility spaces and for coming up with new solutions that may not be nearby to current practice.

I had wondered if this idea could be connected to the use of Qualitative Comparative Analysis (QCA), a method of identifying configurations of attributes (e.g. of development projects) associated with a particular type of outcomes (e.g. reduced household poverty). QCA is a theory-led approach, which uses very basic forms of data about attributes (i.e. categorical), then describes configurations of these attributes using Boolean logic expressions, and analyses these with the help of software that can compare and manipulate these statements. The aim is to come up with a minimal number of simple “IF…THEN” type statements describing what sorts of conditions are associated with particular outcomes. This is potentially very useful for development aid managers who are often asking about “what works where in what circumstances”. (But before then there is the challenge of getting on top of the technical language required to be able to do QCA).

My initial thought was whether genetic algorithms could be used evolve and test statements describing different configurations, as distinct from constructing them one by one on the basis of a current theory. This might lead to quicker resolution, and perhaps discoveries that had not been suggested by current theory.

As described in my previous post, there is already a simple GA built into Excel, known as Solver. What I could not work out was how to represent logic elements like AND, NOT, OR in such a way that Solver could vary them to create different statements representing different configurations of existing attributes. In the process of trying to sort out this problem I discovered that there is a whole literature on GAs and rule discovery (rules as in IF-THEN statements). Around the same time, a technical adviser from FrontlineSolver suggested I try a different approach to the automated search for association rules. This involved the use of Classification Trees, a tool which has the merit of producing results which are readable by ordinary mortals, unlike the results of some other data mining methods.

An example!

This Excel file contains a small data set, which has previously been used for QCA analysis. It contains 36 cases, each with 4 attributes and 1 outcome of interest. The cases relate to different ethnic minorities in countries across Europe and the extent to which there has been ethnic political mobilisation in their countries (being the outcome of interest). Both the attributes and outcomes are coded as either 0 or 1 meaning absent or present.

With each case having up to four different attributes there could be 16 different combinations of attributes. A classification algorithm in XLMiner software (and others like it) is able to automatically sort through these possibilities to find the simplest classification tree that can correctly point to where the different outcomes take place. XLMiner produced the following classification tree, which I have annotated and will through below.

We start at the top with the attribute “large” referring to the size of the linguistic subnation within their own country. Those that are large have then been divided according to whether their subnational region is “growing” or not. Those that are not have then been divided into those who are relatively “wealthy” group within their nation and those who are not. The smaller linguistic substations have also been divided into those who are relatively wealthy group within their nation and those who are not, and those who are relatively wealthy are then divided into those whose subnational region speak and write in their own language or not. The square nodes at the end of each “branch” indicate the outcome associated with these combinations of conditions - where there has been ethnic political mobilisation (1) or not (0). Under each square node are the ethnic groups placed in that category. These results fit with the original data in Excel (right column).

This is my summary of the rules described by the classification tree:

· IF a linguistic subnation’s economy is large AND growing THEN ethnic political mobilisation will be present [14 of 19 positive cases]
· IF a linguistic subnation’s economy is large, NOT growing AND is relatively wealthy THEN ethnic political mobilisation will be present [2 of 19 positive cases]
· IF a linguistic subnation’s economy is NOT large AND is relatively wealthy AND speaks and writes in its own language THEN ethnic political mobilisation will be present [3 of 19 positive cases]

Both QCA and classification trees have procedures for simplifying the association rules that are found. With classification trees there is an automated “pruning” option that removes redundant parts. My impression is that there are no redundant parts in the above tree, but I may be wrong.

These rules are, in realist evaluation terminology, describing three different configurations of possible causal processes. I say "possible" because what we have above are associations. Like correlation co-effecients, they don't necessarily mean causation. However, they are at least candidate configurations of causal processes at work.

The origins of this data set and its coding are described in pages 137-149 of The Comparative Method: Moving Beyond Qualitative and Quantitative Strategies by Charles C. Ragin, viewable on Google Books. Also discussed there is the QCA analysis of this data set and its implications for different theories of ethnic political mobilisation. My thanks to Charles Ragin for making the data set available.

I think this type of analysis, by both QCA and classification tree algorithms, has considerable potential use in the evaluation of development programs. Because it uses nominal data the range of data sources that can be used is much wider than statistical methods that need ratio or interval scale data. Nominal data can either be derived from pre-existing more sophisticated data (by using cut-off points to create categories) or be collected as primary data, including by participatory processes such as card/pile sorting and ranking exercises. The results in the form of IF…THEN rules should be of practical use, if only in the first instance as a source of hypotheses needing further testing by more detailed inquiries.

There are some fields of development work where large amounts of potentially useful, but rarely used, data is generated on a continuing basis such a microfinance services and to a less extent healthy and education services. Much of the demand for data mining capacity has come from industries that are finding themselves flooded with data, but lack the means to exploit it. This may well be the case with more development agencies in the future, as they make more use of interactive websites and mobile phone data collection methods and the like.

For those who are interested, there is a range of software worth exploring in addition to the package I have mentioned above. See these lists: A and B I have a particular interest in GATree, which uses genetic algorithm to evolve the best fitting classification tree, and to avoid the problem of being stuck in a “local optimum”. There is also another type of algorithm with the delightfull name of Random Forests, which uses the “wisdom of crowds” principle to find the best fitting classification tree. But note the caveat: “Unlike decision trees, the classifications made by Random Forests are difficult for humans to interpret”. These and other algorithms are in use by participants in the Kaggle competitions online, which themselves could be considered as a kind of semi-automated meta-algorithm (i.e. an algorithm for finding useful algorithms). Lots to explore!

PS: I have just found and tested another package, called XLSTAT, that also generates classification trees. Here is a graphic showing the same result as found above, but in more detail. (Click on the image to enlarge it)

PS 29 April 2012: In QCA distinctions are made between a condition being "necessary" and or "sufficient" for the outcome to occur. In the simplest setting a single condition can be a necessary and sufficient cause. In more complex settings a single condition may be a necessary part of a configuration of conditions which itself is sufficient but not necessary. For example a "growing" economy in the right branch of the first tree above. In classification trees the presence/absence of the necessary/sufficient conditions can easily be observed. If a condition appears in all "yes" branches of the tree (= different configurations) then it is "necessary". If a condition appears along with another in a given "yes" branch of of a tree then it is not "sufficient". "Wealthy" is a condition that appears necessary but not sufficient. See more on this distinction in a more recent post:Representing different combinations of causal conditions

PS 4 May 2012: I have just discovered there is what looks like a very good open source data mining package called RapidMiner, which comes with a whole stack of training videos, and a big support and development community

PS 29 May 2012: Pertinent comment from Dilbert

PS 3 June 2012: Prediction versus explanation: I have recently found a number of web pages on the issue of prediction versus explanation. Data mining methods can deliver good predictions. However information relevant to good predictions does not always provide good explanations e.g. smoking may be predictive of teenage pregnancy but it is not a cause of it (see interesting exercise here). So is data mining a waste of time for evaluators? On reflection it occured to me that it depends on the circumstances and how the results of any analysis are to be used. In some circumstances the next steps may be to choose between existing alternatives. For example, which organisation or project to fund. Here good predictive knowledge, based on data about past performance, would be valuable. In other circumstances a new intervention may need to be designed from the ground up. Here access to some explanatory knowledge about possible causal relationships would be especially relevant.On further reflection, even where a new intervention has to be designed it is likely that it will involve choices of various modules (e.g. kinds of staff, kinds of activities) where knowledge of their past performance record is very relevant. But so would be a theory about their likely interactions.

At the risk of being too abstract,it would seem that a two way relationship is needed: proposed explanations need to be followed by tested predictions and successful predictions need to be followed by verified explanations.

Thursday, April 05, 2012

Criteria for assessing the evaluability of Theories of Change

2019 05 21 Update: Please also see

Evaluability Assessments: Reflections on a review of the literature. Davies, R., Payne, L., 2015. Evaluation 21, 216–231. PDF copy

Planning Evaluability Assessments: A Synthesis of the Literature with Recommendations. Rick Davies, DFID Working Paper No. 40 August 2013. Available as pdf

2012: Our team has recently begun work on an evaluability assessment of an agency's work in a particular policy area, covering many programs in many countries. Part of our brief is to examine the evaluability of the programs' Theory of Change (ToC).

In order to do this we clearly need to identify some criteria for assessing the evaluability of ToC. I initially identified five which I thought might be appropriate, and then put these out to the members of the MandE NEWS email list for comment. Many comments were quickly forthcoming. In all, a total of 20 people responded in the space of two days (Thanks to Bali, Dwiagus, Denis, Bob, Helene, Mustapha, Justine, Claude, Alex, Alatunji, Isabel, Sven, Irene, Francis, Erik, Dinesh, Rebecca, John, Rajan and Nick).

Caveats and clarifications

What I have presented below is my current perspective on the issue of evaluability criteria, as informed by these responses. It is not intended to be an objective and representative description of the responses (Lookhere for a copy of all the comments received) (You can also download this posting as a pdf)

The word "evaluable" needs some clarification. In the literature on evaluability assessments it has two meanings. The main one is that it is possible to evaluate something. For example, if the theory is clear and the data is available. The second meaning is more practically oriented. The theory may be clear and the data available, but the theory may be so implausible that it is simply not worth expending resources on its evaluation. Or there may be a perfectly good ToC, but if no one owns it apart from a consultant who visited the project six months ago, so it might be questionable whether expensive resources should be invested in its evaluation.

We also need to distinguish between an evaluable ToC and a “good” ToC. A ToC may be evaluable because the theory is clear and plausible, and relevant data is available. But as the program is implemented, or following its evaluation, it might be discovered that the ToC was wrong, that people or institutions don’t work the way the theory expected them to do so. It was a “bad” ToC. Alternately it is also possible that a ToC may turn out to be good, but the poor way it was initially expressed made it un-evaluable, until remedial changes were made.

This brings us to a third clarification. My minimalist definition of a ToC is quite simple: “the description of a sequence of events that is expected to lead to a particular desired outcome” Such a description could be in text, tables, diagrams or a combination of these. Falling within the scope of this definition we could of course find ToC that are evaluable and those that are not so evaluable.

A possible list of criteria for assessing the evaluability of a Theory of Change (Version 2)

· Understandable

o Do the individual readers of the ToC find it easy to understand? Is the text understandable? If used, is the diagram clear?

o Do different people interpret the ToC in the same way?

o Do different documents give consistent representations of the same ToC?

· Verifiable

o Are the events described in a way that could be verified? This is the same territory as that of Objectively Verifiable Indicators (OVIs) and Means of Verification (MoVs) found in LogFrames

· Testable

o Are there identifiable causal links between the events? Often there are not

o Are the linked events parts of an identifiable causal pathway?

· Explained

o Are there explanations of how the connections are expected to work? Connections are common, explanations of the causal process involved are much less so.

o Have the underlying assumptions been made explicit? (also duplicated below)

· Complete

o Does what might be a long chain of events make a connection between the intervening agent with the intended beneficiaries (/target of their actions)? In a recent ToC that I have seen the ToC is quite detailed at the beneficiary end, but surprisingly vague and unspecific towards the agent’s end, even though that is where accountability might be more immediately expected.

· Inclusive (a better a term is needed here)

o Does the ToC encompass the diversity of contexts it is meant to cover? In ToC covering whole portfolios of projects there could be a substantial diversity of contexts and interventions. Does the ToC provide room for these with sacrificing too much in terms of verifiability and testability” See Modular Theories of Change: A means of coping with diversity and change? for some views how to respond to this challenge.

· Justifiable(new)

o Is there evidence supporting the sequence of events in the ToC? Either from past studies, previous projects, and/or from a situation analysis/baseline study or the like which is part of the design/inception stage of the current project

· Plausible (new)

o Where there is no prior evidence is the sequence of events plausible, given what is known about the intervention and the context?

o Have the underlying assumptions been made explicit?

o Have contextual factors been recognised as important mediating variables?

· Owned

o Can those responsible for contents of the ToC be identified?

o How widely owned is the ToC?

o Do their views have any consequences?

· Embedded

o Are the contents of the ToC are also referred to in other documents that will help ensure that it is operationalized?

Weighting

It was sensibly suggested that some criteria were more important than others. One argued that if you can establish that the causal links in a ToC are evidence based then ‘ownership will and shall follow’”.

In individual evaluability assessments a simple sense of their relative priority may be sufficient. When comparisons need to be made of the evaluability of multiple programs, it may be necessary to think about weighted scoring mechanisms/checklists.

Purpose

It was suggested that the criteria used would depend on the purpose for which the ToC was created. An understanding of the Purpose could therefore inform the weighting given to the different criteria.

Prior to consulting the email list members I had drafted a list of three possible purposes that could generate different kinds of evaluation questions, which an evaluability assessment would need to consider. They were:

· If the purpose of the ToC was to set direction

o Then we need to ask were programs designed accordingly?

· If the purpose of the ToC was to make a prediction

o Then we need to ask if the programs subsequently turn out this way

· If the purpose of the ToC was to provide a summation

o Then we need to ask if this is an accurate picture of what actually happened?

One criticism of the inclusion of prediction was that most ToC are nothing like scientific models and because of this they are typically insufficient in their contents to generate any attributable predictions. This may be true in the sense that scientific predictions aim to be generalisable, albeit subject to specific conditions e.g. that gravity behaves the same way in different parts of the universe. But most program ToC have much more location-specific predictions in mind, e.g. about the effects of a particular intervention in a particular place. There are interesting exceptions however, such as a ToC about a whole portfolio of programs, or a ToC about a whole policy area that might be operationalised through investment portfolios managed in a range of countries. There the criticism of incapacity may be more relevant.

The same critic proposed an alternate purpose to prediction, one where simplicity might be more of a virtue than a liability. ToC may aim to communicate or generate insight, by focusing on the core of an idea that is driving or inspiring a program. If so, then evaluation question could focus on how the ToC has changed the users’ understanding of the issues involved. This question about effects could be extended to include the effects of participation in the process whereby the ToC was developed.

PS: A similar point was made by another contributor, in a parallel related discussion on the KBF email list, who distinguished between two purposes:

to model a situation to better understand it and programme around it

to simplify a complex situation to help explain it to others and persuade them of the logic of your proposed intervention (e.g. for funding).

...noting that “in practice there is often a tradeoff between the explanatory and persuasive aspects of the underlying logic”

Issues arising about criteria

The following issues were raised.

· Process and Product: The list above is largely about the ToC product, not the process whereby it was created. Some argued there needed to be a participatory process of development to ensure the ToC was “aligned with the needs of beneficiaries and the national objectives”. However, others argued that that “ToC are not “development projects” that must be aligned with the Paris Declaration, but rather tools that must be rigorous, applied without ‘complaisance’ “. The hoped for reality might lie in between, ToC typically are associated with specifically project interventions and the extent of their ownership is relevant to answering the practical aspects of evaluability. On the other hand, the rigour of their use as tools will affect their usefulness and whether they can be evaluated. The product-oriented criteria given above do include two criteria that may reflect the effects of a good development process. i.e. ownership and embeddedness.

· Ownership: It was argued that ownership was not a criterion of good ToC, often the consensus in science has been proved wrong. But in the above list the criterion of ownership is relevant to whether the ToC is worth evaluating, it is not a criterion of value of the belief or understanding represented by the ToC. It could be argued that widely owned views of how a project is working are eminently worth evaluating, because of the risk that they are wrong.

· This approach might lead to the view that on the other hand ToC with few owners should not be evaluated. This view was in effect questioned by an example cited of an evaluator coming up with their alternative ToC, which was based on prior evaluations studies and research, in contrast to the politically motivated views of the official in charge of a program. This brings us back to the criteria listed above, and the idea of weighting them according to context (ownership versus justifiability).

· Relevance: This proposed criterion begs the question of relevant to whom? Ownership of the ToC (voluntary or mandated) would seem to signify a degree of relevance.

· Falsifiability: It was argued that this is the pre-eminent criteria of a good scientific theory, and one which needed more attention by development agencies when thinking about the ToC behind their interventions. The criteria in the list above address this to some extent by inquiring about the existence of clear causal links, along with good explanations for how they are expected to work. Perhaps “good” needs to be replaced by falsifiable, though I worry about setting the bar too high when most ToC I see barely manage to crawl. Many decent ToC do include multiple causal links. The more there are, the more vulnerable they are to disproof, because only one link needs to fail for ToC not to work. This could be seen as a crude measure of falsifiability.

· Flexibility: Although it was suggested that ToC be flexible and adaptable this view is contentious, in that it seems to contradict the need for specificity (by being verifiable, testable, and explained) and thus its falsifiablity. However, there is no in principle reason why a ToC can’t be changed. If it is, it becomes a different ToC, subject to a separate evaluation. It is not the same one as before. The only point to note here is that the findings of the adapted version would not validate the content of the earlier version.

· Lack of adaptability may also be a problem. It was suggested evaluators should ask 'When has the ToC been reviewed and how has it been adapted in the light of implementation experience, M&E data, dialogue and consultation with stakeholders?” If the answer is not for a long time, then there may be doubts about its current relevance, which could be reflected in limited ownership.

· Clarity of logic as well as evidence: One commentator suggested that it might be made clear whether a given cause is both “necessary and sufficient”, presumably as distinct from alternative combinations of these terms. Necessity and sufficiency is a demanding criterion, and arguable whether which many programs would satisfy, or perhaps even should satisfy.

· Simplicity: This suggested requirement (captured by Occam’s razor) is not as simple a requirement as it might sound. It will always be in tension with its opposite (captured by Ashby’s Law of Requisite Variety), which is that a theory must also have sufficient internal complexity in order to describe the complexity of the events it is seeking to describe. Along the same lines some commentators asked whether there was enough detail provided, the lack of which can affect verifiability and testability. Simplicity may win out as the more important criteria where a ToC is primarily intended as a communication tool.

· Justifiability was highlighted as important. Plausiblity was questioned “What that does really mean? If based on common sense then it is incompatible with being evidence based! If humanity had to rely on common sense, the earth would still be flat!!” Plausibility is clearly not a good evaluation finding. But it is a useful finding for an evaluability assessment. If a ToC is not plausible then it makes no sense to go any further with the design of an evaluation. Justifiablity is evidence of a good ToC, and is a judgement that might follow an evaluation. However, it might also be obvious before an evaluation, through an evaluability assessment, and lead to a decision that a further evaluation would not be useful.

Informed sources mentioned by contributors

Connell, J.P. & Kubisch, A.C. (1998) Applying a theory of change approach to the evaluation of comprehensive community initiatives: progress, prospects and problems, in: K. Fulbright-Anderson, A.C. Kubisch & J.P. Connell (Eds) New Approaches to Evaluating Community Initiatives. Volume 2: Theory, measurement and analysis (Queenstown, The Aspen Institute). [courtesy of John Mayne]

Connell and Kubisch suggest a number of attributes of a good theory of change.

· It should be plausible. Does common sense or prior evidence suggest that the activities, if implemented, will lead to desired results?

· It should be agreed. Is there reasonable agreement with the theory of change as postulated?

· It should be embedded. Is the theory of change embedded in a broader social and economic context, where other factors and risks likely to influence the desired results are identified?

· It should be testable. Is the theory of change specific enough to measure its assumptions in credible and useful ways?

Other sources that may be of interest

Evaluability Assessments - Achieving Better Evaluations Nicola Dawkins, 2005 PowerPoint

PS 30 April 2012: See also HIVOS posting on "How can I recognise a good quality Theory of Change?"