Rick On the Road

Sunday, June 09, 2019

Extracting additional value from the analysis of QuIP data

James Copestake, Marlies Morsink and Fiona Remnant are the authors of "Attributing Development Impact: The Qualitative Impact Protocol Casebook" published this year by Practical Action
As the title suggests, the book is all about how to attribute development impact, using qualitative data - an important topic that will be of interest to many. The book contents include:

Two introductory chapters

1. Introducing the causal attribution challenge and the QuIP |
2. Comparing the QuIP with other approaches to development impact evaluation

Seven chapters describing case studies of its use in Ethiopia, Mexico, India, Uganda, Tanzania, Eastern Africa, and England
A final chapter synthesising issues arising in these case studies
An appendix detailing guidelines for QuIP use

QuIP is innovative in many respects. Perhaps most notably, at first introduction, in the way data is gathered and coded. Neither the field researchers or the communities of interest are told which organisation's interventions are of interest, an approach known as "double blindfolding". The aim here is to mitigate the risk of "confirmation bias", as much as is practical in a given context.

The process of coding the qualitative data that is collected is also interesting. The focus is on identifying causal pathways in the qualitative descriptions of change obtained through one to one interviews and focus group discussions. In Chapter One the authors explain how the QuIP process uses a triple coding approach, which divides each reported causal pathway into three elements:

Drivers of change (causes). What led to change, positive or negative?
Outcomes (effects) What change/s occurred, positive or negative?
Attribution: What is the strength of association between the causal claim and the activity or project being evaluated

"Once all change data is coded then it is possible to use frequency counts to tabulate and visualise the data in many ways, as the chapters that follow illustrate". An important point to note here is that although text is being converted to numbers, because of the software which has been developed, it is always possible to identify the source text for any count that is used. And numbers are not the only basis which conclusions are reached about what has happened, the text of respondents' narratives are also very important sources.

That said, what interests me most at present are the emerging options for the analysis of the coded data. Data collation and analysis was initially based on a custom-designed Excel file, used because 99% of evaluators and program managers are already familiar with the use of Excel. However, more recently, investment has been made in the development of a customised version of MicroStrategy, a free desktop data analysis and visualization dashboard. This enables field researchers and evaluation clients to "slice and dice" the collated data in many different ways, without risk of damaging the underlying data, and its use involves a minimal learning curve. One of the options within MicroStrategy is to visualise the relationships between all the identified drivers and outcomes, as a network structure. This is of particular interest to me, and is something that I have been exploring with the QuIP team and with Steve Powell, who has been working with them

The network structure of driver and outcome relationships

One way QuIP coded data has been tabulated is in the form of a matrix, where rows = drivers and columns = outcomes and cell values = incidence of reports of the connection between a given row and a given column (See tables on pages 67 and 131). In these matrices, we can see that some drivers affect multiple outcomes and some outcomes are affected by multiple drivers. By themselves, the contents of these matrices are not easy to interpret, especially as they get bigger. One matrix provided to me had 95 drivers, 80 outcomes and 254 linkages between these. Some form of network visualisation is an absolute necessity.

Figure 1 is a network visualisation of the 254 connections between drivers and outcomes. The red nodes are reported drivers and the blue nodes are the outcomes that they have reportedly led to. Green nodes are outcomes that in turn have been drivers for other outcomes (I have deliberately left off the node labels in this example). While this was generated using Ucinet/Netdraw, I noticed that the same structure can also be generated by the MicroStrategy.

Figure 1

It is clear from a quick glance that there is still more complexity here than can be easily made sense of. Most notably in the "hairball" on the right.

One way of partitioning this complexity is to focus in on a specific "ego network" of interest. An ego network is a combination of (a) an outcome plus (b) all the other drivers and outcomes it is immediately linked to, plus (c) the links between those. MicroStartegy already provides (a) and (b) but probably could be tweaked to also provide (c). In Ucinet/Netdraw it is also possible to define the width of the ego network, i.e. how many links out from ego to collect connections to and between "alters". Here in Figure 2 is one ego network that can be seen in the dense cluster in Figure 1

Figure 2

Within this selective view, we can see more clearly the different causal pathways to an outcome. There are also a number of feedback loops here, between pairs (3) and larger groups of outcomes (2).

PS: Ego networks can be defined in relation to a node which represents a driver or an outcome. If one selected a driver as the ego in the ego network then the resulting network view would provide an "effects of a cause" perspective. Whereas if one selected an outcome as the ego, then the resulting network view would provide a "causes of an outcome" perspective.

Understanding specific connections

Each of the links in the above diagrams, and in the source matrices, have a specific "strength" based on a choice of how that relationship was coded. In the above example, these values were "citation counts, meaning one count per domain per respondent. Associated with each of these counts are the text sources, which can shed more light on what those connections meant.

What is missing from the above diagrams is numerical information about the nodes, i.e. the frequency of mentions of drivers and outcomes. The same is the case for the tabulated data examples in the book (pages 67, 131). But that data is within reach.

Here in Figure 7 and 8 is a simplified matrix. and associated network diagram, taken from this publication: QuIP and the Yin/Yang of Quant and Qual: How to navigate QuIP visualisations"

In Figure 7 I have added row and column summary values in red, and copied these in white on to the respective nodes in Figure 8. These provide values for the nodes, as distinct from the connections between them.

Why bother? Link strength by itself is not all that meaningful. Link strengths need to be seen in context, specifically: (a) how often the associated driver was reported at all, and (b) how often the associated outcome was reported at all. These are the numbers in white that I have added to the nodes in the network diagram above.

Once this extra information is provided we can insert it into a Confusion Matrix and use it to generate two items of missing information: (a) the number of False Positives, in the top right cell, and (b) the number of False Negatives (in the bottom left cell). In Figure 3, I have used Confusion Matrices to describe two of the links in the Figure 8 diagram.

Figure 3

It now becomes clear that there is an argument for saying that the link with a value of 5, between "Alternative Income" and "Increased income" is more important than the link with the value of 14 between "Rain/recovering from drought" and "Healthier livestock"

The reason? Despite the first link looking stronger (14 versus 5) there is more chance that the expected outcome will occur when the second driver is present. With the first driver the expected outcome only happens 14 of 33 times. But with the second driver the expected outcome happens 5 of the 7 times.

When this analysis is repeated for all the links where there is data (6 in Figure 8 above), it turns out that only two links are of this kind, where the outcome is more likely to be present when the driver is present. The second one is the link between Increased price of livestock and
Increased income, as shown in the Confusion Matrix in Figure 4 below

Figure 4

There are some other aspects of this kind of analysis worth noting. When "Increased price of livestock" is compared to the one above (Alternative income...), it accounts for a bigger proportion of cases where the outcome is reported i.e. 11/73 versus 5/73.

One can also imagine situations where the top right cell (False Positive) is zero. In this case, the driver appears to be sufficient for the outcome i.e. where it is present the outcome is present. And one can imagine situations where the bottom left cell (False Negative) is zero. In this case, the driver appears to be necessary for the outcome i.e. where it is not present the outcome is also not present.

Filtered visualisations using Confusion Matrix data

When data from a Confusion Matrix is available, this provides analysts with additional options for generating filtered views of the network of reported causes. These are:

Show only those connected drivers which seem to account for most instances of a reported outcome. I.e. the number of True Positives (in the top left cell) exceeds the number of False negatives (in the bottom left cell)
Show only those connected drivers which are more often associated with instances of a reported outcome (the True Positive in the top left cell), than its absence (the false Positive, in the top right cell).

Drivers accounting for the majority of instances of the reported outcome

Figure 5 is a filtered version of the blue, green and red network diagram shown in Figure 1 above. A filter has retained links where the True Positive value in the top left cell of the Confusion Matrix (i.e. the link value) is greater than the associated False Negative value in the bottom left cell. This presents a very different picture to the one in Figure 1.

Figure 5

Key: Red nodes = drivers, Blue nodes = outcomes, Green nodes = outcomes that were also in the role of drivers.

Drivers more often associated with the presence of a reported outcome than its absence

A filter has retained links where the True Positive value in the top left cell of the Confusion Matrix (i.e. the link value) is greater than the associated False Positive value in the top right cell.

Figure 6

Other possibilities

While there are a lot of interesting possibilities as to how to analyse QuIP data , one option does not yet seem available. That is the possibility of identifying instances of "configurational causality" By this, I mean packages of causes that must be jointly present for an outcome to occur. When we look at the rows in Figure 7 it seems we have lists of single causes, each of which can account for some of the instances of the outcome of interest. And when we look at Figures 2 and 8 we can see that there is more than one way of achieving an outcome. But we can't easily identify any "causal packages" that might be at work.

I am wondering to what extent this might be a limitation built into the coding process. Or, if better use of existing coded information might work. Perhaps the way the row and column summary values are generated in Figure 7 needs rethinking.

Looking at the existing network diagrams these provide no information about which connections were reported by whom. In Figure 2, take the links into "Increased income" from "Part of organisation x project" and "Social Cash Transfer (Gov)" Each of these links could have been reported by a different set of people, or they could have been reported by the same set of people. If the latter, then this could be an instance of "configurational causality" . To be more confident we would need to establish that people who reported only one of the two drivers did not also report the outcome.

Because all QuIP coded values can be linked back to specific sources and their texts, it seems that this sort of analysis should be possible. But it will take some programming work to make this kind of analysis quick and easy.

PS 1: Actually maybe not so difficult. All we need is a matrix where:

Rows = respondents
Columns = drivers & outcomes mentioned by respondents
Cell values = 1/0 mean that row respondent did or did not mention that column driver or outcome

Then use QCA, or EvalC3, or other machine learning software, to find predictable associations between one or more drivers and any outcome of interest. Then check these associations against text details of each mention to see if a causal role is referred to and plausible.

That said, the text evidence would not necessarily provide the last word (pun unintended). It is possible that a respondent may mention various driver-outcome relationships e.g. A>B, C>D, and A>E but not C>E. Yet, when analysing data from multiple respondents we might find a consistent co-presence of references to C and E (though no report of actual causal relations between them). The explanation may simply be that in the confines of a specific interview there was not time or inclination to mention this additional specific relationship.

In response...

James Copestake has elaborated on this final section as follows "We have discussed this a lot, and I agree it is an area in need of further research. You suggest that there may be causal configurations in the source text which our coding system is not yet geared up to tease out. That may be true and is something we are working on. But there are two other possibilities. First, that interviewers and interviewing guidelines are not primed as much as they could be to identify these. Second, respondents narrative habits (linked to how they think and the language at their disposal) may constrain people from telling configurational stories. This means the research agenda for exploring this issue goes beyond looking at coding"

PS: Also of interest: Attributing development impact: lessons from road testing the QuIP. James Copestake, January 2019

Saturday, May 18, 2019

Evaluating innovation...

Earlier this week I sat in on a very interesting UKES 2019 Evaluation Conference presentation "Evaluating grand challenges and innovation" by Clarissa Poulson and Katherine May (of IPE Tripleline).

The difficulty of measuring and evaluating innovation reminded me of similar issues I struggled with many decades ago when doing the Honours year of my Psychology degree, at ANU. I had a substantial essay to write on the measurement of creativity! My faint memory of this paper is that I did not make much progress on the topic.

After the conference, I did a quick search to find how innovation is defined and measured. One distinction that is often made is between invention and innovation. It seems that innovation = invention + use. The measurement of the use of an invention seems relatively unproblematic. But if the essence of the invention aspect of innovation is newness or difference, then how do you measure that?

While listening to the conference presentation I thought there were some ideas that could be usefully borrowed from work I am currently doing on the evaluation and analysis of scenario planning exercises. I made a presentation on that work in this year's UKES conference (PowerPoint here).

In that presentation, I explained how participants' text contributions to developing scenarios (developed in the form of branching storylines) could be analyzed in terms of their diversity. More specifically, three dimensions of diversity, as conceptualised by Stirling (1998):

Variety: Numbers of types of things
Balance: Numbers of cases of each type
Disparity: Degree of difference between each type

Disparity seemed to be the hardest to measure, but there are measures used within the field of Social Network Analysis (SNA) that can help. In SNA distance between actors or other kinds of nodes in a network, is measured in terms of "geodesic", i.e. the number of links between any two nodes of interest. There are various forms of distance measure but one simple one is "Closeness", which is the sum of geodesic distances from a node in a network and all other nodes in that network (Borgatti et al, 2018). This suggested to me one possible way forward in the measurement of the newness aspect of an innovation.

Perhaps counter-intuitively, one would ask the inventor/owner of an innovation to identify what other product, in a particular population of products, that their product was most similar to. All other unnamed products would be, by definition, more different. Repeating this question for all owners of the products in the population would generate what SNA people call an "adjacency matrix", where a cell value (1 or 0) tells us whether or not a specific row item is seen as most similar to a specific column item. Such a matrix can then be visualised as a network structure, and closeness values can be calculated for all nodes in that network using SNA software (I use UCINET/Netdraw) . Some nodes will be less close to other nodes, than others. That measure is a measure of their difference or "disparity"

Here is a simulated example, generated using UCINET. The blue nodes are the products. Larger blue nodes are more distant i.e. more different, from all the other nodes. Node 7 has the largest Closeness measure (28) i.e. is the most different, whereas node 6 has the smallest Closeness measure (18) i.e. is the least different.

There are two other advantages to this kind of network perspective. The first is that it is possible to identify the level of diversity in the population as a whole. SNA software can calculate the average closeness of all nodes in a network, to all others. Here is an example, of a network where nodes are much more distant from each other than the example above

The second advantage is that a network visualisation, like the first one above, makes it possible to identify any clusters of products. i.e. products that are each most similar to each other. No example is shown here, but you can imagine one!.

So, three advantages of this measurement approach:

1. Identification of how relatively different a given product or process is

2. Identification of diversity in a whole population of products

3. Identification of types of differences (clusters of self-similar products) within that population.

Having identified a means of measuring degrees of newness or difference (and perhaps categorising types of these), the correlation between these and different forms of product usage could then be explored.

PS: I will add a few related papers of interest here:

Innovation vs. Invention: Make the Leap and Reap the Rewards, Bill Walker, Wired, 2015
Chapter2: Measuring Diversity, in Diversity and Complexity, by Scott Page, 2011. Here he explains the same diversity measurement concepts. I have just made the connection to SNA and conceptions of innovation
Bob Picciotto's "Evaluation and innovation: Challenging the single narrative"

Measuring multidimensional novelty

Sometimes the new entity may be novel in multiple respects but in each respect only when compared to a different entity. For example, I have recently reviewed how my participatory scenario planning app ParEvo is innovative, in respect to (a) its background theory, (b) how it is implemented, (c) how the results are represented. In each area, there was a different "most similar" comparator practice.

The same network visualisation approach can be taken as above. The difference is the new entity will have links to multiple existing entities, not one, and the link to each entity will have varying "weight", reflecting the number of shared attributes it has with that entity. The aggregate value of the link weights for novel new entities will be less than those of other existing entities.

Information on the nature of the shared attributes can be identified in at least two ways:

(a) content analysis of the entities, if they are bodies of text (as in my own recent examples)

(b) card/pile sorting of the entities by multiple respondents

In both cases, this will generate a matrix of data, known as a two-mode network. Rows will represent entities and columns will represent their attributes (as via content analysis) or pile membership.

Novelty and Most Significant Change

The Most Significant Change (MSC) technique is a participatory approach to impact monitoring and evaluation, described in detail in the 2005 MSC Guide. The core of the approach is a question that asks "In your opinion, what was the most significant change that took place in ...[location]...over the last ...[time period]?" This is then followed up by questions seeking both descriptive details and an explanation of why the respondent thinks the change is most significant to them.

A common (but not essential) part of MSC use is a subsequent content analysis of the collected MSC stories of change. This involves the identification of different themes running through the stories of change, then the coding of the presence of these themes in each MSC story. One of the outputs will be a matrix, where rows = MSC stories and columns = different themes and cell values = the presence or absence of a particular column theme in a particular row story.

Such matrices can be easily imported into network analysis and visualisation software (e.g. Ucinet&Netdraw) and displayed as a network structure. Here the individual nodes represent individual MSC stories and individual themes. Links show which story has which theme present (= a two-mode matrix). The matrix can also be converted into two different types of one-mode matrix, where (a) stories are connected to stories by N number of common themes, and (b) themes are connected to themes by N number of common stories.

Returning to the focus on novelty, with each of the one-mode networks, our attention should be on (a) story nodes on the periphery of the network, and (b) on story nodes with a low total number of shared themes with other nodes (found by adding their link values). Network software usually enables filtering by multiple means, including link values, so this will help focus on nodes that have both characteristics.

I think this kind of analysis could add a lot of value to the use of MSC as a means of searching for significant forms of change, in addition to the participatory analytic process already built into the MSC process.

Thursday, March 28, 2019

Where there is no (decent / usable) Theory of Change...

I have been reviewing a draft evaluation report in which two key points are made about the relevant Theory of Change:

A comprehensive assessment of the extent to which expected outcomes were achieved (effectiveness) was not carried out, as the xxx TOC defines these only in broad terms.
...this assessment was also hindered by the lack of a consistent outcome monitoring system.

I am sure this situation is not unique to this program.

Later on the same report, I read about the evaluation's sampling strategy. As with many other evaluations I have seen, the aim was to sample a diverse range of locations in such a way that was maximally representative of the diversity of how and where the program was working. This is quite a common approach and a reasonable one at that.

But it did strike me later on that this intentionally diverse sample was an underexploited resource. If 15 different locations were chosen, one could imagine a 15 x 15 matrix. Each of the cells in the matrix could be used to describe how a row location compared to a column location. In practice, only half the matrix would be needed, because each relationship would be mentioned twice e.g. Row location A and its relation to Column location J would also be covered by Row location J and its relation to Column location A.

What sort of information would go in such cells? Obviously, there could be a lot to choose from. But one option would be to ask key stakeholders, especially those funding and/or managing any two compared locations. I would suggest they be asked something like this:

"What do you think is the most significant difference between these two locations/projects, in the ways they are working?"

And then ask a follow-up question...

"What difference do you think this difference will make?"

The answers are potential (if...then...) hypotheses, worth testing by an evaluation team. In a matrix generated by a sample of 15 locations, this exercise could generate ((15*15)-15))/2 = 105 potentially useful hypotheses, which could then be subject to a prioritisation / filtering exercise, which should include considerations of their evaluability (Davies, 2013). More specifically, how they relate to any Theory of Change, whether there is relevant data available, and whether any stakeholders are interested in the answers.

Doing so might also help address a more general problem, which I have noted elsewhere (Davies, 2018). And which was also a characteristic of the evaluation mentioned above. That is the prevalence in evaluation ToRs of open-ended evaluation questions, rather than hypothesis testing questions:

" While they may refer to the occurrence of specific outcomes or interventions, their phrasings do not include expectations about the particular causal pathways that are involved. In effect these open-ended questions imply either that those posting the questions either know nothing, or they are not willing to put what they think they know on the table as testable propositions. Either way this is bad news, especially if the stakeholders have any form of programme funding or programme management responsibilities. While programme managers are typically accountable for programme implementation it seems they and their donors are not being held accountable for accumulating testable knowledge about how these programmes actually work. Given the decades-old arguments for more adaptive programme management, it’s about time this changed (Rondinelli, 1993; DFID, 2018). (Davies, 2018)

Saturday, March 09, 2019

On using clustering algorithms to help with sampling decisions

I have spent the last two days in a training workshop run by BigML, a company that provides very impressive, coding-free, online machine learning services. One of the sessions was on the use of clustering algorithms, an area I have some interest in, but have not done much with, over the last year or so. The whole two days were very much centered around data and the kinds of analyses that could be done using different algorithms, and with more aggregated workflow processes.

Independently, over the previous two weeks, I have had meetings with the staff of two agencies in two different countries, both at different stages of carrying out an evaluation of a large set of their funded projects. By large, I mean 1000+ projects. One is at the early planning stage, the other is now in the inception stage. In both evaluations, the question of what sort of sampling strategy to use was a real concern.

My most immediate inclination was to think of using a stratified sampling process, where the first unit of analysis would be the country, then the projects within each country. In one of the two agencies, the projects were all governance related, so an initial country level sampling process seemed to make a lot of sense. Otherwise, the governance projects would risk being decontextualized. There were already some clear distinctions between countries in terms of how these projects were being put to work, within the agency's country strategy. These differences could have consequences. The articulation of any expected consequences could provide some evaluable hypotheses, giving the evaluation a useful focus, beyond the usual endless list of open-ended questions typical of so many evaluation Terms of Reference.

This led me to speculate on other ways of generating such hypotheses. Such as getting key staff managing these projects to do pile/card sorting exercises to sort countries, then projects, into pairs of groups, separated by a difference that might make a difference. These distinctions could reflect ideas embedded in an overarching theory of change, or more tacit and informal theories in the heads of such staff, which may nevertheless still be influential because they were operating (but perhaps untested) assumptions. They would provide other sources of what could be evaluable hypotheses.

However, regardless of whether it was a result of a systematic project document review or pile sorting exercises, you could easily end up with many different attributes that could be used to describe projects and then use as the basis of a stratified sampling process. One evaluation team seemed to be facing this challenge right now, of struggling to decide what attributes to choose. (PS: this problem can arise either from having too many theories or no theory at all)

This is where clustering algorithms, like K-means clustering, could come in handy. On the BigML website you can upload a data set (e.g. projects with their attributes) then do a one-click cluster analysis. This will find clusters of projects that have a number of interesting features: (a) Similarity within clusters is maximised, (b) Dissimilarity between clusters is maximised and visualised, (c) It is possible to identify what are called "centroids" i.e. the specific attributes which are most central to the identity of a cluster.

These features are relevant to sampling decisions. A sample from within a cluster will have a high level of generalisability within that cluster because all cases within that cluster are maximally similar. Secondly, other clusters can be found which range in their degree of difference from that cluster. This is useful if you want to find two contrasting clusters that might capture a difference that makes a difference.

I can imagine two types of analysis that might be interesting here:
1. Find a maximally different cluster (A and B) and see if a set of attributes found to be associated with an outcome of interest in A is also present in B. This might be indicative of how robust that association is
2, Find a maximally similar set of clusters (A and C) and see if incremental alterations to a set of attributes associated with an outcome in A means the outcome is not found associated in C. This might be indicative of how significant each attribute is.

These two strategies could be read as (1) Vary the context, (2) Vary the intervention

For more information, check out this BigML video tutorial on cluster analysis. I found it very useful

PS: I have also been exploring BigMLs Association Rule facility. This could be very helpful as another means of analysing the contents of a given cluster of cases. This analysis will generate a list of attribute associations, ranked by different measures of their significance. Examining such a list could help evaluators widen their view of the possible causal configurations that are present.

Saturday, July 14, 2018

Two versions of the Design Triangle - for choosing evaluation methods

Here is one version, based on Stern et al (2012) BROADENING THE RANGE
OF DESIGNS AND METHODS FOR IMPACT EVALUATIONS

A year later, in a review of the literature on the use of evaluability assessments, I proposed a similar but different version:

In this diagram "Evaluation Questions" are subsumed within the wider category of "Stakeholder demands". "Programme Attributes" have been disaggregated into "Project Design" (especially Theory of Change) and "Data Availability". "Available Designs" in effect disappears into the background, and if there was a 3D version, behind Evaluation Design.

Wednesday, July 19, 2017

Transparent Analysis Plans

Over the past years, I have read quite a few guidance documents on how to do M&E. Looking back at this literature, one thing that strikes me is how little attention is given to data analysis, relative to data collection. There are gaps, both in (a) guidance on "how to do it" and (b) how to be transparent and accountable for what you planned to do and then actually did. In this blog, I want to provide some suggestions that might help fill that gap.

But first a story, to provide some background. In 2015 I did some data analysis for a UK consultancy firm. They had been managing a "Challenge Fund" a grant making facility funded by DFID, for the previous five years, and in the process had accumulated lots of data. When I looked at the data I found sapproximately170 fields. There were many different analyses that could be made from this data, even bearing mind one approach we had discussed and agreed on - the development of some predictive models, concerning the outcomes of the funded projects.

I resolved this by developing a "data analysis matrix", seen below. The categories on the left column and top row referred to different sub-groups of fields in the data set. The cells referred to the possibility of analyzing the relationship between the row sub-group of data and the column sub-group of data. The colored cells are those data relationships the stakeholders decided would be analyzed, and the initials in the cells referred to the stakeholder wanting that analysis. Equally importantly, the blank cells indicate what will not be analyzed.

We added a summary row at the bottom and a summary column to the right. The cells in the summary row signal the relative importance given to the events in each column. The cells in the summary column signal the relative confidence in the quality of data available in the row sub-groups. Other forms of meta-data could also have been provided in such summary rows and columns, which could help inform stakeholders choice of what relationships between the data should be analyzed.

A more general version of the same kind of matrix can be used to show the different kinds of analysis that can be carried out with any set of data. In the matrices below, the row and column letters refer to different variables / attributes / fields in a data set. There are three main types of analysis illustrated in these matrices, and three sub-types:

Univariate - looking at one measure only
Bivariate - looking at the relationships between two measures
Multivariate - looking at the relationship between multiple measures

But within the multivariate option there three alternatives, to look at:

Many to one relationships
One to many relationships
Many to many relationships

On the right side of each matrix below, I have listed some of the forms of each kind of analysis.

What I am proposing is that studies or evaluations that involve data collection and analysis should develop a transparent analysis plan, using a "data analysis matrix" of the kind shown above. At a minimum, cells should contain data about which relationships will be investigated. This does not mean investigators can't change their mind later on as the study or evaluation progresses. But it does mean that both original intentions and final choices will be more visible and accountable.

Postscript: For details of the study mentioned above, see LEARNING FROM THE CIVIL SOCIETY CHALLENGE FUND: PREDICTIVE MODELLING Briefing Paper. September 2015

Monday, October 31, 2016

...and then a miracle happens (or two or three)

Many of you will be familiar with this cartoon, used in many texts on the use of Theories of Change

If you look at diagrammatic versions of Theories of Change you will see two type of graphic elements: nodes and links between the nodes. Nodes are always annotated, describing what is happening at this point in the process of change. But the links between nodes are typically not annotated with any explanatory text. Occasionally (10% of the time in the first 300 pages of Funnell and Rogers book on Purposeful Program Theory) the links might be of different types e.g. thick versus thin lines or dotted versus continuous lines. The links tell us there is a causal connection but rarely do they tell us what kind of causal connection is at work. In that respect the point of Sidney Harris's cartoon applies to a large majority of graphic representations of Theories of Change.

In fact there are two type of gaps that should be of concern. One is the nature of individual links between nodes. The other is how a given set of links converging on a node work as a group, or not, as it may be. Here is an example from the USAID Learning Lab web page. Look at the brown node in the centre, influenced by six other green events below it

In this part of the diagram there are a number of possible ways of interpreting the causal relationships between the six green events underneath the brown event they all connect to:

The first set are binary possibilities, where the events are or are not important:

1. Some or all of these events are necessary for the brown event to occur.
2. Some of all of the events are sufficient for the brown event to occur
3. None of the events are necessary or sufficient but two or more of combinations of these are sufficient

The fourth is more continuous
4. The more of these events that are present (and the more of each of these) the more the brown event will be present
5. The relationship may not be linear, but exponential or s-shaped or more complex polynomial shapes (likely if there are feedback loops present)

These various possibilities have different implications for how this bit of the Theory of Change could be evaluated. Necessary or sufficient individual events will be relatively easy to test for. Finding combinations that are necessary or sufficient will be more challenging, because there potential many (2^5=32 in the above case). Likewise finding linear and other kinds of continuous relationships would require more sophisticated measurement. Michael Woolcock (2009) has written on the importance of thinking through what kinds of impact trajectories our various contextualised Theories of Change might suggest we will find in this area.

Of course the gaps I have pointed out are only one part of the larger graphic Theory of Change shown above. The brown event is itself only one of a number of inputs into other events shown further above, where the same question arises about how they variously combine.

So, it turns out that Sydney Harris's cartoon is really a gentle understatement of how much more we really need to specify before we can have an evaluable Theory of Change on our hands.

Tuesday, August 09, 2016

Three ways of thinking about linearity

Describing change in "linear" terms is seen as bad form these days. But what does this term linear mean? Or perhaps more usefully, what could it mean?

In its simplest sense it just means one thing happening after another, as in a Theory of Change that describes an Activity leading to an Output leading to an Outcome leading to an Impact. Until time machines are invented, we can't escape from this form of linearity.

Another perspective on linearity is captured by Michael Woolcock's 2009 paper on different kinds of impact trajectories. One of these is linear, where for every x increase in an output there is a y increase in impact. In a graph plotting outputs against impacts, the relationship appears as a straight line. Woolcock's point was that there are many other shaped relationships that can be seen in different development projects. Some might be upwardly curving, reflecting an exponential growth arising from the existence of some form of feedback loop, whereby increased impact facilitates increased outputs. Others may be must less ordered in their appearance as various contending social forces magnify and moderate a project's output to impact relationship, with the balance of their influences changing over time. Woolcock's main point, if I recall correctly, was that any attempt to analyse a project's impact has to give some thought to the expected shape of the impact trajectory, before it plans to collect and analyse evidence about the scale of impact and its causes.

The third perspective on linearity comes from computer and software design.Here the contrast is made between linear and parallel processing of data. With linear processing, all tasks are undertaken somewhere within a single sequence. With parallel processing many tasks are being undertaken at the same time, within different serial processes. The process of evolution is a classic example of parallel processing. Each organism in its interactions with its environment is testing out the viability of a new variant in the species' genome. In development projects parallel processing is also endemic, in the form of different communities receiving different packages of assistance, and then making different uses of those packages, with resulting differences in the outcomes they experience.

In evaluation oriented discussion of complexity thinking a lot of attention is given to unpredictability, arising from the non-linear nature of change over time, of the kind described by Woolcock. But it is important to note that there are various identifiable forms of change trajectories that lie in between simple linear trajectories and chaotic unpredictable trajectories. Evaluation planning needs to think carefully about the whole continuum of possibilities here.

The complexity discussion gives much less attention to the third view of non-linearity, where diversity is the most notable feature. Diversity can arise from both intentional and planned differences in project interventions but also from unplanned or unexpected responses to what may have been planned as standardized interventions. My experience suggests that all too often assumptions are made, at least tacitly, that interventions have been delivered in a standardized manner. If instead the default assumption was heterogeneity, then evaluation plans would need to spell out how this heterogeneity would be dealt with. If this is done then evaluations might become more effective in identifying "what works in what circumstances", including identifying localized innovations that had potential for wider application.

Saturday, July 16, 2016

EvalC3 - an Excel-based package of tools for exploring and evaluating complex causal configurations

Over the last few years I have been exposed to two different approaches to identifying and evaluating complex causal configurations within sets of data describing the attributes of projects and their outcomes. One is Qualitative Comparative Analysis (QCA) and the other is Predictive Analytics (and particularly Decision Tree algorithms). Both can work with binary data, which is easier to access than numerical data, but both require specialist software - which requires time and effort to learn how to use

In the last year I have spent some time and money, in association with a software company called Aptivate (Mark Skipper in particular) developing an Excel based package which will do many of the things that both of the above software packages can do, as well as provide some additional capacities that neither have.

This is called EvalC3, and is now available [free] to people who are interested to test it out, either using their own data and/or some example data sets that are available. The "manual" on how to use EvalC3 is a supporting website of the same name, found here: https://evalc3.net/ There is also a short introductory video here.

Its purpose is to enable users: (a) to identify sets of project & context attributes which are good predictors of the achievement of an outcome of interest, (b) to compare and evaluate the performance of these predictive models, and (c) to identify relevant cases for follow-up within-case investigations to uncover any causal mechanisms at work.

The overall approach is based on the view that “association is a necessary but insufficient basis for a strong claim about causation, which is a more useful perspective than simply saying “correlation does not equal causation”.While the process involves systematic quantitative cross-case comparisons, its use should be informed by within-case knowledge at both the pre-analysis planning and post-analysis interpretation stages.

The EvalC3 tools are organised in a work flow as shown below:

The selling points:

EvalC3 is free, and distributed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
It uses Excel, which many people already have and know how to use
It uses binary data. Numerical data can be converted to binary but not the other way
It combines manual hypothesis testing with algorithm based (i.e. automated) searches for good performing predictive models
There are four different algorithms that can be used
Prediction models can be saved and compared
There are case-selection strategies for follow-up case-comparisons to identify any casual mechanisms at work "underneath" the prediction models

If you would like to try using EvalC3 email rick.davies at gmail.com

Skype video support can be provided in some instances. i.e. if your application is of interest to me :-)

Monday, March 07, 2016

Why I am sick of (some) Evaluation Questions!

[Beginning of rant] Evaluation questions are are a cop out, and not only that, they are an expensive cop out. Donors commissioning evaluations should not be posing lists of sundry open ended questions about how their funded activities are working and or having an impact.

They should have at least some idea of what is working (or not) and they should be able to articulate these ideas. Not only that, they should be willing, and even obliged, to use evaluations to test those claims. These guys are spending public monies, and the public hopefully expects that they have some idea about what they are doing i.e. what works. [voice of inner skeptic: they are constantly rotated through different jobs, so probably don't have much idea about what is working, at all]

If open ended evaluation questions were replaced by specific claims or hypotheses then evaluation efforts could be much more focused and in-depth, rather than broad ranging and shallow. And then we might have some progress in the accumulation of knowledge about what works.

The use of swathes of open ended evaluation questions also relates to the subject of institutional memory about what has worked in the past. The use of open ended questions suggests little has been retained from the past, OR is now deemed to be of any value. Alas and alack, all is lost, either way [end of rant]

Background: I am reviewing yet another inception report, which includes a lot of discussion about how evaluation questions will be developed. Some example questions being considered:

How can we value ecosystem goods and services and biodiversity?

How does capacity building for better climate risk management at the institutional level
translate into positive changes in resilience

What are the links between protected/improved livelihoods and the resilience of people and communities, and what are the limits to livelihood-based approaches to improving resilience?

Friday, March 04, 2016

Why we should also pay attention to "what does not work"

There is no shortage of research on poverty and how people become poor and often remain poor.

Back in the 1990s (ancient times indeed, at least in the aid world :-) a couple of researchers in Vietnam were looking at the nutrition status of children in poor households. In the process they came across a small number of households where the child was well nourished, despite the household being poor. The family's feeding practices were investigated and the lessons learned were then disseminated throughout the community. The existence of such positive outliers from a dominant trend was later called "positive deviance" and this subsequently became the basis of large field of research and development practice. You can read more on the Positive Deviance Initiative website

From my recent reading of the work done by those associated with this movement the main means that has been used to find positive deviance cases has been participatory investigations by the communities themselves. I have no problem with this.

But because I have been somewhat obsessed with the potential applications of predictive modeling over the last few years I have wondered if the search for positive deviance could be carried out on a much larger scale, using relatively non-participatory methods. More specifically, using data mining methods aimed at developing predictive models. Predictive models are association rules that perform well in predicting an outcome of interest. For example, that projects with x,y,z attributes in contexts with a,b, and c attributes will lead to project outcomes that are above average in achieving their objectives.

The core idea is relatively simple. As well as developing predictive models of what does work (the most common practice) we should also develop predictive models of what does not work. It is quite likely that many of these models will be imperfect, in the sense that there are likely to be some False Positives. In this type of analysis FPs will be cases where the development outcome did take place, despite all the conditions being favorable to it not taking place. These are the candidate "Positive Deviants" which would then be worth investigating in detail via case studies, and it is at this stage that participatory methods of inquiry would then be appropriate.

Here is a simple example, using some data collated and analysed by Krook in 2010, on factors affecting levels of women's participation in parliaments in Africa. Elsewhere in this blog I have shown how this data can be analysed using Decision Tree algorithms, to develop predictors of when womens' participation will be high versus low. I have re-presented the Decision Tree model below
In this predictive model the absence of quotas for women in parliament is a good predictor of low levels of their participation in parliaments. 13 of the 14 countries with no quotas have low levels of women's participation. The one exception, the False Positive of this prediction rule and an example of "positive deviance", is the case of Lesotho, where despite the absence of quotas there is a (relatively) high level of women's participation in parliament. The next question is why so, and then whether the causes are transferable to other countries with no quotas for women. This avenue was not explored in the Krook paper, but it could be a practically useful next step.

Postscript: I was pleased to see that the Positive Deviance Initiative website now has a section on the potential uses of predictive analytics (aka predictive modelling) and they are seeking to establish some piloting of methods in this area with other interested parties