Rick On the Road

Monday, June 14, 2021

Paired case comparisons as an alternative to a configurational analysis (QCA or otherwise)

[Take care, this is still very much a working draft!] Criticisms and comments welcome though

The challenge

The other day I was asked for some advice on how to implement a QCA type of analysis within an evaluation that was already fairly circumscribed in its design. Circumscribed both by the commissioner and by the team proposing to carry out the evaluation. The commissioner had already indicated that they wanted a case study orientated approach and had even identified the maximum number of case studies that they wanted to see (ten) . While the evaluation team could see the potential use of a QCA type analyses they were already committed to undertaking a process type evaluation, and did not want a QCA type analyses to dominate their approach. In addition, it appeared that there already was a quite developed conceptual framework that included many different factors which might be contribute causes to the outcomes of interest.

As is often the case, there seemed to be a shortage of cases and an excess of potentially explanatory variables. In addition, there were doubts within the evaluation team as to whether a thorough QCA analysis would be possible or justifiable given the available resources and priorities.

Paired case comparisons as the alternative

My first suggestion to the evaluation team was to recognise that there is some middle ground between across-case analysis involving medium to large numbers of cases, and a within-case analysis. As described by Rihoux and Ragin (2009) a QCA analysis will use both, going back and forth, using one to inform the other, over a number of iterations.. The middle ground between these two options is case comparisons – particularly comparisons of pairs of cases. Although in the situation described above there will be a maximum of 10 cases that can be explored, the number of pairs of these cases that can be compared is still quite big (45). With these sort of numbers some sort of strategy is necessary for making choices about the types of pairs of cases that will be compared. Fortunately there is already a large literature on case selection. My favourite summary is the one by Gerring, J., & Cojocaru, L. (2015). Case-Selection: A Diversity of Methods and Criteria.

My suggested approach was to use what is known as the Confusion Matrix as the basis for structuring the choice of cases to be compared. A Confusion Matrix is a simple truth table, showing a combination of two sets of possibilities (rows and columns), and the incidence of those possibilities (cell values). For example as follows:

Inside the Confusion Matrix are four types of cases:

True Positives where there are cases with attributes that fit my theory and where the expected outcome is present
False Positives, where there are cases with attributes that fit my theory but where the expected outcome is absent
False Negatives, where there are cases which do not have attributes that fit my theory but where nevertheless the outcome is present
True Negatives, where there are cases which do not have attributes that fit my theory and where the outcome is absent as expected

Both QCA and supervised machine learning approaches are good at identifying individual (or packages of) attributes which are good predictors of when outcomes are present or when they are absent – in other words where there are large number of True Positive and True Negative cases. And the incidence of exceptions: the False Positive and False Negatives. But this type of cross case-based led analysis do not seem to be available as an option to the evaluation team I have mentioned above.

1. Starting with True Positives

So my suggestion has been to look at the 10 cases that they have at hand, and start by focusing in on those cases where the outcome is present (first column). Focus on the case that is most similar to others with the outcome present, because findings about this case may be more likely to apply to others. See below on measuring similarity) . When examining that case identify one or more attributes which is the most likely explanation for the outcome being present. Note here that this initial theory is coming from a single within-case analysis, not a cross-case analysis. The evaluation team will now have a single case in the category of True Positive.

2. Comparing False Positives and True Positives

The next step in the analysis is to identify cases which can be provisionally described as a False Positive. Start by finding a case which has the outcome absent. Does it have the same theory-relevant attributes as the True Positive? If so, retain it as a False Positive. Otherwise, move it to the True Negative category. Repeat this move for all remaining cases with the outcome absent. From among all those qualifying as False Positives, find one which is otherwise be as similar as possible in all its other attributes to the True Positive case.This type of analysis choice is called MSDO, standing for "most similar design, different outcome" - see the de Meur reference below. Also see below on how to measure this form of similarity.

The aim here is to find how the causal mechanisms at work differ. One way to explore this question is to look for an additional attribute that is present in the True Positive case but absent in the False Negative case, despite those cases otherwise being most similar. Or, an attribute that is absent in the True Positive but present in the False Negative case. In the former case the missing case could be seen as a kind of enabling factor, whereas in the latter case it could be seen as more like a blocking factor. If nether can be found by comparison of coded attributes of the cases then a more intensive examination of raw data on the case might still identify them, and lead to an updated/elaboration of theory behind the True Positive case. Alternately, that examination might suggest measurement error is the problem and that the False Positive case needs to be reclassified as True Positive.

3. Comparing False Negatives and True Positives

The third step in the analysis is to identify at least one most relevant case which can be described as a False Negative. This False-Negative case should be one that is as different as possible in all its attributes to the True Positive case. This type of analysis choice is called MDSO, standing for "most different design, same outcome".

The aim here should be to try to identify if the same or different causal mechanisms is at work, when compared to that seen in the True Positive case. One way to explore this question is to look for one or more attributes that both the True Positive and False Negative case have in common, despite otherwise being "most different". If found, and if associated with the causal theory in the True Positive case, then the False Negative case can now be reclassed as a True Positive. The theory describing the now two True Positive cases can now be seen as provisionally "necessary"for the outcome, until another False Negative case is found and examined in a similar fashion.If the casual mechanism seems to be different then the case remains as a False Negative.

Both the second and third step comparisons described above will help : (a0 elaborate the details, and (b) establish the limits of the scope of the theory identified in step one. This suggested process makes use of the Confusion Matrix as a kind of very simple chess board, where pieces (aka cases) are introduced on to the board, one at a time, and then sometimes moved to other adjacent positions (depending on their relation to other pieces on the board).Or, the theory behind their chosen location is updated.

If there are only ten cases available to study, and these have an even distribution of outcomes present and absent, then this three step process of analysis could be reiterated five times i.e. once for each case where the outcome was present. Thus involving up to 10 case comparisons, out of the 45 possible.

Measuring similarity

The above process depends on the ability to make systematic and transparent judgements about similarity. One way of doing this, which I have previously built into an Excel app called EvalC3, is to start by describing each case with a string of binary coded attributes of the same kind as used in QCA, and in some forms of supervised machine learning. An example set of workings can be seen in this Excel sheet, showing an imagined data set of 10 cases, with 10 different attributes and then the calculation and use of Hamming Distance as the similarity measure to chose cases for the kinds of comparisons described above. That list of attributes and the Hamming distance measure, is likely to need to be updated, as the investigation of False Positives and False Negatives proceeds.

Incidentally, the more attributes that have been coded per case, the more discriminating this kind of approach can become. In contrast to cross-case analysis where an increase in numbers of attributes per case is usually problematic

Related sources

For some of my earlier thoughts on case comparative analysis see here, These were developed for use within the context of a cross-case analysis process. But the argument above is about how to proceed when the staring point is a within-case analysis.

Monday, May 24, 2021

The potential use of Scenario Planning methods to help articulate a Theory of Change

Over the past few months I have been engaged in discussions with other members of the Association of Professional Futurists (APF) Evaluation Task Force about how activities and outcomes in the field of foresight/alternative futures/scenario planning can usefully be evaluated.

Just recently the subject of Theories of Change has come up, and it struck me that there are at least three ways of looking at Theories of Change in this context:

The first perspective: A particular scenario (i.e. an elaborated view of the future) can contain within it a particular theory of change. One view of the future may imply that technological change will be the main driver of what happens. Another might emphasise the major long-term causal influence of demographic change.

The second perspective: Those organising a scenario planning exercise are also likely to have either explicitly or implicitly or mixture of both a Theory of Change of how their exercise is expected to influence on the participants, and the influence those participants will have on others.

The third perspective looks in the opposite direction and raises the possibility that in other settings a Theory of Change may contain a particular type of future scenario. I'm thinking here particularly of Theories of Change as used by organisations planning economic and/or social interventions in developed and developing economies. This territory has been explored recently in a paper by Derbyshire (2019), titled "Use of scenario planning as a theory-driven evaluation tool. FUTURES & FORESIGHT SCIENCE, 1(1), 1–13. In that paper he puts forward a good argument for the use of scenario planning methods as a way of developing improved Theories of Change. Improved in a number of ways. Firstly a much more detailed articulation of the causal processes involved. Secondly, more adequate attention to risks and unintended consequences. Thirdly, more adequate involvement of stakeholders in these two processes.

Both the task force discussions and my revisiting of the paper by Derbyshire have prompted me to think about the potential use of a ParEvo exercise as a means of articulating the contents of a Theory of Change for a development intervention. And to start to reach out to people who might be interested in testing such possibilities. The following possibilities come to mind:

1. A ParEvo exercise could be set up to explore what happens when X project is set up in Y circumstances with Z resources and expectations. A description of this initial setting would form the seed paragraph(s) of the ParEvo exercise. The subsequent iterations would describe the various possible developments that took place over a series of calendar periods, reflecting the expected lifespan of the intervention, and perhaps a limited period thereafter. The participants would be, or act in the role of, different stakeholders in the intervention. Commentators of the emerging storylines could be independent parties with different forms of expertise relevant to the intervention and its context.

2. As with all previous ParEvo exercises to date, after the final iteration there would be an evaluation stage, completed by at least the participants and the commentators, but possibly also by others in observer roles. You can see a copy of a recent evaluation survey form here, to see the types of evaluative judgements that would be sought from those involved and observing.

3. .3. There seemed to be at least two possible ways of using the storylines that have been generated, to inform the design of a Theory of Change. One is to take whole storylines as units of analysis. For example, a storyline evaluated as both most likely and most desirable, by more participants than any other storyline, would seem an immediately useful source of detailed information about a causal pathway that should go into a Theory of Change. Other storylines identified as most likely but least desirable would warrant attention as risks that also need to be built into a Theory Of Change, along with any potential means of preventing and/or mitigating those risks. Other storylines identified as least likely but most desirable would warrant attention as opportunities, also to be built into a Theory Of Change, along with means of enabling and exploiting those opportunities.

4. 34. The second possible approach would give less respect to the existing branch structure, and focus more on the contents of individual contributions i.e. paragraphs in the storylines. Individual contributions could be sorted into categories familiar to those developing Theories of Change: activities, outputs, outputs, and impacts. These could then be recombined into one or more causal pathways that the participants thought was both possible and desirable. In effect, a kind of linear jigsaw puzzle. If the four categories of event types were seen as being too rigid a schema (a reasonable complaint!), but still an unfortunate necessity, they could be introduced after the recombination process, rather than before. Either way, it probably would be useful to include another evaluation stage, making a comparative evaluation of the different combinations of contributions that had been created. Using the same metrics as are already being used with existing ParEvo exercise.

More ideas will follow..

The beginnings of a bibliography...

Derbyshire, J. (2019). Use of scenario planning as a theory-driven evaluation tool. FUTURES & FORESIGHT SCIENCE, 1(1), 1–13. https://doi.org/10.1002/ffo2.1

Ganguli, S. (2017). Using Scenario Planning to Surface Invisible Risks (SSIR). Stanford Social Innovation Review. https://ssir.org/articles/entry/using_scenario_planning_to_surface_invisible_risks

Sunday, March 21, 2021

Mapping the "structure of cooperation": Adding the time dimension and thinking about further analyses

In October 2020 I wrote the first blog of the same name, based on some experiences with analysing the results of a ParEvo exercise. (ParEvo is a web assisted participatory scenario planning process).

The focus of that blog posting was a scatter plot of the kind shown below.

Figure 1: Blue nodes = ParEvo exercise participants. Indegree and Outdegree explained below. Green lines = average indegree and average outdegree

The two axes describe two very basic aspects of network structures, including human social networks. Indegree, in the above example, is the number of other participants who built on that participant's contributions. Outdegree is the number of other participant's contributions that participant built on. Combining these two measures we can generate (in classic consultants' 2 x 2 matrix style!) four broad categories of behavior, as labelled above. Behaviors , not types of people, because in the above instance we have no idea how generalisable the participants' behaviors are across different contexts.

There is another way of labelling two of the quarters of the scatter plot, using a distinction widely used in evolutionary theory and the study of organisational behavior (March, 1991, Wilden et al, 2019). Bridging behavior can be seen as a form of "exploitation" behavior, i.e., it involves making use of others prior contributions, and in turn having one's contributions built on by others. Isolating behavior can be seen as a form of "exploration" behavior, i.e., building storylines with minimal help from other participants. General opinion suggest that there is no ideal balance of these two approaches, rather it is thought to be context dependent. But, in stable environments exploitation is thought to be more relevant whereas in unstable environments, exploration is seen as more relevant.

What does interest me is the possibility of applying this updated analytical framework to other contexts. In particular to: (a) citation networks, (b) systems mapping exercises. I will explore citation networks first. Here is an example of a citation network extracted from a public online bibliographic database covering the field of computer science. Any research funding programme will be able to generate such data, both from funding applications and subsequent research generated publications.

Figure 2: A network of published papers, linked by cited references

Looking at the indegree and outdegree attributes of all the documents within this network the average indegree, and outdegree, was 3.9. When this was used as a cutoff value for identifying the four types of cooperation behavior, their distribution was as follows:

Isolating / exploration = 59% of publications
Leading = 17%
Following = 15%
Bridging / exploitation = 8%

Their location within the Figure 2 network diagram is shown below in this set of filtered views.

Figure 3: Top view = all four types, Yellow view = Bridging/Exploitation, Blue = Following, Red = Leading, Green = Isolating/Exploration

It makes some sense to find the bridging/exploitation type papers in the center of the network, and the isolating/exploration type papers more scattered and especially out in the disconnected peripheries.

It would be interesting to see whether the apparently high emphasis on exploration found in this data set would be found in other research areas.

The examination of citation networks suggests a third possible dimension to the cooperation structure scatter plot. This is time, as represented in the above example as year of publication. Not surprisingly, the oldest papers have the higher indegree and the newest papers have the lower. Older papers (by definition, within an age bounded set of papers) have lower outdegree compared to newer papers). But what is interesting here is the potential occurrence of outliers, of two types: "rising stars" and "laggards". That is, new papers with higher than expected indegree ("rising stars") and old papers with lower than expected indegree ("laggards", or a better name??), as seen in the imagined examples (a) and (b) below.

Another implication of considering the time dimension is the possibility of tracking the pathways of individual authors over time, across the scatter plot space. Their strategies may change over time. "If we take the scientist .. it is reasonable to assume that his/her optimal strategy as a graduate student should differ considerably from his/her optimal strategy once he/she received tenure" ( Berger-Tal, et al, 2014) They might start by exploring, then following, then bridging, then leading.

Figure 4: Red line = Imagined career path of one publication author. A and B = "Rising Star" and "Laggard" authors

There seem to be two types of opportunities present here for further analyses:

Macro-level analysis of differences, in the structure of cooperation across different fields of research. Are there significant differences in the scatter plot distribution of behaviors? If so, to what extent are these differences associated with different types of outcomes across those fields? And if so, is there a plausible causal relationship that could be explored and even tested?
Micro-level analysis of differences, in the behavior of individual researchers within a given field. Do individuals tend to stick to one type of cooperation behavior (as categorised above). Or is their behavior more variable over time? If the latter , there any relatively common trajectory? What are the implications for these micro-level behaviors for the balance of exploration and exploitation taking place in a particular field?

Thursday, January 28, 2021

Connecting Scenario Planning and Theories of Change

This blog posting was prompted by Tom Aston’s recent comment at the end of an article about theories of change and their difficulties. There he said “I do think that there are opportunities to combine Theories Of Change with scenario planning. In particular, context monitoring and assumption monitoring are intimately connected. So, there’s an area for further exploration”

Scenario planning, in its various forms, typically generates multiple narratives about what might happen in the future. A Theory Of Change does something similar but in a different way. It is usually in a more diagrammatic rather than narrative form. Often it is simply about one particular view of how change might happen i.e., particular causal pathway or package thereof. But in more complex network representations Theories Of Change do implicitly present multiple views of the future, in as much as there are multiple causal pathways that can work through these networks.

ParEvo is a participatory approach to scenario planning which I have developed and which has some relevance to discussion of the relationship between scenario planning and Theories Of Change. ParEvo is different from many scenario planning methods in that it typically generates a larger number of alternative narratives about the future, and these narratives proceed rather than follow a more abstract analysis of causal processes that might be at work generating those narratives. My notion is that this narrative–first approach involves less cognitive demands on the participants, and is an easier activity to get participants engaged in from the beginning. Another point worth noting about the narratives is that they are collectively constructed, by different self-identified combinations of (anonymised) participants.

At the end of a ParEvo exercise participants are asked to rate all the surviving storylines in terms of their likelihood of happening in real life and their desirability. These ratings can then be displayed in a scatterplot, of the kind shown in the two examples below. The numbered points in the scatterplot are IDs for specific storylines generated in the same ParEvo exercise. Each of the two scatterplot represents a different ParEvo exercise.

The location of particular storylines in a scatterplot has consequences. I would argue that storylines which are in the likely but undesirable quadrant of the scatterplot deserve the most immediate attention. They constitute risks which, if at all possible, need to be forfended, or at least responded to appropriately when they do take place. The storylines in the unlikely but desirable quadrant problem justify the next lot of attention. This is the territory of opportunity. The focus here would be on identifying ways of enabling aspects of those developments to take place.

Then attention could move to the likely and desirable quadrant. Here attention could be given to the relationship between what is anticipated in the storylines and any pre-existing Theory Of Change. The narratives in this quadrant may suggest necessary revisions to the Theory Of Change. Or, the Theory of Change may highlight what is missing or misconceived in the narratives. The early reflections on the risk and opportunity quadrants might also have implications for revisions to the Theory Of Change.

The fourth quadrant contains those storylines which are seen as unlikely and undesirable. Perhaps the appropriate response here is simply to periodically to check and update the judgements about likelihood and undesirability.

These four views can be likened to the different views seen from within a car. There is the front view, which is concerned about likely and desirable events, our expected an intended direction of change. Then there are two peripheral views, to the right and left, which are concerned with risks and opportunities, present in the desirable but unlikely, and undesirable but likely quadrants. Then there is the rear view, out the back, looking at undesirable and unlikely events.

In this explanation I have talked about storylines in different quadrants, but in the actual scatterplots develop so far the picture is a bit more complex. Some storylines are way out in the corners of the scatterplot and clearly need attention, but others are more muted and mixed in the position characteristics, so prioritising which of these to give attention to first versus later could be a challenge.

There is also a less visible third dimension to this scatterplot. Some of the participants judgements about likelihood and desirability were not unanimous. These are the red dots in the scatterplot above. In these instances some resolution of differences of opinion about the storylines would need to be the first priority. However it is likely that some of these differences will not be resolvable, so these particular storylines will fall into the category of "Knightian uncertainties", where probabilities are simply unknown. These types of developments can't be planned for in the same way as the others where some judgements about likelihood could be made. This is the territory where bet hedging strategies are appropriate, a strategy seen both in evolutionary biology and in human affairs. Bet hedging is a response which will be functional in most situations but optimal in none. For example the accumulation of capital reserves in a company, which provides insurance against unexpected shocks, but which is at the cost of efficient use of capital..

There are some other opportunities for connecting thinking about Theories Of Change and the multiple alternative futures that can be identified through a ParEvo process. These relate to systems type modelling that can be done by extracting keywords from the narratives and mapping their cooccurrence in the paragraphs that make up these narratives, using social network analysis visualisation software. I will describe these in more detail in the near future, hopefully.

Tuesday, December 15, 2020

The implications of complex program designs: Six proposals worth exploring?

Last week I was involved in a seminar discussion of a draft CEDIL paper reviewing methods that can be used to evaluate complex interventions. That discussion prompted me to the following speculations, which could have practical implications for the evaluation of complex interventions.

Caveat: As might be expected, any discussion in this area will hinge upon the definition of complexity. My provisional definition of complexity is based on a network perspective, something I've advocated for almost two decades now (Davies, 2003). That is, the degree of complexity depends on the number of nodes (e.g. people, objects or events), and the density and diversity of types of interactions between them. Some might object and say what you have described here is simply something which is complicated rather than complex. But I think I can be fairly confident in saying that as you move along this scale of increasing complexity (as I have defined it here) the behaviour of the network will become more unpredictable. I think unpredictability, or at least difficulty of prediction, is a fairly widely recognised characteristic of complex systems (But see Footnote).

The proposals:

Proposal 1. As the complexity of an intervention increases, the task of model development (e.g. a Theory of Change), especially model specification, becomes increasingly important relative to that of model testing. This is because there are more and more parameters that could make a difference/ be "wrongly" specified

Proposal 2. When the confident specification of model parameters becomes more difficult then perhaps model testing will then become more like an exploratory search of a combinatorial space rather than more focused hypothesis testing.This probably has some implications for the types of methods that can be used. For example, more attention to the use of simulations, or predictive analytics.

Proposal 3. In this situation where more exploration is needed, where will all the relevant empirical data come from, to test the effects of different specifications? Might it be that as complexity increases there is more and more need for monitoring (/time-series data, relative to evaluation / once-off type data?

Proposal 4. And if a complex intervention may lead to complex effects – in terms of behaviour over time – then the timing of any collection of relevant data becomes important. A once-off data collection would capture the state of the intervention+context system at one point in an impact trajectory that could actually take many different shapes (e.g. linear, sinusoidal, exponential, etc. The conclusions drawn could be seriously misleading.

Proposal 5. And going back to model specification, what sort of impact trajectory is the intervention aiming for? One where change happens then plateaus, or one where there is an ongoing increase. This needs specification because it will affect the timing and type of data collection needed.

Proposal 6. And there may be implications for the process of model building. As the intervention gets more complex – in terms of nodes in the network –, there will be more actors involved, each of which will have a view on how the parts and perhaps the w0hole package is and should be working, and the role of their particular part in that process. Participatory, or at least consultative, design approaches would seem to become more necessary

Are there any other implications that can be identified? Please use the Comment facility below.

Footnote: Yes, I know you can also find complex (as in difficult to predict) behaviour in relatively simple systems, like a logistic equation that describes the interaction between predator and prey populations. And there may be some quite complex systems (by my definition) that are relatively stable. My definition of complexity is more probabilistic than determinist

Friday, December 11, 2020

"If you want to think outside of the box, you first need to find the box" - some practical evaluative thinking about Futures Literacy

Over the last two days, I have participated in a Futures Literacy Lab, run by Riel Miller and organised as part of UNESCO's Futures Literacy Summit. Here are some off-the-cuff reflections.

Firstly the definition of futures literacy. I could not find a decent one, but my search was brief so I expect readers of this blog posting will quickly come up with a decent one. Until then this is my provisional interpretation. Futures literacy includes two types of skills, both of which need to be mastered, although some people will be better at one type than the other:

1. The ability to generate many different alternative views of what might happen in the future.

2. The ability to evaluate a diversity of alternative views of the future, using a range of potentially relevant criteria.

There is probably also a third skill, i.e. the ability to extract useful implications for action from the above two activities,

The process that I took part in highlighted to me (perhaps not surprising because I'm an evaluator) the importance of the second type of skill above - evaluation. There are two reasons I can think of for taking this view:

1. The ability to critically evaluate one's ideas (e.g. multiple different views of the possible future) is a metacognitive skill which is essential. There is no value in being able to to generate many imagined futures if one is then incapable of sorting the "wheat from the chaff" - however that may be defined.

2. The ability to evaluate a diversity of alternative views of the future, can actually have a useful feedback effect, enabling us to improve the way we search for other imagined futures

Here is my argument for the second claim. In the first part of the exercise yesterday each participant was asked to imagine a possible future development in the way that evaluation will be done, and the role of evaluators, in the year 2050. We were asked to place these ideas on Post-It Notes on an online whiteboard, on a linear scale that ranged between Optimistic and Pessimistic.

Then a second and orthogonal scale was introduced, which ranged from "I can make a difference" to I can't make a difference". When that second axis was introduced we were asked to adjust our Post-It Notes into a new position that represented our view of its possibility and our ability to make a difference to that event. These two steps can be seen as a form of self-evaluation of our own imagined futures. Here is the result (don't bother try to read the note details).

Later on, as the process proceeded we were encouraged to 'think out of the box" But how do you do that, ...how do you know what is "out of the box"? Unless you deliberately go to extremes, with the associated risk that whatever you come up with be less useful (however defined)

Looking back at that task now, it strikes me that what the above scatterplot does is show you where the box is, so to speak. And by contrast, where outside the box also is located. "Inside the box" is the part of scatterplot where the biggest concentration of posts is located. The emptiest area and thus most "out of the box" area is the top right quadrant. There is only one Post-it Note there. So, if more out of the box thinking is needed in this particular exercise setting then perhaps we should start brainstorming about "Optimistic future possibilities and of a kind where I think "I can't make a difference" - now there is a challenge!

The above example can be considered as a kind of toy model, a simple version of a larger and more complex range of possible applications. That is, that any combination of evaluative dimensions will generate a combinatorial space, which will be densely populated with ideas about possible futures in some areas and empty in others. To explore those kinds of areas we will need to do some imaginative thinking at a higher level of abstraction, i.e. of the different kinds of evaluative dimensions that might be relevant. My impression is that this meta-territory has not yet been explored very much. When you look at the futures/foresight literature the most common evaluative dimensions are those of "possibility" and "desirability" (and ones I have used myself, within the ParEvo app). But there must be others that are also relevant in various circumstances.

Postscript 2020 12 11: This afternoon we had a meeting to review the Futures Literacy Lab experience. In that meeting one of the facilitators produced this definition of Futures Literacy, which I have visibly edited, to improve it :-)

Lots more to be discussed here, for example:

1. Different search strategies that can be used to find interesting alternate futures. For example, random search, and "the adjacent possible" searches are two that come to mind

2. Ways of getting more value from the alternate futures already identified e.g. by recombination

3. Ways of mapping the diversity of alternate futures that have already been identified e.g using network maps of kind I discussed earlier on this blog (Evaluating Innovation)

4. The potential worth of getting independent third parties to review/evaluate the (a) contents generated by participants, and (b) participants' self-evaluations of their content

For an earlier discussion of mine that might be of interest, see

"Evaluating the Future". Podcast and paper prepared with and for the EU Evaluation Support Services Unit, 2020

Monday, December 07, 2020

Has the meaning of impact evaluation been hijacked?

This morning I have been reading, with interest, Giel Ton's 2020 paper: Development policy and impact evaluation: Contribution analysis for learning and accountability in private sector development

I have one immediate reaction; which I must admit I have been storing up for some time. It is to do with what I would call the hijacking of the meaning or definition of 'impact evaluation'. These days impact evaluation seems to be all about causal attribution. But I think this is an overly narrow definition and almost self-serving of the interests of those trying to promote methods specifically dealing with causal attribution e.g., experimental studies, realist evaluation, contribution analysis and process tracing. (PS: This is not something I am accusing Giel of doing!)

I would like to see impact evaluations widen their perspective in the following way:

1. Description: Spend time describing the many forms of impact a particular intervention is having. I think the technical term here is multifinality. In a private-sector development programme, multifinality is an extremely likely phenomenon. I think Giel has in effect said so at the beginning of his paper: " Generally, PSD programmes generate outcomes in a wide range of private sector firms in the recipient country (and often also in the donor country), directly or indirectly."

2. Valuation: Spend time seeking relevant participants’ valuations of the different forms of impact they are experiencing and/or observing. I'm not talking here about narrow economic definitions of value, but the wider moral perspective on how people value things - the interpretations and associated judgements they make. Participatory approaches to development and evaluation in the 1990s gave a lot of attention to people's valuation of their experiences, but this perspective seems to have disappeared into the background in most discussions of impact evaluation these days. In my view, how people value what is happening should be at the heart of evaluation, not an afterthought. Perhaps we need to routinely highlight the stem of the word Evaluation.

3. Explanation: Yes, do also seek explanations of how different interventions worked and failed to work (aka causal attribution). Paying attention of course to heterogeneity, both in the forms of equifinality and multifinality Please Note: I am not arguing that causal attribution should be ignored - just placed within a wider perspective! It is part of the picture, not the whole picture.

4. Prediction: And in the process don't be too dismissive of the value of identifying reliable predictions that may be useful in future programmes, even if the causal mechanisms are not known or perhaps are not even there. When it comes to future events there are some that we may be able to change or influence, because we have accumulated useful explanatory knowledge. But there are also many which we acknowledge are beyond our ability to change, but where with good predictive knowledge we still may be able to respond to appropriately.

Two examples, one contemporary, one very old: If someone could give me a predictive model of sharemarket price movements that had even a modest 55% accuracy I would grab it and run, even though the likelihood of finding any associated causal mechanism would probably be very slim. Because I’m not a billionaire investor, I have no expectation of being able to use an explanatory model to actually change the way markets behave. But I do think I could respond in a timely way if I had relevant predictive knowledge.

Similarly, with the movements of the sun, people have had predictive knowledge about the movement of the sun for millennia, and this informed their agricultural practices. But even now that we have much improved explanatory knowledge about the sun’s movement few feel this would think that this will help us change the way the seasons progress.

I will now continue reading Giel's paper…

2021 02 19: I have just come across a special issue of the Evaluation journal of Australasia, on the subject of values. Here is the Editorial section.

Sunday, December 06, 2020

Quality of Evidence criteria that can be applied to Most Significant Change (MSC) stories

Two recent documents have prompted me to do some thinking on this subject

Thomas Aston. (2020). Quality of Evidence Rubrics. Focusing on single cases and their internal validity
Puttick, R. (2018). Mapping the Standards of Evidence used in UK social policy. Alliance for Useful Evidence. Looking at 18 different standards from a wide range of fields

If we view Most Significant Change (MSC) stories as evidence of change (and what people think about those changes) what should we look for in terms of quality - what are the attributes of quality we should look for?

Some suggestions that others might like to edit or add to, or even delete...

1. There is clear ownership of an MSC story and the reasons for its selection by the storyteller. Without this, there is no possibility of clarification of any elements of the story and its meaning, let alone more detailed investigation/verification

2. There was some protection against random/impulsive choice. The person who told the story was asked to identify a range of changes that had happened, before being asked to identify the one which was most significant

3. There was some protection against interpreter/observer error. If another person recorded the story, did they read back their version to the storyteller, to enable them to make any necessary corrections?

4. There has been no violation of ethical standards: Confidentiality has been offered and then respected. Care has been taken not only with the interests of the storyteller but also of those mentioned in a story.

5. Have any intended sources of bias been identified and explained? Sometimes it may be appropriate to ask about " most significant changes caused by....xx..." or "most significant changes of ...x ...type"

6. Have any unintended sources of bias been anticipated and responded to? For example, by also asking about "most significant negative changes " or "any other changes that are most significant"?

7. There is transparency of sources. If stories were solicited from a number of people, we know how these people were identified and who was excluded and why so. If respondents were self-selected we know how they compare to those that did not self-select.

8. There is transparency of selection process: If multiple stories were initially collected then the most significant of these have been selected then reported and used elsewhere, the details of the selection process should be available, including (a) who was involved, (b) how choices were made, and (c) the reasons given for the final choice(s) made

9. Fidelity: Has the written account of why a selection panel chose a story as most significant done the participants' discussion justice? Was it sufficiently detailed, as well as being truthful?

10. Have potential biases in the selection processes been considered? Do most of the finally selected most significant change stories come from people of one kind versus another e.g. men rather than women, one ethnic or religious group versus others? In other words, is the membership of the selection panel transparent? (thanks to Maleeha below).

11. your thoughts here on.. (using the Comment facility below).

Please note

1. That in focusing here on "quality of evidence" I am not suggesting that the only use of MSC stories is to serve as forms of evidence. Often the process of dialogue is immensely important and it is the clarification of values and who values what and why so, that is most important. And there are also bound to be other purposes also served

2. (Perhaps the same point, expressed in another way) The above list is intentionally focused on minimal rather than optimal criteria. As noted above, a major part of the MSC process is about the discovery of what is of value, to the participants.

For more on the MSC technique, see the resources here.

Wednesday, October 28, 2020

Mapping the structure of cooperation

Over the last year or so I have been developing a web application known as ParEvo. The purpose of ParEvo is to enable people to take part in a participatory scenario planning process, online. How the process works is described in detail on this website. The main point that I need to make clear here in this post is that the process consists of people writing short paragraphs of text describing what might happen next. Participants choose which previously written paragraphs their paragraphs should be added to. In turn, other participants may choose to add their own paragraph of text to these. The net result is a series of branching storylines describing alternative futures, which can vary in the way that they constructed i.e. who was involved in the construction of which storyline.

One of the advantages of using ParEvo is that data can be downloaded showing whose text contribution was added to whose. While the ParEvo app does show all the contributions and how they connected into different storylines in the form of a tree structure it does this in an anonymous way – it is not possible for participants, or observers, to see who wrote which contributions. However one of the advantages of using ParEvo is that an exercise facilitator can download data showing the otherwise hidden data on whose contribution was added to whose. This data can be downloaded in the form of an "adjacency matrix". This shows the participants listed by row and the same participants listed by column by column. The cells in the matrix show which row participant added a contribution to an existing contribution made by the column participant. This kind of matrix data is easy to then visualise as a social network structure. Here is an anonymized example from one ParEvo exercise.

Blue nodes = participants. Grey links = contributions to the pointed participant. Red links = reciprocated contributions. Big nodes have many links, small nodes have few links

Another way of summarising the structure of participation is to create a scatterplot, as in this example shown below. The X-axis represents the number of other participants who have added contribution to one's own contributions (SNA term = indegree). The Y-axis represents the number of other participants that one has added one's own contributions to (SNA term = outdegree. The data points in the scatterplot identify the individual participants in the exercise and their characteristics as described by the two axes. The four corners of the scatterplot can be seen as four extreme types of participation:

– Isolates: who only build on their own contributions and nobody else builds on these
– Leaders: who only build on their own contributions, but others also build on these
– Followers: who only build on others' contributions, but others do not build on theirs
– Connectors: who built on others' contributions and others build on theirs

The maximum value of the Y-axis is defined by the number of iterations in the exercise. The maximum value of the X-axis is defined by the number of participants in the exercise. The graph below needs updating to show an X-axis maximum value of 10, not 8

One observer of this ParEvo exercise commented that ' It makes sense to me: those three leaders are the three most senior staff in the group, and it makes sense that they might have produced contributions that others would follow, and that they might be the people most sure of their own narrative"

What interested me was the absence of any participants in the Isolates and Connectors corners of the scatter plot. The absence of isolates is probably a good thing within an organisation, though it could mean a reduced diversity of ideas overall. The absence of Connectors seems more problematic - it might suggest a situation where there are multiple conceptual silos/cliques that are not "talking" to each other. It will be interesting to see in other ParEvo exercises what this scatter plot structure looks like, and how the owners of those exercises interpret them.

Saturday, September 26, 2020

EvalC3 versus QCA - compared via a re-analysis of one data set

I was recently asked whether that EvalC3 could be used to do a synthesis study, analysing the results from multiple evaluations. My immediate response was yes, in principle. But it probably needs more thought.

I then recalled that I had seen somewhere an Oxfam synthesis study of multiple evaluation results that used QCA. This is the reference, in case you want to read, which I suggest you do.

Shephard, D., Ellersiek, A., Meuer, J., & Rupietta, C. (2018). Influencing Policy and Civic Space: A meta-review of Oxfam’s Policy Influence, Citizen Voice and Good Governance Effectiveness Reviews | Oxfam Policy & Practice. Oxfam. https://policy-practice.oxfam.org.uk/publications/*

Like other good examples of QCA analyses in practice, this paper includes the original data set in an appendix, in the form of a truth table. This means it is possible for other people like me to reanalyse this data using other methods that might be of interest, including EvalC3. So, this is what I did.

The Oxfam dataset includes five conditions a.k.a. attributes of the programs that were evaluated. Along with two outcomes each pursued by some of the programs. In total there was data on the attributes and outcomes of twenty-two programs concerned with expanding civic space and fifteen programs concerned with policy influence. These were subject to two different QCA analyses.

The analysis of civic space outcomes

In the Oxfam analysis of the fifteen programs concerned with expanding civic space, QCA analysis found four “solutions” a.k.a. combinations of conditions which were associated with the outcome of successful civic space. Each of these combinations of conditions was found to be sufficient for the outcome to occur. Together they accounted for the outcomes found in 93% or fourteen of the fifteen cases. But there was overlap in the cases covered by each of these solutions, leaving the question open as to which solution best fitted/explained those cases. Six of the fourteen cases had two or more solutions that fitted them.

In contrast, the EvalC3 analysis found two predictive models (=solutions) which are associated with the outcome of expanded civic space. Each of these combinations of conditions was found to be sufficient for the outcome to occur. Together they accounted for all fifteen cases where the outcome occurred. In addition, there was no overlap in the cases covered by each of these models.

The analysis of policy influencing outcomes

in the Oxfam analysis of the twenty-two programs concerned with policy influencing the QCA analysis found two solutions associated with the outcome of expanding civic space. Each of these was sufficient for the outcome, and together they accounted for all the outcomes. But there was some overlap in coverage, one of the six cases were covered by both solutions.

In contrast, the EvalC3 analysis found one predictive model which was necessary and sufficient for the outcome, and which accounted for all the outcomes achieved.

Conclusions?

Based on parsimony alone, the EvalC3 solutions/predictive models would be preferable. But parsimony is not the only appropriate criteria to evaluate a model. Arguably a more important criterion is the extent which a model fits the details of the cases covered when those cases are closely examined. So, really what the EvalC3 analysis has done is to generate some extra models that need close attention, in addition to those already generated by the QCA analysis. The number of cases covered by multiple models is been increased.

In the Oxfam study, there was no follow-on attention given to resolving what was happening in the cases that were identified by more than one solution/predictive model. In my experience of reading other QCA analyses, this lack of follow-up is not uncommon.

However, in the Oxfam study for each of the solutions found at least one detailed description was given of an example case that that solution covered. In principle, this is good practice. But unfortunately, as far as I can see, it was not clear whether that particular case was exclusively covered by that solution, or part of a shared solution. Even amongst those cases which were exclusively covered by a solution there are still choices that need to be made (and explained) about how to select particular cases as exemplars and/or for a detailed examination of any causal mechanisms at work.

QCA software does not provide any help with this task. However, I did find some guidance in specialist text on QCA: Schneider, C. Q., & Wagemann, C. (2012). Set-Theoretic Methods for the Social Sciences: A Guide to Qualitative Comparative Analysis. Cambridge University Press. https://doi.org/10.1017/CBO9781139004244 (it’s a heavy read in part of this book but overall, it is very informative). In section 11.4 titled Set-Theoretic Methods and Case Selection, the authors note ‘Much emphasis is put on the importance of intimate case knowledge for a successful QCA. As a matter of fact, the idea of QCA is a research approach and of going back-and-forth between ideas and evidence largely consists of combining comparative within case studies and QCA is a technique. So far, the literature has focused mainly on how to choose cases prior to and during but not after a QCA were by QCA we here refer to the analytic moment of analysing a truth table it is therefore puzzling that little systematic and specific guidance has so far been provided on which cases to select for within case studies based on the results of i.e. after a QCA… ‘The authors then go on to provide some guidance (a total of 7 pages of 320).

In contrast to QCA software, EvalC3 has a number of built-in tools and some associated guidance on the EvalC3 website, on how to think about case selection as a step between cross-case analysis and subsequent within-case analysis. One of the steps in the seven-stage EvalC3 workflow (Compare Models) is the generation of a table that compares the case coverage of multiple selected alternative models found by one’s analysis to that point. This enables the identification of cases which are covered by two or more models. These types of cases would clearly warrant subsequent within-case investigation.

Another step in the EvalC3 workflow called Compare Cases, provides another means of identifying specific cases for follow-up within-case investigations. In this worksheet individual cases can be identified as modal or extreme examples within various categories that may be of interest e.g. True Positives, False Positives, et cetera. It is also possible to identify for a chosen case what other case is most similar and most different to that case, when all its attributes available in the dataset are considered. These measurement capacities are backed up by technical advice on the EvalC3 website on the particular types of questions that can be asked in relation to different types of cases selected on the basis of their similarities and differences. Your comments on these suggested strategies would be very welcome.

I should explain...

...why the models found by EvalC3 were different from those found from the QCA analysis. QCA software finds solutions i.e. predictive models by reducing all the configurations found in a truth table down to the smallest possible set, using what is known as a minimisation algorithm called the Quine McCluskey algorithm.

In contrast, EvalC3 provides users with a choice of four different search algorithms combined with multiple alternative performance measures that can be used to automatically assess the results generated by those search algorithms. All algorithms have their strengths and weaknesses, in terms of the kinds of results they can find and cannot find, including the QCA Quine McCluskey algorithm and the simple machine learning algorithms built into EvalC3. I think the McCluskey algorithm has particular problems with datasets which have limited diversity, in other words, where cases only represent a small proportion of all the possible combinations of the conditions documented in the dataset. Whereas the simple search algorithms built into EvalC3 don't experience that is a difficulty. This is my conjecture, not yet rigorously tested.

[In the above data set the cases represented the two data sets analysed represented 47% and 68% of all the possible configurations given the presence of five different conditions]

While EvalC3 results described above did differ from the QCA analyses, they were not in outright contradiction. The same has been my experience when I reanalysed other QCA datasets. EvalC3 will either simply duplicate the QCA findings or produce variations on those, and often those which are better performing.