Rick On the Road

Monday, March 06, 2023

How can evaluators practically think about multiple Theories of Change in a particular context?

This blog posting is been prompted by participation in two recent events. One was some work I was doing with the ICRC, reviewing Terms of Reference for an evaluation. The other was listening in as a participant to this week's European Investment Bank conference titled "Picking up the pace: Evaluation in a rapidly changing world".

When I was reviewing some Terms of Reference for an evaluation I noticed a gap which I have seen many times before. While there was a reasonable discussion of the types of information that would need to be gathered there was a conspicuous absence of any discussion of how that data would be analysed. My feedback included the suggestion that the Terms of Reference needed to ask the evaluation team for a description of the analytical framework they would use to analyse the data they were collecting.

The first two sessions of this week's EIB conference were on the subject of foresight and evaluation. In other words how evaluators can think more creatively and usefully about possible futures – a subject of considerable interest to me. You might notice that I've referred to futures rather than the future, intentionally emphasising the fact that there may be many different kinds of futures, and with some exceptions (e.g. climate change) is not easy to identify which of these will actually eventuate.

To be honest, I wasn't too impressed with the ideas that came up in this morning's discussion about how evaluators could pay more attention to the plurality of possible futures. On the other hand, I did feel some sympathy for the panel members who were put on the spot to answer some quite difficult questions on this topic.

Benefiting from the luxury of more time to think about this topic, I would like to make a suggestion that might be practically usable by evaluators, and worth considering by commissioners of evaluations. The suggestion is how an evaluation team could realistically give attention not just to a single "official" Theory Of Change about an intervention, but to multiple relevant Theories Of Change about an intervention and its expected outcomes. In doing so I hope to address both issues I have raised above: (a) the need for an evaluation team to have a conceptual framework structuring how it will analyse the data it collects, and (b) the need to think about more than one possible future and how that might be realised i.e. more than one Theory of Change.

The core idea is to make use of something which I have discussed many times previously in this blog, known as the Confusion Matrix – to those involved in machine learning, and more generally described simply as a truth table - one that describes four types of possibilities. It takes the following form:

In the field of machine learning the main interest in the Confusion Matrix is the associated performance measures that can be generated, and used to analyse and assess the performance of different predictive models. While these are of interest, what I want to talk about here is how we can use the same framework to think about different types of theories, as distinct from different types of observed results.

There are four different types of Theories of Change that can be seen in the Confusion Matrix. The first (1) describes what is happening when intervention is present and the expected outcome of that intervention is present. This is the familiar territory of the kind of Theories of Change that an evaluator will be asked to examine.

The second (2) describes what is happening when intervention is present and the expected outcome of that intervention is absent. This theory would describe what additional conditions are present, or what expected conditions are absent, which will make a difference – leading to the expected outcome being absent. When it comes to analysing data on what actually happened identifying these conditions can lead to modification of the first (1) Theory of Change such that it becomes a better predictor of the outcome and there are fewer False Positives (found in cell 2). Ideally the less False Positives the better. But from a theory development point of view there should always be some situations described in cell 2 because there will never be an all-encompassing theory that works everywhere. There will always be boundary conditions beyond which the theory is not expected to work. So an important part of an evaluation is not just to refine the theory about what works (1) but also to refine the theory of the circumstances in which it will not be expected to work (2), sometimes known as conditions or boundary conditions.

The third theory (3) describes what is happening when the intervention is absent but nevertheless the outcome is present. Consideration of this possibility involves recognition of what is known as "multi-finality" i.e. that some events can arise from multiple alternative causal conditions (or combinations of causal conditions). It's not uncommon to find advice to evaluators that they should consider alternative theories to those they are currently focused on. For example in the literature on contribution analysis. But it strikes me that this is often close to a ritualistic requirement, or at least treated that way in practice. In this perspective alternative theories are a potential threat to the theory being focused on (1). But a much more useful perspective would be to treat these alternative theories as potentially useful other courses of an action that an agent could take, which warrant serious attention in their own right. And if they are shown to have some validity this does not by definition mean that the main theory of change (1) is wrong. It' simply means that there are alternative ways of achieving the outcome, which can only be a bonus finding.

The fourth theory describes what is happening when intervention is absent and the outcome is also absent (4). In its simplest interpretation, it may be that the actual absence of the attributes of the intervention is the reason why the outcome is not present. But this can't be assumed. There may be other factors which have been more important causes. For example the presence of an earthquake, or the holding of a very contested election. This possibility is captured by the term "asymmetric causality" i.e. that the causes of something not happening may not simply be the absence of the causes of something happening. Knowing about these other possible causes of desired outcome not happening is surely important, in addition to and alongside knowing about how an intervention does cause the outcome. Knowing more about these causes might help other parties with other interventions in mind move cases with this experience from being True Negatives (4) to being False Negatives (3)

In summary, I think there is an argument for evaluators not being too myopic when they are thinking about Theories of Change they need to pay attention to. It should not be all about testing the first (1) type of Theory of Change, and considering all the other possibility is simply as challengers, which may or may not then be dismissed Each of those other types of theories (2-3-4) are important and useful in their own right and deserve attention.

Tuesday, October 18, 2022

Four types of futures that should be covered by a Theories of Change

ParEvo.org is a web app that enables the collaborative exploration of alternative futures, online. In the evaluation stage, participants are asked to identify which of the surviving storylines fall into each of these categories:

Most desirable
Least desirable
Most likely
Least likely

In one part of the analysis of storylines generated during a ParEvo exercise the storylines are plotted on scatter plot, where the two dimensions are likelihood and desirability, as seen in this example

Most Theories of Change that I have come across, when working as an evaluator, focus on a future that is seen as desirable and likely (as in expected). At best, the undesirable futures will be mentioned in an accompanying section on risks and their management.

A less myopic approach might be useful, one which would orient the users of the Theory of Change to a more adaptive stance towards the future.

One way forward would be to think of a four-part Theory of Change, each of which has different implications. as follows

The top right cell may already be covered by a Theory of Change. In the desirable but unlikely, and undesirable but likely two cells it would be useful to have ordered lists that describe events, what needs to be done before they happen, and what needs to be done after they happen. In the unlikely and undesirable cell plans for monitoring the status of these events need to be spelled out, and updated on an ongoing basis

Thursday, October 13, 2022

We need more doubt and uncertainty!

This week the Swedish Evaluation Society (SVUK) is holding its annual conference. I took part in a session today on Theories of Change. The first part of my presentation summarised the points I made in a 2018 CEDIL Inception Report titled 'Theories of Change: Technical Challenges with Evaluation Consequences'. Following the presentation I was asked by Gustav Petersson, the discussant, whether we should pay more attention to the process of generating diagrammatic Theories of Change. I could only agree, reflecting that for example it was not uncommon that a representative of a conference working group might summarise a very comprehensive and in-depth discussion in all too brief and succinct terms when reporting back to a plenary. Leaving out, or understating, the uncertainties , ambiguities and disagreements. Similarly the completed version of a diagrammatic Theory of Change is likely to suffer from the same limitations ... being an overly simplified version of a much more complex and nuanced discussions between those involved in its construction that went on beforehand.

Later in the day I was reminded of this section in the Hitchhiker's Guide to the Galaxy where Vroomfondel, representing a group of striking philosophers said '"That's right!" and shouted , "we demand rigidly defined areas of doubt and uncertainty!"

I'm inclined to make a similar kind of request of those developing Theories of Change. And of those subsequently charged with assessing the evaluability of the associated intervention, including its Theory of Change. What I mean is that the description of the Theory of Change should make it clear which various parts of the theory the owner(s) of that theory are more confident in verses less confident. Along with descriptions of the nature of the doubt or uncertainty and its causes e.g. first-hand experience, or supporting evidence (or lack of) from other sources.

Those undertaking an evaluability assessment could go a step further and convert various specific forms of doubt and uncertainty into evaluation questions that could form an important part of the Terms of Reference for an evaluation. This might go some way to remedying another problem discussed during the session, which is the all too common (in my experience) phenomena of Terms of Reference only making generic references to an intervention's Theory of Change. For example, by asking in broad terms about "what works and in what circumstances". Rather than the testing of various specific parts of that theory, which would arguably be more useful, and better use of limited time and resources.

The bottom line: The articulation of a Theory of Change should conclude with a list of important evaluation questions. Unless there are good reasons to the contrary, those questions should then appear in the Terms of Reference for a subsequent evaluation

PS: Vroomfondel is a philosopher. He appears in chapter 25 of The Hitchhiker's Guide to the Galaxy, along with his collegue Majikthise, as a representative of the Amalgamated Union of Philosophers, Sages, Luminaries and Other Thinking Persons (AUPSLOTP; the BBC TV version inserts 'Professional' before 'Thinking'). The Union is protesting about Deep Thought, the computer which is being asked to determine the Answer to the Ultimate Question of Life, the Universe and Everything. See https://hitchhikers.fandom.com/wiki/Vroomfondel

Thursday, June 30, 2022

Using ParEvo to conduct thought experiments

I have just had an interesting conversation with an NGO network who have been developing some criteria to: (a) help speed up the approval and release of funding in humanitarian emergencies, but (b) at same time minimising risk of poor use of those funds.

They think these criteria are useful but are not entirely sure whether those seeking funding will agree. So they are exploring ways of testing out their applicability through a wider consultation process.

One way doing this, which we have been discussing, involves the use of ParEvo.org. The plan is that a group of participants representing potential grantees will develop a set of storylines which starts off with a particular organisation seeking funding for a particular humanitarian emergency. Then a branching structure of possible subsequent storyline developments will be articulated through the usual ParEvo process

After those storylines been developed there will be an evaluation phase, as is common practice now with most ParEvo exercises. At this point the participants will be asked two generic types of questions ( and variations on these), as described below:

1. Which of the criteria in the current framework would be most likely to help avoid or mitigate the problems seen in storyline X? (Answer=Description & Explanation)

and if the answer is none, are there any other criteria that could be included in the framework that might have helped?

2. Which of the storylines in the current exercise would have most benefited by criteria X in the current framework, in the sense of problems described there would have been avoided or mitigated. (Answer=Description & Explanation)

and if the answer is none, does this suggest that the criteria is irrelevant and could be removed?

Postscript: One interesting thing about this type of thought experiment is that the theory (the proposed funding criteria) and the possible realities that they may be applied to (where the theories may or may not work there as expected) are constructed by different parties who are independent from each other. This is not usually the case with thought experiments, and could be seen as a positive variation.

Stay tuned for if and when this idea flies, then soars or crashes

Courtesy https://xkcd.com/

For more on thought experiments, see Armchair science

Thought experiments played a crucial role in the history of science. But do they tell us anything about the real world?

Friday, June 17, 2022

Alternative futures as "search strategies"

When you read the phrase "search strategy' this may bring to mind what you need when you are doing a literature search on the Internet. Or you may be thinking about different forms of supervised machine learning, which involve different types of search strategies. For example in my Excel-based EvalC3 prediction modelling app there are four different search strategies that users can choose from, to help find the most accurate predictive model describing what combinations of attributes are the best predictor of a particular outcome. Or you may have heard of James March, an organisational theorist who in 1981 wrote a paper called 'A model of adaptive organizational search ' where he talks about how organisations find the right new technologies to develop and explore.This is probably the closest thing to the type of search process that I'm describing below.

Right now I am in the process of helping some other consultants design a ParEvo exercise, in which recipients of research grants from the same foundation will collaboratively develop a number of alternative storylines describing how their efforts to ensure the uptake and use of the research findings takes place (and sometimes fails to take place) over the coming three years. Because these are descriptions of possible futures they are inherently a form of fiction. But please note they are not an attempt at "predicting" fiction. Rather, they are more like a form of 'preparedness enabling ' fiction.

As part of the planning process for this exercise we have had to articulate our expectations of what will come out of this exercise, in terms of possible desirable benefits for both the participants and the foundation. In other words the beginnings of a Theory of Change, which needs to be supplemented by details of how the exercise will be best be run in this particular instance, and thus hopefully deliver these results.

When thinking about reasonable expectations for this exercise I came up with the following possibilities, which are now under discussion:

1 Participants will hear different interpretations and views of

What other participants mean when they use the term "research uptake '
What successful, and unsuccessful, research uptake looks like in its various forms, to various participants
How the process of research uptake can be facilitated, and inhibited, by a range of factors – some within researchers control and some beyond their control.

2. This experience may then inform how each of the participants proceed with their own work on facilitating research uptake

3. The storylines that are generated by the end of the exercise will provide the participants and the XXXX trust with a flexible set of expectations against which actual progress with research uptake can be compared at a later date.

So, my current thinking is that what we have here is a description of a particular kind of search strategy where both the objectives worth pursuing, and the means of achieving them, are both being explored at the same time, at least within the ParEvo exercise. Though other things will also be happening after the exercise, hopefully involving some use of the ideas generated during exercise (see possibility 2)

There is also another facet of the idea of search strategies which needs to be mentioned here. When search is used in a machine learning context it is always accompanied by an evaluation function which determines whether the search continues or comes to a stop because the best possibility has now been identified (a stopping rule, I think is the term involved). So, in the three possibilities listed above the last one describes the possibility of an evaluation function. Exactly how it will work needs more thinking, but I think it will be along the lines of asking participants in the prior exercise to identify the extent to which their experience in the interim period has fitted any of the storylines that were developed earlier, and in what ways it has and has not, and why so in both cases. Stay tuned...

Thursday, April 28, 2022

Budgets as theories

A government has a new climate policy. It outlines how climate investments will be spread through a number of different ministries, and implemented by those ministries using a range of modalities. Some funding will be channelled to various multilateral organisations. Some will be spent directly by the ministries. Some will be channelled on to the private-sector. At some stage in the future this government wants to evaluate the impact of this climate policy. But before then it is been suggested that an evaluability assessment might be useful, to ask if how and when such an evaluation might be feasible.

This could be a challenge to those with the task of undertaking the evaluability assessment. And even for those planning the Terms of Reference for that evaluability assessment. The climate policy is not yet finalised. And if the history of most government policy statements (that I have seen) has any lessons it is that you can't expect to see a very clearly articulated Theory of Change of the kind that you might expect to find in the design of a particular aid programme.

My provisional suggestion at this stage is that the evaluability assessment should treat the government's budget, particularly those parts involving funding of climate investments, as a theory of what is intended. And to treat the actual flows of funding that subsequently occur as the implementation of that theory. My naïve understanding of the budget is that it consists of categories of funding, along with subcategories and sub- subcategories, et cetera. In other words a type of tree structure involving a nested series of choices about where more versus less funds should go. So, the first task of an evaluability assessment would be to map out the theory i.e. the intentions as captured by budget statements at different levels of detail, moving from national to ministerial and then to small units thereafter. And to comment on the adequacy of these descriptions and and gaps that need to be addressed.

This exercise on its own will not be sufficient as an explication of the climate policy theory because it will not tell us how these different flows of funding are expected to do their work. One option would be to follow each flow down to its 'final recipient', if such a thing can actually be identified. But that would be a lot of work and probably leave us with a huge diversity of detailed mechanisms. Alternatively, one might do this on sampling basis, but how would appropriate samples be selected?

There is an alternative which could be seen as a necessity that could then be complemented by a sampling process. This would involve examining each binary choice, starting from the very top of the budget structure and asking 'key informants" questions about why climate funding was present in one category but not the other, or more in one category than the other. This question on its own might have limited value because budgeting decisions are likely to have a complex and often muddy history, and the responses received might have a substantial element of 'constructed rationality' . Nevertheless the answers could provide some useful context.

A more useful follow-up question would be to then ask the same informants about their expectations of differences in performance of the amount of climate financing via category X versus category Y. Followed by a question about how they expect to hear about the achievement of that performance, if at all. Followed by a question about what they would most like to know about performance in this area. Here performance could be seen in terms of the continuum of behaviours, ranging from simple delivery of the amount of funds as originally planned, to their complete expenditure, followed by some form of reporting on outputs and outcomes, and maybe even some form of evaluation, reporting some form of changes.

These three follow-up questions would address three facets of an evaluability assessments (EA): a) The ToC - about expected changes, b) Data availability , c) Stakeholder interests. Questions would involve two types of comparisons: funding versus no funding, and more versus less funding. The fourth EA question, about the surrounding institutional context, typically asks about the factors that may enable and/or limit an evaluation of what actually happened (more on evaluability assessments here).

There will of course be complications in this sort of approach.. Budget documents will not simply be a nested series of binary choices, at each level their work may be multiple categories available rather than just two. However informants could be asked to identify 'the most significant difference 'between all these categories, in effect introducing an intermediary binary category. There could also be a great number of different levels to the budget documents, with each new level in effect doubling the number of choices and associated questions that need to be asked. Prioritisation of enquiries would be needed, possibly based on a 'follow the (biggest amount of) money 'principle. It is also possible that quite a few informants will have limited ideas or information about the binary comparisons they are asked about. A wider selection of informants might help fill that gap. Finally there is the question of how to 'validate" the views expressed about expected differences in performance, availability of performance information and relevant questions about performance. Validation might take the form of a survey of a wider constituency of stakeholders within the organisation of interest, of the views expressed by the informants.

PS: Re this comment in the third para above: "And to treat the actual flows of funding that subsequently occur as the implementation of that theory" One challenge the EA team might find is that while it may have accessed to detailed budget documents, in many places it may not yet be clear where funds have been tagged as climate finance spending. That itself would be an important EA finding.

To be continued...

Sunday, April 24, 2022

Making small samples of large populations useful

I was recently contacted by someone who is working for a consulting firm that has a contract to evaluate the implementation of a large-scale health program covering a huge number of countries. Their client had questioned their choice of 6 countries as case studies. They were encouraging the consultancy firm to expand the number of country case studies, apparently because they thought this would make this sample of country cases more representative of the population of countries as a whole. However, the consulting firm wasn't planning to aggregate results of the six country case studies and then make a claim about generalisability of findings across the whole population of countries. Quite the opposite, the intention was that each country case study would provide a detailed perspective on one or more particular issues that was well exemplified by that case.

In our discussions, I ended up suggesting a strategy that might satisfy both parties in that it addressed to some extent the question of generalisable knowledge at the same time was designed to exploit the particularities of individual country cases. My suggestion was relatively simple, although implementing might take a bit of work making use of whatever data is available on the full population of countries. The suggestion was that for each individual case study the first step in the process would be to identify and explain the interesting particularities of that case, within the context of the evaluation's objectives. Then the evaluation team would look through whatever data is available on the whole population of countries, with the aim of identifying a sub-set of other countries that had similar characteristics (perhaps both generic {political, socio-economic indicators} and issue specific) with the case study country. These would then be assumed to be the countries where the case study findings and recommendations could be most relevant.

A shown in the diagram below, it is possible that the sub-set of countries relevant to each case study county might overlap to some extent. Even when one case study country is examined it is possible that it might have more than one particularity of interest, each of whose analysis might be usefully generalised to a limited number of other countries. And those different sub-sets of countries may themselves overlap to some extent (not shown below).

Green nodes = case study countries
Red nodes = remainder of the whole population
Red nodes connected to green nodes = countries that might find green node country case study findings relevant
Unconnected red nodes = Parts of whole population where case study findings not expected to have any relevance

Another possibility, perhaps seen as unadvisable in normal circumstances, would be to identify the relevant countries to any case study analysis after the fact, not necessarily or only before. After the case study had actually been carried out there would be much more information available on the relevant particularities of the case study country that might make it easier to identify which other countries these finding were most relevant to. However the client of the evaluation might need to be given some reassurance in advance. For example, by ensuring that at least some of these (red node) countries were identified at the beginning, before the case studies were underway.

PS: It is possible to quantify the nature of this kind of sampling. For example, in the above diagram

Total number of cases = 37 (red and green).

Case study cases = 5 (14%) of all cases

Relevant-to-case-study cases = 17 (46%) of all cases

Relevant-to->1-case-study cases = 3 (8%) of all cases

Not-relevant-to-case-study* cases = 15 (40%) of all cases

*Bear in mind also that in many evaluations case studies will not be the only means of inquiry. For example, there are probably internal and external data sets that can be gathered and analysed re the whole set of 37 countries.

Conclusion: We should not be thinking in terms of binary options. It is not true that either a case is part of a representative sample of a whole population, or it is representative and of interest only to itself. It can be relevant to a sub-set of the population.

Thursday, November 25, 2021

Choosing between simpler and more complex versions of a Theory of Change

Background: Over the last few months I have been involved as a member of the Evaluation Task Force, convened by the Association of Professional Futurists. Futurists being people who explore alternative futures using various foresight and scenario planning methods. The intention is to help strengthen the evaluation capacity of those doing this kind of work

One part of this work will involve the development of various forms of introductory materials and guidelines documents. These will inevitably include discussion of the use of Theories of Change, and questions about appropriate levels of detail and complexity that they should involve.

In my dialogues with other Task Force members I have recently made the following comments, which may be of wider interest:

As already noted, a ToC can take various forms, from very simple linear versions to very complex network versions.

I have a hypothesis that may be useful when we are developing guidance on use of ToC by futurists. In fact I have two hypotheses:

H1: A simple linear ToC is more likely to be appropriate when dealing with outcomes that are closest in time to a given foresight activity of interest. Outcomes that are more distant in time, happening long after the foresight activity has finished, would be better represented in a ToC that took a more complex network (i.e. systems map type) form

Why so?: As time passes after a foresight activity, more and more other forces, or various kinds, are likely to come into play and influence the longer term outcome of interest. As a proportion of all influences, the foresight activity will grow progressively smaller and smaller. A type of ToC that takes into account this widening set of influences would seem essential

H2: This need for progressively more complex ToC, as the outcome of interest is located further away in time, can be moderated by a second variable, which is the social distance between those involved in the foresight activity and those involved in the outcome of interest . [Social distance is measured in social network analysis (SNA) terms by units known as "degree", i.e, the number of person-to-person linkages needed for information to flow between one person and another]. So, if the outcome is a change in the functioning of the same organization that the foresight exercise participants they themselves belong to, this distance will be short, relative to an outcome relating to another organisation altogether - where there may be few if any direct links between the exercise participants and staff of that organisation

The implications of these two perspectives could be graphically represented in a scatter plot or two-by-two matrix e.g.

On reflection, this view probably needs some more articulation. Social distance will probably not be present in the form of a single pathway through a network of actors. Especially given that any foresight activity will typically involve multiple participants, each with their own access to relevant networks. So there may be a third relevant dimension here to think about, which is the diversity of the participants. Greater diversity being plausibly associated with a greater range of social (and causal) pathways to the outcome of interest. And thus the need for more complex representations of the Theory of Change.

Monday, November 01, 2021

Exploring counterfactual histories of an intervention

Background

I and others are providing technical advisory support to the evaluation of a large complex multilateral health intervention, one which is still underway. The intervention has multiple parts implemented by different partners and the surrounding context is changing. The intervention design is being adapted as time moves on. Probably not a unique situation.

The proposal

As one of a number of parts of a multi-year evaluation process I might suggest the following:

1. A timeline is developed describing what the partners in the intervention see as key moments in its history, defined as where decisions have been taken to change , stop or continue, a particular course(s) of action.

2. Take each of those moments as a kind of case study, which the evaluation team then elaborates in detail: (a) the options that were discussed, and others not, (d) the rationales for choosing for and against those discussed options at the time, (d) the current merit of those assessments, as seen in the light of subsequent events. [See more details below]

The objective?

To identify (a) how well the intervention has responded to changing circumstances and (b) any lessons that might be relevant to the future of the intervention, or generalisable to other similar intervention.

This seems like a form of (contemporary and micro-level) historical research, investigating actual versus possible causal pathways. It seems different from a Theory of Change (ToC) based evaluation, where the focus is on what was expected to happen and then what did happen. Whereas with this proposed historical research in to decisions taken the primary reference point is what did happen, then what could have happened.

It also seems different from what I understand is a process tracing form of inquiry where, I think, the focus is on particular hypothesised causal pathway. Not the consideration of multiple alternative possible pathways, as would be the case within each of a series of decision making case studies proposed here. There may be a breadth rather than depth of inquiry difference here.[Though I may be over-emphasising the difference here, ...I am just reading Mahoney, 2013 on use of process tracing in historical research]

The multiple possible alternatives that could have been chosen are the counterfactuals I am referring to here.

The challenges?

As Henry Ford allegedly said "History is just one damn thing after another" There are way too many events in most interventions where alternative histories could have taken off in a different direction. For example, at a normally trivial level, someone might have missed their train. So to be practical but also systematic and transparent the process of inquiry would need to focus on specific types of events, involving particular kinds of actors. Preferably where decisions were made about courses of action. Such as Board Meetings.

And in such individual settings how wide should the should the evaluation focus be? For example, only on where decisions were made to change something, or also where decisions were made to continue doing something? And what about the absence of decisions being even considered, when they might have been expected to be considered. That is, decisions about decisions.

Reading some of the literature about counter-factual history, written by historians, there is clearly a risk of developing historical counterfactuals that stray too far from what is known to have happened, in terms of imagined consequences of consequences, etc. In response, some historians talk about the need to limit inquiries to"constrained counterfactuals" and the use of a "Minimal Rewrite Rule". [I will find out more about these]

Perhaps another way forward is to talk about counter-factual reasoning, rather than counterfactual history (Schatzberg, 2014) . This seems to be more like what the proposed line of inquiry might be all about i.e. how the alternatives to what actually was decided and happened were considered (or not even considered) by the intervening agency. But even then, the evaluators' assessments of these reasonings would seem to necessarily involve some exploration of consequences of these decisions, and only some of which will have been observable, and others only conjectured.

The merits?

When compared to a ToC testing approach this historical approach does seem to have some merit. One of the problems of a ToC approach, particularly when applied to a complex intervention is the multiplicity of possible causal pathways, relative to the limited time and resources available available to an evaluation team. Choices usually need to be made, because not all avenues can be explored (unless some can excluded by machine learning explorations or other quantitative processes of analysis).

However, on reflection, the contrast with a historical analysis of the reality of what actually happened is not so black and white. In large complex programmes there are typically many people working away in parallel, generating their own local and sometimes intersecting histories. There is not just one history from within which to sample decision making events In this context a well articulated ToC may be a useful map, a means of identifying where to look for those histories in the making.

Where next

I have since found that that the evaluation team has been thinking along similar lines to myself i.e. about the need to document and analyse the history of key decisions made. If so, the focus now should be on elaborating questions that would be practically useful, some of which are touched on above. Including:

1. How to identify and sample important decision making points

At least two options here:

1. Identify a specific type of event where it is know that relevant decisions are made. E.g, Board Meetings. This is a top-down deductive approach. Risk here is that many decisions will (and have to be) made outside and prior to these events, and just receive official authorisation at these meetings. Snowball sampling backwards to original decisions may be possible...

2. Try using the HCS method to partition the total span of time of interest into smaller (and nested) periods of time. Then identify decisions that have generated the differences observed between these periods (which will sought about the intervention strategy). This is a more bottom-up inductive approach.

2. How to analyses individual decisions.

The latter includes interesting issues such as how much use should be made of prior/borrowed theories about what constitute good decision making, versus using a more inductive approach that emphasises understanding how the decisions were made within their own particular context. I am more in favor of the latter at present

Here is a VERY provisional framework/checklist for what could be examined, when looking at each decision making event:

In this context it may also be useful to think about a wider set of relevant ideas like the role of "path dependency" and "sunk costs"

3. How to aggregate/synthesise/summarise the analysis of multiple individual decision making cases

This is still being thought about, so caveat emptor:

Objectives were to identify:

(a) how well the intervention has responded to changing circumstances.

Possible summarising device? Rate each decision making event on degree to which it was optimal under the circumstances. Backed by a rubric explaining rating values.

Cross tabulate these ratings against a ratings of the subsequent impact of the decision that was made? An "Increased/decreased potential impact" scale.? Likewise supported by a rubric (i.e. annotated scale).

(b) any lessons that might be relevant to the future of the intervention, or generalisable to other similar intervention.

Text summary of implications identified from the analysis of each decision making event, with priority to more impactful/consequential decisions?

Lot more thinking yet to be done here...

Miscellaneous points of note hereafter...

Postscript 1: There must be a wider literature on this type of analysis, where there may be some useful experiences. "“Ninety per cent of problems have already been solved in some other field. You just have to find them.” McCaffrey, T. (2015) New Scientist.

Postscript 2: I just came across the idea of an "even if..." type counterfactual. As in "Even if I did catch the train, I would still not have got the job". This is where when an imagined action, different from what really what happened, still leads to the same outcome as when the real action took place.

Wishmi, CC BY-SA 4.0, via Wikimedia Commons

Sunday, August 22, 2021

Reconciling the need for both horizontal and vertical dimensions in a Theory of Change diagram

In their usual table form, Logical Frameworks are strong on their horizontal dimension but weak on their vertical dimension. On the horizontal dimension is the explanation of what kind of data will be collected and used to measure the changes that are described. This is good for accountability. On the vertical dimension is the explanation of how events at one level will connect and cause events at another level. This is good for learning. But unfortunately LogFrames often simply provide lists of events at each level, with relatively little description of which event will connect to which, especially where multiple and mixed sets of connections might be expected between events. On the other hand diagrammatic versions of a Theory of Change tend to be much better at explicating the various causal pathways at work, but weak on the information they provide on the horizontal dimension - on how various events will be observed and measures. Both of these problems reflect both a lack of space to do both things and different relative priorities pursued within those constraints.

The Donor Committee for Enterprise Development (DCED) has produced a web page based Theory of Change to explain its way of working, which I think points the way to reconciling these conflicting needs. At first glance here is what you see, when you visit this page of their website.

The different causal pathways are quite visible, more so than within a standard LogFrame table format. But another common weakness of diagrammatic versions of Theories of Change is the lack of explanation of what is going on within each of these pathways. The DCED addressed this problem by allowing visitors to click on a link and be taken to another web page, where visitors get a detailed text description of the available evidence, plus any assumptions, about the causal process(es) that connect the events connected by the arrow.

The one weakness in this DCED ToC diagram is the lack of detail about the horizontal dimension- how the various events described in the diagram will be observed./ measured and by who and when and where. But this is clearly resolvable by using the same approach with the links: enable users to click on any event and be taken to a web page where this information is provided for that specific event. As shown below:

Monday, July 19, 2021

Diversity and complexity? Where should we focus our attention?

This posting has been promoted by Michael Bamberger's recent two blog postings on "Building complexity into development evaluation" on the 3ie website: Evidence Matters: Towards equitable, inclusive and sustainable development

I started to make some comments underneath each of the two postings but have now decided to try to organise and extend my thoughts here.

My starting point is an ongoing concern about how unproductive the discussion has been about complexity (especially in relation to evaluation). Like an elephant giving birth to a mouse, has been my chosen metaphor in the past. There probably is some rhetorical overkill here, but it does express some of my felt frustration with the limited value of the now quite extended discussion.

Michael's blog postings have not allayed these concerns. My concerns start with the idea of measuring complexity, both how you do it and how measuring would in fact be useful. Measuring complexity is Michael's proposed first step in a "practical five-step approach to address complexity-responsive evaluation in a systematic way" A lot of ink has already been spilled on the topic of measurement, which is the first of the five steps. A useful summary can be found in Melanie Mitchel's widely read Complexity: A guided Tour (2009:94-114) and Loyd, 1998, who counted at least 40 different ways. But I cant see any references back to any of these methods, suggesting that not much is being learned from past efforts, which is a pity.

Along with the challenge of how to do it is the question of why you would want to do it, ...how might it be useful? The second blog posting explains that " In addition to providing stakeholders with an understanding of what complexity means for their program, the checklist also helps decide whether the program is sufficiently complex to merit the additional investment of time and resources required to conduct a complexity-focused evaluation"

The second of these outcomes might be a more observable consequence, so my first question here is where is the cut-off point in a checklist derived score that would at least inform such a decision, and how is that cut-off point justified. The checklist has 25 attribute questions spread over 4 dimensions. This has not yet been made clear.

My next question is how do the results of this measurement exercise inform the next of the five steps: "Breaking the project into evaluable components and identifying the units of analysis for the component evaluation ". So far, I have not found any answers to this question either. PS 2021 07 29: Michael did reply to a Comment of mine raising the same issue, and suggested that high versus low scores on rated complexity might be one way.

Another concern which I have already written about in my comment on the blog postings is that complexity seems to be very much “in the eye of the beholder”, i.e. depending on who you are and what you are looking for. My partner sees much more complexity in the design and behaviour of moths and butterflies than I do. A friend of mine sees much more complexity in the performance of classical music than I do. Such observations prompt me to thinking that perhaps we should not put too much effort into trying to objectively measure complexity. Rather, perhaps we should take a more ethnographic perspective on complexity – i.e. we should pay attention to where people are seeing complexity and where they are not, and what are the implications thereof.

If we accept this suggestion it is still the case that the challenge of identifying complexity is still with us, but in a different form. So, I have another suggestion, which is to pay much more attention to diversity, as an important and related concept to complexity. As Scott Page has well described, there is a close and complicated relationship between diversity and complexity. Nevertheless, there are some practically useful points to note about the concept of diversity.

Firstly the presence of diversity is indicative of the absence of a common constraint, and the presence of many different causal influences. So can be treated as a proxy - indicating the presence of complex processes.

Secondly, there is extensive and more internally consistent and practically useful set of ways in which diversity can be measured. These mainly have their origins in the studies of biodiversity but have a much wider applicability. Diversity can also be measured in other spheres, in human relationships (using social network analysis tools) and how people see the world (using forms of ethnographic enquiry known as pile or card sorting).

Thirdly, diversity has some potentially global relevance as a value and as objective. Diversity of behaviour can be indicative of choice and empowerment.

Fourthly, diversity can also be seen as an important independent variable as well, enabling adaptation and creativity.

All this is not to say that diversity cannot also be problematic. Some forms of diversity in the present could severely limit the extent of diversity in the future. For example, if within a population there was a wide range of different types of guns held by households and many different views on how and when they could legitimately be used. At the more mundane level, within organisations different kinds of tasks may benefit from different levels of diversity within the groups addressing those tasks. So diversity presents a useful and important problematic in a way that the concept of complexity does not. What forms of diversity we want to see, and see sustained over time, and how can they be enabled? Where do we want choice and where should be accept restriction?

Arguing for more attention to diversity, rather than complexity, does not mean there also needs to be a whole new school of evaluation developed around this idea (Get Brand X Evaluation, just hot off the press! Uhh... No). It is consistent with a number of ideas already found useful, including the idea of equifinality (An outcome can arise from multiple different causes) and multifinality ( A cause can have multiple different outcomes), and the idea of multiple conjunctural causation. It is also compatible with a social constructionist and interpretive perspective on social reality.

Monday, June 14, 2021

Paired case comparisons as an alternative to a configurational analysis (QCA or otherwise)

[Take care, this is still very much a working draft!] Criticisms and comments welcome though

The challenge

The other day I was asked for some advice on how to implement a QCA type of analysis within an evaluation that was already fairly circumscribed in its design. Circumscribed both by the commissioner and by the team proposing to carry out the evaluation. The commissioner had already indicated that they wanted a case study orientated approach and had even identified the maximum number of case studies that they wanted to see (ten) . While the evaluation team could see the potential use of a QCA type analyses they were already committed to undertaking a process type evaluation, and did not want a QCA type analyses to dominate their approach. In addition, it appeared that there already was a quite developed conceptual framework that included many different factors which might be contribute causes to the outcomes of interest.

As is often the case, there seemed to be a shortage of cases and an excess of potentially explanatory variables. In addition, there were doubts within the evaluation team as to whether a thorough QCA analysis would be possible or justifiable given the available resources and priorities.

Paired case comparisons as the alternative

My first suggestion to the evaluation team was to recognise that there is some middle ground between across-case analysis involving medium to large numbers of cases, and a within-case analysis. As described by Rihoux and Ragin (2009) a QCA analysis will use both, going back and forth, using one to inform the other, over a number of iterations.. The middle ground between these two options is case comparisons – particularly comparisons of pairs of cases. Although in the situation described above there will be a maximum of 10 cases that can be explored, the number of pairs of these cases that can be compared is still quite big (45). With these sort of numbers some sort of strategy is necessary for making choices about the types of pairs of cases that will be compared. Fortunately there is already a large literature on case selection. My favourite summary is the one by Gerring, J., & Cojocaru, L. (2015). Case-Selection: A Diversity of Methods and Criteria.

My suggested approach was to use what is known as the Confusion Matrix as the basis for structuring the choice of cases to be compared. A Confusion Matrix is a simple truth table, showing a combination of two sets of possibilities (rows and columns), and the incidence of those possibilities (cell values). For example as follows:

Inside the Confusion Matrix are four types of cases:

True Positives where there are cases with attributes that fit my theory and where the expected outcome is present
False Positives, where there are cases with attributes that fit my theory but where the expected outcome is absent
False Negatives, where there are cases which do not have attributes that fit my theory but where nevertheless the outcome is present
True Negatives, where there are cases which do not have attributes that fit my theory and where the outcome is absent as expected

Both QCA and supervised machine learning approaches are good at identifying individual (or packages of) attributes which are good predictors of when outcomes are present or when they are absent – in other words where there are large number of True Positive and True Negative cases. And the incidence of exceptions: the False Positive and False Negatives. But this type of cross case-based led analysis do not seem to be available as an option to the evaluation team I have mentioned above.

1. Starting with True Positives

So my suggestion has been to look at the 10 cases that they have at hand, and start by focusing in on those cases where the outcome is present (first column). Focus on the case that is most similar to others with the outcome present, because findings about this case may be more likely to apply to others. See below on measuring similarity) . When examining that case identify one or more attributes which is the most likely explanation for the outcome being present. Note here that this initial theory is coming from a single within-case analysis, not a cross-case analysis. The evaluation team will now have a single case in the category of True Positive.

2. Comparing False Positives and True Positives

The next step in the analysis is to identify cases which can be provisionally described as a False Positive. Start by finding a case which has the outcome absent. Does it have the same theory-relevant attributes as the True Positive? If so, retain it as a False Positive. Otherwise, move it to the True Negative category. Repeat this move for all remaining cases with the outcome absent. From among all those qualifying as False Positives, find one which is otherwise be as similar as possible in all its other attributes to the True Positive case.This type of analysis choice is called MSDO, standing for "most similar design, different outcome" - see the de Meur reference below. Also see below on how to measure this form of similarity.

The aim here is to find how the causal mechanisms at work differ. One way to explore this question is to look for an additional attribute that is present in the True Positive case but absent in the False Negative case, despite those cases otherwise being most similar. Or, an attribute that is absent in the True Positive but present in the False Negative case. In the former case the missing case could be seen as a kind of enabling factor, whereas in the latter case it could be seen as more like a blocking factor. If nether can be found by comparison of coded attributes of the cases then a more intensive examination of raw data on the case might still identify them, and lead to an updated/elaboration of theory behind the True Positive case. Alternately, that examination might suggest measurement error is the problem and that the False Positive case needs to be reclassified as True Positive.

3. Comparing False Negatives and True Positives

The third step in the analysis is to identify at least one most relevant case which can be described as a False Negative. This False-Negative case should be one that is as different as possible in all its attributes to the True Positive case. This type of analysis choice is called MDSO, standing for "most different design, same outcome".

The aim here should be to try to identify if the same or different causal mechanisms is at work, when compared to that seen in the True Positive case. One way to explore this question is to look for one or more attributes that both the True Positive and False Negative case have in common, despite otherwise being "most different". If found, and if associated with the causal theory in the True Positive case, then the False Negative case can now be reclassed as a True Positive. The theory describing the now two True Positive cases can now be seen as provisionally "necessary"for the outcome, until another False Negative case is found and examined in a similar fashion.If the casual mechanism seems to be different then the case remains as a False Negative.

Both the second and third step comparisons described above will help : (a0 elaborate the details, and (b) establish the limits of the scope of the theory identified in step one. This suggested process makes use of the Confusion Matrix as a kind of very simple chess board, where pieces (aka cases) are introduced on to the board, one at a time, and then sometimes moved to other adjacent positions (depending on their relation to other pieces on the board).Or, the theory behind their chosen location is updated.

If there are only ten cases available to study, and these have an even distribution of outcomes present and absent, then this three step process of analysis could be reiterated five times i.e. once for each case where the outcome was present. Thus involving up to 10 case comparisons, out of the 45 possible.

Measuring similarity

The above process depends on the ability to make systematic and transparent judgements about similarity. One way of doing this, which I have previously built into an Excel app called EvalC3, is to start by describing each case with a string of binary coded attributes of the same kind as used in QCA, and in some forms of supervised machine learning. An example set of workings can be seen in this Excel sheet, showing an imagined data set of 10 cases, with 10 different attributes and then the calculation and use of Hamming Distance as the similarity measure to chose cases for the kinds of comparisons described above. That list of attributes and the Hamming distance measure, is likely to need to be updated, as the investigation of False Positives and False Negatives proceeds.

Incidentally, the more attributes that have been coded per case, the more discriminating this kind of approach can become. In contrast to cross-case analysis where an increase in numbers of attributes per case is usually problematic

Related sources

For some of my earlier thoughts on case comparative analysis see here, These were developed for use within the context of a cross-case analysis process. But the argument above is about how to proceed when the staring point is a within-case analysis.

Monday, May 24, 2021

The potential use of Scenario Planning methods to help articulate a Theory of Change

Over the past few months I have been engaged in discussions with other members of the Association of Professional Futurists (APF) Evaluation Task Force about how activities and outcomes in the field of foresight/alternative futures/scenario planning can usefully be evaluated.

Just recently the subject of Theories of Change has come up, and it struck me that there are at least three ways of looking at Theories of Change in this context:

The first perspective: A particular scenario (i.e. an elaborated view of the future) can contain within it a particular theory of change. One view of the future may imply that technological change will be the main driver of what happens. Another might emphasise the major long-term causal influence of demographic change.

The second perspective: Those organising a scenario planning exercise are also likely to have either explicitly or implicitly or mixture of both a Theory of Change of how their exercise is expected to influence on the participants, and the influence those participants will have on others.

The third perspective looks in the opposite direction and raises the possibility that in other settings a Theory of Change may contain a particular type of future scenario. I'm thinking here particularly of Theories of Change as used by organisations planning economic and/or social interventions in developed and developing economies. This territory has been explored recently in a paper by Derbyshire (2019), titled "Use of scenario planning as a theory-driven evaluation tool. FUTURES & FORESIGHT SCIENCE, 1(1), 1–13. In that paper he puts forward a good argument for the use of scenario planning methods as a way of developing improved Theories of Change. Improved in a number of ways. Firstly a much more detailed articulation of the causal processes involved. Secondly, more adequate attention to risks and unintended consequences. Thirdly, more adequate involvement of stakeholders in these two processes.

Both the task force discussions and my revisiting of the paper by Derbyshire have prompted me to think about the potential use of a ParEvo exercise as a means of articulating the contents of a Theory of Change for a development intervention. And to start to reach out to people who might be interested in testing such possibilities. The following possibilities come to mind:

1. A ParEvo exercise could be set up to explore what happens when X project is set up in Y circumstances with Z resources and expectations. A description of this initial setting would form the seed paragraph(s) of the ParEvo exercise. The subsequent iterations would describe the various possible developments that took place over a series of calendar periods, reflecting the expected lifespan of the intervention, and perhaps a limited period thereafter. The participants would be, or act in the role of, different stakeholders in the intervention. Commentators of the emerging storylines could be independent parties with different forms of expertise relevant to the intervention and its context.

2. As with all previous ParEvo exercises to date, after the final iteration there would be an evaluation stage, completed by at least the participants and the commentators, but possibly also by others in observer roles. You can see a copy of a recent evaluation survey form here, to see the types of evaluative judgements that would be sought from those involved and observing.

3. .3. There seemed to be at least two possible ways of using the storylines that have been generated, to inform the design of a Theory of Change. One is to take whole storylines as units of analysis. For example, a storyline evaluated as both most likely and most desirable, by more participants than any other storyline, would seem an immediately useful source of detailed information about a causal pathway that should go into a Theory of Change. Other storylines identified as most likely but least desirable would warrant attention as risks that also need to be built into a Theory Of Change, along with any potential means of preventing and/or mitigating those risks. Other storylines identified as least likely but most desirable would warrant attention as opportunities, also to be built into a Theory Of Change, along with means of enabling and exploiting those opportunities.

4. 34. The second possible approach would give less respect to the existing branch structure, and focus more on the contents of individual contributions i.e. paragraphs in the storylines. Individual contributions could be sorted into categories familiar to those developing Theories of Change: activities, outputs, outputs, and impacts. These could then be recombined into one or more causal pathways that the participants thought was both possible and desirable. In effect, a kind of linear jigsaw puzzle. If the four categories of event types were seen as being too rigid a schema (a reasonable complaint!), but still an unfortunate necessity, they could be introduced after the recombination process, rather than before. Either way, it probably would be useful to include another evaluation stage, making a comparative evaluation of the different combinations of contributions that had been created. Using the same metrics as are already being used with existing ParEvo exercise.

More ideas will follow..

The beginnings of a bibliography...

Derbyshire, J. (2019). Use of scenario planning as a theory-driven evaluation tool. FUTURES & FORESIGHT SCIENCE, 1(1), 1–13. https://doi.org/10.1002/ffo2.1

Ganguli, S. (2017). Using Scenario Planning to Surface Invisible Risks (SSIR). Stanford Social Innovation Review. https://ssir.org/articles/entry/using_scenario_planning_to_surface_invisible_risks