Monday, December 07, 2020

Has the meaning of impact evaluation been hijacked?

 



This morning I have been reading, with interest, Giel Ton's 2020 paper: 
Development policy and impact evaluation: Contribution analysis for learning and accountability in private sector development 

 I have one immediate reaction; which I must admit I have been storing up for some time.  It is to do with what I would call the hijacking of the meaning or definition of 'impact evaluation'.  These days impact evaluation seems to be all about causal attribution. But I think this is an overly narrow definition and almost self-serving of the interests of those trying to promote methods specifically dealing with causal attribution e.g., experimental studies, realist evaluation, contribution analysis and process tracing. (PS: This is not something I am accusing Giel of doing!)

 I would like to see impact evaluations widen their perspective in the following way:

1. Description: Spend time describing the many forms of impact a particular intervention is having. I think the technical term here is multifinality. In a private-sector development programme, multifinality is an extremely likely phenomenon.  I think Giel has in effect said so at the beginning of his paper: " Generally, PSD programmes generate outcomes in a wide range of private sector firms in the recipient country (and often also in the donor country), directly or indirectly."

 2. Valuation: Spend time seeking relevant participants’ valuations of the different forms of impact they are experiencing and/or observing. I'm not talking here about narrow economic definitions of value, but the wider moral perspective on how people value things - the interpretations and associated judgements they make. Participatory approaches to development and evaluation in the 1990s gave a lot of attention to people's valuation of their experiences, but this perspective seems to have disappeared into the background in most discussions of impact evaluation these days. In my view, how people value what is happening should be at the heart of evaluation, not an afterthought. Perhaps we need to routinely highlight the stem of the word Evaluation.

 3. Explanation: Yes, do also seek explanations of how different interventions worked and failed to work (aka causal attribution).  Paying attention of course to heterogeneity, both in the forms of equifinality and multifinality Please Note: I am not arguing that causal attribution should be ignored - just placed within a wider perspective! It is part of the picture, not the whole picture.

 4. Prediction: And in the process don't be too dismissive of the value of identifying reliable predictions that may be useful in future programmes, even if the causal mechanisms are not known or perhaps are not even there.  When it comes to future events there are some that we may be able to change or influence, because we have accumulated useful explanatory knowledge.  But there are also many which we acknowledge are beyond our ability to change, but where with good predictive knowledge we still may be able to respond to appropriately.

Two examples, one contemporary, one very old: If someone could give me a predictive model of sharemarket price movements that had even a modest 55% accuracy I would grab it and run, even though the likelihood of finding any associated causal mechanism would probably be very slim.  Because I’m not a billionaire investor, I have no expectation of being able to use an explanatory model to actually change the way markets behave.  But I do think I could respond in a timely way if I had relevant predictive knowledge.

 Similarly, with the movements of the sun, people have had predictive knowledge about the movement of the sun for millennia, and this informed their agricultural practices.  But even now that we have much improved explanatory knowledge about the sun’s movement few feel this would think that this will help us change the way the seasons progress.

 I will now continue reading Giel's paper…


2021 02 19: I have just come across a special issue of the Evaluation journal of Australasia, on the subject of values. Here is the Editorial section.

Sunday, December 06, 2020

Quality of Evidence criteria that can be applied to Most Significant Change (MSC) stories

 


Two recent documents have prompted me to do some thinking on this subject

If we view Most Significant Change (MSC) stories as evidence of change (and what people think about those changes) what should we look for in terms of quality - what are the attributes of quality we should look for?

Some suggestions that others might like to edit or add to, or even delete...

1. There is clear ownership of an MSC story and the reasons for its selection by the storyteller. Without this, there is no possibility of clarification of any elements of the story and its meaning, let alone more detailed investigation/verification

2. There was some protection against random/impulsive choice. The person who told the story was asked to identify a range of changes that had happened, before being asked to identify the one which was most significant 

3. There was some protection against interpreter/observer error. If another person recorded the story, did they read back their version to the storyteller, to enable them to make any necessary corrections?

4. There has been no violation of ethical standards: Confidentiality has been offered and then respected. Care has been taken not only with the interests of the storyteller but also of those mentioned in a story.

5. Have any intended sources of bias been identified and explained? Sometimes it may be appropriate to ask about " most significant changes caused by....xx..." or "most significant changes of ...x ...type"

6. Have any unintended sources of bias been anticipated and responded to? For example, by also asking about "most significant negative changes " or "any other changes that are most significant"?

7. There is transparency of sources. If stories were solicited from a number of people, we know how these people were identified and who was excluded and why so. If respondents were self-selected we know how they compare to those that did not self-select.

8. There is transparency of selection process: If multiple stories were initially collected then the most significant of these have been selected then reported and used elsewhere, the details of the selection process should be available, including (a) who was involved, (b) how choices were made, and (c) the reasons given for the final choice(s) made

9. Fidelity: Has the written account of why a selection panel chose a story as most significant done the participants' discussion justice? Was it sufficiently detailed, as well as being truthful?

10. Have potential biases in the selection processes been considered? Do most of the finally selected most significant change stories come from people of one kind versus another e.g. men rather than women, one ethnic or religious group versus others? In other words, is the membership of the selection panel transparent? (thanks to Maleeha below).

11.    your thoughts here on.. (using the Comment facility below).

Please note 

1. That in focusing here on "quality of evidence" I am not suggesting that the only use of MSC stories is to serve as forms of evidence. Often the process of dialogue is immensely important and it is the clarification of values and who values what and why so, that is most important. And there are also bound to be other purposes also served

2. (Perhaps the same point, expressed in another way) The above list is intentionally focused on minimal rather than optimal criteria. As noted above, a major part of the MSC process is about the discovery of what is of value, to the participants.  

For more on the MSC technique, see the resources here.




Wednesday, October 28, 2020

Mapping the structure of cooperation


Over the last year or so I have been developing a web application known as ParEvo. The purpose of ParEvo is to enable people to take part in a participatory scenario planning process, online. How the process works is described in detail on this website. The main point that I need to make clear here in this post is that the process consists of people writing short paragraphs of text describing what might happen next. Participants choose which previously written paragraphs their paragraphs should be added to. In turn, other participants may choose to add their own paragraph of text to these. The net result is a series of branching storylines describing alternative futures, which can vary in the way that they constructed i.e. who was involved in the construction of which storyline.

One of the advantages of using ParEvo is that data can be downloaded showing whose text contribution was added to whose. While the ParEvo app does show all the contributions and how they connected into different storylines in the form of a tree structure it does this in an anonymous way – it is not possible for participants, or observers, to see who wrote which contributions. However one of the advantages of using ParEvo is that an exercise facilitator can download data showing the otherwise hidden data on whose contribution was added to whose. This data can be downloaded in the form of an "adjacency matrix". This shows the participants listed by row and the same participants listed by column by column. The cells in the matrix show which row participant added a contribution to an existing contribution made by the column participant. This kind of matrix data is easy to then visualise as a social network structure. Here is an anonymized example from one ParEvo exercise.

Blue nodes = participants. Grey links = contributions to the pointed participant. Red links = reciprocated contributions. Big nodes have many links, small nodes have few links

Another way of summarising the structure of participation is to create a scatterplot, as in this example shown below. The X-axis represents the number of other participants who have added contribution to one's own contributions (SNA term = indegree). The Y-axis represents the number of other participants that one has added one's own contributions to (SNA term = outdegree. The data points in the scatterplot identify the individual participants in the exercise and their characteristics as described by the two axes. The four corners of the scatterplot can be seen as four extreme types of participation:

– Isolates: who only build on their own contributions and nobody else builds on these
– Leaders: who only build on their own contributions, but others also build on these
– Followers: who only build on others' contributions, but others do not build on theirs
– Connectors: who built on others' contributions and others build on theirs

The maximum value of the Y-axis is defined by the number of iterations in the exercise. The maximum value of the X-axis is defined by the number of participants in the exercise. The graph below needs updating to show an X-axis maximum value of 10, not 8




One observer of this ParEvo exercise commented that ' It makes sense to me: those three leaders are the three most senior staff in the group, and it makes sense that they might have produced contributions that others would follow, and that they might be the people most sure of their own narrative"

What interested me was the absence of any participants in the Isolates and Connectors corners of the scatter plot. The absence of isolates is probably a good thing within an organisation, though it could mean a reduced diversity of ideas overall. The absence of Connectors seems more problematic - it might suggest a situation where there are multiple conceptual silos/cliques that are not "talking" to each other. It will be interesting to see in other ParEvo exercises what this scatter plot structure looks like, and how the owners of those exercises interpret them.


Saturday, September 26, 2020

EvalC3 versus QCA - compared via a re-analysis of one data set


I was recently asked whether that EvalC3 could be used to do a synthesis study, analysing the results from multiple evaluations.  My immediate response was yes, in principle.  But it probably needs more thought.

I then recalled that I had seen somewhere an Oxfam synthesis study of multiple evaluation results that used QCA.  This is the reference, in case you want to read, which I suggest you do.

Shephard, D., Ellersiek, A., Meuer, J., & Rupietta, C. (2018). Influencing Policy and Civic Space: A meta-review of Oxfam’s Policy Influence, Citizen Voice and Good Governance Effectiveness Reviews | Oxfam Policy & Practice. Oxfam. https://policy-practice.oxfam.org.uk/publications/*

Like other good examples of QCA analyses in practice, this paper includes the original data set in an appendix, in the form of a truth table.  This means it is possible for other people like me to reanalyse this data using other methods that might be of interest, including EvalC3.  So, this is what I did.

The Oxfam dataset includes five conditions a.k.a. attributes of the programs that were evaluated.  Along with two outcomes each pursued by some of the programs.  In total there was data on the attributes and outcomes of twenty-two programs concerned with expanding civic space and fifteen programs concerned with policy influence.  These were subject to two different QCA analyses.

The analysis of civic space outcomes

In the Oxfam analysis of the fifteen programs concerned with expanding civic space, QCA analysis found four “solutions” a.k.a. combinations of conditions which were associated with the outcome of successful civic space.  Each of these combinations of conditions was found to be sufficient for the outcome to occur.  Together they accounted for the outcomes found in 93% or fourteen of the fifteen cases.  But there was overlap in the cases covered by each of these solutions, leaving the question open as to which solution best fitted/explained those cases.  Six of the fourteen cases had two or more solutions that fitted them.

In contrast, the EvalC3 analysis found two predictive models (=solutions) which are associated with the outcome of expanded civic space.  Each of these combinations of conditions was found to be sufficient for the outcome to occur.  Together they accounted for all fifteen cases where the outcome occurred.  In addition, there was no overlap in the cases covered by each of these models.

The analysis of policy influencing outcomes

in the Oxfam analysis of the twenty-two programs concerned with policy influencing the QCA analysis found two solutions associated with the outcome of expanding civic space.  Each of these was sufficient for the outcome, and together they accounted for all the outcomes.  But there was some overlap in coverage, one of the six cases were covered by both solutions.

In contrast, the EvalC3 analysis found one predictive model which was necessary and sufficient for the outcome, and which accounted for all the outcomes achieved.

Conclusions?

Based on parsimony alone, the EvalC3 solutions/predictive models would be preferable.  But parsimony is not the only appropriate criteria to evaluate a model.  Arguably a more important criterion is the extent which a model fits the details of the cases covered when those cases are closely examined.  So, really what the EvalC3 analysis has done is to generate some extra models that need close attention, in addition to those already generated by the QCA analysis.  The number of cases covered by multiple models is been increased.

In the Oxfam study, there was no follow-on attention given to resolving what was happening in the cases that were identified by more than one solution/predictive model.  In my experience of reading other QCA analyses, this lack of follow-up is not uncommon.

However, in the Oxfam study for each of the solutions found at least one detailed description was given of an example case that that solution covered.  In principle, this is good practice. But unfortunately, as far as I can see, it was not clear whether that particular case was exclusively covered by that solution, or part of a shared solution.  Even amongst those cases which were exclusively covered by a solution there are still choices that need to be made (and explained) about how to select particular cases as exemplars and/or for a detailed examination of any causal mechanisms at work.  

QCA software does not provide any help with this task.  However, I did find some guidance in specialist text on QCA:  Schneider, C. Q., & Wagemann, C. (2012). Set-Theoretic Methods for the Social Sciences: A Guide to Qualitative Comparative Analysis. Cambridge University Press. https://doi.org/10.1017/CBO9781139004244 (it’s a heavy read in part of this book but overall, it is very informative).  In section 11.4 titled Set-Theoretic Methods and Case Selection, the authors note ‘Much emphasis is put on the importance of intimate case knowledge for a successful QCA.  As a matter of fact, the idea of QCA is a research approach and of going back-and-forth between ideas and evidence largely consists of combining comparative within case studies and QCA is a technique.  So far, the literature has focused mainly on how to choose cases prior to and during but not after a QCA were by QCA we here refer to the analytic moment of analysing a truth table it is therefore puzzling that little systematic and specific guidance has so far been provided on which cases to select for within case studies based on the results of i.e. after a QCA… ‘The authors then go on to provide some guidance (a total of 7 pages of 320).

In contrast to QCA software, EvalC3 has a number of built-in tools and some associated guidance on the EvalC3 website, on how to think about case selection as a step between cross-case analysis and subsequent within-case analysis.  One of the steps in the seven-stage EvalC3 workflow (Compare Models) is the generation of a table that compares the case coverage of multiple selected alternative models found by one’s analysis to that point.  This enables the identification of cases which are covered by two or more models.  These types of cases would clearly warrant subsequent within-case investigation.

Another step in the EvalC3 workflow called Compare Cases, provides another means of identifying specific cases for follow-up within-case investigations.  In this worksheet individual cases can be identified as modal or extreme examples within various categories that may be of interest e.g. True Positives, False Positives, et cetera.  It is also possible to identify for a chosen case what other case is most similar and most different to that case, when all its attributes available in the dataset are considered.  These measurement capacities are backed up by technical advice on the EvalC3 website on the particular types of questions that can be asked in relation to different types of cases selected on the basis of their similarities and differences. Your comments on these suggested strategies would be very welcome.

I should explain...

...why the models found by EvalC3 were different from those found from the QCA analysis.  QCA software finds solutions i.e. predictive models by reducing all the configurations found in a truth table down to the smallest possible set, using what is known as a minimisation algorithm called the Quine McCluskey algorithm.

In contrast, EvalC3 provides users with a choice of four different search algorithms combined with multiple alternative performance measures that can be used to automatically assess the results generated by those search algorithms. All algorithms have their strengths and weaknesses, in terms of the kinds of results they can find and cannot find, including the QCA Quine McCluskey algorithm and the simple machine learning algorithms built into EvalC3. I think the McCluskey algorithm has particular problems with datasets which have limited diversity, in other words, where cases only represent a small proportion of all the possible combinations of the conditions documented in the dataset. Whereas the simple search algorithms built into EvalC3 don't experience that is a difficulty. This is my conjecture, not yet rigorously tested.

[In the above data set the cases represented the two data sets analysed represented 47% and 68% of all the possible configurations given the presence of five different conditions]

While EvalC3 results described above did differ from the QCA analyses, they were not in outright contradiction. The same has been my experience when I reanalysed other QCA datasets. EvalC3 will either simply duplicate the QCA findings or produce variations on those, and often those which are better performing. 


Wednesday, July 29, 2020

Converting a continuous variable into a binary variable i.e. dichotomising


If you Google "dichotomising data" you will find lots of warnings that this is basically a bad idea!. Why so? Because if you do so you will lose information. All those fine details of differences between observations will be lost.

But what if you are dealing with something like responses to an attitude survey? Typically these have five-pointed scales ranging from disagree to neutral to agree, or the like. Quite a few of the fine differences in ratings on this scale may well be nothing more than "noise", i.e. variations unconnected with the phenomenon you are trying to measure. A more likely explanation is that they reflect differences in respondents "response styles", or something more random

Aggregation or "binning" of observations into two classes (higher and lower) can be done in different ways. You could simply find the median value and split the observations at that point. Or, you could look for a  "natural" gap in the frequency distribution and make the split there. Or, you may have a prior theoretical reason that it makes sense to split the range of observations at some other specific point.

I have been trying out a different approach. This involved not just looking at the continuous variable I wanted to dichotomise, but also its relationship with an outcome that will be of interest in subsequent analyses. This could also be a continuous variable or a binary measure.

There are two ways of doing this. The first is a relatively simple manual approach. In the first approach, the cut-off point for the outcome variable has already been decided, by one means or another.  We then needed to vary the cut-off point in the range of values for the independent variable to see what effect they had on the numbers of observations of the outcome above and below its cut-off value. For any specific cut-off value for the independent variable an Excel spreadsheet will be used to calculate the following:
  1. # of True Positives - where the independent variable value was high and so was the outcome variable value
  2. # of False Positives - where the independent variable value was high but the outcome variable value was low
  3. # of False Negative - where the independent variable value was low but the outcome variable value was high
  4. # of True Negatives - where the independent variable value was low and the outcome variable value was low
When doing this we are in effect treating cut-off criteria for the independent variable as a predictor of the dependent variable.  Or more precisely, a predictor of the prevalence of observations with values above a specified cut-off point on the dependent variable.

In Excel, I constructed the following:
  • Cells for entering the raw data - the values of each variable for each observation
  • Cells for entering the cut-off points
  • Cells for defining the status of each observation  
  • A Confusion Matrix, to summarise the total number of observations with each of the four possible types described above.
  • A set of 6 widely used performance measures, calculated using the number of observations in each cell of the Confusion Matrix.
    • These performance measures tell me how good the chosen cut-off point is as a predictor of the outcome as specified. At best, all those observations fitting the cut-off criterion will be in the True Positive group and all those not fitting it would be in the True Negative group. In reality, there are also likely to be observations in the False Positive and False Negative groups.
By varying the cut-off points it is possible to find the best possible predictor i.e. one with very few False Positive and very few False Negatives. This can be done manually when the cut-off point for the outcome variable has already been decided.

Alternatively, if the cut-off point has not been decided for the outcome variable, a search algorithm can be used to find the best combination of two cut-off points (one for the independent and one for the dependent variable).  Within Excel, there is an add-in called Solver, which uses an evolutionary algorithm to do such a search, to find the optimal combination

Postscript 2020 11 05: An Excel file with the dichotomization formula and a built-in example data set is available here  

  
2020 08 13: Also relevant: 

Hofstad, T. (2019). QCA and the Robustness Range of Calibration Thresholds: How Sensitive are Solution Terms to Changing Calibrations? COMPASSS Working Papers, 2019–92. http://www.compasss.org/wpseries/Hofstad2019.pdf  

 This paper emphasises the importance of declaring the range of original (pre-dichotomised) values over which the performance of a predictive model remains stable 

Tuesday, April 07, 2020

Rubrics? Yes, but...




The blog posting is a response to Tom Aston's blog posting: Rubrics as a harness for complexity

I have just reviewed an evaluation of the effectiveness of policy influencing activities of programs funded by HMG as part of the International Carbon Finance Initiative.  In the technical report there are a number of uses of rubrics to explain how various judgements were made.  Here, for example, is one summarising the strength of evidence found during process tracing exercises:
  • Strong support – smoking gun (or DD) tests passed and no hoop tests (nor DDs) failed.
  • Some support – multiple straw in the wind tests passed and no hoop tests (nor DDs) failed; also, no smoking guns nor DDs passed.
  • Mixed – mixture of smoking gun or DD tests passed but some hoop tests (or DDs) failed – this required the CMO to be revised.
  • Failed – some hoop (or DD) tests failed, no double decisive or smoking gun tests passed – this required the theory to be rejected and the CMO abandoned or significantly revised. 

Another rubric described in great detail how three different levels of strength of evidence were differentiated (Convincing Plausible, Tentative).  There was no doubt in my mind that these rubrics contributed significantly to the value of the evaluation report.  Particularly by giving readers confidence in the judgements that were made by the evaluation team.

But… I can't help feel that the enthusiasm for rubrics seems to be out of proportion with their role within an evaluation.  They are a useful measurement device that can make complex judgements more transparent and thus more accountable.  Note the emphasis on the ‘more‘… There are often plenty of not necessarily so transparent judgements present in the explanatory text which is used to annotate each point in a rubric scale.  Take, for example, the first line of text in Tom Aston’s first example here, which reads “Excellent: Clear example of exemplary performance or very good practice in this domain: no weakness”

As noted in Tom’s blog it has been argued that rubrics have a wider value i.e. rubrics are useful when trying to describe and agree what success looks like for tracking changes in complex phenomena”.  This is where I would definitely argue “Buyer beware” because rubrics have serious limitations in respect of this task.

The first problem is that description and valuation are separate cognitive tasks.  Events that take place can be described, they can also be given a particular value by observers (e.g. good or bad).  This dual process is implied in the above definition of how rubrics are useful.  Both of these types of judgements are often present in a rubrics explanatory text e.g. Clear example of exemplary performance or very good practice in this domain: no weakness”

The second problem is that complex events usually have multiple facets, each of which has a descriptive and value aspect.  This is evident in the use of multiple statements linked by colons in the same example rubric I refer to above.

So for any point in a rubric’s scale the explanatory text has quite a big task on its hands.  It has to describe a specific subset of events and give a particular value to each of those.  In addition, each adjacent point on the scale has to do the same in a way that suggests there are only small incremental differences between each of these points judgements. And being a linear scale, this suggests or even requires, that there is only one path from the bottom to the top of the scale. Say goodbye to equifinality!

So, what alternatives are there, for describing and agreeing on what success looks like when trying to track changes in complex phenomena?  One solution which I have argued for, intermittently, over a period of years, is the wider use of weighted checklists.  These are described at length here.  

Their design addresses three problems mentioned above.  Firstly, description and valuation are separated out as two distinct judgements.  Secondly, the events that are described and valued can be quite numerous and yet each can be separately judged on these two criteria.  There is then a mechanism for combining these judgements in an aggregate scale. And there is more than one route from the bottom to the top of this aggregate scale.

“The proof is in the pudding”.  One particular weighted checklist, known as the Basic Necessities Survey, was designed to measure and track changes in household-level poverty.  Changes in poverty levels must surely qualify as ‘complex phenomena ‘.  Since its development in the 1990s, the Basic Necessities Survey has been widely used in Africa and Asia by international environment/conservation organisations.  There is now a bibliography available online describing some of its users and uses. https://www.zotero.org/groups/2440491/basic_necessities_survey/library








 [RD1]Impressive rubric

Friday, February 28, 2020

Temporal networks: Useful static representations of dynamic events

.
I have just found out about the existence of a field of study called "temporal networks"  Here are two papers I came across

Linhares, C. D. G., Ponciano, J. R., Paiva, J. G. S., Travençolo, B. A. N., & Rocha, L. E. C. (2019). Visualisation of Structure and Processes on Temporal Networks. In P. Holme & J. Saramäki (Eds.), Temporal Network Theory (pp. 83–105). Springer International Publishing. https://doi.org/10.1007/978-3-030-23495-9_5
Li, A., Cornelius, S. P., Liu, Y.-Y., Wang, L., & Barabási, A.-L. (2017). The fundamental advantages of temporal networks. Science, 358(6366), 1042–1046. https://doi.org/10.1126/science.aai7488

Here is an example of a temporal network:
Figure 1


The x-axis represents intervals of time The y-axis represents six different actors. The curved lines represent particular connections between particular actors at particular moments of time. For example, email messages or phone calls.

In Figure 2 below, we can see a more familiar type of network structure. This is the same network as that shown in Figure 1. The difference is that it is an aggregation of all the interactions over the 24 time periods shown in Figure 1. The numbers in red refer to the number of times that each communication link was active in this whole period.

This diagram has both some strengths and weaknesses. Unlike Figure 1 it shows us the overall structure of interactions. On the other hand, it is obscuring the possible significance of variations in the sequence within which these interactions take place over time. In a social setting involving people talking to each other, the sequencing of when different people talk to each other could make a big difference to the final state of the relationships between the people in the network.

Figure 2
How might the Figure 1 way of representing temporal networks be useful?

The first would be as a means of translating narrative accounts of events into network models of those events. Imagine that the 24 time periods are a duration of time covered by events described in a novel. And events in periods 1 to 5 are described in one particular chapter of the novel. In a chapter, the story is all about the interactions between actors 2, 3 and 4. In subsequent chapters, their interactions with other actors are described.
Figure 3
Now, instead of a novel, imagine a narrative describing the expected implementation and effects of a particular development programme. Different stakeholders will be involved at different stages. Their relationships could be "transcribed" into a temporal network, and also then into a static network diagram (as in Figure 2) which would describe the overall set of relationships for the whole programme period.

The second possible use would be to adapt the structure of a temporal network model to convert it into a temporal causal network model. Such as shown in Figure 4 below. The basic structure would remain the same, with actors list row by row and time listed column by column. The differences would be that:

  1. The nodes in the network could be heterogeneous, reflecting different kinds of activities or events, undertaken/involved in by each actor. Not homogenous as in Figure 1 example above.
  2. The connections between activities/events would be causal, in one direction or in both directions. The latter signifying a two-way exchange of some kind. In Figure 1 causality may be possible and even implied, but it can't simply be assumed.
  3. There could also be causal links between activities within the same row, meaning an actor's particular at T1 influenced another of their activities in T3, for example. This possibility is not available in Figure 1 type model
  4. Some spacer" rows and columns are useful to give the node descriptions more and to make the connections between them more visible

Figure 4 is a stylised example. By this I mean I have not detailed the specifics of each event or characterised the nature of the connections between them. In a real-life example this would be necessary. Space limitations on the chart would necessitate very brief titles + reference numbers or hypertext links.
Figure 4: Stylised example
While this temporal causal network looks something a Gantt chart it is different and better.

  1. Each row is an about a specific actor, whereas in a Gantt chart each row is about a specific activity 
  2. Links between activities signal a form of causal influence , whereas in a Gantt chart they signal precedence which may or may not have causal implications
  3. Time periods can be more flexibly and abstractly defined, so long as they follow a temporal sequence. Whereas in a Gannt chart these are more likely to be defined in specific units like days, weeks or months, or specific calendar dates


How does a temporal causal network compare to more conventional representations of Theories of Change? Results chains versions of a Theory of Change do make use of a y-axis to represent time but are often much less clear about the actors involved in the various events that happen over time. Too often these describe what might be called a sequence of disembodied events i.e. abstract descriptions of key events. On the other hand, more network like Theories of Change can be better at identifying the actors involved in the relationships between them. But it is very difficult to also capture the time dimension in a static network diagram. Associated with this problem is the difficulty of then constructing any form of text narrative about the events described in the network.

One possible problem is whether measurable indicators could be developed for each activity that is shown. Another is how longer-term outcomes, happening over a period of time, might be captured. Perhaps the activities associated with their measurement would be what would be shown in a Figure 4 type model.

Postscript: The temporal dimension of network structures is addressed in dynamic network models, such as those captured in Fuzzy Cognitive Networks. With each iteration of a dynamic network model, the states of the nodes/events/actors in the network are updated according to the nature of the links they have with others in the network. This can lead to quite complex patterns of change in the network over time. But one of the assumptions built into such models is that all relationships are re-enacted in each iteration. This is clearly not the case in our social life. Some relationships are updated daily, others much less frequently. The kind of structure shown in Figure 1 above seems more appropriate view. But can these be used for simulation purposes, where all nodes would have values that are influenced by their relationships with each other?



Tuesday, November 05, 2019

Combining the use of the Confusion Matrix as a visualisation tool with a Bayesian view of probability


Caveat: This blog posting is total re-write of an earlier version on the same subject. Hopefully, this one will be more coherent and more useful!


Quick Summary
In this revised blog I:
1. Explain what a Confusion Matrix is and what Bayes Theorem says
2. Explain three possible uses for Bayes Theorem when combined with a Confusion Matrix

What is a Confusion Matrix?


A Confusion Matrix is a tabular structure that displays four possible combinations of two types of events, each of which may have happened, or not happened. Wikipedia provides a good description.

Here is an example, with real data, taken from an EvalC3 analysis.


    TP = True Positive, FP = False Positive, FN = False Negative, TN = True Negative

In this example, the top row of the table tells us that when the attributes of a particular predictive model (as was identified by EvalC3) are present there are 8 cases where the expected outcome is also present (True Positives). But there are also 4 cases where the expected outcome is not also present (False Positives). In all the remaining cases (all in the bottom row), which do not have the attributes of the predictive model, there is one case where the outcome is nevertheless present (False Negative) and 13 cases where the outcome is not present (True Negative). As can be seen in the Wikipedia article, and also in EvalC3, there is a range of performance measures which can be used to tell is how well this particular predictive model is performing – and all of these measures are based on particular combinations of the types of values in this Confusion Matrix.

Bayes theorem


According to Wikipedia, 'Bayes' theorem (alternatively Bayes's theorem, Bayes's law or Bayes's rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event '

The Bayes formula reads as follows:

P(A|B) = The probability of A, given the presence of B
P(B|A) = The probability of B, given the presence of A
P(A) = The probability of A
P(B) = The probability of B

This formula can be calculated using data represented within a Confusion Matrix. Using the example above, the outcome being present = A in the formula, and the model attributes being present = B in the formula.  So this formula could tell us the probability of finding True Positives i.e when these are both present. Here is how the various parts of the formula are calculated, along with some alternate names for their parts:


So far, in this blog posting, the Bayes formula simply provides one additional means of evaluating the usefulness of prediction models found through the use of machine learning algorithms, using EvalC3 or other means.

Process Tracing application


But I'm more interested here in the use of the Bayes formula for process tracing purposes, something that Barbara Befani has written about. Process tracing is all about articulating and evaluating conjectured causal processes in detail. A process tracing analysis is usually focused on one case or instance, not multiple cases. It is a within-case rather than cross-case method of analysis. 

In this context, the rows and columns of the Confusion Matrix have slightly different names. The columns described whether a theory is true or not, and the rows described whether evidence of a particular kind is present or not. More importantly, the values in the cells are not numbers of actual observed cases.  Rather, they are the analyst's interpretation of what is described as the "conditional probabilities" of what is happening in one case or instance.  In the two cells in the first column, the analyst puts their own probability estimates, between zero and one, reflecting the likelihood: (a) that if the evidence is present in the theory is true, that a man was the murderer, and (B) that if the evidence is absent the theory is true. In the two cells in the second column, the analyst puts their own probability estimates, between one and two, reflecting the likelihood: (a) that if the evidence is theory is not true, that a man was not the murderer, and (B) that if the evidence is absent the theory is not true. 

Here is a notional example. The theory is that a man was the murderer. The available evidence suggests that the murderer would have needed exceptional strength.
The analyst also needs to enter their "priors". That is, their belief about the overall prevalence of the theory being true i.e. men are most often the murderers. Wikipedia suggests that 80% of murders are committed by men.These prior probabilities are entered in the third row of the Confusion Matrix, as shown below. The main cell values are then updated in the light of those new value, as also shown below

Using the Bayes formula provided above, we can now calculate P(A|B),i.e. man being the murderer if the evidence x was found..  P(A|B) = TP/(TP+FP) = 0.97

"Naive Bayes" 


This is another useful application, based on an algorithm of this name, described here. 
On that web page, an example is given of a data set where each row describes three attributes of a car (color, type and origin) and whether the car was stolen or not. Predictive models (Bayesian or otherwise)  could be developed to identify how well each of these attributes predicts whether a car is stolen or not. In addition, we may want to know how good a predictor the combination of all these three individual predictors is. But the dataset does not include any examples of these types of cases.

The article then explains how the probability of a combination of all three of these attributes can be used to predict whether a car is stolen or not.

1. Calculate (TP/(TP+FP)) for color * (TP/(TP+FP)) for type * (TP/(TP+FP)) for origin 
2. Calculate (FP/(TP+FP)) for color * (FP/(TP+FP)) for type * (FP/(TP+FP)) for origin
3. Compare the two calculated values. If the first is higher classify a car as most likely stolen. If it is lower, classify a car as most likely not stolen.

A caution: Naive Bayes calculations assume (as its name suggests) that each of the attributes of the predictive model is causally independent. This may not always be the case.

In summary


Bayes formula seems to have three uses:

1. As an additional performance measure when evaluating predictive models generated by any algorithm, or other means. Here the cell values do represent numbers of individual cases.

2.  As a way of measuring the probability of a particular causal mechanism working as proposed, within the context of a process-tracing exercise. Here the cell values are conjectures about relative probabilities relating to a specific case, not numbers of individual cases.

3.  As a way of measuring the probability of a combination of predictive models being a good predictor of an outcome of concern. Here the cell values could represent either multiple real cases or conjectured probabilities (part of a Bayesian analysis of a causal mechanism) regarding events within one case only.






Saturday, October 19, 2019

On finding the weakest link...



Last week I read and responded to a flurry of email exchanges that were prompted by Jonathan Morell circulating a think piece titled 'Can Knowledge of Evolutionary Biology and Ecology Inform Evaluation?". Putting aside the details of the subsequent discussions, many of the participants were in agreement with the idea that evaluation theory and practice could definitely benefit by more actively seeking out relevant ideas from other disciplines.

So when I was reading Tim Harford's column in this weekend's Financial Times, titled 'The weakest link in the strong Nobel winner 'I was very interested in this section:
Then there’s Prof Kremer’s O-ring Theory of Development, which demonstrates just how far one can see from that comfortable armchair. The failure of vulnerable rubber “O-rings” destroyed the Challenger space shuttle in 1986; Kremer borrowed that image for his theory, which — simply summarised — is that for many production processes, the weakest link matters.
Consider a meal at a fancy restaurant. If the ingredients are stale, or the sous-chef has the norovirus, or the chef is drunk and burns the food, or the waiter drops the meal in the diner’s lap, or the lavatories are backing up and the entire restaurant smells of sewage, it doesn’t matter what else goes right. The meal is only satisfactory if none of these things go wrong.
As you will find when you do a Google search to find out more information about the O-ring Theory of Development, you will find there is a lot more to the theory than this, much of it very relevant to evaluators.  Prof Kremer is an economist, by the way.

This quote was of interest to me because in the last week I have been having discussions with a big agency in London about how to go ahead with an evaluation of one of their complex programs. By complex, in this instance, I mean a program that is not easily decomposable into multiple parts – where it might otherwise be possible to do some form of cross-case analysis, using either observational data or experimental data. We have been talking about strategies for identifying multiple alternative causal pathways that might be at work, connecting the program's interventions with the outcomes it is interested in. I'll be reporting more on this in the near future, I hope.

But let's go right now to a position a bit further along, where an evaluation team has identified which causal pathway (s) are most valuable/plausible/relevant. In those circumstances, particularly in a large complex program, the causal pathway itself could be quite long, with many elements or segments. This in itself is not a bad thing, because the more segments there are in a causal pathway that can be examined then the more vulnerable to disproof the theory about that causal pathway is – which in principle is a good thing – especially if the theory is not disproved – it means it's a pretty good theory. But on the other hand, a causal pathway with many segments or steps pose a problem for an evaluation team, in terms of where they are going to allocate their resource-limited attention.

What I like about the paragraph from Tim Harford's column is the sensible advice that it provides to an evaluation team in this type of context. That is, look first for the weakest link in the causal pathway. Of course, that does raise a question of what we mean by the weakest link. A link may be weak in terms of its verifiability or its plausibility, or in other ways. My inclination at this point would be to focus on the weakest link in terms of plausibility. Your thoughts on this would be appreciated. How one would go about identifying such weak links would also need attention. Two obvious choices would be either to use expert judgement or different stakeholders perspectives on the question. Or probably better, a combination of both.

Postscript: I subsequently discovered some other related musings:


.

Wednesday, October 02, 2019

Participatory design of network models: Some implications for analysis



I recently had the opportunity to view a presentation by Luke Craven. You can see it here on YouTube:https://www.youtube.com/watch?v=TxmYWwGKvro

Luke has developed an impressive software application as a means of doing what he calls a 'Systems Affects 'analysis. I would describe it as a particular form of participatory network modelling. The video is well worth watching. There is some nice technology at work within this tool. For example,see how a text search algorithms can facilitate the process of coding a diversity of responses by participants into a smaller subset of usable categories. In this case, descriptions of different types of causes and effects at work.

In this blog, I want to draw your attention to one part of the presentation, which is in matrix form which I have copied below. (Sorry for the poor quality, it's a copy of a YouTube screen)


In social network analysis jargon this is called an "adjacency matrix". Down the left-hand side is a list of different causal factors identified by survey respondents. This list is duplicated across the top row. The cell values refer to the number of times respondents have mentioned the row factor being a cause of the column factor.

This kind of data can easily be imported into one of many different social network analysis visualisation software packages, as is pointed out by Luke in his video (I use Ucinet/NetDraw). When this is done it is possible to identify important structural features. Such as some causal factors having much higher 'betweenness centrality'. Such factors will be at the intersection of multiple causal paths. So, in an evaluation context, they are likely to be well worth investigating. Luke explores the significance of some of these structural features in his video.

In this blog, I want to look at the significance of the values in the cells of this matrix, and how they might be interpreted. At first glance, one could see them as measures of the strength of a causal connection between 2 factors mentioned by a respondent. But there are no grounds for making that interpretation. It is much better to interpret those values as a description of the prevalence of that causal connection. A particular cause might be found in many locations/in the lives of many respondents, but in each setting, it might still only be a relatively minor influence compared to others that are also present there.

Nevertheless, I think a lot can still be Done with this prevalence information. As I explained in a recent blog about the analysis of QuIP data we can add additional data to the adjacency matrix in a way that will make it much more useful. This involves 2 steps. Firstly, we can generate column and row summary figures, so that we can identify: (a) the total number of times a column factor has been mentioned, (b) the total number of times a row factor has been mentioned.  Secondly, we can use those new values to identify how often a row cause factor has been present but a column effect factor has not been and vice versa.  I will explain this in detail with the help of this imaginary example of the use of a type of table known as a confusion matrix. (For more information about the Confusion Matrix see this Wikipedia entry).
In this example 'increased price of livestock 'is one of the causal factors listed amongst others on the left side of an adjacency matrix of the kind shown above. And 'increased income 'is one of the effect factors of the kind listed amongst others across the top row in the kind of matrix shown above. In the green cell the 11 refers to a number of causal connections respondents have identified between the two factors. This number would be found in in a cell of an adjacency matrix, which links the row factor with a column factor.

The values in the blue cells of the confusion matrix are the respective road total and column total. Knowing the green and blue values we can then calculate the yellow values.  The number 62 refers to the incidence of all the other possible causal factors listed down the left side of a matrix. And a number 2 refers the incidence of all the other possible effects listed across the top of the matrix.
PS: In Confusion Matrix jargon the green cell is referred to as a True Positive, the yellow cell with the 2 as a False Positive, and yellow cell with a 62 as a False Negative. The blank cell is known as a True Negative.

Once we have this more complete information we can then do simple analyses that can tell us just how important, or not so important, the 11 mentions of the relationship between this cause and effect are. ( I will duplicate some of what I've said in the previous post here) For example, if the value of 2 was in fact a value of 0 this would be telling us that the presence of an increased price of livestock was sufficient for the outcome of increased income to be present. However, the value of 62, would be telling us that while the increased price of livestock is sufficient it is not necessary for increased income. In fact, most of the cases have increased income arises from other causal factors.

Alternatively, we can imagine the value of 62 is now zero while the value of 2 is still present. In this situation, this would be telling us that an increased price of livestock is necessary for increased income. There are no cases where income increased income has arisen in the absence of increased price of livestock. But it may not always be sufficient. If the value 2 is still there it is telling us that in some cases although the increased price of livestock is necessary it is not sufficient. Some other factor is missing or obstructing things and causing the outcome of increased income not to occur.

Alternatively, we can imagine that the value 2 is now much higher, say 30. In this context, the increased price of livestock is neither necessary or sufficient for the outcome. And in fact, more often than not it is an incorrect predictor, and is only present in a small proportion of all the cases where there is increased income. The point being made here is that the value in the TruePositive cell (11) has no significance unless it is seen in the context of the other values in the Confusion Matrix. Looking back at the big matrix at the top of this blog we can't interpret the significance of the cell values on their own.

So far this discussion has not taken us much further than discussion in the previous blog. In that blog, I ended with the concern that while we could identify the relative importance of individual causal factors in this sort of one-to-one analysis we couldn't do the more interesting type of configurational analyses, where we might identify the relative importance of different combinations of causal factors.

I now think it may be a possibility. If we look back at the matrix at the top of this blog we can imagine that there is in fact a stack such matrices one sitting above the other. And each of those matrices represents one respondent's responses. And the matrix at the bottom is a kind of summary matrix, where the individual cells are totals of the values of all the cells sitting immediately above them in the other matrices.

From each individual's matrix we could extract a string of data telling us which of each of the causal factors have been reported as present (1) or absent (0) and whether particular outcome/effect of interest was reported as present (1) or absent (0). Each of those strings can be listed as a 'case ' in the kind of data set used in predictive modelling. In those datasets, each row represents a case, and each column represents an attribute of those cases, plus the outcome of interest.

Using EvalC3, an Excel predictive modelling app, it would then be possible to identify one or more configurations i.e. combinations of reported attribute/causes which are good predictors of the reported effect/outcome.

Caveat: There are in fact 2 options in the kinds of strings of data that could be extracted from the individuals' matrices. One would list whether the 'cause' attributes were mentioned as present, or not, at all. The other would only list the cause attribute is present or not, specifically in relation with the effect/outcome of interest.