Tuesday, August 16, 2011

Evaluation methods looking for projects or projects seeking appropriate evaluation methods?


A few months ago I carried out a brief desk review of 3ie's approach to funding impact evaluations, for AusAID's Office of Development Effectiveness. One question that I did not address was "Broadly, are there other organisations providing complementary approaches to 3ie for promoting quality evaluation to fill the evidence gap in international development?"

While my report for 3ie examined the pros and cons of 3ie's use of experimental methods as a preferred evaluation design, it did not look at the question of appropriate institutional structures for supporting better evaluations. Yet, you could argue that choices made about institutional structures could have more consequences than those involving the specifics of particular evaluation methods. The question quoted above seems to contain a tacit assumption about institutional arrangements, i.e that improvements in evaluation can best be promoted by funding externally located specialists centres of expertise, like 3ie. This kind of assumption seems questionable, for two sets of reasons that I explain below. One is to do with the results they generate, the other concerns the neglected potential of an alternative.

In the "3ie" (Open Window) model anyone can submit a proposal for an evaluation of a project implemented by any organisation. This approach is conducive to 'cherry picking' of evaluable (by experimental methods) projects and the collection of evaluations representing a miscellany of types of projects - about which it will be hard to generate useful generalisations. Plus an unknown number of other projects possibly being left unevaluated, because they dont fit the prevailing method preferences.

In the alternative scenario the funding of evaluations would not be outsourced to any specialist centre(s). Instead, an agency like DFID would identify a portfolio of projects needing evaluation. For example, those initiatives focusing on climate change adaptation. DFID would call for proposals for their evaluation and then screen those proposals, largely as it does now, but perhaps seeking a wider range of bidders.

Unlike the present process, it would then offer funding to the bidders who had provided, say the best 50% of, the proposals to develop those proposals further in more detail. At present there is no financial incentive to do so, and any time and money already spent on developing proposals is unlikely to be recompensed, because only one bidder will get the contract.

The expected result of this "proposal development" funding would be revised and expanded proposals that outlined the bidder's proposed methodology in considerable detail, in something like an  inception report. All the bidders involved at this stage would need access to a given set of project documents and at least one collective meeting with the project holders.

The revised proposals would then be assessed by DFID, but with a much greater weighting towards the technical content of the proposal than exists at present. These second level assessment would benefit from the involvement of external specialists, as in the 3ie model. DFID Evaluation Department already does this in the case of some evaluations through the use of a quality assurance panel.The best proposal would then be funded as normal, and the evaluation then carried out.

Both the winning and losing technical proposals would then be put in the public domain via the DFID website in order to encourage cross fertilisation of ideas, external critiquing and public accountability. This is not the case at present. All bidders operate in isolation. There are no opportunities to learn from each other. The same appears to be the case with 3ie, the full text of technical proposals are not publicly available (even of those who were successful). Making the proposals public would mean that the proposal development funding had not been wasted, even where the proposals were not successful.

In summary, with the "external centre of expertise" model there is a risk that methodological preferences are the driving force behind what gets evaluated.  The alternative is a portfolio-of-projects led approach, where interim funding support is used to generate a diversity of improved evaluation proposals, which are later made accessible by all and which can then inform future proposals.

A meta-evaluation might be useful to test the efficacy of this project-led approach. Other matched kinds of projects also needing evaluation could continue to be funded by the pre-existing mechanisms (e.g. in-country DFID offices). Pair comparisons could later be made of the quality of the evaluations that were subsequently produced by the two different mechanisms.Although it is likely there would be multiple points of difference, it should be possible for DFID, and any other stakeholders, to prioritise their relative importance, and come to an overall judgement of which has been most useful.

PS: 3ie seems to be heading in this direction, to some extent.  3ie now have a Policy Window where they have recently sought applications for the evaluation of projects belonging to a specific portfolio ("Poverty Reduction Interventions in Fiji" implemented by the Government of Fiji). Funding is available to cover costs of the successful  bidder (only) to visit Fiji "to develop a scope of work to be included in a subsequent Request for Proposal (RFP) to conduct the impact evaluation".Subject to 3ie's approval of the developed proposal 3ie will then fund the implementation of the evaluation by the bidder.The success of this approach will be worth watching, especially its ability to ensure the evaluation of the whole portfolio of projects (which is likely to depend on 3ie having some flexiblity about the methodologies used). However, I am perhaps making a risky assumption here, that the  projects within the portfolio to be evaluated have not already been pre-selected on the grounds of their suitability to 3ie's preferred approach.




PS: I have been reading the [Malawi] CIVIL SOCIETY GOVERNANCE FUND -
TECHNICAL SPECIFICATION REFERENCE DOCUMENT FOR POTENTIAL SERVICE PROVIDERS. In the section on the role of the Independent Evaluation Agent, it is stated that the agent will be responsible for "The commissioning and coordination of randomised control trials for two large projects funded during the first or second year of granting." This specification appears to have been made prior to the funding of any projects. So, will the fund managers feel obliged to find and fund two large projects that will be evaluable by RCTs? Fascinating, in a bizarre kind of way.

Friday, April 08, 2011

Models and reality: Dialogue through simulation


I have been finalising preparations for a training workshop on network visualisation as an evaluation tool. In the process I came across this "Causality Map for the Enhanced Evaluation Framework", for Budget Support activities.
On the surface this diagram seems realistic, budget support is a complex process. However,  I will probably use this diagram to highlight what is often missing in models of development interventions. Like many others, it lacks any feedback loops, and as such it is a model that is a long way from the reality it is trying to represent in summary form. Using a distinction being used more widely these days (and despite my reservations about it), I think this model qualifies as complicated but not complex. If you were to assign a numerical value to each node and to each connecting relationship, the value that would be generated at the end of the process (on the right) would always be the same.

The picture changes radically as soon as you include feedback loops, which is much easier to do when you use network rather than chain models (and where you give up using one dimension in the above type of diagram to represent the passage of time). Here below is my very simple example. This model represents five actors. They all start with a self-esteem rating of 1, but their subsequent self-esteem depends on the influence of the others they are connected to (represented by positive or negative link values, [randomly allocated]) and the self-esteem of those others.

You can see what happens when self-esteem values are recalculated to take into account those each actor is connected to, in this Excel file (best viewed with the NodeXL plugin). After ten iterations, Actor 0 has the highest self-esteem, and Actor 2 has the lowest. After 20 iterations Actor 2 has the highest self-esteem and Actor 1 has the lowest. After 30 iterations Actor 1 has the highest self-esteem and Actor O has the lowest. With more and more iterations the self-esteem of the actors involved might stabilise at a particular set of values, or it might repeat past patterns already seen, or maybe not.

There are two important points to be made here. The first is the dramatic affect of introducing feedback loops, in even the simplest of models. The aggregate results are not easily predictable, but they can be simulated. The second is that nature of the impact that is seen even in this very small complex system is a matter of the time period under examination. Impact seen at iteration 10 is different from iteration 20 and different again at iteration 30. In the words of the proponents of systems perspectives on evaluation, what is seen depends on the "perspective" that is chosen (Williams and Hummelbrunner, 2009).
PS 1: Michael Woolcock has written about the need to pay more attention to a related issue, captured by the term "impact trajectory". He argues that: "...in virtually all sectors, the development community has a weak (or at best implicit or assumed) understanding of the shape of the impact trajectories associated with its projects, and even less understanding of how these trajectories vary for different kinds of project operating in different contexts, at different scales and with varying degrees of implementation effectiveness; more forcefully, I argue that the weakness of this knowledge greatly compromises our capacity to make accurate statements about project impacts, irrespective of whether they are inspired by ‘demand’ or ‘supply’ side imperatives, and even if they have been subject to the most deftly implemented randomised trial"
PS1.1: Some examples:I recall that it has been argued that there is a big impact on households when they first join savings and credit groups, but the continuing impact drops down to a much more modest level thereafter. On the other hand, the impact of girls completing primary school may be the greatest when it reaches through to the next generation, to the number of their children and their survival rates. 
There is one downside to my actors' self-esteem model, which is its almost excessive sensitivity. Small changes to any of the node or link values can  significantly change the longer term impacts. This is because this simple model of a social system has no buffers or "slack". Buffers could be in the form of accumulated attributes of the actors (like an individual's self-confidence arising from their lifetime experience or a firm's accumulated inventory) and also provided via the wider context (like individuals having access to a wider network of friends or firms having alternate sources of suppliers) . This model could clearly be improved upon.
PS 2: I came across this quote by Duncan Watts, in a commentary on his latest book "Everything is obvious" - "when people base their decisions in part on what other people are deciding, collective outcomes become highly unpredictable" That is exactly what is happening in the self-esteem model above.Duncan Watts has written extensively on networks.
Here below is another unidirectional causal model, available on the Donor Committee for Enterprise Development website

What I like about this example is that visitors to the website can click on the links (but not in the copy I have made above) and be taken to other pages where they will be given a detailed account of the nature of the causal processes represented by those links. This is exactly what the web was designed for. Visitors can also click on any of the boxes at the bottom and find out more about the activities that input into the whole process.

The inclusion of a feedback loop in this diagram would not be too difficult to imagine. For example, from perhaps the top box back to one of the earlier boxes e.g New firms start / register. This positive feedback loop would quickly produce escalating results further up the diagram. Ideally, we would recognise that this type of outcome (simple continuous escalation) does not fit very well with our perception of what happens in reality. That awareness would then lead to further improvements to the model, which generated more realistic behaviors.
PS 24 May 2011:  In their feminist perspective on monitoring and evaluation Batliwala and Pittman have suggested that we need "to develop a “theory of constraints” to accompany our “theory of change” in any given context..." They noted that "… most tools do not allow for tracking negative change, reversals, backlash, unexpected change, and other processes that push back or shift the direction of a positive change trajectory. How do we create tools that can capture this “two steps forward, one step back” phenomenon that many activists and organizations acknowledge as a reality and in which large amounts of learning lay hidden? In women’s rights work, this is vital because as soon as advances seriously challenge patriarchal or other social power structures, there are often significant reactions and setbacks. These are not, ironically, always indicative of failure or lack of effectiveness, but exactly the opposite— this is evidence that the process was working and was creating resistance from the status quo as a result .”
But it is early days. Many development programs do not yet even have a decent unidirectional causal model of what they are trying to do. In this context, the inclusion of any sort of feedback loop would be a major improvement. As shown above, the next step that can be taken is to run simulations of those improved models by inserting numerical values in the links and functions/equations in nodes of those models. In the examples above we can see how simulations can help improve models by showing how their results do not fit with our own observations of reality. Perhaps in the future they will be seen as a useful form of pre-test, worth carrying out at the earliest stages of an evaluation.
PS 3: This blog was prompted by comments to me by Elliot Stern on the relevance of modeling and simulation to evaluation, on which I hope he has more to say.

PS 4 I am struggling through Manuel DeLanda's Philosophy and Simulation: The emergence of synthetic reason (2011), which being about simulations, relates to the contents of this post.
PS 5: I have just scanned Funnell and Roger's very useful new book, Purposeful Program Theory" and found 67 unidirectional models but only 15 models that have one or more feedback loops (that is, 23%). This is quite dissapointing. So is the advice on the use of feedback loops: "We advise against using so many feedback loops that the logic becomes meaningless. When feedback loops are incorporated, a balance needs to be struck between including all of them and (because everything is related to everything else) and capturing some important ones. Showing that everything leads to everything else can make an outcome chain very difficult to understand - what we call a spagetti junction model. Neverthless some feedback loops can be critical to the success of a program and should be included ..."p187
Given the scarcity of models with feedback links even in this book, the risk of having too many feedback loops sounds like "a  problem we would like to have" And I am not sure why an excess of feedback links should be any more of a probability than an excess of forward links. The concern about the understandability of models with feedback loops is however more reasonable,  for reasons I have outlined above. When you introduce feedback loops what were either simple and complicated models start to exhibit complex behavior. Welcome to something that is a bit closer to the real world.
PS6:  "As change makers, we should not try to design a better world. We should make better feedback loops", the text of the last slide in Owen Barder's presentation "Development Complexity and Evolution"
PS7: In Funnell and Roger's book (page 486) they describe how the Intergovernmental Panel on Climate Change (IPCC) " recognised that controlled experimentation with the climate system in which the hypothesised agents of change are systematically varied in order to determine the climate's sensitivity to these agents...[is] clearly not possible"  Different models of climate change were developed with different assumptions about possible contributing factors. "The observed patterns of warming, including greater warming over land than over the ocean, and their changes over time, are simulated only by the models that include anthropogenic forcing. No coupled global climate change model that has used natural forcing only has reproduced the continental warming trends in individual continents (except Antarctica) over the second half of the 20th centrury" (IPCC, 2001, p39)




Tuesday, March 22, 2011

A submission to the UK Independent Commission for Aid Impact (ICAI)

Background:
"ICAI is holding a public consultation to understand which areas of UK overseas aid stakeholder groups and the public believe the Commission should report on in its first three years. The consultation will run for 12 weeks from the 14th January until the 7th April 2011.
Click here to respond to the consultation
If you would like to read further information on ICAI, the consultation and the aid budget, please click on Consultation document.
If you would like to consider the questions in detail before responding, please click on Downloadable questions. You will need to access the online response form to respond".
Two initial comments:

1. An online survey is a fairly narrow approach to a public consultation. There are many other options, even if the ICAI is limited to those that can take place online, rather than face to face

2. The focus of the consultation is also narrow, i.e. "which areas of UK overseas aid stakeholder groups and the public believe the Commission should report on in its first three years". Equally important is how those areas of aid should be reported on.

On widening the process of consultation

1. All pages on the ICAI website should have associated Comment facilities, which visitors can make use of. More and more websites are being constructed with blog-type features such as these, because website managers these days expect to be interacting with their audience, not simply broadcasting. Built into such a facilitwould be an assumption that there will be an ongoing process of consultation, not a once off event.

2. The raw data results of the current online survey should be made publicly available, not just summaries. This is already possible, with minimal extra work required, because the survey provider (SurveyMonkey.com) is able to provide a public link to the survey results, with multiple options regarding reading, filtering and downloading the data. The more people who can access the data, the more value that might be obtained from it.  However, the most important reason for doing so is that the ICAI should be seen to be maximally transparent in its operations. Transparency will help build trust and confidence in the work and judgements of the commission.

3. Although late in the day, the ICAI should edit the Consultation page to include an invitation to people to submit their own submissions using their own words and structures.

4. The ICAI website should include an option for visitors to sign up for email notification of any changes to the website, including the main content pages and any comments made on those pages by visitors.

5. The ICAI should be open about what it will be open about. It should develop a policy on transparency and place that policy on its website. Disclosure policies are now commonplace for many large international organisations like the World Bank and IMF, and transparency in regard to international aid is high on the agenda of many governments, including the UK. Having such a policy does mean everything about the workings of the ICAI must be made public, but it would typically require a default assumption of openness along with specified procedures and conditions relating to when and where information will not be disclosed.

On widening the content of consultation

 1. The ICAI should be aware, if not already, that there continues to be intense debate about the best ways of assessing the value of international aid. This debate exists because of the  multiplicity of purposes behind aid programs, many and varied types of aid, the enormous diversity of contexts where it is provided, the wide range of people and organisations involved in its delivery, as well as some genuinely difficult issues of measurement and analysis. There are no simple and universally applicable solutions. Value for Money provides only a partial view of aid impact, and is only partially measurable. Randomise Control Trials (RCTs) can be useful for simple replicable interventions in comparable conditions, but many aid interventions are complex. The best immediate response in these circumstances is for the ICAI to be maximally transparent about the methods being used to assess aid interventions, and to be open to the wider debate.

A good starting point would be for the ICAI to make public: (a) the Terms of Reference for the "Contracted Out Service Provider" who will do the assessment work for the ICAI, and (b) the tendered proposal put forward by the winning bidder. Both of these documents refer to ways and means of doing the required work. It is also expected that there will be periodic reviews of the work of the winning bidder. The ToRs and reports of those reviews should also be publicly disclosed on the ICAI website. Finally, all the  "evaluations, reviews and investigations" to be carried out by the winning bidder on behalf of the ICAI should be publicly disclosed, as hopefully has already been agreed.

2. It would be useful, if only to help ensure that the ICAI itself delivers Value for Money", if the ICAI could clarify not only how its role will differ from that of the DFID Evaluation Department and multi-agency initiatives like 3IE, but also how it will cooperate with them to exploit any complementarities and possible synergies in their work.Complete and utter independence could lead to wasteful duplication. In the worst case various wheels could be reinvented. 

For example, the OECD's Development Assistance Committee (DAC) has over the years developed a widely agree set of evaluation criteria, however these are nowhere to be seen in the ICAI's ToR for the "Contracted Out Service Provider". Instead, Value For Money receives repeated attention, and its definition is sourced to the National Audit Office (NAO). However, the NAO is a member of the Improvement Network, which has provided a wider perspective on Value for Money. Their website notes that effectiveness is part of Value for Money, as well as efficiency and economy. Commenting on effectiveness they note:
"Effectiveness is a measure of the impact that has been achieved, which can be either quantitative or qualitative....Outcomes should be equitable across communities, so effectiveness measures should include aspects of equity, as well as quality. Sustainability is also an increasingly important aspect of effectiveness."
Sustainability is a DAC evaluation criteria which has been around for a decade or more.
------------------------------------------------------------------------------------------------
A link to this blog posting has been emailed to c-robathan@icai.independent.gov.uk.
Other readers of this blog might like to do the same, with their own views.

PS: See Alex Jacob's March 22nd submission to the ICAI: Advice for the new aid watchdog




Wednesday, October 20, 2010

Counter-factuals and counter-theories

Thinking about the counter-factual means thinking about something that did not happen. So consider a project involving the provision of savings and credit services, with the expectation of reducing levels of poverty amongst the participating households. The counter-factual is the situation where the savings and credit services were not provided. This can either be imagined, or monitored through the use of a control group, which is a group of similar households in a similar context.

In the course of 20 years work on monitoring and evaluation of development aid projects I have only come across one good opportunity to analyse changes in household poverty levels through the comparison of participating and non-participating households (i.e. the so called double difference method: comparing participants and non-participants, before and after the intervention). This was in Can Loc District, Ha Tinh province, in Vietnam. In 1996 ActionAid Vietnam began a savings and credit program in Can Loc. In 1997 I helped them design and implement a baseline survey of almost 600 households, being a 10% sample of the population in three communes of Can Loc District, covering participants and non-participants in the savings and credit services (which reached about 25% of all households). This was done using the Basic Necessities Survey (BNS) , an instrument that I have described in detail elsewhere.

A few years later the responsibility for the project was handed over to a Vietnamese NGO called the Pro-Poor Centre (PPC), which had been formed by ex-Action Aid staff who used to work in Ha Tinh. They continued to manage the savings and credit program over the following years. In 2006, nine years after the baseline survey, an ex-ActionAid staff member who was now working for a foundation in Hanoi, held discussions with the PPC about doing a follow up survey of the Ha Tinh households. I was brought in to assist the re-use of the same BNS instrument as in 1996. At this stage the main interest was simply to see how much households' situations had improved over the nine year period, a period of rapid economic growth throughout much of Vietnam.

The survey went ahead, and was implemented with particular care and diligence by the PPC staff. A copy of the 2006 survey report can be found here (See pages 23-25 especially). Fortunately the PPC had carefully kept hard copy records of the 1996 baseline survey (including the sample frame) and I had also kept digital copies of the data. This meant it was possible to make a number of comparisons:
  • Of households poverty status in 2006 compared to 1997
  • Of changes in the poverty status of households who were and were not participating in the saving and credit program during these periods i.e
    1. Those who had never participated
    2. Those were in (in 1997) but dropped out (by 2006)
    3. Those who were not in (in 1997) but joined later (before 2007)
    4. Those who were always in (in 1997 and 2006)
    Somewhat to my surprise, I found what seemed an ideal set of results. Poverty levels had dropped the most in the 4th group ("always members"), then almost as much in the 2nd group ("ex-members"), less in the 3rd group ("new members") and least in the 1st group ("never members"). The 3rd group might have been expected to have changed less because over the years the project had expanded its coverage to include the less poor, reaching 43% of all households by 2006.

    However, the project's focus on the poorest was also a problem. The members of the savings and credit program had not been randomly chosen, so the control group was not really a control group. They were not comparable. (and I had not heard of, nor still know, how to use the propensity score matching method)

    The alternative to considering the effects of a counter-factual (i.e. a non-intervention) is, I guess, what could be called “counter-theoretical” That is, an alternative theory of what has happened, with the existing intervention.

    My counter-theoretical centered on the idea of dependency ratios - poor families typically have high dependency ratios (i.e. many young children, relatively few adults). As families age this ratio will change, with dependent children growing up and becoming more able bodied and able to take on workloads and or generate income. Even without the access to a savings and credit program, this demographic fact alone might have explained why the participating families did better over the nine year period. It could also explain why the 2nd group did almost as well, if they were selected on the same basis of being the poorest, but had been participants for the shorter period of time.

    What I could have and should have done, was go back to the PPC and see what data they had on the family structure of the interviewed households. It is quite likely they would have the relevant data: ages of all family members, given their close involvement with the community. Unfortunately at that time there was not much interest in the impact assessment aspect of the survey, by either the foundation, the PPC or ActionAid, and their support was necessary for any further analysis. Perhaps I gave up too quickly…

    Nevertheless, reflection on this experience makes me wonder how often it would be well worthwhile, in the absence of good control group data, giving more attention to identifying and testing “counter-theoreticals” about the existing intervention, as part of a more rigorous process of coming to conclusions about impacts.

    PS1 3rd November 2010: I have since recalled that as part of the 2006 survey I met with the staff of ActionAid in Hanoi to explain the survey process and to solicit from them their views on the likely causes of any improvements. The attached file shows two lists, one relating to ActionAid interventions in the district, and the other relating to interventions in the same district by other organisations, including government. Micro-finance was at the top of the list of the ActionAid interventions seen as likely causes of change, but there were 7 others, as well as 12 non-ActionAid interventions that were possible causes. This raises the spectre of 12 possible alternative hypotheses, let alone various combinations of these. One approach I subsequently toyed with for generating composite predictions in this kind of multiple-location/multiple-intervention situation was the "Prediction Matrix".

    PS2 3rd November 2010: The current edition of Evaluation (16(4), 2010 has an article by Nicoletta Stame, titled " What doesn’t work? Three Failures, Many Answers" which includes a section on "Rival Explanations" which I have taken the liberty of copy and pasting below:
    "The link between complexity and causation has been at the centre of evaluation theory ever since and has nurtured thinking about 'plausible rival hypotheses' (Campbell, 1969). Although it was originally treated as a methodological problem of validity, it has recently been revisited from the substantive perspective of programme theory. Commenting on Campbell's interest in 'reforms', that are by definition 'complex social change', Yin contrasts two strategies of Campbell: that of the experimental design and that of using rival explanations. He concludes that the second - as Campbell himself came to admit in Campbell (199a) - is better suited to complex interventions (that are changing and multifaceted), as it is with the complex case studies that have been Yin's turf for a long times (Yin, 2000: 242). The use of rival explanations is common in other crafts journalism, detective work, forensic science and astronomy), where 'the investigator defines the most compelling explanations, tests them by fairly collecting data that can support or refute them, and - given sufficiently consistent and clear evidence - concludes that one explanation but not the others is the most acceptable' (Yin, 2000: 243). These crafts are empirical: their advantage is that while a 'whole host of societal changes may be amenable to empirical investigation', especially those where stakes are currently the highest, they are 'freed from having to impose an experimental design' ('the broader and in fact more common use of rival explanations covers real-life, not craft, rivals', Yin, 2000:248). Nonetheless, rival explanations are by no means alien to evaluation, as is shown by how Campbell himself has offered Pawson good arguments for criticizing the way systematic reviews are conducted (Pawson, 2006)."
    "The problem that remains is how to identify rival explanations. From a methodological starting point, Yin says that 'evaluation literature offers virtually no guidance on how to identify and define real-life rivals'. He proposes a typology of real-life rivals, that can variously relate to targeted interventions, to implementation, to theory to external conditions; and proposes examples of how to deal with them taken from such fields as decline in crime rates, support for industrial development, technological innovations, etc. However, Yin appears to overlook something that had indeed fascinated theory-based evaluation since its first appearance: the possible existence of different theories to explain the working of a programme, and the need to choose among them in order to test them. And - as Patton (1989: 377) has advised - it should be noted that in this way it would be possible to engage stakeholders in conceptualizing their own programme’s theories of action. Nevertheless, Yin’s contribution in its explicitness and methodological “correctness” is an important step forward."
    "Weiss responded to Win’s provocative stance. In an article entitled “What to do until the random assigners come”, she locates Yin’s contribution as the next step beyond Campbell’s ideas about plausible rival hypotheses: “where Campbell focused primarily on rival explanations stemming from methodological artifacts, Yin proposes to identify substantive rival explanations” (Weiss, 2002: 217). She describes the process whereby the evaluator “looks around and collects whatever information and qualitative data are relevant to the issue at hand” (2002: 219), in order to see “whether any [other factor, such as other programs or policies, environmental conditions, social and cultural conditions] could have brought about the kinds of outcomes that the target program was trying to affect”, thus setting up systematic inquires into the situation. Weiss concludes that alternative means to random assignment in order to solve the causality dilemma can be a “a combination of Theory-Based Evaluation and Ruling-Out” (the rival explanation)."
    I recommend the whole article...

    PS3 6th December. On re-reading this post, especially Nicoletta's quote, I wondered about the potential usefullness of the "Evolving Storylines" method I developed some years ago. It could be used as a means of developing a small range of alternative histories of a project, that could then each be subject to some testing (by focusing on the most vulnerable point in each story)

    Tuesday, October 05, 2010

    Do we need a Required Level of Failure (RLF)?


    (PS: This post was previously titled "Do we need a Minimum Level of Failure (MLF), which could be misinterpreted as suggesting that we need to minimise the level of failure - which I definitely did not mean to suggest)

    This week I am attending the 2010 European Evaluation Society conference in Prague. Today I have been involved in a number of interesting discussions, including how to commission better evaluations and the potential and perils of randomised control trials (RCTs). This has prompted me to resurrect an idea I have previously raised partly in jest, but which I now think deserves more serious consideration.

    Background first: RCTs have been promoted as an important means of improving the effectiveness of development aid projects. But there are also concerns that RCTs will become a dominating orthodoxy, driving out the use of other approaches to impact assessment, and in the worst case, discouraging investment in development projects which are not evaluable through the use of RCTs.

    In my PhD thesis many years ago I looked at organisational learning through the lense of an evolutionary epistemology. That school of thought sees evolution (through the re-iteration of variation, selection and retention) as a kind of learning process, and human learning as a sub-set of that process. As I explain below, that view of the process of learning has some relevance to the current debate on how to improve aid effectiveness. It is also worth acknowledging the results of that process - evolution has been very effective in developing some extremely complex and sophisticated lifeforms, against which intentionally designed aid projects pale in comparison.

    The point to be made: A common misconception is evolution is about the “survival of the fittest”. In fact this phrase, coined by Herbert Spencer, is significantly misleading. Biological evolution is NOT about the survival of the fittest, but the non-survival of the least fit. This process leaves room for some diversity amongst those that survive, and it is this diversity that enables further evolution. The lesson here is that the process of evolution is not about picking winners according to some global standard of fitness, but about culling of failures based on their lack of fitness to local circumstances.

    This leads me to my own “modest proposal” for another route to improved aid effectiveness, which is an alternative to the widespread use of RCTs and the replication of the kinds of projects found to be effective via that means. This would be to build a widening consensus about the need for a defined “Minimum Level of Failure” (MLF) within the portfolio of activities funded or implemented by aid agencies. A MLF could be something like a 10% of projects by value. Participating agencies would committ to publicly declaring this proportion of their projects as failed. Each of these agencies would also need to show: (a) how in their particular operating context they have defined these as failures, and (b) what steps they will take to avoid the replication of these failures in the future. There would be no need for a global consensus on evaluation methods, or a hegemony of methods arising through less democratic processes.  PS: Using the current M&E terminology, the consensus would need to be on the desired outcomes, not on the activities needed to achieve them.

    I can of course anticipate, if not already hear, some protests about how unrealistic this proposal is. Let us hear these protests, especially in public. Any agency that did so would probably be implying, if not explicitly arguing, that such a failure rate would be unacceptable, because public monies and poor people’s lives are at stake. However making such a de facto claim of a 90%+ rate of success would be a seriously high risk activity, becaus it would be very vulnerable to disproof, probably through journalistic inquiry alone. For anyone involved with development aid programmes, a brief moment’s reflection would suggest that the reality of aid effectiveness is very different, and that a 10% failure rate is probably way too optimistic and in real life failures are much more common.

    Perhaps the protesting agencies might be better advised to consider the upside of a achieving a minimum level of failure. If taken seriously establishing a norm of a minimal level of failure could help get the public at large, along with journalists and politicians, past the shock-horror of failure itself and into the more interesting territory of why some projects fail. It could also help raise the level of risk tolerance, and enable the exploration of more innovative approaches to the uses of aid. Both of these developments would be in addition to a progressive improvement on the average performance of development projects resulting from a periodic culling of the worst performers.

    It is possible that advocates of specific methods like RCTs (as the route to improved aid effectiveness) might also have some criticisms of the MLF proposal. They could argue that these methods will generate plenty of evidence of what does not work, and perhaps that evidence should be privileged. But the problem with this method-led solution is that there is already a body of evidence from a number of fields of scientific research that negative findings are widely under-reported. People like to publish positive findings. This may not be a big risk while RCTs are funded by one or two major actors, but it will become a systemic risk as the number of actors involved increases.  There needs to be an explicit and public focus on failure.

    Actual data on failure rates

    PS: 15th October 2010: Four days ago I posted below some information on the success and failure rates of  DFID projects. I have re-stated and re-edited that information here with additional comments:

    There is some interesting data on failure within the DFID system, most notably the most recent review of Project Completion Reports (PCRs), undertaken in 2005. See the “An Analysis of Projects and programmes in Prism 2000-2005”report available on the DFID website. The percentage (68%) of projects “defined as ‘completely’ or ‘largely’ achieving their Goals (Rated 1 or 2)” was given at the beginning of the Executive Summary, but information about failures was less prominent. Under section “8. Lessons from Project Failures” on page 61 it is stated “There are only 23 projects [out of 453] within the sample that are rated as failing to meet their objectives (i.e. 4 or 5) and which have significant lessons” (italics added). This is equivalent to about 5% of the sampled projects.

    More importantly are the 20% or so rated 3 = Likely to be partly achieved (see page 64). It could be argued that those with a rating of 3 should also be included as failures, since their objectives are only likely to be partly achieved, versus largely achieved in the case of rating 2. In other words a successful project should be defined as one likely to achieve more than 50% of its Output and Purpose objectives. Others are failures. This interpretation seems to be supported by a comment sent to me (whose author will remain anonymous): " "My understanding is that projects with scores of less than 2 are under real pressure and maybe quickly closed down unless they improve rapidly. I have certainly "felt the pressure" from projects to score them 2 rather than 3. That said I have not buckled to the pressure!"

    I think the fact that DFID at least has a performance scoring system (for all its faults), that it has done this analysis of the project scores, and that it has made the results public, probably puts it well ahead of many other aid agencies. I would like to hear about any other agencies who have done anything like this, along with comments on the strengths and weaknesses of what they have done. I would also like to see DFID repeat the 2005 exercise at the end of this year, this time with more discussion on the projects rated 3 = Likely to be partly achieved, and what subsequently happened to these projects.

    PS 2nd November 2010: See Lawrence Hadad's reference here to the same DFID set of statistics here, recently quoted/misused on the One Show 

    PS 3rd November 2010:  Thanks to Yu-Lan van Alphen, Programmamanager, Stichting DOEN, Amsterdam, for this book reference: Kathryn Schulz – "Being Wrong",  reviewed in the NYT. It sounds like a good read.

    PS 14th February 2011: Computer programs are intolerant of programming errors. So, computer programmers tried to avoid them at all costs, not always successfully. Doing so becomes a much bigger challenge as software grows in size and complexity. Now some programmers are trying a different approach, that involves recognising that there will always be programming errors. For more, see "Let It Crash" Programming" by Craig Stunz at http://blogs.teamb.com/craigstuntz/2008/05/19/37819/ 

    PS 15th February 2011: "Why negative studies are good for health journalism, and where to find them" "
    This is a guest column by Ivan Oransky, MD, who is executive editor of Reuters Health and blogs at Embargo Watch and Retraction Watch.One of the things that makes evaluating medical evidence difficult is knowing whether what's being published actually reflects reality. Are the studies we read a good representation of scientific truth, or are they full of cherry-picked data that help sell drugs or skew policy decisions?..."

    PS: 21 February 2011: See also the Admitting Failure website

    PS: 23 April 2011. See today's Bad Science column in the Guardian by Ben Goldacre, titled "I foresee that nobody will do anything about this problem", on the difficulty of getting negative findings published

    PS: 23 May 2011 .The above analysis of DFID project ratings focuses on the recognition of failures that have already occurred. It is also possible, and important, to take steps to ensure that failures are possible to be recognised in the first place. A project that has no clear theory of change will be difficult to evaluate and thus difficult to classify as a success or failure. The most common means of describing a development project’s theory of change is probably via a LogFrame representation. Within a reasonably well constructed LogFrame representation there is a sequence of “if…and…then…” statements, spelling out what is expected to happen as the project is implemented and takes effect. While there may be positive developments in a project’s Goal level indicators, there also needs to be associated evidence that the expected chain of causation leading to that Goal has also taken place as expected. It is not uncommon, in my experience, to find that while the expected outcomes have occurred, the outputs that were meant to contribute to those outcomes were not successfully delivered. In this situation the project cannot claim to be successful. There is however a more generic point to be made here.  The more detailed a project’s ToC is the more vulnerable it will be to disproof. Any one of the many expected causal links could be found to have not worked as expected. However, if these linkages have not been disproved, then the stronger the project’s claims will be to have contributed to any expected and observed changes. Willingness to allow failure to be identified strengthens the claim of any success that is observed. This seems an important observation in the case of projects where there is no possibility of making comparisons with a control group where there was no intervention. In those circumstances a ToC should be as detailed and articulate as possible.

    PS: 23 May 2011 :Articulating more disprovable theories of change may sound like a good idea, but it could be argued that this requirement risks locking aid agencies into a static view of the world they are working in, and one which is developed quite early in their intervention. In many settings, for example in humanitarian emergencies and highly politicised environments, aid agencies often have to revisit, revise and adapt their views of what is happening and how they should best respond. The best that might be expected in these circumstances is that those agencies are able to construct a detailed (and disprovable) history of what happened.  This could actually produce better (i.e. more disprovable) results. There is some research evidence which shows that people find it easier to imagine events in some detail when they are situated in the past than to imagine the same kind of events taking place in the future[i].



    [i] Bavelas, J.B. (1973) Effects of the temporal context of information, Psychological Reports, 35, 695-698, cited in Dance with Chance, by Makridakis, Hogarth and Gaba, 2009, page189.



    Thursday, August 26, 2010

    Meta-narratives, evaluation and complexity

    A meta-narrative is a story about stories. Some evaluations take this form, especially those using participatory approaches to obtain qualitative data from a diversity of sources. Even more conventional expert-led evaluations have an element of storytelling to them as they attempt to weave information obtained from various sources, often opportunistically, into a coherent and plausible overall picture of what happened, and what might happen in future.


    Recently I have come across two examples of evaluations that were very much about creating a story about stories. They raised interesting questions about method: how can it be done well? 


    Stories about Culture


    The first evaluation was of a multiplicity of small arts projects in developing countries, funded by DOEN, a Dutch funding agency. Claudia Fontes used the Most Significant Change technique to elicit and analyse 95 stories from a sample of different kinds of participants in these projects. The aim was to identify what DOEN’s cultural intervention meant to the primary stakeholders. What particularly interested me was one part of the MSC process, which can be a useful step when faced with a large number of stories. This involved the participants categorising the stories into different groupings, according to their commonalities. It was from each of these groupings that the participants then went on to select, through intensive discussion, what they saw as the most significant changes of all. In one country five categories of stories were identified: Personal Development and Growth, Professional Development, Exposure, Change Of Perception And Attitude Towards Art And Artists, and Validation Of Self-Expression. Later on, at the report writing stage, Claudia looked at the contents of these groupings, especially the MSC stories within each, and produced an interpretation of how these groups of stories linked together. In other words, a meta-narrative. 


    “For the primary stakeholders in XXXX these categories of change relate to each other in that the personal and professional development of artists and other professionals who support the artists’ work results in a validation of the self-expression of direct (artists) and indirect (public in general) users. This process of affirmation and recovery of ownership of self-expression contributes in turn to a change in society’s perception of art and artists with the potential to make the whole cycle of change sustainable for the sector. Strategies of exposure have a key role in contributing across these changes, and towards the profiling of the sector in general” (italics added) 

    In commenting on the report I suggested that in future it might be possible and useful to take a participatory approach to the same task of producing a meta-narrative. Faced with the five groupings (and knowledge of their contents) each participant could be asked to identify expected causal connections between the different groupings, and give some explanation of these views. This can be done through a simple card sorting exercise. The results from multiple participants can then be aggregated, and the result will take the form of a network of relationships between groupings, some being stronger than others (stronger in the sense of more participants’ highlighting a particular causal linkage). This emergent structure can then be visualized using network software. Once visualised in this manner, the structure could be the subject of discussion, and perhaps some revisions.  One important virtue of this kind of process is that it will not necessarily produce a single dominant narrative. Minority and majority views will be discernable. And using network visualization software, the potential complexity would be manageable. Network views can be filtered on multiple variables, such as strength of the causal linkages.


    Stories about Conflict


    The second evaluation was done by Lundy and McGovern, of Community –based approaches to Post-Conflict “Truth telling” in Northern Ireland” I was sent this and other related papers by Ken Bush, who is exploring methods for evaluating story-telling as a peace building methodology. His draft conceptual framework notes that a “survey of the literature highlighted the lack of an agreed and effective evaluation tool for story-telling in peace-building despite the near universality of the practice and the huge monetary investment by the EU and others in story-telling projects.”

    Lundy and McGovern’s paper is a good read, because it explores the many important complications of storytelling in a conflicted society. Not only important issues like appropriate sampling of story tellers, but how the story telling project intentions were framed, and the how the results were presented. The primary product of the project was a publication called “Ardoyne: The Untold Truth”, containing testimonies based on 300 interviews.  The purpose of Lundy and McGovern’s assessment of the project was “to assess the impacts and benefits of community based “truth telling”. This was done by interviewing 50 people from five different stakeholder groups. The results were then written up in their paper.


    What we have here is daunting in its complexity: (a) There are the “original” stories, as compiled in the book, (b) then the stories of people’s reactions to these stories and how they were collected and disseminated, (c) and then the authors’ own story about how they collected these stories and  their interpretation of them as a whole. And of course behind all this we have the complex (as colloquially used) context of Northern Ireland!

    When reading what might be called Lundy and McGovern’s meta-meta-narrative (i.e. the interpreted results of the interviews) I looked for information on how sources were cited. These are the sorts of phrases I found: “according to respondents”, “many”, ”there was evidence”, “most”, “the vast majority”, “It was felt”, “respondents”, “in the main”, “for many”, “many people”, “there was a very strong opinion”, “it was felt”, “there was a consensus”. “for the majority of participants”, “without exception”, “many interviews”, “overwhelmingly”, “for others”, “some”, “for these respondents”, “one of the most frequently mentioned”, “it was further suggested”, “most respondents”, “the view”, “it was further suggested”, in general respondents were of the view that”, “the experience of those involved…would seem to suggest”, “some respondents”, “the overwhelming majority”, “responses from Union representatives were”, “for some”, “a representative of the community sector”, “that said, others were”, “by another interviewee”. “it was”, and “a significant section of mainly nationalist interviewees”.


    I list these here with some hesitation, knowing how often during evaluations I have resorted to using the same vocabulary, when faced with making sense of many different comments by different sources, in a limited period of time. However there are important issues here, made even more important by how often we have to deal with situations like this. How people see things, like their reactions to the Ardoyn stories, matters. How many people see things in a given way matters, who those groups of people are matters, and how the views of different groups overlap also matters. In Lundy and McGovern’s paper we only get glimpses of this kind of underlying social structure. We sometimes get a sense of majority or minority and occasionally which particular group holds a view, and sometimes that a group sharing one view also thinks that…


    How could it be done differently? The views of a set of respondents can be summarised in a "two-mode" matrix, with respondents listed in the rows and the descriptions of views listed in columns, and cell values indicating what is known about a person’s views on a listed issue. For example, agreement/disagreement, degree of agreement, or not known. By itself this data is not easy to analyse, other than through frequency counts (e.g. # of people supporting x view, or # views expressed by x person). But it is possible to convert this data into two different kinds of one-mode matrix, showing: (a) how different people are connected to each other (by their shared views) and, (b) how different views are connected to each other (by the same people holding those views). The networks structure of the data in these matrices can be seen and further manipulated using network visualization software

    As in many evaluations, Lundy and McGovern were constrained by a confidentiality commitment. Individuals can be anonymised by being categorized into types of people, but this may have its limits if the number of respondents is small and the identity of participants is known to others (if not their specific views). This means the potential to make use of the first kind of network visualization (i.e. a) may be limited, even if the network visualization showed the relationships between types of respondents. However, the second type (i.e. b) should remain an option. To recap, this would show a network of opinions, some strongly linked to many others because they were often shared by the other respondents, others with weaker links to fewer others because they were shared with few, if any, other respondents. The next step would be the development of a narrative, commentary explaining the highlights of the network structure. This would usefully focus on the contents of the different clusters of opinions, and the nature of any bridges between them, especially of the clusters expressed contrasting views.


    There are two significant hurdles in front of this approach. Typically not all respondents will express views on all topics, and the number who do will vary across topics. One option would be to filter out the views with the least number of respondents. The other, which I have never tried, would be interesting to explore. That would be to build in a supplementary question in interviews, along the lines of  “…and how many people do you think would feel the same way as you on this issue?”. Their answers would be important in themselves, possibly affecting how the same people might act on their own views. But the same answers could also provide a weighting mechanism for views in an otherwise small sub-sample.

    The second hurdle is that the network description of the relationships between the participants views is a snapshot in time. But an evaluation usually requires comparison, with a prior state. This is a problem if the  questions asked by Lundy and McGovern were about current opinions. but if they were about changes in people's views it would not be.

    Lets return to the layer below, the stories collected in the original “Ardoyne: The Untold Truth” publications. Stories beget stories. The telling of one can prompt the telling of another. If stories can be seen as linked in this way, then as the number number of stories recounted grows we could end up with a network of stories. Some stories  in that network may be told more often than others, because they are connected to many others, in the minds of the storytellers. These stories might be what complexity science people call "attractors" Although storytellers may start off telling various different stories, their is a likelihood many of them will end up telling this particular story, because of its connectedness, its position in the network.  If these stories are negative, in the sense of provoking antipathy towards others in the same community, then this type of structure may be of concern. Ideally the attractors, the highly connected stories in the network would be positive stories, encouraging peace and cooperation with others. This network structure of stories could be explored by an evaluator asking questions like "What other stories does this story most remind you off? or, "Which of these stories does that story most remind you of?" Or versions thereof. When comparing changes over time the evaluator's focus would then be on the changing contents of the strongly connected versus weakly connected stories.


    In this discussion above I have outlined how a network approach could help us construct various types of aggregated (network) views of multiple stories. Because they are built up out of the views of individuals, it would be possible to see where there were varying degrees of agreement within those structures. They would not be biased towards a single (excluding others) narrative, a concern of many people using story-telling approaches including some of the originators of the term meta-narrative

    And complex histories


    My final comments relate to another form of story-telling, that is grand narrative as done by historians. Yesterday I read with interest Niall Ferguson’s Complexity and Collapse: Empires on the Edge of Chaos(originally in Foreign Affairs). In this article Niall describes the ways some historians have sought to explain the rise and fall in empires, in terms of sequences of events taking place over long periods. In his view they suffer from what Nassim Taleb calls "the narrative fallacy": they construct psychologically satisfying stories on the principle of post hoc, ergo propter hoc ("after this, therefore because of this”). That is, the propensity to over-explain major historical events, to create a long and coherent story where in fact there was none. His alternate view is couched in terms of complexity theory ideas. Given the complexity of modern societies “In reality, the proximate triggers of a crisis are often sufficient to explain the sudden shift from a good equilibrium to a bad mess.” He then qualifies the notion of equilibrium: “a complex economy is characterized by the interaction of dispersed agents, a lack of central control, multiple levels of organization, continual adaptation, incessant creation of new market niches, and the absence of general equilibrium.” Within those systems small changes can have catastrophic (i.e. non-linear) effects, because of the nature of the connectivity involved. Ferguson then goes onto recount examples of the rapidity of decline in some major empires.

    One point which he does not make, but which  I think is implicit in his discription of how change can happen in complex systems is that more than one type of small change can trigger the same kind of large scale change. Consider the assissination of Archduke Franz Ferdinand of Austria in Sarajevo in June 1914. Would World War 1 not have happened if that event took place? Not speaking as a historian..my guess is that there are quiet a few other events that could have triggered the start of a war thereafter.


    Niall Ferguson complexity based view is in a sense a technocrat’s objection to grand narratives, but perhaps also another kind of grand narrative in its own right. Nevertheless his view does seem to have practical relevance to the writing of evaluation stories: it highlights the need for caution about excessive internal coherence in any story of change and its causes. A network view of causal relationships between types of events, constructed by participants with differing views, might help mitigate against this risk, when it needs to be reduced to a text description.

    PS: "In recent years, however, advancements in cognitive neuroscience have suggested that memories unfold across multiple areas of the cortex simultaneously, like a richly interconnected network of stories, rather than an archive of static files." in The Fully Immersive Mind of Oliver Sacks

    PS 25 October 2010. Please also see  Networks of self-categorised stories

    Friday, August 20, 2010

    Cynefin Framework versus Stacey Matrix versus network perspectives

      
    Cynefin

    Lots of people seem to like the Cynefin Framework. Jess Dart and Patricia Rogers are some of my friends and colleagues of mine who have expressed a liking for it. It was one of the subjects of discussion in the recent Evaluation Revisited conference in Utrecht in May. Why don’t I like it? There are three reasons...

    Usually matrix classifications of possible states are based on the intersection of two dimensions. They can provide good value because combining two dimensions to generate four (or more) possible states is a compact and efficient way of describing things. Matrix classifications have parsimony.

    But whenever I look at descriptions of the Cynefin Framework I can never see, or identify, what the two dimensions are which give the framework its 2 x 2 structure, and from which the four states are generated. If they were more evident I might be able to use them to identify which of the four states best described the particular conditions I was facing at a given time. But up to now I just have to make a best guess, based on the description of each state. PS: I have been told by someone recently that Dave Snowden says this is not a 2x2 matrix, but if so, why is presented like one?

    My second concern is the nature of the connection between this fourfold classification and other research on complexity, beyond the field of management studies and consultancy work. IMHO, I don’t think there is much in the way of a theoretical or empirical basis for it, especially when Dave’s fifth state of “disorder”, is placed in the centre. This may be the reason why the two axes of the matrix I mentioned above have not been specified, ...because they have not yet been found.

    My third concern is that I don’t think the fourfold classification has much discriminatory power. Most the situations I face, as an evaluator, could probably be described as complex. I don’t see many really chaotic ones, like gyrating stock markets or changeable weather patterns, nor do I see many that could be described as simple, or just complicated. Except perhaps when dealing with single person’s task, not involving interactions with others. Given the prevalence of complex situations, I would prefer to see a matrix that helped me discriminate between different forms of complexity, and their possible consequences.

    Stacey


    This brings me to Stacey's matrix, which does have two identifiable dimensions shown above: certainty (i.e. the predictability of events) and the degree of agreement over those events. Years before I had heard of "Stacey's matrix"" I had found the same kind of 2 x 2 matrix a useful means of describing four different kinds of possible development outcomes which had different  implications for what sort of M&E tools would be most relevant. For example, by definition you cannot use predefined indicators to monitor unpredictable outcomes (regardless of whether we agree or disagree on their significance). However methods like MSC can be used to monitor these kinds of change. And a good case could be made for more attention to the use of historian's skills, especially to respond to unexpected events that are of dispute meaning. More recently I argued that weighted checklists are probably the most suitable for tracking outcomes that are predictable but where there is not necessary any agreement about their significance. A quote from Patton could be hijacked and used here "These distinctions help with situation recognition  so that an evaluation approach can be selected that is appropriate to a particular situation and intervention, thereby increasing the likely utility -and actual use- of the evaluation" (page 85, Developmental Evaluation)

    Post script: Here is an example of how I have used it for this kind of purpose, in  a posting on MandE NEWS about weighted checklists


    From what I have read I think Ralph Stacey also produced the following more detailed version of his matrix:


    This has then been simplified by Brenda Zimmerman, as follows


    In this version simple, complicated complex and anarchy (chaos) are in effect part of a continuum, involving different mixes of agreement and certainty. Interestingly, from my point of view, the category taking up the most space in the matrix is that of complexity, echoing my gut level feeling expressed above. This feeling was supported when I read Patton's three examples of simple, complicated and complex (page92, ibid), based on Zimmerman. The simple and complicated examples were both about making materials do what you wanted (cake mix and rocket components), whereas the complex example was about child rearing i.e. getting people to do what you wanted. More interesting still, the complex example was raising a couple of children in  family, in other words a small group of people.So anything involving more people is probably going to be a whole lot more complex. PS: And interestingly along the same lines, the difference between simple and complicated was a physical task involving one person (following a recipe) and one involving large numbers of people (sending a rocket into space)

    Another take on this is given by Chris Rodgers comments on Stacey’s views:
    Although the framework, which Stacey had developed in the mid-1990s, regularly crops up in blogs, on websites and during presentations, he no longer sees it as valid and useful.  His comment explains why this is the case, and the implications that this has for his current view of complexity and organizational dynamics.  In essence, he argues that
    • life is complex all the time, not just on those occasions which can be characterized as being “far from certainty” and “far from agreement” …
    • this is because change and stability are inextricably intertwined in the everyday conversational life of the organization …
    • which means that, even in the most ordinary of situations, something unexpected might happen that generates far-reaching and unexpected outcomes …
    • and so, from this perspective, there are no “levels of complexity” …
    • nor levels in human action that might usefully be thought of as a “system”.
    Well maybe,… but this is beginning to sound a bit too much like the utterances of a Zen master to me :-) Like Rodgers, I hope we can still make some kind of useful distinctions re complexity.

    Back to Snowden

    Which brings me back to a more recent statement by Dave Snowden, which to me seems more useful than his earlier Cynefin Framework. In his presentation at the Gurteen Knowledge Cafe, in early 2009, as reported by Conrad Taylor, "Dave presented three system models: ordered, chaotic and complex. By ‘system’ he means networks that have coherence, though that need not imply sharp boundaries. ‘Agents’ are defined as anything which acts within a system. An agent could be an individual person,or a grouping; an idea can also be an agent, for example the myth-structures which largely determine how we make decisions within the communities and societies within which we live."
    • "Ordered systems are ones in which the actions of agents are constrained by the system, making the behavior of the agents predictable. Most management theory is predicated on this view of the organisation."
    • Chaotic systems are ones in which the agents are unconstrained and independent of each other. This is the domain of statistical analysis and probability. We have tended to assume that markets are chaotic; but this has been a simplistic view."
    • "Complex systems are ones in which the agents are lightly constrained by the system, and through their mutual interactions with each other and with the system environment, the agents also modify the system. As a result, the system and its agents ‘co-evolve’. This, in fact, is a better model for understanding markets, and organisations.”

    This conceptualization is simpler (i.e. has more economy) and seems more connected with prior research on complexity. My favorite relevant quote here is Stuart  Kauffman’s book: At home in the Universe: The search for the laws of complexity (p86-92) where he describes the behavior of electronic models of networks of actors (with on/off behavior states for each actor) moving from simple to complex to chaotic patterns, depending on the number of connections between them. As I read it, few connections generate ordered (stable) network behavior, many connections generate chaotic (apparently unrepeating) behavior, and medium numbers (where N actors = N connections) generate complex cyclical behavior. (See more on Boolean networks).

    This relates back to conversation that I had with Dave Snowden in 2009 about the value of a network perspective on complexity, in which he said (as I remember) that relationships within networks can be seen as constraints. So, as I see it, in order to differentiate forms of complexity we should be looking at the nature of the specific networks in which actors are involved: Their number, the structure of relationships, and perhaps the extent to which the actors have own individual autonomy i.e. responses which are not specific to particular relationships (an attribute not granted to “actors” in the electronic model described).

    My feeling is that with this approach it might even be possible to link this kind of analysis back to Stacey’s 2x2 matrix. Predictability might be primarily a function of connectedness, and therefore more problematic in larger networks where the number of possible connections is much higher. The possibility of agreement, Stacey’s second dimension, might be further dependent the extent to which actors’ have some individual autonomy within a given network structure.

    To be continued…

    PS 1:Michael Quinn Patton's book on Developmental Evaluation has a whole chapter on "Distinguishing Simple, Complicated, and Complex". However, I was surprised to find that despite the book's focus on complexity, there was not a single reference in the Index to "networks". There was one example of a network model (Exhibit 5.3) , contrasted with a Linear Program Logic Model..." (Exhibit 5.2) in the chapter on Systems Thinking and Complexity Concepts. [I will elaborate further here]

    Regarding the simple, complicated and complex, on page 95 Michael describes these as "sensitising concepts, not operational measurements" This worried me a bit, but it is an idea with a history (Look here for other views on this idea). But he then says "The purpose of making such distinctions is driven by the utility of situation recognition and responsiveness. For evaluation this means matching the evaluation to the nature of the situation" That makes sense to me, and is how have I tried to use the simple version of the Stacey Matrix (using dimensions only). However, Michael then goes on to provide, perhaps unintentionally, evidence of how useless these distinctions are in this respect, at least in their current form. He describes working with a group of 20 experienced teachers to design an evaluation of an innovative reading program. "They disagreed intensely about the state of knowledge concerning how children learn to read..Different preferences for evaluation flowed from different definitions of the situation. We ultimately agreed on a mixed methods design that incorporated aspects of both sets of preferences". Further on in the same chapter, Bob Williams is quoted reporting the same kind of result (i.e conflicting interpretations), in a discussion with health sector workers. PS 25/8/2010 - Perhaps I need to clarify here - in both cases participants could not agree on whether the situation under discussion was simple, complicated or complex, and thus these distinctions could not inform their choices of what to do. As I read it, in the first case the mixed method choice was a compromise, not an informed choice.

    PS 2: I have also just pulled Melanie Mitchell's "Complexity: A Guided Tour" off the shelf, and re-scanned her Chapter 7 on "Defining and  Measuring Complexity". She notes that about 40 different measures of complexity have been proposed by different people. Her conclusion, 17 pages later, is that "The diversity of measures that have been proposed indicates that the notions of complexity that we're trying to get at have many different interacting dimensions and probably cant be captured by a single measurement scale" This is not a very helpful conclusion. But I noticed that she does cite earlier what seem to be three categories of measures that cover many of the 40 or so measures: These are: 1. how hard the object or process is to describe?, 2. How had it is to create?, and 3. What is its degree of organisation?

    PS 3: I have followed up John Caddell's advice to read a blog post by Cynthia Kurtz (a co-author of the IBM Systems Journal paper on Cynefin) recalling some of the early work around the framework. In that post was the following version of the Cynefin Framework included in the oft-mentioned "The new dynamics of strategy: Sense-making in a complex and complicated world" published in the IBM SYSTEMS JOURNAL, VOL 42, NO 3, 2003.
    In her explanation of the origins of this version she says it had two axes: "the degree of imposed order" and "the degree of self-organization." This I found interesting because these dimension have the potential to be measurable. If they are measurable, then the actual behavior of four identified systems could be compared. And we could then ask "Does their behavior differ in ways that have consequences for managers or evaluators?" I have previously speculated that there might be network measures that could describe these two measures: network density and network centrality. Network centrality could be the x axis, being low on the left and high on the right, and network density could be the y axis, low on the bottom and high on the top. How well the differences in these four types of network structures might capture our day-to-day notion of complexity is not yet clear to me. As mentioned way above, density does seem to be linked to differences between simple, complex and chaotic behavior. Maybe differences in centrality moderate/magnify the consequences of different levels of network density?

    PS 4 (April 2015: For more reading on this subject that may be of interest, see Diversity and Complexity by Scott E Page, Princeton, 2011

    PS 5 (June 2020): Please view Andy Stirlings video'd take on risk, uncertainty, ambiguity and ignorance, a slightly different taken on the two dimensions also present in the Stacey matrix

    PLEASE NOTE: To make a Comment, or to read the Comments already made on this post, click on the Leave a Comment  link below, or directly here