Tuesday, April 07, 2020

Rubrics? Yes, but...

The blog posting is a response to Tom Aston's blog posting: Rubrics as a harness for complexity

I have just reviewed an evaluation of the effectiveness of policy influencing activities of programs funded by HMG as part of the International Carbon Finance Initiative.  In the technical report there are a number of uses of rubrics to explain how various judgements were made.  Here, for example, is one summarising the strength of evidence found during process tracing exercises:
  • Strong support – smoking gun (or DD) tests passed and no hoop tests (nor DDs) failed.
  • Some support – multiple straw in the wind tests passed and no hoop tests (nor DDs) failed; also, no smoking guns nor DDs passed.
  • Mixed – mixture of smoking gun or DD tests passed but some hoop tests (or DDs) failed – this required the CMO to be revised.
  • Failed – some hoop (or DD) tests failed, no double decisive or smoking gun tests passed – this required the theory to be rejected and the CMO abandoned or significantly revised. 

Another rubric described in great detail how three different levels of strength of evidence were differentiated (Convincing Plausible, Tentative).  There was no doubt in my mind that these rubrics contributed significantly to the value of the evaluation report.  Particularly by giving readers confidence in the judgements that were made by the evaluation team.

But… I can't help feel that the enthusiasm for rubrics seems to be out of proportion with their role within an evaluation.  They are a useful measurement device that can make complex judgements more transparent and thus more accountable.  Note the emphasis on the ‘more‘… There are often plenty of not necessarily so transparent judgements present in the explanatory text which is used to annotate each point in a rubric scale.  Take, for example, the first line of text in Tom Aston’s first example here, which reads “Excellent: Clear example of exemplary performance or very good practice in this domain: no weakness”

As noted in Tom’s blog it has been argued that rubrics have a wider value i.e. rubrics are useful when trying to describe and agree what success looks like for tracking changes in complex phenomena”.  This is where I would definitely argue “Buyer beware” because rubrics have serious limitations in respect of this task.

The first problem is that description and valuation are separate cognitive tasks.  Events that take place can be described, they can also be given a particular value by observers (e.g. good or bad).  This dual process is implied in the above definition of how rubrics are useful.  Both of these types of judgements are often present in a rubrics explanatory text e.g. Clear example of exemplary performance or very good practice in this domain: no weakness”

The second problem is that complex events usually have multiple facets, each of which has a descriptive and value aspect.  This is evident in the use of multiple statements linked by colons in the same example rubric I refer to above.

So for any point in a rubric’s scale the explanatory text has quite a big task on its hands.  It has to describe a specific subset of events and give a particular value to each of those.  In addition, each adjacent point on the scale has to do the same in a way that suggests there are only small incremental differences between each of these points judgements. And being a linear scale, this suggests or even requires, that there is only one path from the bottom to the top of the scale. Say goodbye to equifinality!

So, what alternatives are there, for describing and agreeing on what success looks like when trying to track changes in complex phenomena?  One solution which I have argued for, intermittently, over a period of years, is the wider use of weighted checklists.  These are described at length here.  

Their design addresses three problems mentioned above.  Firstly, description and valuation are separated out as two distinct judgements.  Secondly, the events that are described and valued can be quite numerous and yet each can be separately judged on these two criteria.  There is then a mechanism for combining these judgements in an aggregate scale. And there is more than one route from the bottom to the top of this aggregate scale.

“The proof is in the pudding”.  One particular weighted checklist, known as the Basic Necessities Survey, was designed to measure and track changes in household-level poverty.  Changes in poverty levels must surely qualify as ‘complex phenomena ‘.  Since its development in the 1990s, the Basic Necessities Survey has been widely used in Africa and Asia by international environment/conservation organisations.  There is now a bibliography available online describing some of its users and uses. https://www.zotero.org/groups/2440491/basic_necessities_survey/library

 [RD1]Impressive rubric