KeyLIME 86: A Checklist Manifesto

The Key Literature in Medical Education podcast captures a two decade old debate (see here for an example: Van der Vleuten, C.P.M., Norman, G.R. & De Graaff, E.D. (1991) Pitfalls in the pursuit of objectivity I: Issues of reliability. Medical Education 25: 110–118.)   Is a checklist more accurate than a global rating in assessing performance.  In this debate, you are either a “lumper” (there is complexity in assessment that cannot be reduced without losing meaning) or a “splitter” (increasing granularity helps to unpack all of the supporting elements of a more complex ability).

In truth this is not a binary argument, although the health professions education community likes to set up these straw dog arguments.  The article this week tackles
the evidence and gives us a nuanced understanding of the best understanding of this important (controversial?) discussion.  For details, check out the abstract below.  For even more insight, check out the podcast here.

If you have a recommendation for the KeyLIME podcast, leave us a comment below or send us an email :



KeyLIME Session 86 – Article under review:


Listen to the podcast

View/download the abstract here.

Ilgen JS1, Ma IWY2, Hatala R3, Cook DA4-5. A systematic review of validity evidence for checklists versus global rating scales in simulation-based assessment. Medical Education, Feb 2015; 49(2): 161-73

Reviewer: Jonathan Sherbino

There is an ongoing debate in the HPE literature that is not explicit, but in my opinion, has widely infiltrated discussions involving assessment of learners. Put one way, should we standardize our approach to assessment of performance, which improves reliability but leads to issues of excessive reductionism and attention to superficial features? Or, should we trust the gestalt judgment of “experts,” which attends to the origins of medical education but leads to issues of rater bias.  (For more details on the challenges of rater cognition, see KeyLIME episode 78.)

Checklists (CL) provide a scaffold for uninitiated assessors.  Global rating scales (GRS) allow experts to use a more natural reasoning process (e.g., the complexity of diagnosis is neither dichotomous, nor algorithmic).  Research evidence supports both sides of the debate.  Depending upon your bias, you can argue that a CL forces a systematic, and not idiosyncratic, assessment framework. In contrast, GRS are more discriminating at higher levels of expertise, presumably because they allow the input of more complex data points that cannot be easily captured in yes/no checkboxes.

This paper attempts to weigh the evidence and see if there is a unifying story.

“1. What are the inter-rater, inter-item and inter-station reliabilities of global ratings in comparison with checklist scores?

  1. How well do global ratings and checklist scores correlate?
  1. What validity evidence has been reported for global ratings and checklist scores?

The prevailing view has held that the GRS offers greater reliability than the checklist. In the present review, we sought evidence to confirm or refute this opinion.”

Type of paper
Systematic Review

Key Points on the Methods
This study employed exemplary methods that adhere to the PRISMA (preferred reporting items for systematic reviews and meta-analyses) guidelines.  (For a further description of the study protocol see KeyLIME Ep#44.)

The “general” inclusion criterion was simulation-based (but not standardized patients) health professions research comparing checklist to GRS.  The studies included in this paper are part of a larger research program on technology-enhance simulation.  The search strategy was updated to February, 2013.

** Potential CoI: JS is a co-author on two publications that used the data set discussed in this paper.

Key Outcomes

  • n= 45 studies; 1819 learners (median 27 learners /study)
    • mean MERSQI 13.3 (i.e. good)
    • technical (n = 44) and non-technical (n = 3) (e.g. communication, leadership) skills assessed
    • clinical areas included anesthesiology, endoscopy, resuscitation, and surgery
    • all studies included physicians/physicians-in-training
    • raters mainly (75%) physicians
  • correlation between CL and GRS moderate
    • r = 0.76, 95% confidence interval [CI] 0.69–0.81)
  • inter-rater reliability was similar between scales
    • GRS 0.78, (95% CI 0.71–0.83), n = 23; checklist 0.81, (95% CI 0.75–0.85), n = 21
    • **Rater training was not mentioned in nearly all of the included studies
  • GRS inter-item reliabilities and inter-station reliabilities were higher than those for CL
    • GRS = (0.92, 95% CI0.84–0.95, n = 6); (0.80, 95%CI 0.73–0.85, n = 10)
    • CL = (0.66, 95% CI 0–0.84, n = 4); (0.69, 95% CI 0.56–0.77, n = 10), respectively
    • **task-specific CL typically varied across stations
  • Sensitivity analyses did not demonstrate the influence of a particular scale or study design influencing the global findings

Key Conclusions
The authors conclude…“[CL] inter-rater reliability and trainee discrimination were more favourable than suggested in earlier work, but each task requires a separate [CL]. Compared with the [CL], the GRS has higher average inter-item and inter-station reliability, can be used across multiple tasks, and may better capture nuanced elements of expertise.”

Spare Keys – other take home points for clinician educators
This is a great example of a planned research program leading to multiple wins. The original database developed for this systematic review has led to > 30 publications.

 Access KeyLIME podcast archives here