Assessment in medical education: Finding the signal in the noise

This week at the ICENet blog, we’re trying something different.  We’re connecting to the Academic Life in Emergency Medicine (ALiEM) blog. With > 50k visits per month, ALiEM  is a major (digital) connector/influencer in academic emergency medicine.  This week Brent Thoma (@Brent_Thoma) shares a post with the ICENet community that is also hosted on ALiEM.  It was sparked by a recent KeyLIME podcast .  (The article discussed below is slated for an upcoming KeyLIME release.  Coincidence? Can you just feel the matrix!)

– Jonathan


This past December it was reported in the Harvard Crimson that the median grade at their prestigious University was an A-.[1] A flood of articles followed bemoaning grade inflation at educational institutions with a former Harvard President noting cheekily that “the most unique honor you could graduate with was none.”[2] This might be alright if well-developed criterion-based instruments are used to grade the students, but given the variability in courses taught at the University and difficulty of developing such tools, it is unlikely. That being the case, if the median is an A-, one wonders how sub-par performance must be to fail.

Like Harvard University students, medical students and residents are an exceptional bunch who have succeeded in highly competitive application processes and are expected to perform well. However, the problem of grade inflation in medicine has also been acknowledged. For example, a recent survey of US internal medicine clerkship directors found that 78% felt that it was a serious problem, while 38% had passed students
on their rotations whom they thought should have failed.[3] This is problematic as accurate and reliable assessment will be necessary for competency-based medical education [pdf] to have a future.[4] Substantial work has been done on developing and validating assessment instruments. Unfortunately, faculty frequently fail to note deficiencies in trainee performance and their assessments have poor inter-rater reliability.[5] Faculty development efforts designed to improve these skills, despite their substantial costs, do not seem to be very effective.[6] That’s depressing. For a broader exploration of the reasons for inter-rater reliability, check out the latest KeyLIME podcast (episode 59) and the related article. [7,8]

Adjusting assessment instruments

These problems have led to the development of various perspectives in the burgeoning field of rater cognition. Some educators focus heavily on qualitative elements while others attempt to improve the consistency of quantitative instruments using criterion-based scales and rater training. Dr. Keith Baker, the Program Director of the Massachusetts General Hospital Anesthesia residency program, presented another approach to these problems at the recent Harvard Macy Institute course: A Systems Approach to Assessment in Health Professions Education. Rather than trying to teach his faculty to provide accurate, consistent assessment, he calculated how they assessed and normalized their results based on that. His multi-year project incorporated >14,000 assessments over a 2 year period and was published in 2011.

Highlighted article

Baker K. Determining resident clinical performance: getting beyond the noise. Anesthesiology. 2011 Oct;115(4):862-78. PMID: 21795965.

The assessment instrument

An assessment instrument with multiple components was developed. It was sent to each attending anesthesiologist attending for every resident whom they worked with each week. The program aimed for a goal of completing 60% of the assessments and, on average, each resident received >70 assessments from >40 faculty during the study period. The instrument included space for free-text comments along with four quantitative components, including:

  1. Relative performance designations for each ACGME milestone ranging from 1 (distinctly below peer level) to 5 (distinctly above peer level).
  2. Anchored competency designations for each ACGME milestone ranging from 1 (needed significant attending assistance, input or correction) to 7 (expert and able to serve as a resource to fully trained anesthesiologists).
  3. A list of eight increasingly difficult entrustable professional activities (e.g. these ranged from a skin biopsy in a health patient to a repair of a ruptured AAA in a patient with CHF and atrial fibrillation). For each, the attending was asked if they were confident that the learner could perform anesthesia independently and unsupervised.

The quantitative scores for each resident assessment were normalized based on how the evaluator has scored residents in the past using a Z-score. The number of entrustable professional activities the attending was confident that the resident could perform was normalized in the same way.

Study findings

There is so much data presented in the article that it is impossible to present it in detail. However, some important findings included:

  • Positive bias: Despite well-anchored normative scales (ranging from 1-5 with 3 considered ‘at peer level’), evaluators had a positive bias that increased over the course of the program with average scores of 3.36, 3.51, and 3.68 in years one, two, and three, respectively. The amount of bias varied by the faculty member (e.g. a score of 4 from one was easier to get than a score of 4 from another).
  • Score consistency: When a faculty member scored the same resident twice, the previous assessment predicted only 23.1% of the variance in subsequent scores. This reinforces the conclusion that single assessments are inconsistent. However, the average scores of the assessments for each resident were remarkably consistent over time.
  • Performing procedures: The evaluators’ confidence that residents could perform procedures increased as expected throughout the residency substantially more than the relative scores, implying that it was able to measure performance.
  • Predictive power: Several outcome measures were used to demonstrate the predictive ability of the system. Tests of clinical knowledge (in-training exams) correlated mild-moderately (r=0.3-0.38) with scores on the instrument. Low scores on the instrument predicted referral to the Clinical Competency Committee for remediation (OR = 27).



Assessment is difficult. The highlights of the system outlined by Dr. Baker are the use of multiple measurement instruments (relative, anchored, and performance-based scales were used), the inclusion of qualitative assessment with each component, frequent low-stakes assessment based on direct observation by multiple raters, and the removal of inter-rater reliability/positive bias using simple mathematical principles. This article is
particularly relevant to emergency medicine (EM) educators because our teaching and learning environment is similar to anesthesia’s in several ways. In both fields, attending physicians work directly with specific residents for predefined periods. While I am aware of anchored competency assessment tools have been developed by EM residency programs, I have not read about any that incorporate normalization. The normalization required by this method can be criticized because it assumes that the variability between assessments is a function of consistent rater characteristics. The study did not demonstrate the extent to which this is the case, although a previous study found that 67% of the variance in their online encounter cards was due to the rater.[8]

Check out Eric Holmboe’s (@boedudley) peer review of this post, also on the ALiEM site.  Later this week, there will be a video interview between Brent Thoma and Keith Baker on this topic also posted.



  1. Clarida MQ & Fandos NP. Substantiating Fears of Grade Inflation, Dean Says Median Grade at Harvard College Is A-, Most Common Grade Is A. Harvard Crimson, December 3, 2013.
  2. Ferdmandec RA. The Most Commonly Awarded Grade at Harvard Is an A. The Atlantic, December 4, 2013.
  3. Fazio SB., et al. Grade Inflation in the Internal Medicine Clerkship: A National Survey. Teaching and learning in medicine. 2013; 25(1); 71-76. PMID 23330898
  4. Holmboe ES, et al. Med Teach. The role of assessment in competency-based medical education. Medical Teacher. 2010; 32(8): 676-82. PMID: 20662580
  5. Kalet A, et al. How well do faculty evaluate the interviewing skills of medical students? Journal of General Internal Medicine. 1992; 7(5): 499-505. PMID: 1403205
  6. Cook DA, et al. Effect of rater training on reliability and accuracy of mini-CEX scores: a randomized, controlled trial. Journal of General Internal Medicine. 2009; 24(1): 74-79. PMID 19002533
  7. Sherbino J, Franke J & Snell L. Mechanisms That Contribute to Assessor Differences in Performance Assessments. KeyLIME Podcast. 2014; 59.
  8.  Yeates P, O’Neill P & Mann, K, Eva K. Seeing the same thing differently: Mechanisms that contribute to assessor differences in directly-observed performance assessments. Advances in health Sciences Education. 2013; 18(3): 325-41. PMID 22581567
  9. Sherbino J, et al. The reliability of encounter cards to assess the CanMEDS roles. Advances in Health Science Education. 2013; 18(5): 987-996. PMID 23307097

Image courtesy of Omegatron , via Wikimedia Commons