Quantifying competence using workplace-based assessments in CBME

By: Andrew E. Krumm (@aekrumm)

Effective measurement of clinical performance is the result of a long chain of events. This chain begins by observing a learner and translating what gets observed into a number. This number is then used to support a decision as to whether a learner is performing well or poorly, in need of remediation or acceleration, competent or not. (It could be argued that this chain of events too often moves in the opposite direction: A rater has a decision and subsequently seeks out evidence.) While conducting clinical observations can be obtrusive and time consuming, workplace-based assessment (WBA) systems that leverage smartphones have come about as a way of efficiently collecting observational evidence from a clinical setting.1 As critical as these advances have been, synthesizing one or more ratings into a number that communicates what a medical professional can do remains underdeveloped.2

The term “measurement” denotes standards of precision and reproducibility.3 A critical challenge when doing measurement in clinical settings is that irrelevant factors can seep into a score intended to quantify only a trainee’s capability. The potential harms caused by the presence of irrelevant factors increases as stakes are added to the use of a resulting quantification (e.g., Will learners be allowed to advance in training based on their score?).4 Unfortunately, as WBAs are used more broadly, many tried-and-true ways of improving the chain of events that lead to a score that is both precise and reproducible can break down. For example, it can be hard to lengthen the instruments used to collect observational evidence or to have multiple raters observe the same learner at the same time and still maintain the efficiency of a well-designed WBA. In addition, the use of short, practical clinical observations spread out over time implies a different analytical framework than traditional psychometrics often allows. Take for example the idea of two WBA ratings collected in the same week on the same procedure for the same trainee: From a traditional psychometric perspective, if we were to combine these two observations in the same way we would combine two test items into an overall score, we would be violating a foundational assumption that no new learning takes place between assessment items.5

While WBAs can be a critical link in an evolving chain of events for effective measurement of clinical performance, new links need to be modified or developed. In particular, new links will need to include ways of combining multiple ratings into a score that communicates what a learner can do in a healthcare setting. Combining multiple ratings can be either simple or complex. A simple approach needs limited statistical machinery, but these simpler approaches require that clear and intentional constraints be placed (and followed) on when a WBA rating is collected, on what procedure, and by whom. (Constraints improve the precision and reproducibility of eventual scores because they provide a framework for interpreting both the presence and absence of a learner’s actions in a clinical setting.) A more complex way of combining multiple ratings, on the other hand, will need to statistically address sources of irrelevant variation in a trainee’s overall score, but these models can be challenging to develop and may not entirely subvert the need to constrain the tasks from which observations are collected.

Quantifying competence using WBAs is a complex process. If and when programs seek to combine multiple WBA ratings into a score that informs a consequential decision, particular care will need to be exercised and more and more of the conversation will need to move toward addressing the weak, strong, and even missing links in the long chain of events that comprise an effective measurement process.

About the author: Andrew E. Krumm PhD, is Assistant Professor of Learning Health Sciences at the University of Michigan Medical School and Assistant Professor of Information at the University of Michigan School of Information.


1. George BC, JD Bohnen, MC Schuller and JP Fryer. Using Smartphones for Trainee Performance Assessment: A SIMPL Case Study. Surgery. 2020; 167(6): 903–6.

2. Williams R et al. A Proposed Blueprint for Operative Training, Assessment, and Certification. Annals of Surgery. 2021; 273(4): 701–8.

3. Briggs DC. (2021). Historical and Conceptual Foundations of Measurement in the Human Sciences: Credos and Controversies. Routledge.

4. Messick S. The Interplay of Evidence and Consequences in the Validation of Performance Assessments. Educational Researcher. 1994; 23(2): 13-23.

5. De Boeck P, M Wilson (eds.) (2004). Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach. Springer.

The views and opinions expressed in this post are those of the author(s) and do not necessarily reflect the official policy or position of The Royal College of Physicians and Surgeons of Canada. For more details on our site disclaimers, please see our ‘About’ page