(This is the seventh post of our #AppliedMedEdMethods101 series. View the others here: Beyond the RCT; Pre-Post Simulation; Discourse Analysis; Retrospective Cohort Studies ; Critical Validity and Phenomography)
By Walter Tavares (@WalterTava)
Let us assume you are involved or responsible for the assessment of clinical competence in simulation or workplace settings and want to know (a) if decisions made are reliable and/or (b) how to improve the process. Another possibility is: (c) you read an assessment paper and noticed that the authors used generalizability theory to draw conclusions regarding their assessment process. You ask, what is generalizability theory and what can it do for me?
Performance-based assessment of clinical competence can be a puzzling science. At its core is the intention to draw inferences, conclusions regarding an individual’s abilities related to a particular construct (e.g., communication, crisis resource management or emergency medicine). Some assumptions in creating assessment plans may be that (a) there is a true level of ability within the individual and that these abilities are stable (i.e., will exist across clinical situations and time) and (b) that we have the tools or processes to meaningfully differentiate levels of abilities across a continuum. In conducting assessments, we recognize that we can never do this perfectly and that any assessment processes are only estimates of an individual’s skill level because there is always some degree of “error”. By error, I mean anything that results in differences between the scores we generate (i.e., observed scores) and a “true” score – that is, the long run hypothetical mean we would get for an individual if we could observe them extensively across multiple contexts, raters, cases and time. That error can come in many forms, but the degree to which our assessment process is detecting ranges in performances (assuming they exist) is also of interest and concern. If we can know this information, that is, sources of error and ability to differentiate between levels of performance and candidates, then we have – at least in part – what we need to estimate reliability and make the necessary adjustments to our assessment process. This is what generalizability theory (GT) can do for you.
GT can be a complicated methodology but it is helpful to have some understanding of the underlying concepts and why you would consider using it. GT is an extension of classical test theory (CTT) which indicates that any observed score (O) is a function of the true score (T) plus error (E) (i.e., O = T + E).  In other words, CTT is concerned with how well observed scores reflect true scores. Whereas, GT assumes that any assessment process only includes a sample of all possible observations, and emphasizes how well observed scores allow for generalizations about behaviour from the circumstances used in an assessment process to a “universe” of possible observations and circumstances. [1, 2] Circumstances of any assessment process can become sources of variance (or error) and there can be many. Where CTT lumps all error into one term, GT allows us to identify to what degree each “circumstance” (e.g., the stations used, the rating tool, raters) contributes error, along with those we cannot define precisely, referred to as random error. GT helps to understand specifically what part of your assessment process is problematic so that you can target improvements more efficiently and effectively.
If we knew what sources of error exist and to what extent, we could think of ways to mitigate error in our assessment plans. For example, perhaps you are limited in the resources you have for your OSCE and you are deciding between using 5 stations with 2 raters per station or 10 stations with 1 rater per station. A G-study could help you identify which is a greater source of error, raters or stations. We know from previous G-studies that “context specificity” (i.e., the finding that performance in one context is a poor predictor of the next)  which is a function of the sampling strategy including number of stations, tends to be a greater source of error than raters. Therefore, when left with the choice of more stations (or samples) or more raters, the answer is almost always more stations (adding stations divides the error term by the number of stations included and reduces the impact of that source).
You can test your ability to mitigate error in assessment plans with an extension of a G-study, referred to as a Decision study (D-study). D-studies allow you to calculate and predict how reliable an assessment might be if you manipulated the number of stations, or the number of raters per station for example. In other words, we can conduct an OSCE, calculate sources of error (called variance components) establish their error contributions, then conduct a D-study to determine several possibilities (e.g., adding 2, 3 or 4 stations) in how you might change the next OSCE. This again, is what GT can do for you. GT allows you to explore sources of variance and allows you to determine whether your stations, raters, tools or several other “circumstances” are contributing more or less error to the process. GT can also be used to determine “subject variance” or how well your stations, raters, tools are resulting optimal differentiation (i.e., ranges in performance levels assuming they exist). With this additional information, you can adjust the difficulty of your stations for example, or improve on the performance of your raters and tools. GT can do all this for you and much, much more.
For a more comprehensive survey of GT I refer you to the resources below. Bloch and Norman (2012) provide a practical introduction to GT with helpful examples and reference to statistical packages that can be used to conduct the analyses. Crossley (2002) provides a simpler yet helpful overview with examples. Eva et al., (2004) and Cook et al., (2010) provide some additional examples of both G and D studies for an multiple mini interview (MMI) and use of mini clinical evaluation exercise (Mini-CEX).[5, 6]
Take Home Messages:
- Generalizability theory extends classical test theory by specifying and quantifying numerous sources of error in an assessment process simultaneously.
- Sources of error and their contributions as well as subject variance (i.e., how well the assessment process differentiates) can provide valuable information about how an assessment process is functioning.
- The information derived form a G-study can then be used in a D-study, to provide information about how changing aspects of an assessment process will impact the overall generalizability.
1. Bloch, R., & Norman, G. (2012). Generalizability theory for the perplexed: a practical introduction and guide: AMEE Guide No. 68. Medical teacher, 34(11), 960-992.
Provides a helpful overview of the foundations supporting generalizability theory including worked examples using, statistical operations / steps, different types of research designs and numerous examples.
2. Crossley, J., Davies, H., Humphris, G., & Jolly, B. (2002). Generalizability: a key to unlock professional assessment. Medical education, 36(10), 972-978.
Provides a more in-depth discussion of how GT extends CTT and a worked example to demonstrate the concepts.
3. Eva, K. W., Rosenfeld, J., Reiter, H. I., & Norman, G. R. (2004). An admissions OSCE: the multiple mini‐interview. Medical education, 38(3), 314-326.
Provides an example of the application of G and D studies.
4. Cook, D. A., Beckman, T. J., Mandrekar, J. N., & Pankratz, V. S. (2010). Internal structure of mini-CEX scores for internal medicine residents: factor analysis and generalizability. Advances in Health Sciences Education, 15(5), 633-645.
Provides another example of the application of G and D studies.
1. Brennan, R. L. (1992). Generalizability theory. Educational Measurement: Issues and Practice, 11(4), 27-34.
2. Bloch, R. and G. Norman, Generalizability theory for the perplexed: A practical introduction and guide: AMEE Guide No. 68. Medical teacher, 2012. 34(11): p. 960-992.
3. Vleuten, C. P. (2014). When I say… context specificity. Medical education, 48(3), 234-235.
4. Crossley, J., Davies, H., Humphris, G., & Jolly, B. (2002). Generalizability: a key to unlock professional assessment. Medical education, 36(10), 972-978.
5. Eva, K. W., Rosenfeld, J., Reiter, H. I., & Norman, G. R. (2004). An admissions OSCE: the multiple mini‐interview. Medical education, 38(3), 314-326.
6. Cook, D. A., Beckman, T. J., Mandrekar, J. N., & Pankratz, V. S. (2010). Internal structure of mini-CEX scores for internal medicine residents: factor analysis and generalizability. Advances in Health Sciences Education, 15(5), 633-645.