Education Theory Made Practical – Volume 3, Part 2: Validity

As part of the ALiEM Faculty Incubator program, teams of 2-4 incubator participants authored a primer on a key education theory, linking the abstract to practical scenarios. For the third year, these posts are being serialized on our blog, as a joint collaboration with ALiEM. You can view the first e-book here – the second is nearing completion and will soon be released. You can view all the blog posts from series 1 and 2 here.

The ALiEM team loves hearing your feedback prior to publication. No comment is too big or too small and they will be used to refine each primer prior to the eBook publication.  (note: the blog posts themselves will remain unchanged)

This is the second post of Volume 3. You can find the first post here: Bolman and Deal’s Four-Frame Model.


AuthorsRebecca Shaw; Carly Silvester (@edforbeginners)

Editor:  Dimitrios Papanagnou

Main Authors or Originators: Samuel Messick; Michael Kane

Other important authors or works: David Cook

Part 1:  The Hook

Dr. Carmody was excited. As a junior faculty member, she was attending her first clinical competency committee meeting for her residency program. She had so many ideas that could potentially help improve the program. She knew everyone else on the committee was more experienced than her; so she hoped her enthusiasm would compensate for her lack of formal training in medical education.

The meeting was not progressing as she had expected. The written evaluations from faculty from earlier in the academic year had failed half of the first-year residents. There was a lot of conversation about the validity and reliability of assessment methods. She wasn’t sure what they meant by program evaluation and scoring inferences. With all this new terminology Dr. Carmody felt herself contributing less and less to the discussion.

At the end of the meeting, the program director assigned tasks for the next meeting. She heard her name. ‘Dr. Carmody, it would be great if you could present your insights into the validity of our written evaluations as a tool in our assessment program. Could you have that ready for the next meeting?’

Dr. Carmody was stressed. Written evaluations were a cornerstone of resident assessment. Was the whole program invalid? She was confused by the validity argument, and the task ahead of her seemed daunting. Was she reviewing evaluations or scores? Where was she possibly going to start?

Dr. Carmody was most concerned about the implications of the current scores, as several residents would be faced with remediation and extended training time.

Part 2:  The Meat


Assessment is an integral part of medical education; and the validation of an assessment is vital to its use. All assessments aim to facilitate defensible decisions about those being assessed.1 To make these decisions, evidence needs to be evaluated in order to understand the strengths and weaknesses of the assessment in question.

Validity and validation are two separate terms with distinct meanings. Validity refers to a conceptual framework for interpreting evidence, whereas validation is the process of collecting and interpreting evidence to support those decisions.1

The current Standards for Educational and Psychological Testing define validity as “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests.”2 It is not the assessment or test itself that is validated, but rather the meaning of the test scores, as well as any implications resulting from them.3 The implications and resulting outcomes are the most important inferences in the validity argument.1

Validation evaluates the fundamental claims, assumptions, and inferences linking assessment scores with their intended interpretations and uses.1 Validity is not a property of the test itself, but rather refers to the use of a test for a particular purpose. To evaluate the suitability of a test for a particular purpose requires multiple sources of evidence.4 Sufficient evidence is also required to defend the use of the test for that purpose.3 The extent to which score meaning holds across population groups, settings, and contexts is an ongoing question, and is the main reason that validation is a continuous process rather than a static, one-time event. Rather than being dichotomous, validity is considered a matter of degree.4

Background about this theory

Validity theory has significantly evolved over time. Initially, validity was divided into three main types:

  1. Content validity, which relates to the creation of the assessment items.
  2. Criterion validity, which refers to how well scores correlate with a reference-standard measure of the same phenomenon.
  3. Construct validity, in which intangible attributes (i.e., constructs) are linked with observable attributes based on a conception or theory of the construct.1

Educators, however, recognized that content validity nearly always supports the test, and that identifying and validating a reference standard is very difficult.1 Over 20 years ago, Messick proposed an alternative unified framework in which all validity is considered construct validity and consists of evidence collected from six different aspects.3  These aspects can be briefly described as follows:

  • The content aspect includes evidence of content relevance and evidence that assessment content reflects the construct it is intended to measure.
  • The substantive aspect refers to theoretical and empirical analyses of the observed consistencies in test responses. It evaluates the extent to which responses from examinees or raters align with the intended construct.
  • The structural aspect evaluates how well the internal structure of the assessment reflected in the scores is consistent with the construct domain selected. This can include measures of reliability across assessment items or stations.
  • The generalizability aspect evaluates how efficiently score properties and interpretations generalize to and across population groups, settings, and tasks.
  • The external aspect includes evidence of the statistical associations between assessment scores and another measure with a specified theoretical relationship. This may include criterion relevance, applied utility, and evidence from multi-trait, multi-method comparisons. The relationship may be positive for measures of the same construct or negligible for independent measures.
  • The consequential aspect appraises the actual and potential consequences of test use, including the beneficial or harmful impact of the assessment itself and the decisions that result. It is related to the issues of bias, fairness, and distributive justice.3

The current Standards for Educational and Psychological Testing place emphasis on five of the sources of evidence proposed by Messick. These are content, response process, internal structure, relations with other variables, and consequences evidence; the generalizability aspect, however, is not included in the standards.1

In 2006, Kane proposed an alternative unifying approach, instead identifying four inferences in the validity argument. These are:

  1. Scoring, which refers to translating an observation into one or more scores.
  2. Generalization, which involves using the score as a reflection of performance in a test setting.
  3. Extrapolation, which is using the scores as a reflection of real world performance.
  4. Implications, which involves applying the scores to inform a decision or action.1

Kane stipulated that the implications and associated consequences are the most important inferences in the validity argument. Evidence is needed to support each inference, and should focus on the most questionable assumptions in the chain of inference. The framework is versatile and can apply to all forms of assessment equally, including quantitative or qualitative assessments, individual tests, and programs of assessment. This framework can help educators identify which are the most important pieces of evidence when planning an evaluation and identifying evidence gaps.1

Modern takes or advances

The unified view of construct validity is widely endorsed; but there is ongoing controversy about the definition of validity. Messick’s definition incorporates both accuracy of score inferences and consequential validity. It has been argued that this definition is too complicated. Cizek proposed validation of score inferences and justification of test use should be considered as two parallel, but equally valued, endeavors. Evaluation of the technical test score inferences is considered separate to the evaluation of the justification of test use which is a social value.5

Another alternative model proposed by Lissitz and Samuelsen in 2007 separates test evaluation into internal and external aspects. In this model, validity mainly concerns itself with the internal aspects of a test which can be studied in relative isolation from other tests. The external aspects are then only validated if necessary. Validity is approached independently of the testing context and purpose of the test. Instead the main focus is on evaluating test content. This model therefore implies that validity is a property of the test itself, as validation of the construct is separated from validation of the test itself.4

Finally, a broader conceptual framework may be considered, which analyzes assessment procedures with varying degrees of detail along the continuum of micro-validation to macro-validation.  Macro-validation is concerned with the overarching validity claim utilising a broad, holistic evaluation, but providing the least diagnostic information. Micro-validation on the other hand is concerned with the underpinning validity claims utilizing a narrow, targeted evaluation, and providing the most diagnostic information. This framework allows a different way of thinking about validation evidence by distinguishing between types of inquiry which can be outcome-related or process-related. Evidence from multiple sources, rather than just the five sources in the current standards, is considered legitimate.6

Other examples of where this theory might apply in both the classroom & clinical setting

High-stakes examinations are often the final summative hurdle in professional education. The consequences of a pass or fail result is far reaching for both the doctor and the public. Interest in the consequential validity and the process to determine scores are critical in ensuring stakeholders view the results as credible.8

Programmatic and competency-based assessment have increasingly become the focus of medical education. One of the tenets of competency-based medical education is that training programs must use assessment tools that meet minimum standards of quality.7 As a result, validity and validation are increasingly being studied.  A thorough understanding of the principles of validity and its measurement would be of benefit in many domains of medical education.

One of the most validated and studied tools takes the form of direct observation of clinical skills (i.e., the American Board of Internal Medicine Mini-CEX assessment of clinical skills).9 Knowledge of its validity and robustness may lead to weighing its results more heavily in summative assessments.

Another area where validity theory has increasing application is in medical simulation. Simulation provides the opportunity for deliberate practice, and acts as a surrogate for meaningful educational outcomes.10 Inherent in its design is the ability to control a large amount of variables, enabling targeting of a specific topic or skill. This control allows the application of a validity framework to ensure the designed instrument measures what is purported.

Assessment of professionalism is often described as one of the more challenging aspects of performance reviews. Consequently, many assessment tools have been developed to assess professionalism. Having a good grasp of the principles of validation frameworks would enable better evaluation of assessment tools11 and more accurate reflections.

Annotated Bibliography of Key Papers

Messick, S. (1995). Standards of Validity and the Validity of Standards in Performance Assessment. Educational Measures Issues and Practice, 14 (4), pp. 5-8.3

Considered to be the seminal article unifying the concepts of validity into a defined framework.

Kane, M. (2013). The Argument-Based Approach to Validation. School Psychology Review, 42(4), pp. 448-457.12

Kane’s modern framework is explained in this paper. Covering a brief history of validity theory, he then goes on to offer his version of a simplified, stepwise template for validation. The defined approach is to state what is being claimed and evaluate the claims being made.

Cook, D., Brydges, R., Ginsburg, S. and Hatala, R. (2015). A contemporary approach to validity arguments: a practical guide to Kane’s framework. Medical Education, 49(6), pp.560-575.1

This key paper explains the utility of Kane’s framework in medical education, and highlights the use of validation as a continuous process. It provides clarity to the argument that the purpose of validation is to collect evidence that evaluates whether or not a decision and its attendant consequences are useful. The core elements of Kane’s framework (i.e., scoring, generalization, extrapolation, and implications) are explained in practical terms, along with examples of elements of evidence that may be used to test each. Finally, it provides an example of the application of Kane’s framework to familiar testing tools including assessment of procedural skills, and a qualitative (in-training narrative) assessment.

Downing, SM. (2003) Validity: on the meaningful interpretation of assessment data. Medical Education, 37, pp.830-837.8

Downing’s paper is a thorough explanation of construct validity specific to medical education assessment, and closely reviews the five types of validity as outlined by the Standards.2 In order to enhance the understanding of each of the validity sources, this paper has constructed example assessments, and provides validity evidence for each. Viewing validity as closely aligned with the scientific method of theory development, Downing provides a solid argument for validity as a marker for quality.

Limitations of this theory

Though validity is considered essential to quality educational assessment, there is no consistency in terminology or agreed upon exemplar of good validation practices.13,14 It is theorized that this resulted secondary to the rich fabric of practitioners contributing to health education including psychology, sociology and education backgrounds.13

The complexity of Messick’s model had been widely criticized, along with a lack of practical guidance.13 The sense that the task is insurmountable, along with a long list of hypotheses and settings, has the consequence of allowing practitioners to consider even a little bit evidence of whatever type sufficient,15 potentially resulting in the development of sub-optimal research validation programs.

Lastly the emphasis placed on morality and social consequence by Messick (less so by Kane) is considered by some to lead to confusion of fact and personal preference.16

Part 3:  The Denouement

Thank goodness for Kane’s framework,’ Dr. Carmody thought at the end of the clinical competency committee. She had managed to explain the concept of validity well, and identified some fixable problems in their current written evaluation assessment tool.

In reviewing resident evaluations and scores, she noted that there were limited details about observed behaviors by faculty, reflecting poor question construction. She also noted that evaluations were weighed more heavily when completed by non-clinical faculty, and wondered if this parameter introduced a bias.

Dr. Carmody found that written evaluations were incongruent with patient evaluations, but were consistent with in-service training examination scores. She began to wonder what this meant for extrapolation.

Using this framework, she had put forward some clear suggestions to improve the validity of the current written evaluation model within the residency. The meeting progressed with further discussion of how the assessment process for the residents could be improved. She was happy that the resident evaluations were being reconsidered, and felt optimistic that suggested improvements would make the evaluation system a more useful tool in the future for her residents.



1.Cook, D., Brydges, R., Ginsburg, S. and Hatala, R. (2015). A contemporary approach to validity arguments: a practical guide to Kane’s framework. Medical Education, 49(6), pp.560-575.

2. American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. Standards for Educational and Psychological Testing. Washington, DC, 2014, p11.

3. Messick, S. (1995). Standards of Validity and the Validity of Standards in Performance Assessment. Educational Measures Issues and Practice, 14 (4), pp. 5-8.

4.Sireci, S. (2007). On Validity Theory and Test Validation. Educational Researcher, 36(8), pp.477-481.

5.Cizek, G. (2012). Defining and distinguishing validity: Interpretations of score meaning and justifications of test use. Psychological Methods, 17(1), pp.31-43.

6. Newton, Paul, E. (2016). Macro-and Micro Validation: Beyond the “Five Sources” Framework for Classifying Validation Evidence and Analysis. Practical Assessment, Research & Evaluation, 21 (12). Available online:

7. Holmboe ES, Sherbino J, Long DM, Swing SR, Frank JR & for the International CBME Collaborators(2010). The role of assessment in competency-based medical education. Medical Teacher, 32(8),676-682.

8.Downing, SM. (2003) Validity: on the meaningful interpretation of assessment data. Medical Education, 37, pp.830-837.

9.Cook, D., Hatala R. (2016). Validation of educational assessments: a primer for simulation and beyond. Advances in simulation. 1(31)

10. Kogan, JR., Holmboe, ES., Hauer KE. (2009) Tools for direct observation and assessment of clinical skills of medical trainees. JAMA. 302(12) pp.1316-1326.

11. Clauser, B.E., Margolis, M.J., Holtman, M.C. (2012) Validity considerations in the assessment of professionalism. (2012). Advances in Health Science Education. 17(2). 165-181.

12.Kane, M. (2013). The Argument-Based Approach to Validation. School Psychology Review, 42(4), pp. 448-457.

13. St-Onge, C., Young, M., Eva, KW., Hodges B. (2017). Validity: one word with a plurality of meanings. Advances in Health Science Education. pp. 853-867.

14. Royal KD. (2017). Four tenets of modern validity theory for medical education assessment and evaluation. Advances in medical education and practice. Pp. 567-570.

15. Shepard LA. (1993) Evaluating test validity. Review of research in education. 19. pp. 405-450.

16. Lees-Haley PR. (1996) Alice in Validityland, or the dangerous consequences of consequential validity. American Psychologist, 51(9), pp 981-983.