Medical Teacher ISSN: 0142-159X (Print) 1466-187X (Online) Journal homepage: http://www.tandfonline.com/loi/imte20 Twelve tips for developing an OSCE that measures what you want Vijay John Daniels & Debra Pugh To cite this article: Vijay John Daniels & Debra Pugh (2017): Twelve tips for developing an OSCE that measures what you want, Medical Teacher, DOI: 10.1080/0142159X.2017.1390214 To link to this article: http://dx.doi.org/10.1080/0142159X.2017.1390214 Published online: 25 Oct 2017. Submit your article to this journal View related articles View Crossmark data Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=imte20 Download by: [Georgetown University] Date: 26 October 2017, At: 02:08 MEDICAL TEACHER, 2017 https://doi.org/10.1080/0142159X.2017.1390214 TWELVE TIPS Twelve tips for developing an OSCE that measures what you want Vijay John Danielsa and Debra Pughb a Department of Medicine, University of Alberta, Edmonton, Canada; bDepartment of Medicine, University of Ottawa, Ottawa, Canada ABSTRACT Downloaded by [Georgetown University] at 02:08 26 October 2017 The Objective Structured Clinical Examination (OSCE) is used globally for both high and low stakes assessment. Despite its extensive use, very few published articles provide a set of best practices for developing an OSCE, and of those that do, none apply a modern understanding of validity. This article provides 12 tips for developing an OSCE guided by Kane’s validity framework to ensure the OSCE is assessing what it purports to measure. The 12 tips are presented in the order they would be operationalized during OSCE development. Introduction The Objective Structured Clinical Examination (OSCE) was first introduced in 1975 (Harden et al. 1975) and, since that time, OSCEs have been used extensively (Patrıcio et al. 2013) for assessing clinical skills, both at local institutions and on national high-stakes examinations. There are multiple review articles that examine the use of OSCEs in health professions education (Walsh et al. 2009; Smith et al. 2012; Patrıcio et al. 2013; Hastie et al. 2014; Hodges et al. €mert et al. 2016; Kreptul 2014; Setyonugroho et al. 2015; Co and Thomas 2016), including psychometric evidence for their use (Brannick et al. 2011); however, few provide a set of best practices for developing an OSCE. Of those that do (Casey et al. 2009; Nulty et al. 2011; Sturpe 2010), none apply a modern understanding of validity to ensure the OSCE is assessing what it purports to measure. Our understanding of validity has evolved from several separate types of validity (e.g. criterion, content validity etc.) to a unitary concept of construct validity in which various sources of evidence are used to support an argument for validity, first through Messick’s framework of the five sources (Messick 1989) and, more recently, through Kane’s argument-based approach to validation (Kane 2013). As summarized by Cook et al (2015), Kane’s framework involves focusing on four key steps to ensure valid interpretation from observation to making a decision based on the assessment. The first step is translation of an observed performance into a score (Scoring) ensuring the score reflects the performance as best as possible. The second step is generalizing the score from the specific examination to the test performance environment (i.e. all possible equivalent tests – Generalization). Third is extrapolating performance in the test environment to real life (Extrapolation). And finally the fourth step is the interpretation of this information for making a decision (Implications). The two main threats to validity are construct underrepresentation (too little sampling or inappropriate sampling) and constructirrelevant variance (anything unrelated to the construct of interest that results in score variability). Over 25 years ago, Harden published a 12 Tips paper for organizing an OSCE (Harden 1990). The purpose of that paper was to provide guidance to those developing and administering an OSCE, and focused mainly on practical concerns. In contrast, the purpose of this paper is to provide 12 tips for developing an OSCE that measures what you want, as viewed through the lens of Kane’s validity framework. The 12 tips are presented in the order they would be operationalized when developing an OSCE. Key points from each tip are summarized in Table 1, demonstrating how they relate to each of the categories of validity evidence. Tip 1 Decide on the intended use of the results from your OSCE Development of an OSCE should begin with the end: What decisions will I make with the results?; Is the OSCE formative or summative? The answers to these questions provide evidence for the Implications stage of Kane’s model. And though this stage is last, the answers to these questions will frame the rest of OSCE development, and hence why they must be asked first. For example, a lower stakes exam would be used to provide feedback to learners, and could lead to individual coaching or remediation, compared to a higher stakes end-of-clerkship or national certification examination, that can result in repeating a clerkship or year of residency. For these reasons, a lower stakes exam does not require the same level of score reliability as a high stakes examination (Downing 2004), and so a shorter examination is possible. Another novel design is the sequential OSCE in which all candidates would be required to participate in a relatively short screening examination. Then, only those who perform below a predefined standard would subsequently be required to participate in a full-length OSCE to assess their skills. Two different studies used available data to model CONTACT Vijay John Daniels firstname.lastname@example.org Division of General Internal Medicine, University of Alberta Hospital, 5-112 Clinical Sciences Building, 11350 83rd Avenue NW, Edmonton, Alberta, Canada, T6G 2G3 ß 2017 Informa UK Limited, trading as Taylor & Francis Group 2 V. J. DANIELS AND D. PUGH Table 1. Categories and examples of validity evidence. Categories of validity evidence Scoring Generalization Extrapolation Implications Examples of how validity evidence can be demonstrated Description of how rating instruments were developed and selected Training of raters Training of standardized patients Performance of an item analysis/reliability within each station Use of test security measures Quality assurance Use of a blueprint to ensure appropriate sampling of the domain Calculation of measures of reliability across stations (e.g. Cronbach alpha, G-study) Comparison of experts to a novice group Demonstration of correlations with other measures of the same construct (e.g. communication skills measured by an OSCE and by an in-training evaluation) Use of content experts to develop authentic cases Relevance to real-life clinical tasks Standard Setting process Analysis of pass–fail consequences (e.g. remediation opportunities) Exploration of how the assessment influences learning Exploration of how the assessment influences curriculum Downloaded by [Georgetown University] at 02:08 26 October 2017 this approach and demonstrated this would increase score reliability for borderline candidates and could save money if designed properly (Pell et al. 2013; Currie et al. 2016). Tip 2 Decide what your OSCE should assess OSCEs cannot be used to assess an entire content domain. Rather, they are used to assess a sample of the knowledge and skills that learners are expected to have mastered. To ensure that an OSCE reflects educational objectives, blueprinting is key. Blueprinting refers to the process by which content experts ensure that constructs of interest are adequately represented (Coderre et al. 2009). For example, if the goal of the OSCE is to assess clinical skills, such as history-taking and physical examination skills, then the blueprint should include a wide variety of stations that reflect this. This helps to ensure that one can generalize performance on these stations to the learner’s ability to perform other history and physical examinations in an OSCE (Generalization). The length of each station is usually between five and ten minutes (Khan et al. 2013) but could be longer depending on what task is being assessed. There must be enough stations to adequately sample the construct of interest, taking into account the intended use of the exam results (i.e. low versus high stakes). A lower stakes locally developed exam may have only eight to ten stations, whereas a high stakes OSCE may require 14-18 stations to achieve acceptable reliability (Khan et al. 2013). Although OSCEs have been used to assess all of the CanMEDS roles (Frank et al. 2015; Jefferies et al. 2007), there are challenges in assessing the intrinsic (i.e. nonMedical Expert) roles authentically (e.g. professionalism, collaboration, etc.), which has an impact on how well the test performance extrapolates to real-world performance. The more focused the OSCE blueprint, the better it will provide validity evidence for generalization to other test settings, though at the expense of extrapolation to other skills. A programmatic approach to assessment (Schuwirth and van der Vleuten 2011) would view an OSCE as one part of an overall assessment framework. This leads to two questions that can guide OSCE development: (1) Where else are (or could) skills be assessed in my overall program?; and (2) If I choose to assess this in an OSCE, can I do it authentically? Tip 3 Develop the cases Once you have decided what will be assessed by your OSCE, careful consideration should be given to case development. Cases should be developed to ensure that they authentically represent the clinical problem of interest (Extrapolation). Instructions to candidates should include information related to the presenting problem, a task, and a time-frame for completing the encounter (Pugh and Smee 2013). Cases should undergo review by both content experts as well as educational experts to ensure that the cases reflect best practices of OSCE case development (Pugh and Smee 2013). These experts should consider the following questions in their review: (1) Is the task clear? (Kane’s Scoring stage); (2) Is there enough time to complete the task in the allotted time?; (3) Does the case authentically represent a clinical problem?; and (4) Is the level of difficulty appropriate for the learners? (the last three relate to Kane’s Extrapolation stage). Pilot-testing of cases at this stage can help identify and mitigate potential issues. Tip 4 Decide how your OSCE should assess candidates (the scoring rubric) The development of scoring rubrics is an area where much of the research on OSCE validity has focused. A description for how rubrics were developed or selected can provide important validity evidence for Scoring in Kane’s framework. Rubrics for each OSCE case can involve checklists and/or rating scales. Checklists are used to assess observable behaviors (e.g. asked about smoking history, identified the JVP, etc.). Checklists are generally dichotomous (e.g. did or did not do), but they can also be polytomous (e.g. done well, attempted but not done well, not done) (Pugh, Halman, et al. 2016). Checklists should be carefully constructed to avoid rewarding learners who use a rote approach unless that is the goal, such as for very junior medical students. For most learners, there should be an attempt to include items that help to discriminate between learners who understand the subject matter and those who do not (i.e. a key features approach) (Daniels et al. 2014). If one uses Downloaded by [Georgetown University] at 02:08 26 October 2017 MEDICAL TEACHER long checklists that reward nonspecific thoroughness as opposed to focusing on key clinically discriminating features in a history or physical examination, this will not extrapolate well to what we want in physicians as thoughtful diagnosticians. Although, intuitively, it makes sense to apply differential weights to checklist items based on their perceived importance, weighting items does not appear to affect overall reliability or pass/fail decisions significantly (Sandilands et al. 2014), and thus the decision regarding weighting checklist items should be based on considerations of the construct of interest and rewarding the behaviors you are seeking. Unlike checklist items, rating scales can capture a wider spectrum of performance, and are better suited for skills that exist along a continuum (e.g. communication, rapport, organization, procedural flow, etc.) (Swanson and van der Vleuten 2013). Rating scales allow raters to make judgments about candidate performance, thus capitalizing on their expertise. When developing rating scales, careful consideration should be given to the anchors used to provide guidance to raters. Vague anchors (e.g. inferior, borderline or excellent), may be less meaningful to raters than behavioral anchors (e.g. scattered, shotgun approach to the problem). Entrustability-aligned scales (e.g. could perform the procedure with minimal assistance”) are emerging as a useful approach to assessment, but are generally reserved for workplace-based assessment (Gofton et al. 2012). Tip 5 Train your raters Further support for Scoring includes evidence demonstrating raters were trained to ensure they interpreted scoring rubrics as intended. Raters should be provided with an orientation that includes information about the purpose of the OSCE, the level of the learners, and how they should interact with learners (e.g. can they provide prompts or feedback to learners?). They should also be provided with examples of the scoring rubrics, including the operational definition of success on any checklist items and the meaning of each behavioral anchor for rating scales. A more detailed form of orientation, such as frame-ofreference training, is sometimes provided to raters, which involves creating a shared mental model of the desired performance by defining performance dimensions, providing examples of behaviors for each dimension, and then allowing raters to practice and receive feedback on sample performances (Roch et al. 2012). This method can be timeconsuming and is usually reserved for high-stakes examinations, but can strengthen the validity argument for scoring. It is important to remember that any undesired variation in rater scoring may introduce construct irrelevant variance and thus threaten the validity of scoring inferences made. Despite training, raters may make mistakes. Although traditionally we often think of some raters as excessively harsh or lenient compared to other raters (i.e. hawks and doves), more recent research demonstrates that rater variability is more complex than this (Govaerts et al. 2013; Gingerich et al. 2014). With that said, when systematic errors in raters are evident, there are published approaches to deal with these (Bartman et al. 2013; Fuller et al. 2017) that are beyond the scope of this paper. 3 Tip 6 Develop scripts for and train standardized patients Most OSCEs employ the use of standardized patients (SPs) to allow learners to demonstrate their clinical skills. A rigorous and standardized approach to SP training provides further validity evidence for the integrity of Scoring as it reduces the variance between SP portrayals. SPs should be provided with a script to guide their portrayal, and basing the script on a real patient adds authenticity. For history stations, the script is relatively rich in details about: the presenting problem (including a timeline and pertinent positives and negatives); the SP’s past medical history (including medication use); and social history (e.g. smoking and alcohol use), as required. At a minimum, there should be a scripted answer for all checklist items, but there should be answers provided for any anticipated questions that learners might ask. For unanticipated questions, SPs can be trained to answer either “no” or “I’m not sure” depending on the context. In contrast, for physical examination stations, fewer details may be required, but SPs can be trained to react to stimuli (e.g. guarding during an abdominal examination, limited range of motion of a joint, etc.). Other details to be included in the script may relate to demographics (e.g. age and gender), SP starting position in room (e.g. sitting vs lying down), appearance (e.g. anxious vs calm), and behavior (e.g. cooperative vs evasive). The script may also include statements or prompts for the SP to ask (e.g. “What do you think is going on with me?”) to allow raters to better assess learners’ understanding of the problem. Tip 7 Ensure integrity of data collection processes Data collection should have some sort of quality assurance to ensure data integrity. This provides further evidence that test scores reflect the observations (Kane’s Scoring stage). During an OSCE, staff can periodically verify that raters are completing the rating instruments correctly (i.e. not skipping any items) and address any questions they might have. After the OSCE, if scores are manually entered into a computer, a random set of score sheets should be checked to ensure accurate data entry. There are reasonably-priced software packages that allow creating scannable score sheets which reduces, but does not eliminate, the need for random verification. Some centers may have access to tablets and eOSCE systems that have an added advantage of reducing time to transcribe comments and number of missed rating scales, and can increase the quantity and quality of feedback (Daniels et al. 2016; Denison et al. 2016). However, having reliable internet access for internet-based systems, and back-up plans for when a tablet or the eOSCE system fails is imperative. Decisions must be made about missing data (e.g. a rating scale that is left blank). For example, scores may be calculated without the missing data, data may be extrapolated, or, in extreme circumstances, the station may 4 V. J. DANIELS AND D. PUGH need to be deleted if there is insufficient data to render a judgment (Fuller et al. 2017). Finally, as with any assessment, one must consider the issue of test security. To ensure an accurate measurement of learners’ abilities, it is important that all students have equal access to information about the assessment. Unauthorized access to test materials (e.g. through student created ghost banks) provides learners with an unfair advantage that threatens the validity of the interpretation of scores from the OSCE. Tip 8 Downloaded by [Georgetown University] at 02:08 26 October 2017 Choose a standard setting approach The choice of standard-setting methods (i.e. cut score) also deserves careful attention in order to support the validity of score interpretations as this impacts the Implications of the assessment. Cut scores that are inappropriately high may result in failing learners who are actually competent, while cut scores that are too low may lead weak learners to be overly confident in their abilities. This is especially important for high-stakes assessments in which pass-fail decisions have important repercussions for learners, educators and patients. Although there is no gold standard when setting a cutscore, a detailed rationale for the method chosen should be provided. The three most common criterion-referenced methods used for OSCEs are Angoff, Borderline Group, and Borderline Regression. Detailed explanations of each are provided in Yousuf and colleagues’ (Yousuf et al. 2015) recent OSCE standard setting study. The chosen method is applied at the station level to determine the initial cut score. The next decision is whether the overall pass/fail determination should be based on the overall OSCE score alone, or if examinees must also pass a minimum number of stations. The latter (conjunctive) approach is favored by some educators, to ensure that examinees demonstrate a breadth of knowledge (i.e. that a failing performance on several stations cannot be compensated for by very strong performance on others) (Homer et al. 2017). A conjunctive approach will increase failure rates so this decision should be based on the intended use of the OSCE and the consequences of failing it. Tip 9 Consider how well the OSCE would generalize to all possible forms Another important source of validity evidence relates to the Generalizability of the results. Support for this element of the validity argument can be provided by analyzing the psychometric properties of the OSCE. The reliability (i.e. reproducibility) of scores is an important element of validity evidence. Many readers will be familiar with Cronbach alpha which is available in common statistical software packages. Alpha is usually used across stations to measure overall reliability and to look for problematic stations. If decisions are made based on the performance of a single station (e.g. failing a station leads to remediating that specific station), then alpha can be used at the station level to evaluate reliability and identify problematic items. Because OSCEs are inherently multi-faceted (e.g. persons, items, raters, tracks, etc.), generalizability theory (G-theory) is often preferred for calculating reliability as well as determining the impact of the various sources of error. However, G-theory works best if there are multiple raters per station; otherwise, one cannot tease out the variance due to raters as opposed to due to the station. There are freely available packages for running G-studies such as the syntax-based GENOVA (Crick and Brennan 1983) and the more user friendly G-string IV (Bloch and Norman 2015). For more on G-theory, one can review the AMEE guide on G-theory (Bloch and Norman 2012) but, in brief, G-theory seeks to estimate various sources of error in measuring the construct of interest. Regardless of which statistic is used, alpha or G-theory the desired coefficient is dependent on the purpose and use of the test. If a high stakes decision is based on one OSCE such as a national certification exam, the desire is for 0.8 or even 0.9 (Downing 2004), whereas for a moderate stakes locally developed examinations, especially those that are one piece of a program of assessment, lower reliability would be acceptable. Hence, why the intended use of the assessment (Tip 1) frames everything. Tip 10 Review the correlation of your examination with other variables One of the main reasons for an argument-based approach to validity is the lack of an easy gold standard criterion to which we can compare our assessments in medical education. In medical education, the strongest such evidence would come in the form of patient outcomes. An example of such work is the study by Tamblyn and colleagues (Tamblyn et al. 1998) in which they demonstrated that lower scores on a licensing examination were associated with lower quality of clinical practice as measured by patterns in consultations, prescribing, and mammography screening. This data supports evidence along Kane’s Extrapolation stage of validity of that licensing exam. More commonly, evidence is sought by comparing OSCE scores to other assessments. For example, Pugh and colleagues (Pugh, Bhanji, et al. 2016) demonstrated that performance on a locally developed Internal Medicine OSCE progress test correlated with scores on the high stakes Internal Medicine certification examination and could identify residents at an elevated risk of failure. Not all correlations need to be done with data external to the institution. Local data can be used to correlate OSCE scores to other assessments measuring similar and dissimilar competencies. For example, if OSCE scores correlate better with workplace-based assessments than with an MCQ exam, this supports the validity argument as both the OSCE and workplace-based assessments are measuring performance over knowledge. Another analysis could examine if an OSCE discriminates more senior versus junior learners as this also provides validity evidence. Tip 11 Evaluate the effects of the OSCE on learners Whether formative or summative, we know that assessment drives learning (Kane’s Implications stage). However, it is Downloaded by [Georgetown University] at 02:08 26 October 2017 MEDICAL TEACHER important to recall that assessment can influence learning in both positive and negative ways (Pugh and Regehr 2016), and so one should seek evidence for how an OSCE is promoting or impeding learning. Cook and colleagues (2015), in their review of Kane’s model, argue that this is an underutilized but important aspect of the validity argument. Questions to be considered include: How does the OSCE influence learning?; What are the outcomes of learners who fail versus pass?; If remediation is provided to those who fail, is there evidence that performance improves on a repeat assessment?; How does the OSCE influence subsequent changes in the curriculum (e.g. if a high number of candidates fail a station) and, conversely, do changes to the curriculum influence OSCE performance?; and finally, how does the OSCE influence patient care? If the purpose of the OSCE is to drive learning, then is there data to show the learners are learning as a result of the OSCE? Follow-up surveys or focus groups of learners can look at the impact of assessment on learning. A recent study involved learners reviewing their OSCE rubrics immediately after the OSCE (tablet scoring was used to facilitate this) and then writing an action plan of what studying they will do, and how they will change their clinical behavior as a result of reviewing their results. A follow-up survey demonstrated that this process did impact future learning with almost all residents having either reviewed material, or changed how they approached history or physical exam in the workplace as a result of this feedback process (Strand and Daniels 2017). framework in mind such as Kane’s model during development will allow assessment data that can be used for the intended purpose of the assessment, whether it is a high stakes end-of-training exam, or a low stakes formative exercise. Acknowledgements Dr. Daniels would like to acknowledge the Department of Medicine’s Academic Alternative Relationship Plan at the University of Alberta for its financial support. Dr. Pugh would like to acknowledge the Department of Medicine at The Ottawa Hospital for their financial support. Disclosure statement The authors report no conflicts of interest. The authors alone are responsible for the content and writing of this article. Notes on contributors Vijay John Daniels, MD, MHPE, FRCPC, is an Associate Professor in the Department of Medicine at the University of Alberta. He is a member of the Royal College of Physicians and Surgeons of Canada’s Examination Committee which reviews the quality of all specialty certification examinations. Debra Pugh, MD, MHPE, FRCPC, is an Associate Professor in the Department of Medicine at the University of Ottawa. She serves as Vice Chair of the Central Examination Committee at the Medical Council of Canada, and Vice Chair of the General Internal Medicine Examination Board at the Royal College of Physicians and Surgeons of Canada. Tip 12 ORCID Review the entire process to look for threats to validity Vijay John Daniels http://orcid.org/0000-0002-6350-3129 Debra Pugh http://orcid.org/0000-0003-4076-9669 An argument for validity is an iterative process where one states the proposed interpretation and use of the assessment, then examines the evidence of validity, and if the evidence does not support the intended interpretation or use, either revise the use or revise the assessment process. This should continually happen to ensure the assessment is meeting its purpose. Too often this ongoing quality assurance is focused solely on psychometrics such as reliability, but all aspects of the development of an OSCE should be reviewed to look for issues related to each of the four stages of Kane’s model. For a full guide on OSCE quality assurance strategies, we refer the reader to Pell and colleague’s AMEE guide (Pell et al. 2010). Some OSCE metrics that are often overlooked are the percent of students who fail overall or fail a specific station (can be program evaluation information), correlation between a station’s sum score and global rating scale (lower correlation raises concern about score sheet content), and comparisons between groups who encounter the same stations, but with differences such as raters or locations, (Pell et al. 2010). Conclusions Development of an OSCE is a significant undertaking with several steps involved. However, keeping a validity 5 References Bartman I, Smee S, Roy M. 2013. A method for identifying extreme OSCE examiners. Clin Teach. 10:27–31. Bloch R, Norman G. 2012. Generalizability theory for the perplexed: a practical introduction and guide: AMEE Guide No. 68. Med Teach. 34:960–992. Bloch R, Norman G. 2015. G-String IV program. http://fhsperd.mcmaster. ca/g_string/download.html. Brannick MT, Erol-Korkmaz HT, Prewett M. 2011. A systematic review of the reliability of objective structured clinical examination scores. Med Educ. 45:1181–1189. Casey PM, Goepfert AR, Espey EL, Hammoud MM, Kaczmarczyk JM, Katz NT, Neutens JJ, Nuthalapaty FS, Peskin E. Association of Professors of Gynecology and Obstetrics Undergraduate Medical Education Committee. 2009. To the point: reviews in medical education-the objective structured clinical examination. Am J Obstet Gynecol. 200:25–34. Coderre S, Woloschuk W, McLaughlin K. 2009. Twelve tips for blueprinting. Med Teach. 31:322–324. €mert M, Zill JM, Christalle E, Dirmaier J, H€arter M, Scholl I. 2016. Co Assessing communication skills of medical students in objective structured clinical examinations (OSCE)-A Systematic Review of Rating Scales. PLoS One. 11:e0152717. Cook DA, Brydges R, Ginsburg S, Hatala R. 2015. A contemporary approach to validity arguments: a practical guide to Kane’s framework. Med Educ. 49:560–575. Currie GP, Sivasubramaniam S, Cleland J. 2016. Sequential testing in a high stakes OSCE: determining number of screening tests. Med Teach. 38:708–714. Downloaded by [Georgetown University] at 02:08 26 October 2017 6 V. J. DANIELS AND D. PUGH Crick JE, Brennan RL. 1983. GENOVA program. https://education.uiowa. edu/centers/center-advanced-studies-measurement-and-assessment/ computer-programs. Daniels VJ, Bordage G, Gierl MJ, Yudkowsky R. 2014. Effect of clinically discriminating, evidence-based checklist items on the reliability of scores from an Internal Medicine residency OSCE. Adv in Health Sci Educ. 19:497–506. Daniels VJ, Surgin C, Lai H. 2016. Enhancing formative feedback of an OSCE through tablet scoring. Med Educ. 50 Supplement 1:28. doi:10.1111/medu.13057 Denison A, Bate E, Thompson J. 2016. Tablet versus paper marking in assessment: feedback matters. Perspect Med Educ. 5:108–113. Downing SM. 2004. Reliability: on the reproducibility of assessment data. Med Educ. 38:1006–1012. Frank JR, Snell L, Sherbino J, editors. 2015. CanMEDS 2015 physician competency framework. Ottawa: Royal College of Physicians and Surgeons of Canada. Fuller R, Homer M, Pell G, Hallam J. 2017. Managing extremes of assessor judgment within the OSCE. Med Teach. 39:58–66. Gingerich A, van der Vleuten CP, Eva KW, Regehr G. 2014. More consensus than idiosyncrasy: categorizing social judgments to examine variability in Mini-CEX ratings. Acad Med. 89:1510–1519. Gofton WT, Dudek NL, Wood TJ, Balaa F, Hamstra SJ. 2012. The Ottawa surgical competency operating room evaluation (O-SCORE): a tool to assess surgical competence. Acad Med. 87:1401–1407. Govaerts MJ, Van de Wiel MW, Schuwirth LW, Van der Vleuten CP, Muijtjens AM. 2013. Workplace-based assessment: raters’ performance theories and constructs. Adv in Health Sci Educ. 18:375–396. Harden RM. 1990. Twelve tips for organizing an objective structured clinical examination (OSCE). Med Teach. 12:259–264. Harden RM, Stevenson M, Downie WW, Wilson GM. 1975. Assessment of clinical competence using objective structured examination. Br Med J. 1:447–451. Hastie MJ, Spellman JL, Pagano PP, Hastie J, Egan BJ. 2014. Designing and implementing the objective structured clinical examination in anesthesiology. Anesthesiology. 120:196–203. Hodges BD, Hollenberg E, McNaughton N, Hanson MD, Regehr G. 2014. The psychiatry OSCE: a 20-year retrospective. Acad Psychiatry. 38:26–34. Homer M, Pell G, Fuller R. 2017. Problematizing the concept of the “borderline” group in performance assessments. Med Teach. 39:469–475. Jefferies A, Simmons B, Tabak D, McIlroy JH, Lee KS, Roukema H, Skidmore M. 2007. Using an objective structured clinical examination (OSCE) to assess multiple physician competencies in postgraduate training. Med Teach. 29:183–191. Kane MT. 2013. Validating the interpretations and uses of test scores. J Educ Meas. 50:1–73. Khan KZ, Gaunt K, Ramachandran S, Pushkar P. 2013. The objective structured clinical examination (OSCE): AMEE guide no. 81. Part II: organisation and administration. Med Teach. 35:e1447–e1463. Kreptul D, Thomas RE. 2016. Family medicine resident OSCEs: a systematic review. Educ Prim Care. 27:471–477. Messick S. 1989. Validity. In: Linn RL, editor. Educational measurement. 3rd edn. New York (NY): American Council on Education and Macmillan. p. 13–103. Nulty DD, Mitchell ML, Jeffrey CA, Henderson A, Groves M. 2011. Best practice guidelines for use of OSCEs: maximising value for student learning. Nurse Educ Today. 31:145–151. Patrıcio MF, Juli~ao M, Fareleira F, Carneiro AV. 2013. Is the OSCE a feasible tool to assess competencies in undergraduate medical education? Med Teach. 35:503–514. Pell G, Fuller R, Homer M, Roberts T. International Association for Medical Education 2010. How to measure the quality of the OSCE: a review of metrics - AMEE guide no. 49. Med Teach. 32:802–811. Pell G, Fuller R, Homer M, Roberts T. 2013. Advancing the objective structured clinical examination: sequential testing in theory and practice. Med Educ. 47:569–577. Pugh D, Bhanji F, Cole G, Dupre J, Hatala R, Humphrey-Murto S, Touchie C, Wood TJ. 2016. Do OSCE progress test scores predict performance in a national high-stakes examination? Med Educ. 50:351–358. Pugh D, Halman S, Desjardins I, Humphrey-Murto S, Wood TJ. 2016. Done or almost done? Improving OSCE checklists to better capture performance in progress tests. Teach Learn Med. 28:406–414. Pugh D, Regehr G. 2016. Taking the sting out of assessment: is there a role for progress testing? Med Educ. 50:721–729. Pugh D, Smee S. 2013. Guidelines for the development of objective structured clinical examination (OSCE) Cases. Ottawa: Medical Council of Canada. Roch SG, Woehr DJ, Mishra V, Kieszczynska U. 2012. Rater training revisited: an updated meta-analytic review of frame-of-reference training. J Occup Organ Psychol. 85:370–395. Sandilands DD, Gotzmann A, Roy M, Zumbo BD, De Champlain A. 2014. Weighting checklist items and station components on a largescale OSCE: is it worth the effort? Med Teach. 36:585–590. Schuwirth LW, Van der Vleuten CP. 2011. Programmatic assessment: from assessment of learning to assessment for learning. Med Teach. 33:478–485. Setyonugroho W, Kennedy KM, Kropmans TJ. 2015. Reliability and validity of OSCE checklists used to assess the communication skills of undergraduate medical students: a systematic review. Patient Educ Couns. pii: S0738-3991:00277–00273. Smith V, Muldoon K, Biesty L. 2012. The objective structured clinical examination (OSCE) as a strategy for assessing clinical competence in midwifery education in Ireland: a critical review. Nurse Educ Pract. 12:242–247. Strand A, Daniels VJ. 2017. Improving Learning Outcomes through Immediate OSCE Score Sheet Review. Med Educ. 51(Suppl 1):114. Sturpe DA. 2010. Objective structured clinical examinations in doctor of pharmacy programs in the United States. Am J Pharm Educ. 74:148. Swanson DB, van der Vleuten CP. 2013. Assessment of clinical skills with standardized patients: state of the art revisited. Teach Learn Med. 25 (Suppl 1):S17–S25. Tamblyn R, Abrahamowicz M, Brailovsky C, Grand’Maison P, Lescop J, Norcini J, Girard N, Haggerty J. 1998. Association between licensing examination scores and resource use and quality of care in primary care practice. JAMA. 280:989–996. Walsh M, Bailey PH, Koren I. 2009. Objective structured clinical evaluation of clinical competence: an integrative review. J Adv Nurs. 65:1584–1595. Yousuf N, Violato C, Zuberi RW. 2015. Standard setting methods for pass/fail decisions on high-stakes objective structured clinical examinations: a validity study. Teach Learn Med. 27:280–291.