Exploring the use of videotaped objective structured clinical examination in the assessment of joint examination skills of medical students.код для вставкиСкачать
Arthritis & Rheumatism (Arthritis Care & Research) Vol. 57, No. 5, June 15, 2007, pp 869 – 876 DOI 10.1002/art.22763 © 2007, American College of Rheumatology ORIGINAL ARTICLE Exploring the Use of Videotaped Objective Structured Clinical Examination in the Assessment of Joint Examination Skills of Medical Students PIRASHANTHIE VIVEKANANDA-SCHMIDT,1 MARTYN LEWIS,1 DAVID COADY,2 CATHERINE MORLEY,2 LESLEY KAY,2 DAVID WALKER,2 AND ANDREW B. HASSELL3 Objective. Objective structured clinical examination (OSCE) is a key part of medical student assessment. Currently, assessment is performed by medical examiners in situ. Our objective was to determine whether assessment by videotaped OSCE is as reliable as live OSCE assessment. Methods. Participants were 95 undergraduate medical students attending their musculoskeletal week at Freeman Hospital, Newcastle (UK). Student performance on OSCE stations for shoulder or knee examinations was assessed by experienced rheumatologists. The stations were also videotaped and scored by a rheumatologist independently. The examinations consisted of a 14-item checklist and a global rating scale (GRS). Results. Mean values for the shoulder OSCE checklist were 17.9 by live assessment and 17.4 by video (n ⴝ 50), and 20.9 and 20.0 for live and video knee assessment, respectively (n ⴝ 45). Intraclass correlation coefﬁcients for shoulder and knee checklists were 0.55 and 0.58, respectively, indicating moderate reliability between live and video scores for the OSCE checklists. GRS scores were less reliable than checklist scores. There was 84% agreement in the classiﬁcation of examination grades between live and video checklist scores for the shoulder and 87% agreement for the knee ( ⴝ 0.43 and 0.51, respectively; P < 0.001). Conclusion. Video OSCE has the potential to be reliable and offers some advantages over live OSCE including more efﬁcient use of examiners’ time, increased fairness, and better monitoring of standards across various schools/sites. However, further work is needed to support our ﬁndings and to implement and evaluate the quality assurance issues identiﬁed in this work before justiﬁable recommendations can be made. KEY WORDS. Videotape; Examination; Assessment; Skills. INTRODUCTION Objective structured clinical examinations (OSCEs), ﬁrst introduced by Harden and Gleeson in 1979 (1), are widely used in the assessment of undergraduate and postgraduate medical students and are regarded as offering better validity than traditional long-case ﬁnal examinations (2). In a Supported by an educational project grant from the Arthritis Research Campaign Education Subcommittee. The Virtual Rheumatology CD was funded by the Arthritis Research Campaign. 1 Pirashanthie Vivekananda-Schmidt, DPhil, Martyn Lewis, PhD: Primary Care Sciences Research Centre, Keele University, Keele, North Staffordshire, UK; 2David Coady, MRCP, Catherine Morley, MRCP, Lesley Kay, FRCP, David Walker, MD: University of Newcastle upon Tyne, Newcastle upon Tyne, UK; 3Andrew B. Hassell, MD: Keele University, Keele, North Staffordshire, UK. long-case examination, the student sees a patient alone for 30 – 60 minutes, obtains a history, and performs a physical examination. The student is then questioned about the ﬁndings, relevant investigations, and further treatment of the patient. The processes of history taking and examination are not observed. Therefore, communication and examination skills may not be adequately assessed (3). The advantages of OSCEs include greater reliability (they pro- Dr. Kay has received consulting fees and/or honoraria (less than $10,000 each) from Wyeth, Schering-Plough, and Abbott. Address correspondence to Pirashanthie VivekanandaSchmidt, DPhil, Academic Unit of Medical Education, 85 Wilkinson Street, Shefﬁeld University, S10 2GB, UK. E-mail: email@example.com. Submitted for publication May 4, 2006; accepted in revised form October 18, 2006. 869 870 vide a consistent challenge for all candidates assessed ) and greater face and content validity (5) because the process as well as the outcome can be assessed. In addition, OSCEs allow sampling of a greater range of skills than the long-case examination. They have also been shown to correlate better with consultant rating of the candidate (2) than traditional clinical assessments based on long- and short-case examinations. However, OSCEs require a great deal of organization, not the least the coordination of a large number of clinicians to be in the same place at the same time for a single examination. Not only must these clinicians be in one place, they also should have undergone some training in the assessment to maximize reliability. Furthermore, long OSCE assessment sessions can affect the objectivity of the assessors due to fatigue. Videotapes have been used for a number of years for a variety of purposes within medical education. They are perceived as effective learning resources in the ﬁeld of communication skills (6,7) and have been used in the learning of skills for self and tutor assessment (8 –11). Lane and Gottlieb (8) found that use of videotaping improved students’ interviewing skills and self assessment and had the advantage of identifying students who overrated themselves. Videos have also been used in the evaluation of educational interventions (12). They have been used for evaluating performance and competency (13) as well as rater bias (14). Videos have been used in the assessment of communication skills (15,16) and in the assessment of general practice trainees’ consultation skills in the UK since the 1980s and have been found to be effective, valid, and reliable (17). Successful implementation of videotaped OSCEs (VOSCEs) would offer considerable potential advantages to faculty, examiners, and candidates. The ﬁrst advantage is in terms of quality control. Videotaped OSCE stations offer the potential for establishing consensus between examiners for investigating interexaminer variability and even for comparison of standards between medical schools. It is possible to increase the objectivity of assessment by having assessors evaluate examination skills based on an agreed standardized marking criteria. The second advantage is in terms of practicality. Running an OSCE for a group of students is very time consuming and requires expensive clinical expertise and coordination. Videotaping the student performance and marking the performance at a later point means the OSCE can be run with relatively few, if any, clinicians present because stations do not necessarily have to be manned by clinical assessors. Therefore, the examination process may be perceived to be more efﬁcient and reliable: the cost and stress involved in organizing the OSCE might be reduced while improving the consistency and fairness of assessments. Evidence that VOSCEs are a practical, valid method of implementing OSCEs has not been established in the ﬁeld of musculoskeletal medicine and is little explored in other ﬁelds of physical examination. In this study, we carried out formative OSCE assessments of third-year undergraduate medical students performing shoulder and/or knee examination as part of a larger educational randomized controlled trial (18) and videotaped these OSCEs. We present results of an investigation of the relationship be- Vivekananda-Schmidt et al tween the live examiner’s assessment and that of a video assessor, and we discuss the practicalities of videotaping musculoskeletal examination OSCE stations. PARTICIPANTS AND METHODS Setting. This study was performed alongside a randomized controlled trial evaluating the educational value of a computer-assisted learning program, Virtual Rheumatology CD (Newfangled Media, Stoke on Kent, UK), in the teaching of musculoskeletal clinical examination skills in undergraduate medical students (18). The study took place at the University of Newcastle upon Tyne, Newcastle, UK. Participants. Participants were a subgroup of subjects who took part in the randomized controlled trial and included third-year undergraduate medical students attending their musculoskeletal week during a 12-week clinical skills module at Freeman Hospital, Newcastle between January 2002 and June 2003. Prior to the start of placement, these students had attended a 1-week clinical skills block, which included teaching of musculoskeletal examination. OSCE. The OSCE consisted of a station on knee examination and a station on shoulder examination. Participants in this study were examined on one station only. Each station was 6 minutes long. There was a 14-item checklist for scoring the OSCE. Students did not have access to this checklist prior to the examination. For each item, a score of 0, 1, or 2 was given for “not done,” “done,” and “done well,” respectively. Scores of individual items were summed for each station, resulting in total scores for the OSCE shoulder and knee assessments based on discrete numerical scales ranging from 0 (“not done” recorded on all items) to 28 (“done well” recorded on all items). In addition, we added a global rating scale (GRS) as a supplementary measure (a 10-cm visual analog scale ranging from 0 ⫽ poor to 100 ⫽ excellent) to the OSCE score sheets. GRS have been shown to be valid measures for the assessment of clinical skills (19). Video recording of the OSCEs. Digital video cameras were attached to tripods and placed in the room where the examination took place. Student performance was videotaped with consent. The recording was done on mini digital videotapes and was converted to VHS tapes for ease of scoring. Assessors. Local rheumatology specialist registrars (SpR; qualiﬁed physicians undertaking specialist postgraduate training in rheumatology, equivalent to residents in the US) volunteered for the OSCE assessment of students. It was not feasible to train the registrars especially for this study but all had prior experience in administering and scoring OSCEs. We had 2 main assessors for the knee station and the shoulder station. However, due to clinical commitments, other SpR had to stand in for our regular assessors. Overall, 4 raters were involved in assessing the Joint Examination Using Videotaped OSCE 871 Figure 1. Scatterplots of interrater scores for live versus video scoring of the objective structured clinical examination. knee and/or shoulder stations. A consultant rheumatologist (ABH), who was blind to the live scores of the OSCE assessment, scored the VOSCE for both knee and shoulder stations. Procedure. On day 4 of the week’s rotation, students were asked to volunteer for a formative OSCE examination. Students were randomly allocated to 1 of 2 OSCE stations and at the end of the assessment were given verbal and written feedback on their performance. To preserve participant conﬁdentiality, each student was assigned an anonymized code, which protected his or her name and identity from the VOSCE assessor. Approval was obtained from respective chairs of the university ethics committee. Written consent was obtained from the students for video recording of the OSCE and for using the data from the OSCE for the research study. Sample size. A sample size between 40 and 50 participants was required for each reliability analysis in order to calculate conﬁdence intervals to the precision of ⫾0.2 on either side of the reliability coefﬁcients. Statistical analysis. The association between live and video OSCE checklist and GRS scores was assessed in a number of ways. Mean difference and 95% limits of agreement were calculated. Consistency of scoring between the measures was assessed using Pearson’s correlation. Absolute agreement was determined using the intraclass correlation coefﬁcient (ICC) using a 2-way random effects model (ICC2,1). OSCE checklist scores were classiﬁed according to the traditional examination grading system of fail (score ⬍14, i.e., ⬍50%), pass (score 14 –20, i.e., 50 – 74%), and honors (score ⱖ21, i.e., ⱖ75%), and reliability between live and video grades was evaluated using observed agreements and the chance-corrected weighted kappa statistic (using linear weights). In addition to evaluating total scores, we also looked at the reliability of each of the individual items of the 2 OSCE stations using observed agreements and weighted kappa. Fleiss demonstrated that the ICC was closely related to the weighted kappa (20), and recommended that an ICC value ⬍0.4 was poor, between 0.4 and 0.75 was fair to Table 1. Mean ⴞ SD scores for live and video objective structured clinical examination assessments Shoulder Live Video Knee Composite scale (n ⴝ 50) Global scale (n ⴝ 38) Composite scale (n ⴝ 45) Global scale (n ⴝ 31) 17.9 ⫾ 3.4 17.4 ⫾ 3.4 76.0 ⫾ 11.3 59.0 ⫾ 16.1 20.9 ⫾ 2.5 20.0 ⫾ 2.8 73.0 ⫾ 11.5 60.5 ⫾ 15.8 872 Vivekananda-Schmidt et al Table 2. Reliability of scoring of the objective structured clinical examination for live versus video assessments* Shoulder Mean absolute difference (range)† Mean difference (⫾2 SD difference)‡ ICC2,1 (95% CI) Pearson’s r (95% CI) Knee Composite scale (n ⴝ 50) Global scale (n ⴝ 38) Composite scale (n ⴝ 45) Global scale (n ⴝ 31) 2.26 (0–9) 0.50 (⫺5.98, 6.98) 0.55 (0.33, 0.72) 0.55 (0.32, 0.72) 17.8 (2–45.5) 17.0 (⫺7.1, 41.2) 0.36 (⫺0.10, 0.69) 0.66 (0.43, 0.81) 1.96 (0–5) 0.93 (⫺3.71, 5.57) 0.58 (0.34, 0.75) 0.62 (0.38, 0.80) 15.3 (2–42.5) 12.6 (⫺16.6, 41.7) 0.32 (⫺0.05, 0.61) 0.46 (0.13, 0.70) * ICC ⫽ intraclass correlation coefﬁcient; 95% CI ⫽ 95% conﬁdence interval. † Absolute value of difference between live score and video score. ‡ Live score minus video score. good, and ⬎0.75 was excellent (21). We adopted the similar and widely accepted classiﬁcation according to Landis and Koch (22) to provide adjectives to describe the reliability values for the ICC and kappa calculated in this study: 0.01– 0.20 indicated slight, 0.21– 0.40 indicated fair, 0.41– 0.60 indicated moderate, 0.61– 0.80 indicated substantial, and 0.81–1.00 indicated almost perfect. A random subsample of participant videos were rescored after 3 months by the same consultant (ABH). The intrarater agreement of the OSCE checklist scores was evaluated by ICC and by kappa (after classifying the scores into grades [fail, pass, honors] as described above). RESULTS The results are based on 50 matched pairs of observations for the shoulder OSCE and 45 for the knee. Of the 4 assessors, one (CM) scored 39 (78%) participants, one (DC) scored 7 (14%), and another (MF) scored 4 (8%) at the shoulder station; one assessor (DC) scored 26 (58%) participants, one (MB) scored 15 (33%), and one (MF) scored 4 (9%) at the knee station. The subgroup of individuals who were assessed by VOSCE in addition to the OSCE for this study had similar baseline characteristics to individuals who were assessed by OSCE but not VOSCE, e.g., 66% and 69% were women, respectively; mean OSCE shoulder scores were 19.0 and 18.5, respectively; and mean OSCE knee scores were 21.1 and 21.0, respectively. Paired data for the live and video scores are illustrated in Figure 1. Live and video summary scores on the OSCE checklist were very similar (Table 1). Mean values for the OSCE checklist shoulder scores were 17.9 by live assessment and 17.4 by video assessment, and for the knee scores were 20.9 and 20.0, respectively. By contrast, GRS scores were lower for the video assessment than the live assessment. Pearson’s correlation coefﬁcients between OSCE and GRS scores for the live and videotaped assessments ranged from 0.46 to 0.66 (Table 2). The ICC coefﬁcients indicated moderate reliability between video and live scores, with values of 0.55 and 0.58 for the OSCE checklist of the shoulder and knee, respectively. The reliability was only fair between scores for the global ratings. Table 3. Agreement between live and video ratings for individual items of the objective structured clinical examination (OSCE) shoulder assessment* Agreement, %† Shoulder OSCE items Observed Expected§ Kappa (95% CI)‡ Approach to the patient Inspected shoulder from in front and from behind Palpated shoulder for tenderness Identiﬁes bony landmarks External rotation of the shoulder with the elbows tucked in Asked patient to put hands behind head and hands behind back Assess forward ﬂexion Assess extension Inspects active neck movements Assess for painful arc Assess scapular movement (viewed from behind) Assess the acromioclavicular joint Performs resisted movement Identiﬁes abnormalities correctly 81 96 93 53 88 84 85 84 95 61 81 94 93 71 81 91 87 49 68 50 78 72 70 63 66 74 52 50 ⫺0.03 (⫺0.27, 0.21) 0.57 (0.33, 0.81) 0.46 (0.20, 0.72) 0.08 (⫺0.04, 0.20) 0.63 (0.41, 0.85) 0.68 (0.44, 0.92) 0.32 (0.12, 0.52) 0.41 (0.19, 0.63) 0.83 (0.56, 1.00) ⫺0.04 (⫺0.17, 0.08) 0.44 (0.20, 0.64) 0.76 (0.52, 1.00) 0.85 (0.59, 1.00) 0.42 (0.21, 0.62) P ⬍ 0.001 ⬍ 0.001 ⬍ 0.001 ⬍ 0.001 ⬍ 0.01 ⬍ 0.001 ⬍ 0.001 ⬍ 0.001 ⬍ 0.001 ⬍ 0.001 ⬍ 0.001 * 95% CI ⫽ 95% conﬁdence interval. † Based on linear weights. ‡ Based on linear weights. Agreement expected by chance alone; the kappa coefﬁcient measures the chance-corrected agreement ([observed agreement ⫺ expected agreement]/[1 ⫺ expected agreement]). § Agreement expected by chance alone; the kappa coefﬁcient measures the chance-corrected agreement ([observed agreement ⫺ expected agreement]/ [1 ⫺ expected agreement]). Joint Examination Using Videotaped OSCE 873 Table 4. Agreement between live and video ratings for individual items of the objective structured clinical examination (OSCE) knee assessment* Agreement, %† Knee OSCE items Observed Expected§ Kappa (95% CI)‡ Approach to the patient (including asking about knee pain) Inspection (including from the end of the bed) Assessment of temperature Assessment of muscle bulk Palpation of patella Palpate joint line (including the back of the knee) Patella tap ⫾ cross ﬂuctuation Assess full extension Assess full ﬂexion Collateral ligament assessment at 15 degrees Undertakes active and passive movements Anterior draw test Gets patient to walk Identiﬁes normality/abnormalities correctly 83 87 88 74 73 84 81 80 91 81 81 97 38 50 84 84 61 69 64 69 66 58 84 77 66 97 35 55 ⫺0.06 (⫺0.32, 0.21) 0.18 (⫺0.10, 0.46) 0.68 (0.43, 0.94) 0.18 (0.00, 0.37) 0.27 (0.07, 0.47) 0.50 (0.26, 0.74) 0.44 (0.21, 0.66) 0.52 (0.32, 0.73) 0.45 (0.24, 0.66) 0.19 (⫺0.07, 0.45) 0.44 (0.22, 0.66) ⫺0.03 (⫺0.31, 0.25) 0.05 (⫺0.06, 0.16) ⫺0.12 (⫺0.32, 0.07) P ⬍ 0.001 ⬍ 0.05 ⬍ 0.01 ⬍ 0.001 ⬍ 0.001 ⬍ 0.001 ⬍ 0.001 ⬍ 0.001 * 95% CI ⫽ 95% conﬁdence interval. † Based on linear weights. ‡ Based on linear weights. Agreement expected by chance alone; the kappa coefﬁcient measures the chance-corrected agreement ([observed agreement ⫺ expected agreement]/[1 ⫺ expected agreement]). § Agreement expected by chance alone; the kappa coefﬁcient measures the chance-corrected agreement ([observed agreement ⫺ expected agreement]/ [1 ⫺ expected agreement]). The video examiner consistently scored candidates lower than did the live examiner on the GRS score, but not on the checklist score (Figure 1). Data comparing the live and video ratings of the individual items of the shoulder assessment and the knee assessment are presented in Tables 3 and 4, respectively. Large variations in reliability were seen across the items in both shoulder and knee OSCE stations. Substantial reliability ( ⬎ 0.6) for shoulder OSCE was seen for the items “performs resisted movement,” “inspects active neck movements,” “assesses the acromioclavicular joint,” “asked patient to put hands behind head and hands behind back,” and “external rotation of the shoulder with the elbows tucked in.” Similarly, substantial reliability ( ⬎ 0.6) for knee OSCE was observed for “assesses temperature,” and moderate reliability ( ⫽ 0.41– 0.60) was observed for “assesses full extension,” “assesses full ﬂexion,” “palpates joint line,” “patella tap,” and “undertakes active and passive movements.” Reliability was moderate ( ⫽ 0.41– 0.60) for the overall grades of the shoulder and knee OSCE assessments (Table 5). This could be further improved by considering the omission or modiﬁcation of poorer agreement items (see Tables 3 and 4) within the OSCE checklists. We rescored 22 video OSCEs to evaluate the intrarater agreement of the VOSCE. The test–retest included 11 shoulder checklists and 11 knee checklists. The scores were pooled so that the reliability analysis was based on Table 5. Agreement between live and video ratings for graded classiﬁcation of the objective structured clinical examination (OSCE) checklist* Weighted agreement, %† Video rater Live rater Shoulder checklist Fail (⬍14) Pass (14–20) Honors (ⱖ21) Knee checklist Fail (⬍14) Pass (14–20) Honors (ⱖ21) Fail (<14) Pass (14–20) Honors (>21) 3 (6) 3 (6) 1 (2) 2 (4) 26 (52) 7 (14) 0 (0) 2 (4) 6 (12) 1 (2) 0 (0) 0 (0) 0 (0) 17 (38) 8 (18) 0 (0) 4 (9) 15 (33) Observed Expected Weighted kappa (95% CI) 85 72 0.43 (0.23, 0.63)‡ 87 73 0.51 (0.24, 0.78)‡ * Values are the number (percentage) unless otherwise indicated. 95% CI ⫽ 95% conﬁdence interval. † Based on linear weights. ‡ P ⬍ 0.001. 874 22 pairs of scores. The ICC was 0.98 (95% conﬁdence interval 0.96, 0.99) and the kappa value based on graded classiﬁcation of scores was 1.00 (100% agreement). DISCUSSION Our goal was to investigate the relationship between the assessments of live and videotaped OSCE stations. Our results demonstrated moderate interrater reliability between the live scorer(s) and the video scorer for both the knee station and the shoulder station using a checklist scoring approach. The interrater reliability using a GRS was lower: the live (SpR) scorer consistently scored the students higher than did the (consultant) video scorer, indicating examiner bias. Poor interrater reliability for the GRS may reﬂect different expectations on the part of a strict consultant compared with the lenient SpR. Additionally, it is possible that the live examiner forms more of a relationship with the candidate and therefore tends to give them higher scores. Reliability between live and video assessments ranged from moderate to almost perfect for 16 of 28 of the individual items of the OSCE checklist. We were not able to distinguish which of the 2 methods of assessment, live or video, was most accurate because there was no gold standard to compare against. The rationale behind the poorer agreement of the remaining 12 items may be viewed from clinical and statistical perspectives. The shoulder items “identiﬁes bony landmarks” and “assess for painful arc” had poor reliability. Identiﬁcation of bony landmarks can be particularly difﬁcult to score on a video if the focus is not close enough and/or the students do not name the landmarks or explain what they are doing. The video assessor scored students as having done well only if students gave a verbal description of what they were palpating during the procedure. This highlights one of the key areas where differences may occur. In the event of any uncertainty regarding any aspect of the student clinical examination, live assessors can seek further clariﬁcation from the students. This is not possible via prospective scoring by video, and underlines a limitation of the video method of assessment. Similarly, the items “inspection” and “assessing muscle bulk” from the knee station are also difﬁcult to score via video unless students describe what they are doing. There was only slight reliability for “collateral ligament assessment at 15 degrees” from the knee checklist, which may be explained in part by the fact that this examination requires complex movements and handling skills. Hoving et al (23) found that movements that were complex and required handling skills had poorer interrater reliability than movements that were simple. The item “approach to the patient” scored very poorly at both knee and shoulder stations in terms of agreement. The video assessor gave a score of 2 only if the student both introduced themselves and speciﬁcally asked the patient about pain prior to examining the patient, whereas the live assessors did not appear to have the same criteria for scoring this question. This ﬁnding raises another generic point concerning checklist marking: to maximize reliability, the checklist must make explicit how marks are awarded. The Vivekananda-Schmidt et al item “gets patient to walk” was also scored very differently between the live and video examiners. Scores by the video examiner were most frequently recorded as 0 (“not done”) whereas the live examiners most frequently scored this item as “done well,” suggesting that there were quite different scoring criteria adopted for the 2 approaches, the criteria for the video scoring being more strict. If a student was instructed or prompted by the video examiner, a full mark was not awarded. The item “identiﬁes normality/ abnormalities correctly” from the knee station had only slight reliability, although there was better agreement for this item in relation to the shoulder station. Because this item draws from the other checklist items, the poor agreement for this item can be due to poor concordance within other items. It should be noted that poor reliability may also be deducted from results based on inadequate statistical measurement. In the context of this study, less than moderate reliability was concluded for some items (speciﬁcally “approach to the patient,” “inspection,” “anterior draw test”) when the expected (or chance) agreements of the items were high. As the expected agreement increases, the kappa becomes increasingly limited in its capacity to yield meaningful reliability values (24,25). If for a speciﬁed item a certain category has a high likelihood of being scored by all raters, then the expected interrater agreement for that item will be high. For example, both the live rater and the video rater most frequently scored the items “inspection” and “anterior draw test” as having been done well because both items were relatively easy examinations for the students to perform. As a result, the expected agreements of the 2 items were 84% and 97%, respectively (i.e., close to 100%), leaving little room for measuring agreement above that expected by chance alone. No gold standard exists to establish the content validity of a musculoskeletal examination OSCE station. However, Coady et al (26) have derived a core set of clinical skills relevant to musculoskeletal examination skills in students. Of the 22 core skills relevant to the examination of the shoulder and knee joint from the Regional Examination of Musculoskeletal System (REMS) for undergraduate medical students (26), our OSCEs included 19 skills. The skills not addressed by our tool include assessing leg length when leg length discrepancy is suspected and when appropriate, assessing neurologic and vascular systems during the assessment of a problematic joint, and making a qualitative assessment of movement. There is published evidence that examiners’ clinical experience has an impact on interexaminer agreement on the palpatory diagnosis in osteopathy (27). In this study, the level of agreement between live and video examiners might have been stronger had their level of clinical experience been closer. Unlike the study by Branch and Lipsky (28), which measured the impact of an educational intervention on retention, conﬁdence, and ability of musculoskeletal examination skills of medical students, ours is an exploratory study. There are aspects of this study that can be addressed with improvement. The key area is the lack of face-to-face preassessment discussion between all the assessors on how to score each of the items. This was not possible for several pragmatic reasons. The video assessor Joint Examination Using Videotaped OSCE was geographically too far away from the live assessors. Owing to the busy schedule of the clinical placement, the formative OSCE assessments were offered as an optional addition during the lunch hour and the volunteer assessors had little time to prediscuss scoring criteria for assessment. OSCE assessment via video is a very attractive proposition in the current climate of increasing pressure for clinicians to take on the role of teachers and assessors. It may also provide a higher level of consistency between institutions and paves the way for better quality assurance issues such as anonymized marking to increase fairness, ability of all students to go through the stations in the same order, and ability of the facility to monitor standards in assessment across various hospital sites as well as across schools. Moreover, interrater reliability of live scorers has been shown to vary from 0.25 at some stations to 0.77 at others (29), providing evidence that the consistency between live assessors is not much different from the reliability between live and video assessments in this study. To improve the reliability of video or live assessment, it is important to improve the process of assessment (for example, by standardizing methods of evaluation, scoring, and administration). Our test–retest results, albeit based on a subsample of our original study population, suggest that reproducibility of video scoring is likely to be almost perfect, which further implies that the overall reliability of video scoring by different observers and its reliability against live scoring would probably be increased by standardizing the methods of evaluation and trying to establish scoring consensuses between different assessors. There are a number of other key pragmatic issues, which need to be taken into account when designing an OSCE station that is to be videotaped. It is important that the necessary equipment and expertise are available so that good quality recordings can be obtained. We discarded one videotaped examination as not scorable due to poor positioning of the video camera and therefore poor recording. In this study, we used only 1 video camera to assess the student examining the patient’s joint. An alternative method would be to use 2 cameras simultaneously, where one camera could record the student examining and the other could focus on the joint being inspected. The latter may give better visual information to the video assessor and may improve the reliability of assessment of items that involve visual inspection. However, this would have to be weighed against the probable increase in duration of assessment by video. Although studies in other specialties (largely communication skills) demonstrate that the use of videotaping can be a valuable learning experience for students to improve their skills, not all students may be comfortable with being recorded. We did not explore students’ views in this study. It is also known that student performance may be inﬂuenced by the videotaping process (30). Offering a certain period of adaptation time before the formal assessment phase begins so that the students have a chance to familiarize themselves with the environment may minimize this effect. In contrast, it could be argued that students may express different levels of anxiety about performing clinical examinations in front of a live exam- 875 iner. It also remains to be seen if VOSCEs are suitable for other specialties in medicine. Further work is needed to establish the potential for the VOSCE in the assessment of clinical examination skills. The reliability of video scoring after standardized scoring methods have been put in place should be established; our work on intraobserver variability suggests reliabilities will be considerably enhanced. There is room for further investigation of how procedures including the set up process and quality of equipment can improve the integrity of scoring videotaped OSCE assessments. We have not yet addressed the views of examiners and students regarding videotaped assessment. Finally, there is considerable opportunity for investigating whether VOSCE assessments are valid across different clinical specialties. In conclusion, VOSCEs have the potential of improving quality assurance and saving resources. In practice they need to be conducted with care, taking into account practical issues of camera and patient placement as well as the principles of effective assessment, with good examiner training to ensure consistency of scoring. Finally, this study highlights the potential of VOSCE stations in examiner training. We cannot conclude that videotaped scoring is better than live scoring of OSCE assessments, but our ﬁndings do suggest that VOSCE may be an efﬁcient and reliable alternative to traditional live scoring. ACKNOWLEDGMENTS We thank Dr. Matt Bridges and Dr. Mohammed Farhod for their help in conducting this study. AUTHOR CONTRIBUTIONS Dr. Vivekananda-Schmidt had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Study design. Vivekananda-Schmidt, Lewis, Coady, Hassell. Acquisition of data. Vivekananda-Schmidt, Coady, Morley, Kay, Walker. Analysis and interpretation of data. Vivekananda-Schmidt, Lewis, Hassell. Manuscript preparation. Vivekananda-Schmidt, Lewis, Coady, Hassell. Statistical analysis. Vivekananda-Schmidt, Lewis. REFERENCES 1. Harden RM, Gleeson FA. Assessment of clinical competence using an objective structured clinical examination (OSCE). Med Educ 1979;13:41–54. 2. Probert CS, Cahill DJ, McCann GL, Ben-Shlomo Y. Traditional ﬁnals and OSCEs in predicting consultant and self-reported clinical skills of PRHOs: a pilot study. Med Educ 2003;37: 597– 602. 3. Sood R. Long case examination: can it be improved? J India Acad Clin Med 2001;2:251–5. 4. Harden RM, Stevenson M, Downie WW, Wilson GM. Assessment of clinical competence using the objective structured examination. BMJ 1975;1:447–51. 5. Newble D. Techniques for measuring clinical competence: objective structured clinical examinations. Med Educ 2004; 38:199 –203. 6. Mir MA, Marshall RJ, Evans RW, Hall R, Duthie HL. Comparison between videotape and personal teaching as methods of 876 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. Vivekananda-Schmidt et al communicating clinical skills to medical students. Br Med J (Clin Res Ed) 1984;289:31– 4. Marita P, Leena L, Tarja K. Nurses’ self-reﬂection via videotaping to improve communication skills in health counselling. Patient Educ Couns 1999;36:3–11. Lane JL, Gottlieb RP. Improving the interviewing and selfassessment skills of medical students: is it time to readopt videotaping as an educational tool? Ambul Pediatr 2004;4: 244 – 8. Thorburn J, Dean M, Finn T, King J, Wilkinson M. Student learning through video assessment [review]. Contemp Nurse 2001;10:39 – 45. Winters J, Hauck B, Riggs CJ, Clawson J, Collins J. Use of videotaping to assess competencies and course outcomes. J Nurs Educ 2003;42:472– 6. Hill R, Hooper C, Wahl S. Look, learn, and be satisﬁed: video playback as a learning strategy to improve clinical skills performance. J Nurses Staff Dev 2000;16:232–9. Yudkowsky R, Downing S, Klamen D, Valaski M, Eulenberg B, Popa M. Assessing the head-to-toe physical examination skills of medical students. Med Teach 2004;26:415–9. Ritchie PD, Cameron PA. An evaluation of trauma team leader performance by video recording. Aust N Z J Surg 1999;69: 183– 6. Vogt VY, Givens VM, Keathley CA, Lipscomb GH, Summitt RL Jr. Is a resident’s score on a videotaped objective structured assessment of technical skills affected by revealing the resident’s identity? Am J Obstet Gynecol 2003;189:688 –91. Humphris GM, Kaney S. The Objective Structured Video Exam for assessment of communication skills. Med Educ 2000;34:939 – 45. Smit GN, van der Molen HT. Development and evaluation of a video test for the assessment of interviewing skills. J Cancer Educ 1995;10:195–9. Ram P, Grol R, Rethans JJ, Schouten B, van der Vleuten C, Kester A. Assessment of general practitioners by video observation of communicative and medical performance in daily practice: issues of validity, reliability and feasibility. Med Educ 1999;33:447–54. Vivekananda-Schmidt P, Lewis M, Hassell AB, and the ARC 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. Virtual Rheumatology CAL Research Group. Cluster randomized controlled trial of the impact of a Computer-Assisted Learning package on the learning of musculoskeletal examination skills by undergraduate medical students. Arthritis Rheum 2005;53:764 –71. Hodges B, McIlroy JH. Analytic global OSCE ratings are sensitive to level of training. Med Educ 2003;37:1012– 6. Fleiss JL. Measuring agreement between two judges on the presence or absence of a trait. Biometrics 1975;31:651–9. Fleiss JL. The design and analysis of clinical experiments. New York: John Wiley & Sons; 1986. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159 –74. Hoving JL, Buchbinder R, Green S, Forbes A, Bellamy N, Brand C, et al. How reliably do rheumatologists measure shoulder movement? Ann Rheum Dis 2002;61:612– 6. Feinstein AR, Cicchetti DV. High agreement but low kappa. I. The problems of two paradoxes. J Clin Epidemiol 1990;43: 543–9. Hasnain M, Onishi H, Elstein AS. Inter-rater agreement in judging errors in diagnostic reasoning. Med Educ 2004;38: 609 –16. Coady D, Walker D, Kay L. Regional Examination of the Musculoskeletal System (REMS): a core set of clinical skills for medical students. Rheumatology (Oxford) 2004;43: 633–9. Beal MC, Patriquin DA. Interexaminer agreement on palpatory diagnosis and patient self-assessment of disability: a pilot study. J Am Osteopath Assoc 1995;95:97–106. Branch VK, Lipsky PE. Positive impact of an intervention by arthritis educators on retention of information, conﬁdence, and examination skills of medical students. Arthritis Care Res 1998;11:32– 8. Newble DI, Hoare J, Elmslie R. The validity and reliability of a new examination of the clinical competence of medical students. Med Educ 1981;15:46 –52. Wakeﬁeld J. Direct observation. In: Neufeld VR, Norman GR, editors. Assessing clinical competence. New York: Springer Publishing Company; 1985. p. 51–71.