Abstract
Keywords
Introduction
A sharp decline in the use of general anesthesia (GA) for cesarean delivery (CD) has led to significant reductions in maternal morbidity and mortality in recent decades, but GA-related complications and failed intubation rates among parturients compared with nonpregnant women are still high (1:500) due to intrinsic maternal, fetal, and situational factors. 1 Physiologic changes of pregnancy (PCP), such as more vascular and edematous airway mucosa, increase the difficulty of intubation. 2 Additionally, GA is usually conducted as an emergency in obstetric units remote from assistance, and with time pressure resulting in poor preparation and performance of technical tasks. 2 Furthermore, the practice shift favoring neuraxial anesthesia over GA has not only led to a decline in airway skills 2 but has created the unintended consequence of decreased clinical exposure of anesthesiology trainees to this scenario, necessitating the development of alternative approaches to teaching the necessary knowledge and skills.3,4
In 2016, our research group developed a novel 3-D serious video game, 5 EmergenCSimTM, designed to teach novice first-year anesthesia (CA-1) residents the knowledge and skills to perform GA for emergency CD. 6 Electronic feedback appears automatically at the end of gameplay explaining the expected actions and the underlying rationale for this clinical scenario. A detailed description of the game design and development has been previously published with a report of a single-blinded longitudinal experiment where novice CA-1 residents were randomized to play EmergenCSimTM or a noncontent specific (sham) game that had no embedded electronic feedback. 6 There was no difference between experimental groups in test scores, but a post hoc exploratory analysis found a slight improvement in male residents’ scores over time, suggesting that gender may impact learning outcomes with serious games.
We developed 2 parallel forms of a criterion-referenced multiple-choice test (Form A and Form B) for use in a randomized experiment to compare knowledge gains after playing EmergenCSimTM. Residents would be randomized to experience, after playing the game, either the game-embedded electronic feedback alone (control group) versus electronic feedback and in-person debriefing (intervention group). Form A was designed to be used as a pretest at baseline and Form B as a posttest after playing EmergenCSimTM and having received one of the 2 latter debriefing conditions. Our experimental hypothesis was that the group receiving the additional in-person debrief would demonstrate superior improvement in knowledge and skills after playing the game, evidenced by greater improvement in mean multiple-choice test scores from baseline. Pre- and posttests with different questions but which measure the same content are a more robust approach to measuring learning gains, by avoiding measurement error due to subjects memorizing answers or learning from the pretest. 7 Validation of the instruments measuring the study's primary outcome was considered essential for adding rigor to the planned experiment, to be able to trust the study's results. Here, we describe the multiphase design process for the development, content validation, and empirical validation of scores from the parallel test forms.
Materials and Methods
The Columbia University Institutional Review Board approved this study (Protocol #: AAAR6903) in December 2017. The study activities that involved educational tests whereby the identity of the human subjects cannot readily be ascertained, directly or through identifiers to the subjects were considered to be exempt.
Test development comprised 4 phases, according to Chatterji's process model, which is a framework that takes an intentionally integrated approach to design and validation, situated in, and guided by, the specified test user contexts.
8
The process model draws on long-standing theory on assessment design and validity tied to the intended interpretations and uses of test scores and is broadly consistent with the (1) Assessment Purpose and Population Specification: The total scores from each form were intended to permit inferences on absolute proficiency levels CA-1 residents on the tested domain, justifying their use as outcomes measures in the longitudinal experiment. The targeted population is novice CA-1 anesthesia trainees, not previously exposed to obstetric anesthesia. (2) Specification of the Content Domains and Writing/Selection of Items by one Internal and 2 External Experts: The tested domain was resident knowledge regarding the conduct of GA for emergency CD. The expected learning outcomes for each subconstruct domain were:
Physiologic Changes of Pregnancy (9 items): The physician can describe the normal PCP (eg, airway changes, pulmonary changes) which underlie the differences in management of GA in a pregnant versus a nonpregnant patient. Pharmacology (PHA, 4 items): The physician can correctly apply understanding of pharmacokinetic and pharmacodynamic changes in pregnancy to appropriately manage medications utilized during GA for CD. Anesthetic Implications of Pregnancy (7 items): The physician can correctly apply underlying knowledge about the PCP and drug pharmacokinetic/pharmacodynamic changes unique to pregnancy, to make appropriate clinical decisions when performing GA for CD. Crisis Resource Management (CRM, 6 items): The physician can correctly identify crisis management, communication, and teamwork skills during emergency GA for CD, when given scenarios. (3) Design of Matched Pairs of Items for New Parallel Test Forms: For designing the parallel items, we created distinct pairs of items that measured the same specific content and cognitive skill area, while balancing the item distribution with respect to 3 levels of cognitive demand (i) concept recall and understanding, (ii) application, and (iii) higher order thinking. A table of “Test Design Specifications” categorizing the items in each subdomain according to the 3 cognitive levels was created, to achieve equal proportions of multiple-choice items for each cell.
19
(4) Content Validation by Experts: The annotated question bank was content validated by 3 obstetric anesthesia experts, of which 2 were external (SG, MB, and RM). Seventy-two questions were reviewed; the prior discrimination index values for retained questions and the original versions of revised items were provided.
The essential skills and knowledge were based on Scavone et al's
16
content-validated weighted behavior checklist developed for this scenario. The obstetric anesthesia rotation assigned textbook was the main reference.
17
For each competency identified for the subconstructs, at least 1 item, comprising a stem, 1 correct answer, and 3 distractors was developed. To start, an initial pool of questions was generated from a previously validated and field-tested instrument.
18
Weaknesses identified were addressed by dropping poorly performing items (5 prior items—#2, #10, #20, #23, and #28) determined to be too easy and/or not aligned with content covered in the game, or by revising highly content-relevant items that failed to discriminate between novices and experts (items with D values <15). Additionally, new questions were written, ensuring there were at least 2 parallel items for each form. Question writers were instructed to aim for a “higher order thinking” cognitive level that tests applied knowledge.
The experts were asked to rate the content relevance of each item on a 4-point anchored scale, where 0 = not relevant, 1 = somewhat relevant, 2 = quite relevant, and 3 = highly relevant, and to provide comments/criticisms about the questions. They were also invited to propose new questions according to the specifications explained.
In each round of evaluation, an item-level content validation index (I-CVI) was calculated based on the number of experts rating relevance of an item as a 2 or 3, divided by the total number of experts (the proportion that were in agreement about relevance).
20
An I-CVI greater than 0.78 is considered excellent regardless of the number of experts.
20
Feedback given was used to revise the items and experts rated the revised items (a total of 3 rounds) until consensus was achieved regarding the design and relevance of all items. Items that performed very poorly and were not considered relevant were removed from the test. This ultimately yielded a total of 26 paired items (52 in all) to be allocated to finalized versions of each parallel test form (copies of the shuffled questions for both forms and answer keys are in Supplemental File 1).
(5) Field Testing and Empirical Validation: When the pool of 52 questions was finalized, shuffled items from both forms were uploaded to an online platform (QualtricsXM). The combined test was distributed, along with instructions, via an email-embedded link to volunteer participants from 3 institutions (Jackson Memorial Hospital/University of Miami, NewYork-Presbyterian Hospital/Columbia University, and Massachusetts General Hospital/Harvard University). The email explained that responses would be collected anonymously, and that participation was considered to serve as agreement to participate in the study. Inclusion criteria included trainees of multiple levels of experience, with and without prior obstetric anesthesia experience and obstetric anesthesia fellowship-trained faculty. The only exclusion criterion was refusal to participate. Residency class sizes nationally in the United States are generally small (mean of 13 (SD 7; range 3-30)).
21
Trainees at the 3 participating institutions were invited to participate. An intact cohort of CA-1 anesthesiology residents was successfully recruited from the University of Miami, along with more senior trainees from the 3 institutions. The available pool of fellows (N = 2) and obstetric anesthesia fellowship-trained experts from which volunteer participation was solicited at one of the institutions were also naturally small (total N = 7).
Survey items were included at the end of the test to collect background information from participants on institution, current training status (PGY-1 -fellow or faculty), age group (<25 years, 26-30 years, 31-35 years, 36-40 years, ≥41 years), self-reported gender, number of prior obstetric anesthesia rotations, and prior experience performing GA for either CD or nonobstetric surgery in pregnant women.
(6) Psychometric Analysis: Item analysis statistics, as pi (item difficulty) and D (discrimination index) values, were obtained using Classical Test Theory (CTT) techniques.
22
The value of pi, which represents the proportion of examinees who answer an item correctly, may range from 0.00 to 1.00. The discrimination index in this case indicates how well the item discriminates between the novices (juniors who have never completed an obstetric anesthesia rotation) and the experienced (have completed at least one obstetric anesthesia rotation). Item D values should be evaluated in concert with item pi values. Multiple-choice item p values should not be performing at the chance level. In multiple-choice items with 4 options, a chance level (suggesting random responses) is .25 (25% identified the correct response). For enhancing validity per CTT, criterion-referenced interpretations of scores on proficiency test require that items discriminate sufficiently between experienced persons (those with prior exposure to the domain) versus novices (those who are unexposed to the same). Guidelines suggest that, assuming items meet pi criteria, if negative D or D < 0.10, the item should be removed or examined closely; and if positive D ≥ 20%, the item is functioning well.22‐24 A D of 0 suggests no discrimination, which is not an ideal result for a criterion-referenced test, suggesting that the item is too difficult or too easy for all levels of examinees. (7) Reliability of Scores: Internal consistency reliability estimates for subdomain and total scores were determined using KR-20 and the Parallel Forms reliability estimate with Pearson's correlation of total test scores of each test form. (8) Validity evidence based on expected group differences.
9
We hypothesized that greater experience would be associated with higher scores on the test. The overall sample comprised 49 subjects. All analyses were performed using SPSS statistical software (version 20.0; IBM Corporation, Armonk, NY). A
A detailed description of the development of EmergenCSimTM, and the complete list of items in the electronic feedback checklist, including the weighted scoring for each item have been previously published.
6
Congruent with the knowledge test, the checklist of expected actions within the game and the weighted scoring system were based on the content-validated weighted behavior checklist developed for this scenario by Scavone et al.
16
For the subsequent experiment, the 10-min in-person semistructured debriefing integrated concepts from the Promoting Excellence and Reflective Learning in Simulation (PEARLS) debriefing framework 25 and was conducted by AL. Subjects were invited to reflect on their actions taken in the game and the aspects of clinical management within the scenario using questions such as “Can you walk me through what you were thinking when you were asked to put this patient to sleep emergently?” and “Were there any aspects of the explanations given that you did not understand or need help clarifying?.” Whenever gaps in knowledge or understanding of the concepts being taught were identified, direct teaching was provided. Strategies for scoring better in the game were not discussed.
Results
Field-testing occurred during July to December 2019 on a sample of CA-1 residents (N = 24), CA-2 residents (N = 21), CA-3 residents (N = 1), fellow (N = 2), and faculty anesthesiologists (N = 1), (total N = 49) from the 3 US medical institutions described above. The demographics and background characteristics of participants are described in Tables 1 and 2, respectively. Items were scored with a binary key denoting right (1 point) and wrong (0 points) answers. The 52 items were separated randomly by cells of the Table of Test Specifications into 2 different parallel examinations—“Form A” and “Form B.” Each parallel form yielded 4 subdomain scores and a total score which were investigated for overall construct validity.
Subject demographic variables (N = 49).
aQ53 and Q59 are nominal scale.
Subject background characteristics (N = 49).
The 49th subject showed several missing values in the item response data. This subject did not answer 13 out of 52 items from Forms A and B combined. The data on all items for which valid responses had been received were retained, with the missing responses scored as incorrect. Given the relatively small sample size, this treatment allowed the retention of greater information regarding the item responses as well as avoiding the introduction of unacceptable bias into the analysis.
Measurement theory on parallel forms test design assumes similar distribution for true score and error score distributions. The “error” in measurement refers to the discrepancy in the observed score of the test taker and their “true score,” which is the average score that would be observed from many repeated testings. 26
Consistent with CTT expectations, 22 both test forms yielded similar score distributions that were near-normal (Figure 1A), with medians, ranges, and standard deviations that were less than 2 raw score points apart—the median score (out of maximum 26) for Form A was 14 and for Form B was 13 (Table 3). The KR-20 22 values calculated were well above the minimum of 0.70, with a robust parallel forms reliability of 0.86.

(A) Total score distribution for forms A and B; medians, ranges, and standard deviations were less than 2 raw score points apart—the median score (out of maximum 26) for form A was 14 and for Form B was 13. (B) Item Difficulty Distribution for Forms A and B; item p values showed a near normal distribution on Form A, but a flatter distribution with Form B.
Descriptive statistics on the total score of parallel test forms A and B.
Joint evaluations of CTT item difficulty indicated that the majority of items performed per assumptions of criterion-referenced test design, separating experienced residents from novices.
Item Analysis Statistics
Item p values showed a near normal distribution on Form A, but a flatter distribution with Form B (Figure 1B). The item statistics (see Supplemental Files 2 and 3) were calculated for the total sample and showed one problematic item on Form B (see Supplemental File 3—item #7B). A summary of the item analysis statistics is in Table 4.
Summary item analysis statistics for form A and B.
aDiscrimination is calculated based on adjusted point-biserial correlations.
bEasy items are those with discrimination less than 0.2 and difficulty is greater than 0.8.
When the sample was broken down into experienced and novice groups to investigate if items functioned similarly in both groups, a few added items on Form A seemed to function in the reverse direction (negative D values) where novices performed better than experienced residents (see Supplemental File 2—items #16A, #18A, #21A, #23A, #26A).
Validity Evidence Based on Hypothesized Group Differences
Testing for hypothesized group differences on the overall construct domain measured (all 52 items) verified that greater seniority was associated with better performance on the test. Statistically significant differences were established between subgroups of the total sample broken down by year of residency, the number of rotations completed and more experience performing GA for CD (
Subdomain performance (Supplemental File 4): The reliability levels of the subdomain scores on both forms were generally low, at <0.70. This was possibly due to a combination of low homogeneity of content tested and too few items in each subdomain. Hence, only total scores are recommended for use for educational evaluations with either test form.
Discussion
We developed content-validated criterion-referenced parallel test forms, designed to test CA-1 novice residents’ knowledge regarding performance of GA for CD at baseline (pretest, Form A) and after playing a serious video game (posttest, Form B). The tests demonstrated item-level validity, strong internal consistency and parallel forms reliability, and validity based on expected group differences in performance of experienced versus novice residents. Together these results confirmed our hypotheses, suggesting that the scores are sufficiently valid and reliable for the purposes specified and consistent with the underlying construct theory about the measures.22,26
The study in which the tests were designed to be used would require novice first-year anesthesiology residents (CA-1) to take the 26-item pretest and play EmergenCSimTM. They would be randomized to either the control group which experienced the game-embedded electronic feedback alone or the intervention group, which experienced the electronic feedback and an in-person debriefing. All subjects would then play the game a second time, and take the 26-item posttest. The primary outcome was to be the difference between experimental groups in the change in mean score from pretest to posttest.
The favorable empirical performance of the test forms may be attributed to our systematic item development and content validation processes. 8 The process of construct domain analysis ensured that the construct was adequately represented by the pool of items developed. Test development was an iterative process and benefitted from data derived from earlier empirical validation, 18 which prompted dropping or revising items that were performing poorly. Content validation relied on expert ratings of relevance of individual items and computation of an I-CVI, which is an index of inter-expert agreement adjusting for chance. 20
The construction of a validity argument is based on collecting evidence to support inferences to be made from test scores. 9 Content validity indicates that the relationship between the content tested and thought processes of the test-takers and the intended construct is sound. 27 The advantage of CVI over other computed approaches, such as consistency estimates, consensus estimates, and measurement estimates, is the ease of computation, however, one drawback is a higher risk of chance agreement between experts. 20
The emphasis of our empirical validation steps focused on estimation of a parallel forms reliability coefficient and investigating internal consistency reliability of each test form with the KR-20 formula. 22 The KR-20 compares the sum of the item score variances in the numerator with the variance of the summated total score on an instrument in the denominator and can therefore be interpreted as another measure of item homogeneity.
In future, construct validity of scores from the parallel test forms could be further investigated based on evidence of internal structure, gathered by performing exploratory and confirmatory factor analysis or a unidimensionality analysis with item response theory models. 28
The parallel forms reliability of 0.86 is well above the acceptable minimum standard of 0.70. 22 Our finding that seniority of the physicians was associated with better performance on the test, provides added credence for the measures.
The test forms were developed for use in a randomized controlled trial exploring the educational utility of a novel serious video game. Given the results, an improvement in test scores from pretest to posttest in the forthcoming study can be interpreted meaningfully against the specified construct domain, and with precision. The 2 parallel forms were developed in order to limit a “testing effect,” a potential threat to the internal validity of the forthcoming experiment whereby test takers become familiar with the items and remember the responses for later testing. 29 Strong validity and reliability of our outcome measures was important in order to trust the outcomes of our research.
A limitation of our study is that we were not able to fully control test-taking conditions of volunteers in disparate locations and to exclude factors which may have exerted nonrandom influence on scores (bias or construct-irrelevant variance) or random measurement error. 27 This, however, would have applied to only a minority of subjects—the largest group of subjects (from University of Miami) took the test in one sitting. Another limitation is that a power analysis was not performed. A pragmatic approach was taken, soliciting volunteers from multiple institutions, given the fixed small residency and fellowship class sizes and small numbers of obstetric anesthesia fellowship-trained faculty. A total sample size of N > 30 was achieved, and a retroactive power analysis indicated that there would be >80% power for a 2-group independent means comparison between test forms.
One weakness of the parallel forms was relatively low reliability of the subdomain scores on both forms (<0.70), likely due to low homogeneity of the content tested and too few items by subdomain. This result suggests that the total scores on the forms should be used rather than the subdomain scores. Another weakness found was that an item on Form A (item 23A, Supplemental File 2) seemed to function in the reverse direction with negative D values and will need to be deleted or revised for future iterations. Particular attention will be paid to items with
Multiple-choice question-based testing is convenient and widely used for assessing knowledge in healthcare education and research. Scores must allow the intended interpretation in order to make the correct conclusions about learner knowledge and skills. Although the psychometric properties of our multiple-choice test forms are sound, this modality may be suboptimal for assessing other important behavioral domains of communication and CRM.
Conclusion
Outside of high-stakes testing situations, few validated parallel test forms exist to assess resident learning in domains related to obstetric anesthesia. Although specifically developed for assessing learning outcomes following a novel video game, we believe our parallel tests are sufficiently robust to be utilized for formative assessment of novice anesthesiology resident knowledge related to performing GA for CD, for which there is, unavoidably, diminishing exposure. Educators could use poor performance on the test to identify knowledge gaps, which could be addressed by assigning further reading, direct teaching or participation in other simulation-based teaching techniques.
Supplemental Material
sj-docx-1-mde-10.1177_23821205241229778 - Supplemental material for Validating Parallel-Forms Tests for Assessing Anesthesia Resident Knowledge
Supplemental material, sj-docx-1-mde-10.1177_23821205241229778 for Validating Parallel-Forms Tests for Assessing Anesthesia Resident Knowledge by Allison J. Lee, Stephanie R. Goodman, Melissa E. B. Bauer, Rebecca D. Minehart, Shawn Banks, Yi Chen, Ruth L. Landau and Madhabi Chatterji in Journal of Medical Education and Curricular Development
Supplemental Material
sj-doc-2-mde-10.1177_23821205241229778 - Supplemental material for Validating Parallel-Forms Tests for Assessing Anesthesia Resident Knowledge
Supplemental material, sj-doc-2-mde-10.1177_23821205241229778 for Validating Parallel-Forms Tests for Assessing Anesthesia Resident Knowledge by Allison J. Lee, Stephanie R. Goodman, Melissa E. B. Bauer, Rebecca D. Minehart, Shawn Banks, Yi Chen, Ruth L. Landau and Madhabi Chatterji in Journal of Medical Education and Curricular Development
Footnotes
Authors’ contribution
Authors’ note
DECLARATION OF CONFLICTING INTERESTS
FUNDING
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
