Abstract
Introduction
With dwindling educational resources and increasing student enrollments, multiple-choice testing (MCT) has become a ubiquitous assessment format in higher education. In fact, it is likely that MCT is the most widely used form of assessment in Canadian post-secondary introductory-level courses (DiBattista & Kurzawa, 2011; Slepkov & Godfrey, 2019; Suskie, 2018; Tobias & Raphael, 1996). While the proliferation of MCT has been driven by economic considerations such as the ease and automation of scoring, the technique has remained popular because well-constructed MCT tests prove highly reliable, valid, and fair (Anderson & Biddle, 1975; Little & Bjork, 2015; Scott et al., 2006; Suskie, 2018; Wainer & Thissen, 1993). Much of the (ongoing) developmental research into MCT has been conducted in service of optimizing large high-stakes standardized tests (Haladyna, 2004; Moreno et al., 2006; Rodriguez, 2005), which are now almost exclusively offered in this format. Thus, there has been a large amount of research into ways of improving MCT, including widely available guidelines on best uses and best practices. Many of these guidelines have been adopted by those studying classroom tests and exams. However, the nature of classroom examination is such that opportunities for maximizing their psychometric attributes (i.e., the functional operation) are limited. This is mostly due to the limited testing times, test lengths, student numbers, and opportunities for iterative test improvements, as compared with high-stakes standardized tests. In addition, there is a lack of information in the MCT development literature that is targeted toward instructor-designed and test-bank-based classroom tests.
There is a growing awareness of the need to improve classroom MCT (DiBattista & Kurzawa, 2011). Whether researchers are interested in assessing the quality of new experimentally designed tests or in reviewing the quality of testing in an academic program, such research must be contextualized with respect to typical classroom test attributes. The nature of examinations is such that tests are more shrouded in secrecy and security concerns than other aspects of course instruction. Thus, there is a scarcity of reports on the psychometrics of “typical” classroom MCTs. Many published examples of individual test attributes can be found, but because those studies are invariably reported by assessment experts working to improve testing in a particular academic program, there is likely a strong selection and publication bias in this literature. Thus, a particularly relevant baseline of classroom MCT use would comprise tests made and deployed by average practitioners. A key study that centers the current work was conducted by DiBattista and Kurzawa (2011), where they reported on the functioning of classroom multiple-choice tests in a typical Canadian university. While their work aimed to establish a representative survey of MCT attributes across their institution, they ultimately reported on the psychometric attributes of only 16 tests. To the best of our knowledge, that report remains the broadest publicly available survey of classroom MCT to date.
In this study, we present a large survey of test-level and item-level attributes of classroom multiple-choice tests offered at a primarily undergraduate Canadian university. In contrast to other studies that have looked at individual or small groups of tests, we report on 182 multiple-choice tests that span all undergraduate levels of education and a wide range of academic disciplines. The research objectives of this article are three-fold.
First, we aim to provide a representative sample of traditional multiple-choice tests in the context of higher-education classroom use. Because the analyzed tests are from a wide array of instructors, courses, instructional levels, and disciplines, the primary aim of this work is to provide a representative and useful baseline of classroom MCT psychometrics for future comparisons by practitioners and researchers.
Second, because classroom tests vary widely in operational attributes such as length, the number of test-takers, the flaws of some commonly used measures of test quality can become exaggerated and lead to conflicting conclusions. Such statistical drawbacks are often inconsequential for large (and optimized) standardized tests. Alternatively, more sophisticated item analysis tools such as item response theory, Rasch models, and G-theory are often employed in the analysis of large high-stakes standardized tests, but most of these are unlikely to be adopted by classroom test creators. Thus, in this report, we aim to shed new light on some drawbacks inherent to commonly reported classical item analysis methods, and to subsequently offer recommendations for simple modifications and best practices that will facilitate future comparisons between classroom tests. In particular, we discuss the advantages of using item-excluded correlations as measures of item discrimination, and of using length-normalized test reliability parameters.
Third, our two primary objectives align to provide updated and empirically driven guidelines for assessing the strengths of classroom MCTs. By establishing a representative distribution of item attributes and test psychometrics, we are able to offer data-driven recommendations for what constitutes average, above average, and exceptional measures of item discrimination and test reliability. Furthermore, our survey clarifies long-standing concerns regarding appropriate levels of item difficulty, the prevalence of various option-number items, as well as some presumptive quirks in classroom test-design such as easy “confidence boosters.”
Method
Data Collection
All the analyzed tests were deployed in midterm or final examinations between 2013 and 2019 at Trent University, a small undergraduate-focused institution located in Ontario, Canada, with approximate undergraduate enrolment of 10,000 and faculty complement of 250. The majority of tests were administered and processed with standard Scantron® forms and optical character recognition software. A few tests were manually entered into a compatible digital format to allow for their inclusion in the study. A wide range of course instructors was solicited to supply raw multiple-choice test data. Potential instructors were initially identified from their use of centralized Scantron tools. Instructors were then contacted by email, requesting authorization of the release of anonymized test data. In an attempt to present the most representative sample of institutional MCT use possible, a handful of tests were solicited from instructors of key disciplines who do not use the university’s centralized Scantron processing office. No data were used from instructors who failed to give permission or who subsequently withdrew from the study. No instructor, test-length, quality, or discipline-based criteria were used to exclude data from the survey. Once received, all tests were checked for student anonymity, and anonymized if needed. The majority of collected tests exclusively comprised multiple-choice items. In the rare cases where the multiple-choice section was only a component of the test, the non-multiple-choice items were removed from the analysis and the remainder was treated as a standalone assessment tool (i.e., a test). For this study, we are only concerned with traditional “single-response” MCT that is scored dichotomously, without partial-credit and without penalty for guessing. Some tests included multiple-choice items that fell outside of the standard criteria. These included clear cases of true-false items, multiple-selection items, and items scored with more than one correct option. In this case, the items were likewise excluded from the analysis. In total, 368 items were excluded from the data of 39 tests. Of these, 222 were true-false from 12 tests. In total, we survey the functioning of 182 tests from 45 different instructors, spanning 12 academic disciplines. The data include analysis of a total of 11,246 multiple-choice items and 24,885 student-tests. All item analysis and statistical tests reported herein were conducted in R (version 3.6.3, (R Core Team, 2019)), with custom-written scripts (Van Bussel et. al., 2019), an R package (Van Bussel & Fitze, 2019) and a Shiny applet (GUI) for easy practitioner access (Van Bussel & Burr, 2019).
Item Analysis
Item difficulty
Individual units of evaluation on MCTs are referred to as items. The most common measure of a test item’s functionality is that of the item’s difficulty. The difficulty index,
Item discrimination
An important aspect of any reliable classroom test is its ability to differentiate between students with strong knowledge of the subject and students with poor knowledge. Ideally, each item gives a small measure of such distinction, and the combination of a group of items allows the test on the whole to discriminate better between more and less knowledgeable students. Under this framework,
and
where
As an item-total correlation, the point-biserial is somewhat problematic conceptually. This is because the item score makes up part of the total test score and thus adds spurious weight to the total. Mathematically, this means that
Test reliability
The consistency with which a test can be used as a tool for accurately ranking student knowledge is known as test-score
where
In the “Recommendations” section, we use the distribution of obtained test reliabilities to provide updated recommendations for targeted values of
Notation
As we present the following section with numerical results from our dataset of multiple-choice tests, there are two conventions we will follow. First, when presenting summary statistics, we will present the statistic along with its standard deviation in parentheses, for example,
Summary Measures for 182 Classroom Multiple-Choice Tests.
Results and Discussion
The MCT Context
As university enrollments have exploded in recent years, upper-year courses have grown sufficiently so that the use of MCT is now commonplace across all levels of instruction. Accordingly, the plurality (43%) of tests in our survey come from first-year (i.e., Freshman) courses, over half of the tests come from second-year (26%) and third-year (25%) courses (i.e., Sophomore and Junior), and 6% of tests are obtained from fourth-year (i.e., Senior) courses. The abundance of multiple-choice tests at the Sophomore and Junior levels is a reflection of the fact that a larger number of courses are offered at these instructional levels, compared with Freshman courses that are less numerous but are larger and invariably use MCT. Class sizes vary widely across the institution. This fact is reflected in the distribution of the number of students,

Distribution of test cohort size and test length. (A) distribution of number of students per test. (B) distribution of the number of multiple-choice items per test.
Surprisingly, we find this practice to be common at all class sizes larger than 40 students. Thus, the data shown in Figure 1A include multiple versions of some tests analyzed as independent tests. The prevalence of MCT at the Junior and Senior levels is reflected by the abundance of MCTs taken by cohorts of 40 to 120 students. Tests of more than 250 students are invariably from introductory-level courses for this corpus.
Classroom multiple-choice tests also vary in terms of length, as can be seen in Figure 1B. In practice, the number of items generally reflects the length of the test in time, with final examinations that span 2 to 3 hr (often, but not always) employing more items than midterm examinations that span 1 to 2 hr. While surveyed tests with more than 50 items were almost exclusively final examinations, several final examinations comprised fewer than 40 questions. As seen in Figure 1B, test sizes ranged from less than 20 items to more than 100 items. Our data collection protocols are blind to cases where the multiple-choice component is only one part of the total examination. Nonetheless, all surveyed tests with fewer than 50 items were deployed with Scantron cards that were auto-scored by computer. The mean number of multiple-choice items in a given test is 62 (22), with the smallest MCT component being 17 items, and the largest being 106.
Prevalence of 3-, 4-, and 5-Option Items
Establishing the optimal number of options offered within MTC items has been an active area of research (Owen & Froman, 1987; Raymond et al., 2019; Rodriguez, 2005). Traditionally, the use of between 3 and 5 options has been most common. Deciding on the number of options to offer in a classroom examination represents a balance of considerations, as the primary consideration for offering more options is a desire to mitigate the effects of student guessing. In practice, this desire is beset by the difficulty of writing large numbers of viable distractors (non-keyed options). The disadvantage of deploying nonfunctional distractors (broadly defined by Raymond et al. [2019] as options that elicit negligible attention and prove non-discriminating) is that they take up both time and cognitive space in the test and risk lowering both test validity and score reliability. Studies of the optimal number of options consistently find that the 3-option format is best; at least from a psychometric standpoint (Rodriguez, 2005). However, despite best-practice for the exclusive use of 3-option items, 4- and 5-option items are commonly offered by educational publisher test banks, and the distribution of option-number item types in instructor-created classroom tests remains undiscussed in the literature.
Our content-agnostic data collection method precludes an absolute determination of item scoring rules for every item. For example, a single student who selects a non-offered option (such as selecting F in a 5-option A–E item) is sufficient to mis-identify the item as a 6-option item when examined anonymously, as we did. Nonetheless, we are able to positively code the vast majority of test items in the survey. Among all items, 31% are found to be of the 5-option type, 56% are 4-option, and only 6% are of 3-option, with 7% of items remaining uncategorized or miscoded. Furthermore, most instructors appear to opt for heterogeneous mixed-type tests. We define a test as homogeneous (in terms of option-number type) if the mode type represents over 90% of the total. Based on this definition, we find that 40% (71 of 182) of the tests are homogeneous, and 60% (111 of 182) are heterogeneous. Of the homogeneous tests, 65% comprise 4-option items and 35% comprise 5-option items; that is, none use 3-option items. In fact, not only is the 3-option format unpopular in homogeneous tests, it also does not comprise a significant proportion of mixed-type tests. Within heterogeneous tests, 32% of items are 5-option, 50% are 4-option, and only 8% are 3-option (with the remainder 10% being of ambiguous categorization). Thus, there is a clear and persistent gap between recommendations for best-practices in high-stakes MCTs and in-practice deployment of classroom tests.
Item Difficulty and Test Scores
All classroom tests are a blend of questions of varying difficulty. When considering the opportunity for guessing inherent in MCT, the psychometrically optimal item difficulty should be near the midpoint between the expectation for guessing and a perfect score (Allen & Yen, 2001; Doran, 1980; Lord, 1953). Thus, depending on the number of options available for an item (e.g., 3, 4, or 5), the optimal item difficulty should be in the range from 0.60 to 0.67 (midpoints ranging from 0.20 to 0.33; perfect score being 1.0). Across all items, our surveyed mean item difficulty is

Distribution of test item difficulty,
Most university courses—particularly at the introductory level—are surveys of topics rather than sequences of culminating knowledge. Thus, it may not be expected that item difficulty should increase (

Progression of mean item difficulty within tests.
Anticipated test scores are a primary consideration in classroom test design. Because item difficulties can span the full range of possibilities, it is likely that many test makers adjust the composition of their tests in an effort to attain a target test score. However, the success of such design hinges on instructors’ ability to know or predict the difficulty level of test items. As seen in the distribution of test scores shown in Figure 4, most of the surveyed tests show a class average in the range from 60% to 70%. The mean test score is 65% (8%), and the most common test score is nominally 65%. The range of test averages is large, spanning 42% to 89%. A “pass” in the Canadian post-secondary system is 50%.

Distribution of average test-scores across a total of 182 classroom multiple-choice tests.
Item Discrimination
Figure 5 displays the distribution of individual item discrimination for our survey, both in terms of the conventional

Distribution of individual item discrimination for 11,246 classroom test items.
The mean of item discrimination for each test provides a useful descriptive measure of test quality. The distribution of mean item discrimination for the 182 tests, both in terms of

Distribution of mean item discrimination for 182 classroom tests.

Mean test item discrimination,
Association Between Item Difficulty and Discrimination
In practice, item difficulty and item discrimination are not wholly independent. Theoretically, in the absence of guessing, items with a difficulty of

Item discrimination and item difficulty.
Test-Score Reliability
Test-score reliability is observed to vary widely among the surveyed tests. The mean value of reliability for our tests is
It is currently fashionable (see, for example, Ebel & Frisbie, 1991; Frisbie, 1988; Tavakol & Dennick, 2011) to gauge the quality of a test by considering various it guidelines for test reliability. Several recommendations, for instance, suggest that a value of α ranging from 0.7 to 0.85 constitutes an acceptable-to-good test, while a value greater than 0.90 implies an excellent test. However, α is not meant to represent test quality, but rather is a specific measure of test-score reliability (Frisbie, 1988). It is an absolute and standalone measure that is not particularly useful for comparisons between tests of unequal lengths. Certainly, to be valid, any summative test must have a semblance of reliability. But, precisely how reliable a (say) 25-item 1-hr midterm exam needs to be before it is jettisoned as “unreliable” is not psychometrically defined. Most often, when used in the context of classroom tests, reliability is used to compare whole-test quality among experimental variants. To this end, the normalized reliability,

Test-score reliability and mean item discrimination.
Recommendations
Over the years, numerous recommendations and guidelines have been offered for interpreting multiple-choice item and test psychometrics. Most such recommendations are provided with an eye to maximize the quality of MCT items and are thus beneficial and informative for classroom test designers. However, because most of the recommendations are either given in the context of professional standardized tests or are based on theoretical psychometric considerations, they are not particularly suitable for classroom test design. On the other hand, our broad survey of classroom MCTs provides a unique context and opportunity to construct useful guidelines from representative data for the Canadian post-secondary education system and may also be of interest to other similarly structured education systems elsewhere. Thus, in what follows, we provide recommendations and guidelines for the interpretation of item difficulty, item discrimination, and overall test quality/reliability.
Most recommendations for item difficulty emphasize that the greatest opportunity for discrimination exists for items with a value of
A common guideline for interpreting item discrimination breaks the range of
Guidelines for the interpretation of Cronbach’s α are somewhat arbitrary. Ultimately, α is a measure of the robustness of the test scores to repeated measurement, but different amounts of score uncertainty can be tolerated depending on the purpose of the test. This is to say that an acceptably reliable classroom midterm exam may prove to be woefully unreliable as a high-stakes standardized test. Although standardized tests aim for α > 0.90, such tests comprise manifold more items than a classroom test. Classroom test scores are known to span a wide range of reliability, likely averaging below 0.6 (Frisbie, 1988). The standard recommendation for classroom tests is to attain a reliability of at least 0.7 (Downing, 2005). In practice, the reporting of reliability in the classroom test development literature is done more as a proxy for overall test quality, rather than strictly as a measure of internal test-score reliability. In that case, it often makes more sense to compare a whole-test measure of quality that controls for the number of items. To this end, the adjusted reliability,
Summary and Outlook
We have presented a broad examination of MCT “on the ground” at an undergraduate education focused Canadian university. Our dataset is the largest of its kind and has allowed for comparisons between typical theoretically-driven recommendations for item analysis and empirical findings for as-deployed classroom tests and examinations. This study thus presents an opportunity for establishing a baseline for a wealth of future research into the development, strengthening, and assessment of MCT in the tertiary education setting.
Several Empirical Findings are of Particular Interest
First, expert recommendations for the preferential use of 3-option items are entirely unheeded by classroom test designers. We find that in practice, 4-option items are most popular, followed by 5-option items. 3-option items comprise less than 10% of all deployed MCT items. It is likely that the preference for 4- and 5-option items stems from a combination of the desire to mitigate successful guessing and a dearth of 3-option items in publisher-made test bank questions.
Second, a number of attributes that are theoretically considered to be deficiencies in test design appear to be of limited impact in practice. Particularly, across 11,246 items and 182 tests, item difficulty was largely concentrated above 0.6. despite this, test averages and overall student performance did not suffer. Thus, it does not seem to be of high importance for instructors to emphasize the difficulty of their questions, so long as
Third, the presumptive practice of using confidence boosting “easy” (
Fourth, use of item-included item-total correlations, such as the point-biserial, to measure discrimination metrics is demonstrated to inflate individual item discrimination scores, as was understood by theoreticians, but perhaps not well understood by practitioners. This inflation is large enough to distort the mean test-level examination of test quality through use of average item discrimination, and we strongly recommend against further use of this statistic, especially when using tests with fewer than 100 items. Instead, an item-excluded correlation provides a cleaner measure of discrimination that can be used for comparing items between tests of different lengths. Many modern item-analysis computer programs provide the option for calculating item-excluded correlations.
Fifth, this large set of solicited MCTs from a typical Canadian university proves to have surprisingly good performance, with
Finally, when examining the reliability of MCTs in the tertiary education setting, we strongly advocate for the use of normalized α—such as
In conclusion, we leave the reader with a list of quantitative recommendations for the use of evaluating test reliability and efficacy. Classroom test designers and practitioners should aim for the following:
Individual item difficulty ranging from 0.35 to 0.90, with the highest discrimination occurring for difficulties ranging from 0.55 to 0.70.
Minimum individual item discrimination of 0.15, aiming for 0.35 (0.15–0.35 being “good,” more than 0.35 being excellent) when using item-excluded correlation,
Minimum test-level mean item discrimination,
Score reliability with normalized Cronbach’s alpha of
