Abstract
Keywords
Introduction
If construct-irrelevant artifacts of the test-development process interfere with the validity of inferences made from test scores, then the entire testing enterprise is at risk. However, relatively few empirical studies have examined the impact of item-writing guidelines on test performance.
Despite recent enthusiasm for technology-enhanced items, test construction with multiple-choice (MC) items remains popular due to short administration times, scoring objectivity, and the ability to make sound decisions with test results (Haladyna, Downing, & Rodriguez, 2002; McCoubrie, 2004). However, creating an MC item can be challenging. Previous studies have suggested that forming answer choices is a difficult part of the item-construction process (Haladyna & Downing, 1989; Hansen & Dexter, 1997), because each individual answer choice can potentially influence item quality.
Haladyna and his colleagues (2002) proposed 31 item-writing guidelines for creating high-quality MC items that contribute to overall test reliability and validity. The authors examined three commonly cited guidelines related to constructing answer choices: No.23 “Keep choices homogeneous,” No.29 “Make all distracters plausible,” and No.30 “Use typical errors of students.”
However, application of one guideline can lead to a violation of another. For instance, answer choices based on common student errors may include choices that are not consistent with one another. In other words, constructing an MC item with plausible answer choices may harm their homogeneity in content and grammatical structure. Specifically, answer choices for math items seems homogeneous if similar number types, number of digits, or operations are used among answer options. Table 1 shows two examples of math items with plausible but not homogeneous answer choices. Choice D of Item 1 includes words, while the other answer choices contain only equations. Choice A of Item 2 is the only integer number, while the other choices are fractions.
Items with Plausible Answer Choices That Are Not Homogeneous.
Writing items under the constraints of these three guidelines requires more effort and time. However, the impact of homogeneity of answer choices on item difficulty (proportion of correct answer) and discrimination (correlation with the total test score) remains uncertain. The current study addresses this question.
Haladyna and Rodriguez (2013) suggested that answer choices that are nonhomogeneous in content and grammar can provide a cue to students about the correct answer. Cues can make items easier and influence item discrimination indices because students take advantage of such cues to choose correct answers.
Answer choice homogeneity is defined in different ways for MC items based on written statements and MC items based on mathematical expressions. For example, language similarity, content, and grammatical consistency can make word-based answer choices homogeneous. On the contrary, answer choices in mathematics items can be made homogeneous by using a consistent number of digits (three vs. two), type of number (integer vs. fraction), and format of answer choices (words vs. numbers). Although these differences may seem small, they may be a source of construct-irrelevant variance, caused, perhaps, by test wiseness or alternatively a student’s suspicion that the item writer is trying to be tricky. This study examines the impact of homogeneity of answer choices only for mathematics MC items.
The few empirical test-development studies that have focused on answer choice similarity have had mixed results. Smith and Smith (1988) tested the impact of the similarity between correct choice (key) and incorrect student response for reading test items using different standard-setting methods, such as Angoff and Nedelsky methods. They found that answer choice similarity makes items more difficult using both methods. However, they did not adequately define how to classify answer choice similarity. Ascalon, Meyers, Davis, and Smits (2007) found parallel results. They focused on only MC items on a driver’s license examination and classified answer choice similarity by comparing all distractors in an item with the correct answer based on distractor content, such as theme, words, and sentence length. On the contrary, Green (1984) found the opposite result using general information items.
Unlike previous studies that explored the impact of answer choice homogeneity on item difficulty for the items with word-based answer choices, the current study examines the impact of answer choice homogeneity on item difficulty and item discrimination of mathematics items with numerical answer choices. More specifically, we examine three research questions:
Method
In this study, we replicated previous studies that compared the difficulty and discrimination of items with homogeneous answer choices and items with nonhomogeneous answer choices for mathematics items. We have also analyzed the interaction of these factors with item content and answer choices construction strategy.
Test Items
We selected 22 pairs of MC mathematics items from a state test forms in the United States developed for Grades 3 to 8 and high school. We selected the pairs by identifying items that were based on the same specific learning standard, which is educational objectives students should possess at critical point of any course, within algebra or geometry and that had either homogeneous or nonhomogeneous answer choices. The coding of homogeneity and nonhomogeneity was determined by three judges who are researchers experienced at item-writing based on state standards and taxonomy. Each MC item had four answer choices: the key and three distractors. Although the items in each pair were written to the same specific learning standards and the stems were set up similarly, the item content was not identical as can be seen in the item pair examples in Table 2.
Items with Homogeneous and Nonhomogeneous Answer Choices.
For Item Pair 1, the items on the right and the left side measure the same specific learning standard, which is “dividing a four-digit numbers by two-digit number.” The answer choices of the item on the left are homogeneous because all choices have three-digit numbers. Moreover, the final digits in the choices are either 02 or 27. This also allows readers make a set of two similar choices: (102, 127) and (602, 627). It does not harm dissimilarity among answer choices because only one choice which is exactly different from others is not selected by take takers during the exam to provide cue. On the contrary, the answer choices of the item on the right are nonhomogeneous because Choice A contains a different number of digits than the other answer choices. It means that Choice A is exactly different from other in a particular way and student might tend to select this option as cue. This influences psychometric properties of items.
For Item Pair 2, both items measure the same specific learning standard, which is “multiply each term inside the brackets (both algebraic term and number) by fraction outside the brackets.” The answer choices of the item on the left are homogeneous because two of the choices are integers and the other two choices are mixed fractions; integers and mixed fractions are equally represented in the answer choices. On the contrary, the answer choices of the item on the right side are nonhomogeneous because Choice D is a whole integer, while the other choices contain fractions.
For Item Pair 3, both items measure the same specific learning standard, which is “calculate the area of rectangle in real word problems.” The answer choices of the item on the left are homogeneous because two of the choices contain two digits, whereas the other two choices contain three digits. The answer choices have two equal sets of similar choices in terms of number of digits. On the contrary, the answer choices of the item on the right side are nonhomogeneous because Choice D contains three digits, while the other choices contain two digits.
After we selected the item pairs, three judges determined whether answer choices in item pairs were constructed using “the same” or “different” strategies. Considering item pairs in Table 1, Item Pair 3 was coded as “using the same strategy” because each answer choice of items in Item Pair 3 was constructed using the same strategy. It means that Choice A stems from adding length and width, Choice B from the formula for perimeter, Choice C from area, and Choice D from area times 2.
However, Item Pair 1 was coded as “using different strategy” because not all answer choices of items in Item Pair 1 were constructed using the same strategy. For example, Choice B of homogeneous item (the left side) and Choice A of nonhomogeneous item (the right side) stem from dividing the last three digits of dividend by divisor while Choice C of homogeneous item and Choice B of nonhomogeneous item stem from the combination of dividing the first two digits, the third digit, and the last digit of dividend by divisor. However, Choice A of homogeneous item stems from dividing each digit of dividend by divisor while Choice C of nonhomogeneous item stem from the combination of dividing the first two digits of dividend by divisor and last two digits of dividend.
Item Pair 2 also was coded as “using different strategy” because not all answer choices of items in Item Pair 2 were constructed using the same strategy. The items are related to the distributive property of multiplication over addition/subtraction. Only Choice C of homogeneous item and nonhomogeneous item was constructed using the same strategy, which stems from multiplying the first term inside the brackets by fraction outside the brackets and adding/subtracting the second term inside the brackets and/from the fraction. For other answer choices of homogeneous and nonhomogeneous items, they were constructed using different strategy. For homogeneous item, Choice A stems from making sign error after multiplying each term inside the brackets by the function. Choice B of homogeneous item stems from multiplying the number on right side of equation by the function rather than multiplying each term inside the brackets by the function. For nonhomogeneous item, Choice B stems from multiplying each term inside the brackets. Choice D stems from equating the terms inside brackets to the number on right side of equation.
After answer choices construction strategy of all item pairs was examined, we calculated the item difficulty and discrimination for each item within the pairs by using classical test theory (CTT) approach. Item difficulty is calculated as the proportion of examinees answering the item correctly while item discrimination refers to the ability of an item to discriminate between students with high scores and low scores (Thorndike, 2005). We applied item-total correlation index to calculate item discrimination for each item in this study, which is one of most widely used method (Downing, 2005).
Figure 1 shows item difficulty and item discrimination index values for each item pairs. A total of seven item pairs had geometry content, whereas 14 item pairs had algebra content. In terms of answer options construction strategy, five item pairs were coded as “using the same strategy” while 17 item pairs were coded as “using the different strategy” (see Figure 1).

Item difficulty and discrimination for 22 item pairs.
Participants
The items in each pair were administered to randomly equivalent, overlapping, and nonoverlapping samples of students selected from those participating in a state accountability assessment. Field test items were embedded in operational test forms and thus students were not aware that these items were field test items and did not count toward their scores.
To check the similarity of the random samples of students receiving each item within a pair, we took all of the demographic characteristics of the samples into account. For every one of the 22 item pairs, the samples of students differed by no more than .01 in the proportion of disability of students, .02 in the proportion of male students, .01 in the proportion of students from any racial group, and .01 in the proportion of students qualified for a free or reduced-price lunch. In addition, the difficulty of each test form was the same because mean total scores of the test forms were not more than .8, whereas standard deviation in the mean total scores was not more than .06. Table 3 shows item characteristics of each pair.
Item Characteristics of Each Pair.
Meta-Analysis
Meta-analysis is a statistical analysis integrating and summarizing the results of individual quantitative studies on a particular topic (Glass, 1976). Meta-analysis allows researchers to compute effect sizes to combine mean and standard deviation values,
There are two approaches to compute effect size in meta-analysis: fixed-effects model and random-effects model. Under the fixed model, the same effect size is calculated for all studies and weighted based on the number of observations by that study (Borenstein, Hedges, Higgins, & Rothstein, 2012). Under random-effects model, effect size varies from study to study due to demographic characteristics of samples, such as differences in education level, age, and socioeconomic status (Cooper, 2010).
Item difficulty
Each item pair comparison was treated as a separate study in a random-effects meta-analysis. Thus, the difference in sample characteristics and item content across item pairs was accounted for by the random-effects error term. The software package Comprehensive Meta-Analysis was used to perform the analyses (Borenstein, Hedges, Higgins, & Rothstein, 2005). Table 4 presents the results of fixed- and random-effects models, which are based on different assumptions. True effect was the same across all item pairs in the fixed-effects model, whereas it varied from one item pair to another in the random-effects model (Borenstein et al., 2012).
Fixed Effect and Random Effect of the Model for Item Difficulty.
The variation between item pairs, heterogeneity, was statistically significant,
We also conducted a mixed effects moderator analysis for the content area of the items (algebra vs. geometry) and answer choice construction strategy, as shown in Table 5.
Moderator Analysis for Content Area and Answer Choice Construction Strategy (Mixed-Effect Analysis).
According to Table 5, algebra items with homogeneous answer choices were easier than algebra items with nonhomogeneous answer choices, with an average effect size of .17, which is statistically significant (
For items whose answer choices were constructed using different strategy in Table 5, items with homogeneous answer choices were easier than items with nonhomogeneous answer choices, with an average effect size of .12, which is statistically significant (
Item discrimination
After we transformed the item discrimination indices for each of the 44 items from item-total correlations to Fisher
Fixed Effect and Random Effect of the Model for Item Discrimination.
The value of
Results and Discussion
The purpose of this study was to explore the impact of the precise homogeneity of answer choices on item difficulty and discrimination of mathematics items. The findings showed that, overall, items with homogeneous answer choices were easier than items with nonhomogeneous answer choices, but the result depended on item content (algebra vs. geometry) and answer choice construction strategy. Algebra items with homogeneous answer choices were easier than algebra items with nonhomogeneous answer choices. However, the difference in difficulty of geometry items with homogeneous answer choices was not statistically significant than from the difficulty of geometry items with nonhomogeneous answer choices. Taking into consideration answer choice construction strategy, the findings showed that items with homogeneous answer choices were easier than items with nonhomogeneous answer choices when different strategy was used to construct answer choices of items with homogeneous and nonhomogeneous answer choices. On the contrary, when the answer choices of items were constructed using the same construction strategy, the difficulty of items with homogeneous answer choices and nonhomogeneous answer choices was not statistically different. Moreover, the very large
The results of this study provide empirical support to the growing body of test-development studies. Specifically, this study contributes to research on the impact of answer choice homogeneity on item psychometric characteristics. First, unlike past studies that focused on word-based items, the current study examined the impact of homogeneity of answer choices on psychometric characteristics of mathematics items. With regard to effect in item difficulty, our overall findings are consistent with Green (1984), yet inconsistent with Smith and Smith (1988) and Ascalon et al. (2007), in that on average the items with homogeneous answer choices were easier. However, when answer choices were constructed using the same strategy, it was found that the difficulty of items with homogeneous answer choices was not statistically different from the difficulty of items with nonhomogeneous answer choices, which is not consistent with the findings of Green (1984), Smith and Smith (1988), and Ascalon et al. (2007). One hypothesis that might explain these results is that the impact of option homogeneity depends on item content, because the discrepant results came from studies using reading test items, driver license items, and general information items. The common characteristic of such word-based items for examinees is that the probability of choosing correct answer using answer choices may be far higher compared with mathematics items because examinees may choose correct answer for a mathematics item only if they know the solution regardless of answer choices. Existing literature related to number of options on item psychometric properties may be considered to support this claim. For example, Rodriguez (2005) conducted a meta-analysis to determine the impact of on item difficulty and discrimination by examining 56 empirical studies, most of which contained social sciences and language arts items which are word-based items. The findings showed that items become easier when number of options decreased. However, a recent study conducted by Atalmış and Kingston (2017) designed MC mathematics items with four options and three options constructed using the same strategy and found that item difficulty was not statistically different across the items with four options and three options. These findings supported only if options of mathematics items were constructed using the same strategy, item difficulty may not be changed by homogeneity of answer choices or number of the options.
Another contribution of this study is that the current study examined the impact of answer choice homogeneity not only on item difficulty but also on item discrimination which differs from previous studies. The findings showed that item discrimination was not statistically influenced by mathematics items with homogeneous answer choices and nonhomogeneous answer choices even if different item contents and answer choice construction strategies were applied. A hypothesis that might explain these results is that items with homogeneous answer choices and nonhomogeneous answer were constructed applying the same specific learning standards and the stems. Thus, when parallel mathematics item pairs are constructed, learning standards and stem similarity may be considered as higher priority than item content and answer choose construction strategy to keep item discrimination similarity between parallel items.
The last contribution of this study is the use of meta-analysis in unconventional contexts. Although meta-analysis is usually used for integrating and summarizing the results of individual quantitative studies on a particular topic, in this study, each item pair comparison was considered as a separate study to examine the impact of answer option homogeneity on item psychometric properties.
The results of this empirical study also apply to item writers and classroom teachers. Writing MC items with plausible distractors based on common student errors is always suggested (Haladyna et al., 2002). However, creating plausible distractors may affect answer choice homogeneity. Although each approach has merits, creating items with both plausible and homogeneous answer choices poses a challenge for item writers and classroom teachers. That is, item writers might need to spend significant extra time to construct fourth or fifth option which is plausible and homogeneous, and thus construct fewer items in a given amount of time. The results of the current study are informative for test creators on the usage of such approaches while designing a test.
Given the results of this study combines with the existing literature on item development, we recommend that the use of plausible answer options be given a higher priority than the use of homogeneous answer options in constructing MC mathematics items, but that both guidelines be considered.
There are some limitations of this study. First, there were only 22 pairs of items; thus, the number of items per grade level and the variety of types of nonhomogeneity were limited. Second, this study was conducted for only mathematics items. Most importantly, most of the variation in difficulty differences between the items with homogeneous and nonhomogeneous answer choices was not explained. We recommend additional exploration of item content and answer options construction strategy as moderator variables.
