Abstract
Keywords
1. Introduction
Test score equating is a crucial statistical tool that enables the comparison of test scores from different test forms and ensures fairness in assessments (González & Wiberg, 2017). When equating scores from nonequivalent test groups, it is essential to account for differences in both the ability levels of the test groups and the difficulty of the test forms. To make the scores comparable, any differences in ability and difficulty must be separated, so that the scores are only adjusted for differences in difficulty. For this purpose, testing programs generally apply either an assumption of common test-takers or the use of common items. The former assumes that the test groups to be equated are random samples from the same underlying population, whereas the latter views the groups as samples from different populations. In the latter case, a subset of common items is used to adjust for the differences in ability between the test groups. These common items are often referred to as anchor items and the belonging data collection design is known as the Nonequivalent Groups With Anchor Test (NEAT) design (von Davier et al., 2004b). However, not all testing programs have common items available but still need to adjust for ability imbalances. Examples of such tests are the Invalsi test (Invalsi, 2013), the Armed Services Vocational Aptitude Battery (Quenette et al., 2006), and the Swedish Scholastic Aptitude Test (SweSAT; Stage & Ögren, 2004) up until recently. If the ability imbalances are ignored, the equated test scores will be biased, which can have severe consequences in high-stakes testing scenarios.
One way of applying a nonequivalent groups design without anchor items is to use background information about the test takers in the form of measured covariates (Wiberg & Bränberg, 2015). There are several ways that covariates can be utilized within equating. Kolen (1990), Cook et al. (1990), and Wright and Dorans (1993) used covariates to balance the test groups before equating the test forms, Liou et al. (2001) applied covariates in a similar fashion to anchor items, Bränberg and Wiberg (2011) incorporated covariates in linear equating, and Hsu et al. (2002) used covariates within item response theory (IRT) true-score equating. However, as the covariate vector grows, controlling for the covariates quickly becomes very difficult. For example, conditioning on four categorical covariates, each with four categories, yields 256 possible outcomes. The matrix of all possible combinations of test scores and covariate realizations would therefore have an inflated number of empty cells. To overcome this problem, the test-takers can be compared on their propensity score instead, which is a scalar function of the covariates.
Livingston et al. (1990) was the first to propose the use of covariates within a propensity score for equating. More recently, Moses et al. (2010) explored the use of two anchor tests within a propensity score, Powers (2010) applied chained equating (CE) frequency estimation, IRT true score, and observed-score equating using propensity scores, Haberman (2015) used propensity scores to create pseudoequivalent groups from nonequivalent groups, and Longford (2015) used it as a tool for matching before equating. Wallin and Wiberg (2019) were the first to propose propensity scores for both a poststratification equating (PSE) and CE estimator within the kernel equating framework (von Davier et al., 2004b). Their results showed that a similar level of precision and accuracy compared to the NEAT design could be achieved. However, their results were based on the assumption that the propensity score was known. Since this will never be the case in any real testing situation, it is of great importance to assess the sensitivity of violations of this assumption. Thus, the aim is to study the functional form of the propensity score through which the covariates are conditioned on and investigate how sensitive the equated scores are to model misspecification of the estimated propensity score using both real and simulated data.
Propensity score model misspecification has previously been studied within the field of causal inference. Drake (1993) showed that a substantial bias was introduced when estimating the average treatment effect if a confounding covariate was omitted in the propensity score estimation model. Dehejia and Wahba (1999) had similar findings but also noted that causal estimates were not sensitive to the specification of the functional form of the propensity score, once all important covariates had been included. This has been shown in more recent studies as well, where Waernbaum (2010, 2012) showed that the average treatment effect can be unbiasedly estimated using propensity scores even when, for example, the link function is misspecified or when failing to include higher order terms of the covariates. There were furthermore situations with no efficiency loss, and one of the key components to obtain such results was that the true propensity score was a function of the misspecified model.
There are currently no existing studies on propensity score model misspecification in the equating context. This is critical to examine since the equating results often are used for decision-making on an individual level (e.g., admission decisions to universities) and for educational policy making. The current study therefore investigates the sensitivity of the equating function for model misspecification of the propensity score. Assuming a parametric model for the propensity score, three misspecifications are considered, inspired by the studies of Waernbaum (2010) and Waernbaum (2012): (1) misspecifying the link function, (2) excluding an important (true confounder) covariate, and (3) excluding a higher order moment of a confounding covariate. Each misspecification will be evaluated in terms of the equating function precision and accuracy to determine how critical they are.
The structure of this article is as follows. The kernel equating framework is introduced in Section 2, followed by an introduction to propensity scores in Section 3. Section 4 includes an empirical illustration, and Section 5 presents a simulation study. This article is concluded with a discussion of the results together with some practical guidelines.
2. Kernel Equating
We denote the new test form by X and the old test form by Y and their respective scores by
Consider the random variable
The equipercentile function thus matches all of the moments of
However, since most test scores are discrete, their CDFs are not continuous but step functions. Hence, for any value
Since kernel equating (Holland & Thayer, 1989; von Davier et al., 2004b) generalizes many of the most common and modern equating approaches, we present our theory in terms of this framework although the proposed method is applicable for example traditional equipercentile and linear equating as well. This framework consists of five steps: (1) fitting a regression model (typically a log-linear model) to the empirical score distributions, (2) estimating the test score probabilities on the target population based on the estimated model in Step 1 and given the data collection design, (3) making continuous approximations to the estimated discrete score distributions from Step 2, (4) equating the test scores using the equipercentile function, and (5) evaluating the estimated equating function (González & Wiberg, 2017; von Davier et al., 2004b). From Equation 1, it is clear that in order to estimate
where
The next step is to estimate the score probabilities
Let the mean and variance of
where
The random variable
In most studies of kernel equating, the function
The bandwidth
where
and
With the estimated, continuized score distributions
The asymptotic distribution of
where
3. Nonequivalent Groups With Covariate (NEC) Design
This section will clarify the viewpoint we take on the nonequivalent groups designs in test score equating, and the specific assumptions underlying the NEC design (Wiberg & Bränberg, 2015). The NEC design assumes that the group of test-takers being administered test form X are a random sample from population
The Nonequivalent Groups With Covariate (NEC) Design Summarized
The covariates in
In Figure 1, the variables

The nonequivalent groups with covariates design.
3.1. Propensity Scores
The basic idea of the NEC design is to replace the anchor test scores with the covariates and then to equate the test scores treating the covariate realizations as if they were in fact anchor scores. When using more than only a few covariates, the number of empty cells in the frequency table will grow large. There is thus a practical problem with the NEC design that is unrelated to the theoretical justification of the method. The curse of dimensionality is a well-known problem far beyond the equating literature, and a well-established method to handle this problem is by using a dimension-reducing function of the covariates called the propensity score. It reduces the dimension of covariate vector down to a scalar and is defined as
The propensity score possesses the appealing property of being a balancing score (Rosenbaum & Rubin, 1983). This means that it is sufficient to control for
As the propensity score is not known, it needs to be estimated. A common method is to use logistic regression, which will be used here. Following Rosenbaum and Rubin (1984) and Wallin and Wiberg (2019), the estimated propensity scores of the test-takers will thereafter be partitioned into strata based on the percentiles. The test-takers in each stratum are treated as homogeneous in terms of the latent ability, meaning that the equivalent groups design assumptions hold true within each stratum.
3.2. Equating Estimators Based on the Propensity Score
In the following, two propensity score-based equating estimators are derived and presented together with their underlying assumptions, following the estimators presented in Wallin and Wiberg (2019). Note that as these estimators were presented without much theoretical justification in the original paper by Wallin and Wiberg (2019), special attention is given to motivate them in this section.
3.2.1. PSE estimator
To define the PSE estimator, abbreviated PS-PSE, define the elements in
and
where
The probabilities are defined on the target population
In Equations 8 and 9, the terms
for any
Note, for a dichotomous treatment (i.e., a pair of test forms to be equated),
The first part of Assumption 1 means that the test scores are conditionally independent of the test form assignment by controlling for the propensity score. The test groups would thereby be only randomly different from each other, as in the equivalent groups design. The second part of Assumption 1 is to ensure that all test-takers have a nonzero probability of being assigned either test form. If the propensity score has been stratified into
Under Assumption 1, we furthermore assume that:
and
Equation
10 states that the probability of test score
The proof of Proposition 1 is found in Online Appendix A.
Lastly, plug the estimated test score probability
3.2.2. CE estimator
Even though conditioning on the propensity score, as has been outlined for the PS-PSE estimator, is the traditional way of removing dependencies between the outcome and the treatment, CE methods have a long-standing tradition within test score equating. Several studies have showed that the result of linking, or
with
where
The PS-CE estimator is dependent on the linking of distributions between populations
Assumption 2 is to be understood as a statement regarding population invariance of the equipercentile function linking
The proof of Proposition 2 is found in Online Appendix B.
4. A Motivating Example Using Empirical Data
As a motivating example, two test administrations of the SweSAT are analyzed. The SweSAT is used in the selection process for Swedish university programs and consists of a verbal and quantitative section. These sections in turn consist of 80 items each and are equated separately. Only recently, the SweSAT started including anchor items. Prior to this, covariates were used in a matching procedure when the test forms were equated (Wiberg & Bränberg, 2015). In this empirical study, both the PS-PSE and PS-CE estimators will be used to equate the quantitative sections from two SweSAT test administrations from the past decade.
4.1. Data and PS Models
The score distributions of the analyzed test forms are shown in Figure 2. As seen, the

The score distributions of the
Summary Statistics of the Variables Used in the Empirical Illustration
Since there is no known true propensity score model, a number of candidate models are set up for both the PS-PSE and PS-CE equating estimators. Let
The Parametrization of the Candidate Propensity Score Models
Hence, in total, there will be 26 equating estimators considered, 13 for the PS-PSE estimator and 13 for the PS-CE estimator. The equated scores and the SEEs of each estimator will be analyzed to determine the extent to which they vary with changes in the propensity score model’s parameterization. The difference that matters (Dorans & Feigenbaum, 1994), defined to be larger than half a raw score point, will also be investigated. Goodness-of-fit measures like the Akaike information criterion (Akaike, 1974) or the Bayesian information criterion (BIC; Schwarz, 1978) are not suitable for evaluating the propensity score models, since their parameter estimates are not the priority but rather the achieved covariate balance between the test groups (Augurzky & Schmidt, 2001; Stuart, 2010). The absolute standardized mean difference (ASMD; Austin, 2008) will therefore be used to evaluate the level of achieved covariate balance:
where
In Figure 3, the ASMDs between the treatment (test form

The absolute standardized mean difference between the treatment group (test form
In the next stage, bivariate log-linear models are fit to the observed test scores and the stratified propensity scores. A set of candidate models are considered and evaluated in terms of their BIC. In Tables 4 and 5, the estimated coefficients together with their corresponding standard errors,
The Estimated Coefficients, With Standard Errors in Parenthesis, of the Four Bivariate Log-Linear Models Fit Considered for the
*
The Estimated Coefficients, With Standard Errors in Parenthesis, of the Four Bivariate Log-Linear Models Fit Considered for the
*
4.2. Results
To illustrate the general trend among the estimators, we display the results of the equating estimators using propensity score models 1–4 in Figure 4. Propensity score model number 3, which does not include the covariate Gender, deviates clearly. For the upper score scale, Model 3 has a score difference to the other estimators that clearly matters. Since gender has been established as an important covariate when analyzing the SweSAT (Bränberg et al., 1990) and with a fairly strong correlation with the test scores, it comes as no surprise that the equated scores are affected when gender is excluded. Far less important is the choice of link function, or whether or not a second-order term is included, for this dataset. On the other hand, the SEEs of all estimators are more or less similar along the whole score scale.

The equated scores and standard error of equating of the PS-PSE estimator, using Models 1–4 for the propensity score estimation.
In Figure 5, the equated scores (upper part) are shown for the four PS-CE estimators, together with SEE (lower part). The pattern from the PSE estimators is evident here as well, with clear deviations for the model that fails to include gender in the propensity score model and with a negligible difference in terms of SEE. We also notice a distinct difference between the equated scores produced by the PSE-based estimators in Figure 4 and the CE-based estimators in Figure 5. In the online supplements, the estimated equating functions resulting from all 13 propensity score models are given.

The equated scores and standard error of equating of the PS-CE estimator, using Models 1–4 for the propensity score estimation.
5. Simulation Study
For the empirical illustration, the results suggested that a critical component when using propensity scores to equate test scores is to include all important covariates in the propensity score estimation model. The equated scores were less sensitive to the choice of link function and the inclusion of higher order polynomials. Since it is not possible to generalize these results, the robustness of the PS-PSE and PS-CE estimators to misspecifications of the propensity score model is evaluated in a simulation study. We assume that the propensity score is described by a parametric model and consider two different simulation designs. Both designs are inspired by the simulation study in Wallin and Wiberg (2019) but with propensity score model misspecifications added. The misspecifications considered are (1) using the wrong link function, (2) leaving out a covariate, and (3) leaving out higher order terms. The simulation designs follow closely the studies typically seen in the causal inference literature, where potential outcomes under different treatment regimes are generated and the observed outcomes depend on the realization of the treatment variable, which in turn is a function of a covariate vector. Inspired by this and by trying to mimic the situation described in Figure 1, we generated covariates that both affected the test form assignment (through the propensity score) and the test scores, making them true confounders. Both potential test scores and observed test scores are generated, as explained in the simulation designs. The presented results are based on
5.1. Simulation Design A
For Design A, the data generating process (DGP) is as follows:
1. Generate the covariates
2. Generate
It follows that the test groups will be of approximately the same size.
3. The potential test scores on test form
and the potential test scores on test form
Since the covariates in these expressions represent the ability differences between the groups, the
4. The observed test score for each test-taker is defined as
To generate an observed score
5. The propensity score is estimated using logistic regression. Based on the percentiles, it is thereafter divided into 20 categories. The number of categories was chosen trying to reach a covariate balance between the test groups as measured by the ASMD. Four candidate models will be defined: one that is correctly specified according to Equation 15, one that uses a probit link function instead of the correct logit link, one that leaves out
5.2. Simulation Design B
For Design B, the DGP is as follows:
1. Generate the covariates
2. Generate
It follows that the test groups will be of approximately the same size.
3. The scores on test form
and the scores on test form
where
4. The observed test score for each test-taker is generated as
To generate an observed score
5. As in Design A, the propensity score is estimated using logistic regression and thereafter divided into 20 categories, based on the absolute standardized mean difference. Four candidate models will be used: one that is correctly specified according to Equation 16, one that uses a probit link function instead of the correct logit link, one that leaves out
5.3. Evaluation Measures
The PS-PSE and PS-CE estimators are evaluated by calculating the bias and
and
where
and
5.4. Simulation Results—Design A
The bias of the PS-PSE and PS-CE estimators is presented in Figure 6. Note that for propensity score models with a misspecified link function and for those that fail to include the second order term, the bias is very similar. Although not illustrated in the figure, their biases practically coincide with the biases of their correctly specified counterparts (the difference is less than 0.01 for each score point). This turns out to be a pattern which is present for both estimators for all considered sample sizes, all evaluation measures, and both simulation designs.

The bias of the PS-PSE and PS-CE estimators for
As the upper part of Figure 6 illustrates, the PS-PSE estimators exhibit only a small bias for all scores, with the exception of the KE estimators with a propensity score model that leaves out a covariate. It is also noteworthy that it does not matter whether or not the covariates have been categorized; the biases for all estimators stay similar regardless. The estimators that misspecify the link function and that leaves out the second-order term show the best performance, with differences between them being too small to be discovered in the figure. As these estimators more or less coincide with the estimator using a correctly specified model, the results suggest that the propensity score is successful at balancing the test groups for the PS-PSE estimator.
The lower part of Figure 6 depicts the bias for the PS-CE estimators. For the PS-PSE estimators, misspecifying the link function (and leaving out the second order term) yields small biases across the score range. There is a negligible difference between using categorized and uncategorized covariates in the propensity score model, and the bias increases substantially when a covariate is left out and grows particularly large for categorized covariates.
The

The standard error of the PS-PSE and PS-CE estimators for
From Design A, we conclude that misspecifying the link function or missing to include a second-order term, for both the original covariates and the categorized versions of them, introduces far less error compared to missing to include a covariate in the propensity score model.
5.5. Simulation Results—Design B
The results of Design B are presented for
The bias of the PS-PSE and PS-CE estimators is displayed in Figure 8. The similarity with the biases in Design A is apparent. Once again, failing to include an important covariate leads to severe bias for both estimators. Especially in the case of the PS-CE estimator with categorized covariates, the results are particularly inaccurate. The estimators with a misspecified link function and those who fail to include the second-order term show robust results in the presence of model misspecification.

The bias of the PS-PSE and PS-CE estimators for
The

The standard error of the PS-PSE and PS-CE estimators for
Similar to Design A, we conclude from Design B that the estimators with an incorrect link function and those that do not include the second-order term are relatively robust. The PS-CE estimator that fails to include one of the categorized covariates shows the overall worst performance. We also observe that the results of Design B are approximately proportional to those of Design A, possibly due to both designs having the same type of covariates (uniformly distributed on the interval [1, 5]). However, Design B has a more intricate relationship between the covariates and the propensity score, as well as the covariates and the test scores. As a result, the biases displayed in Design B’s results are roughly twice as large as those seen in Design A, and the
6. Discussion
The goal of this study was to investigate how sensitive the equated scores are to model misspecification of the propensity score, when the propensity score is used to equate nonequivalent test groups. It has already been shown in Wallin and Wiberg (2019) that equating with propensity scores has the possibility to reach similar precision and accuracy as equating with an anchor, and superior results compared to equating under a false assumption of equivalent groups. But since the results of Wallin and Wiberg (2019) are based on the assumption that the propensity score is known, which it typically is not in practical research scenarios, it was crucial to study how sensitive these results are to model misspecification. The propensity score is a useful tool in research as it possesses the desirable feature of being a balancing score, which has led to its widespread application across various domains. However, its high degree of flexibility means that there are numerous modeling options available, emphasizing the need for careful scrutiny to determine when the propensity score can effectively balance test-taker groups and when it falls short.
The propensity score methods explored in this study demonstrate potential as the equated scores remain insensitive to both link function misspecification and the omission of a second-order term in the estimation model. This applies to both linear (Simulation Design A) and nonlinear (Simulation Design B) relationships between covariates and outcomes. Notably, the model misspecifications resulted in a similar bias and SE (in rounded score terms) to the correctly specified models, signifying robustness of the equated scores to such errors in the propensity score model. On the other hand, the equated scores were negatively affected by a propensity score model that omitted a true confounding covariate. These conclusions remained the same for all considered sample sizes and for both simulation designs. The results therefore clearly point to the importance of using all pertinent information related to latent ability when using the propensity score as a proxy variable. This aligns with earlier research on the propensity score, which indicates that omitting a higher order term that exists in the actual model while estimating the propensity score does not result in biased estimates (Dehejia & Wahba, 1999; Drake, 1993; Stuart, 2010; Waernbaum, 2010, 2012). Incorporating all true confounding variables is linked to the unconfoundedness assumption that forms the foundation of the propensity score method for covariate balancing. Consistent with earlier research, it was found that this aspect is crucial in the equating context as well. As in Waernbaum (2010, 2012), we note that as long as the true propensity score is a function of the misspecified model, unbiased estimation of the parameter of interest is possible. We note that for Design B, the standard errors are fairly large but should be seen in relation to previous research that has showed that equating error and variability is even greater when falsely assuming equivalent groups (Wallin & Wiberg, 2019). A misspecification of the propensity score model when the relationship between the test scores and the covariates is nonlinear is thus a delicate scenario. Since reported scores often are used for individual-level decision making, the current results suggest that future research should carefully study nonlinear cases.
We emphasize that the quality of the ability balancing suggested in this article depends strictly on the quality of the auxiliary information. The restrictions that come with the data at hand need to be evaluated with the identifying Assumptions 1 and 2 in mind. Two examples of restrictions in the empirical data analyzed in this study are the limited amount of covariates and the fact that the variable Age is only available in a categorized version. Since the proposed method has been shown to perform similar to anchor test-based equating for this particular data set (Wallin & Wiberg, 2019), there is reason to believe that the current covariate restrictions have not reversed the results. In the case of propensity score-based equating, we advise seeking input from experts in the subject matter concerning the testing program and test groups that need to be equated. Additionally, we suggest conducting a comprehensive analysis of the associations between the collected covariates and test scores. Since both the propensity score and anchor test score are employed as proxies for ability, they can be evaluated using similar methods.
Some limitations with the current study include the following. We only considered two types of covariates and future studies could consider to expand that. Both by using a propensity score model that is a function of both discrete and continuous covariates and with different dependence structure between them. We however emphasize that the aim of this article was to study propensity score model misspecification, and the misspecifications were thus the main focus and not different types of covariates. We therefore chose to vary the relationship between the treatment variable, the test scores, and the covariates but not the covariates themselves. On this note, it should be pointed out that Assumptions 1 and 2 are strong, but of similar magnitude to the assumptions underlying NEAT equating. The results in both the original paper by Wallin and Wiberg (2019) and the current article furthermore suggest that there are several realistic test scenarios, where propensity score stratification is a viable technique for a sufficient ability imbalance reduction. It would therefore be of importance to further investigate how sensitive the equating function parameter is to violations of the propensity score assumption. Studying the omission of a true confounder in the propensity score model could be considered a first step toward such analysis, since this violated the unconfoundedness assumption in Assumption 1. A diagnostic tool would in the future be of great use for such analysis. In Online Appendix C, further simulation results are presented, considering both missing data and another case of model assumption violation. These results suggest that the PS-PSE particularly is robust against certain missingness, but that bias is introduced when a subset of test-takers have a true propensity score equal to 1 (or equivalently, equal to 0). These scenarios could, for example, happen when there is an age restriction to the test in question, and certain test-takers were not allowed to take the test in the previous administration. An empirical check of the propensity scores should therefore always be conducted.
It is worth mentioning that the outcomes of Simulation Design B demonstrate a proportional relationship with those of Simulation Design A. This is attributed to the intricate association among the covariates, the treatment variable, and the outcome in Design B, which is more complicated than that in Design A. In addition to these factors, there are testing programs that have access to both covariates and an anchor test. It would therefore be worth investigating if there is any additional gain by using both sources of information to control for ability differences. Incorporating both covariates and anchor test scores has been studied within the NEC design (Albano & Wiberg, 2019; Wiberg & Bränberg, 2015), but never when considering propensity scores. We expect this to improve the results, as demonstrated in the small example in Online Appendix C. Generalizing these results and quantifying the improvement would be a significant contribution to equating nonequivalent groups. Finally, this study has only considered parametric regression models to estimate the propensity score, and other existing methods should be examined in future research.
As a final note, we point out the recent critique that has been raised toward NEAT-based equating in San Martín and González (2022). With data being partially missing by design in the nonequivalent groups designs, the test score distributions, and thus the equating estimator, are not identified. Most methods, including the methods studied in this article, make identifying assumptions to estimate the score distributions. An alternative approach, suggested in San Martín and González (2022), is to use the theory of partial identification (Manski, 2009) to define identification regions for the equating function. This is a new perspective that we believe sheds light to the discussion on whether or not equating has any potential to report fair scores under nonequivalent groups designs, see, for example, Bolsinova and Maris (2016). Their approach could also serve as a useful tool to investigate the sensitivity of the identifying assumptions presented in this article.
Supplemental Material
Supplemental Material, sj-pdf-1-jeb-10.3102_10769986231161575 - Model Misspecification and Robustness of Observed-Score Test Equating Using Propensity Scores
Supplemental Material, sj-pdf-1-jeb-10.3102_10769986231161575 for Model Misspecification and Robustness of Observed-Score Test Equating Using Propensity Scores by Gabriel Wallin and Marie Wiberg in Journal of Educational and Behavioral Statistics
Footnotes
Declaration of Conflicting Interests
Funding
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
