Abstract
1. Introduction
Ensuring fairness in educational testing is a critical issue, especially in the context of large-scale assessments where decisions about examinees are often based on their test scores. The concept of fairness in testing typically revolves around comparability: Scores from different forms of the same test should be interchangeable. However, in practice, examinees might take different forms of the test, and ensuring that scores from these forms are comparable is the purpose of test score equating (Kolen & Brennan, 2014). This becomes particularly challenging in the presence of non-equivalent groups, where examinees are not randomly assigned to different forms, leading to systematic differences in the ability distributions of the groups. Such differences must be accounted for in the equating process to ensure that the resulting scores can be meaningfully compared.
In observed score equating, the goal is to determine comparable scores between two test forms. Depending on the data collection design, different equating methods are used. Two commonly applied designs are the equivalent groups (EG) design and the
While the NEAT design is often preferred for equating as it can handle heterogeneous test groups, there are situations where no anchor test is available, particularly in large-scale assessments. For example, the Italian INVALSI test and earlier versions of the Swedish Scholastic Aptitude Test (SweSAT) lacked an anchor test despite having non-equivalent groups (Lyrén & Hambleton, 2011). In such cases, one possible solution is to use a
Common for all test equating methods is that they aim to fulfill certain equating criteria, such as equity and population invariance. However, many of the methods often struggle in real-world scenarios, and part of the reason for this is that they have not taken full consideration of the equating criteria when defining score equivalence (van der Linden & Wiberg, 2010). Lord’s equity requirement is particularly important here, as it states that the equated scores from two test forms should be indistinguishable for examinees with the same latent ability
Using covariates in equating is not a new idea (Kolen, 1990), and several researchers have explored their use in matching or as complementary information (Cook et al., 1990; N. J. Dorans et al., 2008; Hsu et al., 2002; Liou et al., 2001; Longford, 2015; Wright & Dorans, 1993). The propensity score (Rosenbaum & Rubin, 1983), which is a scalar function of the covariates, has also been applied to test equating in a limited capacity. For example, Livingston et al. (1990) were the first to use propensity scores in test equating, while Paek et al. (2006) used propensity score matching to develop a linking relationship between the PSAT and SAT tests. Furthermore, Sungworn (2009) used propensity scores based on collateral information to improve poststratification equating. Other researchers, including Moses et al. (2010) and Powers (2010), examined the potential for propensity scores to reduce equating biases by combining anchor test scores or demographic information. Recently, Wallin and Wiberg (2019) proposed to use propensity scores in the NEC design, however, they did not examine the possibility of using it in local equating.
Despite these developments, the use of propensity scores in local equating has not been explored. This paper aims to fill this gap by introducing two new methods: propensity score stratification and inverse probability weighting (IPW) for local equating. Both of these methods have been widely applied in various areas, such epidemiology (Austin, 2008; Hernán & Robins, 2006), sociology (Pais, 2011; Thoemmes and Kim, 2011), and economics (Huber, 2015; Vikström, 2017) to estimate causal effects from observational data. Their popularity stems from their intuitive appeal and their ability to reduce multidimensional covariates to a single scalar summary. Here, these methods are designed to address non-equivalent groups when no anchor test is available, using propensity scores as a proxy for latent ability differences between the test groups to be equated. Propensity score stratification divides examinees into strata based on similar propensity scores, aiming to make the groups balanced in terms of the covariates within each stratum. IPW assigns weights to examinees inversely proportional to their probability of group membership, adjusting for population differences across all score levels. Both methods rely on propensity scores, which represent the conditional probability of group membership based on observed covariates. By using propensity scores as proxies for the unobserved latent ability, we extend local equating to cases where traditional anchor-based methods cannot be applied.
The first method, propensity score stratification, partitions examinees into strata based on their estimated propensity scores. Within each stratum, examinees are assumed to be comparable in terms of their latent ability, allowing for local equating to take place within these balanced groups. This approach ensures that the equating process accounts for the observed differences between groups, conditional on the covariates used to estimate the propensity scores. Stratification on the propensity score is particularly useful in situations where covariates can capture a significant portion of the variability in the latent ability distributions between groups. By equating within each stratum, this method aims to fulfill Lord’s equity requirement, ensuring that examinees with similar propensity scores receive equitably equated test scores. The second method, IPW, takes a different approach by assigning weights to each examinee based on the inverse of their propensity score. In this method, the propensity score serves as a weight that adjusts for the differential likelihood of group membership across test forms. By reweighting the sample, IPW effectively creates a pseudo-population in which the distributions of covariates are balanced between the test forms. To evaluate the proposed methods, we first present an empirical study that applies the proposed methods to real test data, illustrating their practical utility. We also conduct an extensive simulation study, where we generated test data under various conditions to evaluate the proposed methods’ performance.
The structure of this paper is as follows: Section 2 introduces local equating and the equating estimators considered. Section 3 reports findings from an empirical illustration, and Section 4 presents the results of the simulation study. Finally, the paper concludes with a discussion of the results and future research directions.
2. Local Equating
2.1. Notation and Background
Let
where
For this criterion to be meaningful, the equating transformation must satisfy certain identification conditions. Specifically, we require:
This condition ensures that the equating transformation preserves the ordering of scores within each ability level, making it a proper monotonic transformation. Together, Lord’s equity criterion and this identification condition imply that the equated scores will be indistinguishable from those on the reference form for examinees at each ability level.
Traditional equating methods use a single transformation across the entire population, which averages across ability levels. There will thus always be a compromise made at each ability level for the equating transformation, which inevitably introduces bias. The resulting equated scores will be influenced by the shape of the ability distribution, leading to population-dependent transformations. Local equating addresses these limitations by focusing on the conditional distributions of the test scores given
mapping the percentiles of
In practice, however,
where
where
By conditioning on proxies for latent ability, local equating improves on traditional methods by reducing bias and population dependence. As demonstrated in previous research (van der Linden & Wiberg, 2010; Wiberg & van der Linden, 2011), traditional methods like chain equating and poststratification equating can exhibit substantial bias due to their reliance on marginal distributions that do not account for examinee differences in ability. In contrast, local equating maintains Lord’s equity criterion by ensuring that the equating transformation is tailored to the specific ability level of each examinee, thereby producing more accurate scores.
In this paper, we assume that the true relationship between the test forms is linear, meaning that the equating process only needs to match the means and variances of the two distributions. Specifically, the family of true equating transformations is given by
where
2.2. Anchor-Based Local Equating
Anchor-based equating is a well-established method in test equating, relying on a set of common items (the anchor test) administered to all examinees regardless of which test form they take. The anchor test score
where
The key assumption underlying anchor-based equating is that, despite this measurement error, the anchor score contains sufficient information about ability differences to enable valid equating. Formally, we assume that conditional on the anchor score
This assumption implies that any dependence between scores on forms
where
The conditional means are estimated by
where
This local approach to calculating standard deviations ensures that the equating transformation adapts to the variability specific to each anchor score, rather than pooling data across all anchors.
This method provides a straightforward approach to equating when a common set of items is available. However, its effectiveness relies heavily on the quality of the anchor test and its relationship to the main test forms. In our study, we will use this method as a benchmark for comparing the performance of the propensity score-based methods.
3. Propensity Score-Based Equating
When no anchor test is available, but covariates are present, equating may be performed using propensity scores. The propensity score
In this case, the propensity score serves as a proxy for
3.1. Propensity Score Stratification
Propensity score stratification is a useful method used in observational studies to reduce bias (Rosenbaum & Rubin, 1984). This method aims to create subgroups of subjects with similar propensity scores, thereby approximating the balance achieved by randomization in experimental studies. The propensity score serves as a balancing score: when subjects are grouped based on similar propensity scores, the distribution of observed baseline covariates is expected to be similar between treated and untreated subjects within each stratum, mimicking a randomized experiment within these subgroups.
Let
The effectiveness of stratification relies on achieving balance within each stratum. Within each stratum
indicating that, conditional on stratum membership, the probability of test form assignment becomes independent of the covariates. When this balance condition holds, examinees within each stratum have similar distributions of covariates regardless of their test form assignment. The choice of the number of strata
Within each stratum
where
This yields a family of local equating functions, where scores are transformed according to the stratum-specific parameters.
3.1.1. Assumptions and Properties
The validity of the stratified propensity score approach relies on several assumptions:
Under these assumptions, the consistency of the stratum-specific estimators can be established. As the sample size increases, the estimated means and standard deviations converge in probability to their population counterparts, and by the continuous mapping theorem, the estimated equating transformation within each stratum converges to the true stratum-specific equating relationship:
where
The stratification approach offers several advantages. First, it allows for heterogeneous equating relationships across different regions of the propensity score distribution. Second, the quantile-based stratification ensures balanced stratum sizes. However, the method’s effectiveness depends critically on proper covariate selection, careful diagnostic assessment of balance within strata, and sufficient sample sizes to support local estimation.
3.2. Inverse Probability Weighting
IPW is a statistical technique used in causal inference and observational studies to adjust for confounding and selection bias. The method was first introduced by Horvitz and Thompson (1952) in the context of survey sampling and has since been widely applied in various fields, including epidemiology, economics, and social sciences. The core idea of IPW is to create a pseudo-population where the distribution of confounding variables is balanced between treatment groups (or, in our case, test forms). This is achieved by assigning weights to individual observations, with the weights being inversely proportional to the probability of receiving the treatment (or taking a particular test form) given the observed covariates.
In a general setting, let
In the context of test equating, we adapt this general framework to address the challenge of comparing scores from different test forms.
3.2.1. IPW for Test Equating
As a second method to use propensity scores, we propose a stratified IPW approach for local equating. First, we estimate propensity scores
where
is the proportion of examinees in stratum
where
Within each stratum
and the weighted standard deviations are
The IPW equating transformation is then defined locally within each stratum as
This yields a family of stratum-specific equating functions that transform scores on test form
3.2.2. Assumptions and Properties
As with the propensity score stratification method, the stratified IPW approach to local equating relies the assumptions of positivity, unconfoundedness, and correct model specification. Under these assumptions, the consistency of the stratum-specific estimator can be established. For any stratum
where
The IPW-stratified approach offers several advantages over global weighting. First, it allows for heterogeneous equating relationships across different regions of the propensity score distribution, capturing potential effect modification by covariates. Second, by conducting weight truncation within strata, we can better control the influence of extreme weights while preserving local equating relationships. However, this flexibility comes at the cost of requiring sufficient sample sizes within each stratum to ensure stable estimation of the local equating functions.
4. Model Generalizations
The proposed framework for equating using propensity scores can be extended in several ways to accommodate various test structures and equating requirements. Here, we present some generalizations that broaden the applicability of our method.
4.1. Equipercentile Equating
While our primary focus has been on linear equating, the framework can be extended to equipercentile equating. Equipercentile equating assumes that scores on different forms of a test should have the same percentile ranks in the populations of examinees taking each test form. Let
To implement this within our IPW framework, we estimate
where
If we define the smoothed version of the equipercentile equating transformation as
then
This limit is taken at the population level (hence deterministic); when the population moments are replaced by their sample estimates, the same result holds in probability.
This relationship highlights the flexibility of our framework in accommodating different equating methods.
4.2. Mixed-Format Tests
Our framework can be extended to handle mixed-format tests, including not only binary items but also nominal or ordinal items. Let
For ordinal items, we can employ ordinal logistic regression:
where
where
5. Empirical Analysis
To empirically illustrate the proposed local equating methods and compare with existing methods, two consecutive forms from the SweSAT were used. SweSAT is a paper and pencil test with 160 multiple-choice binary-scored items, and it consists of a quantitative section of 80 items and a verbal section of 80 items that are equated separately. The SweSAT is given twice a year and is a college admission test. Since 2011, an anchor test is included but previously, different groups with specific values on their covariates have been a major part of the equating process. For details about previous used equating methods for the SweSAT, see Lyrén and Hambleton (2011). Note, although anchor tests are used in the equating, covariates are still important to examine and use in the equating process to adjust for non-equivalent test-taker groups. The job market highly influences which test-takers that take the test and how large the test-taking group is for any given year. If there is an economic recession and the unemployment rate is high, more test-takers (with a diverse background) tend to take the SweSAT as opposed to when the unemployment rate is low. Furthermore, Lyrén and Hambleton (2011) found clear signs that the equivalent groups design assumptions were violated for the SweSAT. The empirical study was carried out in R (R Core Team, 2024).
The quantitative new test form

The test score distributions for the analyzed SweSAT data.
In this analysis, we made use of the covariates gender (0 = female, 1 = male), age, and test scores from the verbal section (range: 0–80). A summary of these covariates is given in Table 1. The choice of used covariates depended both on availability of covariates from SweSAT administrations and the fact that these covariates have been used successfully when calculating propensity scores in previous studies, for example, Wallin and Wiberg (2019). To implement the stratification, we evaluated different numbers of strata through covariate balance diagnostics rather than traditional model fit assessments. The absolute standardized mean difference (ASMD) served as our primary balance metric:
with values below 0.1 indicating satisfactory balance between test forms within a stratum. Through iterative evaluation, we determined that 20 strata provided the best balance while maintaining adequate sample sizes within each stratum. The observed ASMDs across strata ranged from 0.008 to 0.276 for verbal ability, 0.001 to 0.282 for age, and 0.002 to 0.386 for gender. Satisfactory balance (ASMD < 0.1) was achieved in 30%, 75%, and 45% of strata for these covariates, respectively. Complete ASMD values for all covariates and strata are provided in Table 2. To ensure the robustness of our results, we conducted sensitivity analyses by varying the number of strata, confirming that slight changes in stratification did not substantially impact the equated scores.
Summary Statistics and Correlations for Test Forms
Values in parentheses correspond to test form
For the Age variable, the statistics presented are median and quartile deviation. Age correlations are Spearman coefficients, and Gender correlations are point-biserial. All other measures are means and standard deviations.
Absolute Standardized Mean Differences by Stratum and Covariate
5.1. Results
5.1.1. Anchor-Based Method
Figure 2 illustrates two plots of the equated scores using the anchor-based method, based on five selected percentiles of the anchor score. The first plot (a) shows the estimated equating functions, and the second plot visualizes the same equating functions but with the raw scores subtracted from each equated value to highlight the differences between the function values.

The estimated equating functions, conditioning on different values of the anchor score. The 10th percentile corresponds to an anchor score of 5, the 30th percentile to 8, the 50th percentile to 11, the 70th percentile to 14, and the 90th percentile to 18. Panel (a) shows the equated scores, and Panel (b) shows the difference between equated scores and raw scores. (a) The estimated equated scores for five different anchor scores. (b) The estimated equated scores, with the unadjusted raw scores subtracted, for five different anchor scores.
The equating functions are relatively close to each other, indicating that the differences between the selected percentiles of the anchor do not result in large differences across most of the score range. This suggests that the examinees from different anchor percentiles follow a relatively consistent pattern of score distribution. In the second plot (b), the equating functions minus the raw scores are displayed. The plot reveals that for lower and higher scores, the equated values exhibit some variation depending on the anchor score percentile. The equated scores conditioning on the 50th and 70th anchor percentile are slightly more similar to each other compared to the other lines, as are the equated scores conditioning on the 10th and 30th anchor percentile. For the mid-range score values, the equated scores for all considered anchor scores are fairly similar.
5.1.2. Propensity Score Stratification Method
In Figure 3, the distribution of propensity scores within the treatment groups and across different strata is illustrated. The first plot (a) shows the distribution of propensity scores for the two test groups. There is a clear overlap between the groups in terms of the propensity score distributions. The second plot (b) illustrates boxplots of propensity scores across 20 strata.

The estimated propensity scores. (a) The estimated density function of the estimated propensity scores for each test group. (b) The estimated and stratified propensity scores across 20 strata.
In Figure 4, the estimated equated scores for the equating method based on propensity scores stratification for five selected percentiles of the stratified propensity scores are illustrated. The left-hand side shows the equated scores, and the right-hand side shows the the equated scores with the raw scores subtracted. Similar with the anchor-based method, the equated scores differ based on what value of the propensity score we condition on. From Figure 4b, it is evident that the equated scores conditioning on the 30th, 50th, and 90th percentile of stratified propensity score are very similar, and that the equated scores for the 10th and 70th percentile are more similar, although still clearly different.

The estimated equating functions, conditioning on different values of the estimated and stratified propensity score. The 10th percentile corresponds to an anchor score of 2, the 30th percentile to 6, the 50th percentile to 10, the 70th percentile to 14, and the 90th percentile to 18. Panel (a) shows the equated scores, and Panel (b) shows the difference between equated scores and raw scores. (a) The estimated equated scores for five different estimated and stratified propensity scores. (b) The estimated equated scores, with the unadjusted raw scores subtracted, for five different estimated and stratified propensity scores.
5.1.3. IPW Method
In Figure 5, the distribution of the weights used in the IPW-based estimator is displayed for five selected percentiles. The median for all percentile groups is close to 1, indicating that little correction was necessary. Specifically, weights close to one mean that those examinees have estimated propensity scores that align well with their treatment (i.e., test form) assignment. For the 10th percentile group, the weights cover a quite wide range with examinees being both up-weighted and down-weighted. Most examinees have weights in the range approximately 0.75 to 1.25.

The distribution of the weights used in the IPW-based equating method, across five groups defined by the 10th, 30th, 50th, 70th, and 90th percentile of the estimated and stratified propensity score.
In the 30th percentile group, the distribution of weights remains tightly centered around 1, with limited spread. This suggests that for this group, the propensity scores closely match the treatment assignment probabilities, resulting in minimal reweighting. For the 50th, 70th, and 90th percentile groups, they show a broader range of values, with some weights considerably higher than 1, indicating more substantial up-weighting of certain examinees. This increased spread reflects a greater variability in the propensity scores relative to treatment assignment in these groups. The range also suggests that these groups include examinees whose propensity to receive treatment deviates more from what is observed in the data.
In Figure 6, the estimated equated scores are illustrated for the IPW-based method. As for the two other methods, the equating transformations are very similar in the mid-range of the score scale but differ clearly in the lower and higher ends. The equated scores, when conditioning on the 30th and 70th percentile of the propensity score, are close to each other, and the 10th percentile curve has a similar slope, whereas the 50th and 90th percentile lines are both negative (Figure 6).

The estimated IPW-based equating functions, conditioning on different values of the estimated and stratified propensity score. The 10th percentile corresponds to an anchor score of 2, the 30th percentile to 6, the 50th percentile to 10, the 70th percentile to 14, and the 90th percentile to 18. Panel (a) shows the equated scores, and Panel (b) shows the difference between equated scores and raw scores. (a) The estimated equated scores for five different estimated and stratified propensity scores. (b) The estimated equated scores, with the unadjusted raw scores subtracted, for five different estimated and stratified propensity scores.
5.1.4. Comparison Between the Methods
In Figure 7, we compare the anchor-based and propensity score-based equating functions within anchor-defined strata. Specifically, the sample was partitioned into three groups based on anchor scores (Low = 0%–33%, Medium = 33%–67%, High = 67%–100%), and separate anchor-based equating functions were estimated within each stratum. We then evaluated the corresponding propensity score-based equating functions in the stratum whose median estimated propensity score fell within the same anchor tertile, so that both approaches are applied under parallel conditions. The vertical axis represents the equated score minus the anchor-based score; the horizontal axis is the observed test score. In each panel, the line with circle markers depicts the stratified propensity score equating function evaluated in the stratum whose median estimated propensity score falls within the same anchor tertile; the line with triangle markers depicts the IPW equating function in that stratum. Horizontal solid lines indicate the Difference That Matters (DTM) threshold for number-correct scoring (N. Dorans & Feigenbaum, 1994).

The difference in equated scores between the propensity score-based methods and the anchor-based method, computed within anchor-defined strata (Low, Medium, High). The horizontal solid lines represent the DTM threshold.
Both propensity score-based estimators yield nearly identical results, with deviations remaining small across the full score range in all three tertiles. In the High anchor tertile, equated scores fall within the DTM bounds for the vast majority of observed scores. By contrast, in the Low and Medium anchor tertiles, differences exceed the DTM threshold across much of the score scale, indicating that the difference between the propensity score-based and anchor score-based equating methods are the greatest when anchor difficulty is at the lower or intermediate levels. This pattern is due to the fact that examinees in the High anchor tertile are more homogeneous in both ability and covariates, so the anchor-based and propensity score-based methods operate under nearly equivalent conditions. At lower anchor levels, greater heterogeneity remains, leading to larger discrepancies between the two approaches.
6. Simulation Study
6.1. Design
To evaluate the performance of the proposed equating methods, we conducted a comprehensive simulation study. We generated item response data using a two-parameter logistic (2PL) model, where the probability of a correct response for examinee
with
We included three covariates, each intended to represent observable characteristics commonly available in empirical testing contexts, such as age group, gender, or scores in other domains. These types of variables are typically reported in ordered categorical form (e.g., age bands, educational attainment levels, grouped scores), and our simulation design aims to reflect this structure. To generate these covariates in a way that captures their potential relationship with latent ability, each was constructed by summing a small number of binary indicators. For example, a covariate with four categories was generated by summing three binary variables, each simulated using a 2PL model conditional on
We note that this strategy differs from using a polytomous item response model (e.g., the graded response model). Instead of simulating a single categorical response, we model a collection of binary variables whose outcomes depend on ability, then aggregate them to obtain the covariate. This method yields observed covariates with an ordinal structure, as seen in practice, introduces a tunable correlation between the covariate and latent ability, and allows the covariates to function as observable proxies for the unmeasured confounding induced by
True propensity scores were generated using a logistic model incorporating both the standardized anchor test score (
where
This design allows us to evaluate the equating methods under various realistic testing scenarios. We emphasize that the true latent abilities
Three equating methods were compared: the traditional anchor-based equating method, the propensity score stratification equating method, and the IPW equating method. We conducted 500 replications for each scenario considered. The simulation study was carried out in R (R Core Team, 2024), and the code can be obtained upon request from the corresponding author.
6.2. Evaluation Measures
For each simulation replication, we sample examinees from both populations, extracting their total scores, anchor scores, covariates scores, and latent abilities. Based on the sampled data, we estimate the equating transformations for the three methods: anchor-based equating, propensity score stratification, and IPW. To evaluate bias, we partition the sample of examinees from test form
The true equated score for an examinee in bin
which transforms the observed score
where the expectation is taken over the distribution of the observed score
To estimate the bias in practice, we approximate
where
In addition to bias, we evaluate the RMSE of the equating methods. The RMSE for method
where the expectation is again taken over the distribution of
To estimate the RMSE in practice, we approximate the expectation using the average squared differences over multiple Monte Carlo replications. The RMSE for each method is obtained by computing the square root of the mean squared differences between the true equated scores and the estimated equated scores over all replications and examinees within each
6.3. Results
Figure 8 shows the bias and RMSE under the weak correlation scenario, for selected

The performance in terms of average bias and RMSE of anchor-based, propensity score-based, and IPW-based equating methods across sum scores for selected theta values for
In Figure 8, the three methods yield very similar results across most of the score range, especially the propensity score stratification and IPW methods, which are nearly indistinguishable in both bias and RMSE. The average absolute difference between these two methods is only 0.08 for both metrics, with no substantial divergence at any particular score level. The anchor method also performs similarly in many regions. Differences emerge for certain scores, especially at the lower end of the score range where the anchor method exhibits higher bias and RMSE. Unlike the propensity score-based methods, which directly account for covariate information, the anchor method relies solely on the assumption that examinees with the same anchor scores have similar abilities across forms. When the correlation between covariates and ability weakens, the conditional distribution of abilities given anchor scores may differ more substantially across the two populations, particularly at the extremes of the score distribution. This explains why the anchor method’s performance is slightly worse at the lower end of the score range, where sample sizes are typically smaller and estimation is inherently more challenging. Interestingly, despite not directly incorporating covariates in its equating procedure, the anchor method’s performance is indirectly affected by covariate correlation changes through the form assignment mechanism. When covariates have weaker correlations with ability, the propensity model for form assignment becomes more heavily influenced by the anchor score alone, potentially creating more systematic differences in ability distributions between forms that are inadequately captured by the anchor items. This highlights an important characteristic of non-equivalent group equating designs: The anchor method’s effectiveness depends not only on the anchor items themselves but also on the underlying mechanisms determining sample selection into different forms.
In Figure 9, panel (a) shows the bias for the anchor, propensity score stratification, and IPW equating methods under the medium correlation scenario. The results are similar to the weak correlation setting, however, both bias and RMSE are slightly smaller, which is to be expected.

The performance in terms of average bias and RMSE of anchor-based, propensity score-based, and IPW-based equating methods across sum scores for selected theta values for
7. Discussion
The objective of this study was to explore the use of covariates in local test score equating, particularly when no anchor test is available. We introduced two propensity score-based methods—propensity score stratification and IPW—as alternatives to traditional anchor-based equating methods. Through both empirical and simulation studies, we aimed to assess the effectiveness of these methods in producing fair and comparable test scores across different test forms administered to non-equivalent groups.
From the empirical analysis using data from the SweSAT, we observed that both propensity score-based methods produced equated scores that varied depending on the percentile of the propensity score conditioned upon. This result is in line with the results of previous local equating studies when other proxies have been used to capture differences between group differences (van der Linden & Wiberg, 2010; Wiberg & van der Linden, 2011; Wiberg et al., 2014). This variation was particularly noticeable at the lower and higher ends of the score scale, while the equated scores were relatively similar in the mid-range. The anchor-based method exhibited a similar pattern, suggesting that conditioning on either an anchor score or estimated propensity scores can capture variations in examinee abilities across different score levels. The propensity score stratification method showed equated scores that were closely aligned for the 30th, 50th, and 90th percentiles, indicating consistency in the mid to higher ability ranges. The IPW method also demonstrated stability in the mid-range but showed differences at the extremes of the score distribution.
In the simulation study, we evaluated the performance of the proposed methods under various conditions, varying factors such as the strength of the correlation between covariates and the latent ability, and the sample size. Note that to examine different levels of correlation has neither been examined in local equating studies (van der Linden & Wiberg, 2010; Wiberg & van der Linden, 2011) nor when equating with propensity scores (Wallin & Wiberg, 2019). The results indicated that when the correlation between covariates and the test scores was weak, the IPW method consistently exhibited smaller bias compared to the anchor-based and propensity score stratification methods across most of the score range. The bias and RMSE for the IPW method were relatively stable, suggesting robustness in scenarios where covariates are not strongly predictive of the latent ability. When the correlation between covariates and test scores was medium, the performance of the anchor-based method improved, showing decreased bias at higher levels of latent ability. The propensity score stratification method exhibited increasing bias and RMSE across the score scale for a given ability level, but performed better than the IPW method. Surprisingly, the IPW method’s performance was slightly worse in the medium correlation scenario compared to the weak correlation scenario, although it still maintained relatively low bias and RMSE.
The findings suggest that both the propensity score stratification and IPW methods may offer viable alternatives to anchor-based equating in situations where an anchor test is unavailable but covariates are available. The stability of the propensity score-based methods across different levels of correlation indicates their potential for adjusting for group differences when covariates do not perfectly capture the variations in ability between groups.
The results align with previous research that has explored the use of covariates and propensity scores in equating. Wiberg and Bränberg (2015) demonstrated that covariate-adjusted equating methods could effectively account for group differences in the NEC design. Our study extends this work by applying propensity score methods within the local equating framework, addressing the challenge of fulfilling Lord’s equity requirement in the absence of an anchor test. Additionally, our findings corroborate those of Wallin and Wiberg (2019), who proposed the use of propensity scores in the NEC design and highlighted the potential of these methods to reduce equating bias.
Although local equating procedures may produce different conversion functions for different subgroups, the idea of stratified reporting is not new in testing programs as several of them already report scores conditional on subgroup membership (e.g., by grade level, test version, or test-taking language). Also, our proposed local equating methods can be implemented to obtain conversion functions for a manageable number of strata, rather than a separate function for every individual. The conversion tables can be pre-computed and embedded in scoring software, making operational implementation feasible. Finally, as local equating can reduce bias and improve score interpretation in the presence of group differences, this can outweigh the added complexity of using local equating.
Despite the promising results in this study, several limitations should be acknowledged. First, the success of propensity score methods relies heavily on the quality and relevance of the covariates used. In practice, covariates must be related to the latent ability to effectively adjust for group differences. If important covariates are omitted or the relationship between covariates and ability is too weak, the equating transformation may be biased. Additionally, the assumption of unconfoundedness—that all relevant covariates have been included and correctly measured—is a strong one and may not hold in all testing scenarios. Thus, it is important to perform robustness studies, similar to Wallin and Wiberg (2023).
It is also important to note that relying solely on covariates in the equating procedure carries certain risks. While covariate-based methods can provide a useful alternative when anchor tests are unavailable, they should be considered as a supplementary approach rather than a replacement for anchor-based methods. Covariates may not capture all aspects of the latent ability, and their effectiveness depends on how well they correlate with the constructs being measured. Therefore, testing programs should aim to include anchor items in their tests, especially when test groups tend to be non-equivalent. Incorporating anchor items provides a direct measure to link test forms and can enhance the accuracy and fairness of the equating process.
Our results and other studies considering propensity scores in equating indicate that under certain conditions, covariate-based methods can perform almost as well as, and sometimes even better than, anchor-based methods. This suggests that in situations where anchor items are not available, covariate-based methods serve as a “better than nothing” alternative. Moreover, for testing programs that are beginning to incorporate anchor items, there is potential value in combining information from both covariates and anchor items when performing equating. Such an approach could leverage the strengths of both methods, potentially improving the accuracy of the equating transformation. However, methodologies for integrating both covariates and anchor items in the equating process are yet to be fully developed, and we leave this as a direction for future research.
Future research could also explore optimal stratification strategies or the use of alternative methods, such as propensity score matching, to enhance the balance between groups. Additionally, investigating the impact of different types and numbers of covariates on the performance of propensity score-based methods would provide valuable insights. Assessing the generalizability of these methods across various testing contexts, including different test formats and populations, would further contribute to understanding their applicability.
Moreover, while our study focused on linear equating, there is potential for extending the proposed methods to equipercentile equating, as discussed in Section 4.1. Equipercentile equating may provide a more flexible approach, particularly for tests with non-linear score distributions or when higher moments need to be matched.
In conclusion, this study contributes two propensity score-based methods for local equating when no anchor test is available. The findings suggest that both methods hold promise as alternatives to traditional anchor-based methods. While limitations exist, the exploration of covariate-based equating methods expands the toolkit available to assessment professionals, facilitating the development of fair testing practices. Future research should continue to further develop these methods, address their limitations, and explore their applicability across a broader range of testing scenarios, including the integration of both covariates and anchor items in the equating process.
