Abstract
Keywords
Introduction
Most educational and psychological testing programs are organized as a coherent battery of subtests. For example, for programs of diagnostic testing for instructional purposes, admission or vocational guidance decisions, and large-scale assessments of educational progress, reporting profiles of scores to stakeholders is much more informative than single scores summarizing an instructional need, a recommended decision, or progress made in a school district or country. A recent example from neurocognitive assessment is reported in Moore et al. (2023).
There also exist valuable psychometric benefits related to the use of test batteries. Obviously, a unidimensional response model is much more likely to fit each of a set of homogeneous subdomains of test items than the more heterogeneous pool resulting from their aggregation. In addition, as scores inferred from subtests of a battery typically correlate highly, when dealing with one of the subtests, we have the opportunity to import data collected for the other subtests as powerful collateral information. The test batteries in this article are assumed to be such batteries of homogenous, unidimensional tests. As our focus is on adaptive testing, the assumption implies an item pool with the items for each subtest selected from a unique subset in the pool.
Test batteries are typically administered in fixed times slots, the presence of which imposes a difficult dilemma with respect to the number of subtests and their length. Generally, the greater the number, the richer the score profile. At the same time though, a greater number of subtests means fewer items per subtest and/or greater degree of speededness for some of them, the former implying lower accuracy of the score profile, the latter introducing bias with respect to its intended underlying abilities. No wonder the earliest applications of adaptive testing were attempts to use its efficiency to improve the design of test batteries. Since Lord’s (1980, sect. 10.7) discovery of an adaptive version of the
The previous rule refers to the typical adaptive test starting with the selection of an item somewhere in the middle of the ability scale. However, the dilemma between the number of subtests and their score accuracy can be further relaxed if, instead, we initialize each next subtest using collateral information from other subtests in the battery. As already noted, as test batteries tend to have high intercorrelation between their subtests, we could do so with high precision. In fact, we could go even one step further and select the subtests for each test taker
The additional level of adaptation can be used to develop large batteries of short diagnostic subtests (each of 5–7 items, say), for instance, to monitor learning progress in education. As their improved initialization immediately brings the selected items close to the true abilities of the test takers, only a few more items are required to finish each subtest. The current research was motivated by the desire to replace current diagnostic testing with its typical subscoring of long heterogenous fixed tests with such batteries of a large number of short adaptive tests.
The idea of a two-level adaptive test battery was already proposed in van der Linden (2010). The proposal consisted of a battery run on a regular response model for each of the subtests extended with a second level for the joint distribution of their abilities. However, its statistical treatment was rather ad hoc. All item parameters and second-level ability parameters were assumed to be equal to point estimates obtained during earlier item pool calibration. But as known for traditional one-level adaptive testing, the treatment of point estimates as known parameters results in overoptimistic estimates of the test taker’s ability parameter together with less than optimal item selection due to capitalization on item parameter error (e.g., Cheng et al, 2015; Patton et al, 2013; van der Linden & Glas, 2000). In the context of two-level adaptive testing, not only the within-subtest level of item selection is impacted by item parameter error, but the same can be expected to happen during the transition from one subtest to the next. The overestimation of the information about the first-level ability parameters from the preceding subtests and the second-level parameters for the ability structure is then likely to result in less than optimal selection of the next subtest as well.
Another necessary improvement is with respect to the final subtest scores reported in van der Linden (2000). As each next subtest profited from a greater number of responses already collected from the test taker and thus began with more information about the next ability parameter, the gains of score accuracy for a subtest were higher, the later it was administered to the test taker.
Finally, the earlier proposal was based on an ad hoc procedure to compute all necessary integrals using a single random sample of ability values drawn from the multivariate ability distribution with its means and (co)variances set equal to points estimates collected prior to the test. As the integrals were with respect to the conditional distributions of the current ability parameter given all possible combinations of the ability parameters for the preceding subtests, the necessary sample size to produce the accurate estimates of these continuous distributions quickly becomes prohibitive for test batteries with larger numbers of subtests.
The current proposal is based on a fully Bayesian approach. All first- and second-level parameters are assumed to be known only through their posterior distributions. The subpools and items are selected based on posterior distributions of the ability parameters permanently updated during testing, avoiding the danger of capitalization on parameter error inherent in adaptive testing based on point estimates. Also, computationally, rather than a large single sample from the second-level ability distribution, the updates are obtained locally from a rapidly converging Gibbs sampler. In addition, the new approach is extended with an adjustment that removes the unbalance between the scores on the earlier and later subtests in the battery scores, making each final score for an earlier subtest as informative as the score for the final subtest. Finally, as discussed at the end of this article, the combination of a two-level adaptive testing model with the fully Bayesian approach allows for several extensions and generalizations of adaptive testing, including such practical options as continued updating of the model parameters during operational testing or even continuous field testing and calibration of new items.
In the next sections, we first review the two-level response model for the adaptive test battery and then introduce our Bayesian approach to ability parameter updating, item selection, and the transition from one subpool to the next. Using the output from the proposed Gibbs sampler, the computational expressions for the optimization of each of these steps are presented. The practical feasibility of the approach is demonstrated for an extensive study with simulated test takers for a pool of items for a real-world adaptive test battery.
Two-Level Model
Each of the subpools of items
The lower level equations define the well-known three-parameter logistic (3PL) response functions:
where
At the higher level, the ability structure in the population of the test takers is supposed to follow a multivariate normal density
with mean vector
and
where
The items in each subpool are assumed to have been calibrated prior to operational adaptive testing. The details of the Bayesian procedure used by the authors in their empirical example are provided in the following section. For applications unlikely to have the multivariate normal structure in Equation 2, it is possible to introduce transformations of the abilities deviating from normality with back transformation of the final scores. Another straightforward option, seamlessly fitting the approach below but less attractive from a statistical point of view, is discussed at the end of this article.
Our goal for the test battery is optimal estimation of the score profiles (
Bayesian Approach
The approach is fully Bayesian accounting for the uncertainty about all parameters in Equations 1 and 2. At each step in the approach, all prior distributions are empirical distributions.
The initial prior distributions for the ability parameters used for the selection of the first subpool are the marginal distributions of Equation 2 saved from item calibration. The distributions are updated in a sequential fashion during testing using a generalization of the Gibbs sampler introduced in van der Linden (2018) and further optimized by van der Linden and Ren (2020), Ren et al. (2020), and Niu and Choi (2022). The prior distributions for the ability parameters used for the selection of the later subpools are the posterior predictive distributions immediately available upon termination of the preceding subtest, whereas the prior distributions for the item and second-level model parameters are the posterior distributions obtained from their calibration prior to operational testing. As demonstrated by our examples in the following, though quite efficient, this Bayesian approach is computationally not more intensive than traditional one-level adaptive testing with point estimation of all parameters.
The proposed Gibbs sampler is introduced first allowing its output to be used in the presentation of the necessary computational expressions for subpool and item selection.
Gibbs Sampler
The algorithm for the updates of the ability parameters is based on the following ideas: Rather than the usual point estimates of the item parameters, ability parameters, and means and covariances in Equations 1 and 2, short vectors of random draws from their last posterior distributions are stored in the testing system. At each of the posterior updates of an ability parameter, a Gibbs sampler is used which cycles between resampling the vector of draws saved for the item, mean, and covariance parameters; and a Metropolis–Hastings (MH) step to sample the test taker’s ability parameter. The MH steps for the ability parameter capitalize on the sequential nature of adaptive testing in the following way: the proposal distribution is a normal centered at the value drawn at the preceding iteration step, and as variance, a rescaled version of the posterior variance saved from the previous update of the ability parameter and the prior distribution is a normal with both the mean and variance saved from the previous update. Upon termination of the MH steps, the existing vector of draws for the ability parameter in the system is overwritten by an appropriate selection from the current draws.
For the rescaling in Step 3a, Niu and Choi (2022) recommended a factor of 2.4 times the posterior standard deviation found to be most effective by them. As the posterior distributions of the item and second-level ability parameters do not depend on the data used to update the test taker’s ability parameter, their resampling replaces the sampling of the complete conditional distributions generally required for a Gibbs sampler. In fact, because of posterior independence, the sampler reduces to something known as an independence sampler (Gilks, Richardson & Spiegelhalter, 1996). Also, as the posterior distributions for these parameters are narrow and already converged, the Markov chain needs to converge for one ability parameter only, which occurs almost immediately.
The last posterior mean and variance of the ability parameter are always our best summary of the information about the parameter collected so far. Their use as prior distribution at each next update thus remains empirical. Besides, the proposal distribution used in the MH steps adapts itself in the sense of having a mean automatically converging to the true parameter value during testing and variance converging to zero. As the proposal distribution is symmetric, the acceptance probability for the candidate value drawn from it reduces to the simple calculation of the product of the prior and model probability of the test taker’s response given the current proposal for the value of the ability parameter. No other computational steps are required.
More formally, the MH step for the update of the posterior distribution of
Using Drawing a value from the posterior samples of item parameters Drawing a candidate value Calculating the probability of acceptance
Accepting Returning to Step 1.
As already hinted, the calculation of Equation 5 requires only the calculation of the product of the current prior density and the probability of the last observed response at candidate value
Fisher’s Information
A crucial quantity in adaptive testing is Fisher’s information in response
where
The criterion for item selection is maximum posterior expected information, which, for a given response vector
where
We use superscripts
The use of equal numbers of draws is for notational convenience only. In case of unequal sample sizes, the average is efficiently calculated recycling smaller samples against the larger.
Selection of First Subpool
It may seem advantageous to administer the first test from the subpool with the item that has the greatest value of Equation 10 across all subpools, where the draws for
A solution that avoids this pitfall of capitalization on a few good items is to select the best full-size subtest from each subpool and pick the subpool with the best result among all of them. Let
among all possible sets
As indicated earlier, this article is based on research operating on the assumption of a larger battery of short subtests each from a homogenous subpool of items. But if the same methodology is applied to a battery with more heterogenous subpools, a second pitfall is possible. It may then be necessary to impose constraints on the selection of the items to balance the content of the subtest across all test takers. If so, just beginning a subtest with items that are statistically best for the test taker is likely to lead to later suboptimal selection because of the necessity to satisfy each of the constraints at the end of the test.
A solution that efficiently avoids both pitfalls is to use a shadow-test approach (STA) both for the selection of subpools and items. The first subtest is then the one with the best solution for the shadow-test model with Equation 11 as objective and a constraint set that controls both the length of the subtest and the content distribution of the items. For a brief review of the approach, see the Appendix.
Selection of Items From First Subpool
We relabel the subpools assigning
After each new response
Selection of Second Subpool
The second subtest is administered from one of the subpools
that is, the version of Equation 9 with
which defines
Analogous to Equation 10, Equation 13 is calculated as
where
Thus, the draws required for Equation 15 are obtained from
with the second-level parameters resampled from their posterior distributions saved from the calibration of the item pool.
The second item pool is selected according to the criterion in Equation 11 with the conditional draws
Selection of Items From Second Subpool
We now use
Selection of Subsequent Subpools and Items
The same procedure is continued to select subsequent subpools and items. The only necessary change for the selection of the next subpool is the extension of Equations 13 through 17 with an additional conditioning ability parameter representing the last subtest administered. To illustrate one more step, it is easy to verify from the general result in Equations 3 and 4 that, using correlations rather than covariances for notational convenience, for the selection of the third subpool,
and
respectively. Thus, analogous to Equation 17, we combine the draws
For larger numbers of subpools, the use of analytic expressions for the conditional means and variances derived directly from Equations 3 and 4 becomes less convenient. A more practical approach is then to use the fact that conditional variance
Final Subtest Scores
The procedure presented so far suggests a serious unbalance in the sense of earlier subtests necessarily profiting from a smaller number of preceding subtests and hence producing less accurate final scores than later subtests. The proposed correction for the unbalance is to recalculate the final scores for the subtests from the posterior distribution of each ability parameter given the test taker’s complete set of responses to all subtests; that is,
A straightforward way to sample Equation 20 for each
An alternative approach is to redesign the Gibbs sampler to update the posterior distribution of each
The means and
Empirical Example
The goal of the empirical example was to demonstrate the practical feasibility of the current approach to two-level adaptive testing and give an impression of the gain in relative efficiency created by the introduction of the second level of adaptation under operational conditions.
Item Pool Calibration
The real-world test battery used in the example had four different subtests labeled here as Subtests 1 through 4. The item pool consisted of 150 items randomly sampled from an inventory of retired operational items for each of the subtests. The items had been extensively pretested and shown to have satisfactory fit to the 3PL model in Equation 1. Also, the ability parameters for the four subpools had been estimated to have empirical mean vector
and covariance matrix
for a typical population of test takers. The estimates of all these parameters were used as their true values for the generation of response data, both for item pool calibration and during simulated adaptive testing.
The items had been calibrated previously using one of the standard computer programs from the maximum-likelihood tradition. But as samples from the posterior distributions for all parameters in the two-level model in Equations 1 and 2 were needed to simulate the adaptive testing administrations, it was decided to re-estimate all parameters in a fully Bayesian version. (An alternative would have been to take the maximum-likelihood estimates (MLEs) together with their estimated standard error and sample the parameters assuming asymptotic normality. But Bayesian estimation was preferred because of the small-sample validity of its posterior distributions.) All parameters were estimated jointly in a Bayesian fashion from the response data of
Figure 1 shows the scatterplots of the posterior means (expected a posteriori or EAP estimates) against the true values of the item parameters for the four subpools. The average root mean squared errors (RMSEs) for the item parameters are shown in Table 1. For second-level parameters

Scatterplots of the posterior means of the
Average Root Mean Squared Errors for the
Simulation Conditions
The main conditions in the simulation study were: subtest length of 5 versus 10 items; two-level versus one-level adaptive testing.
Each of the four combinations of conditions was simulated for a total of 5,000 test takers with ability parameters

Average bias functions for the estimated ability parameters for the four simulated conditions.
For the condition of two-level adaptive testing, subpool and item selection were entirely according to the Bayesian approach presented in this article. The first subpool was selected averaging the marginal distributions of each
As already indicated, the sampler was set to have a burn-in of 5,000 iterations and thinning by a factor of 500. For both conditions, resampling of the parameters was from 500 independent posterior draws saved in the system from the calibration. For security reasons, the authors had access only to the parameter estimates for the items in the pool, not to their content or any of their other attributes. The simulations were therefore run without any content constraints on the selection of the items. At the end of the simulations, the final scores for the simulated test takers were recalculated using the posterior distributions given their complete set of responses in Equation 20.
Results
Figure 2 and 3 show the average bias and RMSE functions for the final scores for each of the four combinations in the simulation. The two-level battery clearly outperformed the one-level version both as for bias and RMSE. Obviously, the same was true for the longer relative to the shorter version of the subtests.
The relatively larger bias and RMSE for the one-level approach at the upper end of the ability scale for the case of four 5-item subtests are due to local scarcity of the items in the pool. As demonstrated by the mean vector in Equation 21, the population had a relatively low ability distribution for each of the subtests, and the item pool matched the population. The 10-item subtests suffered less from the match though.
The most significant result, however, was the performance of the two-level battery with the subtest length of five items relative to the one-level battery with the length of 10. Both the bias and RMSE functions for the two cases were relatively close to each other each, especially at the lower end of the scale. The introduction of the second level of adaptation in the battery thus had the same general effect on the bias and accuracy of the scores as lengthening the subtests by a factor close to two. Alternatively, returning to the dilemma discussed in the Introduction to this article, the results can thus be taken to support the option of increasing the number of subtests in the battery by the same factor without any noticeable loss of quality of scoring while keeping the total testing time constant.

Average root mean squared error functions for the estimated ability parameters for the four simulated conditions.
It is also informative to check the different paths through the test battery followed by the test takers. Table 2 shows the counts for each of these paths for the two different subtest lengths collected during the simulation. The first path follows the fixed order, in which the subtests of the battery had been administered during operational testing. However, for both lengths of the subtests, nearly every test takers did profit from the presence of an alternative, more informative path thanks to the second-level of adaptation added to the test battery.
Counts of the Paths Through the Two-Level Battery by the Test Takers
Runtimes
The simulations were run on a computer with a Hexa-core CPU (Intel I7-8700) and 16-GB RAM. The simulation was programmed using
The runtimes for the simulated test takers to update their ability parameters and select the next item ranged from 0.030 to 0.040 second/item. For the selection of the next subtest, the range was 0.214 to 0.243 second/subtest. For the computation of the final scores for each of the subtests according to the procedure discussed directly below Equation 20, the runtimes ranged from 0.821 to 0.892 and 1.096 to 1.170 seconds for the 5-item and 10-item subtests, respectively. The times are small enough for real-world application of the two-level type of adaptive testing proposed in this article.
Discussion
As already hinted at, the combination of a two-level adaptive testing model with the proposed Bayesian algorithm for all parameter updates enables several practical extensions and generalizations of current adaptive testing. One of the options is online calibration of new items. The only thing required is the insertion of an adaptively selected item from a section of field-test items added to the pool toward the end of the subtest. To update the posterior distributions of the field-test parameters, the same Gibbs sampler can be used, but now with an MH step for the parameters of the item and resampling of the current posterior distribution of the test taker’s ability parameter. In addition to the advantages of calibration under the actual conditions of testing with optimized sample sizes, the items are immediately available for testing with their parameters directly on the operational scales along with the samples from their final posterior distributions required for the version of adaptive testing proposed in this article. Examples of these options have already been illustrated for traditional one-level adaptive testing by Ren et al. (2017), van der Linden (2018), and van der Linden and Jiang (2020). Another extension is the introduction of response times as a source of collateral information on the test takers’ abilities. The Bayesian way of doing so is to replace the second-level ability distribution in the model with the joint distribution of the test takers’ ability and speed. For a more traditional approach to the additional use of response times with point estimates for all parameters, see van der Linden (2008). A third option is continued updating of the posterior distributions for the first-level item and second-level parameters for the ability distribution during operational testing. This choice would have both advantages and disadvantages. The greatest advantage would be use of information about these parameters from operational data that so far has been ignored. The disadvantage would be loss of the current posterior independence between the test taker’s ability parameter and the other model parameters given the data, which allowed the Gibbs sampler to just resample the posterior distributions of all model parameters other than the one for the test taker’s ability. Further research of the option is necessary to see whether the additional complexity due to loss of posterior independence is worth the effort. Along the same lines, once the fit of a field-test item has shown to be satisfactory, the response to the item can be used to update the test taker’s ability parameter as well. The necessity to distinguish between operational and field-test items then disappears completely. The only thing that counts would be the distinction between items with more and less informative priors for their parameters, something a Bayesian approach automatically deals with.
If the assumption of a second-level multivariate normal distribution for the ability parameters in their original metric appears to be untenable and temporary transformation to normality does not work, an alternative is to sample an empirical multivariate distribution of the ability parameter estimates for the population of test takers, for example, a distribution collected during initial use of the traditional one-level version of the test battery. The use of an empirical distribution has the advantage of avoiding any assumption about the shape of the ability distribution for the battery. However, distributions of estimated ability parameters generally have larger variability than an estimate of the distribution of their true values. In addition, extremely large samples of test takers are required to stabilize empirical distributions for larger test batteries, especially the conditional distributions required when the sequence of subtests progresses. More generally, the dilemma faced when choosing between a modeled and an empirical second-level ability distribution is between possible bias in the former and inaccuracy inherent in the latter. However, for the current application, bias as a consequence of a misfitting distribution manifests itself only in the form of a less than optimal order of the subtests for the test takers
Appendix
Shadow-Test Approach
The STA treats adaptive testing as a sequence of full-size tests assembled to be optimal at each new update of the ability parameter while satisfying the complete set of constraints in force for the adaptive test. Each item administered is the best free item in the next shadow test; the rest of the free items is returned to the pool. As both optimality and constraint satisfaction hold for each of the shadow tests, the same automatically holds for the completed adaptive test.
In the current context, the approach does not only support within-subtest item selection but also adaptive transition from one subpool to the next. The criterion for the transition is the selection of the next subpool as the one with the first shadow test that has the best value of the objective function among all remaining subpools. The criterion automatically avoids the pitfall of running into the necessity to violate any of the constraints during testing.
The approach is possible through the use of mixed integer programming (MIP) for the assembly of the shadow tests. The application of the MIP methodology includes the introduction of binary decision variables for the selection of the items, modeling of the objective function and constraints to be imposed on the selection in terms of these variables, and a call to software with a standard mathematical solver to calculate the solution to the model prior to the selection of the next item. Let
The core of the shadow-test model for the selection of the
subject to
where denotes the choice of a (strict) inequality. In addition to the categorical and content constraints in Equations 26 and 27, the constraints in Equations 24 and 25 are necessary to control the length of the test and guarantee the presence of the set of items
