Sage Journals: Discover world-class research

Abstract

Surveys are experiencing declining response rates. With more and more effort expended to combat these declining response rates, the cost of large-scale surveys has continued to rise. Stopping rules are one of the interventions used to improve the efficiency of data collection. However, most previously proposed stopping rules have focused only on the quality of a single estimate or assumed sufficient funds for nonresponse follow-up. In multipurpose surveys, there may be data quality objectives that must be met for multiple estimates with a constraint on costs. In this paper, we introduce a multivariate stopping rule that aims to maximize the tradeoff between the cost of data collection and the quality of multiple estimates. The multivariate stopping rule uses predicted costs and mean squared errors of different estimates to evaluate alternative sets of cases for stopping. The proposed stopping rule is illustrated using data from the Health and Retirement Study. In the multivariate stopping rule, each estimate receives a prespecified weight that determines the relative importance in the design. We found that assigning a relatively larger estimate-level weight to a survey variable did not necessarily result in better data quality for that variable. One plausible reason is that the prediction quality varies across variables.

Keywords

stopping rule responsive survey design nonresponse bias survey cost

1. Introduction

Surveys have been facing dual threats of rising data collection costs and declining response rates (e.g., Curtin et al. 2005; de Leeuw and de Heer 2002; Williams and Brick 2018). As a response to these challenges, Groves and Heeringa (2006) introduced the notion of responsive survey design (RSD). RSD is an active survey data management technique that uses incoming data to make design decisions during the field period.

The adoption of computerized survey data collection gives survey managers access to timely information about the progress of data collection. A key element of RSD is the collection and analysis of data describing the data collection process, better known as paradata (e.g., Couper 1998; Couper and Lyberg 2005; Olson 2013). In RSD, paradata are used to compute predictions for informing design decisions and to document the effects of changes (“interventions”) on key process indicators (e.g., variation in the response rates among subgroups). For example, Groves et al. (2009) monitor four levels of paradata daily to guide active interventions, Kirgis and Lepkowski (2013) use paradata to guide interviewers to target sample in a specific subgroup, and Wagner et al. (2012) use paradata to systematically guide interviewer behaviors during the field period.

The term stopping rule is borrowed from clinical trials (Rao et al. 2008), where data are periodically reviewed to decide whether to stop a clinical trial before its planned completion. Recently, several stopping rules have been proposed for surveys, but their goal is different from those used in a randomized controlled trial. Stopping rules in surveys aim to balance survey costs and errors. For example, stopping rules can be used to determine whether to stop data collection or to initiate a new phase of data collection (Lewis 2017, 2019; Rao et al. 2008; Wagner and Raghunathan 2010). Potential design changes in response to stopping rules include switching modes, changing the amount of incentives, increasing interviewer effort, or discontinuing nonresponse follow-up. These stopping rules are triggered by detecting phase capacity, which refers to the stable condition of an estimate or several estimates in a specific design phase.

Rao et al. (2008) proposed three stopping rules that are based on testing whether an estimated proportion changes substantially following the completion of wave $k (k \geq 2)$ relative to the same estimate following the completion of wave $k - 1$ in a mail survey. They defined each wave as a new follow-up attempt with nonrespondents from the previous wave(s). Their first two rules use only observed data, and their third rule (and its variant) adjust for nonresponse by multiply imputing missing values. Recently, their third rule has been extended by Wagner and Raghunathan (2010) and Lewis (2017). Wagner and Raghunathan (2010) developed a prospective stopping rule, which is also called the “stop-and-impute” rule for the purpose of improving the efficiency of data collection. Lewis (2017) introduces a stopping rule that uses survey weights to adjust for nonresponse.

Lewis (2019) introduced two multivariate stopping rules to address the situations where independent stopping rules produce conflicting results when detecting phase capacity for multiple estimates. The first multivariate stopping rule is based on the Wald chi-squared test of changes in estimates at the completion of waves $k - 1$ and $k$ ; each estimate can be adjusted for nonresponse and its relative importance at the estimate level. However, the Wald chi-squared test cannot guarantee that each estimate of interest has reached its own phase capacity. Arguably, the stopping rules triggered by the detection of phase capacity for a single estimate are still valuable in a multivariate context. During the field period, each estimate of interest may reach its own phase capacity at a different time point. Overall phase capacity is said to be detected once all estimates of interest reach stable conditions.

The second multivariate stopping rule is based on the non-zero trajectory method. The relative percent change in estimates are modeled as a linear function of the data collection wave. When all estimated regression coefficients in the model are statistically indistinguishable from zero, one can declare that the overall phase capacity is detected. However, the non-zero trajectory method requires additional waves of follow-up to avoid fitting a saturated model. For example, at least four waves are required to detect overall phase capacity for three estimates.

Existing stopping rules that are based on detecting phase capacity have three critical limitations. First, phase capacity is a proxy measure of data quality (Lewis 2017, 2019). Phase capacity suggests that an estimate reaches a stable condition in a design phase, but it does not necessarily indicate the estimate is free of nonresponse error. Therefore, there is a need for a stopping rule that considers direct measures of data quality.

Second, most of these stopping rules are retrospective in nature (Lewis 2017, 2019; Rao et al. 2008; Wagner and Raghunathan 2010). Testing for phase capacity requires another wave of data collection, which demands more time and resources. These stopping rules are feasible when there are sufficient funds for future waves of data collection to detect phase capacity. However, the survey manager does not have to wait until phase capacity is reached to stop some cases if stopping them early on can lead to a large reduction in data collection costs with an acceptable level of data quality.

Lastly, most of these stopping rules are univariate in nature (Lewis 2017; Rao et al. 2008; Wagner and Raghunathan 2010; Wagner et al. 2023). Two multivariate stopping rules proposed by Lewis (2019) are exceptions, but these two rules do not explicitly account for survey costs. In multipurpose surveys, there may be data quality objectives that must be met for certain estimates with a constraint on costs. A decision to stop effort on a subset of cases can hardly be acceptable if the decision undermines the quality of other key estimates.

Two-phase sampling for nonresponse follow-up (Hansen and Hurwitz 1946) is another type of stopping rule. Instead of discontinuing nonresponse follow-up for all nonrespondents, two-phase sampling only stops effort on a subset of nonrespondents. Since the selection procedure in two-phase sampling is random, it implicitly assumes that the cost is constant across all nonrespondents. If one has predicted measures of data quality and the cost varies across nonrespondents, the selection procedure for stopping a subset of cases can be improved. For example, Wagner et al. (2023) presented a stopping rule aimed at optimizing the cost-error tradeoff. Their stopping rule relies on predictions of data collection costs and data quality as inputs, and is used to stop a subset of unresolved cases during data collection while the other unresolved cases were followed up. However, their stopping rule is univariate in nature. Again, a univariate decision rule for survey data collection overlooks the quality of other key estimates and stopping effort on a case before phase capacity is declared could have a negative ripple effect on other key estimates. While comparisons to the existing rules go beyond the scope of the current study, the pros and cons of these existing stopping rules are summarized in Table 1.

Table 1.

Pros and Cons of Existing Stopping Rules.

Study	Pros	Cons
Type 1: Discontinue follow-up for all nonrespondents (test for phase capacity)
Rao et al. (2008)	Capability of adjusting for nonresponse by imputation	Univariate, constant cost across unresolved cases, indirect measure of data quality, lack of cost considerations in decisions, retrospective
Wagner and Raghunathan (2010)	Capability of adjusting for nonresponse by imputation, relatively cost efficient (one wave earlier than Rao et al. to detect phase capacity)	Univariate, constant cost across unresolved cases, indirect measure of data quality, lack of cost considerations in decisions, retrospective
Lewis (2017)	Capability of adjusting for nonresponse by survey weights	Univariate, constant cost across unresolved cases, indirect measure of data quality, lack of cost considerations in decisions, retrospective
Lewis (2019)	Multivariate, capability of adjusting for nonresponse by survey weights, capability of accounting for relative importance of different estimates	Constant cost across unresolved cases, indirect measure of data quality, lack of cost considerations in decisions, retrospective
Type 2: Discontinue follow-up for some nonrespondents while continuing to follow up with the other nonrespondents
Hansen and Hurwitz (1946)	Ease of implementation, cost efficient	Constant cost across unresolved cases, nonresponse follow-up for the second phase is assumed to be representative of all nonrespondents from the initial phase (i.e., 100% response rate), no guidance on when to stop the initial phase of data collection
Wagner et al. (2023)	Allowing for varying costs across unresolved cases, direct measure of data quality, cost efficient, capability of incorporating incoming data	Univariate, sophisticated statistical modeling, no guidance on when to implement the stopping rule

We aim to develop a multivariate stopping rule for survey data collection that accounts for the cost of data collection and the quality of multiple estimates. This study extends the stopping rule proposed by Wagner et al. (2023). In their study with multiple survey variables, Wagner et al. applied the stopping rule separately for each survey variable and stopped effort on cases that were flagged for stopping on any one of the survey variables. Our proposed stopping rule, however, simultaneously considers the data quality of multiple survey variables when determining which cases to stop. To achieve this, we use the weighted mean squared errors of multiple sample means, with each estimate receiving a prespecified weight. We illustrate via simulation how the stopping rule performs empirically.

2. A Multivariate Stopping Rule

The proposed stopping rule is implemented in the following setting. In the data collection process, a survey manager aims to stop effort on a subset of unresolved cases in a way that maximizes the tradeoff between the cost and data quality of multiple estimates—a fairly large amount of cost savings relative to the loss in data quality—at the end of data collection. Other unresolved cases where effort is not stopped will continue to be followed up. To fix notation, we let $n$ be the total number of sampled cases. At a given stage of the data collection, without loss of generality, $n_{0}$ cases are interviewed and $n - n_{0}$ cases are unresolved. Let $S$ denote the subset of $n - n_{0}$ unresolved cases that are stopped by the proposed stopping rule. For the data quality component, the mean squared errors are estimated under simple random sampling. We discuss extensions of this approach to complex sample designs in Section 5.

Many cases are resolved by a series of attempts. At a specific point in time, the value of the sunk cost for case $i$ , $i = 1, \dots, n$ , is denoted as $C_{i}$ (e.g., costs for existing attempts) and the predicted future cost for an unresolved case is denoted as ${\hat{C}}_{i}$ (e.g., costs for future attempts). The user selects several key variables to be included in the rule. These variables should be selected based upon the objectives of the survey. We will return to this question in the discussion. A vector of the observed values of survey variables for a resolved case is denoted as $y_{i}$ , and a vector of predicted values of survey variables for an unresolved case is denoted as ${\hat{y}}_{i}$ . For survey variable $Y_{p}$ , $p$ = 1, …, $P$ , with $P$ being the total number of key survey variables, we let ${\hat{y}}_{p} = \sum_{i = 1}^{n_{0}} y_{p, i} + \sum_{i = n_{0} + 1}^{n} {\hat{y}}_{p, i}$ be the estimated sample total for $n$ cases, ${\hat{\bar{y}}}_{p} = \frac{1}{n} {\hat{y}}_{p}$ be the estimated sample mean, ${\hat{y}}_{p, S} = \sum_{i \in S} {\hat{y}}_{p, i}$ be the estimated sample total of set $S$ , and ${\hat{\bar{y}}}_{p, - S} = \frac{1}{n - \sum_{i = n_{0} + 1}^{n} I (i \in S)} ({\hat{y}}_{p} - {\hat{y}}_{p, S})$ be the estimated sample mean after stopping effort on set $S$ , where $I (i \in S)$ is an indicator function that equals 1 when $i \in S$ and 0 otherwise.

Before implementing the stopping rule, the estimated total cost would be $\hat{C} = \sum_{i = 1}^{n_{0}} C_{i} + \sum_{i = n_{0} + 1}^{n} {\hat{C}}_{i}$ , and the weighted average of the estimated mean squared errors coincides with the weighted average of the estimated variances when all the biases are zero and would be $\sum_{p = 1}^{P} w_{p} {\hat{V}}_{p}$ , where $w_{p}$ is the prespecified estimate-level weight, $w_{1}, \dots, w_{P}$ are subject to the constraints $\sum_{p = 1}^{P} w_{p} = 1$ and $w_{p} \geq 0$ for a meaningful interpretation, and the estimated sampling variance of ${\hat{\bar{y}}}_{p}$ is given by ${\hat{V}}_{p} = \frac{1}{n (n - 1)} (\sum_{i = 1}^{n_{0}} {(y_{p, i} - {\hat{\bar{y}}}_{p})}^{2} + \sum_{i = n_{0} + 1}^{n} {({\hat{y}}_{p, i} - {\hat{\bar{y}}}_{p})}^{2})$ . The simulation study will vary the selected $w_{p}$ to see how the choice of these weights may impact the results. Since model predictions are not without errors in practice, we expect that different estimate-level weights will lead to differences in the outcome. We also aim to provide general guidelines for the selection of these weights. After making predictions about survey variables for unresolved cases, all survey variables are standardized using z-score scaling to ensure that all estimated mean squared errors are on the same scale. If we stop effort on $S$ , the estimated remaining costs would become ${\hat{C}}_{- S} = \hat{C} - \sum_{i \in S} {\hat{C}}_{i}$ and the weighted combination of the estimated mean squared errors of different sample means after stopping $S$ would become $\sum_{p = 1}^{P} w_{p} {\hat{MSE}}_{p, - S} = \sum_{p = 1}^{P} w_{p} ({\hat{B}}_{p, - S}^{2} + (\frac{n}{n - \sum_{i = n_{0} + 1}^{n} I (i \in S)}) {\hat{V}}_{p}),$ where ${\hat{B}}_{p, - S} = {\hat{\bar{y}}}_{p, - S} - {\hat{\bar{y}}}_{p}$ is the estimated bias of ${\hat{\bar{y}}}_{p, - S}$ .

To summarize, the inputs of the proposed stopping rule include: (a) the estimated data collection costs after stopping $S$ (i.e., ${\hat{C}}_{- S}$ ), (b) the estimated mean squared errors of multiple sample means after stopping $S$ (i.e., ${\hat{MSE}}_{p, - S}$ , $p = 1, \dots, P$ ), and (c) the weighting factor $w_{p}$ for each variable. The proposed stopping rule is formulated below. We consider an objective function

$ψ_{- S} = {\hat{C}}_{- S} \sum_{p = 1}^{P} w_{p} {\hat{M S E}}_{p, - S},$ (1)

which mimics the function for optimal allocation in stratified sampling (Cochran 1977, 97). The estimated cost ${\hat{C}}_{- S}$ does not have to be rescaled since rescaling ${\hat{C}}_{- S}$ will not affect the rank of the objective function $ψ_{- S}$ . If we aim to trade the largest reduction in costs for the smallest compromise in data quality, the optimal solution to Equation (1) is identified when $ψ_{- S}$ reaches its smallest value.

Because of z-score scaling, we can replace the variances by 1’s and replace the full sample means/totals by 0’s. Then, $(1)$ becomes

$ψ_{- S} = {\hat{C}}_{- S} \sum_{p = 1}^{P} w_{p} [{(\frac{\sum_{i \in S} {\hat{y}}_{p, i}}{n - \sum_{i = n_{0} + 1}^{n} I (i \in S)})}^{2} + \frac{1}{n - \sum_{i = n_{0} + 1}^{n} I (i \in S)}] .$

However, finding the optimal set of cases to stop that has the exact minimal value for Equation (1) is computationally expensive (or even computationally prohibitive) when the number of unresolved cases, $n - n_{0}$ , is moderately large (e.g., over fifty unresolved cases); there are $2^{n - n_{0}} - 1$ possible sets to stop. We use an approximation approach to reduce the computational cost of finding the optimal set of cases to stop (Wagner et al. 2023). The idea of the approximation approach is to stop cases recursively. In other words, we stop a case with the best multiplicative cost-error tradeoff at each step and repeat this procedure until we identify a step that has the lowest value of the cost-error tradeoff function in Equation (1). The procedure is outlined below.

Before implementing the stopping rule, the value of $ψ_{0}$ is given by

$\begin{array}{l} ψ_{0} = \hat{C} \sum_{p = 1}^{P} w_{p} {\hat{V}}_{p} \\ = \frac{\hat{C}}{n} . \end{array}$ (2)

The objective function for stopping effort on case $j$ , $j = n_{0} + 1, \dots, n$ , is given by

$ψ_{- j} = {\hat{C}}_{- j} \sum_{p = 1}^{P} w_{p} {\hat{M S E}}_{p, - j},$ (3)

where ${\hat{C}}_{- j} = \hat{C} - {\hat{C}}_{j}$ is the estimated remaining cost after stopping case $j$ , and ${\hat{MSE}}_{p, - j} = {\hat{B}}_{p, - j}^{2} + (\frac{n}{n - 1}) {\hat{V}}_{p} = {(\frac{{\hat{Y}}_{p, j}}{n - 1})}^{2} + \frac{1}{n - 1}$ is the estimated mean squared error of the sample mean for variable $p$ after stopping case $j$ , $p = 1, \dots, P$ .

We propose to calculate $ψ_{- j}$ for each unresolved case, select a case that minimizes $ψ_{- j}$ (largest estimated cost reduction and minimal overall estimated mean squared error) and set this case aside. The selected case defines an initial set of size 1, defined by set $S_{1}$ . The smallest value of $ψ_{- j}$ can also be denoted as $ψ_{- S_{1}}$ .

Next, we take the remaining $n - n_{0} - 1$ unresolved cases, and compute $ψ_{- S_{1}, - j}$ for each of them. Now, $ψ_{- S_{1}, - j}$ is given by

$ψ_{- S_{1}, - j} = {\hat{C}}_{- S_{1}, - j} \sum_{p = 1}^{P} w_{p} {\hat{M S E}}_{p, - S_{1}, - j},$ (4)

where ${\hat{C}}_{- S_{1}, - j} = \hat{C} - \sum_{i \in S_{1}} {\hat{C}}_{i} - {\hat{C}}_{j}$ is the estimated remaining cost after stopping set $S_{1}$ and case $j$ , and ${\hat{MSE}}_{p, - S_{1}, - j} = {\hat{B}}_{p, - S_{1}, - j}^{2} + (\frac{n}{n - 2}) {\hat{V}}_{p} = {(\frac{\sum_{i \in S_{1}} {\hat{y}}_{p, i} + {\hat{y}}_{p, j}}{n - 2})}^{2} + \frac{n}{n - 2}$ is the estimated mean squared error of the sample mean for variable $p$ after stopping set $S_{1}$ and case $j$ , $p =$ $1, \dots, P$ .

Then, we propose to select a case from the remaining unresolved cases, resulting in the smallest value of $ψ_{- S_{1}, - j}$ . We add this case to set $S_{1}$ , and form a new set $S_{2}$ . Now, we obtain $ψ_{- S_{2}}$ . we repeat this process until we identify a set of cases $S_{a}$ (with $a$ cases) to stop that has the lowest value across all steps. The minimal value of the objective function is approximated by

$ψ_{- S_{a}} = {\hat{C}}_{- S_{a}} \sum_{p = 1}^{P} w_{p} {\hat{M S E}}_{p, - S_{a}},$ (5)

where ${\hat{C}}_{- S_{a}} = \hat{C} - \sum_{i \in S_{a}} {\hat{C}}_{i}$ is the estimated remaining cost after stopping set $S_{a}$ , and ${\hat{MSE}}_{p, - S_{a}} = {\hat{B}}_{p, - S_{a}}^{2} + (\frac{n}{n - a}) {\hat{V}}_{p} = {(\frac{\sum_{i \in S_{a}} {\hat{y}}_{p, i}}{n - a})}^{2} + \frac{n}{n - a}$ is the estimated mean squared error of the sample mean for variable $p$ after stopping set $S_{a}$ , $p = 1, \dots, P$ .

The set $S_{a}$ approximately optimizes the multiplicative cost-error tradeoff. If $ψ_{0} < ψ_{- S_{a}}$ , then no cases would be stopped (i.e., $S_{a} \equiv \emptyset$ and the minimum value is $ψ_{0}$ ). The case to stop is optimized at each step, but there is no guarantee that $S_{a}$ is the global optimum set of cases to stop. We suspect that the success of the implementation of the stopping rule also relies on the quality of the estimated costs and the estimated mean squared errors. We leave the investigation of the model performance for the proposed stopping rule for future research.

To summarize, the implementation of the stopping rule includes five main steps.

Estimate the cost model, and predict a future cost ${\hat{C}}_{j}$ for unresolved cases.

Select key variables based on the data quality objectives. For example, key variables can include those that are frequently used by data users, are representative of other variables, or are considered to have a high risk of bias.

Estimate the survey variable models for all key variables of interest, predict survey variables $Y_{1}, \dots, Y_{P}$ for nonrespondents or unresolved cases, and standardize each survey variable by z-score scaling.

Assign weights to each variable.

Use the proposed algorithm mentioned above to identify the set of cases that minimize the following objective function:

$ψ_{- S} = {\hat{C}}_{- S} \sum_{p = 1}^{P} w_{p} {\hat{M S E}}_{p, - S} .$

The SAS code for implementing the proposed multivariate stopping rule is available at https://github.com/xyzhangxinyu/StoppingRules.

3. Simulation Study

3.1. Data

We used real data from the 2018 wave of the telephone component of the Health and Retirement Study (HRS) to simulate the stopping effects of the proposed rule. The HRS is a longitudinal study of the U.S. population over age 50. Among 7,415 sampled cases in the 2018 wave of the telephone component of the HRS, 5,462 eventually responded to the survey. The field work took 416 days to complete.

Three types of HRS data were used to model survey design quantities. First, we used the 2016 and 2018 incoming HRS timesheet data to estimate interviewer hours required for each type of call attempt (e.g., interview, contact, and no contact) in the telephone mode. We treat interviewer hours as a proxy of survey costs. Since an interviewer on the same day might contact cases in person and by telephone, timesheet data for any face-to-face (FTF) mode was also included. The timesheet data contained one outcome variable and seven predictors recorded at the interviewer-day level. The outcome variable is the number of interview hours each interviewer worked each day. The predictors include the number of FTF noncontacts, the number of FTF contacts (without an interview), the number of FTF interviews, the number of telephone noncontacts, the number of telephone contacts (without an interview), the number of telephone interviews, and an indicator of any FTF attempt (accounting for travel time in FTF).

Second, we used the 2016 and incoming 2018 call record data for the telephone mode, as well as the 2016 survey data to model propensity scores at the call attempt level. The call record data contain paradata, such as call attempt number, outcome of each call attempt, and mode of call attempt. To align with the timesheet data, we recoded call attempt outcomes into one of the three categories: interview, contact without an interview, and no contact.

Third, we used the 2016 survey data and available 2018 survey data by the time (e.g., a specific data collection day) to implement the stopping rule to predict survey variables. In practice, survey managers should select key survey variables for the stopping rule. As an illustration, we selected three survey variables, namely, self-rated health (SRH), functional limitations (FLs), and impairments that limit work (ILW). Respondents were asked to report “would you say your health is excellent, very good, good, fair, or poor?” for SRH, which were coded as a binary variable (coded as 1 if the respondent reported excellent, very good, or good, and 0 if the respondent reported fair or poor). FLs is a summed score of 23 binary indicators, including ten mobility tasks, six activities of daily living (ADL), and seven instrumental activities of daily living (IADL). A higher score of FLs indicates more physical limitations. Respondents were asked to report “do you have any impairment or health problem that limits the kind or amount of paid work you can do?” for ILW, which was coded as a binary variable (coded as 1 = yes or too old and 0 = no).

3.2. Inputs to the Stopping Rule

We implemented the stopping rule on data collection day 28 for illustration purposes. If we were to implement the stopping rule on data collection day 84 when a half of cases had been attempted more than three call attempts, cost savings would be negligible as only a few cases would be stopped in this study. Readers should be cautious about implementing the stopping rule on day 28 in practice and assess whether other days would make more operational sense for a given survey. We leave the topic of the optimal timing for implementing the stopping rule for future research.

The number of call attempts is another time measure commonly used for implementing the stopping rule. Zhang (2023) tested several numbers of call attempts using data from the telephone component of the 2018 HRS to identify the optimal single point in time to implement a stopping rule that maximizes data quality for a given budget. The cost estimation was focused on the number of call attempts to finalization. In that context, the best timing was found to be after eight to ten call attempts, resulting in the same data quality but the lowest level of call attempts per interview. Both the data collection day and the number of call attempts are commonly used time measures for stopping rules. However, the choice between them depends on the operational constraints and quality of predictions.

The inputs to the stopping rule are shown in the following subsections.

a. Cumulative effort/cost. We do not have access to the fixed cost in the simulation study, so we are only concerned with the variable cost and omit the fixed cost. We modeled the number of hours worked by an interviewer for each day by fitting a multilevel model, with a random intercept for each interviewer, $u_{0 w} ~ N (0, σ_{u_{0 w}}^{2})$ , treating each interviewer as a cluster, and a random slope for an indicator of any FTF attempts for each interviewer on each day, $u_{1 wd} ~ N (0, σ_{u_{1 wd}}^{2})$ , allowing FTF and telephone interviews to have different slopes. The model predictors included an intercept, the number of call attempts for each of the three outcomes by survey mode, and an indicator of any FTF attempts for each interviewer on each day. The multilevel model takes the form

$y_{w d} = β_{0} + \sum_{k = 1}^{K} β_{k} x_{k w d} + u_{0 w} + u_{1 w d} x_{K w d} + ϵ_{w d},$

where $y_{wd}$ represents the number of hours worked by interviewer $w$ on day $d$ ,

$x_{kwd}$ represents a covariate for interviewer $w$ on day $d$ , $k = 1, \dots, K$ , with $K$ being the total number of covariates for predicting interviewer’s working hours,

$β_{0}$ is the fixed intercept,

$β_{k}$ is the fixed effect parameter, $k = 1, \dots, K$ ,

$u_{0 w}$ is the random intercept for interviewer $w$ ,

$u_{1 wd}$ is the random slope for any FTF interviews for interviewer $w$ on day $d$ , and

$ϵ_{wd}$ is the residual term at the interviewer level and is assumed to follow a normal distribution $ϵ_{wd} ~ N (0, σ_{ϵ}^{2})$ .

The seven timesheet variables (see Subsection 3.1) were used as predictors in the multilevel model. Specifically, the coefficients for the number of telephone noncontacts, the number of telephone contacts (without an interview), and the number of telephone interviews each represent the hour(s) spent on a telephone call attempt that results in one of these three call attempt outcomes. The estimated hours for an interview, contact without an interview, and no contact in the telephone mode were 1.58, 0.18, and 0.07 hours, respectively. The estimated hour(s) spent for different types of call attempt outcomes are used to predict the case-level cost.

To predict propensity scores for future call attempts, we fit a multinomial logistic model to the call record data. The call number t is included in the model (also known as a discrete time hazard model). This model is expressed as

$p (y_{t} = z) = \frac{\exp (\sum_{q = 1}^{Q - 1} β_{q z} x_{q} + β_{Q z} t)}{\sum_{z = 0, 1, 2} \exp (\sum_{q = 1}^{Q - 1} β_{q z} x_{q} + β_{Q z} t)},$

where $y_{t}$ represents the outcome of the call number t,

$z$ indicates the result of call attempt outcome (0 = interview, 1 = contact, 2 = no contact),

$x_{q}$ represents the $q^{th}$ covariate, $q = 1, \dots, Q - 1$ , with $Q$ being the total number of covariates for predicting call attempt outcomes,

$β_{qz}$ is the regression coefficient associated with the $q^{th}$ covariate when the call attempt outcome is $z$ .

We used LASSO (least absolute shrinkage and selection operator) for variable selection. The variable list is shown in the appendix (see Table A1). The discrete time hazard model was estimated in a Bayesian fashion, eliciting priors from the 2016 call-level data to provide protection against biased estimates when using the early data in 2018 (Schouten et al. 2018). For the regression parameters, we used normal distributions for priors with parameters elicited by fitting an identical regression model to the previous data collection wave (2016 HRS) and using the resulting point estimates and associated variances as the means and variances for the priors (see Table A2 in the appendix). The posterior distributions of parameters were generated using a Markov Chain Monte Carlo (MCMC) algorithm. The initial 500 MCMC iterations served as burn-in to ensure that the chain has converged to the target distribution. Then, every tenth iteration from the MCMC chain was retained until a total of 500 draws were obtained. We used the posterior means to estimate parameters in the discrete time hazard model.

Since an interview for an unresolved case may be achieved by multiple call attempts, we extend the horizon for predicting interviewer hours out to twenty-one call attempts (around 80% of active HRS cases were finalized within twenty-one call attempts in 2016). Let $t_{0, i}$ be the current call attempt number for case $i$ , $i = n_{0} + 1, \dots, n$ on the data collection day 28. We used the discrete time hazard model to predict propensity scores for three call attempt outcomes from the next call attempt (i.e., $t = t_{0, i} + 1$ ) to the twenty-first call attempt. Let ${\hat{p}}_{iw, t, i}$ , ${\hat{p}}_{cont, t, i}$ , and ${\hat{p}}_{nocont, t, i}$ be the estimated response propensities for an interview, contact without an interview, and no contact, respectively, $t = t_{0, i} + 1, \dots, 21$ and $i = n_{0} + 1, \dots, n$ . The propensity scores for a future call attempt are adjusted by a factor to account for the estimated probability of not being interviewed before each future call attempt. The case-level response propensity is bounded between 0 and 1, so the sum of the estimated probabilities of being interviewed for case $i$ across all predicted call attempts should not exceed 1.

The estimated interviewer hours of $C_{i}$ , $i = n_{0} + 1, \dots, n$ , is given by

${\hat{C}}_{i} = \sum_{t = t_{0, i} + 1}^{21} {\hat{S}}_{t - 1, i} ({\hat{p}}_{i w, t, i} {\hat{β}}_{1} + {\hat{p}}_{c o n t, t, i} {\hat{β}}_{2} + {\hat{p}}_{n o c o n t, t, i} {\hat{β}}_{3}),$ (6)

where ${\hat{S}}_{t - 1, i} = {\hat{S}}_{t - 2, i} (1 - {\hat{p}}_{iw, t - 1, i})$ is the estimated probability of not being interviewed at call attempt $t - 1$ and ${\hat{S}}_{t_{0, i}, i} = 1$ . The assessment of the quality of predicted interviewer hours is shown in Figure A1 (see the appendix).

b. Prediction of survey variables. Survey variables in 2016 and demographic variables are treated as predictors:

• FL (range = 0–21),

• SRH (1 = excellent/very good/good; 2 = fair/poor),

• ILW (1 = yes; 2 = no),

• Number of private health insurance plans (range = 0–10),

• Medicaid coverage (1 = yes; 2 = no),

• Currently working for pay (1 = yes; 2 = no),

• Regular use of web for email (1 = yes; 2 = no),

• Diabetes status (1 = yes; 2 = no),

• Age in 2016 (range = 24–101; some spouses can be under age 50),

• Race/Ethnicity (1 = Hispanic; 2 = non-Hispanic Black; 3 = non-Hispanic White; 4 = other),

• Gender (1 = female; 2 = male), and

• Education (1 = less than high school; 2 = high school or equivalent; 3 = some college; 4 = college graduate; 5 = graduate degree).

The categorical variables and binary variables are dummy coded in the prediction model. Three survey variables, SRH (binary), ILW (binary), and FLs (continuous) in the 2018 HRS are selected as key variables of interest for illustration purposes. We used regression models for predicting values for each case. A generalized linear regression model can be expressed as

$g (E (Y | X = x)) = b_{0} + \sum_{p = 1}^{P} b_{p} x_{p},$ (7)

where $Y$ is the survey variable of interest,

$g (\cdot)$ is the link function (e.g., a logit link can be used for a binary outcome or an identity link can be used for a continuous outcome),

$x_{p}$ represents the $p^{th}$ predictor,

$b_{0}$ is the intercept, and $b_{p}$ is the $p^{th}$ regression coefficient associated with $x_{p}$ , $p = 1, \dots, P$ , with $P$ being the total number of predictors.

Specifically, logistic regression is used to predict values of SRH and ILW, and normal linear regression is used to predict values of FLs. All these models are estimated using the Frequentist approach. We did not use the Bayesian approach since it would require an additional wave of the HRS survey data (e.g., the 2014 wave of the HRS) to elicit prior information. Predicted values of survey variables are obtained at the case level. See the appendix for the assessment of the quality of these predictive models for survey variables.

3.3. Data Structure of the Study

We implemented the stopping rule only once on data collection day 28. The performance of the stopping rule is evaluated at the end of the data collection using all observed data. Since the benchmark estimates experience nonresponse and the published estimates may be nonresponse-adjusted, we applied multiple imputation techniques to account for the 1,953 nonrespondents, as well as some item missing data; multiple imputation techniques might be more flexible than weighting methods for addressing a general missing data pattern. We used CART models to multiply impute missing data since CART models are robust against outliers and flexible enough to capture interactions, nonlinear relationships, and complex distributions (e.g., Burgette and Reiter 2010). Figure 1 shows the data structure used for implementing the stopping rule and evaluating the simulated stopping effects.

Figure 1.

The data structure for implementing and evaluating the multivariate stopping rule.

We created ten different possible scenarios in terms of the configuration of prespecified weights for SRH (unweighted mean = 0.70), FLs (unweighted mean = 3.1), and ILW (unweighted mean = 0.35). The unweighted correlation coefficient between ILW and SRH is −0.40, the unweighted correlation coefficient between ILW and FLs is 0.59, and the unweighted correlation coefficient between FLs and SRH is −0.48. It is not essential to have a significant correlation between two survey variables to implement the stopping rule.

Table 2 shows a few possible configurations of estimate-level weights. Exploring the impacts of estimate-level weights is the first step toward applying the stopping rule in practice. The first three scenarios in Table 2 can also be treated as univariate stopping rules since the quality of only one estimate is considered. Scenarios 4 to 6 can also be treated as bivariate stopping rules since the set of cases to stop is determined by the mean squared errors of two estimates. Scenarios 7 to 10 consider the mean squared errors of all three estimates in the stopping rule. If all three variables are equally important, we would assign equal estimate-level weights to them (i.e., Scenario 10). In other situations, we would assign a higher weight to a variable that is more important than the other two equally important variables (e.g., Scenarios 7–9).

Table 2.

Configuration of the Estimate-Level Weights.

Scenario	Weight for SRH	Weight for FLs	Weight for ILW
1	1	0	0
2	0	1	0
3	0	0	1
4	1/2	1/2	0
5	1/2	0	1/2
6	0	1/2	1/2
7	1/4	1/4	1/2
8	1/4	1/2	1/4
9	1/2	1/4	1/4
10	1/3	1/3	1/3

3.4. Evaluation of the Performance of the Stopping Rule

The performance of the stopping rule is evaluated at the end of the data collection. All three variables are separately evaluated based on the original scale. For each scenario, we let ${\hat{\bar{y}}}_{p, m}^{(\bar{S} I)}$ , $p = 1, \dots P$ and m = 1, …, M, be the estimated sample mean for survey variable $Y_{p}$ in the multiply-imputed dataset that results after imputing the missing data M times, and $U_{p, m}^{(\bar{S} I)}$ be the associated variance for ${\hat{\bar{y}}}_{p, m}^{(\bar{S} I)}$ . An estimator of ${\bar{y}}_{p}$ , $p = 1, \dots P$ , based on observed values except for stopped cases and final nonrespondents is given by

${\hat{\bar{y}}}_{p}^{(\bar{S} I)} = \frac{1}{M} \sum_{m = 1}^{M} {\hat{\bar{y}}}_{p, m}^{(\bar{S} I)} .$

The multiple-imputation variance of ${\hat{\bar{y}}}_{p}^{(\bar{S} I)}$ is then estimated as

$v a r ({\hat{\bar{y}}}_{p}^{(\bar{S} I)}) = \frac{1}{M} \sum_{m = 1}^{M} U_{p, m}^{(\bar{S} I)} + (1 + M^{- 1}) \frac{\sum_{m = 1}^{M} {({\hat{\bar{y}}}_{p, m}^{(\bar{S} I)} - {\hat{\bar{y}}}_{p}^{(\bar{S} I)})}^{2}}{M - 1} .$

For the original data collection, we let ${\hat{\bar{y}}}_{p, m}^{(RI)}$ , $p = 1, \dots P$ and m = 1, …, M, be the estimated sample mean for survey variable $Y_{p}$ in M multiply-imputed data, and $U_{p, m}^{(RI)}$ be the associated variance for ${\hat{\bar{y}}}_{p, m}^{(RI)}$ . An estimator of ${\bar{y}}_{p}$ , $p = 1, \dots P$ , based on observed values from final respondents is given by

${\hat{\bar{y}}}_{p}^{(R I)} = \frac{1}{M} \sum_{m = 1}^{M} {\hat{\bar{y}}}_{p, m}^{(R I)} .$

The multiple-imputation variance of ${\hat{\bar{y}}}_{p}^{(RI)}$ is estimated as

$v a r ({\hat{\bar{y}}}_{p}^{(R I)}) = \frac{1}{M} \sum_{m = 1}^{M} U_{p, m}^{(R I)} + (1 + M^{- 1}) \frac{\sum_{m = 1}^{M} {({\hat{\bar{y}}}_{p, m}^{(R I)} - {\hat{\bar{y}}}_{p}^{(R I)})}^{2}}{M - 1} .$

We treat ${\hat{\bar{y}}}_{p}^{(RI)}$ , $p = 1, \dots P$ , as the benchmark in evaluating the performance of the stopping rule since the fraction of missing information is smaller in ${\hat{\bar{y}}}_{p}^{(RI)}$ than in ${\hat{\bar{y}}}_{p}^{(SI)}$ . The fraction of missing information for each of the three selected survey variables is 0.3 or less, so missing data were multiply imputed forty times based on the findings of Graham et al. (2007).

The following statistics are used for evaluation.

1. Percent relative bias (%relbias) of ${\hat{\bar{y}}}_{p}^{(\bar{S} I)}$ :

$% r e l b i a s ({\hat{\bar{y}}}_{p}^{(\bar{S} I)}) = \frac{{\hat{\bar{y}}}_{p}^{(\bar{S} I)} - {\hat{\bar{y}}}_{p}^{(R I)}}{{\hat{\bar{y}}}_{p}^{(R I)}} \times 100 .$

2. Percent relative root mean squared error (%relRMSE) of ${\hat{\bar{y}}}_{p}^{(\bar{S} I)}$ :

$% r e l r m s e ({\hat{\bar{y}}}_{p}^{(\bar{S} I)}) = \frac{r m s e ({\hat{\bar{y}}}_{p}^{(\bar{S} I)}) - r m s e ({\hat{\bar{y}}}_{p}^{(R I)})}{r m s e ({\hat{\bar{y}}}_{p}^{(R I)})} \times 100,$

where $rmse ({\hat{\bar{y}}}_{p}^{(\bar{S} I)}) = \sqrt{{({\hat{\bar{y}}}_{p}^{(\bar{S} I)} - {\hat{\bar{y}}}_{p}^{(RI)})}^{2} + var ({\hat{\bar{y}}}_{p}^{(\bar{S} I)})}$ , and $rmse ({\hat{\bar{y}}}_{p}^{(RI)}) = \sqrt{var ({\hat{\bar{y}}}_{p}^{(RI)})}$ .

3. Percent relative estimated saved interviewer hours (% $\hat{SIH}$ ) by using the stopping rule:

$% {\hat{S I H}}^{(S)} = \frac{{\hat{I H}}^{(R)} - {\hat{I H}}^{(S)}}{{\hat{I H}}^{(R)}} \times 100,$

where ${\hat{IH}}^{(R)} = \sum_{i = 1}^{n} \sum_{t = 1}^{T_{i}} [I (y_{t, i} = 0) {\hat{β}}_{1} + I (y_{t, i} = 1) {\hat{β}}_{2} + I (y_{t, i} = 2) {\hat{β}}_{3}]$ is the estimated interviewer hours (IHs) in the original data collection,

${\hat{IH}}^{(S)} = \sum_{i = 1}^{n} \sum_{t = 1}^{T_{i}^{'}} [I (y_{t, i} = 0) {\hat{β}}_{1} + I (y_{t, i} = 1) {\hat{β}}_{2} + I (y_{t, i} = 2) {\hat{β}}_{3}]$ is the estimated IHs for a scenario that implements the stopping rule,

$y_{t, i}$ represents the outcome of call attempt $t$ for case $i$ (coded as 0 = interview, 1 = contact, 2 = no contact)

$T_{i}$ is the number of all call attempts for case $i$ in the original data collection, and $T_{i}'$ is the number of all call attempts for case $i$ in a scenario that implements the stopping rule.

4. Results

4.1. Estimated Cost of Data Collection

Table 3 shows several effort-related measures for each scenario. These include the number of stopped cases, percent of cases stopped, number of interviews, response rate, estimated total interviewer hours, and percent $\hat{SIH}$ . The different configurations of the estimate-level weights for the multivariate stopping rules resulted in varying numbers of cases being stopped, indicating different estimated optimal cost-error tradeoffs. Generally, a scenario that stopped a larger number of cases resulted in a lower response rate but also led to greater cost savings compared to a scenario that stopped a smaller number of cases. For example, Scenario 2 (where the data quality component of the stopping rule is only determined by FLs) had the largest number of stopped cases, the lowest response rate, and the largest cost savings among all ten scenarios.

Table 3.

Effort-Related Measures at the End of Data Collection by Scenario.

Scenario	Estimate-level weight			Stopped cases (% cases)	Interviews (response rate [%])	Est total hours (% $\hat{SIH}$ )
Scenario	SRH	FLs	ILW	Stopped cases (% cases)	Interviews (response rate [%])	Est total hours (% $\hat{SIH}$ )
1	1	0	0	396 (5.3)	5,171 (70)	17,226 (5.7)
2	0	1	0	1,058 (14.3)	4,658 (63)	15,717 (14.0)
3	0	0	1	483 (6.5)	5,104 (69)	17,091 (6.4)
4	1/2	1/2	0	912 (12.3)	4,770 (64)	15,998 (12.4)
5	1/2	0	1/2	897 (12.1)	4,781 (64)	16,072 (12.0)
6	0	1/2	1/2	927 (12.5)	4,761 (64)	16,022 (12.3)
7	1/4	1/4	1/2	829 (11.2)	4,828 (65)	16,236 (11.1)
8	1/4	1/2	1/4	833 (11.2)	4,825 (65)	16,210 (11.3)
9	1/2	1/4	1/4	644 (8.7)	4,980 (67)	16,677 (8.7)
10	1/3	1/3	1/3	899 (12.1)	4,780 (64)	16,050 (12.1)
Original data collection				N/A	5,462 (74)	18,267 (0)

4.2. Estimated Data Quality

Table 4 presents the percent relbiases of the sample means from the multiply imputed datasets. Across all ten scenarios, all the three estimates were nearly unbiased. However, the percent relbias for ILW ranged from −0.6% to 1.9%, meaning that these estimates were typically more biased than those for SRH and FLs. Surprisingly, we did not find evidence that a larger estimate-level weight would produce a relatively lower percent relbias. For example, even though the weight for SRH was lower and the weight for ILW was higher in Scenario 7 compared to Scenario 9, the former showed a lower percent relbias for SRH and a higher percent relbias for ILW. Additionally, Scenario 10 was a natural choice for three equally important variables, but this scenario did not yield similar percent relbiases for these three estimates. Scenario 4 appeared to be the best in terms of percent relbiases across the three variables.

Table 4.

Percent Relbiases of Three Estimates at the End of Data Collection by Scenario.

Scenario	Estimate-level weight			SRH	FLs	ILW
	SRH	FLs	ILW	%relbias	%relbias	%relbias
1	1	0	0	−0.4	0.2	0.8
2	0	1	0	0.0	0.3	1.6
3	0	0	1	−0.1	0.1	−0.6
4	1/2	1/2	0	0.1	−0.4	0.2
5	1/2	0	1/2	−0.1	0.3	1.1
6	0	1/2	1/2	0.1	−0.1	1.9
7	1/4	1/4	1/2	0.0	−0.5	1.2
8	1/4	1/2	1/4	0.0	0.0	1.9
9	1/2	1/4	1/4	0.2	−0.3	1.0
10	1/3	1/3	1/3	0.1	−0.5	1.8

Table 5 shows the percent relRMSEs of the estimated means and proportions based on the multiply imputed datasets. The results for the percent relRMSEs were similar to the results for the percent relbiases. For example, the estimated proportion for ILW had the highest percent relRMSE in seven of ten scenarios. The percent relRMSE for SRH is typically lower than FLs and ILW. Again, Scenario 10 was a natural choice for three equally important variables, but the percent relRMSEs were not the same for these three survey variables. In terms of the percent relRMSE, a relatively larger estimate-level weight also did not necessarily indicate a lower percent relRMSE. Interestingly, some scenarios resulted in negative percent relRMSEs, indicating that the quality of these estimates was superior to that of the original data collection. However, these RMSEs were estimates and we found that these negative percent relRMSEs could turn to a small positive value (e.g., 1.0% and 2.0%) when a different seed value was selected for implementing multiple imputation, indicating similar data quality compared with the original data collection. Although none of the scenarios had the lowest percent relRMSE across all three estimates, Scenario 4 had the relatively lower percent relRMSEs across the three variables.

Table 5.

Percent RelRMSEs of Three Estimates at the End of Data Collection by Scenario.

Scenario	Estimate-level weight			SRH	FLs	ILW
	SRH	FLs	ILW	%relRMSE	%relRMSE	%relRMSE
1	1	0	0	8.5	1.7	8.0
2	0	1	0	−2.0	6.7	36.6
3	0	0	1	0.4	3.6	3.1
4	1/2	1/2	0	0.9	5.4	−2.5
5	1/2	0	1/2	0.1	3.5	18.9
6	0	1/2	1/2	2.0	3.4	47.8
7	1/4	1/4	1/2	−0.9	13.9	20.1
8	1/4	1/2	1/4	0.3	3.7	50.6
9	1/2	1/4	1/4	6.3	6.4	16.0
10	1/3	1/3	1/3	−1.2	6.7	47.0

5. Discussion

This study proposed a multivariate stopping rule that optimizes the tradeoff between the cost of data collection and the weighted mean squared errors of multiple sample means. We illustrated the use of the proposed stopping rule using the 2018 wave of the telephone component of the HRS. We found that assigning equal weights to the three survey variables of interest (Scenario 10) did not yield balanced mean squared error at the end of data collection. We also found that assigning equal weights to both SRH and FLs and a weight of 0 to ILW (Scenario 4) had the best cost-error tradeoff, reducing the cost by 12% while maintaining the same level of data quality.

The estimate-level weights in the multivariate stopping rule decide how much influence an estimate will have. Our results suggest that assigning higher weights to a survey variable does not always lead to better data quality for that variable. One plausible reason is that the variance of predictions might also affect the final results (see Figures A2 –A4 in the appendix for model performance). For example, Scenario 10 that did not have the best overall percent relRMSE might be due to the interaction between the optimization procedure, the sample mean of ILW (unweighted mean = 0.35), and the multiple imputation methods. When the costs are the same, the proposed algorithm prefers to stop an unresolved case that is close to the sample mean. However, the sample mean of ILW is close to 0.5, and imputing missing data for cases with a probability that is close to 0.5 would have a relatively high level of uncertainty.

Scenario 4 had the best cost-error tradeoff, which may be due to both chance and the modest correlations among these three variables. Selecting appropriate estimate-level weights extends beyond merely considering the relative importance of key estimates. Future research is needed to better understand the role of prediction errors in the configuration of estimate-level weights. To improve the multivariate stopping rule, one could also revise the optimization procedure by accounting for prediction errors.

This study has several limitations. First, the three survey variables consist of a combination of a continuous variable and two binary variables, but all of them are from the health domain. In practice, key survey variables might be selected from multiple domains to meet the data quality objectives of a multipurpose survey.

Second, the current study only focuses on improving the efficiency of data collection. Equation (1) is capable of judging the tradeoff between the cost and the data quality of stopping alternative sets of cases. However, this equation overlooks cost constraints and data quality requirements. The primary objective of survey data collection is to maximize data quality for a given budget or to minimize survey costs for a given level of data quality. To address this issue, future research might add a cost constraint or some data quality requirements to Equation (1).

Third, the “cutoff point” for model performance required for the implementation of the stopping rule remains unknown. Estimating inputs to the stopping rule is essentially a prediction problem. We illustrate the stopping rule in a panel study that can use a rich dataset (e.g., survey variables from the previous waves) to make predictions. It can also be applied to cross-sectional surveys as long as survey costs and survey variables are predictable. In cross-sectional surveys, auxiliary data (e.g., the sampling frame, planning database, and commercial data) can be used for making predictions.

Fourth, although the cost predictions were positively correlated with observed costs, the correlation was weak (see Figure A1 in the appendix). One issue with the prediction model is that we only used time-invariant variables in the discrete time hazard model. The cost predictions can be improved by incorporating time-varying paradata (e.g., the outcome of previous call attempts) into the cost prediction model, as time-varying paradata may contain useful information for prediction (e.g., Durrant and D’Arrigo 2014). While we only consider the cost predictions in the telephone mode, the choice of the survey mode also affects the cost structure of data collection. For example, the cost model for face-to-face interviewing should also account for the travel costs. Wagner et al. (2023) discussed that the cost for face-to-face interviewing should be evaluated at the secondary sample unit (e.g., neighborhood) level. This would allow us to assess the travel costs more easily. We leave the topic of cost estimation in other modes for future research.

The quality component in the stopping rules can be modified based on specific data quality objectives. We only focused on the mean square errors of sample means, but the stopping rules can be easily extended to other statistics (e.g., a correlation between two key survey variables and a distribution of a survey variable) by simple modifications. In addition, the variance estimators in this study were derived under the assumption of simple random sampling. Extending the stopping rules to complex sample designs is also needed. The replication variance estimator seems to be a straightforward way for dealing with stratification and clustering. However, the stopping rule has to account for practical considerations in complex sample designs. For example, there is a requirement for the minimum number of interviews in each stratum.

Stopping a subset of cases during data collection essentially reallocates effort from stopped cases to other unresolved cases. More experimental studies in different contexts are needed to further test the performance of the stopping rule. An ideal situation is to stop cases that have low impacts on data quality and are also less likely to respond at the early stages of data collection.

6. Conclusion

The proposed multivariate stopping rule considers the potential consequences of stopping cases during data collection, namely, cost reduction and impacts on multiple survey estimates. There are five key steps in implementing the stopping rule: (1) estimate the cost model and predict future costs; (2) select key variables; (3) estimate the survey variable models and compute predictions; (4) assign weights to each variable; (5) run the stopping rule algorithm.

The selection of key survey variables should align with data quality objectives. For example, key variables can be those that are frequently used by data users, are representative of other variables, or are considered to have a high risk of bias. Strong relationships among survey variables are not required for the multivariate stopping rule, but the effect of stopping on the estimated data quality would be reinforced when two selected variables have a higher correlation. The estimate-level weights are also an important input to the stopping rule. A more detailed discussion of the configuration of estimate-level weights is provided in the Discussion section.

Different variables may react differently to the stopping rule. In our study, SRH generally had better percent relbias and percent relRMSE than FLs, and both were better than ILW. In addition to varying prediction performance for different survey variables, the variance of predictions at the case level is omitted when deciding which cases to stop. Careful evaluation concerning various aspects of data quality is still needed to understand the potential benefits and risks of the stopping rule.

Footnotes

The author would like to thank Katharine G. Abraham.

Authors’ Note

This research was carried out as part of a PhD dissertation at the University of Michigan,Ann Arbor.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This work was supported by a grant from the National Institutes of Health (#1R01AG058599-01).

ORCID iDs

Xinyu Zhang

James Wagner

Received: April 17,2024

Accepted: November 4,2024

References

Burgette

L. F.

Reiter

J. P.

2010. “Multiple Imputation for Missing Data via Sequential Regression Trees.”American Journal of Epidemiology 172 (9): 1070–76. DOI: https://doi.org/10.1093/aje/kwq260.

Cochran

W. G.

1977. Sampling Techniques. 3rd ed., John Wiley & Sons.

Couper

M. P.

1998. “Measuring Survey Quality in a CASIC Environment.”In Proceedings of the Survey Research Methods Section of the American of Statistical Association, August. http://www.asasrms.org/Proceedings/papers/1998_006.pdf (accessed April, 2024).

Couper

M. P.

Lyberg

L. E.

2005. “The Use of Paradata in Survey Research.”In Proceedings of the 54th Session of the International Statistical Institute, April.

Curtin

Presser

Singer

2005. “Changes in Telephone Survey Nonresponse over the Past Quarter Century.”Public Opinion Quarterly 69 (1): 87–98. DOI: https://doi.org/10.1093/poq/nfi002.

De Leeuw

E. D.

de Heer

2002. “Trends in Household Survey Nonresponse: A Longitudinal and International Comparison.” In Survey Nonresponse, edited by Groves

R. M.

Dillman

D. A.

Eltinge

J. L.

Little

R. J. A.

John Wiley & Sons.

Durrant

G. B.

D’Arrigo

2014. “Doorstep Interactions and Interviewer Effects on the Process Leading to Cooperation or Refusal.”Sociological Methods & Research 43 (3): 490–518. DOI: https://doi.org/10.1177/0049124114521148.

Graham

J. W.

Olchowski

A. E.

Gilreath

T. D.

2007. “How Many Imputations Are Really Needed? Some Practical Clarifications of Multiple Imputation Theory.”Prevention Science 8 (3): 206–13. DOI: https://doi.org/10.1007/s11121-007-0070-9.

Groves

R. M.

Heeringa

S. G.

2006. “Responsive Design for Household Surveys: Tools for Actively Controlling Survey Errors and Costs.”Journal of the Royal Statistical Society: Series A (Statistics in Society) 169 (3): 439–57. DOI: https://doi.org/10.1111/j.1467-985X.2006.00423.x.

10.

Groves

R. M.

Mosher

W. D.

Lepkowski

J. M.

Kirgis

N. G.

2009. “Planning and Development of the Continuous National Survey of Family Growth.”Vital and Health Statistics 1 (48): 1–64. https://www.cdc.gov/nchs/data/series/sr_01/sr01_048.pdf (accessed April, 2024).

11.

Hansen

M. H.

Hurwitz

W. N.

1946. “The Problem of Non-Response in Sample Surveys.”Journal of the American Statistical Association 41 (236): 517–29. DOI: https://doi.org/10.2307/2280572.

12.

Kirgis

N. G.

Lepkowski

J. M.

2013. “Design and Management Strategies for Paradata-Driven Responsive Design: Illustrations from the 2006–2010 National Survey of Family Growth.” In Improving Surveys with Paradata: Analytic Uses of Process Information, edited by Kreuter

Wiley.

13.

Lewis

2017. “Univariate Tests for Phase Capacity: Tools for Identifying When to Modify a Survey’s Data Collection Protocol.”Journal of Official Statistics 33: 601–24. DOI: https://doi.org/10.1515/jos-2017-0029.

14.

Lewis

2019. “Multivariate Tests for Phase Capacity.”Survey Research Methods 13: 153–65. DOI: https://doi.org/10.18148/srm/2019.v13i2.7370.

15.

Olson

2013. “Paradata for Nonresponse Adjustment.”The Annals of the American Academy of Political and Social Science 645 (1): 142–70. DOI: https://doi.org/10.1177/0002716212459475.

16.

Rao

R. S.

Glickman

M. E.

Glynn

R. J.

2008. “Stopping Rules for Surveys with Multiple Waves of Nonrespondent Follow-Up.”Statistics in Medicine 27 (12): 2196–2213. DOI: https://doi.org/10.1002/sim.3063.

17.

Schouten

Mushkudiani

Shlomo

Durrant

Lundquist

Wagner

2018. “A Bayesian Analysis of Design Parameters in Survey Data Collection.”Journal of Survey Statistics and Methodology 6 (4): 431–64. DOI: https://doi.org/10.1093/jssam/smy012.

18.

Wagner

Raghunathan

T. E.

2010. “A New Stopping Rule for Surveys.”Statistics in Medicine 29 (9): 1014–24. DOI: https://doi.org/10.1002/sim.3834.

19.

Wagner

West

B. T.

Kirgis

Lepkowski

J. M.

Axinn

W. G.

Ndiaye

S. K.

2012. “Use of Paradata in a Responsive Design Framework to Manage a Field Data Collection.”Journal of Official Statistics 28 (4): 477–99. https://www.scb.se/contentassets/f6bcee6f397c4fd68db6452fc9643e68/use-of-paradata-in-a-responsive-design-framework-to-manage-a-field-data-collection.pdf (accessed April, 2024).

20.

Wagner

Zhang

Elliott

M. R.

West

B. T.

Coffey

2023. “An Experimental Evaluation of a Stopping Rule Aimed at Maximizing Cost-Quality Tradeoffs in Surveys.”Journal of the Royal Statistical Society: Series A (Statistics in Society) 186 (4): 788–810. DOI: https://doi.org/10.1093/jrsssa/qnad059.

21.

Williams

Brick

J. M.

2018. “Trends in US Face-to-Face Household Survey Nonresponse and Level of Effort.”Journal of Survey Statistics and Methodology 6: 186–211. DOI: https://doi.org/10.1093/jssam/smx019.

22.

Zhang

2023. “Using Models to Inform Responsive Survey Design.” PhD thesis, University of Michigan. DOI: https://doi.org/10.7302/22257.

Scenario	Weight for SRH	Weight for FLs	Weight for ILW
1	1	0	0
2	0	1	0
3	0	0	1
4	1/2	1/2	0
5	1/2	0	1/2
6	0	1/2	1/2
7	1/4	1/4	1/2
8	1/4	1/2	1/4
9	1/2	1/4	1/4
10	1/3	1/3	1/3

Scenario	Weight for SRH	Weight for FLs	Weight for ILW
1	1	0	0
2	0	1	0
3	0	0	1
4	1/2	1/2	0
5	1/2	0	1/2
6	0	1/2	1/2
7	1/4	1/4	1/2
8	1/4	1/2	1/4
9	1/2	1/4	1/4
10	1/3	1/3	1/3

Scenario	Weight for SRH	Weight for FLs	Weight for ILW
1	1	0	0
2	0	1	0
3	0	0	1
4	1/2	1/2	0
5	1/2	0	1/2
6	0	1/2	1/2
7	1/4	1/4	1/2
8	1/4	1/2	1/4
9	1/2	1/4	1/4
10	1/3	1/3	1/3