Abstract
1. Introduction
Surveys have been facing dual threats of rising data collection costs and declining response rates (e.g., Curtin et al. 2005; de Leeuw and de Heer 2002; Williams and Brick 2018). As a response to these challenges, Groves and Heeringa (2006) introduced the notion of responsive survey design (RSD). RSD is an active survey data management technique that uses incoming data to make design decisions during the field period.
The adoption of computerized survey data collection gives survey managers access to timely information about the progress of data collection. A key element of RSD is the collection and analysis of data describing the data collection process, better known as paradata (e.g., Couper 1998; Couper and Lyberg 2005; Olson 2013). In RSD, paradata are used to compute predictions for informing design decisions and to document the effects of changes (“interventions”) on key process indicators (e.g., variation in the response rates among subgroups). For example, Groves et al. (2009) monitor four levels of paradata daily to guide active interventions, Kirgis and Lepkowski (2013) use paradata to guide interviewers to target sample in a specific subgroup, and Wagner et al. (2012) use paradata to systematically guide interviewer behaviors during the field period.
The term stopping rule is borrowed from clinical trials (Rao et al. 2008), where data are periodically reviewed to decide whether to stop a clinical trial before its planned completion. Recently, several stopping rules have been proposed for surveys, but their goal is different from those used in a randomized controlled trial. Stopping rules in surveys aim to balance survey costs and errors. For example, stopping rules can be used to determine whether to stop data collection or to initiate a new phase of data collection (Lewis 2017, 2019; Rao et al. 2008; Wagner and Raghunathan 2010). Potential design changes in response to stopping rules include switching modes, changing the amount of incentives, increasing interviewer effort, or discontinuing nonresponse follow-up. These stopping rules are triggered by detecting phase capacity, which refers to the stable condition of an estimate or several estimates in a specific design phase.
Rao et al. (2008) proposed three stopping rules that are based on testing whether an estimated proportion changes substantially following the completion of wave
Lewis (2019) introduced two multivariate stopping rules to address the situations where independent stopping rules produce conflicting results when detecting phase capacity for multiple estimates. The first multivariate stopping rule is based on the Wald chi-squared test of changes in estimates at the completion of waves
The second multivariate stopping rule is based on the non-zero trajectory method. The relative percent change in estimates are modeled as a linear function of the data collection wave. When all estimated regression coefficients in the model are statistically indistinguishable from zero, one can declare that the overall phase capacity is detected. However, the non-zero trajectory method requires additional waves of follow-up to avoid fitting a saturated model. For example, at least four waves are required to detect overall phase capacity for three estimates.
Existing stopping rules that are based on detecting phase capacity have three critical limitations. First, phase capacity is a proxy measure of data quality (Lewis 2017, 2019). Phase capacity suggests that an estimate reaches a stable condition in a design phase, but it does not necessarily indicate the estimate is free of nonresponse error. Therefore, there is a need for a stopping rule that considers direct measures of data quality.
Second, most of these stopping rules are retrospective in nature (Lewis 2017, 2019; Rao et al. 2008; Wagner and Raghunathan 2010). Testing for phase capacity requires another wave of data collection, which demands more time and resources. These stopping rules are feasible when there are sufficient funds for future waves of data collection to detect phase capacity. However, the survey manager does not have to wait until phase capacity is reached to stop some cases if stopping them early on can lead to a large reduction in data collection costs with an acceptable level of data quality.
Lastly, most of these stopping rules are univariate in nature (Lewis 2017; Rao et al. 2008; Wagner and Raghunathan 2010; Wagner et al. 2023). Two multivariate stopping rules proposed by Lewis (2019) are exceptions, but these two rules do not explicitly account for survey costs. In multipurpose surveys, there may be data quality objectives that must be met for certain estimates with a constraint on costs. A decision to stop effort on a subset of cases can hardly be acceptable if the decision undermines the quality of other key estimates.
Two-phase sampling for nonresponse follow-up (Hansen and Hurwitz 1946) is another type of stopping rule. Instead of discontinuing nonresponse follow-up for all nonrespondents, two-phase sampling only stops effort on a subset of nonrespondents. Since the selection procedure in two-phase sampling is random, it implicitly assumes that the cost is constant across all nonrespondents. If one has predicted measures of data quality and the cost varies across nonrespondents, the selection procedure for stopping a subset of cases can be improved. For example, Wagner et al. (2023) presented a stopping rule aimed at optimizing the cost-error tradeoff. Their stopping rule relies on predictions of data collection costs and data quality as inputs, and is used to stop a subset of unresolved cases during data collection while the other unresolved cases were followed up. However, their stopping rule is univariate in nature. Again, a univariate decision rule for survey data collection overlooks the quality of other key estimates and stopping effort on a case before phase capacity is declared could have a negative ripple effect on other key estimates. While comparisons to the existing rules go beyond the scope of the current study, the pros and cons of these existing stopping rules are summarized in Table 1.
Pros and Cons of Existing Stopping Rules.
We aim to develop a multivariate stopping rule for survey data collection that accounts for the cost of data collection and the quality of multiple estimates. This study extends the stopping rule proposed by Wagner et al. (2023). In their study with multiple survey variables, Wagner et al. applied the stopping rule separately for each survey variable and stopped effort on cases that were flagged for stopping on any one of the survey variables. Our proposed stopping rule, however, simultaneously considers the data quality of multiple survey variables when determining which cases to stop. To achieve this, we use the weighted mean squared errors of multiple sample means, with each estimate receiving a prespecified weight. We illustrate via simulation how the stopping rule performs empirically.
2. A Multivariate Stopping Rule
The proposed stopping rule is implemented in the following setting. In the data collection process, a survey manager aims to stop effort on a subset of unresolved cases in a way that maximizes the tradeoff between the cost and data quality of multiple estimates—a fairly large amount of cost savings relative to the loss in data quality—at the end of data collection. Other unresolved cases where effort is not stopped will continue to be followed up. To fix notation, we let
Many cases are resolved by a series of attempts. At a specific point in time, the value of the sunk cost for case
Before implementing the stopping rule, the estimated total cost would be
To summarize, the inputs of the proposed stopping rule include: (a) the estimated data collection costs after stopping
which mimics the function for optimal allocation in stratified sampling (Cochran 1977, 97). The estimated cost
Because of
However, finding the optimal set of cases to stop that has the exact minimal value for Equation (1) is computationally expensive (or even computationally prohibitive) when the number of unresolved cases,
Before implementing the stopping rule, the value of
The objective function for stopping effort on case
where
We propose to calculate
Next, we take the remaining
where
Then, we propose to select a case from the remaining unresolved cases, resulting in the smallest value of
where
The set
To summarize, the implementation of the stopping rule includes five main steps.
Estimate the cost model, and predict a future cost
Select key variables based on the data quality objectives. For example, key variables can include those that are frequently used by data users, are representative of other variables, or are considered to have a high risk of bias.
Estimate the survey variable models for all key variables of interest, predict survey variables
Assign weights to each variable.
Use the proposed algorithm mentioned above to identify the set of cases that minimize the following objective function:
The SAS code for implementing the proposed multivariate stopping rule is available at https://github.com/xyzhangxinyu/StoppingRules.
3. Simulation Study
3.1. Data
We used real data from the 2018 wave of the telephone component of the Health and Retirement Study (HRS) to simulate the stopping effects of the proposed rule. The HRS is a longitudinal study of the U.S. population over age 50. Among 7,415 sampled cases in the 2018 wave of the telephone component of the HRS, 5,462 eventually responded to the survey. The field work took 416 days to complete.
Three types of HRS data were used to model survey design quantities. First, we used the 2016 and 2018 incoming HRS timesheet data to estimate interviewer hours required for each type of call attempt (e.g., interview, contact, and no contact) in the telephone mode. We treat interviewer hours as a proxy of survey costs. Since an interviewer on the same day might contact cases in person and by telephone, timesheet data for any face-to-face (FTF) mode was also included. The timesheet data contained one outcome variable and seven predictors recorded at the interviewer-day level. The outcome variable is the number of interview hours each interviewer worked each day. The predictors include the number of FTF noncontacts, the number of FTF contacts (without an interview), the number of FTF interviews, the number of telephone noncontacts, the number of telephone contacts (without an interview), the number of telephone interviews, and an indicator of any FTF attempt (accounting for travel time in FTF).
Second, we used the 2016 and incoming 2018 call record data for the telephone mode, as well as the 2016 survey data to model propensity scores at the call attempt level. The call record data contain paradata, such as call attempt number, outcome of each call attempt, and mode of call attempt. To align with the timesheet data, we recoded call attempt outcomes into one of the three categories: interview, contact without an interview, and no contact.
Third, we used the 2016 survey data and available 2018 survey data by the time (e.g., a specific data collection day) to implement the stopping rule to predict survey variables. In practice, survey managers should select key survey variables for the stopping rule. As an illustration, we selected three survey variables, namely, self-rated health (SRH), functional limitations (FLs), and impairments that limit work (ILW). Respondents were asked to report “would you say your health is excellent, very good, good, fair, or poor?” for SRH, which were coded as a binary variable (coded as 1 if the respondent reported excellent, very good, or good, and 0 if the respondent reported fair or poor). FLs is a summed score of 23 binary indicators, including ten mobility tasks, six activities of daily living (ADL), and seven instrumental activities of daily living (IADL). A higher score of FLs indicates more physical limitations. Respondents were asked to report “do you have any impairment or health problem that limits the kind or amount of paid work you can do?” for ILW, which was coded as a binary variable (coded as 1 = yes or too old and 0 = no).
3.2. Inputs to the Stopping Rule
We implemented the stopping rule on data collection day 28 for illustration purposes. If we were to implement the stopping rule on data collection day 84 when a half of cases had been attempted more than three call attempts, cost savings would be negligible as only a few cases would be stopped in this study. Readers should be cautious about implementing the stopping rule on day 28 in practice and assess whether other days would make more operational sense for a given survey. We leave the topic of the optimal timing for implementing the stopping rule for future research.
The number of call attempts is another time measure commonly used for implementing the stopping rule. Zhang (2023) tested several numbers of call attempts using data from the telephone component of the 2018 HRS to identify the optimal single point in time to implement a stopping rule that maximizes data quality for a given budget. The cost estimation was focused on the number of call attempts to finalization. In that context, the best timing was found to be after eight to ten call attempts, resulting in the same data quality but the lowest level of call attempts per interview. Both the data collection day and the number of call attempts are commonly used time measures for stopping rules. However, the choice between them depends on the operational constraints and quality of predictions.
The inputs to the stopping rule are shown in the following subsections.
a. Cumulative effort/cost. We do not have access to the fixed cost in the simulation study, so we are only concerned with the variable cost and omit the fixed cost. We modeled the number of hours worked by an interviewer for each day by fitting a multilevel model, with a random intercept for each interviewer,
where
The seven timesheet variables (see Subsection 3.1) were used as predictors in the multilevel model. Specifically, the coefficients for the number of telephone noncontacts, the number of telephone contacts (without an interview), and the number of telephone interviews each represent the hour(s) spent on a telephone call attempt that results in one of these three call attempt outcomes. The estimated hours for an interview, contact without an interview, and no contact in the telephone mode were 1.58, 0.18, and 0.07 hours, respectively. The estimated hour(s) spent for different types of call attempt outcomes are used to predict the case-level cost.
To predict propensity scores for future call attempts, we fit a multinomial logistic model to the call record data. The call number
where
We used LASSO (least absolute shrinkage and selection operator) for variable selection. The variable list is shown in the appendix (see Table A1). The discrete time hazard model was estimated in a Bayesian fashion, eliciting priors from the 2016 call-level data to provide protection against biased estimates when using the early data in 2018 (Schouten et al. 2018). For the regression parameters, we used normal distributions for priors with parameters elicited by fitting an identical regression model to the previous data collection wave (2016 HRS) and using the resulting point estimates and associated variances as the means and variances for the priors (see Table A2 in the appendix). The posterior distributions of parameters were generated using a Markov Chain Monte Carlo (MCMC) algorithm. The initial 500 MCMC iterations served as burn-in to ensure that the chain has converged to the target distribution. Then, every tenth iteration from the MCMC chain was retained until a total of 500 draws were obtained. We used the posterior means to estimate parameters in the discrete time hazard model.
Since an interview for an unresolved case may be achieved by multiple call attempts, we extend the horizon for predicting interviewer hours out to twenty-one call attempts (around 80% of active HRS cases were finalized within twenty-one call attempts in 2016). Let
The estimated interviewer hours of
where
b. Prediction of survey variables. Survey variables in 2016 and demographic variables are treated as predictors: • FL (range = 0–21), • SRH (1 = excellent/very good/good; 2 = fair/poor), • ILW (1 = yes; 2 = no), • Number of private health insurance plans (range = 0–10), • Medicaid coverage (1 = yes; 2 = no), • Currently working for pay (1 = yes; 2 = no), • Regular use of web for email (1 = yes; 2 = no), • Diabetes status (1 = yes; 2 = no), • Age in 2016 (range = 24–101; some spouses can be under age 50), • Race/Ethnicity (1 = Hispanic; 2 = non-Hispanic Black; 3 = non-Hispanic White; 4 = other), • Gender (1 = female; 2 = male), and • Education (1 = less than high school; 2 = high school or equivalent; 3 = some college; 4 = college graduate; 5 = graduate degree).
The categorical variables and binary variables are dummy coded in the prediction model. Three survey variables, SRH (binary), ILW (binary), and FLs (continuous) in the 2018 HRS are selected as key variables of interest for illustration purposes. We used regression models for predicting values for each case. A generalized linear regression model can be expressed as
where
Specifically, logistic regression is used to predict values of SRH and ILW, and normal linear regression is used to predict values of FLs. All these models are estimated using the Frequentist approach. We did not use the Bayesian approach since it would require an additional wave of the HRS survey data (e.g., the 2014 wave of the HRS) to elicit prior information. Predicted values of survey variables are obtained at the case level. See the appendix for the assessment of the quality of these predictive models for survey variables.
3.3. Data Structure of the Study
We implemented the stopping rule only once on data collection day 28. The performance of the stopping rule is evaluated at the end of the data collection using all observed data. Since the benchmark estimates experience nonresponse and the published estimates may be nonresponse-adjusted, we applied multiple imputation techniques to account for the 1,953 nonrespondents, as well as some item missing data; multiple imputation techniques might be more flexible than weighting methods for addressing a general missing data pattern. We used CART models to multiply impute missing data since CART models are robust against outliers and flexible enough to capture interactions, nonlinear relationships, and complex distributions (e.g., Burgette and Reiter 2010). Figure 1 shows the data structure used for implementing the stopping rule and evaluating the simulated stopping effects.

The data structure for implementing and evaluating the multivariate stopping rule.
We created ten different possible scenarios in terms of the configuration of prespecified weights for SRH (unweighted mean = 0.70), FLs (unweighted mean = 3.1), and ILW (unweighted mean = 0.35). The unweighted correlation coefficient between ILW and SRH is −0.40, the unweighted correlation coefficient between ILW and FLs is 0.59, and the unweighted correlation coefficient between FLs and SRH is −0.48. It is not essential to have a significant correlation between two survey variables to implement the stopping rule.
Table 2 shows a few possible configurations of estimate-level weights. Exploring the impacts of estimate-level weights is the first step toward applying the stopping rule in practice. The first three scenarios in Table 2 can also be treated as univariate stopping rules since the quality of only one estimate is considered. Scenarios 4 to 6 can also be treated as bivariate stopping rules since the set of cases to stop is determined by the mean squared errors of two estimates. Scenarios 7 to 10 consider the mean squared errors of all three estimates in the stopping rule. If all three variables are equally important, we would assign equal estimate-level weights to them (i.e., Scenario 10). In other situations, we would assign a higher weight to a variable that is more important than the other two equally important variables (e.g., Scenarios 7–9).
Configuration of the Estimate-Level Weights.
3.4. Evaluation of the Performance of the Stopping Rule
The performance of the stopping rule is evaluated at the end of the data collection. All three variables are separately evaluated based on the original scale. For each scenario, we let
The multiple-imputation variance of
For the original data collection, we let
The multiple-imputation variance of
We treat
The following statistics are used for evaluation.
1. Percent relative bias (
2. Percent relative root mean squared error (
where
3. Percent relative estimated saved interviewer hours (%
where
4. Results
4.1. Estimated Cost of Data Collection
Table 3 shows several effort-related measures for each scenario. These include the number of stopped cases, percent of cases stopped, number of interviews, response rate, estimated total interviewer hours, and percent
Effort-Related Measures at the End of Data Collection by Scenario.
4.2. Estimated Data Quality
Table 4 presents the percent
Percent Relbiases of Three Estimates at the End of Data Collection by Scenario.
Table 5 shows the percent
Percent RelRMSEs of Three Estimates at the End of Data Collection by Scenario.
5. Discussion
This study proposed a multivariate stopping rule that optimizes the tradeoff between the cost of data collection and the weighted mean squared errors of multiple sample means. We illustrated the use of the proposed stopping rule using the 2018 wave of the telephone component of the HRS. We found that assigning equal weights to the three survey variables of interest (Scenario 10) did not yield balanced mean squared error at the end of data collection. We also found that assigning equal weights to both SRH and FLs and a weight of 0 to ILW (Scenario 4) had the best cost-error tradeoff, reducing the cost by 12% while maintaining the same level of data quality.
The estimate-level weights in the multivariate stopping rule decide how much influence an estimate will have. Our results suggest that assigning higher weights to a survey variable does not always lead to better data quality for that variable. One plausible reason is that the variance of predictions might also affect the final results (see Figures A2–A4 in the appendix for model performance). For example, Scenario 10 that did not have the best overall percent
Scenario 4 had the best cost-error tradeoff, which may be due to both chance and the modest correlations among these three variables. Selecting appropriate estimate-level weights extends beyond merely considering the relative importance of key estimates. Future research is needed to better understand the role of prediction errors in the configuration of estimate-level weights. To improve the multivariate stopping rule, one could also revise the optimization procedure by accounting for prediction errors.
This study has several limitations. First, the three survey variables consist of a combination of a continuous variable and two binary variables, but all of them are from the health domain. In practice, key survey variables might be selected from multiple domains to meet the data quality objectives of a multipurpose survey.
Second, the current study only focuses on improving the efficiency of data collection. Equation (1) is capable of judging the tradeoff between the cost and the data quality of stopping alternative sets of cases. However, this equation overlooks cost constraints and data quality requirements. The primary objective of survey data collection is to maximize data quality for a given budget or to minimize survey costs for a given level of data quality. To address this issue, future research might add a cost constraint or some data quality requirements to Equation (1).
Third, the “cutoff point” for model performance required for the implementation of the stopping rule remains unknown. Estimating inputs to the stopping rule is essentially a prediction problem. We illustrate the stopping rule in a panel study that can use a rich dataset (e.g., survey variables from the previous waves) to make predictions. It can also be applied to cross-sectional surveys as long as survey costs and survey variables are predictable. In cross-sectional surveys, auxiliary data (e.g., the sampling frame, planning database, and commercial data) can be used for making predictions.
Fourth, although the cost predictions were positively correlated with observed costs, the correlation was weak (see Figure A1 in the appendix). One issue with the prediction model is that we only used time-invariant variables in the discrete time hazard model. The cost predictions can be improved by incorporating time-varying paradata (e.g., the outcome of previous call attempts) into the cost prediction model, as time-varying paradata may contain useful information for prediction (e.g., Durrant and D’Arrigo 2014). While we only consider the cost predictions in the telephone mode, the choice of the survey mode also affects the cost structure of data collection. For example, the cost model for face-to-face interviewing should also account for the travel costs. Wagner et al. (2023) discussed that the cost for face-to-face interviewing should be evaluated at the secondary sample unit (e.g., neighborhood) level. This would allow us to assess the travel costs more easily. We leave the topic of cost estimation in other modes for future research.
The quality component in the stopping rules can be modified based on specific data quality objectives. We only focused on the mean square errors of sample means, but the stopping rules can be easily extended to other statistics (e.g., a correlation between two key survey variables and a distribution of a survey variable) by simple modifications. In addition, the variance estimators in this study were derived under the assumption of simple random sampling. Extending the stopping rules to complex sample designs is also needed. The replication variance estimator seems to be a straightforward way for dealing with stratification and clustering. However, the stopping rule has to account for practical considerations in complex sample designs. For example, there is a requirement for the minimum number of interviews in each stratum.
Stopping a subset of cases during data collection essentially reallocates effort from stopped cases to other unresolved cases. More experimental studies in different contexts are needed to further test the performance of the stopping rule. An ideal situation is to stop cases that have low impacts on data quality and are also less likely to respond at the early stages of data collection.
6. Conclusion
The proposed multivariate stopping rule considers the potential consequences of stopping cases during data collection, namely, cost reduction and impacts on multiple survey estimates. There are five key steps in implementing the stopping rule: (1) estimate the cost model and predict future costs; (2) select key variables; (3) estimate the survey variable models and compute predictions; (4) assign weights to each variable; (5) run the stopping rule algorithm.
The selection of key survey variables should align with data quality objectives. For example, key variables can be those that are frequently used by data users, are representative of other variables, or are considered to have a high risk of bias. Strong relationships among survey variables are not required for the multivariate stopping rule, but the effect of stopping on the estimated data quality would be reinforced when two selected variables have a higher correlation. The estimate-level weights are also an important input to the stopping rule. A more detailed discussion of the configuration of estimate-level weights is provided in the Discussion section.
Different variables may react differently to the stopping rule. In our study, SRH generally had better percent
