Sage Journals: Discover world-class research

Abstract

In this research, we propose a longitudinal adjusted design effect model that integrates both the design effect and time effect, offering reliable estimators of variance. We investigate the asymptotic properties of certain types of estimators. Through simulation studies, we show the exceptional performance of the proposed method, demonstrating its superiority in producing accurate standard error estimators compared to existing approaches.

Keywords

CPS design effect generalized variance function longitudinal generalized variance function longitudinal adjusted design effect model simulations

1. Introduction

To calculate variance estimates in survey samples, commonly used methods include Taylor Series Linearization (TSL), jackknife, and Balanced Repeated Replication (BRR; interested readers can refer to Cohen (1979), Burt and Cohen (1984), Rao and Wu (1988), Rao (1988) and Wolter (2007), etc.). However, in many large-scale sample surveys such as the Current Population Survey (CPS) and the Canadian Labour Force Survey (CLFS), calculating thousands of variance estimates using these standard methods requires a significant amount of labor.

The Generalized Variance Functions (GVFs) have been used by CPS since 1947 (U.S. Census Bureau 2006). GVF develops regression estimates using a set of variance estimates obtained through BRR or TSL methods, along with their associated survey statistics. It allows for the direct calculation of standard errors for a large volume of potential survey statistics through a fitted regression model. Valliant (1987) demonstrated that the GVF model produces consistent estimates of the variance for a certain class of superpopulation models. However, while many current surveys follow the same households at regular time intervals, GVF does not take advantage of the longitudinal feature of the data.

Zhang et al. (2019) proposed a longitudinal generalized variance function (LGVF) by incorporating the time effect into the modeling of longitudinal survey data. The proposed LGVF maintains the property of GVF, where the estimated variances are often more stable than the direct estimates as they smooth out some of the variability from variable to variable when the design effects (determined by comparing the variance of a variable under a particular survey design with the variance under a simple random sampling [SRS] design with the same size) for the variables are similar to each other. However, a challenge faced by both GVF and LGVF is selecting a group of variables with similar design effects (deffs) to build a regression line, which is often practically difficult. Johnson and King (1986) studied GVF estimators using a national survey of reading ability among young adults and found that one way to markedly improve upon the GVF model is to use prior information about the design effect (deff) of individual estimators.

Building upon the groundwork laid by Zhang et al. (2019) and Johnson and King (1986), this research introduces the Longitudinal Adjusted Design Effect model (LADE). The LADE integrates both design and time effects using a model-based approach. The structure of the paper is as follows: Section 2 provides a comprehensive review of Generalized Variance Function (GVF) and Longitudinal Generalized Variance Function (LGVF) models; Section 3 establishes the theoretical framework, introduces the LADEs, and outlines the asymptotic properties of these models; Section 4 presents results from simulation studies conducted with CPS data. The research concludes in Section 5.

2. Generalized Variance Functions (GVFs) and Longitudinal Generalized Variance Functions (LGVFs)

In this section, we briefly review GVF and LGVF models. More detailed description of GVF can be found from textbooks by Wolter (2007), and Lohr (2021). Detailed description of LGVF can be found from Zhang et al. (2019).

2.1. Generalized Variance Functions (GVF)

Let ${\hat{T}}_{υ}$ represent a survey statistic for variable $υ$ , such as the estimated number of persons employed. Define ${\hat{p}}_{υ}$ as the estimated proportion of employment, calculated as ${\hat{p}}_{υ} = {\hat{T}}_{υ} / M$ , where $M$ is the population size. Consider $d_{υ}$ as the design effect of ${\hat{p}}_{υ}$ and $m$ as the sample size. We denote the sampling variance of ${\hat{p}}_{υ}$ based on a given design as $var ({\hat{p}}_{υ}) = d_{υ} \times p_{υ} (1 - p_{υ}) / m .$ Introduce the relative variance (relvar) of ${\hat{p}}_{υ}$ as

$relvar ({\hat{p}}_{υ}) = \frac{var ({\hat{p}}_{υ})}{{[E ({\hat{p}}_{υ})]}^{2}} = a_{υ} + \frac{b_{υ}}{E ({\hat{T}}_{υ})},$ (1)

where $a_{υ} = ‒ d_{υ} / m$ , and $b_{υ} = M d_{υ} / m$ . The relative variance is often recognized as the square of the coefficient of variation (CV). Let $V$ denote the number of variables under consideration. We propose a regression model linking a set of relative variances $relvar ({\hat{p}}_{υ})$ to $E ({\hat{T}}_{υ})$ as follows:

$relvar ({\hat{p}}_{υ}) = a + b / E ({\hat{T}}_{υ}) for υ = 1, 2, \dots, V$

assuming that $d_{υ}$ values across these variables are the same, that is, $a_{υ} = a$ and $b_{υ} = b$ . Let $\hat{a}$ and $\hat{b}$ be the regression estimators of $a$ and $b$ . The GVF relative variance is predicted by $\hat{a} + \hat{b} / {\hat{T}}_{υ}$ . A GVF estimate for $var (\hat{T})$ is given by the following function,

$\hat{var} ({\hat{T}}_{υ}) = \hat{a} {\hat{T}}_{υ}^{2} + \hat{b} {\hat{T}}_{υ} .$ (2)

2.2. LGVF

Zhang et al. (2019) introduced the Longitudinal Generalized Variance Function (LGVF), incorporating longitudinal information and addressing for time effects. Let $V$ denote the number of variables in the grouping, and $τ$ represent the number of considered time periods, resulting in $V \cdot τ$ observations for LGVF. The subsequent discussion pertains to $t = 1, 2, \dots, τ$ , and $υ = 1, 2, \dots, V$ . To capture the time effect, we define

$e_{t} = M_{t} / \bar{M}, where \bar{M} = (M_{1} + M_{2} + \dots + M_{τ}) / τ,$ (3)

where $M_{t}$ denotes the population size at time $t$ .

Let ${\hat{T}}_{t υ}$ and ${\hat{p}}_{t υ}$ denote the interested totals and proportions for variable $υ$ at time $t$ . Let $var ({\hat{p}}_{t υ})$ and $ϑ_{t υ}$ denote the variance and direct estimator of relative variance of ${\hat{p}}_{t υ}$ based on the sampling design. Let $d_{t υ}$ be the deff of variable $υ$ at time $t$ , $a_{t υ} = ‒ d_{t υ} / m, b_{t υ} = \bar{M} d_{t υ} / m$ , and assume $a_{t υ} = a_{t} = a$ and $b_{t υ} = b_{t} = b$ . For estimating the relative variance, Zhang et al. (2019) proposed the LGVF model,

$E (ϑ_{t υ}) = a + b \cdot \frac{e_{t}}{T_{t υ}} .$

The predicted relative variance by LGVF value is

${\hat{ϑ}}_{t υ} = \hat{a} + \hat{b} \cdot \frac{e_{t}}{{\hat{T}}_{t υ}} .$ (4)

Under certain conditions, the ratios of relative variances and predicted relative variances from LGVF converge in probability to 1. The study by Zhang et al. (2019) demonstrated that LGVF was more efficient in reducing mean squared prediction errors (MSPE) compared to GVF.

3. The Longitudinal Adjusted Design Effects Models

In this section, we begin by introducing notation that closely adheres to conventions set by Valliant (1987) and Zhang et al. (2019). Subsequently, we present the Longitudinal Adjusted Design Effect models (LADEs), seamlessly integrating both the design effect and time effect. Finally, we explore the properties exhibited by specific types of estimators within the framework of LADE.

3.1. Notation

In a stratified two-stage cluster sampling design, various indices are employed to represent distinct levels within the sampling process. Specifically, $h$ serves as the stratum index, $i$ as the primary sampling unit (PSU) index, and $j$ as the secondary sampling unit (SSU) index, representing elements within the PSU. For instance, consider neighborhoods within a county as strata. In the first stage, households (PSUs) within the neighborhood are randomly selected, and in the second stage, some of the individuals (SSUs) residing in the chosen households are randomly interviewed. This approach facilitates a more manageable data collection process and ensures an accurate representation of diverse areas within the broader population.

At the PSU level, we denote $N_{t}$ as the total number of psus in the population at time $t$ , and $N_{th}$ as the number of psus in stratum $h$ at time $t$ . Hence, the total number of PSUs in the population at time $t$ is given by $N_{t} = \sum_{h = 1}^{H} N_{th}$ . At the SSU level, we use $M_{thi}$ to represent the number of SSUs in psu $i$ within stratum $h$ at time $t$ . Therefore, the total number of elements in stratum $h$ at time $t$ is denoted by $M_{th} = \sum_{i = 1}^{N_{h}} M_{thi}$ , and the total number of SSUs in the population at time $t$ is denoted by $M_{t} = \sum_{h = 1}^{H} M_{th}$ .

Moving to the sample level, we denote $n_{th}$ as the number of PSUs in the sample within stratum $h$ , and $n_{t}$ as the number of PSUs in the sample at time $t$ . Hence, the total number of PSUs in the sample at time $t$ is given by $n_{t} = \sum_{h = 1}^{H} n_{th}$ . We assume that the sample size remains constant over time, that is, $n_{t} = n$ for $t = 1, 2, \dots, τ$ . The assumption of a constant sample size over time is a convenient simplification for theoretical considerations. In practice, researchers should be aware of the real-world complexities introduced by non-response and attrition. Furthermore, we use $m_{thi}$ to represent the number of elements in the sample from the ith PSU within stratum $h$ . Therefore, the total number of elements in the sample within stratum $h$ at time $t$ is denoted by $m_{th} = \sum_{i = 1}^{n_{th}} m_{thi}$ . Consequently, the total number of elements in the sample over all strata at time $t$ is denoted by $m_{t} = \sum_{h = 1}^{H} m_{th}$ .

At time $t$ , we define $S_{th}$ as the set of sampled psus in stratum $h$ , $R_{th}$ as the set of nonsampled psus in stratum $h$ , $S_{thi}$ as the set of sampled elements within PSU $i$ in stratum $h$ , and $R_{thi}$ as the set of nonsampled elements within PSU $i$ in stratum $h$ . Assuming that a random variable $y_{t υ hij}$ is associated with the jth unit within PSU $i$ in stratum $h$ at time $t$ for variable $υ$ , we can define a general type estimator of $T_{t υ}$ as follows:

${\hat{T}}_{t υ} = \sum_{h} \sum_{i \in S_{th}} γ_{thi} {\hat{T}}_{t υ hi} .$ (5)

Here, $γ_{thi}$ is a coefficient that depends on the specific survey design. Define ${\bar{y}}_{t υ hi} = \sum_{j \in S_{thi}} y_{t υ hij} / m_{thi}$ , so that ${\hat{T}}_{t υ hi} = M_{thi} {\bar{y}}_{t υ hi}$ .

For example, the Horvitz-Thompson estimator, which is commonly used when PSUs are selected with probabilities proportional to $M_{thi}$ and an equal probability sample is selected within each sampled PSU at time $t$ , can be written as:

${\hat{T}}_{tv, HT} = \sum_{h} \sum_{i \in S_{th}} [M_{th} {\hat{T}}_{t υ hi} {(n_{th} M_{thi})}^{- 1}],$ (6)

where $γ_{thi} = M_{th} (n_{th} M_{thi})^{- 1}$ .

For prediction purposes, the following model assumptions apply:

$E (y_{t v h i j}) = μ_{t v h}, cov (y_{t v h i j}, y_{t v h^{'} i^{'} j^{'}}) = (\begin{matrix} σ_{t v h i}^{2} & if h = h^{'}, i = i^{'}, j = j^{'} \\ ρ_{t v h i} σ_{t v h i}^{2} & if h = h^{'}, i = i^{'}, j \neq j^{'} \\ 0 & otherwise . \end{matrix}$ (7)

where within a PSU $i$ in stratum $h$ at time $t$ , $σ_{t υ hi}$ signifies the variance of measurements and $ρ_{t υ hi}$ represents the intra-class correlation coefficient (ICC) for variable $υ$ . The ICC measures the degree of correlation among observations within a cluster, indicating the level of homogeneity within clusters. A higher intra-class correlation implies greater similarity among observations within a cluster.

Similar formulations can be found in the works of Scott and Smith (1969), Royall (1976, 1986), and Burdick and Sielken (1979). A general direct variance estimator for $var ({\hat{T}}_{t υ})$ used in this research is provided by Royall (1986):

$s_{{\hat{T}}_{t υ}}^{2} = \sum_{h} n_{th} (n_{th} - 1)^{- 1} \sum_{S_{th}} γ_{thi}^{2} r_{t υ hi}^{2},$ (8)

where, $r_{t υ hi} = {\hat{T}}_{t υ hi} - (\sum_{j \in S_{th}} γ_{thj} {\hat{T}}_{t υ hj} / M_{ht}) M_{thi}$ , and $γ_{thi}$ is defined in Equation (6). The direct estimator of the relative variance is $ϑ_{t υ} = s_{{\hat{T}}_{t υ}}^{2} / {\hat{T}}_{t υ}^{2}$ .

Let $k_{t υ hi} = [1 + (m_{thi} - 1) ρ_{t υ hi}] / m_{thi}$ , $α_{1 t υ h}$ and $α_{2 t υ h}$ are constants. Assume that

$σ_{t υ hi}^{2} = σ_{t υ h}^{2} = α_{1 t υ h} μ_{th} + α_{2 t υ h} μ_{t υ h}^{2} .$ (9)

For example, for a binary distribution, $α_{1 h} = 1$ and $α_{2 h} = ‒ 1$ since $σ_{h}^{2} = μ_{h} (1 - μ_{h}) = μ_{h} - μ_{h}^{2}$ . According to the definition, $relvar ({\hat{T}}_{t υ} - T_{t υ}) = var ({\hat{T}}_{t υ} - T_{t υ}) / [E ({\hat{T}}_{t υ})]^{2}$ , where $var ({\hat{T}}_{t υ}) = var (\sum_{h} \sum_{i \in S_{th}} γ_{thi} {\hat{T}}_{t υ hi}) = var (\sum_{h} \sum_{i \in S_{th}} γ_{thi} {\hat{M}}_{thi} {\bar{y}}_{t υ hi})$ . By expanding the variance and utilizing assumptions in Equation (7) and Equation (9), we can derive:

$\begin{matrix} relvar ({\hat{T}}_{t υ} - T_{t υ}) \approx \sum_{h} π_{t υ h}^{2} α_{2 t υ h} M_{th}^{- 2} \sum_{S_{th}} γ_{thi}^{2} k_{t υ hi} M_{thi}^{2} \\ + [\sum_{h} π_{t υ h} α_{1 t υ h} M_{th}^{- 2} \sum_{S_{th}} γ_{thi}^{2} k_{t υ hi} M_{thi}^{2}] / E ({\hat{T}}_{t υ}) \\ = a_{t} + b_{t} / E ({\hat{T}}_{t υ}), \end{matrix}$ (10)

where $π_{t υ h} = E ({\hat{T}}_{t υ h}) / E ({\hat{T}}_{t υ})$ , and ${\hat{T}}_{t υ h} = \sum_{i = 1}^{N_{h}} {\hat{T}}_{t υ hi}$ .

3.2. The LADE Models

Recall that $relvar ({\hat{p}}_{i}) = ‒ d_{i} / m + M d_{i} / [mE ({\hat{T}}_{i})]$ . If we divide both sides by $d_{i}$ , the ratio $relvar ({\hat{p}}_{i}) / d_{i}$ exhibits a perfect linear relationship with $1 / E ({\hat{T}}_{i})$ . Figure 1 depicts the plots of the direct estimators of the relative variance $ϑ_{2010 υ}$ (plot (a)) and the adjusted relative variance $ϑ_{2010 υ} / {\hat{d}}_{2010 υ}$ (plot (b)) for eighteen variables against the reciprocal of the estimated total from a single simulation run using March 2010 CPS annual social and economic supplement (ASEC) data. For detailed information regarding the simulation setup, please consult Section 4. Notably, plot (b) clearly illustrates a perfect linear relationship between $ϑ_{2010 υ} / {\hat{d}}_{2010 υ}$ and $1 / {\hat{T}}_{2010 υ}$ , as expected.

Figure 1.

Direct estimates of relative variance (plot (a)) and adjusted direct estimates of relative variance (plot (b)) versus reciprocal of estimated total.

Motivated by this observation, we propose the Longitudinal Adjusted Design Effect (LADE) model, which adjusts the relative variance (relvar) by the design effect (deff) and incorporates time effect. To get an overview of the deffs of different variables, Figure 2 displays the design effects of the eighteen variables obtained from the mean of the five hundred simulation runs using the 2008 to 2010 ASEC data. It is observed that the design effects for the variables remain relatively stable over the three-year period. However, it is worth noting that the magnitudes of the design effects vary across the different variables.

Figure 2.

Design effects of the eighteen variables for 08 to 10 ASEC data from five hundred simulation runs. X-axis is the index of variables and Y-axis is the design effect for the variables from the five hundred simulation runs. For detailed information regarding the simulation setup, please consult Section 4.

To incorporate the design effects into the analysis, we adjust the relative variance by taking the mean of the design effects for each year across the eighteen variables. Let $θ = (a, b)'$ be the regression parameters. Let $a_{t υ} = ‒ d_{t υ} / m_{t}$ , $b_{t υ} = \bar{M} d_{t υ} / m_{t}$ , and $e_{t}$ be defined as in Equation (3). Using Equation (2), we can express the estimated variance of ${\hat{T}}_{t υ}$ as:

$\begin{matrix} \hat{var} ({\hat{T}}_{t υ}) = \frac{- {\hat{d}}_{t υ}}{m_{t}} {\hat{T}}_{t υ}^{2} + \frac{M_{t} {\hat{d}}_{t υ}}{m_{t}} {\hat{T}}_{t υ} \\ = \frac{- {\hat{d}}_{t υ}}{{\hat{\bar{d}}}_{t}} \frac{{\hat{\bar{d}}}_{t}}{m_{t}} {\hat{T}}_{t υ}^{2} + \frac{M_{t}}{\bar{M}} \frac{{\hat{d}}_{t υ}}{{\hat{\bar{d}}}_{t}} \frac{\bar{M} {\hat{\bar{d}}}_{t}}{m_{t}} {\hat{T}}_{t υ} \\ = {\hat{f}}_{t υ} {\hat{a}}_{t υ}^{*} {\hat{T}}_{t υ}^{2} + e_{t} {\hat{f}}_{t υ} {\hat{b}}_{t υ}^{*} {\hat{T}}_{t υ} \end{matrix}$

where $f_{t υ} = d_{t υ} / {\bar{d}}_{t}$ , ${\bar{d}}_{t} = \sum_{υ = 1}^{V} d_{t υ} / V$ , $a_{t υ}^{*} = ‒ {\bar{d}}_{t} / m_{t}$ , and $b_{t υ}^{*} = \bar{M} {\bar{d}}_{t} / m_{t}$ ; ${\hat{f}}_{t υ}$ , $\hat{\bar{d}} t, {\hat{a}}_{t υ}^{*}$ , and ${\hat{b}}_{t υ}^{*}$ are estimators of $f_{t υ}$ , ${\bar{d}}_{t}$ , $a_{t υ}^{*}$ , and $b_{t υ}^{*}$ respectively. Therefore, $\hat{var} ({\hat{T}}_{t υ}) / ({\hat{T}}_{t υ}^{2} * {\hat{f}}_{t υ}) = {\hat{a}}_{t υ}^{*} + {\hat{b}}_{t υ}^{*} e_{t} / {\hat{T}}_{t υ}$ . Define a set of adjusted estimators of relvars $ϑ_{t υ}^{*}$ as $ϑ_{t υ}^{*} = ϑ_{t υ} / {\hat{f}}_{t υ}$ for $t = 1, 2, \dots, τ$ and $υ = 1, 2, \dots, V,$ such that

$E (ϑ_{t υ}^{*}) = a_{t υ}^{*} + b_{t υ}^{*} \cdot \frac{e_{t}}{T_{t υ}} .$ (11)

Recall that $ϑ_{t υ}$ denote the variance and direct estimator of relative variance of ${\hat{p}}_{t υ}$ based on the sampling design, and $E (ϑ_{t υ}) = a_{0} + b_{0} e_{t} / T_{t υ}$

Let $ϑ_{t}^{*} = (ϑ_{t 1}^{*}, ϑ_{t 2}^{*}, \dots, ϑ_{tV}^{*})'$ , $ϑ^{*} = (ϑ_{1}^{*'}, ϑ_{2}^{*'}, \dots, ϑ_{τ}^{*'})'$ , $ϵ_{t} = (ϵ_{t 1}, ϵ_{t 2}, \dots, ϵ_{tV})'$ , and $ϵ = (ϵ_{1}', ϵ_{2}', \dots, ϵ_{τ}')'$ . Define $X_{t}$ as the $V \times 2$ design matrix for time $t$ with the first column consisting of one second and second column being $(e_{t} / {\hat{T}}_{t 1}, e_{t} / {\hat{T}}_{t 2}, \dots, e_{t} / {\hat{T}}_{tV})$ . Let $X$ be the design matrix defined as $X = (X'_{1}, \dots, X'_{τ})'$ and let $θ = (a, b)'$ . Under the assumption that $a_{t υ}^{*} = a_{t}^{*} = a$ and $b_{t υ}^{*} = b_{t}^{*} = b$ , the LADE model in Equation (11) can be written in the matrix form as:

$ϑ^{*} = X θ + ϵ .$ (12)

The weighted least square estimators (WLS) of $θ$ are given by $\hat{θ} = (X' WX)^{- 1} X' W ϑ^{*}$ , where $w_{t υ}$ is a weight associated with variable $υ$ at time $t$ , and $W$ is a $V τ \times V τ$ matrix with the diagonal elements as $w_{t υ}$ . Typically, $w_{t υ}$ is chosen as the reciprocal of the variance of $ϑ_{t υ}^{*}$ when known. Alternatively, if the variances are unknown, we can approximate the weights by the reciprocal of the squared $ϑ_{t υ}^{*}$ . Following the work of Zhang et al. (2019), the estimators of $a$ and $b$ are given by,

$\hat{b} = \frac{\sum_{t = 1}^{τ} \sum_{υ = 1}^{V} ϑ_{t υ}^{*} [e_{t} {\hat{T}}_{t υ}^{- 1} - {\bar{T}}_{-}] / w_{t υ}}{\sum_{t = 1}^{τ} \sum_{υ = 1}^{V} {[e_{t} {\hat{T}}_{t υ}^{- 1} - {\bar{T}}_{-}]}^{2} / w_{t υ}} = {\hat{S}}_{1} / {\hat{S}}_{2}$ (13)

and

$\hat{a} = {\bar{ϑ}}^{*} - \hat{b} {\bar{T}}_{-}$ (14)

where ${\bar{T}}_{-} = \sum_{t, υ} (e_{t}^{- 1} {\hat{T}}_{t υ} w_{t υ})^{- 1} / \sum_{t, υ} w_{t υ}^{- 1}$ , ${\bar{ϑ}}^{*} = \sum_{t, υ} ϑ_{t υ}^{*} w_{t υ}^{- 1} / \sum_{t, υ} w_{t υ}^{- 1}$ , ${\hat{S}}_{1} = \sum_{t = 1}^{τ} \sum_{υ = 1}^{V} ϑ_{t υ}^{*} [e_{t} {\hat{T}}_{t υ}^{- 1} - {\bar{T}}_{-}] / w_{t υ}$ , and ${\hat{S}}_{2} = \sum_{t = 1}^{τ} \sum_{υ = 1}^{V} [e_{t} {\hat{T}}_{t υ}^{- 1} - {\bar{T}}_{-}]^{2} / w_{t υ}$ . The predicted adjusted relvar of ${\hat{T}}_{t υ}$ based on the estimated LADE model is

${\hat{ϑ}}_{t υ}^{*} = {\bar{ϑ}}^{*} + \hat{b} [e_{t} {\hat{T}}_{t υ}^{- 1} - {\bar{T}}_{-}], and {\hat{ϑ}}_{t υ} = {\hat{ϑ}}_{t υ}^{*} {\hat{f}}_{t υ} .$ (15)

3.3. Properties of Proposed Estimators

In this section, we examine a specific class of estimators defined as $γ_{thi} = g_{1 th} g_{2 thi}$ , representing the product of a nonrandom factor dependent solely on the stratum $(g_{1 th})$ and another nonrandom factor dependent on both the stratum and cluster $(g_{2 thi})$ . For instance, when PSUs are selected with probabilities proportional to stratum size $M_{thi}$ , and an equal probability sample is chosen within each sampled PSU at time $t$ , we have $g_{1 th} = M_{th} / n_{th}$ and $g_{2 thi} = 1 / M_{thi}$ .

Given the assumptions in Equation (7) and the structure of estimators $γ_{thi} = g_{1 th} g_{2 thi}$ , we investigate the asymptotic properties of ${\hat{T}}_{t υ}$ , $s_{{\hat{T}}_{t υ}}^{2}$ , and ${\hat{ϑ}}_{t υ}$ when the number of PSUs in each stratum is large. Lemmas 1 to 3 (see Appendix) extend the work by Royall (1986). Under certain conditions, Theorem 1 demonstrates that the ratios of relative variances and predicted relative variances from the proposed LADEs converge in probability to 1. The proof of Theorem 1 is provided in the Appendix. The asymptotic normality then follows.

Theorem 1. Under the model defined by Equation (7) with variance specification in Equation (9), the assumptions (i) to (xiv) listed in the Appendix, and $μ_{4 t υ hi} = E [{\hat{T}}_{t υ hi} - E ({\hat{T}}_{t υ hi})]^{4} < \infty,$ $a_{t υ}^{*} = a_{t}^{*} = a$ and $b_{t υ}^{*} = b_{t}^{*} = b$ for $t = 1, 2, \dots, τ$ , $υ = 1, 2, \dots, V$ , as $N_{th}, n_{th} \to \infty$ ,

$\frac{relvar ({\hat{T}}_{t υ} - T_{t υ})}{{\hat{ϑ}}_{t υ}} \to^{p} 1 .$

Proof. The proof is given in the Appendix. □

Theorem 2. Under model in Equation (7) with variance specification in Equation (9), the assumptions (i) to (xiv) listed in the Appendix, and $μ_{4 υ thi} = E [{\hat{T}}_{t υ hi} - E ({\hat{T}}_{t υ hi})]^{4} < \infty,$ $a_{t υ}^{*} = a_{t}^{*} = a$ and $b_{t υ}^{*} = b_{t}^{*} = b$ for $t = 1, 2, \dots, τ$ , $υ = 1, 2, \dots, V$ , as $N_{th}, n_{th} \to \infty$ ,

$\frac{{\hat{T}}_{t υ} - T_{t υ}}{{\hat{T}}_{t υ} {({\hat{ϑ}}_{t υ})}^{1 / 2}} \to^{d} N (0, 1) .$

Proof. The proof is a straightforward extension of work by Royall 1986. □

Theorem 2 enables the construction of confidence intervals using the normal distribution.

4. Simulation Studies

This section presents a simulation study that compares the performance of two types of estimators: LGVFs and LADEs. The study utilizes data from the CPS Annual Social and Economic Supplement (ASEC), specifically focusing on the data restricted to New Mexico for the years 2008 to 2011.

The analysis considers eighteen binary variables derived from the “Source of Income” section of the ASEC data. These variables capture characteristics such as self-employment, unemployment compensation, and more. In this context, a binary variable takes a value of 1 if a person possesses a specific characteristic and 0 if they do not possess that characteristic. The available ASEC data includes 2,059 observations for 2008, 2,188 observations for 2009, 2,108 observations for 2010, and 1,975 observations for 2011. Each household in the data is associated with an ultimate sampling unit (USU) defined by the CPS. However, the specific USU information is not publicly released.

To replicate the sampling design, the households within each year were sorted in ascending order based on their sequence numbers. Then, four consecutive households were combined to form a PSU while maintaining the original order. As a result, there were 205, 208, 193, and 184 PSUs created for the years 2008, 2009, 2010, and 2011, respectively. On average, each of these PSUs accommodates about ten individuals across the four households. Consequently, for example, the total number of individuals in 2008 can be calculated as the number of PSUs (205) multiplied by the average number of individuals residing in the four households, resulting in approximately 2,059 observations. The simulation study is conducted using data from March 2008 to March 2010 for bias square and empirical mean squared error (EMSE) comparisons, and data from March 2011 for mean squared prediction error (MSPE) comparisons. The simulation steps are as follows:

(a) Within each year (2008, 2009, and 2010), a sample of $n = 100$ PSUs is selected using probabilities proportional to size (PPS). Within each selected PSU, $m_{i} = 4$ individuals are randomly selected without replacement.

(b) Estimates ${\hat{T}}_{t υ}$ for each sample year $t$ and variable $υ$ are calculated using the Horvitz-Thompson estimator (Equation (6)), and the estimated variance $s_{{\hat{T}}_{t υ}}^{2}$ is calculated using Equation (8). The direct estimator of relative variance is then calculated as $ϑ_{t υ} = s_{{\hat{T}}_{t υ}}^{2} / ({\hat{T}}_{t υ})^{2}$ .

(c) The time adjustment, denoted as $e_{t}$ , is applied to the estimates ${\hat{T}}_{t υ}$ by dividing each estimate by itself, that is, $e_{t} / {\hat{T}}_{t υ}$ .

Here, $e_{1} = M_{1} / \bar{M} = 1, 978, 390 / 1, 967, 487 = 1.0056 (for year 2009)$ ;

$e_{2} = M_{2} / \bar{M} = 1, 977, 807 / 1, 967, 487 = 1.0052 (for year 2010)$ ; and $e_{3} = M_{3} / \bar{M} = 1, 946, 264 / 1, 967, 487 = 0.9892 (for year 2008)$ .

(d) Deff adjustment ${\hat{f}}_{t υ} = {\hat{d}}_{t υ} / \hat{\bar{d_{t}}}$ is applied to the relvar estimates $ϑ_{t υ}$ , so that $ϑ_{t υ}^{*} = ϑ_{t υ} / {\hat{f}}_{t υ} .$ Here, ${\hat{d}}_{t υ}$ is the design effect of each variable and sample, and $\hat{\bar{d_{t}}}$ is the average design effect calculated across all variables and simulation runs at time $t$ .

(e) Two regression models are fitted. Model (16) is used to fit LGVFs, and Model (17) is used to fit LADEs.

$ϑ_{t υ} = a_{0} + b_{0} \cdot \frac{e_{t}}{{\hat{T}}_{t υ}},$ (16)

and

$ϑ_{t υ}^{*} = a + b \cdot \frac{e_{t}}{{\hat{T}}_{t υ}} .$ (17)

Fitting methods include ordinary linear regression (OLS) LGVF1 and LADE1 and WOLS with weights $w_{t υ} = 1 / (ϑ_{t υ})^{2}$ for LGVF2 and $w_{t υ} = 1 / (ϑ_{t υ}^{*})^{2}$ for LADE2.

(f) The direct estimates of relvar calculated by Equations (6) and (8) is recorded, as well as the fitted relvar values by LGVF1, LGVF2, LADE1, and LADE2 multiplied by the deff adjustment $f_{t υ}$ .

(g) The simulation process is repeated $R = 500$ times, following steps (a) to (f). During each iteration, the recorded relvars and fitted relvar values are collected. After the five hundred iterations, the average values of these recorded relvars and fitted relvar values are calculated. Subsequently, bias square, EMSE, and MSPE are computed using the provided formulas.

${Bias}^{2} = \sum_{t = 1}^{3} \sum_{υ = 1}^{18} ({Bias}_{t υ})^{2} / 54 with {Bias}_{t υ} = \frac{\sum_{r = 1}^{500} {\hat{ϑ}}_{t υ}^{(r)}}{R} - ϑ_{t υ},$ (18)

$EMSE = \sum_{t = 1}^{3} \sum_{υ = 1}^{18} EMS E_{t υ} / 54, with EMS E_{t υ} = \frac{\sum_{r = 1}^{500} {({\hat{ϑ}}_{t υ}^{(r)} - ϑ_{t υ})}^{2}}{R},$ (19)

and

$MSPE = \sum_{r = 1}^{500} {MSPE}^{(r)} / R, with MSP E^{(r)} = \frac{\sum_{υ = 1}^{18} {({\hat{ϑ}}_{(2011) υ}^{(r)} - ϑ_{(2011) υ})}^{2}}{18},$ (20)

where $(r)$ denotes the estimate from the rth simulation run.

The simulation results are summarized in Table 1 which compares the bias square, EMSE, and MSPE values for the estimators LGVF1, LGVF2, LADE1, and LADE2. The results indicate that LADEs consistently outperform LGVFs across all metrics, showcasing lower bias square, EMSE, and MSPE values. Notably, there is no discernible advantage in using the WLS estimators LGVF2 and LADE2 over the OLS estimators LGVF1 and LADE1. This absence of advantage can be attributed to the overall equal variance assumption of the regression model not being violated.

Table 1.

Simulation Results of a Comparison Between LGVFs and LADEs with $n_{t} = 100$ . LGVF1 and LADE1 Are with Ordinary Linear Regression Fit, and LGVF2 and LADE2 Are with Weighted Least Square Fit.

	LGVF1	LGVF2	LADE1	LADE2
Bias	0.003250381	0.003314496	0.001336026	0.001165163
EMSE	0.012627135	0.017065547	0.007332843	0.008725467
MSPE	0.03385114	0.03887468	0.01813410	0.01679681

Figure 3 provides a graphical representation of the estimated relvars plotted against the logs of population totals. The solid line represents the direct estimates of relvar values calculated by Equations (6) and (8). The dashed line and dotted line represent the estimates from LGVFs 1 to 2, while the dot-dashed line and the thickest line represent the estimates from LADEs 1 to 2. It is observed that the WLS estimators outperform the OLS estimators in estimating the relvars, especially when the population totals of the variables are large. Figure 3 suggests that heteroscedasticity is more pronounced in the range of large values. WLS is then able to better handle the heteroscedasticity, leading to more efficient and less biased estimates of the relative variance compared to OLS.

Figure 3.

Logs of estimates of relvar plotted versus logs of population totals.

Although LADE2 exhibits a slight improvement over LGVF2, the distinction between LGVFs and LADEs is not evident as that in Table 1. This discrepancy can be attributed to the differing perspectives and scaling from which Table 1 and Figure 3 approach the data. Table 1 offers a comprehensive evaluation of bias in relative variance, whereas Figure 3 explores the intricate connection between the logarithm of relative variance and log totals.

5. Conclusions

In this study, we extended the Longitudinal Generalized Variance Functions (LGVFs) to a novel model known as Longitudinal Adjusted Design Effect models (LADEs), incorporating both design effects and time effects into the modeling process. We demonstrated that, under certain conditions, the ratios of relative variances and predicted relative variances obtained from LADEs converge to 1. A comparative analysis between LGVFs and LADEs revealed that the latter outperforms in reducing bias, empirical mean squared error, and mean squared prediction errors. While our simulation study supports the superiority of LADE over LGVF, we acknowledge its limitations and encourage users to carefully select the model based on their specific context.

Our research hasn’t specifically addressed survey nonresponse, but techniques like multiple imputation and weighted adjustments for regression methods can be seamlessly integrated into our framework. Notably, our analysis is currently confined to estimators derived from the LADE model without auxiliary variables in the regression model.

For future investigations, exploring mixed models and nonparametric smoothing methods for fitting regression models could be valuable. Additionally, a detailed examination of estimators in cases where variances within one PSU differ from another PSU could offer insightful perspectives.

Footnotes

The authors express sincere gratitude to the Editor,Associate Editor,and Referees for their meticulous,constructive,and insightful comments and recommendations. Their valuable feedback has undoubtedly contributed to enhancing the readability and overall quality of the paper.

Funding

The author(s) received no financial support for the research,authorship,and/or publication of this article.

ORCID iD

Guoyi Zhang

Received: June 2023

Accepted: May 2024

References

Burdick

R. K.

Sielken

R. L.

Jr.

1979. “Variance Estimation Based on a Superpopulation Model in Two-Stage Sampling.” Journal of the American Statistical Association 74: 438–40. DOI: https://doi.org/10.1080/01621459.1979.10482533.

Burt

Cohen

S. B.

1984. “A Comparison of Methods to Approximate Standard Errors for Complex Survey Data.” Review of Public Data Use 12: 159–68.

Cohen

S. B.

1979. “An Assessment of Curve Smoothing Strategies Which Yield Variance Estimates from Complex Survey.” In Proceedings of the Survey Research Methods Section of the American Statistical Association, Washington, DC.

Johnson

E. G.

King

B. F.

1986. “Generalized Variance Functions for a Complex Sample Survey.”Technical Report, Technical Report No. 87-72, Program Statistics Research.

Lohr

2021. Sampling: Design and Analysis. 3rd ed. New York, NY: Chapman and Hall/CRC.

Rao

J. N. K.

1988. “Variance Estimation in Sample Surveys.” In Handbook of Statistics, Vol. 6, edited by Krishnaiah

P. R.

Rao

C. R.

, 427–47. Amsterdam: Elsevier.

Rao

J. N. K.

C. F. J.

1988. “Resampling Inference with Complex Survey Data.” Journal of the American Statistical Association 83: 231–41. DOI: https://doi.org/10.1080/01621459.1988.10478591.

Royall

R. M.

1976. “The Linear Least Squares Prediction Approach to Two-Stage Sampling.” Journal of the American Statistical Association 71: 657–64. DOI: https://doi.org/10.1080/01621459.1976.10481542.

Royall

R. M.

1986. “The Prediction Approach to Robust Variance Estimation in Two-Stage Cluster Sampling.” Journal of the American Statistical Association 81: 119–23. DOI: https://doi.org/10.1080/01621459.1986.10478247.

10.

Scott

A. J.

Smith

T. M. F.

1969. “Estimation in Multi-Stage Surveys.” Journal of the American Statistical Association 64: 830–40. DOI: https://doi.org/10.1080/01621459.1969.10501015.

11.

U.S. Census Bureau. 2006. “Design and Methodology, Current Population Survey.” Technical Paper, U.S. Census Bureau, 66.

12.

Valliant

R. L.

1987. “Generalized Variance Functions in Stratified Two-Stage Sampling.” Journal of the American Statistical Association 82: 499–508. DOI: https://doi.org/10.1080/01621459.1987.10478454.

13.

Wolter

K. M.

2007. Introduction to Variance Estimation. 2nd ed. New York, NY: Spring-Verlag.

14.

Zhang

Cheng

2019. “Generalized Variance Functions for Longitudinal Survey Data.” Statistical Theory and Related Fields 3: 50–157. DOI: https://doi.org/10.1080/24754269.2019.1664372