Sage Journals: Discover world-class research

Abstract

Propensity score matching is commonly used in observational studies to control for confounding and estimate the causal effects of a treatment or exposure. Frequently, in observational studies data are clustered, which adds to the complexity of using propensity score techniques. In this article, we give an overview of propensity score matching methods for clustered data, and highlight how propensity score matching can be used to account for not just measured confounders, but also unmeasured cluster level confounders. We also consider using machine learning methods such as generalized boosted models to estimate the propensity score and show that accounting for clustering when using these methods can greatly reduce the performance, particularly when there are a large number of clusters and a small number of subjects per cluster. In order to get around this we highlight scenarios where it may be possible to control for measured covariates using propensity score matching, while using fixed effects regression in the outcome model to control for cluster level covariates. Using simulation studies we compare the performance of different propensity score matching methods for clustered data across a number of different settings. Finally, as an illustrative example we apply propensity score matching methods for clustered data to study the causal effect of aspirin on hearing deterioration using data from the conservation of hearing study.

Keywords

Average treatment among the treated clustered data confounding propensity score matching propensity score estimation

1 Introduction

When data come from observational studies estimating causal effects can be challenging due to potential confounding. In this context, statistical methods to control for potential sources of confounding are necessary in order to get consistent estimates of causal effects. One tool that is frequently used across a range of these statistical methods is the propensity score (PS).¹ In the case of a binary treatment or exposure, $Z$ , and a set of covariates, $X$ , the PS is defined as $P (Z = 1 | X)$ . It has been shown that the PS is useful because of its property as a balancing score.¹ A balancing score, $b (X)$ , is defined as any score such that $X ⊥ ⊥ Z | b (X)$ , where $⊥ ⊥$ indicates that two random variables are independent of each other. Because of the PS’s balancing property, when the PS is consistently estimated, methods such as PS weighting or matching can be used to control for potential confounding due to $X$ and get consistent estimates of causal parameters. In this article, we will primarily focus on using PS matching for clustered data. This is motivated by a study on the causal effect of aspirin use on hearing deterioration using data from the Nurses Health Study II (NHS II) Conservation of Hearing Study (CHEARS), in which we expect data to be correlated for participants from the same testing site, as well as between the left and right ears for the same subject.

In this article, even though the observations are clustered, we will assume that both treatment and outcome occur at the individual level. This is not the same as cluster randomized trials where treatment is assigned at the cluster level.² This means that the treatment effect will be measured at the individual, rather than cluster level. In this setting the cluster can be viewed as a potential confounder which needs to be considered for matching or weighting. However a cluster variable can often present specific challenges which we will discuss in this article. Some of the challenges are due to the ‘cluster variable typically being a categorical variable with a large number of categories and sometimes relatively few observations per category. The techniques in this article can be useful for data sets that would not typically be thought of as having a hierarchical or clustered structure, if there is a confounder that is a categorical variable with a large number of categories, and a potentially low number of observations per category.

Typically, PS matching matches a single treated subject to a single untreated subject with the closest possible PS, otherwise known as pair matching. Pair matching is used to estimate the average treatment effect (ATE) among the treated (ATT). Another matching method, known as full matching, allows for different numbers of treated and untreated subjects in each matched grouping as long as there is at least one treated and one untreated subject in each group, and works to minimize the overall difference in PS between matched subjects.^3,4 Full matching can be used to estimate either the ATT or the ATE.⁵ In this article, we focus on estimating the ATT.

One advantage to PS matching methods is that they are robust to certain mis-specifications of the PS model. There are certain mis-specifications of the PS model that will still lead to balancing scores.^6,7 These include omitting polynomial terms for univariate PS models or using the incorrect link function for any PS model that has a generalized linear model form. The balancing scores that result from these model mis-specifications can be used for matching in the same way as the PS. In this article, we show that the bias of ATT estimates based on PS matching when the PS model is mis-specified depends on the true outcome model as well as the true PS model. For instance, failing to include non-linear terms, such as polynomial or spline terms, or interaction terms in multivariate PS models may not lead to large bias if the true outcome model is a linear model without any non-linear or interaction terms.

Although parametric PS methods are robust to certain types of model mis-specification, recent non-parametric methods such as generalized boosted models (GBMs)^8–10 have been developed which make even fewer assumptions about the true PS model. These non-parametric methods generally assume that the data are independent. In this article, we investigate a number of ways to extend these non-parametric methods to clustered data including adding covariates for cluster membership in the non-parametric propensity model as in fixed effects (FEs) regression. Another option is to ignore clustering in PS estimation, and control for clustering in the outcome regression model; however, this requires additional assumptions which we will discuss in Section 2. There are matching methods that do not use the PS, including Mahalanobis distance matching (MDM) or coarsened exact matching (CEM). One advantage to these methods is that they do not require estimation of the PS model.¹¹ However, these methods may not be suitable for clustered data, or any data set where there is a categorical confounder with a large number of categories. This is because distance measures such as Mahalanobis distance cannot be meaningfully defined for categorical variables. Similarly, CEM with categorical variables may not be feasible when the number of observations per category is too small.

There have been a number of papers exploring PS methods, both matching and weighting, for clustered data.^12–16 Many of these papers look into using either FE or random effects (RE) to estimate the PS. The papers that study PS matching also investigate requiring or preferring each matched group to come from the same cluster.^15,16 One advantage to these methods is that they can control for confounding due to unmeasured cluster level covariates. In this article, we consider the scenario where there are cluster level unmeasured confounders. We investigate the methods mentioned above, while also considering the method where we ignore clustering in the PS estimation and account for it in the outcome model. We discuss the assumptions necessary for each of these PS matching methods to give consistent estimates of the ATT in Section 2.

Additionally, we consider a more complex set up in which there is a multi-level clustering structure for the outcome. This is inspired by the CHEARS data in which each individual may act as a cluster for their left and right ears, nested within a larger cluster of testing site.

The rest of the article is structured as follows: Section 2 describes PS matching methods for clustered data, Section 3 provides a comparison of new and existing methods using simulation studies, Section 4 provides an example of PS matching and regression techniques using data from the CHEARS study, and Section 5 concludes.

2 Estimation of causal effects for clustered data

This article focuses on identifying causal effects for a clustered or hierarchical data structure. We start with the case where each individual belongs to a single cluster. We denote the treatment/exposure for the $j$ th subject in the $i$ th cluster as $Z_{i j}$ . We denote the continuous outcome for the same subject as $Y_{i j}$ . In order to define the causal effects of interest we use the counterfactual framework with potential outcomes. The potential outcome for $Y_{i j}$ given $Z_{i j} = z$ is denoted as $Y_{i j}^{z}$ . Under the assumption of Stable Unit Treatment Value Assumption (SUTVA),¹⁷ at the individual level, $Y_{i j}^{1} - Y_{i j}^{0}$ is the causal effect of $Z_{i j}$ on $Y_{i j}$ . SUTVA includes the assumptions of no interference and no hidden treatments. No interference means that the outcome for each subject depends only on their own treatment, and not the treatment of any other subjects. No hidden treatments means that the potential outcome for a given treatment is the same regardless of how the treatment is administered. For certain types of clustered data the assumption of no interference may not be met, because we may expect that subjects will be affected by the treatment status of others within the same cluster. An example of this would be a study of a treatment aimed at giving children in an elementary school the tools to improve their math scores. If intervention is at the individual level, but students are clustered by school, then we might expect students who received the treatment to discuss some of what they learned with the students from the same school who did not receive the treatment. This would be a violation of the assumption of no interference. The appropriate approach to deal with potential interference for clustered data will depend on the specifics of each study and data set. Because it is only possible to observe either $Y_{i j}^{1}$ or $Y_{i j}^{0}$ for each subject, it is necessary to focus on estimating causal effects at a population or group level. At the population level we will focus on estimating the ATT, which is defined as $E (Y_{i j}^{1} - Y_{i j}^{0} | Z_{i j} = 1)$ . For full matching it is also possible to estimate the ATE which is defined as $E (Y_{i j}^{1} - Y_{i j}^{0})$ ; however, in simulations and data analysis we focus on the ATT. We consider potential confounding due to measured individual or cluster level variables denoted as $X_{i j}$ and unmeasured cluster level variables denoted as $U_{i}$ .

The methods we consider need additional assumptions in order to be able to estimate either the ATT or ATE. These are: consistency — $Y_{i j} = Y_{i j}^{z}$ when $Z_{i j} = z$ , exchangeability - ${Y_{i j}^{1}, Y_{i j}^{0}} ⊥ ⊥ Z_{i j} | X_{i j}, U_{i}$ , and positivity — $P (Z_{i j} = z | X_{i j}, U_{i}) > 0$ for $z = 0, 1$ . For the rest of this article we assume that the causal relationships between $Y$ , $Z$ , $X$ , and $U$ can be explained by one of two directed acyclic graphs (DAGs) in Figure 1. In both DAG A and DAG B, $X$ represents measured confounders, which can be at the individual or cluster level. $U$ represents unmeasured confounders at the cluster level. The one difference between DAG A and DAG B is that the direct relationship between $X$ and $U$ in DAG B is not in DAG A. The association between $X$ and $U$ will lead to bias for certain methods. It is important to note that for methods discussed in this article the direction of the arrow between $X$ and $U$ does not matter.

Figure 1.

Two DAGs indicating potential causal relationships between outcome $(Y)$ , treatment $(Z)$ , measured confounders $(X)$ , and unmeasured cluster level confounders $(U)$ .

2.1 PS models for clustered data

2.1.1 Fixed and REs for PS models

The most common method for estimating the PS is logistic regression, and when we have clustered data it is possible to include a cluster level FE or RE to account for any unmeasured cluster level covariates. The FE model can be defined as $P (Z_{i j} = 1 | X_{i j}, U_{i}) = expit (X_{i j}^{T} β + \sum_{n = 1}^{N} ξ_{n} I {i = n})$ where $N$ is the number of clusters, $ξ_{n}$ represents the cluster effect for the $n t h$ cluster, and superscript $T$ represents the transpose of a vector or matrix. Likewise the RE model can be defined as $P (Z_{i j} = 1 | X_{i j}, U_{i}) = expit (β_{0} + X_{i j}^{T} β + η_{i})$ where $η_{i} \sim N (0, σ_{η}^{2})$ , and $η_{i}$ can be used to account for any confounding due to unmeasured cluster level confounders, $U_{i}$ . If the PS is estimated using a FE or RE method that accounts for clustering, the ATT can be estimated using standard PS matching techniques. For pair matching, this can be as simple as a paired t-test or including both treatment and matched grouping as covariates in a linear model. There has been discussion about the contexts in which FE or RE are preferred.¹² The added assumption of the RE having a normal distribution can lead to more precise estimates compared to FE models if this assumption is correct. In addition to assuming that the RE have a normal distribution, it is generally assumed that the RE, $η_{i}$ , are independent of any covariates, $X_{i j}$ , included in the model. If we assume the true causal structure is DAG B, rather than DAG A, this assumption will not be met. We investigate how violation of this assumption can lead to bias through simulations in Section 3. Unlike RE, logistic FE regression does not make distributional assumptions about cluster level effects and is not biased if $X$ is correlated with $U$ . In most data sets it may be expected for cluster level covariates to be correlated with individual level covariates, which means the FE may be preferred to RE, especially if the correlation is strong. However, FE may not perform well in data sets with a large number of small clusters due to the large number of parameters included in the model.^12,18 We also investigate how well PS matching based on the PS estimated using the FE models perform in simulations in Section 3.

While both FE and RE models are able to control for confounding due to unmeasured cluster level variables, a difference between the two methods is how they handle measured cluster level covariates. In the case of an FE model it is not possible to estimate coefficients for any measured cluster level variables, as any cluster level covariates would be perfectly collinear with the FE. Alternatively, for the RE model, coefficients can be estimated for cluster level covariates just as they are for individual level covariates.

One limitation of the logistic FE and RE models is the parametric assumptions they make about the form of the PS. This includes the link function as well as the component within the link function. Because of the potential for mis-specification of the PS model, some researchers suggest using matching methods that do not use the PS.¹¹ For clustered data, these methods, including MDM or CEM, can often present larger challenges than PS matching. This is because a cluster variable is typically categorical with no ordering, so the Mahalanobis distance cannot be defined for MDM, and CEM would require all matches to be from the same cluster, which may not be reasonable if there are a small number of subjects per cluster. Because of this, despite its challenges, PS matching may be preferred for certain clustered data sets, or in cases where there is a categorical confounder with a potentially small number of subjects per category. In addition, certain PS model mis-specifications may still lead to the consistent estimation of a balancing score that is not the PS.^6,7 In our set up, a balancing score, $b (X, U)$ , is any score such that $X, U ⊥ ⊥ Z | b (X, U)$ . The PS is always a balancing score, and certain mis-specifications of the PS will still result in balancing scores. Such mis-specifications include using the incorrect link function.⁶ These balancing scores can still be used for matching in the same way as the PS. This means that ensuring that the link function is properly specified is not a primary concern when performing PS matching. We investigate the performance of other types of mis-specifications of the PS, such as not including non-linear terms, with simulations in Section 3.

2.1.2 PS matching and regression

Non-parametric methods including GBM have previously been used to estimate the PS.^8–10 GBM is a method for combining many weak classifiers into a single strong classifier. This is done by fitting a series of regression trees, in which each tree gives the best fit for the residuals of the previous tree. At the final step all trees are combined to create a piecewise constant function. A full description, including methods to reduce potential overfitting, can be found in the work of Friedman.^19,20 Other non-parametric methods such as random forest²¹ can also be used for PS estimation. For PS estimation with a binary treatment GBM can be thought of as a binary classification problem. However, the performance of non-parametric methods such as GBM can suffer when trying to account for clustered data by including a FE for each cluster, even for relatively large cluster sizes. We investigate this through simulations in Section 3. Alternatively, semi-parametric methods combining machine learning algorithms with RE have been proposed and could be used to estimate PS.^22–24 However these methods do not work if there is correlation between $X$ and $U$ as in DAG B.

For the reason mentioned above, we consider only including individual level measured variables in non-parametric PS methods and ignoring clustering of the data. However, in order to get an unbiased estimate of the ATT it may still be necessary to control for confounding due to unmeasured cluster level variables. We propose to do this by including a FE in the outcome model after matching based on PS estimates using non-parametric methods that only include individual level variables. Including variables in the outcome regression that are not included in the PS model can lead to bias.²⁵ However, Remark 1 can be used to define a set of assumptions under which the proposed method will lead to unbiased estimates of the ATT.

Remark 1.
Assume that the PS follows the form $P (Z = 1 | X, U) = w (g (X) + h (U))$ where $h ()$ and $g ()$ are real-valued functions and $w ()$ is a monotonic and differentiable link function. If $Y^{1}, Y^{0} ⊥ ⊥ Z | X, U$ and $0 < P (Z = 1 | X, U) < 1$ , then $Y^{1}, Y^{0} ⊥ ⊥ Z | g^{} (X), U$ and $0 < P (Z = 1 | g^{} (X), U) < 1$ for any function $g^{} (X)$ such that $g (X) = f (g^{} (X))$ , where $f ()$ is an arbitrary real-valued function.

To show that Remark 1 is true, note that there exists a function $v$ such that $v (g^{} (X), U) = w (g (X) + h (U))$ . Therefore ${g^{} (X), U}$ is a multivariate balancing score per.¹ When the PS follows the form from Remark 1 and it is possible to get a consistent estimate of a function $g^{} (X)$ , it is possible to ignore clustering in the PS model and only control for it in the outcome model. This is equivalent to conditioning on $g^{} (X)$ and $U$ . Specifically, we condition on $g^{} (X)$ through matching and $U$ by including a cluster level FE in the outcome model. The form of the PS in Remark 1 implies that there is no interaction between $X$ and $U$ in the true PS model. We can think of $g (X)$ as the part of the PS corresponding to $X$ , while $g^{} (X)$ is a balancing score corresponding to $X$ . Just as with standard PS matching, certain model mis-specifications such as incorrect link functions can still lead to consistent estimates of a balancing score, $g^{} (X)$ , even if they do not result in consistent estimates of $g (X)$ .⁶ For example, if $g (X) = X^{T} β$ and $X ⊥ ⊥ U$ , then logistic regression with $X^{T} β$ as the component within the link function and $h (U)$ omitted, can still be used to get a consistent estimate of a balancing score $g^{} (X)$ under mild regularity conditions.^26,6 However if we cannot assume that $X ⊥ ⊥ U$ , then the resulting estimates will not necessarily be a consistent estimate of a balancing score, $g^{*} (X)$ . In general, if applying parametric PS methods such as logistic regression it is preferable to include a FE term for cluster level effects, because it allows for association between $X$ and $U$ .¹² Table 1 gives an overview of which PS estimation methods are valid depending on whether $X$ and $U$ are correlated.

Table 1.
Overview of whether PS matching using each PS estimation method is valid for controlling for confounding due to $X$ and $U$ depending on correlation between $X$ and $U$ .

PS estimation method $X ⊥ ⊥ U$ $X$ and $U$ correlated

Logistic Valid with FE outcome regression Biased

Logistic FE Valid Valid

Logistic RE Valid Biased

GBM Valid with FE outcome regression Biased

PS: propensity score; FE: fixed effect; RE: random effect; GBM: generalized boosted model.

2.2 Matching methods

PS estimation method	$X ⊥ ⊥ U$	$X$ and $U$ correlated
Logistic	Valid with FE outcome regression	Biased
Logistic FE	Valid	Valid
Logistic RE	Valid	Biased
GBM	Valid with FE outcome regression	Biased

In the Section 2.1 we discuss methods for estimating the PS when data are clustered. After PS estimation, there are a number of ways to create a matched data set. These include 1:1 or pair matching, 1:k matching, optimal matching, full matching, matching with and without replacement, and matching with and without a caliper. There are a number of papers that discuss these methods and their potential advantages or disadvantages.^{5,27,28,3,4,29–31} Most of these methods were not developed with clustered data in mind, but in many cases can be extended to data sets with hierarchical or clustered data.

For clustered data, there are certain trade-offs to matching methods that are not an issue when observations are all independent. As we noted in the Section 2.1, under certain assumptions, it is possible to combine non-parametric PS estimation that only includes individual covariates with FE outcome regression. However, since the full matching method from Hansen⁴ requires the estimation of separate treatment effects for each matched group, including FE for cluster in the outcome model can decrease the precision of estimates by including a large number of parameters to estimate. Another method for controlling confounding due to cluster level variables is requiring or preferring matches to be from within the same cluster.^15,16,32 In the within-cluster matching method, all treated subjects are matched to the untreated subject within their cluster that has the closest PS. If a caliper is used and there is no untreated match within the given caliper in the same cluster, then the treated subject is not included in analysis. In the within-cluster preferred matching method, if a treated subject has at least one untreated subject in the same cluster with a PS within a given caliper, they will be matched to the untreated within its cluster that has the closest PS; otherwise they will be matched to the untreated with the closest PS outside its cluster. In this within-cluster preferred matching method, a treated subject is only discarded if there are no untreated subjects within the given caliper across all clusters. For the within-cluster method, matching without replacement can lead to a large number of subjects being excluded from the analysis. When using matching without replacement it is not trivial to define the within-cluster preferred method. Because of this we use matching with replacement for these methods. This makes it necessary to account for having subjects being in multiple matched pairs when estimating confidence intervals (CIs).

In this article, we will focus on estimating the ATT using optimal 1:1 matching without replacement, and compare this to the within-cluster and within-cluster preferred matching methods which both do matching with replacement.

2.3 Outcome models

When the PS estimation does not account for clustering, it is necessary to control for confounding due to cluster level covariates in the outcome model using FE regression. Even when PS estimates account for clustering, using a FE model for the outcome may still improve efficiency relative to linear outcome models. For pair matching, this can be written as $E (Y_{i j} | Z_{i j}, G_{i j}) = α Z_{i j} + \sum_{h = 1}^{H} λ_{h} I (G_{i j} = h) + \sum_{n = 1}^{N} ξ_{n} I (i = n)$ (1)assuming that $H$ is the number of different matched groups, and letting $G_{i j}$ be an integer value between one and $H$ indicating the matched group the $j$ th subject in the $i$ th cluster belongs to. The estimate of $α$ in equation (1) corresponds to the estimate of the ATT for the effect of $Z_{i j}$ on $Y_{i j}$ . Note that for the within-cluster matching method, equation (1) should not include $\sum_{n = 1}^{N} ξ_{n} I (i = n)$ , as it is not possible to estimate FE for each cluster. It is still possible to include FE for the within-cluster preferred matching method described in Section 2.2 as long as some of the matched pairs come from different clusters. Using FE for cluster membership in the outcome model is preferred to RE because it allows $U$ to be associated with $Z$ . One drawback for the FE outcome model is that for small cluster sizes it greatly increases the number of parameters needed to estimate relative to sample size. When matching is done with replacement, which we do for the within-cluster and within-cluster preferred matching, it can be useful to include a subject level random intercept in the outcome model to ensure proper CI coverage.

2.4 Multilevel clustering for outcome

In the preceding subsections, we considered only a single level of clustering. Here we consider multiple levels of clustering for the outcome, but not for the treatment. This is to replicate the situation where subjects are clustered by treatment site, and the outcome is for each subject’s ears, and treatment is at the subject level. In this case we denote $Y_{i j s}$ as the outcome for the $s$ th ear for the $j$ th subject in the $i$ th cluster. Because the treatment, $Z_{i j}$ , is at the individual level, matching is still done at the individual level using the methods described in Section 2.2. We further assume that any covariates related to the second cluster level (left and right ears for individuals) do not affect treatment status, and therefore do not need to be included in PS estimation. Therefore we will still consider the PS models discussed in Section 2.1, and in order to get valid point and interval estimates we recommend including a RE for the second level of clustering in the outome model. With pair matching, the outcome model is then, $E (Y_{i j s} | Z_{i j}, G_{i j}, ϕ_{i j}) = α Z_{i j} + \sum_{h = 1}^{h} λ_{h} I (G_{i j} = h) + ϕ_{i j}$ (2)where $ϕ_{i j} \sim N (0, σ_{ϕ})$ , is the RE for the second level of clustering, or, $E (Y_{i j s} | Z_{i j}, G_{i j}, ϕ_{i j}) = α Z_{i j} + \sum_{h = 1}^{H} λ_{h} I (G_{i j} = h) + \sum_{k = 1}^{N} ξ_{n} I (i = n) + ϕ_{i j}$ (3)if we wish to include a FE for the first level of clustering as in Model 1.

3 Simulations

3.1 Simulations for clustered data

In this section, we test the performance of the PS matching methods described in Section 2. We consider PS estimation using logistic regression without accounting for clustering, logistic FE regression, logistic RE regression, GBM, and GBM FE. We discuss the selection of GBM hyperparameters in the supplementary materials. We use optimal pair matching without replacement with a caliper of 0.2 times the standard deviation (SD) of the PS. We use this matching method because it allows for simple methods to estimate the ATT after matching, and using the same matching method allows us to compare the performance of different PS estimation techniques. In addition to these methods, we consider the within-cluster matching and within-cluster preferred matching methods,^15,16,32 which use logistic regression to estimate the PS and then either requires or prefers treated subjects to be matched to untreated subjects within the same cluster, using pair matching with replacement. This is discussed more in Section 2.2. Although we focus on matching without replacement for other methods for simplicity of analysis, it is not straightforward to use matching without replacement when performing within-cluster or within-cluster preferred matching. Because of this we include a subject level random intercept in the outcome model to account for subjects who are in multiple paired matches. In the Supplemental material, we also consider FE or RE regression for the outcome model, as well as two IPW techniques from Li et al.¹² using PS estimated from logistic FE regression and GBM.

In the first set of simulations we consider the setting where the individual level covariates, $X$ , and the unmeasured cluster level covariates, $U$ are correlated. In each simulation there are 70 total clusters and a sample size in each cluster ranging between 25 and 55 with a discrete uniform distribution. In each simulated data set we have three individual level covariates and a cluster level covariate which we treat as an unmeasured covariate. In order to simulate the individual level covariates we first simulate ${\tilde{X}}_{i j}^{(1)} \sim Bernoulli (0.5)$ , ${\tilde{X}}_{i j}^{(2)} \sim Unif (- 1, 1)$ , and ${\tilde{X}}_{i j}^{(3)} \sim N (1, 2)$ . The cluster level unmeasured covariate is $U_{i} \sim N (0, 1)$ . In the supplementary materials we also consider $U_{i}$ having a lognormal distribution. In order to introduce association between $U_{i}$ and $X_{i j}$ we then define $X_{i j}^{(1)} = {\tilde{X}}_{i j}^{(1)}$ , $X_{i j}^{(2)} = {\tilde{X}}_{i j}^{(2)} + γ \cdot U_{i}$ , and $X_{i j}^{(3)} = {\tilde{X}}_{i j}^{(3)} + γ \cdot U_{i}$ , and set $γ = 1$ . The exposure, $Z_{i j}$ , is simulated as $Z_{i j} | (X_{i j}^{(1)}, X_{i j}^{(2)}, X_{i j}^{(3)}, U_{i}) \sim Bernoulli (p_{i j})$ , where $p_{i j} = \frac{\exp (- 0.5 \cdot X_{i j}^{(1)} - 0.7 \cdot X_{i j}^{(2)} - 0.5 \cdot X_{i j}^{(3)} + U_{i})}{1 + \exp (- 0.5 \cdot X_{i j}^{(1)} - 0.7 \cdot X_{i j}^{(2)} - 0.5 \cdot X_{i j}^{(3)} + U_{i})}$ (4)This results in an exposure prevalence of approximately 37%, which is close to the proportion of respondents taking aspirin in the CHEARS data set, the exposure from the example in Section 4. The average resulting overlap coefficient between $p_{i j}$ for $Z_{i j} = 0$ and $Z_{i j} = 1$ is 0.62 when $γ = 1$ and 0.53 when $γ = 0$ . The overlap coefficient is a measure of the intersection between two distributions, and is a measure of the agreement between two distributions, with zero indicating no agreement and one indicating perfect agreement.³³ On average, fewer than 1% of simulated subjects have an extreme PS of greater than 0.99 or less than 0.01. The continuous outcome is then generated as $Y_{i j} = X_{i j}^{(1)} + 0.8 \cdot X_{i j}^{(2)} + 0.5 \cdot X_{i j}^{(3)} + 0.75 \cdot Z_{i j} + U_{i} + ϵ_{i j}$ (5)where $ϵ_{i j} \sim N (0, 2)$ , is an individual level independent error term. Based on this set up, the true ATT for $Z_{i j}$ is 0.75.

In addition to the PS and outcome models defined by equations (4) and (5) we consider PS and outcome models with quadratic terms. Specifically, the PS model is $p_{i j} = \frac{\exp (- 0.5 \cdot X_{i j}^{(1)} - 0.7 \cdot X_{i j}^{(2)} - 0.5 \cdot X_{i j}^{(3)} + 3.5 \cdot X_{i j}^{(2) 2} - 0.5 \cdot X_{i j}^{(3) 2} + U_{i})}{1 + \exp (- 0.5 \cdot X_{i j}^{(1)} - 0.7 \cdot X_{i j}^{(2)} - 0.5 \cdot X_{i j}^{(3)} + 3.5 \cdot X_{i j}^{(2) 2} - 0.5 \cdot X_{i j}^{(3) 2} + U_{i})}$ (6)and the outcome model is $Y_{i j} = X_{i j}^{(1)} + 0.8 \cdot X_{i j}^{(2)} + 0.5 \cdot X_{i j}^{(3)} + 0.5 \cdot X_{i j}^{(2) 2} + 0.5 \cdot X_{i j}^{(3) 2} + 0.75 \cdot Z_{i j} + U_{i} + ϵ_{i j}$ (7)When the true PS model is equation (6), the exposure percentage is close to 36%, and the average overlap coefficient between $p_{i j}$ for $Z_{i j} = 1$ and $Z_{i j} = 0$ is 0.25 when $γ = 1$ and 0.35 when $γ = 0$ . This decrease in the overlap coefficient is largely due to an increase in extreme PS values, with close to 20% of simulated subjects having a PS less than 0.01 when $γ = 0$ and 35% of simulated subjects having a PS less than 0.01 or greater than 0.99 when $γ = 1$ . When simulating data using equations (6) or (7), we do not include the quadratic terms when estimating either the PS model or outcome model causing all parametric methods to be mis-specified.

Table 2 reports the bias, SD, and CI coverage rate for each method based on 1000 simulation replicates under the scenario where $γ = 1$ and $X$ and $U$ are correlated, as well as when $γ = 0$ and $X ⊥ ⊥ U$ . The variance estimates for the CIs are based on the outcome regression models and ignore any variance due to PS estimation. When $X$ and $U$ are correlated, the conditions in Remark 1 are not met, which means it is necessary to include cluster effects in the estimation of the PS model, and not just the outcome model. For this reason matching methods that do not account for clustering in the estimation of the PS are biased. This is true for pair matching, within cluster matching, and within cluster preferred matching. Additionally the logistic RE PS model assumes that $X$ is uncorrelated with the REs, when this assumption is not met estimates using the logistic RE PS model are also biased. The GBM FE PS model suffers from poor performance due to the high dimension of the FEs. The logistic FE PS model is the only one that has low bias across all settings when $X$ and $U$ are correlated.

Table 2.

Empirical bias (SD) and 95% confidence interval CR for estimation of ATT in four different simulation scenarios based on 1000 simulation replicates for five PM methods, and two WC matching methods.

				Missing non-linear terms in outcome model		Missing non-linear terms in PS model
		$X, U$ correlated	$X ⊥ ⊥ U$	$X, U$ correlated	$X ⊥ ⊥ U$	$X, U$ correlated	$X ⊥ ⊥ U$
Matching method	Outcome model	Bias (SD,CR)	Bias (SD,CR)	Bias (SD,CR)	Bias (SD,CR)	Bias (SD,CR)	Bias (SD,CR)
Logistic FE, PM	FE	0.00 (0.11,0.95)	0.00 (0.11, 0.96)	0.00 (0.17,0.94)	$-$ 0.04 (0.17,0.94)	$-$ 0.01 (0.10,0.96)	$-$ 0.03 (0.11,0.95)
Logistic RE, PM	FE	$-$ 0.11 (0.11,0.83)	0.00 (0.12,0.95)	$-$ 0.03 (0.14,0.93)	0.00 (0.17,0.94)	$-$ 0.03 (0.10,0.94)	$-$ 0.02 (0.11,0.95)
Logistic, PM	FE	$-$ 0.17 (0.11,0.66)	$-$ 0.01 (0.12,0.95)	$-$ 0.12 (0.13,0.84)	0.00 (0.14,0.96)	$-$ 0.16 (0.13,0.69)	$-$ 0.03 (0.11,0.95)
GBM, PM	FE	$-$ 0.12 (0.12,0.83)	0.03 (0.12,0.94)	$-$ 0.06 (0.15,0.92)	0.06 (0.15,0.94)	$-$ 0.28 (0.17,0.65)	0.02 (0.14,0.95)
GBM FE, PM	FE	0.09 (0.43,0.48)	0.06 (0.47,0.49)	0.27 (0.62,0.56)	0.15 (0.81,0.47)	$-$ 0.07 (0.43,0.73)	$-$ 0.01 (0.32,0.77)
Logistic, WC	LM	$-$ 0.18 (0.12,0.68)	$-$ 0.01 (0.12,0.94)	$-$ 0.17 (0.14,0.76)	$-$ 0.01 (0.15,0.95)	$-$ 0.13 (0.14,0.79)	0.00 (0.12,0.96)
Logistic, preferred WC	FE	$-$ 0.18 (0.11,0.66)	$-$ 0.01 (0.12,0.95)	$-$ 0.16 (0.14,0.76)	$-$ 0.01 (0.15,0.94)	$-$ 0.13 (0.13,0.74)	$-$ 0.01 (0.12,0.95)

True ATT is 0.75, which is the coefficient for treatment in the outcome model. Logistic and GBM PS models contain $X^{(1)}$ , $X^{(2)}$ , $X^{(3)}$ as covariates; logistic FE and GBM FE PS models contain $X^{(1)}$ , $X^{(2)}$ , $X^{(3)}$ as covariates as well as fixed effect for cluster; logistic RE PS model contain $X^{(1)}$ , $X^{(2)}$ , $X^{(3)}$ as covariates as well as random intercept for cluster. FE: fixed effect; RE: random effect; GBM: generalized boosted model; CR: coverage rate; ATT: among the treated; PM: pair matching; WC: within-cluster matching

It is interesting to note that the logistic FE PS model performs well, even in the case when the PS model is mis-specified. This is because it is able to sufficiently balance all the relevant confounders, even though it is mis-specified, as we will see in Section 3.2. Even though $X^{(1)^{2}}, X^{(2)^{2}},$ and $X^{(3) 2}$ are not included in the PS model and may not be adequately balanced, if they are not part of the outcome model and therefore not confounders this imbalance may not lead to biased causal estimates. This shows that PS matching can lead to satisfactory results even if the PS model is mis-specified, as long as the matching adequately balances all the relevant confounders. This also shows that PS matching can perform well even when there are a higher percentage of extreme PS values and a lower overlap coefficient between the PSs for exposed and unexposed subjects.

This highlights an advantage of PS matching, which is that even certain mis-specified PS estimates may act as balancing scores. We also see this in the setting where $X ⊥ ⊥ U$ . In this setting the conditions in Remark 1 are met. In this setting all the PS matching methods except for the GBM FE PS model have bias with an absolute value less than or equal to 0.06. The GBM FE PS model still suffers from instability due to the high dimension of the FEs. All other methods which did not perform well when $X$ and $U$ are correlated show much better performance when $X ⊥ ⊥ U$ , including when the PS model is mis-specified. When the PS model is properly specified, but the outcome model includes non-linear terms PS matching methods still perform well.

Across all settings the CI’s for the GBM FE PS matching method shows undercoverage even when the bias is small. This is likely due to the high standard deviation and instability of the estimates. The CIs for the other PS matching methods show close to the desired 0.95 coverage when the bias is low. For the within-cluster and within-cluster preferred matching it is necessary to include a subject level random intercept in the outcome model to ensure the desired CI coverage due to matching with replacement.

Additional methods which do not use matching, including regression and PS weighting can be found in Table S1 of the Supplemental materials.

In Section 1.3 of the Supplemental materials (Table S2) we report the results for PS matching using full rather than pair matching. For full matching, the logistic FE PS model performs the best, although unlike pair matching it is biased when the estimated PS model does not include non-linear terms that are in the true PS model. Section 1.4 (Table S3) of the Supplemental materials report the same results as Table 2, with multilevel clustering for the outcome, but not the treatment. This is done to mimic the hearing loss data in the CHEARS data set. The results are similar to those without multilevel clustering for the outcome in terms of bias and CI coverage. Additionally, Section 1.5 of the Supplemental materials (Table S5) reports results when $U$ has a lognormal, rather than normal distribution. In this case the logistic FE PS model results in larger bias when the estimated PS model does not include non-linear terms that are in the true PS model when $X$ and $U$ are correlated.

3.2 Evaluating balance in matched data sets

Table 2 gives a helpful overview of which methods perform best in each simulation setting. Next, we consider balance for each of the covariates in the matched data sets. Balance measures are often thought of as a way to see if matching worked well. Specifically, if each of the potential confounders are balanced across the matched groups, then it is thought that the matching was successful and therefore PS matching can be used to control for confounding. In this section, we highlight some instances where balance measures must be used with caution in determining how well PS matching performs at controlling for confounding in the clustered data set ups. As a measure for balance, we used the standardized mean difference (SMD). The SMD for a variable is a weighted difference between the mean of this variable for treated and untreated subjects divided by the pooled SD. The weighting is proportional to the harmonic mean of the number of treated and untreated subjects in each matched grouping.^34,35 In addition to the three observed covariates, $X^{(1)}$ , $X^{(2)}$ , and $X^{(3)}$ , we also consider the balance for $U$ , which is known in the simulation setting.

Table 3 presents the SMD averaged across 1000 simulation replicates with data generated using equation (6) as the true PS model with $γ = 0$ . When the true PS model is equation (6) the parametric PS models are mis-specified. Table 3 shows relatively large SMD values across the simulations for $X^{(1)}$ , $X^{(2)}$ and $X^{(3)}$ in the unmatched data set, with an average value of $-$ 0.20 for $X^{(2)}$ and $-$ 0.81 for $X^{(3)}$ . After pair matching using PS scores estimated using logistic, logistic FE, or logistic RE regression, the average SMD values for $X^{(1)}$ , $X^{(2)}$ and $X^{(3)}$ , all reduce greatly, with a maximum absolute value of 0.12. However, for $X^{(2) 2}$ and $X^{(3) 2}$ , the average SMD is much father away from zero for the parametric PS methods than for GBM. For the results presented in Section 3.1 having imbalance in $X^{(2) 2}$ and $X^{(3) 2}$ does not cause bias since these variables are not confounders. This is because we only include the non-linear terms in the true PS model or the outcome model, but not both at the same time. However, in practice failing to balance non-linear terms can cause bias if they act as confounders and are in both the true PS and outcome models. This shows the importance of ensuring a well specified model when using parametric PS models. This may include checking for inclusion of non-linear or interaction terms using goodness of fit measures. We present the results for balance measures for different simulation set ups in Section 1.5 (Tables S5 to S7) in the Supplemental materials.

Table 3.
Mean (SD) standardized mean difference across 1000 simulations for all covariates and their quadratic terms for matched data sets using pair matching (PM). No quadratic term included in estimation of PS for all methods.

$X ⊥ ⊥ U$ , missing non-linear term in PS model

PS, matching method $X^{(1)}$ $X^{(2)}$ $X^{(3)}$ $X^{(1) 2}$ $X^{(2) 2}$ $X^{(3) 2}$ $U$

No matching $-$ 0.13 (0.04) $-$ 0.20 (0.04) $-$ 0.81 (0.04) $-$ 0.13 (0.04) 0.58 (0.05) $-$ 0.89 (0.03) 0.53 (0.06)

Logistic FE, PM 0.00 (0.03) $-$ 0.02 (0.02) 0.12 (0.03) 0.00 (0.03) 0.77 (0.05) $-$ 0.37 (0.06) 0.18 (0.05)

Logistic RE, PM $-$ 0.01 (0.03) $-$ 0.04 (0.02) 0.04 (0.03) $-$ 0.01 (0.03) 0.76 (0.05) $-$ 0.47 (0.05) 0.04 (0.04)

Logistic, PM $-$ 0.01 (0.03) $-$ 0.04 (0.03) 0.01 (0.01) $-$ 0.01 (0.03) 0.77 (0.05) $-$ 0.14 (0.04) 0.71 (0.08)

GBM, PM $-$ 0.01 (0.03) 0.01 (0.03) 0.02 (0.02) $-$ 0.01 (0.03) $-$ 0.01 (0.03) 0.01 (0.01) 0.88 (0.09)

GBM FE, PM $-$ 0.08 (0.11) $-$ 0.01 (0.11) 0.02 (0.20) $-$ 0.08 (0.11) 0.16 (0.30) $-$ 0.08 (0.18) 0.00 (0.27)

Logistic and GBM PS models contain $X^{(1)}$ , $X^{(2)}$ , $X^{(3)}$ as covariates; logistic FE and GBM FE PS models contain $X^{(1)}$ , $X^{(2)}$ , $X^{(3)}$ as covariates as well as fixed effect for cluster; logistic RE PS model contain $X^{(1)}$ , $X^{(2)}$ , $X^{(3)}$ as covariates as well as random intercept for cluster. PM: pair matching; PS: propensity score; FE: fixed effect; RE: random effect; GBM: generalized boosted model.

	$X ⊥ ⊥ U$ , missing non-linear term in PS model
No matching	$-$ 0.13 (0.04)	$-$ 0.20 (0.04)	$-$ 0.81 (0.04)	$-$ 0.13 (0.04)	0.58 (0.05)	$-$ 0.89 (0.03)	0.53 (0.06)
Logistic FE, PM	0.00 (0.03)	$-$ 0.02 (0.02)	0.12 (0.03)	0.00 (0.03)	0.77 (0.05)	$-$ 0.37 (0.06)	0.18 (0.05)
Logistic RE, PM	$-$ 0.01 (0.03)	$-$ 0.04 (0.02)	0.04 (0.03)	$-$ 0.01 (0.03)	0.76 (0.05)	$-$ 0.47 (0.05)	0.04 (0.04)
Logistic, PM	$-$ 0.01 (0.03)	$-$ 0.04 (0.03)	0.01 (0.01)	$-$ 0.01 (0.03)	0.77 (0.05)	$-$ 0.14 (0.04)	0.71 (0.08)
GBM, PM	$-$ 0.01 (0.03)	0.01 (0.03)	0.02 (0.02)	$-$ 0.01 (0.03)	$-$ 0.01 (0.03)	0.01 (0.01)	0.88 (0.09)
GBM FE, PM	$-$ 0.08 (0.11)	$-$ 0.01 (0.11)	0.02 (0.20)	$-$ 0.08 (0.11)	0.16 (0.30)	$-$ 0.08 (0.18)	0.00 (0.27)

We also note that the GBM FE PS model has an SMD value close to zero for many of the covariates including quadratic terms and the unmeasured cluster level covariate, $U$ . However the empirical SD of the SMD for these covariates is in some cases an order of magnitude larger than other methods, which indicates why PS matching using GBM FE performs poorly in the finite sample simulations. Although a GBM FE model may work with sufficient sample size, for many data sets, if it is necessary to account for clustering in PS estimation it may be necessary to fit a well specified parametric model.

Additionally, we can see that while PS matching using FE or logistic RE regression will reduce the SMD for $U$ , PS estimation techniques that do not account for clustering can sometimes increase the SMD for $U$ relative to an unmatched data set. It is important to note that in real data sets this lack of balance in $U$ cannot be observed. Additionally, balance measures cannot help to determine whether there is correlation between measured confounders and potentially unmeasured cluster level confounders. It is advisable to consider existing subject area knowledge to determine if $X ⊥ ⊥ U$ is a reasonable assumption, or if it is necessary to account for clustering in PS estimation.

4 Analysis of conservation of hearing data

As an illustrative example we apply PS matching methods to analyze the hearing data from the NHS II CHEARS. Specifically, we focus on the causal effect of aspirin use on hearing deterioration. CHEARS examines risk factors for hearing loss among participants in several large ongoing cohorts, including the NHS II, an ongoing cohort study of 116,430 female registered nurses in the United States, aged 25–42 years at enrollment in 1989. In a sub-cohort of these NHS II participants, the CHEARS Audiology Assessment Arm (AAA), longitudinal changes in pure-tone air and bone conduction audiometric hearing thresholds were assessed. The methods for the CHEARS AAA are described elsewhere.^36–38 A priori, participants who reported excellent or good hearing and no history of otologic disease were invited to participate to examine early threshold changes. The study population includes 3136 participants in the ongoing NHS II cohort study who completed hearing assessments at both baseline (2012–2015) and 3-year follow-up (2015–2018). Hearing thresholds were measured using pure tone audiometry at the frequencies 0.5, 1, 2, 3, 4, 6, and 8 kHz in the left and right ears.

The frequency-specific outcome for all analyses is the hearing threshold at 3-year follow-up, with a larger value indicating worse hearing. We use both average hearing threshold between both ears as well as each ear individually. The exposure is at least weekly aspirin use (including baby aspirin) as measured at the 2011 NHS II questionnaire, which is latest questionnaire that is completed completely before the baseline time window.

At baseline, each subject was measured at one of 34 sites, while at year three each subject was measured at one of 33 sites. Cluster membership is determined by testing sites at both baseline and 3-year follow-up, and subjects had to be tested at the same site for both timepoints to be in the same cluster. As an example, subjects who were tested at site ‘A for baseline and site ‘B for 3-year follow-up would be considered to be in a cluster which we can denote as ‘AB. Subjects who were tested at site ‘B for baseline and site ‘A for 3-year follow-up would be considered to be in a cluster which we can denote as ‘BA. In total there were 48 different combinations of baseline and 3-year follow-up sites that had at least five subjects, with a maximum of 201 subjects per cluster. The list of potential measured individual level confounders we considered included age at baseline, BMI in 2011 (categorized as <25, 25–29, 30–34, 35–39, 40+), total physical activity in 2011 (quintiles), smoking status in 2011, hypertension by baseline, diabetes by baseline, Dietary Approaches to Stop Hypertension (DASH) diet score in 2011 (quintiles), total caloric energy intake in 2011, and the hearing threshold for the same frequency at baseline. The mean and standard deviation for age and caloric intake, as well as the proportion in each category for all other confounders broken out by aspirin use is included in Section 2 of the Supplemental materials.

A test for association between each of the individual level covariates and testing site indicates association between site and age at baseline as well as DASH score. Therefore, it is important to use methods that are robust to associations between testing site and individual level covariates. In our primary analysis we use a logistic FE regression model to estimate the PS combined with a FE outcome model, which also includes an indicator for matched grouping and baseline hearing threshold averaged between the left and right ears. The PS of interest is the probability of taking aspiring in 2011 given the covariates listed above. In all cases we used optimal pair matching without replacement to estimate the ATT. We consider the hearing threshold for left and right ears as correlated measurements, so we also include an individual level random intercept to account for correlation between left and right ears. This is not necessary when we consider the average hearing threshold for left and right ears as the outcome. In addition to using logistic FE regression to estimate the PS, we also consider using GBM. This can be used as a robustness check for potential violations of linearity in the logistic FE model, but it may introduce bias due to the possible association between individual level covariates and testing site.

Table 4 presents the estimates for the ATT of aspirin use on hearing threshold at 3-year follow-up measured in dB, including both unadjusted and pair matching estimates. After matching the effect of treatment was estimated in a regression model that controlled for cluster using FE as well as the hearing threshold at baseline. The estimates after PS matching are attenuated relative to the unadjusted estimates for most frequencies. The estimated ATT of 0.48 (95% CI: 0.06–0.90) for averaged ear analysis and 0.48 (95% CI: 0.05–0.91) for both ear analysis indicates that hearing deterioration at 500 Hz is greater among those who take aspirin at least weekly. The results using GBM are similar to those using logistic FE regression for 500 Hz. There are larger differences between the point estimates using logistic FE and GBM at higher frequencies, however the CIs include zero for both methods at all other frequencies besides 1000 Hz, where the CI for the averaged ear analysis using GBM is completely above zero.

Table 4.
ATT estimates for the effect of aspirin use on hearing threshold at 3-year follow-up based on data from Nurses Health Study II Conservation of Hearing study.

Unadjusted Pair matching FE Pair matching GBM

Frequency, Hz Estimate CI Estimate CI Estimate CI

Averaged left and right ears

500 0.53 (0.17, 0.88) 0.48 (0.06, 0.90) 0.44 (0.00, 0.88)

1000 0.22 ( $-$ 0.09, 0.53) 0.15 ( $-$ 0.25, 0.54) 0.42 (0.02, 0.81)

2000 0.22 ( $-$ 0.11, 0.55) 0.13 ( $-$ 0.27, 0.53) 0.21 ( $-$ 0.20, 0.61)

3000 0.08 ( $-$ 0.27, 0.44) $-$ 0.07 ( $-$ 0.50, 0.36) 0.16 ( $-$ 0.28, 0.61)

4000 0.17 ( $-$ 0.26, 0.60) 0.00 ( $-$ 0.52, 0.52) $-$ 0.04 ( $-$ 0.57, 0.50)

6000 0.54 (0.00, 1.08) 0.19 ( $-$ 0.49, 0.87) 0.41 ( $-$ 0.28, 1.10)

8000 0.60 (0.01, 1.18) 0.11 ( $-$ 0.58, 0.81) 0.48 ( $-$ 0.22, 1.19)

Separate left and right ears

500 0.57 (0.27, 0.87) 0.48 (0.05, 0.91) 0.53 (0.08, 0.98)

1000 0.27 (0.01, 0.54) 0.12 ( $-$ 0.28, 0.53) 0.36 ( $-$ 0.05, 0.77)

2000 0.28 (0.00, 0.56) 0.08 ( $-$ 0.33, 0.49) 0.20 ( $-$ 0.21, 0.62)

3000 0.17 ( $-$ 0.14, 0.48) $-$ 0.11 ( $-$ 0.55, 0.34) 0.16 ( $-$ 0.31, 0.62)

4000 0.27 ( $-$ 0.10, 0.63) $-$ 0.03 ( $-$ 0.56, 0.51) $-$ 0.07 ( $-$ 0.62, 0.47)

6000 0.66 (0.20, 1.13) 0.22 ( $-$ 0.48, 0.92) 0.33 ( $-$ 0.38, 1.04)

8000 0.72 (0.22, 1.22) 0.01 ( $-$ 0.71, 0.72) 0.39 ( $-$ 0.35, 1.13)

PS is estimated using logistic FE regression or generalized boosted models. For averaged left/right ears, outcome model is FE linear model including baseline hearing threshold and FE for cluster. For separate left/right ears, outcome model is RE linear model including baseline hearing threshold, FE for cluster and random intercept for individual. CI: confidence interval; PS: propensity score; FE: fixed effect; RE: random effect; ATT: among the treated.

	Unadjusted	Pair matching FE	Pair matching GBM
Averaged left and right ears
500	0.53	(0.17, 0.88)	0.48	(0.06, 0.90)	0.44	(0.00, 0.88)
1000	0.22	( $-$ 0.09, 0.53)	0.15	( $-$ 0.25, 0.54)	0.42	(0.02, 0.81)
2000	0.22	( $-$ 0.11, 0.55)	0.13	( $-$ 0.27, 0.53)	0.21	( $-$ 0.20, 0.61)
3000	0.08	( $-$ 0.27, 0.44)	$-$ 0.07	( $-$ 0.50, 0.36)	0.16	( $-$ 0.28, 0.61)
4000	0.17	( $-$ 0.26, 0.60)	0.00	( $-$ 0.52, 0.52)	$-$ 0.04	( $-$ 0.57, 0.50)
6000	0.54	(0.00, 1.08)	0.19	( $-$ 0.49, 0.87)	0.41	( $-$ 0.28, 1.10)
8000	0.60	(0.01, 1.18)	0.11	( $-$ 0.58, 0.81)	0.48	( $-$ 0.22, 1.19)
Separate left and right ears
500	0.57	(0.27, 0.87)	0.48	(0.05, 0.91)	0.53	(0.08, 0.98)
1000	0.27	(0.01, 0.54)	0.12	( $-$ 0.28, 0.53)	0.36	( $-$ 0.05, 0.77)
2000	0.28	(0.00, 0.56)	0.08	( $-$ 0.33, 0.49)	0.20	( $-$ 0.21, 0.62)
3000	0.17	( $-$ 0.14, 0.48)	$-$ 0.11	( $-$ 0.55, 0.34)	0.16	( $-$ 0.31, 0.62)
4000	0.27	( $-$ 0.10, 0.63)	$-$ 0.03	( $-$ 0.56, 0.51)	$-$ 0.07	( $-$ 0.62, 0.47)
6000	0.66	(0.20, 1.13)	0.22	( $-$ 0.48, 0.92)	0.33	( $-$ 0.38, 1.04)
8000	0.72	(0.22, 1.22)	0.01	( $-$ 0.71, 0.72)	0.39	( $-$ 0.35, 1.13)

In order to check the balance of the matched data sets we calculated the SMD for each of the individual level covariates included in the PS models. We do this for the matched data sets created using pair matching with the PS estimated using logistic FE regression, as well as the full unmatched data set. Tables S9 to S15 report the balance measures for the unmatched data set as well as the matched data set for each frequency for both PS matching using logistic FE regression and GBM. In the original data set, a number of the individual level covariates are unbalanced between the aspirin groups. After pair matching using logistic FE regression we see much better balance across the covariates (SMD $\leq$ 0.06 for all covariates). This indicates that pair matching was able to balance the individual level variables we include in the FE model. The balance is also improved when using pair matching with GBM, although as we have discussed, because this method does not account for clustering in the PS matching, the results may still be biased. We also present a visual representation of covariate balance in Figure 2, which includes the balance plots for each of the confounders before and after our main analysis of pair matching using logistic FE regression. This includes the baseline measurements for 500 Hz, with other frequencies having similar balance measures after matching. In particular the balance for age, hypertension, and BMI are improved after PS matching. As noted in Section 3, matching based on logistic FE regression can still lead to biased estimates if the PS is mis-specified. Therefore, we investigate the inclusion of non-linear terms for the continuous covariates. We consider a binned residual plot, plotting the residuals against each of the continuous covariates,³⁹ presented in Figure S1 in the Supplemental materials. For all three covariates these binned residual plots show a consistent scatter around zero and no strong indication of any non-linearity that would indicate the need to include non-linear transformations of the covariates.

Figure 2.

Overlap plots comparing distributional overlap before and after PS pair matching using FE logistic regression. PS: propensity score; FE: fixed effect.

5 Discussion

In this article, we study the use of PS matching to estimate causal effects for clustered observational data. These methods make it possible to control not just for measured variables, but also unmeasured cluster level variables through the inclusion of FE or RE in the PS or outcome model. The methods we focus on are for studies where treatment or exposure is assigned at the individual, rather than cluster level. The considerations for matching in studies where treatment or exposure is assigned at the cluster level will be different. One of the advantages to PS matching is that it can lead to unbiased estimates of the ATT even for certain mis-specifications of the PS model, including using the incorrect link function.⁶ We highlight that the types of mis-specifications of the PS model that can still lead to unbiased causal estimates in PS matching depends on the true outcome model as well as the true PS model. When using pair matching without replacement, the estimated CIs for the coefficient of the exposure from the outcome model perform well in simulations despite ignoring potential variance due to estimating the PS. However, the CIs under pair matching with replacement must account for correlation between pairs that include the same untreated subject in order to get the desired coverage. We are able to do this by including a subject level random intercept. Currently the within-cluster and within-cluster preferred matching algorithms are not easily defined when trying to do matching without replacement, and defining these methods for matching without replacement is a potential area of future research.

Even though PS matching is robust to certain mis-specifications of the PS model, non-parametric methods can still offer a more robust version of PS estimation. Methods such as GBM make minimal assumptions about the form of the PS. However, these methods suffer from poor performance when trying to include FE for clustering. For this reason, it can be useful to instead control for any unmeasured cluster level variables using FE in the outcome model. We discuss the assumptions necessary to use this method in Section 2.1.2. We show that this method can get consistent estimates if (i) the individual level covariates included in the PS model are independent of the unmeasured cluster level covariates and (ii) the true PS model is an additive function of the measured covariates included in the PS model and any unmeasured cluster level covariates. However, if these conditions are not met it is necessary to account for clustering in the PS estimation and not just in the outcome regression model. This is related to the potential issues of controlling for covariates not included in the PS model.²⁵ When the necessary assumptions are met we show that using GBM along with FE regression can be used, and is robust to potential model mis-specification. Alternatively, when these assumptions are not met, particularly when $X$ is associated with $U$ , it is necessary to control for unmeasured cluster level variables in PS estimation. In this case, performance of non-parametric methods such as GBM can suffer greatly, and parametric methods are preferred. This highlights that including cluster as a potential confounder can complicate PS matching methods, because the cluster variable is typically a categorical covariate with a large number of categories. Many matching methods struggle with this type of confounder, so it is important to select appropriate methods. Other methods for matching for clustered data includes combining cardinality matching for continuous covariates with fine or nearly fine matching for categorical variables such as cluster.⁴⁰ Cardinality matching finds the largest possible set of pair matches subject to a given balance constraint,⁴¹ while fine or nearly-fine matching ensures that the marginal distributions of a categorical variable is the same or nearly the same across all matched treated and control subjects.⁴² Cardinality matching does not model the PS, however the balance constraints chosen in cardinality matching implicitly make assumptions about the true outcome model, and fine matching may be impossible depending on the number of subjects per cluster.

Identifying non-parametric methods for clustered data that can be used to better estimate PS is an important area of future research. This would allow for PS matching to be much more robust to potential model mis-specification, while still allowing for correlation between observed individual level variables and unmeasured cluster level variables. Current methods that can account for these either have to make assumptions about the independence between measured and unmeasured variables, or make assumptions about the form of the PS. Finally, this article primarily focuses on continuous outcomes. It would also be of interest to investigate the performance of PS matching methods for clustered data for binary or categorical outcomes.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802221133556 - Supplemental material for An overview of propensity score matching methods for clustered data

Supplemental material, sj-pdf-1-smm-10.1177_09622802221133556 for An overview of propensity score matching methods for clustered data by Benjamin Langworthy, Yujie Wu and Molin Wang in Statistical Methods in Medical Research

Footnotes

Acknowledgements

The authors thank the reviewers for their insightful comments that have improved this article.

Data availability statement

The data for the Audiology Assessment Arm are not publicly available.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This work was partially supported by the National Institute Health grants R01 DC017717,U01 CA176726 (NHS II),and U01 HL145386 (NHS II).

ORCID iDs

Benjamin Langworthy

Molin Wang

Supplemental material

Supplemental material for this article is available online.

References

Rosenbaum

Rubin

. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: 41–55.

Forastiere

Mealli

VanderWeele

. Identification and estimation of causal mechanisms in clustered encouragement designs: disentangling bed nets using Bayesian principal stratification. J Am Stat Assoc 2016; 111: 510–525.

Rosenbaum

. A characterization of optimal designs for observational studies. J R Stat Soc: Ser B (Methodological) 1991; 53: 597–610.

Hansen

. Full matching in an observational study of coaching for the SAT. J Am Stat Assoc 2004; 99: 609–618.

Stuart

. Matching methods for causal inference: a review and a look forward. Stat Sci: Rev J Inst Math Stat 2010; 25: 1.

Waernbaum

. Propensity score model specification for estimation of average treatment effects. J Stat Plan Inference 2010; 140: 1948–1956.

Waernbaum

. Model misspecification and robustness in causal inference: comparing matching with doubly robust estimation. Stat Med 2012; 31: 1572–1581.

McCaffrey

Ridgeway

Morral

. Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol Methods 2004; 9: 403.

McCaffrey

Griffin

Almirall

, et al. A tutorial on propensity score estimation for multiple treatments using generalized boosted models. Stat Med 2013; 32: 3388–3414.

10.

Abdia

Kulasekera

Datta

, et al. Propensity scores based methods for estimating average treatment effect and average treatment effect among treated: a comparative study. Biom J 2017; 59: 967–985.

11.

King

Nielsen

. Why propensity scores should not be used for matching. Polit Anal 2019; 27: 435–454.

12.

Zaslavsky

Landrum

. Propensity score weighting with multilevel data. Stat Med 2013; 32: 3373–3387.

13.

Yang

Imbens

Cui

, et al. Propensity score matching and subclassification in observational studies with multi-level treatments. Biometrics 2016; 72: 1055–1065.

14.

Yang

. Propensity score weighting for causal inference with clustered data. J Causal Inference 2018; 6. Article Number: 20170027.

15.

Arpino

Cannas

. Comparing different approaches for propensity score matching with clustered data: a simulation study. 2015.

16.

Arpino

Cannas

. Propensity score matching with clustered data. An application to the estimation of the impact of caesarean section on the Apgar score. Stat Med 2016; 35: 2074–2091.

17.

Rubin

. Causal inference using potential outcomes: design, modeling, decisions. J Am Stat Assoc 2005; 100: 322–331.

18.

Neyman

Scott

. Consistent estimates based on partially consistent observations. Economet: J Economet Soc 1948; 16: 1–32.

19.

Friedman

. Greedy function approximation: a gradient boosting machine. Ann Stat 2001; 29: 1189–1232.

20.

Friedman

. Stochastic gradient boosting. Comput Stat Data Anal 2002; 38: 367–378.

21.

. Random decision forests. Proce 3rd Int Conference Doc Anal Recogn 1995; 1: 278–282.

22.

Hajjem

Bellavance

Larocque

. Mixed-effects random forest for clustered data. J Stat Comput Simul 2014; 84: 1313–1328.

23.

Ngufor

Van Houten

Caffo

, et al. Mixed effect machine learning: a framework for predicting longitudinal change in hemoglobin A1c. J Biomed Inform 2019; 89: 56–67.

24.

Capitaine

Genuer

Thiébaut

. Random forests for high-dimensional longitudinal data. arXiv preprint arXiv:1901.11279 2019.

25.

Shinozaki

Nojima

. Misuse of regression adjustment for additional confounders following insufficient propensity score balancing. Epidemiology 2019; 30: 541–548.

26.

Cramer

. Omitted variables and misspecified disturbances in the logit model. tech. rep., Tinbergen Institute Discussion Paper, 2005.

27.

Austin

. Some methods of propensity-score matching had superior performance to others: results of an empirical investigation and monte carlo simulations. Biometrical J: J Math Method Biosci 2009; 51: 171–184.

28.

Hill

Reiter

. Interval estimation for treatment effects using propensity score matching. Stat Med 2006; 25: 2230–2256.

29.

Hansen

Klopfer

. Optimal full matching and related designs via network flows. J Comput Graph Stat 2006; 15: 609–627.

30.

Austin

Stuart

. The performance of inverse probability of treatment weighting and full matching on the propensity score in the presence of model misspecification when estimating the effect of treatment on survival outcomes. Stat Methods Med Res 2017; 26: 1654–1670.

31.

Stuart

Green

. Using full matching to estimate causal effects in nonexperimental studies: examining the relationship between adolescent marijuana use and adult outcomes. Dev Psychol 2008; 44: 395.

32.

Cannas

Arpino

. Matching with clustered data: the CMatching package in R. R J 2019; 11: 7.

33.

Clemons

Bradley EL

. A nonparametric measure of the overlapping coefficient. Comput Stat Data Anal 2000; 34: 51–61.

34.

Hansen

Bowers

. Covariate balance in simple, stratified and clustered comparative studies. Stat Sci 2008; 23: 219–236.

35.

Bowers

Fredrickson

Hansen

. RItools: Randomization Inference Tools. 2019. R package version 0.1-17.

36.

Curhan

Halpin

Wang

, et al. Prospective study of dietary patterns and hearing threshold elevation. Am J Epidemiol 2020; 189: 204–214.

37.

Curhan

Stankovic

Halpin

, et al. Osteoporosis, bisphosphonate use, and risk of moderate or worse hearing loss in women. J Am Geriatr Soc 2021; 69: 3103–3113.

38.

Curhan

Halpin

Wang

, et al. Tinnitus and 3-year change in audiometric hearing thresholds. Ear Hear 2021; 42: 886–895.

39.

Gelman

Hill

. Data analysis using regression and multilevel/hierarchical models. Cambridge, UK: Cambridge university press, 2006.

40.

Zubizarreta

Keele

. Optimal multilevel matching in clustered observational studies: a case study of the effectiveness of private schools under a large-scale voucher system. J Am Stat Assoc 2017; 112: 547–560.

41.

Zubizarreta

Paredis

Rosenbaum

. Matching for balance, pairing for heterogeneity in an observational study of the effectiveness of for-profit and not-for-profit high schools in Chile. Ann Appl Stat 2014; 8: 204–560.

42.

Rosenbaum

Ross

Silber

. Minimum distance matched sampling with fine balance in an observational study of treatment for ovarian cancer. J Am Stat Assoc 2007; 102: 75–83.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.20 MB

0.00 MB

	$X ⊥ ⊥ U$ , missing non-linear term in PS model
PS, matching method	$X^{(1)}$	$X^{(2)}$	$X^{(3)}$	$X^{(1) 2}$	$X^{(2) 2}$	$X^{(3) 2}$	$U$
No matching	$-$ 0.13 (0.04)	$-$ 0.20 (0.04)	$-$ 0.81 (0.04)	$-$ 0.13 (0.04)	0.58 (0.05)	$-$ 0.89 (0.03)	0.53 (0.06)
Logistic FE, PM	0.00 (0.03)	$-$ 0.02 (0.02)	0.12 (0.03)	0.00 (0.03)	0.77 (0.05)	$-$ 0.37 (0.06)	0.18 (0.05)
Logistic RE, PM	$-$ 0.01 (0.03)	$-$ 0.04 (0.02)	0.04 (0.03)	$-$ 0.01 (0.03)	0.76 (0.05)	$-$ 0.47 (0.05)	0.04 (0.04)
Logistic, PM	$-$ 0.01 (0.03)	$-$ 0.04 (0.03)	0.01 (0.01)	$-$ 0.01 (0.03)	0.77 (0.05)	$-$ 0.14 (0.04)	0.71 (0.08)
GBM, PM	$-$ 0.01 (0.03)	0.01 (0.03)	0.02 (0.02)	$-$ 0.01 (0.03)	$-$ 0.01 (0.03)	0.01 (0.01)	0.88 (0.09)
GBM FE, PM	$-$ 0.08 (0.11)	$-$ 0.01 (0.11)	0.02 (0.20)	$-$ 0.08 (0.11)	0.16 (0.30)	$-$ 0.08 (0.18)	0.00 (0.27)

	Unadjusted		Pair matching FE		Pair matching GBM
Frequency, Hz	Estimate	CI	Estimate	CI	Estimate	CI
Averaged left and right ears
500	0.53	(0.17, 0.88)	0.48	(0.06, 0.90)	0.44	(0.00, 0.88)
1000	0.22	( $-$ 0.09, 0.53)	0.15	( $-$ 0.25, 0.54)	0.42	(0.02, 0.81)
2000	0.22	( $-$ 0.11, 0.55)	0.13	( $-$ 0.27, 0.53)	0.21	( $-$ 0.20, 0.61)
3000	0.08	( $-$ 0.27, 0.44)	$-$ 0.07	( $-$ 0.50, 0.36)	0.16	( $-$ 0.28, 0.61)
4000	0.17	( $-$ 0.26, 0.60)	0.00	( $-$ 0.52, 0.52)	$-$ 0.04	( $-$ 0.57, 0.50)
6000	0.54	(0.00, 1.08)	0.19	( $-$ 0.49, 0.87)	0.41	( $-$ 0.28, 1.10)
8000	0.60	(0.01, 1.18)	0.11	( $-$ 0.58, 0.81)	0.48	( $-$ 0.22, 1.19)
Separate left and right ears
500	0.57	(0.27, 0.87)	0.48	(0.05, 0.91)	0.53	(0.08, 0.98)
1000	0.27	(0.01, 0.54)	0.12	( $-$ 0.28, 0.53)	0.36	( $-$ 0.05, 0.77)
2000	0.28	(0.00, 0.56)	0.08	( $-$ 0.33, 0.49)	0.20	( $-$ 0.21, 0.62)
3000	0.17	( $-$ 0.14, 0.48)	$-$ 0.11	( $-$ 0.55, 0.34)	0.16	( $-$ 0.31, 0.62)
4000	0.27	( $-$ 0.10, 0.63)	$-$ 0.03	( $-$ 0.56, 0.51)	$-$ 0.07	( $-$ 0.62, 0.47)
6000	0.66	(0.20, 1.13)	0.22	( $-$ 0.48, 0.92)	0.33	( $-$ 0.38, 1.04)
8000	0.72	(0.22, 1.22)	0.01	( $-$ 0.71, 0.72)	0.39	( $-$ 0.35, 1.13)