Abstract
Keywords
Introduction
When data come from observational studies estimating causal effects can be challenging due to potential confounding. In this context, statistical methods to control for potential sources of confounding are necessary in order to get consistent estimates of causal effects. One tool that is frequently used across a range of these statistical methods is the propensity score (PS).
1
In the case of a binary treatment or exposure,
In this article, even though the observations are clustered, we will assume that both treatment and outcome occur at the individual level. This is not the same as cluster randomized trials where treatment is assigned at the cluster level. 2 This means that the treatment effect will be measured at the individual, rather than cluster level. In this setting the cluster can be viewed as a potential confounder which needs to be considered for matching or weighting. However a cluster variable can often present specific challenges which we will discuss in this article. Some of the challenges are due to the ‘cluster variable typically being a categorical variable with a large number of categories and sometimes relatively few observations per category. The techniques in this article can be useful for data sets that would not typically be thought of as having a hierarchical or clustered structure, if there is a confounder that is a categorical variable with a large number of categories, and a potentially low number of observations per category.
Typically, PS matching matches a single treated subject to a single untreated subject with the closest possible PS, otherwise known as pair matching. Pair matching is used to estimate the average treatment effect (ATE) among the treated (ATT). Another matching method, known as full matching, allows for different numbers of treated and untreated subjects in each matched grouping as long as there is at least one treated and one untreated subject in each group, and works to minimize the overall difference in PS between matched subjects.3,4 Full matching can be used to estimate either the ATT or the ATE. 5 In this article, we focus on estimating the ATT.
One advantage to PS matching methods is that they are robust to certain mis-specifications of the PS model. There are certain mis-specifications of the PS model that will still lead to balancing scores.6,7 These include omitting polynomial terms for univariate PS models or using the incorrect link function for any PS model that has a generalized linear model form. The balancing scores that result from these model mis-specifications can be used for matching in the same way as the PS. In this article, we show that the bias of ATT estimates based on PS matching when the PS model is mis-specified depends on the true outcome model as well as the true PS model. For instance, failing to include non-linear terms, such as polynomial or spline terms, or interaction terms in multivariate PS models may not lead to large bias if the true outcome model is a linear model without any non-linear or interaction terms.
Although parametric PS methods are robust to certain types of model mis-specification, recent non-parametric methods such as generalized boosted models (GBMs)8–10 have been developed which make even fewer assumptions about the true PS model. These non-parametric methods generally assume that the data are independent. In this article, we investigate a number of ways to extend these non-parametric methods to clustered data including adding covariates for cluster membership in the non-parametric propensity model as in fixed effects (FEs) regression. Another option is to ignore clustering in PS estimation, and control for clustering in the outcome regression model; however, this requires additional assumptions which we will discuss in Section 2. There are matching methods that do not use the PS, including Mahalanobis distance matching (MDM) or coarsened exact matching (CEM). One advantage to these methods is that they do not require estimation of the PS model. 11 However, these methods may not be suitable for clustered data, or any data set where there is a categorical confounder with a large number of categories. This is because distance measures such as Mahalanobis distance cannot be meaningfully defined for categorical variables. Similarly, CEM with categorical variables may not be feasible when the number of observations per category is too small.
There have been a number of papers exploring PS methods, both matching and weighting, for clustered data.12–16 Many of these papers look into using either FE or random effects (RE) to estimate the PS. The papers that study PS matching also investigate requiring or preferring each matched group to come from the same cluster.15,16 One advantage to these methods is that they can control for confounding due to unmeasured cluster level covariates. In this article, we consider the scenario where there are cluster level unmeasured confounders. We investigate the methods mentioned above, while also considering the method where we ignore clustering in the PS estimation and account for it in the outcome model. We discuss the assumptions necessary for each of these PS matching methods to give consistent estimates of the ATT in Section 2.
Additionally, we consider a more complex set up in which there is a multi-level clustering structure for the outcome. This is inspired by the CHEARS data in which each individual may act as a cluster for their left and right ears, nested within a larger cluster of testing site.
The rest of the article is structured as follows: Section 2 describes PS matching methods for clustered data, Section 3 provides a comparison of new and existing methods using simulation studies, Section 4 provides an example of PS matching and regression techniques using data from the CHEARS study, and Section 5 concludes.
Estimation of causal effects for clustered data
This article focuses on identifying causal effects for a clustered or hierarchical data structure. We start with the case where each individual belongs to a single cluster. We denote the treatment/exposure for the
The methods we consider need additional assumptions in order to be able to estimate either the ATT or ATE. These are: consistency —

Two DAGs indicating potential causal relationships between outcome
Fixed and REs for PS models
The most common method for estimating the PS is logistic regression, and when we have clustered data it is possible to include a cluster level FE or RE to account for any unmeasured cluster level covariates. The FE model can be defined as
While both FE and RE models are able to control for confounding due to unmeasured cluster level variables, a difference between the two methods is how they handle measured cluster level covariates. In the case of an FE model it is not possible to estimate coefficients for any measured cluster level variables, as any cluster level covariates would be perfectly collinear with the FE. Alternatively, for the RE model, coefficients can be estimated for cluster level covariates just as they are for individual level covariates.
One limitation of the logistic FE and RE models is the parametric assumptions they make about the form of the PS. This includes the link function as well as the component within the link function. Because of the potential for mis-specification of the PS model, some researchers suggest using matching methods that do not use the PS.
11
For clustered data, these methods, including MDM or CEM, can often present larger challenges than PS matching. This is because a cluster variable is typically categorical with no ordering, so the Mahalanobis distance cannot be defined for MDM, and CEM would require all matches to be from the same cluster, which may not be reasonable if there are a small number of subjects per cluster. Because of this, despite its challenges, PS matching may be preferred for certain clustered data sets, or in cases where there is a categorical confounder with a potentially small number of subjects per category. In addition, certain PS model mis-specifications may still lead to the consistent estimation of a balancing score that is not the PS.6,7 In our set up, a balancing score,
PS matching and regression
Non-parametric methods including GBM have previously been used to estimate the PS.8–10 GBM is a method for combining many weak classifiers into a single strong classifier. This is done by fitting a series of regression trees, in which each tree gives the best fit for the residuals of the previous tree. At the final step all trees are combined to create a piecewise constant function. A full description, including methods to reduce potential overfitting, can be found in the work of Friedman.19,20 Other non-parametric methods such as random forest
21
can also be used for PS estimation. For PS estimation with a binary treatment GBM can be thought of as a binary classification problem. However, the performance of non-parametric methods such as GBM can suffer when trying to account for clustered data by including a FE for each cluster, even for relatively large cluster sizes. We investigate this through simulations in Section 3. Alternatively, semi-parametric methods combining machine learning algorithms with RE have been proposed and could be used to estimate PS.22–24 However these methods do not work if there is correlation between
For the reason mentioned above, we consider only including individual level measured variables in non-parametric PS methods and ignoring clustering of the data. However, in order to get an unbiased estimate of the ATT it may still be necessary to control for confounding due to unmeasured cluster level variables. We propose to do this by including a FE in the outcome model after matching based on PS estimates using non-parametric methods that only include individual level variables. Including variables in the outcome regression that are not included in the PS model can lead to bias. 25 However, Remark 1 can be used to define a set of assumptions under which the proposed method will lead to unbiased estimates of the ATT.
Assume that the PS follows the form
To show that Remark 1 is true, note that there exists a function
Overview of whether PS matching using each PS estimation method is valid for controlling for confounding due to
PS: propensity score; FE: fixed effect; RE: random effect; GBM: generalized boosted model.
In the Section 2.1 we discuss methods for estimating the PS when data are clustered. After PS estimation, there are a number of ways to create a matched data set. These include 1:1 or pair matching, 1:
For clustered data, there are certain trade-offs to matching methods that are not an issue when observations are all independent. As we noted in the Section 2.1, under certain assumptions, it is possible to combine non-parametric PS estimation that only includes individual covariates with FE outcome regression. However, since the full matching method from Hansen 4 requires the estimation of separate treatment effects for each matched group, including FE for cluster in the outcome model can decrease the precision of estimates by including a large number of parameters to estimate. Another method for controlling confounding due to cluster level variables is requiring or preferring matches to be from within the same cluster.15,16,32 In the within-cluster matching method, all treated subjects are matched to the untreated subject within their cluster that has the closest PS. If a caliper is used and there is no untreated match within the given caliper in the same cluster, then the treated subject is not included in analysis. In the within-cluster preferred matching method, if a treated subject has at least one untreated subject in the same cluster with a PS within a given caliper, they will be matched to the untreated within its cluster that has the closest PS; otherwise they will be matched to the untreated with the closest PS outside its cluster. In this within-cluster preferred matching method, a treated subject is only discarded if there are no untreated subjects within the given caliper across all clusters. For the within-cluster method, matching without replacement can lead to a large number of subjects being excluded from the analysis. When using matching without replacement it is not trivial to define the within-cluster preferred method. Because of this we use matching with replacement for these methods. This makes it necessary to account for having subjects being in multiple matched pairs when estimating confidence intervals (CIs).
In this article, we will focus on estimating the ATT using optimal 1:1 matching without replacement, and compare this to the within-cluster and within-cluster preferred matching methods which both do matching with replacement.
Outcome models
When the PS estimation does not account for clustering, it is necessary to control for confounding due to cluster level covariates in the outcome model using FE regression. Even when PS estimates account for clustering, using a FE model for the outcome may still improve efficiency relative to linear outcome models. For pair matching, this can be written as
In the preceding subsections, we considered only a single level of clustering. Here we consider multiple levels of clustering for the outcome, but not for the treatment. This is to replicate the situation where subjects are clustered by treatment site, and the outcome is for each subject’s ears, and treatment is at the subject level. In this case we denote
Simulations for clustered data
In this section, we test the performance of the PS matching methods described in Section 2. We consider PS estimation using logistic regression without accounting for clustering, logistic FE regression, logistic RE regression, GBM, and GBM FE. We discuss the selection of GBM hyperparameters in the supplementary materials. We use optimal pair matching without replacement with a caliper of 0.2 times the standard deviation (SD) of the PS. We use this matching method because it allows for simple methods to estimate the ATT after matching, and using the same matching method allows us to compare the performance of different PS estimation techniques. In addition to these methods, we consider the within-cluster matching and within-cluster preferred matching methods,15,16,32 which use logistic regression to estimate the PS and then either requires or prefers treated subjects to be matched to untreated subjects within the same cluster, using pair matching with replacement. This is discussed more in Section 2.2. Although we focus on matching without replacement for other methods for simplicity of analysis, it is not straightforward to use matching without replacement when performing within-cluster or within-cluster preferred matching. Because of this we include a subject level random intercept in the outcome model to account for subjects who are in multiple paired matches. In the Supplemental material, we also consider FE or RE regression for the outcome model, as well as two IPW techniques from Li et al. 12 using PS estimated from logistic FE regression and GBM.
In the first set of simulations we consider the setting where the individual level covariates,
In addition to the PS and outcome models defined by equations (4) and (5) we consider PS and outcome models with quadratic terms. Specifically, the PS model is
Table 2 reports the bias, SD, and CI coverage rate for each method based on 1000 simulation replicates under the scenario where
Empirical bias (SD) and 95% confidence interval CR for estimation of ATT in four different simulation scenarios based on 1000 simulation replicates for five PM methods, and two WC matching methods.
True ATT is 0.75, which is the coefficient for treatment in the outcome model. Logistic and GBM PS models contain
It is interesting to note that the logistic FE PS model performs well, even in the case when the PS model is mis-specified. This is because it is able to sufficiently balance all the relevant confounders, even though it is mis-specified, as we will see in Section 3.2. Even though
This highlights an advantage of PS matching, which is that even certain mis-specified PS estimates may act as balancing scores. We also see this in the setting where
Across all settings the CI’s for the GBM FE PS matching method shows undercoverage even when the bias is small. This is likely due to the high standard deviation and instability of the estimates. The CIs for the other PS matching methods show close to the desired 0.95 coverage when the bias is low. For the within-cluster and within-cluster preferred matching it is necessary to include a subject level random intercept in the outcome model to ensure the desired CI coverage due to matching with replacement.
Additional methods which do not use matching, including regression and PS weighting can be found in Table S1 of the Supplemental materials.
In Section 1.3 of the Supplemental materials (Table S2) we report the results for PS matching using full rather than pair matching. For full matching, the logistic FE PS model performs the best, although unlike pair matching it is biased when the estimated PS model does not include non-linear terms that are in the true PS model. Section 1.4 (Table S3) of the Supplemental materials report the same results as Table 2, with multilevel clustering for the outcome, but not the treatment. This is done to mimic the hearing loss data in the CHEARS data set. The results are similar to those without multilevel clustering for the outcome in terms of bias and CI coverage. Additionally, Section 1.5 of the Supplemental materials (Table S5) reports results when
Table 2 gives a helpful overview of which methods perform best in each simulation setting. Next, we consider balance for each of the covariates in the matched data sets. Balance measures are often thought of as a way to see if matching worked well. Specifically, if each of the potential confounders are balanced across the matched groups, then it is thought that the matching was successful and therefore PS matching can be used to control for confounding. In this section, we highlight some instances where balance measures must be used with caution in determining how well PS matching performs at controlling for confounding in the clustered data set ups. As a measure for balance, we used the standardized mean difference (SMD). The SMD for a variable is a weighted difference between the mean of this variable for treated and untreated subjects divided by the pooled SD. The weighting is proportional to the harmonic mean of the number of treated and untreated subjects in each matched grouping.34,35 In addition to the three observed covariates,
Table 3 presents the SMD averaged across 1000 simulation replicates with data generated using equation (6) as the true PS model with
Mean (SD) standardized mean difference across 1000 simulations for all covariates and their quadratic terms for matched data sets using pair matching (PM). No quadratic term included in estimation of PS for all methods.
Logistic and GBM PS models contain
,
,
as covariates; logistic FE and GBM FE PS models contain
,
,
as covariates as well as fixed effect for cluster; logistic RE PS model contain
,
,
as covariates as well as random intercept for cluster. PM: pair matching; PS: propensity score; FE: fixed effect; RE: random effect; GBM: generalized boosted model.
Mean (SD) standardized mean difference across 1000 simulations for all covariates and their quadratic terms for matched data sets using pair matching (PM). No quadratic term included in estimation of PS for all methods.
Logistic and GBM PS models contain
We also note that the GBM FE PS model has an SMD value close to zero for many of the covariates including quadratic terms and the unmeasured cluster level covariate,
Additionally, we can see that while PS matching using FE or logistic RE regression will reduce the SMD for
As an illustrative example we apply PS matching methods to analyze the hearing data from the NHS II CHEARS. Specifically, we focus on the causal effect of aspirin use on hearing deterioration. CHEARS examines risk factors for hearing loss among participants in several large ongoing cohorts, including the NHS II, an ongoing cohort study of 116,430 female registered nurses in the United States, aged 25–42 years at enrollment in 1989. In a sub-cohort of these NHS II participants, the CHEARS Audiology Assessment Arm (AAA), longitudinal changes in pure-tone air and bone conduction audiometric hearing thresholds were assessed. The methods for the CHEARS AAA are described elsewhere.36–38 A priori, participants who reported excellent or good hearing and no history of otologic disease were invited to participate to examine early threshold changes. The study population includes 3136 participants in the ongoing NHS II cohort study who completed hearing assessments at both baseline (2012–2015) and 3-year follow-up (2015–2018). Hearing thresholds were measured using pure tone audiometry at the frequencies 0.5, 1, 2, 3, 4, 6, and 8 kHz in the left and right ears.
The frequency-specific outcome for all analyses is the hearing threshold at 3-year follow-up, with a larger value indicating worse hearing. We use both average hearing threshold between both ears as well as each ear individually. The exposure is at least weekly aspirin use (including baby aspirin) as measured at the 2011 NHS II questionnaire, which is latest questionnaire that is completed completely before the baseline time window.
At baseline, each subject was measured at one of 34 sites, while at year three each subject was measured at one of 33 sites. Cluster membership is determined by testing sites at both baseline and 3-year follow-up, and subjects had to be tested at the same site for both timepoints to be in the same cluster. As an example, subjects who were tested at site ‘A for baseline and site ‘B for 3-year follow-up would be considered to be in a cluster which we can denote as ‘AB. Subjects who were tested at site ‘B for baseline and site ‘A for 3-year follow-up would be considered to be in a cluster which we can denote as ‘BA. In total there were 48 different combinations of baseline and 3-year follow-up sites that had at least five subjects, with a maximum of 201 subjects per cluster. The list of potential measured individual level confounders we considered included age at baseline, BMI in 2011 (categorized as <25, 25–29, 30–34, 35–39, 40+), total physical activity in 2011 (quintiles), smoking status in 2011, hypertension by baseline, diabetes by baseline, Dietary Approaches to Stop Hypertension (DASH) diet score in 2011 (quintiles), total caloric energy intake in 2011, and the hearing threshold for the same frequency at baseline. The mean and standard deviation for age and caloric intake, as well as the proportion in each category for all other confounders broken out by aspirin use is included in Section 2 of the Supplemental materials.
A test for association between each of the individual level covariates and testing site indicates association between site and age at baseline as well as DASH score. Therefore, it is important to use methods that are robust to associations between testing site and individual level covariates. In our primary analysis we use a logistic FE regression model to estimate the PS combined with a FE outcome model, which also includes an indicator for matched grouping and baseline hearing threshold averaged between the left and right ears. The PS of interest is the probability of taking aspiring in 2011 given the covariates listed above. In all cases we used optimal pair matching without replacement to estimate the ATT. We consider the hearing threshold for left and right ears as correlated measurements, so we also include an individual level random intercept to account for correlation between left and right ears. This is not necessary when we consider the average hearing threshold for left and right ears as the outcome. In addition to using logistic FE regression to estimate the PS, we also consider using GBM. This can be used as a robustness check for potential violations of linearity in the logistic FE model, but it may introduce bias due to the possible association between individual level covariates and testing site.
Table 4 presents the estimates for the ATT of aspirin use on hearing threshold at 3-year follow-up measured in dB, including both unadjusted and pair matching estimates. After matching the effect of treatment was estimated in a regression model that controlled for cluster using FE as well as the hearing threshold at baseline. The estimates after PS matching are attenuated relative to the unadjusted estimates for most frequencies. The estimated ATT of 0.48 (95% CI: 0.06–0.90) for averaged ear analysis and 0.48 (95% CI: 0.05–0.91) for both ear analysis indicates that hearing deterioration at 500 Hz is greater among those who take aspirin at least weekly. The results using GBM are similar to those using logistic FE regression for 500 Hz. There are larger differences between the point estimates using logistic FE and GBM at higher frequencies, however the CIs include zero for both methods at all other frequencies besides 1000 Hz, where the CI for the averaged ear analysis using GBM is completely above zero.
ATT estimates for the effect of aspirin use on hearing threshold at 3-year follow-up based on data from Nurses Health Study II Conservation of Hearing study.
PS is estimated using logistic FE regression or generalized boosted models. For averaged left/right ears, outcome model is FE linear model including baseline hearing threshold and FE for cluster. For separate left/right ears, outcome model is RE linear model including baseline hearing threshold, FE for cluster and random intercept for individual. CI: confidence interval; PS: propensity score; FE: fixed effect; RE: random effect; ATT: among the treated.
ATT estimates for the effect of aspirin use on hearing threshold at 3-year follow-up based on data from Nurses Health Study II Conservation of Hearing study.
PS is estimated using logistic FE regression or generalized boosted models. For averaged left/right ears, outcome model is FE linear model including baseline hearing threshold and FE for cluster. For separate left/right ears, outcome model is RE linear model including baseline hearing threshold, FE for cluster and random intercept for individual. CI: confidence interval; PS: propensity score; FE: fixed effect; RE: random effect; ATT: among the treated.
In order to check the balance of the matched data sets we calculated the SMD for each of the individual level covariates included in the PS models. We do this for the matched data sets created using pair matching with the PS estimated using logistic FE regression, as well as the full unmatched data set. Tables S9 to S15 report the balance measures for the unmatched data set as well as the matched data set for each frequency for both PS matching using logistic FE regression and GBM. In the original data set, a number of the individual level covariates are unbalanced between the aspirin groups. After pair matching using logistic FE regression we see much better balance across the covariates (SMD

Overlap plots comparing distributional overlap before and after PS pair matching using FE logistic regression. PS: propensity score; FE: fixed effect.
In this article, we study the use of PS matching to estimate causal effects for clustered observational data. These methods make it possible to control not just for measured variables, but also unmeasured cluster level variables through the inclusion of FE or RE in the PS or outcome model. The methods we focus on are for studies where treatment or exposure is assigned at the individual, rather than cluster level. The considerations for matching in studies where treatment or exposure is assigned at the cluster level will be different. One of the advantages to PS matching is that it can lead to unbiased estimates of the ATT even for certain mis-specifications of the PS model, including using the incorrect link function. 6 We highlight that the types of mis-specifications of the PS model that can still lead to unbiased causal estimates in PS matching depends on the true outcome model as well as the true PS model. When using pair matching without replacement, the estimated CIs for the coefficient of the exposure from the outcome model perform well in simulations despite ignoring potential variance due to estimating the PS. However, the CIs under pair matching with replacement must account for correlation between pairs that include the same untreated subject in order to get the desired coverage. We are able to do this by including a subject level random intercept. Currently the within-cluster and within-cluster preferred matching algorithms are not easily defined when trying to do matching without replacement, and defining these methods for matching without replacement is a potential area of future research.
Even though PS matching is robust to certain mis-specifications of the PS model, non-parametric methods can still offer a more robust version of PS estimation. Methods such as GBM make minimal assumptions about the form of the PS. However, these methods suffer from poor performance when trying to include FE for clustering. For this reason, it can be useful to instead control for any unmeasured cluster level variables using FE in the outcome model. We discuss the assumptions necessary to use this method in Section 2.1.2. We show that this method can get consistent estimates if (i) the individual level covariates included in the PS model are independent of the unmeasured cluster level covariates and (ii) the true PS model is an additive function of the measured covariates included in the PS model and any unmeasured cluster level covariates. However, if these conditions are not met it is necessary to account for clustering in the PS estimation and not just in the outcome regression model. This is related to the potential issues of controlling for covariates not included in the PS model.
25
When the necessary assumptions are met we show that using GBM along with FE regression can be used, and is robust to potential model mis-specification. Alternatively, when these assumptions are not met, particularly when
Identifying non-parametric methods for clustered data that can be used to better estimate PS is an important area of future research. This would allow for PS matching to be much more robust to potential model mis-specification, while still allowing for correlation between observed individual level variables and unmeasured cluster level variables. Current methods that can account for these either have to make assumptions about the independence between measured and unmeasured variables, or make assumptions about the form of the PS. Finally, this article primarily focuses on continuous outcomes. It would also be of interest to investigate the performance of PS matching methods for clustered data for binary or categorical outcomes.
Supplemental Material
sj-pdf-1-smm-10.1177_09622802221133556 - Supplemental material for An overview of propensity score matching methods for clustered data
Supplemental material, sj-pdf-1-smm-10.1177_09622802221133556 for An overview of propensity score matching methods for clustered data by Benjamin Langworthy, Yujie Wu and Molin Wang in Statistical Methods in Medical Research
Footnotes
Acknowledgements
Data availability statement
Declaration of conflicting interests
Funding
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
