Abstract
Introduction
The effectiveness of educational interventions often is evaluated using clustered randomized trials (CRTs) in which random assignment occurs at the group rather than individual level (Hedges & Hedberg, 2007; Raudenbush, 1997). Given the natural groupings of students within classrooms and schools as well as the practical and political challenges of individual-level random assignment in educational settings, CRTs are a common tool for drawing causal inferences about educational policies, practices, and innovations.
When an intervention is randomly assigned, differences in outcomes between treated and untreated groups can be causally attributed to the intervention. However, randomized trials, even with group-level assignment, are not always feasible. In such cases, researchers must turn to observational analyses. One alternative to a CRT is a clustered observational study (COS). In a COS, treatment assignment occurs at the group level but through some uncontrolled process.
Although the literature on observational studies for deriving causal inferences when treatment selection occurs at the individual level is well developed (Rubin, 2007, 2008), the same is not true regarding COSs. For COSs, the literature remains underdeveloped, with no consensus on best practices. This is surprising in the context of educational research, where treatment selection often occurs at the group level. In this article, we outline the key considerations and steps for designing and conducting a COS. We highlight differences between COSs and observational studies with individual-level treatment selection, and we propose a framework for the
Our framework employs the counterfactual model for causal inference. We begin by advocating that investigators design COSs following the principle of target trial emulation—that is, according to the cluster randomized trial that they ideally would have conducted. Although the concept of target trial emulation is not new, we highlight considerations unique to the COS context and associated hierarchical data.
Next, we discuss the importance of understanding the process through which sites were selected into treatment. We discuss why cluster—rather than individual-level—treatment assignment is often preferable for deriving causal inferences, as it can protect against selection bias even with non-random selection into treatment. We introduce notation, articulate assumptions necessary for causal inference in the context of a COS, and argue for the central role of analyses to understand the potential sensitivity of conclusions to an unobserved confounder. Next, we highlight possible approaches to statistical analysis. In this section, we discuss a new form of matching designed specifically for COSs. In sum, our article aims to make a methodological contribution with respect to design rather than analysis. That is, we are not introducing a new statistical model—instead, we are introducing critical design aspects of COSs.
We illustrate the study design process and related concepts with an evaluation of myON, a summer reading intervention in Wake County, North Carolina. In particular, the COS design process requires gathering information on how the treatment was assigned and structuring the analysis to reflect this assignment process. We find no evidence that access to the myON tool improved student outcomes, but this does not diminish the value of the example. As with CRTs, when conducting a COS, all of the key elements of study design should occur prior to examining impacts. Furthermore, careful study design should lead policy makers to place more stock in results, even if estimated effects are not educationally meaningful.
Research Design Principles for Clustered Observational Studies
Here, we outline key considerations for designing COSs. We begin by discussing target trial emulation.
Target Trial Emulation
Target trial emulation calls for applying design principles from randomized trials to the analysis of observational data (Hernán & Robins, 2016). Under the target-trial approach, the investigator ties the design and analysis of the observational study to the experimental trial it emulates, and causal estimands of interest are derived from the hypothetical target trial. Whether the causal effect from this target trial can be estimated consistently using observational data depends on certain assumptions, known as identification assumptions. In observational studies, investigators typically assume that any differences between treated and control groups are observable—that no unobserved differences exist—and that covariate adjustment can handle observable differences. We discuss identification assumptions and covariate adjustment further below.
The purpose of target trial emulation is to improve the quality of observational studies through the application of trial design principles. In an experimental study, the sample and study design are clearly delineated to enable randomization. In contrast, observational studies, particularly those formulated after program implementation, often necessitate investigation to inform decisions about and articulation of sample construction and study design. Imagining the hypothetical experiment that would generate observational data under study (Cochran & Rubin, 1973; Rubin, 2008) initially seems simple. However, this can be challenging in practice since we might conceive of several different hypothetical experiments that generate a given dataset. Here, we outline two CRT study designs common in educational interventions, corresponding to situations where (1) whole groups are assigned to a given treatment and (2) subsets of whole groups are assigned to a given treatment according to qualifying criteria.
Design 1: Clustered Treatment Assignment
Design 1 handles cases where complete clusters (e.g., whole classrooms, schools) are selected for treatment, and all units within a cluster either do or do not receive treatment. Under Design 1, we seek to mimic a CRT in which treatment assignment occurs at the cluster level, and all units within selected clusters receive (or are intended to receive) treatment. Under the COS analogue, cluster-level covariates are critical, given that the assignment occurs at this level and is presumed to have been made based on cluster-level characteristics alone. This design would be appropriate for assessing the impact of school-wide reform efforts, such as Success for All (Borman et al., 2007).
Design 2: Clustered Treatment Assignment for Student Subsets
CRTs often assume that the data include all units in each cluster or a random sub-sample of all units, such that the selected units are representative of the cluster as a whole (Donner & Klar, 2004; Torgerson, 2001). However, educational interventions are often allocated in a purposeful, targeted (e.g., non-random) fashion within clusters. Under Design 2, the target trial is a CRT with nonrandom, student-level selection into the treatment within clusters; clusters are assigned to treatment, but within selected clusters only some units receive treatment. This might occur, for example, if an intervention targeted students who are struggling academically. In such cases, the causal estimand is a group-level contrast for the subset of students within their schools who are at risk for treatment.
The critical distinction from Design 1 is that under Design 2, final treatment assignment of an individual depends on school- and student-level characteristics. Selection of units for treatment within a cluster is analogous to nonrandom attrition. In a CRT, the investigator would need to correct for this selection bias. The same is true in a COS. That is, if the treatment is applied only to a subset of students within selected clusters, the analyst may need to model a second selection mechanism. This implies that in a COS analogue to Design 2, covariate adjustment must account for data at both the school and student levels. Next, we introduce our motivating example and consider the target CRT with which it aligns.
Motivating Application: A Summer School Reading Intervention
In summer 2013, the Wake County Public School System (hereafter, WCPSS) selected myON, a computer-aided instruction program for implementation at selected summer school sites with the goal of boosting summer school attendees’ reading comprehension. myON is a web-based software product that provides students with access to books and suggests titles based on their preferences and reading ability. Students at selected sites used the program for up to thirty minutes during the daily summer-school literacy block and could continue using it at home with a device and internet connection. At the time of its launch in WCPSS, the developers claimed that students using myON would improve comprehension through access to digital books that include “multimedia supports, real-time reporting and assessments and embedded close reading tools” (Capstone Digital, 2015). Given their prevalence and cost, rigorous, independent assessment of such curricular supplements is critical to sound investment decisions by educational agencies.
The study sample includes 3,434 summer school students from 49 different WCPSS elementary schools who attended summer school at one of 19 sites. Due to technical constraints, only some summer school sites used myON. As such, all students in a school were exposed to the software if they attended summer school at a selected site. In a COS designed to study the effects of myON, Design 2 is the relevant target trial, because myON was assigned to schools but only students required to attend summer school were exposed to the treatment. Therefore, we are interested in contrasting outcomes for groups of summer school students who were and were not exposed to myON.
Our key outcome is student-level reading performance, measured using Curriculum Associates’ i-Ready Reading Assessments. Students sat for assessments in reading and mathematics at the beginning and end of summer school and results were reported in scale scores with a possible range of 0 to 800 (Curriculum Associates, 2015). Students also received a reading Lexile score used for selecting an initial bundle of digital books within myON (MetaMetrics, 2012).
Notation
Target trial emulation applies to study design, broadly, as well as analytic notation, specifically. Here, we introduce notation applicable to CRTs and COSs. A defining feature of a clustered study is that individual units (e.g., students) are organized within clusters (e.g., schools) and assigned to a treatment or control condition at the cluster level. Generally, for applications with students nested within schools, each school
We define causal effects using the potential outcomes framework (Neyman, 1923, 1990; Rubin, 1974). Prior to treatment, each student has two potential responses:
With potential outcomes defined, we can define the causal estimand, the target counterfactual quantity of interest. In a COS in an educational setting, one reasonable estimand is the following student-level contrast:
Assumptions
The first key assumption is the Stable Unit Treatment Value Assumption (SUTVA; Rubin, 1986). The notation above implies SUTVA. Here, we elaborate on what SUTVA indicates in a COS. SUTVA includes two components: (1) the treatment levels of
SUTVA’s second component assumes that the treatment for one student does not spill over to any control student. A benefit of clustered (rather than individual) treatment assignment is that it increases the plausibility of this component of SUTVA. In the COS (or CRT) context, spillover violating SUTVA would need to occur across treated and control schools, for example, if a treatment school student gave her myON account information to a control school student who subsequently used it. Although possible, this seems unlikely to be prevalent. Generally, judging the plausibility of the no-spillover assumption requires qualitative implementation information. In the myON context, we assume that SUTVA holds.
Are SUTVA violations a concern under Design 2? Since only a subset of students within schools are treated, spillovers might occur from treated to untreated students within treated schools. Although possible, violating SUTVA is not a concern. Why? The causal effect of interest is between treated and control schools. Therefore, the relevant spillover is between treated and controls schools even if only some students are treated within a school. When Design 2 is the target trial, one might be substantively interested in such within-school spillover, but this is a different causal question. The analysis of treatment effects under interference is the focus of recent methodological work, for example, Aronow and Samii (2017) and Basse and Feller (2018). In the myON context, such within-school spillover is unlikely, as untreated students in treated schools do not attend summer school.
The next key assumption is the “selection-on-observables” assumption which pertains to the treatment assignment process. This assumption has two parts. First, we assume that there is some set of covariates such that treatment assignment is random conditional on these covariates (Barnow et al., 1980). Formally,
That is, after conditioning on observed characteristics,
The second part of the selection-on-observables assumption pertains to “common support.” Formally, we assume that all clusters have some probability of being treated or untreated such that 0 < π
Trimming is not without consequence, however, as it changes the causal estimand. After trimming, the causal estimand describes the causal effect for the population for which the effect of treatment is marginal: units that may or may not receive the treatment. Changing the estimand in this way may be unproblematic if the data do not represent a well-defined population (Rosenbaum, 2012).
Under these assumptions, we use one or more statistical adjustment methods to estimate treatment effects, as we discuss below.
Explicating the Assignment Process
The modern literature on observational studies emphasizes the role of the treatment-assignment mechanism (Rubin, 2008). Indeed, the assignment mechanism is critical to COS design. Since an observational study’s key assumption pertains to whether treatment assignment is based on observed data, understanding how treatment assignment operates is critical. Next, we review important aspects of clustered treatment assignment, including the advantages that cluster-level treatment assignment affords in observational studies.
In any observational study, the investigator should understand and explicate the treatment assignment process. We recommend the following steps. First, understand whether the assignment process can be described as a “natural experiment.” Nonexperimental, but haphazard or arbitrary assignment is often characterized as a natural experiment—the hope being that natural circumstances give rise to an arguably random assignment process (Murnane & Willett, 2010). Although haphazard treatment assignment can require considerable judgment and contextual knowledge to justify, the goal is to reduce the bias associated with treatment self-selection. For many natural experiments, analysts still rely on covariate adjustment. When covariate controls are introduced, the analyst is still relying, at least in part, on the selection-on-observables assumption. In this way, observational studies and natural experiments are related. In fact, the principles we outline for COSs apply to natural experiments.
Second, if a study in not a natural experiment, the investigator should identify the decision-makers responsible for treatment allocation; any factors used to determine assignment (Rubin, 2008); and whether the assignment process is such that decision makers controlled treatment allocation for others within some defined population. In education applications with grouped treatments, this third feature is common and preferable, the advantage being that the selection process is more likely to be based on observed information. Although self-selected treatment assignment may reflect observed factors, it is more likely influenced, at least partially, by unobserved factors, such as a child’s motivation or family’s expectation regarding the treatment’s benefits.
For COSs in educational settings, outside decision makers who are not directly exposed to the treatment often control treatment assignment. For example, district officials often make decisions about selecting schools for treatments. This selection structure offers a key advantage. In general, it should be possible to identify who participated in the treatment-assignment process and the factors used in decision making. Qualitative information typically is critical in this process. For example, WCPSS centrally allocated myON to selected summer school sites based on factors, including internet bandwidth, computer access, and regional distribution. 1 Thus, all summer school students who attended an elementary school close to a selected summer school site used myON. Schools had no input into program allocation. Thus, treatment assignment primarily was a function of school-level data available to district administrators, rather than, for example, teachers’ interest in myON. Thus, the selection-on-observables assumption appears reasonable in this context.
Third, understanding the assignment process allows the investigator to identify the study’s target trial analogue. In our application, beyond the selection of schools, a secondary, student-level selection process occurred, whereby students were identified for summer school based on their standardized test performance, per state policy. Thus, we must consider this second assignment mechanism. Because student-level selection was governed by state guidelines, student populations should not differ systematically across treatment and comparison schools. Taken together, we should expect imbalances in school-level but not necessarily student-level covariates when we compare baseline characteristics between treatment and comparison schools. As illustrated below, our data follow this pattern.
Next, Hansen et al. (2014) demonstrate the advantages of cluster-level treatment assignment in observational studies. Specifically, group-level treatment assignment can reduce the potential for selection bias. For technical details, we refer readers to Hansen et al. (2014) but here convey the intuition. myON is a commercial product; one might imagine a salesperson motivated to bias evidence in favor of the product. The most effective way to do so would be to form a treatment group of individually-selected, higher-performing students who would exhibit stronger reading performance regardless of whether they used myON. If the salesperson is required to select entire schools for myON, however, the mix of students within schools will make it more difficult to guarantee better outcomes under myON. By selecting intact groups, the salesperson is less able to target high performers who would bias results in myON’s favor. Therefore, group-level assignment helps to limit bias from purposeful treatment selection. A limitation is that the analyst cannot quantify how much bias is eliminated.
Statistical Analysis
Whatever the advantages of COSs, they remain observational studies. Thus, treated and control groups commonly will differ on baseline covariates, and the analyst will need to use a method of statistical adjustment to remove overt bias and increase comparability. Here, we highlight conventional and more modern approaches to statistical adjustment for COSs.
Statistical Adjustment Methods
In education, random-effects regression models are frequently used for statistical adjustment. In a COS that relies on the selection-on-observables assumption, covariates are added to the model to remove overt biases—observable differences between the treated and untreated clusters. A limitation of relying on regression-based strategies in a COS is that they can elide over any lack of actual overlap between treatment and comparison schools in covariate distributions. Areas outside of common support can be particularly problematic, since they require extrapolation and, in turn, results may suffer from model dependence. That is, conclusions may depend on the regression model’s functional form.
It is not that regression-based analysis is not useful for COSs. Rather than turning directly to covariate-controlled regressions for assessing treatment effects, we advocate first taking steps to ensure balance and common support. Then, having obtained an analytic sample where balance and common support hold, regression can be used for treatment effect estimation. We discuss this further below.
Propensity score methods are one alternative to regression modeling. In a COS, the statistical adjustment strategy needs to account for the data’s multilevel structure. With propensity score methods, this is done by estimating the propensity score using, for example, a random effects logistic regression model (Arpino & Mealli 2011; Hong & Raudenbush, 2006; Li et al., 2013). However, multilevel models often fail to converge when used to estimate propensity scores (Zubizarreta & Keele, 2016). Therefore, although propensity score methods are a reasonable strategy when the treatment is allocated at the individual level, the same is not always true with cluster-level assignment. When model convergence issues hamper fitting propensity score models with hierarchical data, little can be done.
Matching
Matching provides another adjustment method designed to mimic a randomized trial by constructing a set of treated and control units that are comparable on observed, pretreatment characteristics. Matching methods primarily have been developed to handle individual-level treatment assignment, and a large literature has articulated best practice in this context (Rosenbaum, 2020). Matching studies have been used to evaluate socially-relevant interventions (Stuart, 2010), and methodological research has investigated the extent to which matching yields impact estimates similar to those achieved through experimental design (Cook et al., 2008; Dehejia & Wahba, 1999).
Just as we can use individual-level matching to mimic an individual-level RCT, we can conceive of matching to mimic a CRT by creating comparable treatment and comparison clusters. Despite COSs being a natural analogue to the analytic workhorse of CRTs, strategies for matching with grouped treatments are less well developed. Extant work has focused on multilevel data structures but with applications where clusters are relevant in some way but not for grouped treatments. For example, Steiner et al. (2013) consider matching with multilevel data but assume individual-level assignment. Stuart (2007) discusses group-level matching using group-level data only. Stuart and Rubin (2008) also focus on matching with multilevel data, but advocate building a comparison group from multiple sources when a single comparison site is not a sufficient match for a given treated group (Stuart & Rubin, 2008). This approach considers matching only on student-level characteristics, rendering it less relevant to COSs in which school-level covariates are critical.
Recently, Zubizarreta and Keele (2016) and Pimentel et al. (2018) have developed matching methods specifically for COSs. The resulting matching method mimics a CRT by creating comparable treated and comparison clusters and units within clusters to remove overt bias at the individual and group levels.
In the context of COSs, we endorse matching methods for several reasons. First, matching tends to be more robust to a variety of data configurations—especially when treated and control covariate distributions do not have good overlap (Imbens, 2015). Second, matching methods allow for covariate prioritization to increase treatment-control comparability on covariates of critical importance from a scientific standpoint. For example, an investigator can opt to balance baseline test scores more closely than other covariates such as school size. Third, the investigator can trim the sample to yield the set of observations with the highest levels of comparability.
Our primary goal is to consider COS design. Nevertheless, here, we briefly summarize the mechanics of multilevel matching, as we illustrate an application in the case study below. We contrast this process with strategies for implementing standard, single-level matches. In a standard match, the user selects covariates on which to match, and these covariates are used in one of two ways. One option is for the analyst to first estimate a propensity score model and then match units on estimated propensity scores. Alternatively, the covariates may be used to generate a distance matrix—typically based on a Mahalanobis distance—which captures the multidimensional distance between each possible treatment-comparison matched pair. In a basic pair match, treated and comparison units are matched to minimize these distances. Either of these matching variants can be applied to COS data if only school-level covariates are used. However, with multilevel data available, we seek to incorporate student-level information into the school match, even when our goal is to match at the school level only.
To implement a multilevel match, the analyst must specify several parameters. First, the analyst must identify the student- and school-level covariates to be used in the matching algorithm. The analyst also must specify the design. Here, we select between Design 1 (matching only at the school level) and Design 2 (matching at the school and student levels). Next, the analyst can specify the balance priority for the school-level covariates such that the algorithm will seek to balance higher priority covariates before lower priority covariates. Based on this information, the software computes all possible treated-to-comparison
Although random-effects regression models alone are not our preferred method for COS analysis, they can be fruitfully combined with matching. After matching, the analyst can regress the outcome on the treatment indicator using a random-intercepts model. Regression modeling is also useful in that post-matching covariate adjustment with regression can handle imbalances that remain after matching. That is, any covariates that are not fully balanced can also be included in the post-match regression model to further reduce bias (Imbens, 2015). As such, regression models are a useful analytic tool once matching is complete.
Overlap
As discussed above, a key assumption for a COS is common support or overlap of baseline covariate distributions. When little overlap exists between treated and control covariate distributions, trimming units via matching is one method to enforce overlap. Care should be taken, however, as after trimming, the causal estimand is more local; it applies only to a subset of all treated units. In a COS, trimming even a small number of treated schools may mean losing a large percentage of treated units. In other words, trimming even a small number of clusters may make the treatment effect estimate very local. In such cases, no simple remedy exists, since we should not estimate treatment effects using treated and control observations that are not comparable.
Inference
A key principle of inference for COSs is that the analyst must correct estimates of statistical uncertainty to account for clustering. Failure to do so will result in standard errors that are, at times, grossly underestimated given that the correlation among students in the same cluster has not been accounted for (Angrist & Pischke, 2009; Hayes & Moulton, 2009). Generally, the investigator should account for clustering at the level at which the treatment has been assigned (Abadie et al., 2017). Standard errors can be corrected using a generalization of clustered standard errors developed by Liang and Zeger (1986) or via random-effects regression modeling.
When matching methods are used, regression-based corrections can account for clustering. After matching, the analyst should include a clustering correction for schools and the paired school clusters (Abadie & Spiess, 2019), accounting for clustering both within schools and within matched school pairs. This method requires a sufficiently large number of clusters for valid inferences. To account for clustering while avoiding the large sample assumptions on which regression relies, one can alternatively use randomization inference methods (Hansen et al., 2014). For example, within matched pairs, the analyst randomly reassigns treatment status and estimates a treatment effect. Doing this repeatedly allows the construction of a null distribution of treatment effects against which to evaluate the treatment effect estimated for actual assignment. The resulting inferences are valid for any sample size. However, randomization inference methods test the sharp null hypothesis which asserts a zero treatment effect for all schools and students. This differs from the more usual null hypothesis asserting an average effect of zero. In general, when sample sizes are small (e.g., 20-30 total clusters), randomization inference is useful for understanding whether inferences depend on the assumption of large sample sizes.
Sensitivity Analysis
All observational studies should include a sensitivity analysis. Sensitivity analyses often are based on a partial identification strategy, where bounds are placed on quantities of interest while a key assumption is relaxed. A sensitivity analysis is designed to
To begin, recall that under selection on observables, we assume that any two matched clusters have the same underlying probability of treatment. That is, the coin flip is fair within this pair. Of course, this assumption is strong, and matched clusters may still differ on an unobserved confounder,
For example, we might hypothesize that, despite matching, an unobserved covariate renders selection probabilities unequal. If that hypothesized inequality (which Rosenbaum denotes as
Generally, one can vary the
Although our discussion has focused on the level of
Case Study
Here, we demonstrate concepts with the myON application. Our data contain 3,434 summer school students from 49 elementary schools. These 49 schools were grouped into 19 different summer school sites, eight of which received myON. Our first analytic decision relates to whether we define clusters as elementary schools or summer school sites.
For several reasons, we treat intact elementary schools as our clusters. First, we can reasonably infer that although summer school sites were selected for myON, this process explicitly assigned schools to treatment or control. Second, defining clusters at the school level leads to a larger number of clusters and improves statistical power. Finally, our statistical adjustment strategy employs optimal matching methods designed for COSs by Zubizarreta and Keele (2016) and Pimentel et al. (2018). We benefit from having a greater number of treatment and comparison clusters, as this increases the likelihood of obtaining good cluster-level matches. Thus, our treatment-comparison contrast is between assigning groups of students to summer-school sites that do or do not use myON, under the assumption that schools were otherwise comparable; 1,371 summer-school students from 20 schools used myON.
Next, we consider the appropriate target trial. While entire schools were selected for treatment, the intervention applied only to students required to attend summer school. Although control schools were not selected for myON, the summer school selection process was identical across all schools. In theory, summer school students should be similar across treated and control schools. Nevertheless, the student-level selection process points to Design 2 as the relevant target trial.
Given this, we investigate balance at the school and student levels. Table 1 contains means for the treated and control groups and standardized differences before any statistical adjustments. 2 From Table 1, the imbalances on student-level covariates, including pretreatment test scores, are small, indicating that the summer school selection process is uniform across treated and control schools.
Balance on Student- and School-Level Covariates Before Matching
Table 1 also contains balance statistics for school-level covariates. All school-level measures were calculated by the school district and thus are based on all enrollees from the previous school year—not just the students who attended summer school. For school-level covariates, clear differences are evident between treated and control schools. Treated schools, on average, have higher test scores, lower staff turnover, and a lower percentage of teachers who are non-White. Treated schools also have a higher share of teachers who are novices (i.e., 3 or fewer years of experience).
When comparing the student- and school-level covariates in Table 1, mean differences of a similar magnitude translate to very different standardized mean differences at the student and school levels. This is partially because the standard deviations used to scale mean differences are larger at the student level, as there is more variation within than across schools. In Table 1, given that, for example, there is little variation in the share of English language learners (ELLs) within each school, the mean difference of 2 percentage points translates to a standardized mean difference of −0.29. This is also a function of the school-level selection process, whereas student-level selection operates similarly across schools.
Next, we use matching to address baseline imbalances. The matching process is typically iterative; the analyst performs a match, assesses the resulting balance, and then fine-tunes the matching procedure until balance is deemed acceptable. Just as outcome measures are not available at the time of randomization in an experimental study, the analyst should not examine outcomes when implementing matching. The CRT analogue to this process is conducting a randomization, assessing balance on baseline measures, and rerandomizing if baseline equivalence is not satisfied (Morgan & Rubin, 2012).
Instead of presenting results from the match with the best balance, we present a series to illustrate the iterative nature of the matching process. In doing so, we highlight additional tools for improving balance: balance prioritization, calipers, and subsetting. We refer readers to Keele et al. (2020) for further discussion of these tools. For matching, we use the R package matchMulti built specifically for matching with COS designs (Keele & Pimentel, 2016).
Our first match is based on the match algorithm defaults. At the defaults, no covariate is given priority, and no treated schools are dropped. The resulting sample includes 40 schools, with 20 treatment schools pair-matched to a control school without replacement. Table 2 (column 2) reports on this match in which some but not all of the standardized differences improve. In Match 2, we add covariate prioritization with which we select sets of covariates to prioritize in terms of balance. Such prioritization is useful, because science and context may justify preferring closer balance on certain measures.
Balance on School-Level Covariates for Four Different Sets of Match Parameters
For covariate prioritization, we define two covariate sets. Set 1 includes the school-level test score measures. Set 2 includes the proportion of ELLs and the proportion of non-White teachers. Under balance prioritization, the matching algorithm works to balance the set 1 covariates first, followed by the set 2 covariates. The remaining covariates receive lowest priority for balance. In Match 2, balance on the test score measures is improved, however improvements for the set 2 covariates are minimal.
Next, we applied a school-level caliper. The matchMulti package includes a function that calculates a school-level propensity score, which is the estimated probability of treatment selection based on baseline measures. We can impose a caliper on this estimated propensity score as another tool to improve balance. We set the caliper to 0.20, which forbids school-level matches differing by more than 0.20 of a standard deviation on the estimated propensity score. We also add a third covariate balance prioritization set which includes the proportion of novice teachers and the staff turnover rate. Match 3, which contains the results from this match, is generally better, although balance is worse on the proportion of novice teachers. Note that this match discarded some treated schools, since for these schools, the caliper constraint could not be satisfied. Once the use of a caliper discards schools, optimal subsetting is a better tool for match refinement. This is because with optimal subsetting, one can achieve similarly good balance without losing as many treatment sites as might be lost with a caliper strategy.
With multilevel data, optimal subsetting can be used to trim clusters, units, or both. Given sample sizes, however, trimming is typically necessary only at the school level. In applying optimal subsetting, the analyst specifies a minimum number of treated clusters (or units) that must be included. By iteratively adjusting this number downward, treated schools are dropped one-by-one until balance improves. For example, if there are 20 treated schools and the optimal subset number is 19, the algorithm will discard the treated school with the poorest match. In general, we recommend dropping schools one-by-one until balance is acceptable.
We improve balance on the proportion of novice teachers by dropping four treated schools via optimal subsetting and rematching. The resulting match (Match 4) excludes the four treated schools with the largest covariate imbalances. For this match, the casual estimand differs; treatment effect estimates will apply to a subset of treated schools—not the entire treated population. We might ask whether the estimand is
Next, we plot the distribution of covariate standardized differences for each match (Figure 1) and observe clear patterns. First, the default match settings do improve balance overall (Match 1), but a few covariates remain highly imbalanced. Second, Match 3 is well-balanced with the exception of one covariate. This tells us that the trimming removed schools with a larger proportion of novice teachers and that the schools in Match 4 differ from the overall treated population mostly with respect to this covariate.

Boxplots of the distribution of absolute standardized differences for school-level covariates.
Finally, balance in student-level covariates remains roughly the same across the matches (Table 3). Taken together, there is little evidence that treatment selection was a function of student-level characteristics for summer school participants.
Balance on Student-Level Covariates
Outcome Estimates
Next, we assess the effectiveness of myON for improving performance on the end-of-summer i-Ready reading assessment in our matched sample. With matching complete, we can estimate treatment effects. We use multilevel models with a random intercept, clustering at the school and matched school-pair levels and regressing the outcome on the treatment indicator.
One advantage of regression-based estimation is that the analyst can add baseline covariates to the model. It is useful to include covariates that did not balance sufficiently in the match. For example, in Match 3, we were unable to reduce the standardized differences below 0.10 for the percentage of novice teachers and the staff turnover rate. To more completely remove bias from the imbalance in these covariates, we include them in the treatment effect model.
Table 4 reports unadjusted estimates as well as those produced from regression adjustment alone, matching alone (for Matches 3 and 4), and matching in combination with regression adjustment (again with Matches 3 and 4). Two facts are clear from the results. First, little difference exists between the unadjusted and adjusted estimates. This suggests either that little self-selection is present, or if selection bias is present, it is
Outcome Estimates for the Treatment Effect of the myON Reading Program
In this example, the treatment effect estimates do not vary across statistical adjustment strategies. Is this evidence that these choices are inconsequential? Design choices—including the type of match—should be made without reference to outcomes. Such choices may be of consequence in other applications. In general, if we use regression alone, we cannot be sure that inferences are not overly dependent on the model to extrapolate between treatment and control sites with poor overlap. The inferences we derive and our confidence in them have more to do with the strength of our design process and less to do with how results may change across the different strategies. Sensitivity analyses can increase our confidence further.
Sensitivity Analysis
We can use sensitivity analyses to determine whether it would take a weak or strong unobserved confounder to render a significant treatment effect no longer significant. In the myON example, however, treatment effect estimates are small, and confidence intervals include zero, so we fail to reject the null hypothesis of no treatment effect. Given the null results, one might conclude that sensitivity analyses are not needed. Here, we illustrate how to explore the possibility that bias from a hidden confounder
Because we did not discuss it above, we first review tests of equivalence. Under a test of equivalence, the null hypothesis asserts that the absolute value of the treatment effect is greater than some
Next, we implement a test of equivalence for the myON analysis Match 4, first assuming no unobserved confounding. We test
Were our study a CRT, we could be confident that the results were not due to unobserved treatment-control group differences. In a COS, however, we may reject the null hypothesis of equivalence due to hidden confounding. The test above is conducted under the assumption of no hidden bias (e.g.,
To do so, we repeat the test of equivalence, but use
Is this a large or small
Discussion
Although randomized trials are considered the “gold standard” for conducting educational effectiveness research, they are not always feasible. Furthermore, an investigator may have questions about the efficacy of an intervention after it has been implemented in a nonrandom manner. In educational contexts, such nonrandom allocation often occurs at the cluster (e.g., school or classroom) rather than individual level.
In such instances, thoughtfully designed COSs together with sensitivity analyses are an important tool in the education analyst’s arsenal. Thoughtful design is key to conducting a high-quality COS. We outline principles for COS design and advocate designing COSs with their CRT analogue as a guide. Analysts should focus on the assignment mechanism and identify the factors that guided treatment allocation. We further advocate multilevel matching strategies for achieving treatment-control balance and common support prior to the application of regression or other strategies to estimate treatment effects.
The weakness of a COS, of course, is that even with thoughtful design and analysis, one can never definitively know whether a critical unobserved confounder is driving an impact estimate or whether such an unobserved measure is masking true effects. Nevertheless, sensitivity analyses allow the analyst to consider how large such confounders would need to be to operate in either of these ways, and whether a confounder of such a magnitude is reasonable within the context under consideration. In sum, there is much to be learned from thoughtfully designed and implemented COSs.
