Abstract
Introduction
Natural experiments offer a unique opportunity to explore some of the most elusive questions about the political world, characterized by circumstances that allow researchers to assume as-if random or haphazard treatment conditions even though treatment allocation is not defined by a random device (Rosenbaum, 2010; Dunning, 2012; Keele, 2015). One of the most popular types of natural experiments uses geographic or administrative boundaries to construct treated and control groups by exploiting certain geographical features that generate as-if random variation in the treatment assignment (Keele and Titiunik, 2016). Such geographic natural experiments (GNEs) have been used to study topics such as ethnic relations in Zambia and Malawi (Posner, 2004), political polarization in the United States (Nall, 2018), electoral choices after disasters in Chile (Visconti, 2022), and support for authoritarian regimes in East Germany (Kern and Hainmueller, 2009) that might otherwise escape attempts to establish causal inference.
However, as with any methodological approach, there are important limitations to (geographic) natural experiments. For instance, because randomization is not guaranteed, researchers must provide a compelling justification for the as-if random assumption (Dunning, 2012; Sekhon and Titiunik, 2012). Even with empirical and theoretical justification, though, “the strong possibility that unobserved differences across groups may account for difference in average outcomes is always omnipresent in observational studies” (Dunning, 2008; 289). This concern may obscure important relationships and even undermine the validity of causal claims.
The local geographic ignorability design (LGID) therefore emerges as an attractive empirical approach to limit potential unobservable factors from biasing results. Under the assumption that the treatment was as-if randomly assigned to units that are especially close to a given geographic or administrative boundary, there can be greater confidence in the assumed independence of potential outcomes (Keele and Titiunik, 2016). 1 For example, when studying the effects of a policy intervention, a LGID would examine differences between residents living within a small buffer area from the border. Presumably, these residents would be more similar to each other than to those living far away from the administrative boundary. The LGID approach, however, might still require adjustment for pretreatment covariates. One solution to this problem is to enhance geographic designs by using matching as a flexible form of statistical adjustment (Keele et al., 2015). 2
While the latter is an undoubtedly powerful research design, it is also accompanied by an important limitation-its inherent locality. Although matched treated and control groups may, in fact, be quite similar to each other, they could also be markedly distinct from a larger population of interest (e.g., a city or state). Given an unrepresentative sample, “the estimate of a causal effect may fail to characterize how effects operate in the population of interest” (Aronow and Samii, 2016; 250). Such external validity concerns are often of particular interest for political scientists (McDermott, 2002), as it may be difficult to determine whether any causal effects identified must be restricted to only areas within the narrowly defined boundary or if they can be generalized across cases to answer fundamental questions about broader political phenomena.
We present an approach that addresses this problem, inspired by the idea of template matching (Silber et al., 2014) as well as by recent advances in optimal matching and the construction of representative matched samples (Visconti and Zubizarreta, 2018; Bennett et al., 2019). Using a target population as a template to implement the matching, such as a city, state, or country, matched treated and control groups will not only be similar to each other but also similar to the population of interest. This can increase the generalizability of causal evidence from GNEs, providing a kind of external validity check. By implementing this method, researchers would not have to only rely on collecting multiple studies conducted in diverse contexts to learn about the generalizability of an effect since template matching reveals the hidden studies that resemble other populations within the original study. We see this strategy as a second step to be implemented after the main analysis to explore whether results are consistent across samples that look like the populations of interest. In the following sections, we describe the assumptions and the methodology for this approach and provide an empirical illustration.
Notation and assumptions
When using a sample to draw causal inference, the evidence can be generalized to a target population only when that sample was randomly selected from the target population of interest. In the case of geographic natural experiments, the sample (e.g., the buffer from either side of the administrative boundary) is not constructed by randomly selecting people from the target population (e.g., the city). As a consequence, generalizability efforts must rely on an observational data analysis assumption (Stuart et al., 2018).
In randomized experiments, the most common quantity of interest is the average treatment effect (ATE). Let
In this paper, we instead focus on a different estimand: the target average treatment effect on the treated (TATT), which will inform us about how the treatment effects operate on the target population of interest. In a sample of
We propose a design based on template matching to extend beyond the local effects estimated when using local geographic ignorability designs and to recover the target average treatment effect on the treated (TATT). Template matching was developed by Silber et al. (2014) to make standardized comparisons based on observed characteristics. Their study randomly selected 300 patients (i.e., the template) and used them to match 300 patients at 217 hospitals, constructing a sample that resembled the template used to implement the multivariate matching.
Two assumptions are needed to claim that the matched sample resembles the population of interest and to provide causal evidence after adjusting on observables. The first, the
A key question is how to define what is the appropriate template or target population. Recent research has advocated for a stronger connection between theory and causal identification. Scholars point to the advantages of theory-driven endeavors, which can help to better recognize undefined potential outcomes (Slough, 2022), to improve covariate balance (Resa and Zubizarreta, 2016), and to generalize a causal effect to other contexts (Gailmard et al., 2021). While the nature of causal identification strategies may require a narrow focus, the theories researchers wish to test may be far more extensive. When constructing a generalizable geographic natural experiment, we argue that researchers should ask not only what identification strategy is best to recover causal effects, but also what template or population they wish to mimic that would best test their broader theory.
For example, Posner (2004) takes advantage of the border between Zambia and Malawi to study the political salience of a cultural cleavage. Chewa and Tumbuka people live on both sides of the border. While their cultural differences are identical on both sides of the border, their political differences are more salient in Malawi than Zambia. The rationale behind exploiting this distinction is that Chewas and Tumbukas are large groups relative to the country as a whole in Malawi and, therefore, can be used as a base for coalition-building. Meanwhile, in Zambia, Chewas and Tumbukas are small relative to the country as a whole, creating little incentive to rely on them for coalition-building.
As a result, Posner (2004)’s theory directly connects with a population of interest (i.e., the entire Chewa and Tumbuka people in Malawi and Zambia) rather than just four villages along the border used in the study. If people from these villages have different distributions of observed characteristics than in the entire country, 3 using a traditional geographical experiment might generate estimates that do not speak to the theory. Thus, we would advocate implementing a generalizable geographic natural experiment to improve the connection between theory and causal identification.
The utility of using template matching is also evidenced in more recent implementations of geographic natural experiments. Keele and Titiunik (2018) aim to uncover the effects of all-mail voting on turnout. To do so, they rely on data from two counties, one that used only in-person voting and one that used all-mail voting. While the resulting estimates can tell us about turnout effects at a local scale, they may not be able to extend to the true populations of interest, Colorado, and even the United States as a whole. Employing template matching in this case would provide a kind of external validity check on how well the theory underlying the paper connects with the analysis and results of the causal identification strategy.
It is important to note that we do not equate external validity and representativeness. Our goal is to show that a treatment effect can be generalized across different populations of interest (i.e., external validity). We use template matching to construct representative matched samples that are similar to the population of interest (i.e., representativeness). Using template matching to build representative matched samples can improve the limited external validity of studies that have an especially local nature, often a result of researchers’ efforts to reduce heterogeneity and decrease sensitivity to hidden biases (Rosenbaum, 2005). In observational studies, reducing heterogeneity often means decreasing the sample size to improve comparability between units (Keele, 2015). Therefore, we could end up with a treated and control group that allows us to make credible inferences but that might be substantially different from the target population.
Method
To implement template matching, we use mean balance constraints, with the goal of reducing the standardized differences or difference-in-means in standard deviation units between the treated and control groups. Though stricter balance constraints, such as fine balance, can also be used. 4 In this case, we use matching to restrict the standardized differences (i) between the treated group and our target population and (ii) between the control group and our target population to be no larger than 0.05 pooled standard deviations. This ensures that the standard deviations between the matched treated and control groups cannot be larger than 0.1: a traditional threshold used in the literature to demonstrate covariate balance (see Zubizarreta (2012) and Pimentel et al. (2015), for example).
To generate covariate balance, we use cardinality matching, which allows for different types of balance, such as aggregate balance of low-dimensional joint distributions, marginal distributions, and moments such as the means, among other forms (Visconti and Zubizarreta, 2018). Even though we recommend cardinality matching because it maximizes the size of the matched sample based on flexible constraints on covariate balance, we acknowledge that template matching can also be implemented using other matching techniques such as genetic matching (Diamond and Sekhon, 2013) or matching frontier (King et al., 2017), or by using weighting approaches such as entropy balance (Hainmueller, 2012) or minimal weights (Wang and Zubizarreta, 2020).
As an illustration of the structure of the design, the first panel of Figure 1 depicts a state that contains a group of hypothetical people living to the east and another group to the west of a geographic boundary within the city of Chicago. If we are interested in covariate Different types of matching.
The second panel uses a regular matching approach to decrease imbalances in the observed covariate
We further illustrate our approach, using Illinois as a template, in the third panel. We choose how the standardized differences between the treated group and the state should be restricted, making them no larger than 0.05 pooled standard deviation units—the same for the difference between the control and the state. Therefore, by construction, the treatment and control groups cannot have imbalances greater than 0.1 pooled standard deviation units.
As we have seen, template matching provides a flexible approach for making multiple estimations based on the target population of interest. The standard matching approach, in contrast, achieves balance between the matched treated and control groups, but cannot necessarily be used to make inferences outside of a given administrative or geographic boundary.
Example
We provide a more concrete illustration of our approach by extending Keele et al. (2015)’s study on the role of ballot initiatives in voter turnout. The authors draw on a natural experiment in the city of Milwaukee, Wisconsin, wherein a ballot initiative was established in 2008 but not implemented in the seventeen surrounding areas. This draws on a local geographic ignorability design, where units “within a narrow band around the border are assumed to be good counterfactuals for each other” (Keele and Titiunik, 2016; 3), and offers a unique opportunity to understand whether such initiatives do, in fact, foster greater political participation. Keele et al. (2015) do not find enough evidence to claim that the ballot initiative has increased turnout.
Before matching.
After regular matching.
As expected, matching generates balance for all of the covariates. This provides a compelling way to improve causal inferences by combining a natural experiment based on geography and matching to improve covariate balance. However, the results might be highly local and not necessarily reflect the effects of initiatives on voter turnout for an average American citizen or even a typical resident of Milwaukee, Wisconsin. In fact, considering the averages for the city, state, and country shown in Table 2, the matched treated and control groups do not look, on average, very similar to these potential populations of interest.
To address this concern, in this paper we combine a geographic natural experiment with template matching. Template matching is critical for achieving covariate balance not only between the treated and control groups but also with a given a target population. We use as a template the city (Milwaukee), the state (Wisconsin), and the country (United States) to match with the treated and control groups (see appendix C for more information on the city-, state-, and country-level characteristics we use).
After template matching.
We use a permutational t-test in matched pairs with an embedded sensitivity analysis (Rosenbaum, 2015).
6
The parameter Γ represents the odds of differential assignment to the treatment due to an unobserved factor
Permutational t-test and sensitivity analysis.
Conclusion
Natural experiments can be powerful designs for exploring causal relationships that might otherwise be confined to correlations. Local geographic ignorability designs, for example, allow researchers to focus on differences between treatment and control groups that are in close proximity to one another (Keele and Titiunik, 2016), and recent methodological advances have blended designs based on geographic distance and covariate balance (Keele et al., 2015). Although this approach has increased internal validity because the treatment has a justifiably as-if random or haphazard nature, it is highly local by design and may raise concerns about external validity. We therefore suggest a generalizable natural experiment approach, using template matching as a solution to ensure that the matched sample is similar to the population of interest. This combines both the strong internal validity of LGIDs with an external validity check to provide insight into how generalizable the results may be to other contexts. Additionally, we recommend that theory or previous knowledge is used to define the appropriate template or target population.
While we focus on LGID designs here, it is important to note that this approach could be extended to other designs such as more standard natural experiments that do not rely on buffer zones or can be implemented using other adjustment techniques such as weighting method to obtain covariate balance. We believe that the main implication of this design is to allow for credible inferences while characterizing how the results can operate in other populations of interest.
Supplemental Material
Supplemental Material - Constructing generalizable geographic natural experiments
Supplemental Material for Constructing generalizable geographic natural experiments by Owura Kuffuor, Giancarlo Visconti, and Kayla Young in Research & Politics
Footnotes
Acknowledgements
Declaration of Conflicting Interests
Funding
Correction (June 2025):
Supplemental Material
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
