Abstract
Keywords
Drawing causal inferences and quantifying them is a cornerstone of psychological research. Ever since the random assignment of individuals into different conditions was introduced in the social sciences in the late 19th century, it has been considered the “gold standard” for this purpose (Jamison, 2019). Therefore, psychologists are often reluctant to accept findings from nonrandomized studies that are explicitly presented as causal (Grosz et al., 2020). However, there are causal research questions for which randomization is unfeasible or unethical (Deaton & Cartwright, 2018)—for these questions, nonexperimental (i.e., observational) data can still be informative, but only if they are combined with the appropriate methods for causal inference. Unfortunately, these methods are rarely taught in psychology curricula (D’Onofrio et al., 2020), leaving a certain knowledge gap.
Several articles from the last decade have aimed to fill this gap, including general introductions (e.g., Foster, 2010; Rohrer, 2018; Wysocki et al., 2022), work focusing on aspects such as mediation analysis (e.g., Nguyen et al., 2020; Rohrer et al., 2022) or longitudinal data modeling (e.g., Lucas, 2023; Rohrer & Murayama, 2023), and articles trying to build bridges between frameworks (e.g., Deffner et al., 2022; West & Thoemmes, 2010). With some exceptions (e.g., Schafer & Kang, 2008; Thoemmes & Ong, 2016; Thoemmes & West, 2011), much of this work focuses on causal identification.
Causal identification focuses on whether and under which conditions it is possible to calculate a particular causal effect of interest from observational data (Elwert, 2013). 1 This involves determining whether researchers can find a set of control variables that allows them to correctly estimate the effect of interest. The question of causal identification usually worries psychologists most when it comes to observational data and rightfully so, because strong assumptions (e.g., no unobserved confounders) are necessary to conclude that a causal effect can be estimated. The step following causal identification is causal estimation, in which whatever data are available is used to compute an association that reflects the causal effect if all assumptions are met (Elwert, 2013).
In many ways, causal estimation is familiar terrain for many psychologists because the standard curriculum usually covers methods that can be used to estimate causal effects in some scenarios, such as analysis of variance (ANOVA) and analysis of covariance (ANCOVA), linear and nonlinear regression analysis, multilevel models, and structural equation models. With the present article, we intend to expand this toolbox by introducing readers to modern estimators that are often applicable under more general circumstances and that have been explicitly developed with the aim of causal estimation. We cover methods based on propensity scores, g-computation, and their combination (so-called doubly robust estimators; Kang & Schafer, 2007). Propensity scores have already been used in the social sciences since the last decade (Thoemmes & Kim, 2011), and several tutorials have been published in the psychological literature (e.g., Austin, 2011; Harder et al., 2010; Lanza et al., 2013). Nonetheless, we include them here because we want to discuss different usages with their respective strengths and weaknesses and because they are one ingredient used in doubly robust estimators.
An advantage of these modern estimators is that they are designed to recover clearly defined estimands. In psychology, it is common to talk about “
What Is a “Causal Effect”?
An example: Alcoholics Anonymous attendance and abstinence
Consider the following example: We are interested in whether Alcoholics Anonymous (AA) attendance successfully leads to abstinence 1 year after starting (Ye & Kaskutas, 2009). Imagine we had access to a large number of individuals whom we made follow the AA program. At the 12-month mark, we observe that 70% abstain from alcohol. Now, we take a time machine and ensure that the same people do not follow the AA program. At the 12-month mark, we observe that only 40% are abstinent. This would be the best possible proof that the intervention has a causal effect, increasing the prevalence of abstinence by 30 percentage points.
Unfortunately, this time machine does not exist. Quantifying the causal effect, therefore, requires two groups, one that does attend AA and another one that does not. But apart from that, the groups should be as similar as possible so that they emulate the time-machine study design as well as possible. Suppose the two groups we pick are “people who attend AA” and “people who do not attend AA.” This is not a good way to emulate the time machine: AA attendees may differ from nonattendees in many ways, including age, gender, religiosity, motivation, and so forth. Thus, any direct comparison of attendees and nonattendees will allow us to discern only whether AA attendance and abstinence are associated. To move from association to causation, something else is needed.
Randomized trials are considered the “gold standard” because random allocation (e.g., by the flip of a coin) leads to two more suitable groups. The only systematic difference between those groups will be whether they received the action (e.g., whether they attended AA). Of course, especially with small sample sizes, random allocation may still result in groups that are noticeably unbalanced on some characteristics—for example, more motivated individuals may end up in the action group by chance alone. However, conventional statistical analysis already takes this into account and correctly reflects the resulting imprecision with larger standard errors and wider confidence intervals for smaller samples (Senn, 2013). Other factors can induce bias after random assignment—inevitably, some will drop out of the study before reaching the 1-year mark (loss to follow-up); maybe not everybody randomly assigned into the action group shows up to the meetings (partial adherence). 2 Nonetheless, randomization greatly reduces the sources of bias that researchers need to worry about.
Counterfactuality
The potential-outcomes framework was initially developed in the context of agricultural experiments by the statistician Jerzy Neyman (1923). It was later expanded to observational settings (Rubin, 1974) and then also to longitudinal data (Robins, 1986). The powerful notation has been widely adopted and thus provides “more or less the lingua franca for thinking about and expressing causal statements” (Cunningham, 2021, p. 85). West and Thoemmes (2010) provided an accessible introduction for psychologists. We consider a scenario in which both the action and the outcome are binary to simplify the explanations, but the potential-outcomes framework applies much more broadly.
Let A be the action, AA attendance, which takes the value 1 in the case of attendance and 0 otherwise. Let Y be the outcome, alcohol abstinence, which takes the value 1 in the case of abstinence and 0 otherwise. The potential-outcomes framework is based on the following question: What if the individual experiences A = 1 rather than A = 0? Each individual has a pair of potential outcomes (Table 1). One of them, YA=1 (condensed to Y1), reflects the observed outcome if the individual experiences A = 1. The other one, YA=0 (or Y0), reflects the observed outcome if the individual experiences A = 0. Only one potential outcome can actually be observed; the other one remains counterfactual (Holland, 1986). For example, consider the first row in Table 1: This is a person who did attend AA (A = 1) and who was abstinent a year later (Y = Y1 = 1). We do not know whether the person would have been abstinent if the person had not attended (Y0 = ?).
Toy Data Set Illustrating the Potential Outcomes
Estimands
We can use this notation to define a targeted causal effect, also referred to as a “theoretical (or causal) estimand.” Two components define such a theoretical estimand (Lundberg et al., 2021): a unit-specific quantity, such as a specific contrast between the potential outcomes (e.g., their difference or their ratio), and a target population over which we want to aggregate (e.g., the adult population of a particular country or all people with alcohol use disorders). Considering our AA example, the unit-specific quantity may be the difference in abstinence (Y1 – Y0). The target population may be people with alcohol use disorder who meet any additional study eligibility criteria (i.e., the entire study population). The resulting estimand is the so-called average treatment effect (ATE) on the entire population, E[Y1 – Y0], the most common estimand, which answers the question, “How would abstinence differ, on average, if all participants attended AA meetings versus if no participants attended AA meetings?”
Other common estimands include the ATE on the treated (ATT; E[Y1 – Y0|A = 1]) and the ATE on the untreated (ATU; E[Y1 – Y0|A = 0]). The ATT targets a population made up of the treated individuals, as defined by the study’s eligibility criteria; for example, the average effect of AA attendance among people who did attend AA. For these people, we observe their outcome under treatment (Y1) but need to infer their outcome without treatment (Y0). How would AA attendees’ abstinence differ, on average, had they (counter to fact) not attended the meetings? This gives us the effect of withholding the treatment from those individuals who would otherwise experience it (with the sign reversed). The ATU is the flip side of this. How would AA nonattendees’ abstinence differ, on average, had they attended the AA meetings? The ATU is the effect of expanding treatment to those individuals who would otherwise not experience it. When who attends AA has not been randomly assigned, the ATT and the ATU may plausibly differ. For example, maybe individuals who are most likely to benefit from AA are also the most likely to attend meetings (e.g., because the social component is particularly motivating to them); in such a scenario, the ATT would be larger than the ATU. Or maybe the people who are most likely to benefit from AA (e.g., individuals who suffer from social isolation) are actually the least likely to attend meetings, rendering the ATU larger than the ATT. The ATE averages over both the treated and the untreated and can thus be considered a weighted average of the two (for a thoughtful discussion of these estimands, see Greifer & Stuart, 2021).
ATE, ATT, and ATU are so-called marginal effects because they aim at certain populations, thus averaging (“marginalizing”) across people who may vary on other features that can also matter for the magnitude of the causal effect. For example, women may profit more from AA than men. The resulting ATE is a weighted average over this heterogeneity and thus also depends on the gender ratio in the population. The notion of marginal effects is often conflated with the notion of causal effects, maybe because a randomized experiment will yield a marginal causal effect. However, causal effects are not limited to marginal effects; they can also be so-called conditional effects (Box 1). We focus on the marginal causal effects in the rest of the article.
Regression Coefficients, Conditional Causal Effects, and Collapsibility
The causal estimand is theoretical—researchers cannot estimate it because they observe only one potential outcome per individual, with the other half of the potential outcomes remaining unobserved (i.e., counterfactual). In contrast, a statistical estimand (also called an “empirical estimand”), such as the mean difference in terms of observed outcomes between the two exposure groups (E[Y|A = 1 – Y|A = 0]), can be estimated. A causal estimand is identifiable if it maps onto a statistical estimand. In such a scenario, an observable metric allows one to make statements about an unobservable metric.
The distinction between the causal estimand and the statistical estimand allows one to define two different families of bias: identification bias and estimation bias (Díaz, 2020). Identification bias occurs when one of the assumptions necessary for identification is not met and the statistical estimand thus no longer maps onto the causal estimand. This type of bias is common to all causal methods and requires expert knowledge (Hernán et al., 2019); sometimes, it may even require one to target a different theoretical estimand altogether. Estimation bias occurs when there are modeling issues. It is method-specific and can require additional assumptions (e.g., correct model specification in ordinary least squares regression). We discuss identifiability assumptions and the resulting potential biases in the next section and turn to potential estimation bias when discussing the respective estimators.
Identifiability
Four central assumptions are necessary to map the causal estimand to a statistical estimand. 3 Exchangeability is usually the most contentious of these, and it is likely the one that psychologists are most aware of. It implies that individuals experiencing the action and individuals not experiencing the action are essentially “the same”: They had the same average risk of the outcome before experiencing the action. The two groups are thus exchangeable; the same effect estimate would have been obtained if one had swapped the action group and the control group. This assumption requires the absence of confounding and selection biases. If sources of confounding or selection bias exist, one may “control” for them by adjusting for control variables (see Box 2); this results in the modified assumption of “conditional exchangeability” (also known as “no unmeasured confounding”), which is much more relevant in practice. Exchangeability would be violated if, for example, AA attendees are, on average, more motivated to change their behavior than nonattendees and, thus, more likely to be abstinent 1 year later. Controls may also include risk factors (as defined in Box 2); these do not contribute to exchangeability but can reduce variance (Chatton et al., 2020).
Association Versus Causation

Directed acyclic graphs illustrating (A) the possible components and (B) the backdoor (dotted lines) and frontdoor (solid line) paths.
Positivity essentially means that individuals can theoretically experience all levels of the action. Structural violations of positivity occur if researchers include individuals whose probability of receiving a particular action level is zero. For example, for some people in rural areas, there may simply exist no AA-meeting opportunity. Positivity is needed for all variables required to achieve conditional exchangeability and also for any additional control variables included (e.g., to reduce variance; Chatton et al., 2020), but not beyond that (Westreich, 2020, p. 53). To use an example from Hernán and Robins (2020, p. 30), researchers do not have to ask themselves whether the probability of attending AA meetings is greater than zero for individuals with blue eyes because “having blue eyes” is (very likely) not necessary to achieve conditional exchangeability. Note that positivity can also be violated by chance, especially in small samples. For example, it may happen that in our particular sample, none of the men of a particular age group attend AA. Such violations do not threaten identifiability; however, they can result in estimation issues—we may end up with unstable estimates or may have to extrapolate in missing subgroups. Sometimes, such random violations are referred to as “sparsity,” with the term “positivity violation” exclusively used for structural violations.
Consistency implies that the observed outcomes actually match (are consistent with) the potential outcomes of interest. In practice, this means that the different action levels must be well defined and be manipulable in principle. For example, our current definition of AA attendance is underspecified and may yield nonconsistency: Attending one meeting in 12 months will not lead to the same potential outcome as attending one meeting per week during the same period, but both may count as “AA attendance” unless we clarify our criteria. How specific we need to be to ensure consistency ultimately is a judgment call based on domain expertise (Hernán, 2016)—for example, we may assume that which brand of coffee is served at the AA group does not matter and thus does not need to be specified; however, at least in principle, future research could prove this assumption wrong.
Noninterference means that the outcome of an individual is not affected by the intervention assignment or the outcome of other individuals. For example, noninterference may be violated if our study includes several people living together: In such a scenario, an attendee may counsel a nonattendee, thus leading to a “spillover” of the action. Although such spillover is a nuisance when estimating action effects, it may be of interest in its own right because it leads to other (causal) research questions (Loh & Ren, 2022). Consistency and noninterference are often jointly summarized as the stable unit treatment value assumption (Rubin, 1974).
Causal Estimation as Cake Baking
The causal estimation workflow is a bit like baking a challenging cake (Fig. 2). Imagine you want to bake a cake resembling a character from a popular (noncopyrighted) children’s TV series—that is the causal estimand, the abstract goal of your efforts. On the Internet, you find a cake (the statistical estimand) that is close enough to what you imagined (identifiability). This cake comes with a recipe (the estimator), which you use to create your cake (the estimate). This is an ambitious project that will require a lot of experience and/or collaboration with a baking expert (a statistician). During the baking process (causal estimation), you may strictly follow the recipe provided, or you may adapt it to the ingredients (the data) available to you. You may also want to make other changes, such as adjusting the cooking time or temperature (varying the assumptions of the estimator). There are no guarantees that the cake you will end up with will resemble the cake you imagined, but you can still try your best.

A cooking metaphor for the causal estimation workflow.
Causal Estimators
We present two families of causal estimators that can be distinguished by their nuisance function. Although the nuisance function is not of direct interest to us, we use it to estimate the causal effect. To sustain the cake comparison, the nuisance function may be an essential part of the cake (e.g., the cake base) that must be prepared according to its own recipe.
The first family of estimators is based on propensity scores (Rosenbaum & Rubin, 1983). The corresponding nuisance function is typically denoted with e(C). This function takes as ingredient C, the set of controls—which should include all variables needed to achieve conditional exchangeability (and may include more to improve the precision of the estimate). 4 Because we model a binary action (recall that this may also be referred to as the “treatment,” the “intervention,” or the “exposure”), the function returns an individual’s propensity (i.e., probability) to experience it (A = 1), according to their values on the control variables C; this can be written as P(A = 1|C). For a brief discussion of nonbinary actions, see Box 3. For example, e(C) could be a model for AA attendance from controls (e.g., age, gender, family history of alcohol use disorders). In the next step, propensity-score-based methods use e(C) to emulate a randomized controlled trial.
Causal Estimators for Nonbinary Actions
The second family of estimators is based on g-computation. The corresponding nuisance function can be denoted with Q(A,C). The function takes both the action A and the set of controls C as ingredients and returns the probability of the outcome itself, P(Y = 1|A, C). For example, Q(A, C) could be a model predicting abstinence from both AA attendance and controls. At this point, one may wonder whether the estimation process is already finished. After all, Q(A, C) is the type of regression model from which researchers routinely take the coefficients and interpret them as causal effects. However, for most models, this works in only the simplest (linear, additive) case (see also Box 1). Here, g-computation, which (unlike standard regression) has been specifically developed for the task of causal inference (Robins, 1986), provides a much more general solution. It also works with nonlinear models and does not require coming up with clever coding schemes but instead requires an additional analysis step. In this additional step, Q(A, C) is used to estimate individuals’ outcomes in two (or more) hypothetical worlds—one in which they experience the action and one in which they do not. The contrast between their outcomes in those worlds then informs us about the causal effect.
Which of these two families is preferable—propensity-score-based methods in which the action-allocation process (e.g., AA attendance) is modeled or g-computation in which the outcome (e.g., abstinence) is modeled? A first rule of thumb would be to prefer propensity-score-based methods when the outcome is scarce (i.e., when almost no one in the sample is abstinent after 1 year) and g-computation when the action allocation is unbalanced (e.g., say an allocation ratio of 1 attendee for 5 nonattendees) to avoid a modeling issue. In general, g-computation is asymptotically more accurate (Tan, 2007). Nevertheless, each approach has its own strengths and pitfalls, which we discuss below.
Some sophisticated cakes require two different bases, and in that vein, a third family of estimation approaches combines both nuisance functions, which results in so-called doubly robust estimators. Here, we first model the action-allocation process and then make use of the resulting propensity scores when modeling the outcome. Such estimators have the desirable property that they result in a consistent estimate (i.e., theoretically unbiased in infinite samples) as long as one of the two nuisance models is correct; however, as we show below, this comes at a cost.
Propensity-score-based estimators and inverse probability weighting
The propensity score (introduced by Rosenbaum & Rubin, 1983) summarizes all observed controls into a single variable. It is a balancing score: Conditional on the correctly specified propensity score, the distribution of controls included in it is similar for individuals experiencing the action and individuals not experiencing the action. Therefore, it allows the emulation of a pseudorandomization situation to draw causal inferences. Once estimated, e(C) can be used in four different ways.
First, adjustment means that the propensity score is included as a covariate—in the very same way one would usually include individual control variables as covariates. This approach relies on strong modeling assumptions (Vansteelandt & Daniel, 2014).
Second, for stratification, the sample is divided into subgroups (strata) based on their propensity score; in the next step, the action’s effect is estimated in every single subgroup, and those estimates are combined into an overall effect estimate. This can be done only with a finite number of subgroups, and thus, people with different scores will usually end up in the same stratum, which leads to residual confounding (Lunceford & Davidian, 2004).
Third, in matching, for each individual in the action group, we pick an individual not experiencing the action with a similar propensity score to include them in the control group. Simply comparing these two groups then yields an estimate of the action’s effect. Some authors have argued against the usage of matching for reasons such as covariate balance, inefficiency, model dependence, and bias (King & Nielsen, 2019). However, matching remains a popular approach, with the central advantage that it results in a situation comparable with a randomized experiment with exchangeable groups. There are already excellent sources introducing psychologists to matching (Chan et al., 2022; Stuart, 2010), which is why we do not cover the topic in more depth.
This leaves us with, fourth, weighting; more specifically, inverse probability weighting (IPW; Robins et al., 2000). This approach appears to be less biased and more precise than matching according to simulation studies (Chatton et al., 2020; and references therein). In IPW, the idea is to generate a pseudosample in which the groups are exchangeable. Rather than actually picking individuals to be included in the groups (as is done in matching), here, one assigns weights to each individual, which determines how much they “contribute” to the analysis. Box 4 summarizes the IPW recipe.
Inverse Probability Weighting Recipe
The individual weights are determined as a function of e(C), the propensity score. Weights can be calculated in different ways, which allows us to estimate effects for different target populations, including the entire population (ATE), the treated population (ATT), or the untreated population (ATU). Table 2 displays some of the weighting schemes, and the companion R notebook illustrates how they work in practice. These different weighting schemes render IPW the most flexible propensity-score-based approach. Once the weights have been computed, they can be stabilized. Stabilized weights preserve the sample size of the observed sample and avoid some estimation issues (e.g., variance inflation, Xu et al., 2010; or random violations of positivity, Robins et al., 2000). This is especially true for continuous actions (as described in Box 2). To evaluate the precision of the results, it might be helpful to calculate the so-called effective sample size, which is the size of an unweighted sample yielding the same precision as the weighted pseudosample (McCaffrey et al., 2004). In other words, it estimates the number of comparable individuals between the groups.
Examples of Weighting Schemes and Their Targeted Population
Note: A = action, treatment; ATE = average treatment effect on the entire population; ATT = average treatment effect on the treated; ATU = average treatment effect on the untreated.
Overlap weights were suggested by F. Li et al. (2019) as a solution to extreme propensity scores, which we discuss in the section Iffy Identifiability.
Next, to obtain an estimate of the causal effect, a weighted regression (called “marginal structural model” [MSM]) modeling the outcome of interest is fitted. If the identifiability assumptions are met, this is in fact a model of the potential outcomes. The coefficient of the action in the MSM corresponds to a specific contrast of the potential outcomes, defined by the type of regression (Schnitzer et al., 2020). For example, if we run a linear regression for a binary outcome, 5 the coefficient will give us the risk difference (“Attending AA increases the risk of abstinence by 30 percentage points”), a log-linear regression will give us the risk ratio (“Attending AA increases the risk of abstinence by a factor of 1.75”), and a logistic regression gives us an odds ratio (“Attending AA increases the odds of abstinence by a factor of 3.5”; all of these numbers reflect valid causal effects defined by different causal contrasts). To quantify the uncertainty of this estimate (e.g., to compute the standard error), one can use a so-called robust sandwich-type matrix or a bootstrap approach (for an introduction to bootstrapping, see Rousselet et al., 2021). According to recent simulation studies, bootstrapping seems more accurate (Austin, 2016, 2022) and yields valid inferences by considering both uncertainties in the propensity score and in the MSM (Berk et al., 2013). A Bayesian approach is also possible (Spertus & Normand, 2018).
The goal of the weighting procedure is to balance the controls between the two groups (e.g., to make AA attendees and nonattendees comparable on the relevant third variables; West et al., 2014); whether such a balance has been achieved can be checked. Franklin et al. (2014) suggested 10 metrics for checking the balance of the pseudosample. Among them, the standardized mean difference has been reported as the most accurate (Ali et al., 2014); a value lower or equal to 10% is considered acceptable (Ali et al., 2015; for the formulas, see Austin & Stuart, 2015). The other metrics can also be used, but they need at least 1,000 individuals, according to Ali et al. (2014). There is no point in running statistical tests to compare individual control variables between the two action groups because the resulting
G-computation
G-computation has its roots in so-called stratification and standardization, the process of splitting up the sample into subgroups (strata), calculating the metric of interest in each group, and then reweighting the group-specific metrics to match, for example, the general population. This used to be a common approach to control for confounders in observational studies, dating back as far as the mid-19th century (Neison, 1844), before computationally more demanding methods took hold (for a historical perspective, see Keiding & Clayton, 2014). Robins (1986) extended the logic of standardization to the so-called g(eneral)-formula for estimating causal effects, which allows for incorporating time-dependent confounding within the potential-outcomes framework. It is thus suitable for longitudinal data, but here we consider the time-fixed setting to simplify explanations.
The idea behind the g-formula is to estimate the probability of the outcome (e.g., abstinence) under a hypothetical action (e.g., AA attendance or nonattendance)—in other words, we are trying to estimate the probability of the potential outcomes:
In words, the probability of the (potential) Outcome 1 under the action
Consider a simple scenario with two controls with two levels: family history of alcohol abuse (yes/no) and gender (female/male). Cross-tabulating the outcome (abstinence) for these two controls for (a) the whole sample and separately for (b) AA attendees and (c) nonattendees (Table 3) gives us all the information we need to apply the g-formula. For the probability of abstinence among the attendees, for each of the four subgroups, we simply multiply the fraction abstinent (middle part of Table 3) with the fraction that the subgroup makes up in the whole sample (left part of Table 3) and then add up the numbers: P(Y1 = 1) = .80 × .15 + .625 × .35 + .80 × .15 + .686 × .35 ≈ .70. Repeating the same steps for the nonattendees (right part of Table 3) gives us the probability of abstinence among nonattendees, P(Y0 = 1) = .60 × .15 + .267 × .35 + .50 × .15 + .40 × .35 ≈ .40. Thus, under the identifiability assumptions spelled out above, attendance increases the probability of abstinence by 30 percentage points: from 40% to 70%.
Hypothetical Data for the Example of Alcoholics Anonymous Attendance (AA) and Abstinence, Including Two Controls (Family History of Alcohol Abuse, Gender),
The g-formula essentially allows us to place ourselves in counterfactual worlds in which everybody or nobody attended AA. 7 It is nonparametric because it does not assume any functional form for the relationships between variables. In our simple scenario, this works well because we have only two controls with two levels, resulting in four subgroups. But things quickly get out of hand if we add more (categorical) controls—leading to an exponential increase in the number of subgroups (so-called curse of dimensionality)—and/or if we add continuous controls. Thus, for realistic scenarios in which more than just a few controls are necessary to achieve conditional exchangeability, we need g-computation, a model-based extension of the g-formula proposed by Robins (1986). 8
G-computation (Box 4) is an attempt to emulate the time machine described at the beginning of this article. It aims to model two counterfactual worlds, one in which everybody who meets our inclusion criteria attends AA and one in which nobody does, and predict each individual’s outcome in these worlds. The first step consists of fitting the nuisance function Q(A,C) with AA attendance and all controls needed to achieve conditional exchangeability. Here, we use everybody’s observed characteristics, including their observed AA attendance. For example, Q(A, C) could be a logistic regression. In the first step, we determine the coefficients of the predictor AA attendance and of the controls. In the second step, we create two hypothetical worlds—one in which everybody attends AA and one in which nobody attends AA. To do so, we simply copy the data twice and set the action variable to 1 (world of attendance) or 0 (world of nonattendance) for everybody, keeping their controls at the originally observed levels. We then use the coefficients from the first step to predict the (potential) outcomes in the two worlds. For each individual, we now have the individual’s outcome probability for the scenario in which the individual attends and for the scenario in which the individuals does not attend.
We can take the difference between these probabilities to compute the individual-level causal effects, and we can calculate the ATE by averaging over individuals. Alternatively, we can first average the potential outcome probabilities and then compute a wider range of causal effects (e.g., the odds ratio). Other causal estimands are also easily computable from the predicted potential outcomes. Again, we usually do not want only a point estimate but also some way to quantify its uncertainty (e.g., to compute the standard error). Here, bootstrap approaches and the so-called delta method are frequently used (although a specific variance estimator also exists; Zou, 2009). A Bayesian approach is also possible (e.g., Keil, Daza, et al., 2018; for an applied example close to psychology, see also Rohrer et al., 2021), in which case, the posterior distribution of the parameter of interest provides for a straightforward quantification of uncertainty.
The recipe presented in Box 5 can be varied at multiple points. For example, instead of predicting both counterfactual worlds for all individuals in Step 3, we may instead predict only the unobserved outcome (e.g., the outcome without action for those individuals who did in fact receive the action) and keep the observed outcomes untouched to improve accuracy (Westreich et al., 2015). In Step 1, we can also fit one nuisance model per action group, Q(A = 1,C) and Q(A = 0,C), for predicting the counterfactual outcome (Künzel et al., 2019). Fitting two nuisance models means that we do not have to explicitly model interactions between the action and controls; however, this approach is sensitive to data-set shift when predicting the potential outcomes: A nuisance model fitted only on one action group may have poor predictive performance when applied to another action group because they differ too much (Finlayson et al., 2021). In Step 4, to estimate the causal effect, we can also regress the counterfactual predictions on the action in an MSM (Snowden et al., 2011).
G-Computation Recipe
In contrast to propensity-score-based methods, g-computation does not require the assumption of balance between groups because it holds “by design” between the two counterfactual worlds. However, the flip side of this is that we can demonstrate balance only on measured controls when we use propensity-score-based methods. Such a demonstration can, in turn, convince both researchers and readers that bias because of measured controls has been removed. A similar trade-off arises for positivity. G-computation may be able to simply extrapolate over missing strata; propensity-score-based methods, in contrast, allow us to check for extreme propensity scores and thus notice positivity violations (or a lack thereof).
Doubly robust standardization
Both propensity-score-based methods and G-computation require the correct specification of their respective nuisance model, e(C) and Q(A,C). This means that the models have to approximate the true data-generating process—either of the assignment of the action, e(C), or of the outcome, Q(A,C)—as closely as possible to result in valid inferences. However, because of the complexity of the real world, a correct specification is unlikely, and as a result, estimates can be biased (van der Laan & Rose, 2011, p. 9). Doubly robust estimators provide a partial solution to this problem by combining both nuisance models; they give us two shots to get things right: As long as one of the nuisance models is correctly specified, the resulting estimate does not suffer from misspecification bias. 9 However, this doubly robust property comes at a cost: Although (systematic) bias may be reduced, variance increases in comparison with g-computation (Tan, 2007); thus, we face a bias-variance trade-off (Pargent et al., 2023).
There are different ways to combine the nuisance models e(C) and Q(A,C), resulting in various doubly robust estimators. Here, we focus on the one that we consider most intuitive: doubly robust standardization (DRS; Robins et al., 2007). Recall that IPW aims to balance the AA attenders and nonattenders on the controls so that the analysis emulates a randomized trial. If e(C) is misspecified, this emulation fails, and some residual confounding remains. DRS tackles this residual confounding by adding a g-computation step after the IPW (Box 6). An alternative way to think about DRS is to consider that it is easier to model the counterfactual worlds with g-computation from a randomized trial (even if it is miss-emulated) rather than from scratch because some confounding has already been removed. All variations of IPW and g-computation described above can be applied to DRS. Again, bootstrapping (for the whole process, i.e., for both the IPW and g-computation steps) or the delta method can be used to quantify the uncertainty in the resulting point estimate. When both nuisance models are misspecified, some doubly robust estimators are actually more biased than either IPW or g-computation (Kang & Schafer, 2007)—fortunately, DRS is not affected by this bias-amplification phenomenon (Chatton et al., 2022).
Doubly Robust Standardization Recipe
What Could Possibly Go Wrong?
Iffy identifiability
Any causal-estimation effort can succeed only if the statistical estimand actually corresponds to the causal estimand, and as explained earlier, this requires assumptions: exchangeability (an absence of confounding and collider bias, see Box 2), positivity (nonextreme probabilities of ending up in either action group) with respect to the controls included to achieve exchangeability, consistency (potential outcomes correspond to the observed outcomes), and noninterference (outcome of an individual is not affected by the outcome or action of another individual). Any of these assumptions can fail, leading to biased estimates. Conversely, inferences can be strengthened by trying to render these assumptions more plausible.
Recent reviews in social sciences suggest that the inclusion of controls to achieve exchangeability is often insufficiently justified (Bernerth & Aguinis, 2016; Kohler et al., 2023), leaving a lot of room for improvement. Wysocki et al. (2022) suggested spelling out several plausible causal structures and selecting the controls as the minimal set blocking all backdoor paths. The selection of controls can also be achieved by data-driven procedures (for such an approach introduced in psychology, see Loh & Ren, 2023a)—however, such procedures in themselves are unaware of the underlying causal structure, and they thus need to be combined with existing domain knowledge to achieve exchangeability. As spelled out before, doubly robust estimators may offer advantages here because even if a confounder is missing in one nuisance model, groups remain conditionally exchangeable if it is present in the other nuisance model (Chatton et al., 2022). However, the (erroneous) inclusion of a mediator as a control in either Q(A,C) or e(C) withdraws the doubly robust property of DRS and increases the resulting bias compared with IPW or g-computation (Keil, Mooney, et al., 2018). Regardless of the estimation approach used, concerns regarding exchangeability call for robustness checks. So-called sensitivity analyses try to assess to which extent estimates may be biased because of unobserved confounding. Although these analyses provide no guarantees, they help gauge how worried one should be about the robustness of the results. X. Zhang et al. (2020) provided a review of modern statistical methods and suggested a specific order of steps to evaluate the impact of potential unmeasured confounders.
Although exchangeability always requires a leap of faith—one can never be completely certain that there is no unobserved confounding—positivity can be checked empirically. The classic approach here involves checking whether the estimated propensity scores include extreme values. We recommend using PoRT, a tree-based algorithm recently developed by Danelian et al. (2023), because it can be used with all estimators, does not require assumptions about the data-generating process, and clearly identifies the target population. If a violation of positivity is structural, the target population must be redefined—one cannot estimate the effect of the action in the subgroup that would never experience the action.
Random violations of positivity result in estimation issues that are especially harmful when using IPW because they result in extreme weights and, thus, outsized influence of individual observations on the results. They can be addressed in various manners. Several authors have proposed to trim propensity scores (i.e., to remove observations with extreme values) or to truncate them (i.e., to set all values exceeding a certain threshold to a fixed value). A recent simulation study suggests that a threshold of
As previously discussed, g-computation can be less sensitive to random violations of positivity because it allows for extrapolation over the missing strata while still targeting the initial estimand (Léger et al., 2022). However, if not combined with a diagnostic tool such as PoRT, positivity violations can remain unnoticed when doing g-computation. DRS can also extrapolate in the missing strata, although extreme propensity scores remain an issue. And for any such extrapolation to succeed, Q(A,C) needs to be specified correctly (Robins et al., 2007).
Finally, as already mentioned in the beginning, a lack of consistency suggests that the research question of interest must be redefined. And a lack of noninterference—in other words, interference—leads to its own estimands and methods. For example, Tchetgen Tchetgen et al. (2021) provided an extension of g-computation for causal effects on networks of connected units; for a discussion of the use of propensity scores in the presence of interference, see B. Zhang, Hudgens, & Halloran (2023).
Here in the real world
Model misspecification and machine learning
Beyond nonidentifiability, applied researchers must deal with other sources of bias. For example, to avoid estimation bias, nuisance models must be specified correctly. This means that relevant interactions between predictors must be included, and functional forms need to be correct. This is again a step for which expert knowledge is helpful, but here, machine-learning approaches also hold some promise (Le Borgne et al., 2021; Pirracchio et al., 2015). There is no theoretical proof that one can simply combine machine learning with bootstrapping to arrive at valid inferences; therefore, some doubly robust estimators have been specifically designed to incorporate machine-learning approaches. Among these are the augmented-IPW (Glynn & Quinn, 2010), targeted maximum likelihood estimator (also known as targeted minimum loss-based estimator; van der Laan & Rose, 2011), the collaborative targeted maximum likelihood estimator (van der Laan & Gruber, 2010), and the double/debiased machine learning (Chernozhukov et al., 2018). These estimators should be viewed as complete frameworks for causal inference and are built specifically for the estimation problem at hand. Their implementation is far more complex than, for example, DRS, and requires knowledge about semiparametric estimation theory (Díaz, 2020). Returning to our cake comparison somewhat belatedly, these estimators belong to the realm of haute cuisine.
Missing data
One of the most common issues in practice is missing data. Multiple introductions to the different types of missing data and how to deal with them can be found in the literature (e.g., Hayes & Enders, 2023; for graphical representation of missing data problems, which highlight the causal nature of the resulting inferential problems, see Thoemmes & Mohan, 2015), so here we only briefly touch on the topic with a special focus on the estimators introduced earlier. The most “convenient” approach to missing data involves simply tossing away any incomplete observations, which results in so-called complete case analysis. In some types of models, this can result in unbiased results as long as the chance of being a complete case does not depend on the outcome after taking covariates into consideration (Hughes et al., 2019). But it is still generally discouraged because first, it inadvertently targets a complete-cases population that differs from the intended target population (decreased external validity), and second, even for this complete-cases population, effect estimates can be biased because the missingness can induce new noncausal associations (decreased internal validity). Mathur (2023) provided sensitivity analyses to gauge how sensitive estimates from complete-cases analyses are in different situations.
Another way to handle missing values is multiple imputation, which uses observed variables to create multiple plausible imputations of the missing values. These imputed data sets are then analyzed, and the results are pooled across them. When the missingness of the outcome can be explained by controls that have been measured, multiple imputation can result in unbiased estimates. However, there are also scenarios in which multiple imputation may yield biased results, for example, in the presence of effect modification for propensity-score-based methods (Choi et al., 2019), and there may even be scenarios in which it performs worse than complete case analysis with adequate controls (Hughes et al., 2019). There is also a lack of literature on how to combine multiple imputation with g-computation, and multiple imputation can become particularly time-consuming when combined with bootstrapping or machine-learning approaches. Alternatively, one can fit Q(A,C) on the complete cases only but then use it to predict the potential outcomes for all individuals (Breger et al., 2020; Westreich et al., 2015; for an extension in longitudinal settings, see Bartlett et al., 2023). Another approach involves weighting—here, weights are created that are inverse to the probability of missingness, and these are applied much in the same ways as IPW weights (for an overview of the alternative use of IPW, see Box 7). For DRS, all the approaches mentioned here can be employed or even combined.
Alternative Forms of Inverse Probability Weighting
Considering missing values on the controls, we suggest it is possible to add a missingness indicator among the controls or to apply specific schemes of multiple imputation (Blake et al., 2020; Leyrat et al., 2019; J. Zhang, Dashti, et al., 2023). Furthermore, some machine-learning approaches (e.g., random forest; Strobl et al., 2009) have in-built approaches to handle missing controls.
Measurement error
Another source of bias is measurement errors. How measurement error affects results depends on the underlying causal net, that is, on what causes the deviation between the true value and the observed value of a variable (Hernán & Cole, 2009; van Bork et al., 2022). But in almost all scenarios, measurement error will introduce bias. Thus, high-quality data are crucial to minimize the risk of such biases upfront. This is a particular concern for psychological constructs because reliability may often be modest; for example, failing to account for measurement error in confounding constructs can lead to high rates of mistaken conclusions (Westfall & Yarkoni, 2016).
Causal approaches to correct measurement errors have mainly been developed for propensity-score-based methods. First, considering measurement error in controls, we found that in the epidemiological literature, Rudolph and Stuart (2018) reviewed three ways to deal with an error-prone control for various measurement-error structures using existing sensitivity analyses; in the psychometrics literature, Hong et al. (2017) suggested a Bayesian approach. Blackwell et al. (2017) suggested a multiple-imputation-like approach, called “multiple overputation,” to handle multiple error-prone controls. In general, if the true values of the mismeasured controls are strongly correlated, this will reduce the bias of the estimated effects. However, if the measurement errors of the controls are correlated, this will actually render the bias worse (Hong et al., 2019). Second, considering measurement error in the action, this can be handled through an instrumental-variable procedure (Gustafson, 2007), a regression-calibration-based adjustment (Wu et al., 2019), or a two-step estimation process relying on validation data (Braun et al., 2017). Third, Shu and Yi (2019) discussed causal estimation with an error-prone continuous or binary outcome. In contrast, the literature on g-computation with measurement errors is much more scarce (Blette, 2021); Shu and Yi (2019) proposed a doubly robust estimator.
In psychology, latent variable modeling is the predominant approach to take into account measurement error, and there have been various efforts to explicitly apply it to causal inference. Structural equation modeling (SEM) in particular was originally developed for causal inference (Pearl, 2012), and there have been newer efforts to use SEM to estimate conditional and average effects, taking into account both latent controls and latent outcomes (Mayer et al., 2016). But other ways to combine latent variable modeling and causal inference have also been explored; for example, Lanza et al. (2016) combined IPW with latent class modeling to estimate the effects of depression on substance use (conceived as a latent class). Note that whether or not latent variable modeling “solves” the problem of measurement error crucially depends on whether the assumed measurement model is correct—for example, if a common factor model is mistakenly assumed, the bias that is introduced may sometimes be worse than the measurement bias that is supposed to be removed (Rhemtulla et al., 2020). There have been fairly recent efforts to think about measurement from the viewpoint of causality, both within psychology (van Bork et al., 2022) and within epidemiology (Hernán & Cole, 2009; VanderWeele, 2022), highlighting how this is an area of active conceptual development.
Quantifying the magnitude of our errors
Data and models are, of course, never perfect, and thus, some bias is inevitable. Here, the epidemiological framework of quantitative bias analysis is helpful (for an introduction and best practices, see Lash et al., 2014), which tries to gauge the direction and magnitude of one’s errors. Another helpful framework is the so-called target-trial framework, which spells out an idealized experiment that, in turn, can guide both study planning and data analysis. Bulbulia (2023) provided an illustration of this framework in psychology, trying to answer the question of whether religious service attendance reduces anxiety. Many other sources of bias and ways to avoid them are summarized in Wulff et al. (2023).
Outlook: Other Cakes to Bake
The estimators we have introduced are quite versatile and can be extended in various ways. For example, our focus has been on internal validity (correctness of the results for the targeted population); however, one could also be interested in questions of external validity. This may involve questions about generalizability (is the effect estimate valid for a broader population?) and transportability (can we draw conclusions about the causal effect in different settings or for different populations?; see also Deffner et al., 2022). All the estimators we presented here can be used to address such research questions if they are applied within a transportability framework (Lesko et al., 2017).
Furthermore, here we were interested in the occurrence of the outcome at a specified time point, such as abstinence after 1 year. But we may also be interested in the outcome’s occurrence in time; for example, we may ask whether AA attendance has an effect on the timing of a relapse. Chatton et al. (2022) discussed the particularity of the estimands in this context and proposed an extension of the estimators presented here.
Our focus was on marginal effects that average over groups of people. But sometimes other estimands may be more relevant—for example, one may be interested in heterogeneous causal effects (Bryan et al., 2021) or may want to disentangle indirect and direct effects in the context of mediation analysis. Pósch (2021) illustrated the use of g-computation in this context.
Finally, going beyond cross-sectional data, in a longitudinal setting, both the action and confounders may vary over time. One common issue in this context is confounder-action feedback. For example, imagine we had monthly data spanning 1 year and were interested in the effect of attending AA every month, as opposed to never, on abstinence at the end of the year. Attending AA in a given month may affect subsequent social isolation, and social isolation may, in turn, affect both subsequent AA attendance and abstinence. Thus, social isolation is both an outcome of the action and a confounder, which leaves us in a bad spot: If we statistically adjust for it, we may accidentally induce collider bias; if we do not statistically adjust for it, we are stuck with confounding bias. Traditional methods fail to handle such confounder-action feedback, and so we need the longitudinal extension of the estimators presented here (Hernán & Robins, 2020; for an introduction to g-computation in a longitudinal context for psychologists, see Loh & Ren, 2023c. Two recent articles targeting the psychological community introduced g-estimation (another estimator from epidemiological literature), which is another valid approach for this specific setting (Loh & Ren, 2023b, 2023d).
Conclusion
In this article, we have provided recipes for causal estimators in the presence of time-fixed confounding. A companion R notebook illustrating the implementation of these estimators is available at github.com/ArthurChatton/CausalCookbook. We focused on estimators commonly used in epidemiological literature rather than in psychology to bridge the gap between these disciplines and to broaden psychologists’ causal-inference toolbox. Epidemiology is, of course, not the only field with a strong focus on causal inference. For example, methods from economics can be another valuable addition; in particular, those estimators that do not rely on conditional exchangeability but make other assumptions about the underlying causal net that may sometimes be more palatable (Grosz et al., 2024; Kim & Steiner, 2016).
All estimators, like cake recipes, require good ingredients—no statistical method can overcome poor data. And a lot of effort may be wasted if one sets out to bake the wrong cake—no statistical method can overcome poor research questions. Finally, in causal inference (and elsewhere), there is no free cake: Different approaches make different trade-offs with respect to bias and variance but also with respect to the underlying assumptions. The availability of a large set of estimators—based on different assumptions but targeting the same or at least related estimands—is crucial to improve evidence from observational studies through triangulation (Munafò & Davey Smith, 2018).
