Sage Journals: Discover world-class research

Abstract

Individualized treatment rules leverage patient-level information to tailor treatments for individuals. Estimating these rules, with the goal of optimizing expected patient outcomes, typically relies on individual-level data to identify the variability in treatment effects across patient subgroups defined by different covariate combinations. To increase the statistical power for detecting treatment–covariate interactions and the generalizability of the findings, data from multisite studies are often used. However, sharing sensitive patient-level health data is sometimes restricted. Additionally, due to funding or time constraints, only a subset of available treatments can be included at each site, but an individualized treatment rule considering all treatments is desired. In this work, we adopt a two-stage Bayesian network meta-analysis approach to estimate individualized treatment rules for multiple treatments using multisite data without disclosing individual-level data beyond the sites. Simulation results demonstrate that our approach can provide consistent estimates of the parameters that fully characterize the optimal individualized treatment rule. We illustrate the method’s application through an analysis of data from the Sequenced Treatment Alternatives to Relieve Depression study, the Establishing Moderators and Biosignatures of Antidepressant Response for Clinical Care study, and the Research Evaluating the Value of Augmenting Medication with Psychotherapy study.

Keywords

Bayesian meta-analysis individualized treatment rules multisite studies network meta-analysis personalized medicine

1. Introduction

It is widely recognized in medical research that treatment responses often exhibit variability among different patient subgroups. Personalized medicine leverages this heterogeneity in treatment effect to enhance healthcare service quality by delivering tailored treatments to individual patients.^1–3 An individualized treatment rule (ITR) is a decision rule that utilizes patient-level information, such as demographics, genetic makeup, or disease history, to customize treatment plans at a single decision point. An optimal ITR guides treatment selection for individual patients with the goal of optimizing patient outcomes. Estimating the optimal ITR is essential for the practice of personalized medicine and has attracted significant research focus.

Regression-based approaches are commonly employed to indirectly estimate the optimal ITR. These methods model the expected patient outcome as a function of treatment, covariates, and treatment–covariate interactions. Then, the optimal treatment is determined as the one that leads to the best estimated outcome for any given covariate profile. Q-learning,^4,5 G-estimation,⁶ and dynamic weighted ordinary least squares (dWOLS)⁷ are three popular regression-based approaches. Estimating treatment–covariate interactions based on individual-level data is essential in these approaches. With advancements in technologies, the availability of large collections of health data from multiple data sites has facilitated the identification of factors that contribute to differential treatment responses, provided a higher statistical power of treatment–covariate interaction estimation that cannot be offered by a single site,⁸ and improved the generalizability of the findings. However, patient-level health data are typically highly sensitive, and their disclosure could cause a violation of data-sharing agreements or policies, presenting a challenge for ITR estimation with multisite data. Therefore, valid approaches to ITR estimation without releasing patient-level information are desired.

Several approaches have been proposed to avoid individual-level data sharing for ITR estimation. Spicker et al.⁹ investigate differential privacy¹⁰ in the context of dynamic treatment regimes, an extension of ITRs to multiple treatment decision points. Instead of regression-based approaches, they focus on an outcome-weighted learning method,¹¹ which frames the estimation of ITRs as a classification loss minimization problem and identifies the optimal treatment through support vector machine classifiers. Danieli and Moodie¹² study the use of data pooling¹³ and distributed regression¹⁴ to protect individual-level data from release in multisite studies in the context of ITR estimation with generalized dWOLS for continuous outcomes. In their approach, estimators characterizing the optimal ITR are computed using data summaries (e.g. pooled data or matrix products) shared by each single site, rather than individual-level data. Moodie et al.¹⁵ also explore distributed regression in a dynamic weighted survival modeling (a generalization of dWOLS to survival outcomes¹⁶). One limitation of both approaches is that they typically assume parameters of interest are fixed and common to all sites. To overcome this limitation, our recent work¹⁷ adapted a two-stage Bayesian meta-analysis approach, which requires only site-specific analyses of individual-level data within each site and sharing site-specific estimates, as summary data, to construct a common optimal ITR in settings where all sites are assigned the same treatment options. Conventional meta-analysis approaches typically assume that the treatment is binary and that each site consists of the same treatment comparison, limiting their applicability in a wide range of diseases where the treatment landscape can be quite heterogeneous, as is the case with conditions such as depression. In such cases, due to funding or time constraints, only a subset of available treatment options can be delivered in each site, and yet establishing an optimal ITR that considers all treatments is often desired. Analogously, in our motivating example, we wish to draw inferences using randomized trial data from trials whose randomization groups are overlapping but not identical. In this article, we consider ITR estimation in multisite studies without sharing individual-level data, when more than two treatments are available, and each site may encompass different sets of treatment assignment options.

An extension of classic meta-analysis to multiple treatments is network meta-analysis.^18,19 Network meta-analysis compares multiple treatments within a network of studies, involving the simultaneous analysis of direct evidence obtained from head-to-head trials and indirect evidence from studies including the treatments of interest and one or more common comparator treatments, when comparing any two treatments in the network. This drives the extension of the two-stage Bayesian meta-analysis approach proposed in our previous work¹⁷ to the current setting where treatments are not common across all sites or studies, which is the objective of this work.

The remainder of this paper is organized as follows: Section 2 describes the proposed method, including the notations and assumptions. A simulation study is presented in Section 3 to explore the performance of ITR estimation using the proposed method. Section 4 demonstrates the application of the proposed method via an analysis of real data from three randomized clinical trials for the treatment of depression. The paper concludes with a discussion in Section 5.

2. Methods

2.1. Preliminaries

Consider the data $(X, A, Y)$ , where $X$ includes pretreatment covariates, $A \in Λ = {d_{1}, \dots, d_{G}}$ represents the treatment received by individual patients with $G$ unique options. Without loss of generality, we assume $d_{1}$ is the reference treatment, and $\tilde{A} = (I (A = d_{2}), \dots, I (A = d_{G}))^{⊤}$ codes the treatment assignment in a vector of dummy variables. We denote $Y$ to be the continuous outcome of interest, with larger values preferred. We use uppercase, lowercase, and bold letters to denote random variables, their observed values, and vectors, respectively.

We make the following assumptions: (i) the stable unit treatment value assumption: a patient’s outcome is not influenced by other patients’ treatment²⁰; (ii) no unmeasured confounding²¹; (iii) positivity: there is a positive probability of receiving every possible treatment for every combination of covariate values that occur among individuals in the population.²²

Define a treatment-free function $f (x) = E (Y | A = d_{1}, X = x)$ , which represents the expected outcome at the reference treatment $d_{1}$ for patients with covariates $X = x$ . A blip function $γ (a, x)$ ⁶ is defined such that $γ (d_{h}, x) = E (Y | A = d_{h}, X = x) - E (Y | A = d_{1}, X = x)$ for $h \neq 1$ , and $γ (d_{1}, x) = 0$ . Therefore, $γ (d_{h}, x)$ is the expected difference in the outcomes between receiving treatment $d_{h}$ and the reference treatment $d_{1}$ for patients with covariates $X = x$ . For example, it can be the main effect of $d_{h}$ and interaction effects between $d_{h}$ and covariates $x$ . With $f$ and $γ$ , the outcome can be decomposed: $E (Y | A = a, X = x) = f (x) + γ (a, x) .$ We aim to identify the optimal ITR, that is, a decision rule that, given individual characteristics, outputs a tailored treatment which can maximize the expected outcome. The treatment-free function $f$ is not related to any terms of treatments $d_{2}, \dots, d_{G}$ . Therefore, the optimal ITR $d^{opt} (x)$ only depends on $γ$ , that is, $d^{opt} (x) = \arg max_{a \in {d_{1}, \dots, d_{G}}} γ (a, x)$ . However, estimation of the optimal ITR requires model specifications for both $f$ and $γ$ . For example, we can posit functional forms: $f (x) = β w (x)$ and $γ (a, x) = z (\tilde{a}) ψ l (x)$ , where $w$ , $z$ , and $l$ are multivariate functions specified by analysts, with $z (\tilde{a}) = 0$ for $a = d_{1}$ to ensure the condition $γ (d_{1}, x) = 0$ is met for every possible $x$ . The dimensions of $w$ , $z$ , $l$ , and parameters $β$ and $ψ$ should be compatible to guarantee both $f$ and $γ$ output scalar values. In this article, for illustration purposes, we assume $w (x) = x^{(β)}$ , $z (\tilde{a}) = \tilde{a}$ , and $l (x) = x^{(ψ)}$ . We use $x^{(β)}$ and $x^{(ψ)}$ to indicate that not all collected variables in $x$ , but those related to patient outcomes or treatment selection, are included in $f$ and $γ$ . A number of $p$ covariates that contribute to the outcome (predictive variables) are included in $x^{(β)}$ , among which $q$ have tailoring effects on treatment assignment (prescriptive variables) and are included in $x^{(ψ)}$ . Both $x^{(β)}$ and $x^{(ψ)}$ are augmented with an intercept term and are subvectors of $x$ , and $x^{(ψ)}$ is also contained in $x^{(β)}$ , that is, $x^{(ψ)} = (1, x_{1}, \dots, x_{q})^{⊤}$ and $x^{(β)} = (1, x_{1}, \dots, x_{q}, x_{q + 1}, \dots, x_{p})^{⊤}$ with $q \leq p$ . Alternative parametric choices, such as nonlinear models, can also be considered for $w$ and $l$ . Given a linear specification of $w$ , $z$ , and $l$ , for example, the outcome model becomes $E (Y | A = a, X = x) = \underset{treatment-free model}{\underset{⏟}{β^{⊤} x^{(β)}}} + \underset{blip model}{\underset{⏟}{{\tilde{a}}^{⊤} ψ x^{(ψ)}}},$ (1)where the treatment-free model $f$ and blip model $γ$ are parameterized by a $(p + 1)$ -dimensional vector $β$ and a $(G - 1) \times (q + 1)$ matrix $ψ$ , that is, $\begin{aligned} ψ & = (\begin{matrix} ψ_{20} & ψ_{21} & \dots & ψ_{2 q} \\ ψ_{30} & ψ_{31} & \dots & ψ_{3 q} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ ψ_{G 0} & ψ_{G 1} & \dots & ψ_{G q} \end{matrix}), \end{aligned}$ respectively, and $ψ_{h t}$ is the main effect of treatment $d_{h}$ ( $t = 0$ ) or the interaction effect between $x_{t}$ and $I (a = d_{h}) (t = 1, \dots, q)$ . We use $ψ_{\cdot t} = (ψ_{2 t}, \dots, ψ_{G t})^{⊤}$ and $ψ_{h \cdot} = (ψ_{h 0}, \dots, ψ_{h q})^{⊤}$ to represent the $(t + 1)$ th column and the $h$ th row of $ψ$ , containing all blip parameters related to a given covariate $x_{t}$ and a given treatment $d_{h}$ , respectively. Then, the outcome model can also be written as $\begin{aligned} E (Y | A = a, X = x) & = β^{⊤} x^{(β)} + \sum_{t = 0}^{q} ψ_{\cdot t}^{⊤} \tilde{a} x_{t}, \end{aligned}$ (2) $\begin{aligned} = β^{⊤} x^{(β)} + \sum_{h = 2}^{G} I (a = d_{h}) ψ_{h \cdot}^{⊤} x^{(ψ)} . \end{aligned}$ (3)

Since the treatment assignment only influences the outcome through the blip model $γ$ , the parameter $ψ$ will solely determine the optimal ITR. Given the parameter $ψ$ in the model (3), the optimal ITR $d^{opt} (x) = \arg max_{a \in {d_{1}, \dots, d_{G}}} γ (a, x)$ can be written as $\begin{aligned} d^{opt} (x) & = {\begin{cases} d_{h_{0}}, & h_{0} = \arg max_{h \in {2, \dots, G}} ψ_{h \cdot}^{⊤} x^{(ψ)} and ψ_{h_{0} \cdot}^{⊤} x^{(ψ)} > 0, \\ d_{1}, & ψ_{h \cdot}^{⊤} x^{(ψ)} \leq 0 \forall h = 2, \dots, G . \end{cases} \end{aligned}$ The parameter $ψ$ can be estimated by different approaches. In the absence of model misspecification, consistent and unbiased estimators of $ψ$ can be obtained from a Q-learning method,² which, in our setting, reduces to a standard linear regression for the model (1). Doubly robust alternatives such as dWOLS or G-estimation can also be employed. Bayesian approaches have also been proposed for ITR estimation, such as Bayesian G-computation,²³ Bayesian additive regression trees,²⁴ and Bayesian causal forest.²⁵

2.2. Two-stage Bayesian network meta-analysis

In this section, we describe the use of a two-stage Bayesian network meta-analysis approach to avoid disclosing individual-level data, when estimating the optimal ITR for multiple treatments using multisite data. We first describe the model when all $G$ treatments are present in all sites and then explain the extension to varying sets of treatments across sites. Suppose we have $K$ sites. For site $i \in {1, \dots, K}$ , the outcome model can be expressed as $E (Y_{i j} | A = a_{i j}, X = x_{i j}) = β_{i}^{⊤} x_{i j}^{(β)} + \sum_{t = 0}^{q} ψ_{i \cdot t}^{⊤} {\tilde{a}}_{i j} x_{i j t},$ where $j \in {1, \dots, n_{i}}$ indexes individual patients within each site and $n_{i}$ is the number of patients in site $i$ . We include index $i$ in $β_{i}$ and $ψ_{i \cdot t}$ to indicate that these parameters are site-specific and can vary across sites. The varying site-specific blip parameters $ψ_{i \cdot t}$ are assumed to be exchangeable, that is, $ψ_{i \cdot t} = (ψ_{i 2 t}, \dots, ψ_{i G t})^{⊤} \sim MVN (ψ_{\cdot t}, Σ_{t}),$ (4)where $MVN (ψ_{\cdot t}, Σ_{t})$ represents a multivariate normal distribution with mean $ψ_{\cdot t}$ and variance–covariance matrix $Σ_{t}$ . The common parameters $ψ_{\cdot t}$ , $t = 0, \dots, q$ , fully characterize a common optimal ITR applicable to a broader population, encompassing subpopulations from various sites in the dataset, and potentially for future patients at comparable sites. Estimation of $ψ_{h t}$ , $h = 2, \dots, G$ , $t = 0, \dots, q$ , is of primary interest.

In the two-stage Bayesian meta-analysis approach, we first obtain a set of estimates for the site-specific parameters by conducting analyses on data from each single site, and then combine these estimates in a Bayesian hierarchical model to obtain estimates of the common parameters $ψ_{\cdot t}$ and thus a common optimal ITR. That is, in the first stage, estimates for $ψ_{i \cdot t}$ , that is, ${\hat{ψ}}_{i \cdot t}$ and the corresponding $(G - 1) \times (G - 1)$ variance–covariance matrix $\hat{Σ} ({\hat{ψ}}_{i \cdot t})$ can be acquired from approaches mentioned in Section 2.1, based on solely site-specific data. In the second stage, these site-level estimates, rather than individual records, are shared with a central analysis site and combined in a Bayesian hierarchical model: $\begin{aligned} {\hat{ψ}}_{i \cdot t} & \sim MVN (ψ_{i \cdot t}, \hat{Σ} ({\hat{ψ}}_{i \cdot t})), \\ ψ_{i \cdot t} & \sim MVN (ψ_{\cdot t}, Σ_{t}), \\ ψ_{h t} & \sim p_{ψ_{h t}} (ψ_{h t}), \\ Σ_{t} & \sim p_{Σ_{t}} (Σ_{t}) . \end{aligned}$ (5)Here, prior distributions $p_{ψ_{h t}}$ and $p_{Σ_{t}} (Σ_{t})$ can be assigned for the unknown parameters $ψ_{h t}$ and $Σ_{t}$ . A popular prior choice for $ψ_{h t}$ could be a normal prior with large variance.²⁶ The between-site heterogeneity matrix $Σ_{t}$ could be structured under the assumption that the between-site heterogeneity is the same across different treatment comparisons.^27,28 In this case, a common specification in the network meta-analysis literature^27,28 is that $Σ_{t}$ has diagonal elements $σ_{t}^{2}$ and off-diagonal elements $0.5 σ_{t}^{2}$ (i.e. the correlation between any two treatment contrasts is 0.5), where $σ_{t}^{2}$ is the between-site variance associated with $ψ_{h t}$ for all $h = 2, \dots, G$ . By assuming the same variance for all $ψ_{h t}$ with a given $t$ and fixing the correlation between $ψ_{h_{1} t}$ and $ψ_{h_{2} t}$ , $h_{1} \neq h_{2}$ , $h_{1}, h_{2} = 2, \dots, G$ , at 0.5, we have that the variance of the contrast $ψ_{h_{1} t} - ψ_{h_{2} t}$ is given by: $var (ψ_{h_{1} t} - ψ_{h_{2} t}) = var (ψ_{h_{1} t}) + var (ψ_{h_{2} t}) - 2 cov (ψ_{h_{1} t}, ψ_{h_{2} t}) = σ_{t}^{2}$ . This is referred to as the common between-site heterogeneity assumption, and the contrast $ψ_{h_{1} t} - ψ_{h_{2} t}$ will be needed in a consistency equation (7) described later. This specification also reduces the number of parameters to be estimated in $Σ_{t}$ , and can also improve model convergence. With the structured heterogeneity, a prior is needed only for $σ_{t}$ or $σ_{t}^{2}$ . In this article, we use a half-Cauchy prior for $σ_{t}$ , however, alternatives may also be employed.²⁹ If the variance–covariance matrix $Σ_{t}$ is deemed to be unstructured, that is, a separate between-site heterogeneity is to be estimated for each of the different treatment comparisons, priors can be directly assigned to $Σ_{t}$ . A common choice in this scenario is the inverse-Wishart prior.²⁶ Alternatively, a separation strategy has been proposed.³⁰ The variance–covariance matrix $Σ_{t}$ can be decomposed: $Σ_{t} = U V U$ , where $U$ is a diagonal matrix of between-site standard deviations and $V$ is an unknown correlation matrix. Then, priors can be assigned to between-site standard deviations and correlation matrix $V$ . The separation strategy offers more flexibility than the usual inverse-Wishart prior for the variance–covariance matrix, and it has been shown that it outperforms modeling the variance–covariance matrix as a whole.³¹ Therefore, in this article, we adopt the separation strategy. We still use a half-Cauchy prior for standard deviation parameters. Possible prior choices for the correlation matrix include restricted inverse-Wishart (RIW) prior,³¹ Lewandowski–Kurowicka–Joe (LKJ) prior,³² and an equal correlation (EQ) prior where the pairwise correlations in the correlation matrix are all assumed to be equal, and an appropriate prior (e.g. a uniform prior with upper and lower bounds specified to make the correlation matrix positive-definite) can be assigned for the common correlation parameter. We consider both heterogeneity structures in the simulation studies. For the separation strategy under the unstructured heterogeneity, we use an LKJ prior for the correlation matrix $V$ as recommended by the Stan Development Team.³³ However, we also explore the use of RIW and EQ priors through simulations in a specific scenario that mimics the network structure in the real-data application. The Bayesian hierarchical model is implemented in RStan.^33,34

Model (5) requires all $G$ candidate treatments under consideration to be observed at all sites. However, in reality, some treatments are not administered in specific sites, possibly due to the insufficient sample size and funding to implement a large number of treatments. When the set of treatments differs across sites, we can still implement a two-stage approach. However, in this setting, not all $ψ_{i h t}$ in the site-specific outcome models are estimable, and modifications based on the network meta-analysis approach are made to (5).

To proceed, the treatment set in site $i$ is denoted by $Λ_{i} = {d_{a_{i}^{(1)}}, \dots, d_{a_{i}^{(ν_{i})}}}$ , where $ν_{i}$ is the number of treatments in site $i$ , $ν_{i} < G$ and $1 \leq a_{i}^{(1)} < a_{i}^{(2)} < \dots < a_{i}^{(ν_{i})} \leq G$ . Without loss of generality, we assume $d_{a_{i}^{(1)}}$ is the reference treatment for site $i$ . When $d_{1}$ is available in site $i$ , we have $d_{a_{i}^{(1)}} = d_{1}$ (i.e. the site-specific reference treatment is the common reference treatment). Otherwise, $d_{a_{i}^{(1)}} \neq d_{1}$ . Then, with treatment set $Λ_{i}$ , we can fit a site-specific outcome model $E (Y_{i j} | A = a_{i j}, X = x_{i j}) = β_{i}^{⊤} x_{i j}^{(β)} + \sum_{t = 0}^{q} {\tilde{ψ}}_{i \cdot t}^{⊤} {\tilde{a}}_{i j}^{(2)} x_{i j t},$ where ${\tilde{a}}_{i j}^{(2)} = (I (a = d_{{a_{i}}^{(1)}}), \dots, I (a = d_{{a_{i}}^{(ν_{i})}}))^{⊤}$ is a subvector of ${\tilde{a}}_{i j}$ , ${\tilde{ψ}}_{i \cdot t} = ({\tilde{ψ}}_{i, a_{i}^{(2)} a_{i}^{(1)}, t}, \dots, {\tilde{ψ}}_{i, a_{i}^{(ν_{i})} a_{i}^{(1)}, t})^{⊤}$ includes the estimable site-specific blip parameters, such that ${\tilde{ψ}}_{i, a_{i}^{(\tilde{h})} a_{i}^{(1)}, t}$ , $\tilde{h} = 2, \dots, ν_{i}$ , are the main effect $(t = 0)$ or the treatment–covariate interaction $(t = 1, \dots, q)$ of treatment $d_{a_{i}^{(\tilde{h})}}$ relative to the site-specific reference treatment $d_{a_{i}^{(1)}}$ . The estimates and common means of ${\tilde{ψ}}_{i, a_{i}^{(\tilde{h})} a_{i}^{(1)}, t}$ are denoted by ${\hat{\tilde{ψ}}}_{i, a_{i}^{(\tilde{h})} a_{i}^{(1)}, t}$ and ${\tilde{ψ}}_{a_{i}^{(\tilde{h})} a_{i}^{(1)}, t}$ , respectively. However, the parameters of interest are still $ψ_{h t}$ , $h = 2, \dots, G$ , $t = 0, \dots, q$ , which characterize a common optimal ITR.

With the site-specific estimates ${\hat{\tilde{ψ}}}_{i \cdot t} = ({\hat{\tilde{ψ}}}_{i, a_{i}^{(2)} a_{i}^{(1)}, t}, \dots, {\hat{\tilde{ψ}}}_{i, a_{i}^{(ν_{i})} a_{i}^{(1)}, t})^{⊤}$ and the associated $(ν_{i} - 1) \times (ν_{i} - 1)$ variance–covariance matrix $\hat{Σ} ({\hat{\tilde{ψ}}}_{i \cdot t})$ , the first two levels of model (5) will be modified to $\begin{aligned} {\hat{\tilde{ψ}}}_{i \cdot t} & \sim MVN ({\tilde{ψ}}_{i \cdot t}, \hat{Σ} ({\hat{\tilde{ψ}}}_{i \cdot t})), \\ {\tilde{ψ}}_{i \cdot t} & \sim MVN ({\tilde{ψ}}_{i \cdot t}^{(2)}, {\tilde{Σ}}_{i t}), \end{aligned}$ (6)where ${\tilde{ψ}}_{i \cdot t}^{(2)} = ({\tilde{ψ}}_{a_{i}^{(2)} a_{i}^{(1)}, t}, \dots, {\tilde{ψ}}_{a_{i}^{(ν_{i})} a_{i}^{(1)}, t})^{⊤}$ is a vector of length $ν_{i} - 1$ , and ${\tilde{Σ}}_{i t}$ is a $(ν_{i} - 1) \times (ν_{i} - 1)$ variance–covariance matrix reflecting between-site heterogeneity. We include index $i$ in both ${\tilde{ψ}}_{i \cdot t}^{(2)}$ and ${\tilde{Σ}}_{i t}$ to indicate their dependence on the treatment set $Λ_{i}$ , and the vector ${\tilde{ψ}}_{i \cdot t}^{(2)}$ includes common rather than site-specific blip parameters.

When $d_{a_{i}^{(1)}} = d_{1}$ , ${\tilde{ψ}}_{a_{i}^{(\tilde{h})} a_{i}^{(1)}, t} = ψ_{a_{i}^{(\tilde{h})} t}$ and ${\tilde{ψ}}_{i \cdot t}^{(2)}$ is a subvector of $ψ_{\cdot t}$ : ${\tilde{ψ}}_{i \cdot t}^{(2)} = ({\tilde{ψ}}_{a_{i}^{(2)} a_{i}^{(1)}, t}, \dots, {\tilde{ψ}}_{a_{i}^{(ν_{i})} a_{i}^{(1)}, t})^{⊤} = (ψ_{a_{i}^{(2)} t}, \dots, ψ_{a_{i}^{(ν_{i})} t})^{⊤}$ . Therefore, $ψ_{h t}$ can still be estimated through borrowing information across sites but with lower precision, as not all ${\hat{\tilde{ψ}}}_{i \cdot t}$ include information relevant to estimating $ψ_{h t}$ . When $d_{a_{i}^{(1)}} \neq d_{1}$ , all $ψ_{i h t}$ are not estimable and ${\tilde{ψ}}_{a_{i}^{(\tilde{h})} a_{i}^{(1)}, t} \neq ψ_{a_{i}^{(\tilde{h})} t}$ . We make the consistency assumption in the network meta-analysis literature. In network meta-analysis, two treatments $d_{1}$ and $d_{2}$ can be either (1) directly compared in head-to-head studies, referred to as direct evidence, or (2) indirectly compared via studies comparing $d_{1}$ or $d_{2}$ with one or more common comparator treatments (i.e. indirect evidence). Then, the consistency assumption states that the indirect and direct estimates are in agreement.¹⁸ In our setting, the consistency assumption ensures that we can link ${\tilde{ψ}}_{a_{i}^{(\tilde{h})} a_{i}^{(1)}, t}$ , $\tilde{h} = 2, \dots, ν_{i}$ , to $ψ_{h t}$ , $h = 2, \dots, G$ , through the equation: ${\tilde{ψ}}_{a_{i}^{(\tilde{h})} a_{i}^{(1)}, t} = ψ_{a_{i}^{(\tilde{h})} t} - ψ_{a_{i}^{(1)} t} .$ (7)This equation applies to both main treatment effects and treatment–covariate interactions, such that the consistency is assumed for treatment effects at every covariate combination. Then, priors can be assigned to all between-site variance and common mean parameters as in model (5).

3. Simulation studies

The simulation study is reported following the aims, data-generating mechanisms, estimands, methods, and performance measures (ADEMP) scheme proposed in Morris et al.³⁵

3.1. Aims

We aim to evaluate ITR estimation for a continuous outcome and multiple treatments when individual-level data are protected from disclosure via a two-stage Bayesian network meta-analysis approach, under assumptions regarding (1) network sizes, (2) network shapes, (3) the true between-site heterogeneity, and (4) the assumed between-site heterogeneity in the Bayesian hierarchical model. Points (1)–(3) concern the data-generating mechanisms, while (4) concerns the analysis model. We have explored ITR estimation with the two-stage Bayesian pairwise meta-analysis under different confounding scenarios, between-site heterogeneity levels, prior choices, and sample sizes in the previous work¹⁷; note that in this previous work, we did not consider settings where the assigned treatments varied across sites. While that previous work did not include settings where treatments offered varied by site, we expect similar results can be obtained when we have a network of studies and thus do not consider those particular features in the simulations here.

3.2. Data-generating mechanisms

The network structures considered in the simulation are shown in Figure 1. Sites included in each network structure with their site-specific treatment set $Λ_{i}$ are summarized in Table 1. A network can comprise sites with different treatment arms. Both networks (a) and (b) depicted in Figure 1 include three treatments $d_{1}$ , $d_{2}$ , and $d_{3}$ . However, in network (a), each site only includes two treatments: either comparing $d_{1}$ and $d_{2}$ or comparing $d_{1}$ and $d_{3}$ , while in network (b), we also have a third site only including $d_{2}$ and $d_{3}$ , forming a loop. Similarly, for networks (c) and (d), a larger treatment set is considered, and the networks may or may not include loops. Network (e) reflects the network structure for the real-data application to three trials, considering six treatments of depression described in Section 4. With a given site-specific treatment set $Λ_{i}$ , the number of sites could be 1 or 3. That is, for any particular pair of site-specific treatments, there is either 1 or 3 sites that are considered that set of treatment options. For each site, the sample size is fixed at 300. For site $i$ , we first generate a random number $s_{i}$ uniformly from ${0, 1, 2}$ . Then, a continuous covariate $X_{1}$ and a binary covariate $X_{2}$ are generated from the following distributions: $X_{1} \sim {\begin{cases} N (5, 1) & s_{i} = 0, \\ 6 Beta (4, 4) + 2 & s_{i} = 1, \\ U [2, 8] & s_{i} = 2, \end{cases} X_{2} \sim {\begin{cases} Bernoulli (0.5) & s_{i} = 0, \\ Bernoulli (0.3) & s_{i} = 1, \\ Bernoulli (0.7) & s_{i} = 2. \end{cases}$ Given the site-specific treatment set $Λ_{i} = {d_{a_{i}^{(1)}}, \dots, d_{a_{i}^{(ν_{i})}}}$ , the treatment assignment $A$ follows a multinomial distribution with probabilities determined by $P (A = d_{h} | x_{1}, x_{2}, Λ_{i}) = \frac{\exp (α_{h 0} + α_{h 1} x_{1} + α_{h 2} x_{2})}{\sum_{d_{h} \in Λ_{i}} \exp (α_{h 0} + α_{h 1} x_{1} + α_{h 2} x_{2})},$ where the coefficients $α_{h 0}, α_{h 1}$ , and $α_{h 2}$ are shown in Table 2. That is, although our real-data analysis focuses on randomized trial data, we perform our simulations under a more general setting where treatment allocation may depend on covariates.

Figure 1.

Graphics of simulated networks (a)–(e). Network (e) reflects the network structure for the real-data application to three trials described in Section 4. Connecting lines indicate the two treatments can be directly compared. The treatment $d_{1}$ is considered as the common reference treatment in each network.

Table 1.

Sites included in each network.

Network	Treatment set $Λ$	Number of arms per site	Site-specific treatment set $Λ_{i}$
(a)	${d_{1}, d_{2}, d_{3}}$	2	{ $d_{1}$ , $d_{2}$ }, { $d_{1}$ , $d_{3}$ }
(b)	${d_{1}, d_{2}, d_{3}}$	2	{ $d_{1}$ , $d_{2}$ }, { $d_{1}$ , $d_{3}$ }, { $d_{2}$ , $d_{3}$ }
(c)	${d_{1}, d_{2}, d_{3}, d_{4}, d_{5}}$	2	{ $d_{1}$ , $d_{2}$ }, { $d_{1}$ , $d_{3}$ }, { $d_{1}$ , $d_{4}$ }, { $d_{1}$ , $d_{5}$ }
		3	{ $d_{2}$ , $d_{3}$ , $d_{5}$ }, { $d_{3}$ , $d_{4}$ , $d_{5}$ }
(d)	${d_{1}, d_{2}, d_{3}, d_{4}, d_{5}, d_{6}, d_{7}}$	2	{ $d_{1}$ , $d_{2}$ }, { $d_{1}$ , $d_{3}$ }, { $d_{1}$ , $d_{4}$ }, { $d_{1}$ , $d_{5}$ },
			{ $d_{3}$ , $d_{7}$ }, { $d_{5}$ , $d_{6}$ }
(e)	${d_{1}, d_{2}, d_{3}, d_{4}, d_{5}, d_{6}}$	2	{ $d_{1}$ , $d_{2}$ }
		5	{ $d_{1}$ , $d_{2}$ , $d_{3}$ , $d_{4}$ , $d_{5}$ }
		4	{ $d_{1}$ , $d_{2}$ , $d_{3}$ , $d_{6}$ }

Table 2.

Coefficients in multinomial probabilities for treatment assignment.

Treatment	$α_{h 0}$	$α_{h 1}$	$α_{h 2}$
$d_{1}$	0	0	0
$d_{2}$	0	0.03	0.08
$d_{3}$	0.01	0.09	0.03
$d_{4}$	0.05	0.02	0.09
$d_{5}$	0.08	0.02	0.04
$d_{6}$	0.01	0.01	0.08
$d_{7}$	0.08	0.09	0.03

Suppressing the individual-specific subscript, the continuous outcome for an individual at site $i$ is generated by $Y_{i} = β_{i 0} + β_{i 1} x_{1} + β_{i 2} x_{2} + (ψ_{i \cdot 0}^{⊤} + ψ_{i \cdot 1}^{⊤} x_{1}) \tilde{a} + ϵ,$ where $\tilde{a} = (I (a = d_{2}), \dots, I (a = d_{7}))^{⊤}$ , $ψ_{i \cdot t} = (ψ_{i 2 t}, \dots, ψ_{i 7 t})^{⊤}$ , $t = 0, 1$ , $β_{i 0} + β_{i 1} x_{1} + β_{i 2} x_{2}$ is the site-specific treatment-free function, and $(ψ_{i \cdot 0}^{⊤} + ψ_{i \cdot 1}^{⊤} x_{1}) \tilde{a}$ is the site-specific blip function. The random error $ϵ$ follows a normal distribution with mean zero and residual variance $σ_{ϵ}^{2} = 0.25$ . We note that in the above outcome generation model, seven treatments are assumed, whereas in all scenarios shown in Table 1, all sites employ fewer than seven treatments. This common form of data generation is appropriate. When a treatment $d_{h}$ is not present in a given site $i$ , the blip parameters related to $d_{h}$ , that is, $ψ_{i h 0}$ and $ψ_{i h 1}$ , will not contribute to the outcome as $I (a = d_{h}) = 0$ . This outcome generation model is different from the outcome model we fit in the first stage: $E (Y_{i} | a, x_{1} x_{2}) = β_{i 0} + β_{i 1} x_{1} + β_{i 2} x_{2} + ({\tilde{ψ}}_{i \cdot 0}^{⊤} + {\tilde{ψ}}_{i \cdot 1}^{⊤} x_{1}) {\tilde{a}}^{(2)},$ where a general definition of $\tilde{a}$ and ${\tilde{ψ}}_{i \cdot t}$ have been provided in Section 2.1, and the notations should be accordingly adapted here.

The site-specific parameters $θ_{i} = (β_{i 0}, β_{i 1}, β_{i 2}, ψ_{i \cdot 0}, ψ_{i \cdot 1})$ in the outcome generation model are simulated by: $β_{i s} \sim N (β_{s}, σ_{B}^{2})$ , $s = 0, 1, 2$ , and $ψ_{i \cdot t} \sim MVN (ψ_{\cdot t}, Σ_{t}$ ), $t = 0, 1$ . For the $6 \times 6$ variance–covariance matrix $Σ_{t}$ , we consider two scenarios: –

common between-site heterogeneity: $Σ_{t}$ has diagonals $σ_{B}^{2}$ and off-diagonals $0.5 σ_{B}^{2}$ ;

–

varying between-site heterogeneity: $Σ_{t}$ has diagonals $(0.7, 1, 1.3, 0.7, 1, 1.3) σ_{B}^{2}$ and off-diagonals $0.5 σ_{B}^{2}$ ,

where the between-study variance

σ_{B}^{2}

is derived from heterogeneity level

I^{2} = σ_{B}^{2} / (σ_{B}^{2} + σ_{ϵ}^{2}) = 0.1

. We note that for each distinct

t

, the variance–covariance matrix

Σ_{t}

can be different. Fundamentally, under the common between-site heterogeneity mechanism, for a given

t

, the between-site variance for

ψ_{h t}

is the same regardless of

h

(i.e. the diagonals of

Σ_{t}

are equal with a given

t

), but the between-site variance in different

Σ_{t}

can be different. However, the between-site variance is not our primary interest. Therefore, for simplicity, we assume a single between-site variance parameter

σ_{B}^{2}

in all

Σ_{t}

, as well as for

β_{s}

. The common treatment-free function parameters are

β_{0} = 4

β_{1} = 1

β_{2} = 1

, and the common blip function parameters are

ψ_{\cdot 0} = (5, 8, 4, 6, 2, 3)

, and

ψ_{\cdot 1} = (- 0.9, - 1.6, - 1.3, - 1.5, - 0.8, - 1.1)

. Let

ω_{d_{h}} (x) = ψ_{h 0} + ψ_{h 1} x_{1}

d_{h} \in Λ / {d_{1}}

, and

ω_{d_{1}} (x) = 0

. The common optimal ITR is given by

d^{opt} (x) = \arg max_{d_{h} \in Λ} ω_{d_{h}} (x)

, which, in all networks, can be reduced to

\begin{aligned} d^{opt} (x) & = {\begin{cases} d_{1} & x_{1} > \frac{50}{9}, \\ d_{2} & \frac{30}{7} < x_{1} < \frac{50}{9}, \\ d_{3} & x_{1} < \frac{30}{7} . \end{cases} \end{aligned}

3.3. Estimands, methods, and performance metrics

The estimands of interest are the common blip function parameters $ψ_{h t}$ , $d_{h} \in Λ$ , $t = 0, 1$ , which fully characterize the optimal ITR in each network. We implement a two-stage Bayesian network meta-analysis approach, using linear regression in the first stage and a Bayesian hierarchical model for the second stage. For the mean parameters, we use a normal prior with mean 0 and variance 10,000. Regarding variance–covariance matrix ${\tilde{Σ}}_{i t}$ in (6), we consider two scenarios:

When only a single site exists for each different site-specific treatment set $Λ_{i}$ in the network, we lack sufficient data to estimate the between-site heterogeneity, and thus ${\tilde{Σ}}_{i t} = 0$ .

When we have three sites for each unique sites-specific treatment set, priors will be assigned under different modeling assumptions:

–
Under common between-site heterogeneity assumption, ${\tilde{Σ}}_{i t}$ has diagonal entries $σ_{t}^{2}$ and off-diagonal entries $0.5 σ_{t}^{2}$ , and a half-Cauchy (0,1) prior is assigned to $σ_{t}$ .
–
For unstructured ${\tilde{Σ}}_{t}$ under varying between-site heterogeneity assumption, we have decomposition ${\tilde{Σ}}_{t} = U V U$ , where $U$ is a diagonal matrix with diagonals $σ_{a_{i}^{(\tilde{h})}, a_{i}^{(1)}, t}$ and $V$ is an unknown correlation matrix. Then, a half-Cauchy (0,1) prior and an LKJ (1) prior are assumed for $σ_{a_{i}^{(\tilde{h})}, a_{i}^{(1)}, t}$ in $U$ and $V$ , respectively. Particularly, for scenario (e), in addition to the LKJ(1) prior, a RIW prior with degrees of freedom of 6 and the scale matrix set to be the identity, and an EQ prior with a uniform prior on $(- 0.25, 1)$ for the common correlation are also considered for $V$ to explore the impact of different prior choices on the estimated optimal ITRs.

We are unaware of any existing methods that can estimate common ITRs across multiple studies with differing treatment values at different sites without individual-level data being shared. Thus, our analyses consist of comparing different model specifications within our Bayesian network meta-analysis approach. As no alternatives were found in the literature, no competing methods were included. In previous work,¹⁷ we compared the two-stage approach with a one-stage analysis where all individual-level data are combined and analyzed in both simulations and a real-data application, under scenarios where the same binary or continuous treatments were available across all sites. We found that both approaches provided similar results, and we expect this conclusion to extend to settings where multiple treatments and varying treatment sets are available across sites. Therefore, we do not perform a one-stage analysis in the current simulation. However, both the one-stage and two-stage methods are implemented for the real-data application in Section 4. We assess: (i) the relative bias of blip parameter estimators, which can be calculated by the difference between the mean of the estimates and the true value, divided by the latter, (ii) the standard deviation of the estimates, (iii) the difference in the value function (dVF) under the true and estimated optimal ITR, where the value function with respect to an ITR is approximated by the expected outcome if all patients in a new cohort of size 100,000 were treated according to the ITR, and (iv) the empirical standard deviation of the dVF when the estimated treatment rule was applied to the same population.
3.4. Results

In this section, for the sake of space, only simulation results for ${\hat{ψ}}_{20}$ , ${\hat{ψ}}_{21}$ , and dVF are presented. All results related to other blip parameters are presented in the Supplemental Materials. Tables 3 and 4 show simulation results when the true site-specific blip parameters are generated under varying and common between-site heterogeneity assumptions, respectively. Across all scenarios, the relative bias remains below $1 %$ . When multiple sites exist for each $Λ_{i}$ , in the Bayesian hierarchical model, we can assume either common between-study heterogeneity or an unstructured form for ${\tilde{Σ}}_{t}$ . Irrespective of the true generation mechanisms of site-specific blip parameters, the relative bias of blip parameter estimators is similar in the two specifications. However, the estimators have greater variability when varying between-study heterogeneity is assumed, which is reasonable due to the increased number of parameters to be estimated. The ITRs estimated under the common between-study heterogeneity also have slightly higher values. For scenario (e), the estimators have comparable bias and variability across different prior choices for the correlation matrix, and the resulting ITRs perform similarly. Overall, in the explored scenarios, the results are insensitive to the heterogeneity assumptions in the model, but assuming the common between-site heterogeneity will result in fewer parameters and a more practically feasible model.

Table 3.
Simulation results when the data are generated under an unstructured between-site heterogeneity model.

${\hat{ψ}}_{20}$ ${\hat{ψ}}_{21}$

Network Number of sites Heterogeneity RB (SD) RB (SD) dVF (SD)

a 1 0 −0.062 (0.417) 0.311 (0.182) 0.162 (0.125)

3 Common 0.007 (0.325) −0.004 (0.204) 0.087 (0.075)

Varying 0.016 (0.471) −0.017 (0.331) 0.089 (0.076)

b 1 0 0.023 (0.330) 0.341 (0.158) 0.126 (0.101)

3 Common −0.103 (0.235) 0.062 (0.143) 0.058 (0.052)

Varying −0.087 (0.276) 0.137 (0.169) 0.063 (0.056)

c 1 0 −0.124 (0.315) −0.236 (0.138) 0.107 (0.083)

3 Common 0.042 (0.197) 0.177 (0.112) 0.051 (0.046)

Varying 0.061 (0.238) 0.219 (0.135) 0.058 (0.052)

d 1 0 0.134 (0.411) −0.759 (0.183) 0.219 (0.299)

3 Common −0.089 (0.268) −0.221 (0.155) 0.091 (0.079)

Varying −0.088 (0.470) −0.267 (0.326) 0.095 (0.094)

e 1 0 0.107 (0.290) 0.125 (0.124) 0.109 (0.111)

3 Common 0.039 (0.180) −0.205 (0.092) 0.042 (0.044)

Varying—LKJ(1) 0.042 (0.188) −0.190 (0.097) 0.044 (0.060)

Varying—RIW 0.043 (0.190) −0.192 (0.100) 0.043 (0.046)

Varying—EQ 0.043 (0.194) −0.195 (0.102) 0.045 (0.061)

RB: relative bias; SD: standard deviation; dVF: difference in value function; LKJ: Lewandowski–Kurowicka–Joe; RIW: restricted inverse-Wishart; EQ: equal correlation; ITR: individualized treatment rule. RB ( $%$ ) and SDs of ${\hat{ψ}}_{20}$ and ${\hat{ψ}}_{21}$ , and dVF between true and estimated ITR and its standard deviation are reported across different networks, numbers of sites and heterogeneity assumptions in the Bayesian hierarchical model based on 2000 iterations.

			${\hat{ψ}}_{20}$	${\hat{ψ}}_{21}$
a	1	0	−0.062 (0.417)	0.311 (0.182)	0.162 (0.125)
	3	Common	0.007 (0.325)	−0.004 (0.204)	0.087 (0.075)
		Varying	0.016 (0.471)	−0.017 (0.331)	0.089 (0.076)
b	1	0	0.023 (0.330)	0.341 (0.158)	0.126 (0.101)
	3	Common	−0.103 (0.235)	0.062 (0.143)	0.058 (0.052)
		Varying	−0.087 (0.276)	0.137 (0.169)	0.063 (0.056)
c	1	0	−0.124 (0.315)	−0.236 (0.138)	0.107 (0.083)
	3	Common	0.042 (0.197)	0.177 (0.112)	0.051 (0.046)
		Varying	0.061 (0.238)	0.219 (0.135)	0.058 (0.052)
d	1	0	0.134 (0.411)	−0.759 (0.183)	0.219 (0.299)
	3	Common	−0.089 (0.268)	−0.221 (0.155)	0.091 (0.079)
		Varying	−0.088 (0.470)	−0.267 (0.326)	0.095 (0.094)
e	1	0	0.107 (0.290)	0.125 (0.124)	0.109 (0.111)
	3	Common	0.039 (0.180)	−0.205 (0.092)	0.042 (0.044)
		Varying—LKJ(1)	0.042 (0.188)	−0.190 (0.097)	0.044 (0.060)
		Varying—RIW	0.043 (0.190)	−0.192 (0.100)	0.043 (0.046)
		Varying—EQ	0.043 (0.194)	−0.195 (0.102)	0.045 (0.061)

Table 4.

Simulation results when the data are generated under the assumption of common between-site heterogeneity.

			${\hat{ψ}}_{20}$	${\hat{ψ}}_{21}$
Network	Number of sites	Heterogeneity	RB (SD)	RB (SD)	dVF (SD)
a	1	0	−0.064 (0.416)	0.315 (0.182)	0.148 (0.109)
	3	Common	0.017 (0.322)	0.023 (0.195)	0.081 (0.068)
		Varying	−0.004 (0.461)	0.032 (0.327)	0.082 (0.068)
b	1	0	0.024 (0.327)	0.321 (0.152)	0.113 (0.088)
	3	Common	−0.104 (0.230)	0.061 (0.134)	0.052 (0.047)
		Varying	−0.089 (0.272)	0.149 (0.162)	0.056 (0.051)
c	1	0	−0.123 (0.315)	−0.232 (0.135)	0.099 (0.074)
	3	Common	0.042 (0.196)	0.177 (0.110)	0.047 (0.044)
		Varying	0.057 (0.237)	0.187 (0.133)	0.054 (0.049)
d	1	0	0.135 (0.411)	−0.760 (0.183)	0.194 (0.259)
	3	Common	−0.090 (0.267)	−0.215 (0.153)	0.084 (0.069)
		Varying	−0.082 (0.463)	−0.244 (0.330)	0.089 (0.086)
e	1	0	0.083 (0.299)	0.168 (0.145)	0.190 (0.278)
	3	Common	−0.030 (0.197)	0.284 (0.115)	0.065 (0.064)
		Varying—LKJ(1)	−0.041 (0.206)	0.297 (0.116)	0.071 (0.126)
		Varying—RIW	0.045 (0.190)	−0.193 (0.100)	0.038 (0.039)
		Varying—EQ	0.043 (0.194)	−0.191 (0.102)	0.039 (0.063)

RB: relative bias; SD: standard deviation; dVF: difference in value function; LKJ: Lewandowski–Kurowicka–Joe; RIW: restricted inverse-Wishart; EQ: equal correlation; ITR: individualized treatment rule. RB ( $%$ ) and SDs of ${\hat{ψ}}_{20}$ and ${\hat{ψ}}_{21}$ , and dVF between true and estimated ITR and its standard deviation are reported across different networks, numbers of sites and heterogeneity assumptions in the Bayesian hierarchical model based on 2000 iterations.

With more sites contributing to the common ITR estimation, we have more precise blip parameter estimators and a smaller dVF, corresponding to a better ITR estimation. When there is only one site for each $Λ_{i}$ , due to the limited information, the variability is higher even without considering any between-site heterogeneity, and dVF is also larger. Networks (a) and (b) only differ in the additional sites for $d_{2}$ and $d_{3}$ , which will not provide direct information for the parameters of interest (the main effects or treatment–covariate interactions relative to the common reference treatment $d_{1}$ ). However, we still have more precise parameter estimates and ITRs with higher values in network (b). This suggests indirect evidence can also help with the common ITR estimation. Networks (c), (d), and (e) present more complex structures with relatively limited data information. No obvious difference in results is observed, indicating that the complexity of the network structure may not significantly impact the ITR estimation as long as the model is adapted accordingly based on the consistency equation (7).

4. Application: Estimating individualized depression treatment

In this section, we apply the proposed method to estimate an ITR for patients with major depressive disorder (MDD) using data from three studies: The Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study,³⁶ Establishing Moderators and Biosignatures of Antidepressant Response for Clinical Care (EMBARC) study,³⁷ and Research Evaluating the Value of Augmenting Medication with Psychotherapy (REVAMP) study.³⁸

All three studies are multistage randomized trials, with details of their designs described elsewhere.^36–38 STAR*D include four stages. Due to single treatment assignment in the first stage and limited sample size in stages 3 and 4, we use data from stage 2, where patients without a satisfactory clinical outcome to citalopram (CIT) in the first stage were randomized to seven treatments. Among these, we focus on medications only: venlafaxine (VEN), sertraline (SER), bupropion (BUP), CIT augmented with BUP (CIT + BUP), or buspirone (BUS). In the case of EMBARC, we focus on SER and BUP in the second stage, as patients received only one active treatment SER and a placebo in the first stage. For REVAMP, data from the first stage is used, where a medication algorithm was implemented for treatment assignment, and SER, BUP, VEN, and escitalopram (ESCIT) are included. Therefore, in total, six treatments are identified: $Λ = {SER, BUP, VEN, CIT + BUP, CIT + BUS, ESCIT}$ , forming a network structure as shown in Figure 2, which corresponds to scenario (e) (Figure 1(e)) in the simulation. The common reference treatment is SER, as it was included in all three studies and is often considered as the front-line treatment of MDD.³⁹ In this case, study-specific and common reference treatments were considered the same.

Figure 2.

Network structure of analysis of STAR*D, EMBARC, and REVAMP data. The size of each node (in red) is proportional to the total sample size in the corresponding treatment group, and the width of the connecting line (in gray) between any two treatments is proportional to the number of studies that directly compared the two treatments. STAR*D: Sequenced Treatment Alternatives to Relieve Depression; EMBARC: Establishing Moderators and Biosignatures of Antidepressant Response for Clinical Care; REVAMP: Research Evaluating the Value of Augmenting Medication with Psychotherapy.

Depression severity is measured by the 17-item Hamilton Depression Rating Scale (HDRS-17), with a larger value corresponding to more severe symptoms. In our analysis, we consider the negative of HDRS-17 as the outcome. We choose covariates based on meta-reviews of antidepressant treatment outcome predictors and modifiers.^40,41 The following covariates that are common in the three studies and potentially related to differential treatment effects were identified and included in the model at the first stage: (1) sociodemographic variables: age (in years), sex (male, female), race (White, non-White), marital status (single, married, divorced/widowed), number of years in formal education, employment status (employed, unemployed), number of people in household; (2) clinical variables: age at onset of first MDD (in years), number of depressive episodes, chronicity of current episode, and baseline HDRS-17 before receiving the treatments. Among these variables, race was only used as an adjustment variable rather than a tailoring variable for treatment assignment, as basing treatment decisions on racial or ethnic groups can lead to healthcare disparities and inequities.⁴² Additionally, while the number of depressive episodes was small for many patients, there were also several large values (e.g. 120), making it unsuitable to include this variable as a continuous linear term in the model. Therefore, the number of depressive episodes was dichotomized using a cutoff point of four; that is, a binary variable was created based on whether the number of episodes is $\geq 4$ . The threshold value of four was chosen based on the data to ensure sufficient sample sizes for estimating the parameters associated with the dichotomized variable. Since the inclusion criteria differ among the three studies, patients who had their first MDD episode after the age of 30 in the STAR*D and REVAMP studies were excluded from the analysis to make the study populations more similar, thereby making the positivity assumption more plausible. Information on these variables is collectively denoted by the vector $x$ . Records with missing values are removed. Finally, in our analysis dataset, we have 408, 87, and 308 samples from STAR*D, EMBARC, and REVAMP studies, respectively. The distributions of covariates were summarized in Table 5.

Table 5.

Patient characteristics for STAR*D, EMBARC, and REVAMP studies.

	STAR*D	EMBARC	REVAMP
Variables	( $n = 408$ )	( $n = 87)$	( $n = 308$ )
Age	38.46 (12.59)	38.70 (13.13)	40.06 (12.29)
Sex
Female	251 (61.5)	53 (60.9)	191 (62.0)
Male	157 (38.5)	34 (39.1)	117 (38.0)
Race
White	349 (85.5)	60 (69.0)	259 (84.1)
Non-White	59 (14.5)	27 (31.0)	49 (15.9)
Marital status
Married	165 (40.4)	23 (26.4)	111 (36.0)
Single	136 (33.3)	50 (57.5)	113 (36.7)
Divorced/widowed	107 (26.2)	14 (16.1)	84 (27.3)
Years of education	13.79 (2.82)	15.24 (2.39)	14.72 (2.62)
Employment status
Employed	243 (59.6)	44 (50.6)	195 (63.3)
Unemployed/retired	165 (40.4)	43 (49.4)	113 (36.7)
Number of people in household	2.67 (1.48)	2.62 (2.31)	2.53 (1.57)
Age of first MDD	16.81 (6.11)	16.51 (5.41)	17.47 (5.85)
Number of episodes
$< 4$	195 (47.8)	39 (44.8)	245 (79.5)
$\geq 4$	213 (52.2)	48 (55.2)	63 (20.5)
Chronicity
Chronic	96 (23.5)	37 (42.5)	294 (95.5)
Non-chronic	312 (76.5)	50 (57.5)	14 (4.5)
Baseline HDRS-17	16.91 (7.11)	15.69 (5.47)	20.85 (4.30)

STAR*D: Sequenced Treatment Alternatives to Relieve Depression; EMBARC: Establishing Moderators and Biosignatures of Antidepressant Response for Clinical Care; REVAMP: Research Evaluating the Value of Augmenting Medication with Psychotherapy; MDD: major depressive disorder; HDRS-17: 17-item Hamilton Depression Rating Scale.

Linear regression models with above mentioned covariates and their interactions with treatments were used to obtain site-specific blip parameter estimates and the corresponding variance–covariance matrix. Since for most treatment comparisons only one or two study-specific estimates are available, a Bayesian hierarchical model with ${\tilde{Σ}}_{i t} = 0$ was used to obtain the common blip parameter estimates. The estimated ITR is $d^{opt} (x) = \arg max_{d_{h} \in Λ} {\hat{ω}}_{d_{h}} (x)$ , where ${\hat{ω}}_{SER} (x) = 0$ , and for $d_{h} \neq SER$ , $\begin{aligned} {\hat{ω}}_{d_{h}} (x) & = {\hat{ψ}}_{h 0} + {\hat{ψ}}_{h 1} Age + {\hat{ψ}}_{h 2} Male + {\hat{ψ}}_{h 3} Single + {\hat{ψ}}_{h 4} Divorced/Widowed \\ + {\hat{ψ}}_{h 5} Years of Education + {\hat{ψ}}_{h 6} Unemployed + {\hat{ψ}}_{h 7} Number of People in Household \\ + {\hat{ψ}}_{h 8} Age of First MDD + {\hat{ψ}}_{h 9} Number of Episodes + {\hat{ψ}}_{h, 10} Non-chronic \\ + {\hat{ψ}}_{h, 11} Baseline HDRS-17, \end{aligned}$ with parameter estimates ${\hat{ψ}}_{h t}$ and the corresponding $95 %$ posterior credible intervals shown in Table 6.

Table 6.

Blip parameter estimates (posterior medians) and the 95% posterior credible intervals for the real-data application.

$ψ_{h t}$	BUP	VEN	CIT + BUP	CIT + BUS	ESCIT
Main treatment effect ( ${\hat{ψ}}_{h 0}$ )	$5.05 (- 10.01, 20.32)$	$- 1.14 (- 22.41, 20.17)$	$- 3.27 (- 25.2, 18.91)$	$18.16 (- 0.08, 36.47)$	$- 12.73 (- 40, 14.5)$
Age ( ${\hat{ψ}}_{h 1}$ )	$0.05 (- 0.12, 0.23)$	$- 0.13 (- 0.34, 0.08)$	$- 0.08 (- 0.29, 0.13)$	$- 0.16 (- 0.36, 0.04)$	$0.06 (- 0.22, 0.33)$
Male ( ${\hat{ψ}}_{h 2}$ )	$- 0.61 (- 4.5, 3.33)$	$- 1 (- 5.98, 3.99)$	$1.04 (- 3.52, 5.62)$	$- 0.44 (- 4.63, 3.73)$	$1.5 (- 3.89, 6.93)$
Single ( ${\hat{ψ}}_{h 3}$ )	$- 0.45 (- 5.67, 4.75)$	$- 0.01 (- 6.77, 6.78)$	$2.02 (- 4.17, 8.22)$	$- 4.42 (- 10.36, 1.51)$	$2.11 (- 6.35, 10.64)$
Divorced/widowed ( ${\hat{ψ}}_{h 4}$ )	$0.37 (- 4.56, 5.27)$	$5.51 (- 0.32, 11.37)$	$2.98 (- 2.66, 8.63)$	$1.9 (- 3.59, 7.36)$	$- 0.47 (- 8.91, 7.99)$
Years of education ( ${\hat{ψ}}_{h 5}$ )	$- 0.35 (- 1.07, 0.37)$	$- 0.01 (- 0.89, 0.86)$	$- 0.3 (- 1.21, 0.61)$	$- 0.71 (- 1.49, 0.07)$	$- 0.12 (- 0.94, 0.7)$
Unemployed/retired ( ${\hat{ψ}}_{h 6}$ )	$3.41 (- 0.21, 7.01)$	$4.65 (- 0.4, 9.67)$	$2.6 (- 1.96, 7.15)$	$3.67 (- 0.71, 8.05)$	$- 2.67 (- 8.45, 3.09)$
Number of people in household ( ${\hat{ψ}}_{h 7}$ )	$- 0.44 (- 1.59, 0.72)$	$0.37 (- 1.72, 2.47)$	$0.08 (- 1.65, 1.82)$	$- 1.54 (- 3.07, - 0.01)$	$- 0.03 (- 2.32, 2.24)$
Age of first MDD ( ${\hat{ψ}}_{h 8}$ )	$- 0.03 (- 0.32, 0.27)$	$- 0.06 (- 0.45, 0.33)$	$0.18 (- 0.23, 0.59)$	$0 (- 0.36, 0.35)$	$0.22 (- 0.36, 0.82)$
Number of episodes $\geq 4$ ( ${\hat{ψ}}_{h 9}$ )	$- 0.26 (- 4.35, 3.85)$	$0.43 (- 4.67, 5.56)$	$1.67 (- 3.04, 6.38)$	$1.44 (- 2.99, 5.9)$	$- 1.02 (- 7.31, 5.29)$
Nonchronic ( ${\hat{ψ}}_{h, 10}$ )	$0.53 (- 3.8, 4.81)$	$2.86 (- 2.86, 8.56)$	$1.58 (- 3.64, 6.83)$	$- 0.36 (- 5.2, 4.43)$	$1.91 (- 8.62, 12.49)$
Baseline HDRS-17 ( ${\hat{ψ}}_{h, 11}$ )	$0.02 (- 0.27, 0.31)$	$0.26 (- 0.11, 0.62)$	$0.32 (- 0.03, 0.67)$	$0.16 (- 0.16, 0.49)$	$0.37 (- 0.36, 1.11)$

BUP: bupropion; VEN: venlafaxine; CIT: citalopram; BUS: buspirone; ESCIT: escitalopram; MDD: major depressive disorder; HDRS-17: 17-item Hamilton Depression Rating Scale.

As BUP was also available in all three studies, in addition to the main analysis where SER was chosen as the common reference treatment, we conducted another two-stage analysis with BUP as the common reference treatment to assess how the choice of the common reference treatment would influence the estimation results. A one-stage analysis where all individual-level data are combined and analyzed was also performed for comparison. Figure 3 shows the posterior densities of the main treatment effect parameters, $ψ_{h 0}$ , $h = 2, \dots, 6$ , in the three analyses. Here, the main treatment effects are defined with respect to SER. When BUP is the reference treatment, the posterior distributions of these parameters are obtained through the transformation of the target parameters that are defined with respect to BUP. The posterior distributions of the common treatment–covariate interactions are presented in Supplemental Appendix S3. We observe that changing the reference treatment in our case did not result in much variation in the results: the posterior density curves for the two two-stage analyses largely overlap. The posterior density exhibits a more concentrated peak in the one-stage analysis, suggesting more precise estimates compared to the two-stage analyses. Additionally, while the posterior density curves from the one-stage analysis largely overlap with those from the two-stage analysis for most common blip parameters, noticeable shifts can be observed for some parameters (e.g. ${\hat{ψ}}_{2, 10}$ in Figure S10 of Supplemental Appendix S3). This suggests that the one-stage analysis based on full individual-level data may yield different results from the two-stage analysis in some cases. However, in all three analyses, all estimated effects, including the main treatment effects, have wide credible intervals that include zero. This is not surprising, given the limited number of studies available for this analysis. Additionally, most variables in the analysis are binary, providing less information than continuous variables.

Figure 3.

Posterior densities of the main treatment effect $ψ_{h 0}$ , $h = 2, \dots, 6$ , in the three methods: a two-stage analysis with SER as the common reference treatment, a two-stage analysis with BUP as the common reference treatment, and a one-stage analysis where all individual-level data across the three studies are pooled and analyzed together. SER: sertraline; BUP: bupropion.

5. Discussion

An optimal ITR can be estimated in a regression-based approach by including predefined treatment–covariate interactions. To increase the power for detecting differential treatment effects by covariates, large collections of datasets from multiple sites or studies are often needed. Different sites or studies may have varying treatment sets; however, an ITR analysis of all available treatments is desired. This presents a methodological gap which, to our knowledge, has not previously been considered. To address this gap, we adopt a two-stage Bayesian meta-analysis approach: at the first stage, study-specific analyses are conducted on the single study data only; at the second stage, summary measures, including blip parameter estimates and variance or covariance estimates, are shared and combined in a Bayesian hierarchical model to estimate a common optimal ITR.

The conventional pairwise meta-analysis approach focuses on binary treatments and assumes that all studies consist of the same treatment comparisons. In this article, we consider multiple treatments, and different studies may encompass different sets of treatments. With different treatment sets across studies, the estimated ITRs using study-specific data will only include a subset of the available treatments, and treatments that are not present in the same study cannot be simultaneously considered in an estimated ITR. To address this issue and construct an ITR for all treatments, we employ a network meta-analysis approach and construct the Bayesian hierarchical model at the second stage based on the consistency equation (7), which is assumed for both the main treatment effect as well as treatment–covariate interactions. The consistency assumption relates our parameters of interest and parameters that can be estimated with direct evidence, and is essential to ensure the validity of the Bayesian hierarchical model at the second stage. It builds on the transitivity assumption that the studies comparing different sets are sufficiently similar with respect to all important factors other than the treatments being compared.¹⁹ For example, the distributions of treatment effect modifiers should be similar across studies. The consistency and transitivity assumptions can be violated in certain cases. For example, a treatment in a given site may be inappropriate for patients in another site due to the heterogeneity in population. In this case, in addition to these two assumptions, the positivity assumption may also be violated if specific covariate combinations preclude receipt of particular treatments. Therefore, the proposed method will be inappropriate due to the violation of the assumptions in both the network meta-analysis and causal inference aspects of the analysis.

While assessment of the transitivity assumption can only be made qualitatively, the consistency assumption can be statistically assessed when both direct and indirect evidence are available. A discussion of these approaches can be found elsewhere.^27,43–47 We can reduce the possibility of inconsistency in both the design and analysis stages. When designing a multisite trial, it is recommended to standardize the study protocol to guarantee the same or similar populations, treatment delivery, and assessment of the outcomes and covariates.^48,49 Before the two-stage analysis of the data, we may also prescreen participants based on some analyst-defined harmonized eligibility criteria to ensure the samples in the analysis are similar across sites. If the consistency assumption is deemed unfeasible, incorporation of inconsistency may be considered.⁴³ For example, an inconsistency factor $δ_{a_{i}^{\tilde{h}}, a_{i}^{(1)}, 1}$ may be added to the consistency equation (7), that is, ${\tilde{ψ}}_{a_{i}^{(\tilde{h})} a_{i}^{(1)}, t} = ψ_{a_{i}^{(\tilde{h})} t} - ψ_{a_{i}^{(1)} t} + δ_{a_{i}^{\tilde{h}}, a_{i}^{(1)}, 1}$ , and a model or a distribution can be posited for $δ_{a_{i}^{\tilde{h}}, a_{i}^{(1)}, 1}$ . However, this adaptation in our setting requires further investigation, and the implications for positivity violations should also be considered; this is beyond the scope of this article.

In the consistency equation (7), to identify parameters $ψ_{a_{i}^{(\tilde{h})} t}$ and $ψ_{a_{i}^{(1)} t}$ from ${\tilde{ψ}}_{a_{i}^{(\tilde{h})} a_{i}^{(1)}, t}$ , data information should be available for at least one of the two parameters $ψ_{a_{i}^{(\tilde{h})} t}$ and $ψ_{a_{i}^{(1)} t}$ . This requires a connected network, that is, every two treatments in the network can be either directly or indirectly compared. A disconnected network complicates the model.^50–53 It would be of interest to explore whether proposed methods for meta-analysis in a disconnected network setting can be adapted in our context. For example, in a random baseline model,⁵⁴ a disconnected network will be connected by assuming the baseline effects are exchangeable across studies. In our context, this could be translated into the exchangeability of the site-specific treatment-free parameters and including their estimates in the Bayesian hierarchical model. One criticism of the random baseline model is that it breaks the randomization by assuming randomization not only within studies but also across studies. This limitation might be less of a concern in our case. Although we only assume a common distribution for the blip parameters in (4), in reality, it is also likely that the treatment-free parameters have a common distribution, as the populations in each site should be similar or be a subset of a larger target population. If the population (or the site-specific ITR) is totally different or unrelated across sites, estimating a common optimal ITR is not meaningful.

In this article, we focus on a continuous outcome, and a linear regression model is used at the first stage. This implementation assumes linearity and thus the results are sensitive to model specification. In practice, the linear relationship may not fully capture the true dynamics among covariates, treatments, treatment–covariate interactions, and the outcome. When the outcome model is misspecified in the Q-learning approach, the estimator in the first stage loses its consistency and unbiasedness,⁵⁵ leading to biased estimation of the common blip parameters. However, linear models offer the advantage of interpretability, which is crucial in treatment decision-making. Alternatives such as dWOLS and G-estimation, which provide both double robustness and interpretability, can also be considered at the first stage. In addition to the correctly specified blip model, these methods only require the correct specification of either the treatment-free model or the treatment assignment model, and thus are particularly attractive in the context of randomized trials, where the treatment allocation is known by design and greater flexibility in specifying the treatment-free model is allowed. Extension to other outcome types requires further investigation. A naïve way to adapt the proposed method for different outcome types is to use existing ITR estimation methods for a specific outcome type at the first stage, and then combine the resulting parameter estimates across sites using a Bayesian hierarchical model at the second stage in the same way to produce common blip parameter estimators as well as the common optimal ITR. For example, for a binary outcome, Q-learning can be implemented as logistic regression at the first stage.⁵⁶ Variants of Q-learning, G-estimation, and dWOLS have been proposed for survival outcomes.^57,58 However, with a binary outcome or a survival outcome, whether the two-stage approach yields estimates similar to those from an analysis based on the full individual-level data remains an open question due to non-collapsibility.

We evaluate the ITR estimation through simulations. In all scenarios explored, the proposed method yielded consistent estimation. Additionally, simulation results support the feasibility of assuming common between-site heterogeneity when specifying the structure of variance–covariance matrix ${\tilde{Σ}}_{t}$ , regardless of the true underlying structure. In our simulations, we lack sufficient data to estimate between-site heterogeneity when only a single study is available for each treatment set. Therefore, we assume the between-site heterogeneity to be zero in those cases. When more than one study is available for a particular treatment set, it is technically possible to estimate this heterogeneity. However, the decision to assume a zero or nonzero between-site heterogeneity in the model depends on the specific context. If the number of studies is limited, estimating a nonzero between-site heterogeneity may suffer from low precision. Even with a sufficient number of sites, whether to assume a zero or nonzero between-site heterogeneity depends on how confident we are about the homogeneity of the site-specific parameters. In our simulations, we fit the model with a nonzero between-site heterogeneity when the number of sites is three for each treatment set, as the data are known to be generated under heterogeneity by design. A model allowing for between-site heterogeneity also offers more flexibility, as the between-site heterogeneity can be estimated and assessed. In practice, without knowing the true data-generating mechanism, it is recommended to fit the model both with zero and nonzero between-site heterogeneity using various structures to assess the sensitivity of the results to different modeling strategies as well as the feasibility of assuming a nonzero between-site heterogeneity. When assuming a unstructured nonzero between-site heterogeneity, we adopt a separation strategy for the prior specification of the variance–covariance matrix. We assign a half-Cauchy prior for the standard deviation parameters and an LKJ prior for the correlation matrix. They may not be the best choice in certain cases. For example, when the number of sites is small, assuming an EQ in the correlation matrix will reduce the number of parameters to be estimated, possibly giving more precise estimates. However, in our simulations, different prior choices have little impact on the estimation results.

The application of the proposed method is illustrated through an analysis of real data from the STAR*D, EMBARC, and REVAMP studies. Common covariates in the three studies that are considered to be related to the depression outcome or treatment response in the literature were selected and included in the model. Although statistical associations between these covariates and the depression outcome have been established in the literature, in clinical settings, some variables, such as marital status and the number of people in the household, are seldom considered by physicians when prescribing depression medications. We considered these covariates in the analysis as a proxy for social support, but whether and how to deploy these in clinical contexts requires careful consideration. We dichotomized the number of depressive episodes, which will lead to information loss. The cutoff point was determined solely based on the data and thus lacks clinical interpretations. In clinical settings, patients with two or more episodes are considered to have recurrent depressive disorder. However, most patients in the EMBARC study had two or more episodes, making a threshold value of two less meaningful in terms of distinguishing depression severity or chronicity. There would be little information available to learn about treatment tailoring by episodes if the standard threshold is used. Moreover, only patients who had their first MDD before the age of 30 could be included in the EMBARC study. Patients not satisfying this condition in the STAR*D and REVAMP studies were excluded from the analysis. While this reduced the sample sizes in individual studies, it could be practically recommended when populations from different studies are quite different as a means of ensuring the positivity assumption is met. We assume a zero between-site variance–covariance matrix, as there is only a single site for each unique treatment set. However, this is not the only feasible choice. We note that the SER and BUP comparison is present in all three studies, and the SER and VEN comparison is available in both the STAR*D and REVAMP studies. It is possible to assume a nonzero between-study heterogeneity for these comparisons, but a common effect for others (see Supplemental Appendix S2).

In addition to the main analysis, where SER was chosen as the common reference treatment, we also performed a two-stage analysis using BUP as the common reference treatment, as well as a one-stage analysis where all individual-level data are combined and analyzed together. Changing the common reference treatment from SER to BUP only results in minimal variations in parameter estimates. In general, the choice of the common reference treatment should be based on clinical relevance or the presence of the treatments in the network. When no standard treatment is available across all sites, we can choose the treatment that has the most direct comparisons with others as the common reference treatment. This helps avoid the requirement of the consistency assumption when unnecessary and thus can reduce the uncertainty incurred by indirect comparisons. If multiple treatments are available across all sites, we can choose the treatment that is most clinically meaningful (e.g. a standard treatment) or the most established treatment, especially when novel treatments are included in the analysis. In all three analyses, all estimated effects are relatively small in magnitude compared to their wide credible intervals, indicating a lack of strong evidence for the need to tailor treatment assignments based on the covariates considered in the analysis. This aligns with findings in the literature that baseline anxiety level and common sociodemographic variables, such as age, marital/employment status, or education level, do not contribute to the differential treatment effects.^59,60 However, Noma et al.⁶¹ found that several variables, including age, age at onset, and HDRS subscales, could be potential effect modifiers for response to depression treatments through a meta-analysis. We also observed that the main effects of CIT + BUP and ESCIT are negative. However, combination therapy is generally expected to be superior to monotherapy⁶² and ESCIT is known to be more or at least similarly effective compared to a range of antidepressants including CIT, SER, VEN, and BUP.^63,64 The discrepancy could arise from the limited evidence available in our analysis, as both CIT + BUP and ESCIT were only represented in a single study. Additionally, only 35 patients in the REVAMP study received ESCIT, resulting in estimates with low precision.

No modeling assumptions were found to be heavily violated for the linear regressions at the first stage, but the $R^{2}$ was relatively low (about 40% to 50%). The covariates included in the analysis only reflect the sociodemographic and symptom information. Some important features that are predictive of the outcome, such as those related to genetic information or comorbidities,⁴⁰ are missing. These features could also help refine the personalization of depression treatments. Our analysis, therefore, provides an important proof-of-concept, but important additional refinements would be needed before deploying findings from such an analysis in a clinical setting. Nevertheless, the results showcase a promising approach to leveraging multiple data-sources to learn about the effect modification of a potentially large number of treatment options by important patient characteristics, leveraging those covariates to better allocate treatment for improved patient care.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802251387430 - Supplemental material for Two-stage Bayesian network meta-analysis of individualized treatment rules for multiple treatments with siloed data

Supplemental material, sj-pdf-1-smm-10.1177_09622802251387430 for Two-stage Bayesian network meta-analysis of individualized treatment rules for multiple treatments with siloed data by Junwei Shen, Erica EM Moodie and Shirin Golchi in Statistical Methods in Medical Research

Footnotes

Acknowledgements

This work is supported by an award from the Canadian Institutes of Health Research CIHR FDN-16726. EEMM is a Canada Research Chair (Tier 1) in Statistical Methods for Precision Medicine and acknowledges the support of a chercheur de mérite career award from the Fonds de Recherche du Québec,Santé. SG acknowledges support from the Natural Sciences and Engineering Research Council of Canada (NSERC),Canadian Statistical Sciences Institute (CANSSI) and Fonds de recherche du Québec—Santé (FRQS).

Data availability

The data used in Section 4 were obtained from the National Institute of Mental Health (NIMH) Data Archive (NDA). Researchers can request access at

ORCID iDs

Junwei Shen

Shirin Golchi

Funding

The authors disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This work was supported by an award from the Canadian Institutes of Health Research CIHR FDN-16726. EEMM is a Canada Research Chair (Tier 1) in Statistical Methods for Precision Medicine and received the support of a chercheur de mérite career award from the Fonds de Recherche du Québec,Santé. SG acknowledges support from the Natural Sciences and Engineering Research Council of Canada (NSERC), Canadian Statistical Sciences Institute (CANSSI) and Fonds de recherche du Québec- Santé (FRQS).

Declaration of conflicting interest

The authors declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

Kravitz

Duan

Braslow

. Evidence-based medicine, heterogeneity of treatment effects, and the trouble with averages. Milbank Q 2004; 82: 661–687.

Chakraborty

Moodie

EEM

. Statistical methods for dynamic treatment regimes: Reinforcement learning, causal inference, and personalized medicine. New York: Springer, 2013.

Kosorok

Laber

. Precision medicine. Annu Rev Stat Appl 2019; 6: 263–286.

Watkins

CJCH

. Learning from delayed rewards. PhD Thesis, King’s College, Cambridge, UK, 1989.

Sutton

Barto

. Reinforcement learning: An introduction. Cambridge: MIT Press, 2018.

Robins

. Optimal structural nested models for optimal sequential decisions. In: Lin DY and Heagerty PJ (eds) Proceedings of the second Seattle symposium in biostatistics. New York: Springer, 2004, pp. 189–326.

Wallace

Moodie

EEM

. Doubly-robust dynamic treatment regimen estimation via weighted least squares. Biometrics 2015; 71: 636–644.

Greenland

. Tests for interaction in epidemiologic studies: a review and a study of power. Stat Med 1983; 2: 243–251.

Spicker

Moodie

EEM

Shortreed

. Differentially private outcome-weighted learning for optimal dynamic treatment regime estimation. Stat 2024; 13: e641.

10.

Dwork

. Differential Privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds) Automata, Languages and Programming. ICALP 2006. Lecture Notes in Computer Science, vol 4052. Springer, Berlin, Heidelberg.

11.

Zhao

Zeng

Rush

, et al. Estimating individualized treatment rules using outcome weighted learning. J Am Stat Assoc 2012; 107: 1106–1118.

12.

Danieli

Moodie

EEM

. Preserving data privacy when using multi-site data to estimate individualized treatment rules. Stat Med 2022; 41: 1627–1643.

13.

Saha-Chaudhuri

Weinberg

. Addressing data privacy in matched studies via virtual pooling. BMC Med Res Methodol 2017; 17: 1–10.

14.

Rassen

Moran

Toh

, et al. Evaluating strategies for data sharing and analyses in distributed data settings. Technical report, Mini-Sentinel, 2013.

15.

Moodie

EEM

Coulombe

Danieli

, et al. Privacy-preserving estimation of an optimal individualized treatment rule: a case study in maximizing time to severe depression-related outcomes. Lifetime Data Anal 2022; 28: 1–31.

16.

Schulz

Moodie

EEM

. Doubly robust estimation of optimal dosing strategies. J Am Stat Assoc 2021; 116: 256–268.

17.

Shen

Moodie

EEM

Golchi

. Sparse 2-stage Bayesian meta-analysis for individualized treatments. Biometrics 2025; 81: ujaf082.

18.

Salanti

. Indirect and mixed-treatment comparison, network, or multiple-treatments meta-analysis: many names, many benefits, many concerns for the next generation evidence synthesis tool. Res Synth Methods 2012; 3: 80–97.

19.

Cipriani

Higgins

Geddes

, et al. Conceptual and technical challenges in network meta-analysis. Ann Intern Med 2013; 159: 130–137.

20.

Rubin

. Discussion of “Randomization analysis of experimental data in the Fisher randomization test” by D. Basu. J Am Stat Assoc 1980; 75: 591–593.

21.

Robins

. Causal inference from complex longitudinal data. In: Berkane M (ed.) Latent variable modeling and applications to causality. New York: Springer, 1997. pp.69–117.

22.

Cole

Hernán

. Constructing inverse probability weights for marginal structural models. Am J Epidemiol 2008; 168: 656–664.

23.

Arjas

Saarela

. Optimal dynamic regimes: presenting a case for predictive inference. Int J Biostat 2010; 6: 10.

24.

Logan

Sparapani

McCulloch

, et al. Decision making and uncertainty quantification for individualized treatments using Bayesian additive regression trees. Stat Methods Med Res 2019; 28: 1079–1093.

25.

Hahn

Murray

Carvalho

. Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects (with discussion). Bayes Anal 2020; 15: 965–1056.

26.

Gelman

Carlin

Stern

, et al. Bayesian data analysis. New York: CRC Press, 2013.

27.

White

Barrett

Jackson

, et al. Consistency and inconsistency in network meta-analysis: model estimation using multivariate meta-regression. Res Synth Methods 2012; 3: 111–125.

28.

Riley

Fisher

. Individual participant data meta-analysis: A handbook for healthcare research. Oxford: John Wiley & Sons, 2021.

29.

Gelman

. Prior distributions for variance parameters in hierarchical models. Bayes Anal 2006; 1: 515–533.

30.

Barnard

McCulloch

Meng

. Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Stat Sin 2000; 10: 1281–1311.

31.

Wang

Lin

Hodges

, et al. The impact of covariance priors on arm-based Bayesian network meta-analyses with binary outcomes. Stat Med 2020; 39: 2883–2900.

32.

Lewandowski

Kurowicka

Joe

. Generating random correlation matrices based on vines and extended onion method. J Multivar Anal 2009; 100: 1989–2001.

33.

Stan Development Team . Stan modeling language users’ guide and reference manual (version 2.28), 2021. https://mc-stan.org/.

34.

Stan Development Team . RStan: the R interface to Stan (R package version 2.21.2), 2020. http://mc-stan.org/

35.

Morris

White

Crowther

. Using simulation studies to evaluate statistical methods. Stat Med 2019; 38: 2074–2102.

36.

Rush

Fava

Wisniewski

, et al. Sequenced treatment alternatives to relieve depression (STAR*D): rationale and design. Control Clin Trials 2004; 25: 119–142.

37.

Trivedi

McGrath

Fava

, et al. Establishing moderators and biosignatures of antidepressant response in clinical care (EMBARC): rationale and design. J Psychiatr Res 2016; 78: 11–23.

38.

Trivedi

Kocsis

Thase

, et al. REVAMP – Research evaluating the value of augmenting medication with psychotherapy: rationale and design. Psychopharmacol Bull 2008; 41: 5–33.

39.

Cipriani

Furukawa

Geddes

, et al. Does randomized evidence support sertraline as first-line antidepressant for adults with acute major depression? A systematic review and meta-analysis. J Clin Psychiatry 2008; 69: 1732–1742.

40.

Perlman

Benrimoh

Israel

, et al. A systematic meta-review of predictors of antidepressant treatment outcome in major depressive disorder. J Affect Disord 2019; 243: 503–515.

41.

Kessler

van Loo

Wardenaar

, et al. Using patient self-reports to study heterogeneity of treatment effects in major depressive disorder. Epidemiol Psychiatr Sci 2017; 26: 22–36.

42.

Vyas

Eisenstein

Jones

. Hidden in plain sight – reconsidering the use of race correction in clinical algorithms. New Engl J Med 2020; 383: 874–882.

43.

Ades

. Assessing evidence inconsistency in mixed treatment comparisons. J Am Stat Assoc 2006; 101: 447–459.

44.

Donegan

Williamson

D’Alessandro

, et al. Assessing key assumptions of network meta-analysis: a review of methods. Res Synth Methods 2013; 4: 291–323.

45.

Dias

Welton

Ades

. Study designs to detect sponsorship and other biases in systematic reviews. J Clin Epidemiol 2010; 63: 587–588.

46.

Dias

Welton

Caldwell

, et al. Checking consistency in mixed treatment comparison meta-analysis. Stat Med 2010; 29: 932–944.

47.

Piepho

Williams

Madden

. The use of two-way linear mixed models in multitreatment meta-analysis. Biometrics 2012; 68: 1269–1277.

48.

Weinberger

Oddone

Henderson

, et al. Multisite randomized controlled trials in health services research: scientific challenges and operational issues. Med Care 2001; 39: 627–634.

49.

Noda

Kraemer

Taylor

, et al. Strategies to reduce site differences in multisite studies: a case study of Alzheimer disease progression. Am J Geriatr Psychiatry 2006; 14: 931–938.

50.

Béliveau

Gustafson

. A theoretical investigation of how evidence flows in Bayesian network meta-analysis of disconnected networks. Bayes Anal 2021; 16: 803–823.

51.

Stevens

Fletcher

Downey

, et al. A review of methods for comparing treatments evaluated in studies that form disconnected networks of evidence. Res Synth Methods 2018; 9: 148–162.

52.

Schmitz

Maguire

Morris

, et al. The use of single armed observational data to closing the gap in otherwise disconnected evidence networks: a network meta-analysis in multiple myeloma. BMC Med Res Methodol 2018; 18: 1–18.

53.

Goring

Gustafson

Liu

, et al. Disconnected by design: analytic approach in treatment networks having no common comparator. Res Synth Methods 2016; 7: 420–432.

54.

Béliveau

Goring

Platt

, et al.

Network meta-analysis of disconnected networks: how dangerous are random baseline treatment effects?

Res Synth Methods 2017; 8: 465–474.

55.

Schulte

Tsiatis

Laber

, et al. Q- and A-learning methods for estimating optimal dynamic treatment regimes. Stat Sci 2014; 29: 640–661.

56.

Moodie

EEM

Dean

Sun

. Q-learning: flexible learning about useful utilities. Stat Biosci 2014; 6: 223–243.

57.

Goldberg

Kosorok

. Q-learning with censored data. Ann Stat 2012; 40: 529–560.

58.

Simoneau

Moodie

EEM

Nijjar

, et al. Estimating optimal dynamic treatment regimes with survival outcomes. J Am Stat Assoc 2020; 115: 1531–1539.

59.

Rush

Batey

Donahue

, et al.

Does pretreatment anxiety predict response to either bupropion SR or sertraline?

J Affect Disord 2001; 64: 81–87.

60.

Archer

Kessler

Lewis

, et al. What predicts response to sertraline for people with depression in primary care? A secondary data analysis of moderators in the PANDA trial. PLoS ONE 2024; 19: 1–12.

61.

Noma

Furukawa

Maruo

, et al. Exploratory analyses of effect modifiers in the antidepressant treatment of major depression: individual-participant data meta-analysis of 2803 participants in seven placebo-controlled randomized trials. J Affect Disord 2019; 250: 419–424.

62.

Henssler

Bschor

Baethge

. Combining antidepressants in acute treatment of depression: a meta-analysis of 38 studies including 4511 patients. Can J Psychiatry 2016; 61: 29–43.

63.

Sidney

Kennedy

HFA

Thase

. Escitalopram in the treatment of major depressive disorder: a meta-analysis. Curr Med Res Opin 2009; 25: 161–175.

64.

Kirino

. Escitalopram for the management of major depressive disorder: a review of its efficacy, safety, and patient acceptability. Patient Prefer Adherence 2012; 6: 853–861.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

6.00 MB

0.00 MB