Abstract
Keywords
Introduction
It is widely recognized in medical research that treatment responses often exhibit variability among different patient subgroups. Personalized medicine leverages this heterogeneity in treatment effect to enhance healthcare service quality by delivering tailored treatments to individual patients.1–3 An individualized treatment rule (ITR) is a decision rule that utilizes patient-level information, such as demographics, genetic makeup, or disease history, to customize treatment plans at a single decision point. An optimal ITR guides treatment selection for individual patients with the goal of optimizing patient outcomes. Estimating the optimal ITR is essential for the practice of personalized medicine and has attracted significant research focus.
Regression-based approaches are commonly employed to indirectly estimate the optimal ITR. These methods model the expected patient outcome as a function of treatment, covariates, and treatment–covariate interactions. Then, the optimal treatment is determined as the one that leads to the best estimated outcome for any given covariate profile. Q-learning,4,5 G-estimation, 6 and dynamic weighted ordinary least squares (dWOLS) 7 are three popular regression-based approaches. Estimating treatment–covariate interactions based on individual-level data is essential in these approaches. With advancements in technologies, the availability of large collections of health data from multiple data sites has facilitated the identification of factors that contribute to differential treatment responses, provided a higher statistical power of treatment–covariate interaction estimation that cannot be offered by a single site, 8 and improved the generalizability of the findings. However, patient-level health data are typically highly sensitive, and their disclosure could cause a violation of data-sharing agreements or policies, presenting a challenge for ITR estimation with multisite data. Therefore, valid approaches to ITR estimation without releasing patient-level information are desired.
Several approaches have been proposed to avoid individual-level data sharing for ITR estimation. Spicker et al. 9 investigate differential privacy 10 in the context of dynamic treatment regimes, an extension of ITRs to multiple treatment decision points. Instead of regression-based approaches, they focus on an outcome-weighted learning method, 11 which frames the estimation of ITRs as a classification loss minimization problem and identifies the optimal treatment through support vector machine classifiers. Danieli and Moodie 12 study the use of data pooling 13 and distributed regression 14 to protect individual-level data from release in multisite studies in the context of ITR estimation with generalized dWOLS for continuous outcomes. In their approach, estimators characterizing the optimal ITR are computed using data summaries (e.g. pooled data or matrix products) shared by each single site, rather than individual-level data. Moodie et al. 15 also explore distributed regression in a dynamic weighted survival modeling (a generalization of dWOLS to survival outcomes 16 ). One limitation of both approaches is that they typically assume parameters of interest are fixed and common to all sites. To overcome this limitation, our recent work 17 adapted a two-stage Bayesian meta-analysis approach, which requires only site-specific analyses of individual-level data within each site and sharing site-specific estimates, as summary data, to construct a common optimal ITR in settings where all sites are assigned the same treatment options. Conventional meta-analysis approaches typically assume that the treatment is binary and that each site consists of the same treatment comparison, limiting their applicability in a wide range of diseases where the treatment landscape can be quite heterogeneous, as is the case with conditions such as depression. In such cases, due to funding or time constraints, only a subset of available treatment options can be delivered in each site, and yet establishing an optimal ITR that considers all treatments is often desired. Analogously, in our motivating example, we wish to draw inferences using randomized trial data from trials whose randomization groups are overlapping but not identical. In this article, we consider ITR estimation in multisite studies without sharing individual-level data, when more than two treatments are available, and each site may encompass different sets of treatment assignment options.
An extension of classic meta-analysis to multiple treatments is network meta-analysis.18,19 Network meta-analysis compares multiple treatments within a network of studies, involving the simultaneous analysis of direct evidence obtained from head-to-head trials and indirect evidence from studies including the treatments of interest and one or more common comparator treatments, when comparing any two treatments in the network. This drives the extension of the two-stage Bayesian meta-analysis approach proposed in our previous work 17 to the current setting where treatments are not common across all sites or studies, which is the objective of this work.
The remainder of this paper is organized as follows: Section 2 describes the proposed method, including the notations and assumptions. A simulation study is presented in Section 3 to explore the performance of ITR estimation using the proposed method. Section 4 demonstrates the application of the proposed method via an analysis of real data from three randomized clinical trials for the treatment of depression. The paper concludes with a discussion in Section 5.
Methods
Preliminaries
Consider the data
We make the following assumptions: (i) the stable unit treatment value assumption: a patient’s outcome is not influenced by other patients’ treatment 20 ; (ii) no unmeasured confounding 21 ; (iii) positivity: there is a positive probability of receiving every possible treatment for every combination of covariate values that occur among individuals in the population. 22
Define a treatment-free function
Since the treatment assignment only influences the outcome through the blip model
In this section, we describe the use of a two-stage Bayesian network meta-analysis approach to avoid disclosing individual-level data, when estimating the optimal ITR for multiple treatments using multisite data. We first describe the model when all
In the two-stage Bayesian meta-analysis approach, we first obtain a set of estimates for the site-specific parameters by conducting analyses on data from each single site, and then combine these estimates in a Bayesian hierarchical model to obtain estimates of the common parameters
Model (5) requires all
To proceed, the treatment set in site
With the site-specific estimates
When
The simulation study is reported following the aims, data-generating mechanisms, estimands, methods, and performance measures (ADEMP) scheme proposed in Morris et al. 35
Aims
We aim to evaluate ITR estimation for a continuous outcome and multiple treatments when individual-level data are protected from disclosure via a two-stage Bayesian network meta-analysis approach, under assumptions regarding (1) network sizes, (2) network shapes, (3) the true between-site heterogeneity, and (4) the assumed between-site heterogeneity in the Bayesian hierarchical model. Points (1)–(3) concern the data-generating mechanisms, while (4) concerns the analysis model. We have explored ITR estimation with the two-stage Bayesian pairwise meta-analysis under different confounding scenarios, between-site heterogeneity levels, prior choices, and sample sizes in the previous work 17 ; note that in this previous work, we did not consider settings where the assigned treatments varied across sites. While that previous work did not include settings where treatments offered varied by site, we expect similar results can be obtained when we have a network of studies and thus do not consider those particular features in the simulations here.
Data-generating mechanisms
The network structures considered in the simulation are shown in Figure 1. Sites included in each network structure with their site-specific treatment set

Graphics of simulated networks (a)–(e). Network (e) reflects the network structure for the real-data application to three trials described in Section 4. Connecting lines indicate the two treatments can be directly compared. The treatment
Sites included in each network.
Coefficients in multinomial probabilities for treatment assignment.
Suppressing the individual-specific subscript, the continuous outcome for an individual at site
The site-specific parameters common between-site heterogeneity: varying between-site heterogeneity:
where the between-study variance
The estimands of interest are the common blip function parameters When only a single site exists for each different site-specific treatment set When we have three sites for each unique sites-specific treatment set, priors will be assigned under different modeling assumptions: Under common between-site heterogeneity assumption, For unstructured
We are unaware of any existing methods that can estimate common ITRs across multiple studies with differing treatment values at different sites without individual-level data being shared. Thus, our analyses consist of comparing different model specifications within our Bayesian network meta-analysis approach. As no alternatives were found in the literature, no competing methods were included. In previous work, 17 we compared the two-stage approach with a one-stage analysis where all individual-level data are combined and analyzed in both simulations and a real-data application, under scenarios where the same binary or continuous treatments were available across all sites. We found that both approaches provided similar results, and we expect this conclusion to extend to settings where multiple treatments and varying treatment sets are available across sites. Therefore, we do not perform a one-stage analysis in the current simulation. However, both the one-stage and two-stage methods are implemented for the real-data application in Section 4. We assess: (i) the relative bias of blip parameter estimators, which can be calculated by the difference between the mean of the estimates and the true value, divided by the latter, (ii) the standard deviation of the estimates, (iii) the difference in the value function (dVF) under the true and estimated optimal ITR, where the value function with respect to an ITR is approximated by the expected outcome if all patients in a new cohort of size 100,000 were treated according to the ITR, and (iv) the empirical standard deviation of the dVF when the estimated treatment rule was applied to the same population.
In this section, for the sake of space, only simulation results for
Simulation results when the data are generated under an unstructured between-site heterogeneity model.
RB: relative bias; SD: standard deviation; dVF: difference in value function; LKJ: Lewandowski–Kurowicka–Joe; RIW: restricted inverse-Wishart; EQ: equal correlation; ITR: individualized treatment rule. RB (
) and SDs of
and
, and dVF between true and estimated ITR and its standard deviation are reported across different networks, numbers of sites and heterogeneity assumptions in the Bayesian hierarchical model based on 2000 iterations.
Simulation results when the data are generated under an unstructured between-site heterogeneity model.
RB: relative bias; SD: standard deviation; dVF: difference in value function; LKJ: Lewandowski–Kurowicka–Joe; RIW: restricted inverse-Wishart; EQ: equal correlation; ITR: individualized treatment rule. RB (
Simulation results when the data are generated under the assumption of common between-site heterogeneity.
RB: relative bias; SD: standard deviation; dVF: difference in value function; LKJ: Lewandowski–Kurowicka–Joe; RIW: restricted inverse-Wishart; EQ: equal correlation; ITR: individualized treatment rule. RB (
With more sites contributing to the common ITR estimation, we have more precise blip parameter estimators and a smaller dVF, corresponding to a better ITR estimation. When there is only one site for each
In this section, we apply the proposed method to estimate an ITR for patients with major depressive disorder (MDD) using data from three studies: The Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study, 36 Establishing Moderators and Biosignatures of Antidepressant Response for Clinical Care (EMBARC) study, 37 and Research Evaluating the Value of Augmenting Medication with Psychotherapy (REVAMP) study. 38
All three studies are multistage randomized trials, with details of their designs described elsewhere.36–38 STAR*D include four stages. Due to single treatment assignment in the first stage and limited sample size in stages 3 and 4, we use data from stage 2, where patients without a satisfactory clinical outcome to citalopram (CIT) in the first stage were randomized to seven treatments. Among these, we focus on medications only: venlafaxine (VEN), sertraline (SER), bupropion (BUP), CIT augmented with BUP (CIT + BUP), or buspirone (BUS). In the case of EMBARC, we focus on SER and BUP in the second stage, as patients received only one active treatment SER and a placebo in the first stage. For REVAMP, data from the first stage is used, where a medication algorithm was implemented for treatment assignment, and SER, BUP, VEN, and escitalopram (ESCIT) are included. Therefore, in total, six treatments are identified:

Network structure of analysis of STAR*D, EMBARC, and REVAMP data. The size of each node (in red) is proportional to the total sample size in the corresponding treatment group, and the width of the connecting line (in gray) between any two treatments is proportional to the number of studies that directly compared the two treatments. STAR*D: Sequenced Treatment Alternatives to Relieve Depression; EMBARC: Establishing Moderators and Biosignatures of Antidepressant Response for Clinical Care; REVAMP: Research Evaluating the Value of Augmenting Medication with Psychotherapy.
Depression severity is measured by the 17-item Hamilton Depression Rating Scale (HDRS-17), with a larger value corresponding to more severe symptoms. In our analysis, we consider the negative of HDRS-17 as the outcome. We choose covariates based on meta-reviews of antidepressant treatment outcome predictors and modifiers.40,41 The following covariates that are common in the three studies and potentially related to differential treatment effects were identified and included in the model at the first stage: (1) sociodemographic variables: age (in years), sex (male, female), race (White, non-White), marital status (single, married, divorced/widowed), number of years in formal education, employment status (employed, unemployed), number of people in household; (2) clinical variables: age at onset of first MDD (in years), number of depressive episodes, chronicity of current episode, and baseline HDRS-17 before receiving the treatments. Among these variables, race was only used as an adjustment variable rather than a tailoring variable for treatment assignment, as basing treatment decisions on racial or ethnic groups can lead to healthcare disparities and inequities.
42
Additionally, while the number of depressive episodes was small for many patients, there were also several large values (e.g. 120), making it unsuitable to include this variable as a continuous linear term in the model. Therefore, the number of depressive episodes was dichotomized using a cutoff point of four; that is, a binary variable was created based on whether the number of episodes is
Patient characteristics for STAR*D, EMBARC, and REVAMP studies.
STAR*D: Sequenced Treatment Alternatives to Relieve Depression; EMBARC: Establishing Moderators and Biosignatures of Antidepressant Response for Clinical Care; REVAMP: Research Evaluating the Value of Augmenting Medication with Psychotherapy; MDD: major depressive disorder; HDRS-17: 17-item Hamilton Depression Rating Scale.
Linear regression models with above mentioned covariates and their interactions with treatments were used to obtain site-specific blip parameter estimates and the corresponding variance–covariance matrix. Since for most treatment comparisons only one or two study-specific estimates are available, a Bayesian hierarchical model with
Blip parameter estimates (posterior medians) and the 95% posterior credible intervals for the real-data application.
BUP: bupropion; VEN: venlafaxine; CIT: citalopram; BUS: buspirone; ESCIT: escitalopram; MDD: major depressive disorder; HDRS-17: 17-item Hamilton Depression Rating Scale.
As BUP was also available in all three studies, in addition to the main analysis where SER was chosen as the common reference treatment, we conducted another two-stage analysis with BUP as the common reference treatment to assess how the choice of the common reference treatment would influence the estimation results. A one-stage analysis where all individual-level data are combined and analyzed was also performed for comparison. Figure 3 shows the posterior densities of the main treatment effect parameters,

Posterior densities of the main treatment effect
An optimal ITR can be estimated in a regression-based approach by including predefined treatment–covariate interactions. To increase the power for detecting differential treatment effects by covariates, large collections of datasets from multiple sites or studies are often needed. Different sites or studies may have varying treatment sets; however, an ITR analysis of all available treatments is desired. This presents a methodological gap which, to our knowledge, has not previously been considered. To address this gap, we adopt a two-stage Bayesian meta-analysis approach: at the first stage, study-specific analyses are conducted on the single study data only; at the second stage, summary measures, including blip parameter estimates and variance or covariance estimates, are shared and combined in a Bayesian hierarchical model to estimate a common optimal ITR.
The conventional pairwise meta-analysis approach focuses on binary treatments and assumes that all studies consist of the same treatment comparisons. In this article, we consider multiple treatments, and different studies may encompass different sets of treatments. With different treatment sets across studies, the estimated ITRs using study-specific data will only include a subset of the available treatments, and treatments that are not present in the same study cannot be simultaneously considered in an estimated ITR. To address this issue and construct an ITR for all treatments, we employ a network meta-analysis approach and construct the Bayesian hierarchical model at the second stage based on the consistency equation (7), which is assumed for both the main treatment effect as well as treatment–covariate interactions. The consistency assumption relates our parameters of interest and parameters that can be estimated with direct evidence, and is essential to ensure the validity of the Bayesian hierarchical model at the second stage. It builds on the transitivity assumption that the studies comparing different sets are sufficiently similar with respect to all important factors other than the treatments being compared. 19 For example, the distributions of treatment effect modifiers should be similar across studies. The consistency and transitivity assumptions can be violated in certain cases. For example, a treatment in a given site may be inappropriate for patients in another site due to the heterogeneity in population. In this case, in addition to these two assumptions, the positivity assumption may also be violated if specific covariate combinations preclude receipt of particular treatments. Therefore, the proposed method will be inappropriate due to the violation of the assumptions in both the network meta-analysis and causal inference aspects of the analysis.
While assessment of the transitivity assumption can only be made qualitatively, the consistency assumption can be statistically assessed when both direct and indirect evidence are available. A discussion of these approaches can be found elsewhere.27,43–47 We can reduce the possibility of inconsistency in both the design and analysis stages. When designing a multisite trial, it is recommended to standardize the study protocol to guarantee the same or similar populations, treatment delivery, and assessment of the outcomes and covariates.48,49 Before the two-stage analysis of the data, we may also prescreen participants based on some analyst-defined harmonized eligibility criteria to ensure the samples in the analysis are similar across sites. If the consistency assumption is deemed unfeasible, incorporation of inconsistency may be considered.
43
For example, an inconsistency factor
In the consistency equation (7), to identify parameters
In this article, we focus on a continuous outcome, and a linear regression model is used at the first stage. This implementation assumes linearity and thus the results are sensitive to model specification. In practice, the linear relationship may not fully capture the true dynamics among covariates, treatments, treatment–covariate interactions, and the outcome. When the outcome model is misspecified in the Q-learning approach, the estimator in the first stage loses its consistency and unbiasedness, 55 leading to biased estimation of the common blip parameters. However, linear models offer the advantage of interpretability, which is crucial in treatment decision-making. Alternatives such as dWOLS and G-estimation, which provide both double robustness and interpretability, can also be considered at the first stage. In addition to the correctly specified blip model, these methods only require the correct specification of either the treatment-free model or the treatment assignment model, and thus are particularly attractive in the context of randomized trials, where the treatment allocation is known by design and greater flexibility in specifying the treatment-free model is allowed. Extension to other outcome types requires further investigation. A naïve way to adapt the proposed method for different outcome types is to use existing ITR estimation methods for a specific outcome type at the first stage, and then combine the resulting parameter estimates across sites using a Bayesian hierarchical model at the second stage in the same way to produce common blip parameter estimators as well as the common optimal ITR. For example, for a binary outcome, Q-learning can be implemented as logistic regression at the first stage. 56 Variants of Q-learning, G-estimation, and dWOLS have been proposed for survival outcomes.57,58 However, with a binary outcome or a survival outcome, whether the two-stage approach yields estimates similar to those from an analysis based on the full individual-level data remains an open question due to non-collapsibility.
We evaluate the ITR estimation through simulations. In all scenarios explored, the proposed method yielded consistent estimation. Additionally, simulation results support the feasibility of assuming common between-site heterogeneity when specifying the structure of variance–covariance matrix
The application of the proposed method is illustrated through an analysis of real data from the STAR*D, EMBARC, and REVAMP studies. Common covariates in the three studies that are considered to be related to the depression outcome or treatment response in the literature were selected and included in the model. Although statistical associations between these covariates and the depression outcome have been established in the literature, in clinical settings, some variables, such as marital status and the number of people in the household, are seldom considered by physicians when prescribing depression medications. We considered these covariates in the analysis as a proxy for social support, but whether and how to deploy these in clinical contexts requires careful consideration. We dichotomized the number of depressive episodes, which will lead to information loss. The cutoff point was determined solely based on the data and thus lacks clinical interpretations. In clinical settings, patients with two or more episodes are considered to have recurrent depressive disorder. However, most patients in the EMBARC study had two or more episodes, making a threshold value of two less meaningful in terms of distinguishing depression severity or chronicity. There would be little information available to learn about treatment tailoring by episodes if the standard threshold is used. Moreover, only patients who had their first MDD before the age of 30 could be included in the EMBARC study. Patients not satisfying this condition in the STAR*D and REVAMP studies were excluded from the analysis. While this reduced the sample sizes in individual studies, it could be practically recommended when populations from different studies are quite different as a means of ensuring the positivity assumption is met. We assume a zero between-site variance–covariance matrix, as there is only a single site for each unique treatment set. However, this is not the only feasible choice. We note that the SER and BUP comparison is present in all three studies, and the SER and VEN comparison is available in both the STAR*D and REVAMP studies. It is possible to assume a nonzero between-study heterogeneity for these comparisons, but a common effect for others (see Supplemental Appendix S2).
In addition to the main analysis, where SER was chosen as the common reference treatment, we also performed a two-stage analysis using BUP as the common reference treatment, as well as a one-stage analysis where all individual-level data are combined and analyzed together. Changing the common reference treatment from SER to BUP only results in minimal variations in parameter estimates. In general, the choice of the common reference treatment should be based on clinical relevance or the presence of the treatments in the network. When no standard treatment is available across all sites, we can choose the treatment that has the most direct comparisons with others as the common reference treatment. This helps avoid the requirement of the consistency assumption when unnecessary and thus can reduce the uncertainty incurred by indirect comparisons. If multiple treatments are available across all sites, we can choose the treatment that is most clinically meaningful (e.g. a standard treatment) or the most established treatment, especially when novel treatments are included in the analysis. In all three analyses, all estimated effects are relatively small in magnitude compared to their wide credible intervals, indicating a lack of strong evidence for the need to tailor treatment assignments based on the covariates considered in the analysis. This aligns with findings in the literature that baseline anxiety level and common sociodemographic variables, such as age, marital/employment status, or education level, do not contribute to the differential treatment effects.59,60 However, Noma et al. 61 found that several variables, including age, age at onset, and HDRS subscales, could be potential effect modifiers for response to depression treatments through a meta-analysis. We also observed that the main effects of CIT + BUP and ESCIT are negative. However, combination therapy is generally expected to be superior to monotherapy 62 and ESCIT is known to be more or at least similarly effective compared to a range of antidepressants including CIT, SER, VEN, and BUP.63,64 The discrepancy could arise from the limited evidence available in our analysis, as both CIT + BUP and ESCIT were only represented in a single study. Additionally, only 35 patients in the REVAMP study received ESCIT, resulting in estimates with low precision.
No modeling assumptions were found to be heavily violated for the linear regressions at the first stage, but the
Supplemental Material
sj-pdf-1-smm-10.1177_09622802251387430 - Supplemental material for Two-stage Bayesian network meta-analysis of individualized treatment rules for multiple treatments with siloed data
Supplemental material, sj-pdf-1-smm-10.1177_09622802251387430 for Two-stage Bayesian network meta-analysis of individualized treatment rules for multiple treatments with siloed data by Junwei Shen, Erica EM Moodie and Shirin Golchi in Statistical Methods in Medical Research
Footnotes
Acknowledgements
Data availability
Funding
Declaration of conflicting interest
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
