Adaptive enrichment allows for pre-defined patient subgroups of interest to be investigated throughout the course of a clinical trial. These designs have gained attention in recent years because of their potential to shorten the trial's duration and identify effective therapies tailored to specific patient groups. We describe enrichment trials which consider long-term time-to-event outcomes but also incorporate additional short-term information from routinely collected longitudinal biomarkers. These methods are suitable for use in the setting where the trajectory of the biomarker may differ between subgroups and it is believed that the long-term endpoint is influenced by treatment, subgroup and biomarker. Methods are most promising when the majority of patients have biomarker measurements for at least two time points. We implement joint modelling of longitudinal and time-to-event data to define subgroup selection and stopping criteria and we show that the familywise error rate is protected in the strong sense. To assess the results, we perform a simulation study and find that, compared to the study where longitudinal biomarker observations are ignored, incorporating biomarker information leads to increases in power and the (sub)population which truly benefits from the experimental treatment being enriched with higher probability at the interim analysis. The investigations are motivated by a trial for the treatment of metastatic breast cancer and the parameter values for the simulation study are informed using real-world data where repeated circulating tumour DNA measurements and HER2 statuses are available for each patient and are used as our longitudinal data and subgroup identifiers, respectively.
In current oncology practice and cancer clinical trials, it is crucial to focus testing of novel therapies on the patient subgroups most likely to benefit. Too many patients receive treatments that either do not work particularly well, are toxic, or sometimes both. Adaptive enrichment clinical trials enable the efficient testing of an experimental intervention on specific patient subgroups of interest.1,2 At an interim analysis, if a particular subgroup of patients is identified as responding particularly well to treatment, then we can focus resources and inferences by recruiting additional patients from the subgroup which benefits.
Simon and Simon3 showed the benefits of enrichment trials, in particular that patients who do not appear to benefit are removed from the experimental treatment with potentially harmful side effects. If the treatment is futile for all patients, we are able to terminate the trial at interim analyses.4 Further, if patients respond overwhelmingly well to treatment, then there is potential to stop the trial early for efficacy demonstrating that the experimental treatment is superior to control in this subgroup, and the usual benefits of group sequential tests apply.5 To combine the data from multiple stages and ensure that type 1 error rates are controlled, either a combination function approach,6 or conditional error rate approaches7,8 were originally proposed. In recent years, the computation of such designs have been streamlined and optimised for different purposes.9,10,4,11 Extending upon Simon and Simon,3 more complex designs which allow for more generlised data structures and targeted selection rules have been proposed.12–14 A further advance upon enrichment designs are adaptive signature trials15 which simultaneously identify and validate subgroup structures within a single trial protocol. These designs are based on cross-validation techniques and suffer from inefficiencies in the way that data is analysed and are subject to bias. More recently, designs have been proposed16 which consider subgroup identification using a continuous biomarker. Such designs are based on an assumption a priori of a nested structure among subgroups.
In recent years, there has been increased uptake in enrichment trials which consider a long-term time-to-event (TTE) endpoint, such as overall survival (OS), but this is still low compared to continuous endpoints.17 In such trials, it is common for investigators to also collect repeated measures on biomarkers. Recent research proposes methods which use the short-term endpoint data for subgroup selection rules then focus on the primary endpoint data for hypothesis testing.18,19 Our aim is to leverage the additional biomarker information to improve interim decision making, early stopping rules and hypothesis testing.
We present a joint model for longitudinal and TTE data and base an enrichment trial design on the treatment effect in the joint model. There has been significant interest in joint modelling of longitudinal and TTE data20,21 with a focus on prediction and personalised medicine. However, the uses of joint modelling have yet to be established in clinical trial designs. We show that by incorporating the longitudinal data into the analysis via joint modelling, this results in the subgroup which benefits being selected more frequently and higher power (using the same number of patients) as the equivalent trial which ignores the biomarker observations. Our simulation results are based on data from a study which measured OS and plasma circulating tumour DNA (ctDNA) levels.22 To define subgroups, we hypothesise that patients who are HER2 negative will benefit from the experimental treatment more than patients who are HER2 positive.
Similarly to Magnusson and Turnbull,23 we use the ‘threshold selection’ rule combined with an error spending test to clearly predefine the subgroup selection and stopping rules before the trial commences. We present a method where, in the setting of TTE data and joint modelling, the relationship between number of observed events and information levels can be exploited to design an efficient clinical trial. The novel feature of this work is an enrichment trial which uses a modern joint model to make both interim decisions and perform hypothesis testing.
Motivating example
Fragments of ctDNA are detected in the blood of cancer patients and are routinely measured in many cancer clinical trials. These measurements, which we shall often refer to as ‘biomarker measurements’ or ‘longitudinal data’ are useful prognostic factors that can improve the precision of OS estimates. Throughout this article, we shall base our analyses on data from a study which compared different biomarkers and their accuracy in monitoring tumour burden among women with metastatic breast cancer.22 The results of the study were conclusive that ctDNA was successfully detected and highly correlated with OS.
Another important factor in breast cancer studies is the presence or absence of the HER2 protein. Patients who are HER2 positive may be resistant to conventional therapies and treatments that specifically target the HER2 protein are very effective.24 Not only is OS influenced by HER2 status, but it is expected that ctDNA measurements are similar across HER2 status upon arrival to the trial and HER2 patients’ ctDNA trajectories will increase more rapidly than HER2+. Adaptive enrichment trials are therefore highly efficient in breast cancer settings because the eligibility criteria based on HER2 status can be updated during the trial, restricting entry to patients likely to benefit.
Joint modelling of ctDNA and OS in defined subgroups
Subgroup set-up and notation
For adaptive enrichment trials, a key assumption is that subgroup identification is known prior to commencement. For the metastatic breast cancer example of Section 2, let denote the HER2 negative subgroup and let denote the HER2 positive subgroup. Then, let denote the full population. Extensions to more subgroups can be made following the same logic. Further, we denote as the total number of analyses in the adaptive trial and for our metastatic breast cancer example, we shall use
The aim of a clinical trial is to assess how effective a new experimental treatment performs compared to an existing standard-of-care drug or placebo. We make statistical inferences based on a treatment effect which is defined at the design stage. For a trial with multiple subgroups, let be the treatment effect in subgroup . A mathematical consequence is that if the prevalence of in is given by , then
Throughout, it is assumed that is known and fixed, however methods are available that account for uncertainty and allow estimation of at each analysis.25 We aim to test the hypotheses
The joint model
The joint model that we consider is based on equation (2) of Tsiatis and Davidian26 (referred to as ‘TD’ for short). There are two processes in this model which represent the survival and longitudinal parts, and these processes are linked using random effects. The difference between our joint model and that of TD is that we have chosen to model the longitudinal data trajectory as linear in time whereas in TD, the parametric form for the biomarker is not specified. This appears appropriate for the example dataset of Section 2 as we have seen ctDNA display this property. The methods can easily be extended to incorporate more complex trajectories for the longitudinal data.
Let the times of the measurements of the longitudinal data for patient in subgroup be denoted by , then is the true value of the biomarker at time and is the observed value of the biomarker. Suppose that is a vector of patient specific random effects and that is the measurement error. We make the assumptions that and and are independent for For the survival endpoint, we shall assume a Cox proportional hazards model. Let be the indicator function that patient in subgroup receives the experimental treatment and let and be a scalar coefficients. Then the hazard function for subgroup is denoted and the joint model takes the form
Equation (3) defines the joint model and defines the working model from which we shall perform simulation studies in Section 6. Parameter estimates in the joint model can then be found by fitting both longitudinal and survival outcomes to the joint model simultaneously and we shall describe this process in Section 3.3.
We note here that there is no treatment effect included in the biomarker trajectory. The motivation for this follows the models that are presented in the literature given by TD. For a more general model including a treatment effect in the longitudinal data, we refer the reader to Section A of the Supplemental Material where we discuss the use of the restricted mean survival time (RMST) endpoint which can account for multiple treatment effect parameters. The RMST methodology requires additional modelling assumptions and performs poorly under model misspecification, and for this reason we do not consider it further. Another method which can account for a treatment effect in the long-term data is the p-value combination approach19 where treatment selection is based solely on longitudinal data and confirmatory decisions assess survival outcomes. In Section A of the Supplemental Material, we make a comparison between the joint modelling method and the p-value combination approach. The joint modelling method makes full use of all the information at each analysis, whereas the p-value combination method neglects useful information at each stage; ignoring available survival outcomes at the interim and ignoring biomarker observations at the final analysis.
Conditional score
To perform the adaptive enrichment trial, we must find treatment effect estimates and their distributions at analyses and subgroups To do so, we shall use a modified version of the conditional score method by TD which is a method for fitting the joint model to the data. We present multi-stage adaptations of some functions presented in TD. Let be the observed event time and let be the observed censoring indicator for patient in subgroup at analysis . This censoring event includes censoring patients who remain in the study at analysis but have not yet observed the event at the given analysis. We denote the maximum follow-up time at analysis by . To be included in the at-risk set at time , the patient must have at least two longitudinal observations to fit the regression model. At analysis , we define the at-risk process, , counting process, and function for the joint model.
The conditional score methodology is motivated by the work of Stefanski and Carroll27 who find efficient score functions for nonlinear models by conditioning on sufficient statistics. The authors first present a functional likelihood for a given statistical model which is shown to reduce to the ratio of measurement-error variance to equation-error variance. In turn, the sufficient statistic is often a function of the variance of the nuisance parameters which are being conditioned out, in our case, the random effects of the longitudinal data model. For patient in subgroup , let be the ordinary least squares estimate of based on the set of measurements taken at times . That is, let be the mean biomarker observation and let be the mean measurement time. Then the OLS estimate is given by where
Suppose that is the variance of . TD define the sufficient statistic to be the function
which is defined for all for patient in subgroup . The multi-stage version of the scalar of TD, dependent on subgroup , is given by
and the multi-stage version of the quotient function in equation (6) by TD, dependent on subgroup , is the vector given by
Then, the conditional score function at analysis for subgroup , also a vector of dimension , is given by
Estimates for the treatment effects and their distributions
The aim is now to find treatment effect estimates for and analyses We define these estimates as the root of the conditional score. In doing so, it turns out that these estimates are asymptotically normally distributed and we derive the variance of the estimates.
Burdon et al.28 showed that for each and . Therefore, the conditional score function at analysis and subgroup is an estimating function, and set equal to zero defines an estimating equation. Hence, asymptotically normal parameter estimates for and can be found as the root of the estimating equation. As in TD equation (13), define the pooled estimate where is the residual sum of squares for the least squares fit to all observations for patient in subgroup available at analysis . Then, let be the values of and , respectively, such that
We also need to know the distribution of these estimates and this requires knowledge of the variance of . We shall use the sandwich estimator, as in Section 2.6 by Wakefield,29 to calculate a robust estimate for the variance of the parameter estimates. Firstly, define matrices
Burdon et al.28 presented analytical forms for each of these matrices including a detailed calculation for the derivative matrix In practice, can be calculated numerically and is found by considering the conditional score as a sum over patients. Further, these matrices are estimated by substituting the estimates and for and , respectively. Then the information for the treatment effect estimate is given by the following equation:
for and The subscript represents that we are interested in the second parameter in the vector
In accordance with equation (1), the treatment effect estimate and corresonding information level in the full population at analysis are given by the following equation:
Finally, standardised -statistic is given by the following equation:
For simplicity in notation and exposition, we now return to the example of Section 2 in which In order for subsequent results to hold, we require to have the ‘canonical joint distribution’ (CJD) given in Section 3.1 of Jennison and Turnbull5 for each The CJD of the standardised statistics across analyses is such that
Burdon et al.28 showed that the -statistics calculated using the conditional score methodology have approximately the CJD, but not exactly. The authors show that by proceeding with a group sequential test assuming that this does hold is sensible since type 1 error rates are conservative and diverge minimally from planned significance level. We give simulation evidence that this is also true for an adaptive enrichment trial in Section 6.
The proposed methods make certain assumptions that are needed to validate the CJD in equation (5). In Section C of the Supplemental Material, sensitivity analyses are performed where some of these assumptions are verified. In particular, we find that the conditional score estimator is robust to the assumption that residual errors in the longitudinal data are independent and asymptotic properties hold under small sample sizes. The results of the sensitivity analyses suggest that a minimum of 20 events per subgroup are required at the interim analysis to ensure control of type 1 error rates.
Adaptive enrichment schemes for clinical trials with subgroup selection
The threshold selection rule
An adaptive enrichment scheme consists of two decisions; firstly a decision upon which subgroup, if any, to continue the trial with at the interim analysis and secondly, a decision upon whether or not to reject the null hypothesis at the final analysis. There are a collection of rules which can be used for subgroup selection, for example, the maximum test statistic12 and a Bayes optimal rule.4
Similarly to Magnusson and Turnbull,23 we shall use the threshold selection rule. The definition is as follows; for some constant , select all subgroups such that (Figure 1). If and then the trial continues in the full population and it should be noted that this is a stronger condition than as in the latter case, overwhelming benefit in one subgroup with poor effect in the other could still lead to selection of the full population. Finally, if and then the trial stops at the interim analysis declaring the treatment to be in-efficacious in all subgroups. This ensures that only subgroups which have a large enough treatment effect are followed to the second analysis. The threshold selection rule leads to an efficient enrichment trial design because we can find analytical forms for the type 1 and type 2 error rates and are, therefore, able to maximise power. As well as clearly providing the generic design framework for any test statistic, a novel aspect of this work will be applying this rule in the joint modelling setting.
Flowchart for enrichment trial design which uses the threshold rule for subgroup selection at the interim analysis. Hypothesis testing is based on an error spending design with -spending for the efficacy boundary and -spending for the futility boundary including the opportunity for early stopping. The flowchart describes when the interim analysis should be performed based on the pre-planned number of events in subgroup at the interim and the total number of observed events in the selected subgroup at the final analysis.
To begin, we describe the probability distribution of the population index. At the interim analysis, let be the random variable which represents the decision about which subgroup has been selected. Let be the realisation of and this can take any value in the set . The notation indicates that it is possible to stop the trial for futility at the interim analysis without selecting a subgroup. Given the threshold selection rule and a configuration on parameters , we have
In order for the proposed methods to apply and to ensure control of type 1 error rates, must be specified in advance of the trial. To choose such a value, the desired operating characteristics are considered. First, we define the configuration of parameters under the global null as and the alternative as . This represents that we believe there is an important effect of treatment in . For the metastatic breast cancer example in Section 2, this reflects that the HER2 negative subgroup is expected to respond well to the treatment. Equation (6) can then be solved for and . Since there are two unknowns, only two equations need be considered and we focus attention on those representing enrichment of the biomarker positive subgroup and continuing in the full population since these are the two most desirable outcomes in this order. As an example, with and , we therefore need and .
Sensitivity analyses for different threshold selection rules are included in Section B of the Supplemental Material. The choice of is influential in the sample size calculation and should be at least 0.5 to ensure that asymptotic assumptions for the conditional score estimator are valid. The choice of has an effect on the number of required events at the final analysis.
We now present the joint distribution of the subgroup selection decision and the selected test statistic which will be needed for calculation of type 1 and type 2 error rates. Let be the conditional distribution of the test statistic given that has been selected. Then the joint probability density function is
We note that the random variable is not currently defined since if no subgroup is selected we cannot calculate a subgroup standardised statistic. However, it will be seen that the joint probability density function is independent of and this joint probability function still has meaning. By equation (5), the test statistics are such that for and and are independent. The conditional distribution is given by a truncated normal distribution bounded below by . Hence, we have
where and denote the standard normal probability density and cumulative distribution functions, respectively. We derive in Section B of the Supplemental Material.
The methods presented are unconventional in the fact that we allow enrichment of the biomarker-negative subgroup. We have chosen this structure to allow for maximum flexibility and a novel solution for the enrichment trial where the investigator really believes no hierarchy among subgroups. The proposed design can also be modified to adhere to conventional standards by making small adjustments. For example, the definition of the threshold selection rule becomes; select if and , select if and , otherwise stop the trial at the interim analysis if . The population index can now take values in the set . Then, the conditional distributions remain unchanged for and all following equations hold under this new definition.
Calculation of type 1 error and power
We now consider the possible pathways of the enrichment trial. Then, given the definition of the -statistics, the threshold selection rule and the joint probability density function we are equipped to determine error rates for the study. We shall apply this method in Section 3.3 in order to create an enrichment trial using the joint model for longitudinal and TTE data. The family wise error rate (FWER), denoted by , is defined as the probability of rejecting one or more true null hypotheses and power is denoted by .
The testing procedure for this adaptive enrichment trial is described in Figure 1. At analysis , let be an interval that splits the real line into three sections. We stop for futility if the test statistic of the selected subgroup, is below , reject the corresponding null hypothesis and stop for efficacy if is above and otherwise continue to analysis . Let be the global null hypothesis, . There are many pathways which lead to rejecting . Examples include select and reject at the interim or select then reject at the final analysis. Considering all options, we have
Here, we have specified that we will only test the hypothesis corresponding to the selected subgroup, since it has the highest chance of being significant. For alternative configurations testing all hypotheses, fixed sequence testing30 or other alpha propagation methods31 can be applied.
As is common in the literature,12,18,19 we define power as the conditional probability of rejecting given that subgroup is selected. Here, can be arbitrarily interchanged for or . This reflects the belief that a ‘successful’ trial is one where the subgroup which benefits is selected and also reports a positive trial outcome. Following the same arguments as for type 1 error, type 2 error rates are calculated as
It is now clear that the boundary points and can be calculated to satisfy pre-specified requirements of FWER , under and power under . Further, to ensure that we have four equalities for the four boundary points, we make additional requirements that is the type 1 error ‘spent’ and is the type 2 error spent at analysis where and Then solve
The decomposition of the error rates also ensures that the boundary points and can be calculated at the first analysis before observing the information levels at the second analysis. Hence, there may be the opportunity to stop the trial early without needing to calculate the information levels at the second analysis. This is particularly helpful in trials which use TTE endpoints as information levels are estimated using the data.
There are many options for the break-down of the error rates. For the models considered, we shall use an error spending design.32 In the group sequential setting (without subgroup selection), the error spending test requires specifying the maximum information and then error is spent according to the proportion of information observed at analysis . For the enrichment trial, we propose a similar structure considering to be the maximum information in the full population. Specifically, we shall use the functions and to determine the amount of error to spend. Then we set
We shall discuss the calculation of in the TTE (or joint modelling) setting in Section 4.4.
By construction, under , we have FWER exactly by equations (7) and (8). Hence, the FWER is protected in the weak sense. To prove that we also have strong control of the FWER, we impose the condition that the treatment effect in the full population, is non-negative. This ensures that the subgroup selected does not differ under scenarios and which is needed for the proof. The condition is not restrictive, since treatment effects other than are allowed to be negative and can equal zero.
For global null hypothesis and any such that is non-negative, we have
Proof. See Section B of the Supplemental Material.
In Section 6, we also show by simulation, that the FWER is protected at significance level and is not conservative.
Trials with unpredictable information increments: Events based analyses
To complete the calculation of the boundary points and in equations (7) and (8), it remains to find the information level at analysis for the subgroups that have ceased to be observed. That is, suppose that is the subgroup that has been selected and the trial continues to analysis then is observed. However, we also need to know for all such that . Many enrichment trial designs focus on the simple example where the outcome measure is normally distributed with known variance. Hence, if the number of patients to be recruited is pre-specified, then information levels can be calculated in advance of the trial and this problem does not occur. However, in trials where the primary endpoint is a TTE variable, information is estimated using the data. We find that we can accurately forward predict the information levels at future analyses when we know the number of observed events. Hence, to mitigate the problem of not knowing , we shall pre-specify the number of observed events.
For subgroup , let be the number of events observed in subgroup by analysis . We plan that if no early stopping occurs, then the total number of observed events in the selected subgroup is the same regardless of which subgroup has been selected so that . Figure 1 identifies when the analyses are performed. Note that these values are set as design options and so will be known before commencement of the trial. We shall discuss how to choose these values in Section 4.4.
Further, we relate number of events and information so that we can predict the information level at the second analysis for the unobserved subgroups. Freedman33 proves that, in the context of survival analysis, the variance of the log-rank statistic under is such that . For analysis methods using test statistics other than the log-rank, we shall extend on this idea and assume that , where is a constant. Figure 2 shows evidence that the assumed relationship between number of events and information holds.
Calculation of constants and . Result shows that information is proportional to number of events.
For now, we need only the assumption of the structural form of this relationship. At the interim analysis, each is observed for Hence, we can use the proportionality relationship to predict the information at the second analysis for the subgroup which is no longer observed. For , we can predict using
Trial design – number of events
We have so far presented the calculation of the boundary points for a trial where the number of events at the interim and final analyses are known prior to commencement. We now discuss the design of the trial, in particular, determining the constants and information levels for and maximum information level . These in turn mean that the required numbers of events for and can be planned. The driving design feature is that we will plan the trial to have power under the parameterisation . We now describe a simulation scheme to determine the constants for
Under the parameterisation , simulate a data set of patients
Let be the event times in subgroup
for
do Right-censor all patients at time
Calculate based on data up to time
endfor
Fit a linear model, without an intercept term, to the points
Use this linear model to estimate the value of .
Figure 2 gives a graphical representation of this scheme. It is now possible to calculate the required number of events at the first interim analysis. In the example in Section 4.1, we require which equates to events in subgroup . Further, we find that and which equates to and and this can be seen in Figure 2. The design of the trial does not require us to plan and , but this provides us with estimates of the number of events that will be observed at the first analysis. We can also determine the timing of the final analysis at . Consider the sequence of information levels given by the following equation:
for The value of is calculated such that boundary points satisfy when the information levels replace in equations (7) and (8) for and This is done using an iterative search method. Then, returning to the definition of , the total number of events can be found by solving for . In Section 6, we present the sample sizes which have been calculated for a range of parameter choices.
Alternative models and their analysis methods
Cox proportional hazards model
Methods which leverage information from biomarkers in TTE data in enrichment trials are yet to be established. The current best practice for adaptive designs with a TTE endpoint is to base analyses on Cox proportional hazards models. We emulate this conventionality in order to assess the gain from including the longitudinal data in the analysis. To do so, we shall present a simple Cox proportional hazards model and define treatment effect estimates that can be used in accordance with the threshold selection rule to perform an enrichment trial.
Denote as the baseline hazard function, the treatment parameter and as the treatment indicator that patient in subgroup receives the new treatment. Then the hazard function for the survival model is given by
We note the similarities and differences between this model and the joint model of Section 3.2. In the results that follow in Section 6.3, we shall assume that the joint model is true (and simulate data from the joint model). However, we fit the data to the Cox proportional hazards model which highlights that this will be a misspecified model.
When analysing data using this model, the null hypothesis in equation (2) can be tested at analysis by calculating treatment effect estimates , information levels and -statistics for . As described in Section D of the Supplemental Material, is given as the root of the equation where the partial score statistic is set equal to zero34 and the information as the first derivative of the partial score statistic. Jennison and Turnbull34 proved that the resulting -statistics have the CJD given in equation (5) and the methodology of Section 4 can be used to create an enrichment trial design.
Cox proportional hazards model with longitudinal data as a time-varying covariate
A final option for analysis is one where the longitudinal data is included but is assumed to be free of measurement error. This requires a more sophisticated model than the simple Cox proportional hazards model of Section 5.1 and represents a trial where the longitudinal data is regarded as important enough to be considered and included in the model. However, this is still a naive approach since the model will be misspecified in the presence of measurement error. For the purpose of assessing the necessity of correctly modelling the data, we shall fit a Cox proportional hazards model to the data where the longitudinal data is treated as a time-varying covariate.
In what follows, the definitions of the treatment indicator and longitudinal data measurements remain the same as in Section 3.2. Let and be longitudinal data and treatment parameters, respectively, then the hazard function is given by
This model differs from the joint model because the assumption here is that is a function of time that is measured without error. In reality, we often have measurements for patient in subgroup that include noise around a true underlying trajectory.
In a similar manner to Section 5.1, the hypothesis in equation (2) can be tested by finding -statistics, with the CJD of equation (5)34 and following the enrichment trial design of Section 4.
Results
Simulation set-up
In what follows, we perform simulation studies to assess the type 1 error rates and observed power for the three analysis methods of Sections 3 and 5. These methods shall herto be referred to as ‘Conditional score’, ‘Cox’ and ‘Cox with biomarker’, respectively. The purpose of this comparison is to assess the gain by including the longitudinal data and to decide whether correctly modelling the measurement error is necessary.
For the presented analyses, we shall assume that the joint model is true. Hence, the working model for data generation is given by equation (3). Each of the analysis methods have the advantage that we need not specify the baseline hazard function since each method is semi-parametric and requires no assumptions regarding Even when the method includes the longitudinal data, there are no distributional assumptions about the random effects ensuring it is robust to some model misspecifications. For the purpose of simulation however, we now describe the distributions used for data generation. We shall simulate data with baseline hazard function given by the following equation:
We have chosen to simulate from a model where the baseline hazard function as piece-wise constant with a single knot-point at time for simplicity. This is motivated by the metastatic breast cancer data where we see a sharp difference in the baseline hazard at one year. It is straight forward to extend this to a general piece-wise constant baseline hazard function with multiple knot-points. We consider a random effects model where are independent and identically distributed with the following distribution:
The parameter values for simulation studies are informed using the metastatic breast cancer dataset.22 We removed patients whose ER status is negative and measurements of ctDNA which were ‘not detected’ were set to 1.5 (copies/mL).35 The dataset contains multiple treatment arms and dosing schedules, hence, we use this dataset to represent standard of care (control group). The parameter values, which have been suitably rounded, shall remain fixed throughout the simulation studies are given by the following equation:
We shall perform simulation studies for a range of and values. The interpretation of these parameter are now described. describes the association between the biomarker and TTE outcomes. Higher values of lead to higher correlation between the two endpoints. The parameter controls the noise in the measurement error of the longitudinal data. Finally, represents the variance of the slopes of the random effects terms and therefore the degree of similarity between patients’ longitudinal trajectories.
For our simulations, patients are recruited at a rate of 2 per week so that enrollment is slow and adaptive methods are appropriate. The recruitment ratio of control to experimental treatment is fixed as 1:1 for all subgroups and all simulations studies. ctDNA observations will be collected, via a blood test, at 2 weeks for the first 3 months following entry to study and then once per month. The final object of importance which is required for data generation is the mechanism which simulates censoring times, . We shall simulate these according to an exponential distribution with rate parameter (years) and this is independent of the TTE outcome to reflect non-informative censoring. This results in roughly of patients being lost to follow-up.
To complete the set-up, we now present the sample sizes used for each simulation study and these values have been calculated by employing the methods of Section 4.4. The trial is planned with FWER and planned power . The number of events at the first analysis in subgroup , denoted , have been chosen to ensure that subgroup is selected roughly of the time and the total number of events at the second analysis, have been chosen to attain power of as described in Section 4.4. In all cases, the value of is large enough such that the survival data is mature at the interim analysis and decisions can be made with confidence. These number of events are displayed in Table 1 for a range of values of and . As increases, we see that required and increase. Similarly, the required number of events increase with That is, more events and hence more information is needed to achieve power and selection probabilities when the longitudinal data is noisy. When and with a small number of events at the first interim analysis, it is not always possible to find a root to equation (4). Consequently, the required and are high to ensure that large sample properties of the estimator hold. We have not seen this problem occur for The values of and appear to be immune to changes in .
Sample size calculations for the adaptive enrichment trial. is the required number of events in subgroup at the interim analysis and is the total number of events in the selected subgroup at the final analysis. Number of events calculated to satisfy family wise error rate (FWER) 0.025 and power 0.9.
40
174
47
204
47
206
50
218
45
194
47
206
58
252
69
301
46
198
44
194
47
206
47
203
Type 1 error rate comparison
The first important comparison will be the type 1 error rate using each of the analysis methods conditional score, Cox and Cox with biomarker.
To represent no differences between control and treated groups under , let for each . Figure 3 shows the results of a simulation study assessing the FWER for each method and different parameter values. For each simulation, a dataset of patients is generated from the joint model, then subgroup selection and decisions about are performed after and events have been observed according to Table 1. All four methods are performed on the same dataset and after the same number of events so that differences can be attributed to the analysis methodology and not trial design features.
Type 1 error rates displaying changes in parameters and . All other parameters are as in (13). Numeric values of the points are presented in Section C of the Supplemental Material. For a study with simulations and family wise error rate (FWER) 0.025, simulation standard error is 0.00156.
It is clear that for the majority of cases, the FWER is controlled when the conditional score method is used to estimate the treatment effect in the joint model. For a study with simulations and planned significance value , the simulation error bounds is . Hence, the observed FWER is within reasonable distance from in accordance with the number of simulations. The result of Theorem 1 together with the simulation result in Figure 3 give us confidence that FWER is controlled at the desired significance level using the joint modelling approach. The Cox model also appears to control the FWER but may be seen to be conservative for large values of . However, we see that the Cox with biomarker method has FWER considerably smaller than 0.025. This is particularly apparent for and all values of
Efficiency comparison
We shall focus on power as a measure of efficiency between the different methods and we compare some other outcome measures, such as number of hospital visits and expected stopping time, in Section C of the Supplemental Material. Under the alternative, only patients in subgroup will respond to treatment, represented by and Figure 4 shows the power comparison between the different methods. Power is calculated as the proportion of simulations which reject out of those where subgroup is selected, as described in Section 4.2.
Obsered power displaying changes in parameters and . All other parameters are as in (13). Numeric values of the points are presented in Section C of the Supplemental Material. For a study with simulations and power 0.9, simulation standard error is 0.003.
It is clear that the conditional score method is most efficient since power is highest across nearly all parameter combinations. When , the conditional score method may suffer from a small loss in power in comparison to other methods. This is the case where longitudinal data has no impact on the survival outcome so including it in the analysis is futile. For , however, a gain in power up to 0.46 is seen.
Fitting the data to the simple Cox model is very inefficient and in the extreme cases, power is below The sample size that would be needed to increase power to 0.9 in such a scenario is excessive. This simple method has power lower than the conditional score method whenever and becomes increasingly inefficient as increases and as increases. The efficiency of this method appears to increase slightly with Hence, it is important to include the longitudinal data in the analysis when there is a suspected correlation between the longitudinal data and the survival endpoint.
The final method, where TTE outcomes are fit to a Cox proportional hazards model with the longitudinal data as a time-varying covariate, appears to be a simple yet effective way of including longitudinal data in the analysis. The achieved power is at least 0.78 but is usually lower than the conditional score method. However, scenarios where this method outperformes the conditional score are when or indicating that the longitudinal data is free of measurement error or there are no between-patient differences in the slopes of the longitudinal trajectories. The efficiency decreases as longitudinal data increase in noise or as patient differences become larger, that is as and increase.
An advantage of the two alternative Cox models is that there is no criteria to have a minimum of two longitudinal observations to be included in the at-risk process. In fact, for these alternative models, we need not specify the functional form of the trajectory of the longitudinal data, for example that it is linear in time. Taking these considerations into account, we believe that the most efficient and practical method is the conditional score, which includes the longitudinal data and takes into account the measurement error.
Discussion
We have shown that the threshold selection rule can be combined with an error spending boundary to create an efficient enrichment trial. This is potentially suitable for any trial where the primary outcome is a TTE variable and we present a method to establish the required number of events at the design stage of the trial. A novel aspect of this work is that these methods can be applied to an endpoint which is the treatment effect in a joint model for longitudinal and TTE data. We have implemented the conditional score methodology to estimate the treatment effect and show that the estimator is robust to model assumptions provided that 20 events per treatment arm are observed at the interim analysis.
By including these routinely collected biomarker outcomes in the analysis to leverage this additional information, the enrichment trial has higher power compared to the enrichment trial where the longitudinal data is left out of the analysis. Bauer et al.36 showed that bias is prevalent in designs with selection. In our case, selection bias occurs as the treatment effect estimate in the selected subgroup is inflated in later analyses which could affect the trial results. However, unlike most other selection schemes, the threshold selection rule adjusts for the magnitude of the treatment effect at the design stage so another advantage is that selection bias is incorporated into the decision making process.
We assessed the p-value combination approach as an alternative option for implementing enrichment designs using biomarker data for subgroup selection and survival outcomes alone for hypothesis testing, but we found the joint modeling approach to perform best due to more efficient use of available data. Further, we compared the joint modelling approach with a model which used the longitudinal data but naively assumed this was free of measurement error. Again, the joint model performed more effectively in most cases. This naive approach was more efficient when the longitudinal data was truly free from measurement error, there was no correlation between the two endpoints or there was no heterogeneity between patients’ biomarker trajectories. However, we believe that these situations are rare in practice and the gain in power from joint modelling outweighs this downside.
Supplemental Material
sj-pdf-1-smm-10.1177_09622802241287711 - Supplemental material for Adaptive enrichment trial designs using joint modelling of longitudinal and time-to-event data
Supplemental material, sj-pdf-1-smm-10.1177_09622802241287711 for Adaptive enrichment trial designs using joint modelling of longitudinal and time-to-event data by Abigail J Burdon, Richard D Baird and Thomas Jaki in Statistical Methods in Medical Research
Footnotes
Data availability statement
All data are simulated according to the specifications described.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research,authorship and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no 965397. TJ also received funding from the UK Medical Research Council (MC_UU_00002/14,MC_UU_00040/03). RB also acknowledges funding from Cancer Research UK and support for his early phase clinical trial work from the Cambridge NIHR Biomedical Research Centre (BRC-1215-20014) and Experimental Cancer Medicine Centre. For the purpose of open access,the author has applied a Creative Commons Attribution (CC BY) licence to any author accepted manuscript version arising.
ORCID iDs
Abigail J Burdon
Thomas Jaki
Supplemental material
Supplemental materials for this article are available online.
Software in the form of R code,is available at .
References
1.
BurnettTMozgunovPPallmannP, et al. Adding flexibility to clinical trial designs: an example-based guide to the practical use of adaptive designs. BMC Med2020; 18: 1–21.
2.
PallmannPBeddingAWChoodari-OskooeiBet al. Adaptive designs in clinical trials: why use them, and how to run and report them. BMC Med2018; 16: 1–15.
3.
SimonNSimonR. Adaptive enrichment designs for clinical trials. Biostatistics2013; 14: 613–625.
4.
BurnettTJennisonC. Adaptive enrichment trials: what are the benefits?Stat Med2021; 40: 690–711.
5.
JennisonCTurnbullBW. Group sequential methods with applications to clinical trials. London: Chapman and Hall/CRC, 2000.
6.
WangS-JHungHMJO’NeillRT. Adaptive patient enrichment designs in therapeutic trials. Biom J: J Math Methods Biosci2009; 51: 358–374.
7.
FriedeTParsonsNStallardN. A conditional error function approach for subgroup selection in adaptive clinical trials. Stat Med2012; 31: 4309–4320.
8.
MehtaCSchäferHDanielH, et al. Biomarker driven population enrichment for adaptive oncology trials with time to event endpoints. Stat Med2014; 33: 4515–4531.
9.
OndraTJobjörnssonSBeckmanRA, et al. Optimized adaptive enrichment designs. Stat Methods Med Res2019; 28: 2096–2111.
10.
RosenblumMFangEXLiuH. Optimal, two-stage, adaptive enrichment designs for randomized trials, using sparse linear programming. J R Stat Soc Ser B: Stat Methodol2020; 82: 749–772.
11.
LinZFlournoyNRosenbergerWF. Inference for a two-stage enrichment design. Ann Stat2021; 49: 2697–2720.
12.
ChiuYDKoenigFPoschM, et al. Design and estimation in clinical trials with subpopulation selection. Stat Med2018; 37: 4335–4352.
13.
LaiTLLavoriPWTsangKW. Adaptive enrichment designs for confirmatory trials. Stat Med2019; 38: 613–624.
14.
ThallP. F.: Bayesian cancer clinical trial designs with subgroup-specific decisions. Contemp Clin Trials2020; 90: 105860.
15.
ZhangZLiMLinM, et al. Subgroup selection in adaptive signature designs of confirmatory clinical trials. J R Stat Soc Ser C: Appl Stat2017; 66: 345–361.
16.
StallardN. Adaptive enrichment designs with a continuous biomarker. Biometrics2023; 79: 9–19.
17.
OndraTDmitrienkoAFriedeT, et al. Methods for identification and confirmation of targeted subgroups in clinical trials: a systematic review. J Biopharm Stat2016; 26: 99–119.
18.
StallardN. A confirmatory seamless phase II/III clinical trial design incorporating short-term endpoint information. Stat Med2010; 29: 959–971.
19.
FriedeTParsonsNStallardN, et al. Designing a seamless phase II/III clinical trial using early outcomes for treatment selection: an application in multiple sclerosis. Stat Med2011; 30: 1528–1540.
20.
HendersonRDigglePDobsonA. Joint modelling of longitudinal measurements and event time data. Biostatistics2000; 1: 465–480.
21.
RizopoulosD. Joint models for longitudinal and time-to-event data: with applications in R. London: Chapman and Hall/CRC, 2012.
22.
DawsonSJTsuiDWYMurtazaM, et al. Analysis of circulating tumor DNA to monitor metastatic breast cancer. N Engl J Med2013; 368: 1199–1209.
23.
MagnussonBPTurnbullBW. Group sequential enrichment design incorporating subgroup selection. Stat Med2013; 32: 2695–2714.
24.
SlamonDJClarkGMWongSG, et al. Human breast cancer: correlation of relapse and survival with amplification of the HER-2/neu oncogene. Science1987; 235: 177–182.
25.
WanFTitmanACJakiTF. Subgroup analysis of treatment effects for misclassified biomarkers with time-to-event data. J R Stat Soc Ser C: Appl Stat2019; 68: 1447–1463.
26.
TsiatisAADavidianM. A semiparametric estimator for the proportional hazards model with longitudinal covariates measured with error. Biometrika2001; 88: 447–458.
27.
StefanskiLACarrollRJ. Conditional scores and optimal scores for generalized linear measurement-error models. Biometrika1987; 74: 703–716.
28.
BurdonAJHampsonLVJennisonC. Joint modelling of longitudinal and time-to-event data applied to group sequential clinical trials. https://doi.org/10.48550/arxiv.2211.16138. arXiv. 2022. Creative Commons Attribution 4.0 International.
29.
WakefieldJ. Bayesian and frequentist regression methods. Berlin: Springer Science & Business Media, 2013.
30.
WestfallPHKrishenA. Optimally weighted, fixed sequence and gatekeeper multiple testing procedures. J Stat Plan Inference2001; 99: 25–40.
31.
TamhaneACGouJJennisonC, et al. A gatekeeping procedure to test a primary and a secondary endpoint in a group sequential design with multiple interim looks. Biometrics2018; 74: 40–48.
32.
Gordon LanKKDeMetsDL. Discrete sequential boundaries for clinical trials. Biometrika1983; 70: 659–663.
33.
FreedmanLS. Tables of the number of patients required in clinical trials using the logrank test. Stat Med1982; 1: 121–129.
34.
JennisonCTurnbullBW. Group-sequential analysis incorporating covariate information. J Am Stat Assoc1997; 92: 1330–1341.
35.
BarnettHYGeysHJacobsT, et al. Methods for non-compartmental pharmacokinetic analysis with observations below the limit of quantification. Stat Biopharm Res2021; 13: 59–70.
36.
BauerPKoenigFBrannathW, et al. Selection and bias – two hostile brothers. Stat Med2010; 29: 1–13.
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.