Sage Journals: Discover world-class research

Abstract

This article presents a Bayesian approach to estimation in multistage experiments based on the reference prior theory. The idea of deriving design-dependent priors was first introduced using Jeffreys’ criterion. A theoretical framework was then established by showing that explicit reference to the design is fully Bayesian justified and Bayesian objectivity cannot ignore such information. Extending the work to multi-parameter problems, a general form of priors was derived from the reference prior theory. In this article, I evidence the good frequentist properties of the reference posterior estimators with normally distributed data. As a notable advance, I address the issue of the point and the interval estimations upon experiment termination. The approach is applied to a data set collected in a clinical trial in schizophrenia with the possibility to stop the trial early if interim results provide sufficient evidence of efficacy or futility. Finally, I discuss the idea of using the reference posterior estimators as a default choice for objective estimation in multistage experiment.

Keywords

Multistage design interim analysis Bayesian estimation Jeffreys’ criterion reference prior theory credible interval frequentist properties

Introduction

For ethical and economical reasons there is an increasing interest in experiments which allow for early stopping as soon as interim results are conclusive. Beyond the statistical significance of the test, the estimation of parameters is a key element in decision makings. However, it is known that data-dependent stopping rules affect the sampling distributions of estimators. The effect on the maximum likelihood estimator (MLE) of the mean of normally distributed data has been reported for a long time¹ along with the deficiencies of the coverage probability of the Wald confidence interval.² The issue of the bias induced by the stopping rules has given rise to various researches to propose estimators adapted to multistage experiments.

The frequentist solutions require ordering the observation space. The reason is that the sufficient statistic is the couple of variables formed by the stopping stage, noted $S$ in what follows, and the sample sum upon stopping noted $Y_{S}$ . An estimator of the mean based on the MLE ordering was studied in terms of frequentist characteristics³ whereas an approach to estimate confidence intervals based on the stage-wise ordering was described.⁴ This latter ordering assumes that the results corresponding to earlier termination are more extreme than those which terminate later. However, (pre-)ordering the observation space introduces some subjectivity in the inference.⁵ Other criticisms are that the solutions are not unique and may depend on the information levels for future (unobserved) stopping stages.

An uniformly minimum variance unbiased estimator (UMVUE) can be found from Rao-Blackwell’s theorem which states that, given an unbiased estimator and a sufficient and complete statistic, the conditional expectation of the first to the second is UMVUE. Although ( $S$ , $Y_{S}$ ) is not complete in the normal case, the expectation of the sample mean at stage 1 conditional to ( $S$ , $Y_{S}$ ) is still UMVU.⁶ A computationally quite extensive formula for calculating the UMVUE of the mean was derived³ but the lack of bias is accompanied by a relatively high variance. Another estimator was proposed which consists of adjusting the MLE by subtracting the estimate of its bias.⁷ The bias is calculated at the adjusted ML estimate using recursive method.

This article describes an objective Bayesian approach to the point and the interval estimations. A prior is objective if it has minimal impact on the posterior distribution. Any candidate objective prior should also results to posterior estimators with good frequentist properties. The idea of deriving design-dependent priors was first introduced using Jeffreys’ criterion.⁸ Based on numerical results, it was noted that the coverage of the credible intervals are improved in the negative binomial model.⁹ A theoretical framework was established by showing that explicit reference to the design is fully Bayesian justified and Bayesian objectivity cannot ignore such information.¹⁰ The property of correction for the stopping rule bias was then demonstrated in the Bernoulli sequential design.¹¹ On this basis, a class of point estimators in multistage binomial design was developed¹² and the work was extended to hypothesis testing.¹³ The Jeffreys prior enjoys many optimality properties for regular models where asymptotic normality holds, but in the presence of nuisance parameters this prior suffers from many deficiencies. The reference prior theory developed by Bernardo¹⁴ allows to overcome these deficiencies. The theory is based on the Kullback-Leibler divergence between the prior and the posterior. More recently, Sun and Berger¹⁵ derived a general form of the reference prior in analysis under sequential experimentation.

In this article, I describe the derivation of reference posteriors in multistage experiments with normally distributed data. The approach is implemented using basic programming. The frequentist properties of the posterior estimators are studied using simulations. The approach is then applied to a data set collected in the double-blind clinical trial FAST¹⁶ which compares the efficacy of a new compound versus placebo in the treatment of acute exacerbation of schizophrenia. The treatment effect was originally estimated using a model of covariance analysis. To study the properties of estimators, I introduce fictitious interim analyses with possibility to stop the trial early for futility or efficacy based on frequentist test results.

The key aspects of the reference prior theory are presented in the next section. The frequentist properties of the posterior estimators are investigated in Section “Frequentist properties”. The approach is then applied to the FAST trial in Section “Application of approach in clinical trial” in which I provide the results of the original analysis before turning to the multistage design cases. In the conclusion, I discuss the idea of using the reference posterior estimators as a default choice for objective estimation in multistage experiment. The R scripts to produce the results reported in this article along with computational details are provided in Supplemental Material.

Derivation of the reference prior

Reference prior theory

The idea behind reference prior is to maximize a distance between the prior and the posterior distributions as data are collected. Formally, the data have maximum influence on the posterior if the Kullback-Leibler (K-L) divergence is maximum. By considering the expectation of the K-L divergence, the reference prior can be defined based on virtual data before experiment. In this section, I expose the key aspects of the reference prior theory for the one-parameter problem in the fixed sample case. Many supports describing the theory can be consulted including an overall description,¹⁷ a formal definition,¹⁸ and a didactic tutorial.¹⁹ Let us consider the model $M_{θ} \equiv {p (x | θ), x \in X, θ \in Θ}$ , wherein $X$ is the outcome variable and $θ$ the parameter, together with a prior specification $π (θ)$ . The procedure described below applies for any sufficient statistic $T$ for $θ$ for which $p (x | θ)$ is in one-to-one correspondence with $p (t | θ)$ . Let us consider the $i$ -dimensional vector $x^{i}$ with density $p (x^{i} | θ)$ and consider the inferential scenario in which the components of $x^{i}$ are $i$ realizations of $X$ from independent experiments. The mutual information average is the average of the K-L divergence between the prior and the posterior with respect to the marginal distribution $p (x^{i})$ . This takes the form: $M I_{π (θ)} (θ, x^{i}) = \int_{X^{i}} p (x^{i}) [\int_{Θ} π^{i} (θ | x^{i}) log \frac{π^{i} (θ | x^{i})}{π^{i} (θ)} d θ] d x^{i} .$ A non-informative prior $π^{i} (θ)$ is obtained by maximizing $M I_{π (θ)} (θ, x^{i})$ . Formally, we have: $π^{i} (θ) = \underset{π (θ)}{argmax} M I_{π (θ)} (θ, x^{i}) .$ The reference prior is then derived by taking $π^{R} (θ) = {lim}_{i \to \infty} π^{i} (θ)$ where the limit allows for improper priors.

Asymptotic theory makes it possible to obtain a convenient form of $π^{i} (θ)$ by rewriting $M I_{π (θ)} (θ, x^{i})$ . The problem is reduced to computing: $π^{i} (θ) = exp \int_{X^{i}} p (x^{i} | θ) log π^{i} (θ | x^{i}) d x^{i} .$ An analytical solution for the limit as $i \to \infty$ can be found using the Bernstein Von Mises theorem, sometimes called the Bayesian central limit theorem. Therefore, $π^{i} (θ | x^{i})$ is asymptotically Gaussian and concentrated at a value $θ_{f}$ which is the ‘true value’ of $θ$ in the frequentist sens of the term, i.e., the value of $θ$ such that the components of $X^{i}$ are i.i.d. following $p (x | θ_{f})$ . Another consequence is that the variance is given by the Fisher information of the selected model. It comes that, in the one-parameter problem, the reference prior is identical to the Jeffreys prior, noted $π^{J} (θ)$ , obtained using Jeffreys’ criterion. If we denote by $I (θ)$ the expected Fisher information, we have: $π^{R} (θ) \propto π^{J} (θ) \propto I (θ)^{\frac{1}{2}} .$ (1)

Extension to multistage design

We now assume that the data are collected in a $K$ -stage experiment whose the design is noted $d_{K}$ . In the general case, the experiment outcome is obtained from a sequence of outcome values observed at the interim analyses. The analysis times are predefined according to the statistical information available at each analysis.

We consider a sequence of independent outcomes $x_{(s)} = (x_{1}, x_{2}, \dots, x_{s})$ which is observed until experiment termination and we assume that the density of $X_{(s)}$ is known. The model is now $M_{θ, d_{K}} \equiv {p (x_{(s)} | θ, d_{K}), x_{(s)} \in R_{(s)}, θ \in Θ}$ wherein $R_{(s)}$ is a $s$ -dimensional subset of the observation space restriction $R_{(K)}$ .

The likelihood function in the design $d_{K}$ takes the form: $L (θ; x_{(s)}, d_{K}) = {[L (θ; x_{1})]}^{1_{s = 1}} \times \dots \times {[L (θ; x_{(K)})]}^{1_{s = K}} .$ (2)The introduction of the design information implies rewriting the relation (1) as a function of the expected design-dependent Fisher information $I (θ | d_{K})$ such as: $π^{R} (θ | d_{K}) \propto π^{J} (θ | d_{K}) \propto I (θ | d_{K})^{\frac{1}{2}} .$ Based on (2), $I (θ | d_{K})$ can easily be expressed as a function of the expected naive (i.e., non design-dependent) Fisher information: $\begin{aligned} I (θ | d_{K}) & = - E_{θ} [\frac{\partial^{2}}{θ^{2}} log L (θ; x_{(s)}, d_{K})] \\ = I (θ) P_{θ} (S = 1) + \dots + I (θ) P_{θ} (S = K) \\ = I (θ) [1 + \frac{n_{2}}{n_{1}} P_{θ} (S \geq 2) + \dots + \frac{n_{K}}{n_{1}} P_{θ} (S = K)] \\ = I (θ) E_{θ} (S) . \end{aligned}$ Jeffreys’ criterion applied to the likelihood (2) yields a design-dependent prior which depends on the naive reference prior and the expected stopping time so that: $π^{R} (θ | d_{K}) \propto π^{R} (θ) E_{θ} {(S)}^{\frac{1}{2}} .$ (3)Based on (3), it is evident that $π^{R} (θ | d_{K})$ is proper if both the naive reference prior is proper and the stopping rule is proper as well (i.e., the stopping time is finite almost surely). The prior $π^{R} (θ | d_{K})$ reflects the degree of certainty associated with the projected design $d_{K}$ by over-weighing the probabilities about $θ$ values more likely leading to late termination. The greater the certainty about $θ$ values, the higher their prior probabilities. By so counterbalancing the expected effect of early stopping, the reference posterior estimators benefit from a correction for the stopping rule bias.

Extension to multi-parameter problems

In multi-parameter problems, it is often the case that only one or a subset of parameters is of interest. For the sake of clarity, we first consider the fixed sample case. The model is $M_{θ, λ} \equiv {p (x | θ, λ), x \in X, θ \in Θ, λ \in Λ}$ wherein $θ$ is the parameter of interest and $λ$ is a nuisance parameter. We would like to find the joint $θ$ -reference prior, noted $π_{θ} (θ, λ)$ , that captures our unequal interest in both parameters. To handle such situation, the reference prior theory allows a simplification which comes down to sequentially computing Jeffreys prior in one-parameter problem. This is described in the following procedure :

Determine Jeffreys prior of $p (x | λ, θ)$ conditional on $θ$ (i.e., assuming $θ$ is a constant) and derive the conditional reference prior $π^{R} (λ | θ)$ .

If $π^{R} (λ | θ)$ is proper, integrate out $λ$ to find: $p (x | θ) = \int_{Λ} p (x | λ, θ) π^{R} (λ | θ) d λ .$

Determine $π^{R} (θ)$ using the standard procedure for reference priors which yields Jeffreys prior of $p (x | θ)$ when $θ$ is one-dimensional.

The joint

θ

-reference prior is

π_{θ}^{R} (θ, λ) = π^{R} (λ | θ) π^{R} (θ)

. Whereas the derivation of the reference prior can be a tricky technical issue, a major simplification appears under posterior asymptotic normality when the parameter space

Λ

does not depend on

θ

.¹⁷ The asymptotic variance matrix of (

θ, λ

)

V (θ, λ) = [\begin{matrix} V_{θ θ} (θ, λ) & V_{θ λ} (θ, λ) \\ V_{λ θ} (θ, λ) & V_{λ λ} (θ, λ) \end{matrix}]

depends on the expected Fisher information matrix such that

V (θ, λ) = I^{- 1} (θ, λ) / n

. Let us now define

H (θ, λ) = V^{- 1} (θ, λ)

for the following statements.

Given $f$ and $g$ respectively functions of $θ$ and $λ$ to be determined, both $V_{θ θ} (θ, λ)$ and $H_{λ λ} (θ, λ)$ can factorize so that $V_{θ θ}^{- \frac{1}{2}} (θ, λ) \propto f_{θ} (θ) g_{θ} (λ) and H_{λ λ}^{\frac{1}{2}} (θ, λ) \propto f_{λ} (θ) g_{λ} (λ) .$ Then, the so-called $θ$ -reference prior takes the simple form: $π_{θ}^{R} (θ, λ) = π^{R} (λ | θ) π^{R} (θ) = f_{θ} (θ) g_{λ} (λ)$ This approach to nuisance parameter is based on an implicit ordering according which the first parameter is $θ$ and the second is $λ$ . If the parameter is now a $m$ -dimensional vector, says $θ$ , the principle can be extended by considering the ordering ( $θ_{1}$ ,…, $θ_{m}$ ). The reference prior relative to this ordering is obtained after successive conditioning such that: $π^{R} (θ) = π^{R} (θ_{m} | θ_{1}, \dots, θ_{m - 1}) \dots π^{R} (θ_{2} | θ_{1}) π^{R} (θ_{1}) .$ The derivation of the reference prior is greatly simplified if $H (θ)$ is block diagonal, particularly, if, for $j = 1, \dots, m$ , the $j^{t h}$ term can be factored into a product of a function of $θ_{j}$ and a function not depending on $θ_{j}$ , such that ${h_{j j} (θ)}^{\frac{1}{2}} = f_{j} (θ_{j}) g_{j} (θ),$ where $g_{j} (θ)$ does not depend on $θ_{j}$ . Then we have: $π^{R} (θ) \propto \prod_{j = 1}^{m} f_{j} (θ_{j}) .$ (4)Based on (4), the introduction of a stopping rule which only depends on the first component $θ_{1}$ is straightforward. The design-dependent reference prior takes the simple form: $π^{R} (θ | d_{K}) \propto π^{R} (θ) E_{θ_{1}} {(S)}^{\frac{1}{2}} .$ (5)By this way, we find the results previously obtained in a more general setting.¹⁵

Frequentist properties

In this section, I study the characteristics of reference posterior estimators of the mean of normally distributed data with known variance. Let $μ$ be the parameter of the mean. To this end, I generate a large number of $K$ -stage experiments by drawing sequences of outcomes ( $x_{1}, x_{2}, \dots, x_{k}$ ), $k = 1, \dots, K$ , from the random variables $X_{i} \sim i . i . d N (μ, 1)$ , $i = 1, \dots, k$ . The generation process is governed by a stopping rule which is defined based on the mean of the $X_{i}$ ’s at stage $k$ , i.e., $M_{k} = Σ_{i = 1}^{k} X_{i} / k$ . Let $J_{k}$ denote the interval for the observed values $m_{k}$ to pursue the experiment to stage $k + 1$ . The stopping stage $S$ is the first $k$ such that $m_{k} \notin J_{k}$ so that the stopping rule can be expressed by: $P_{μ} (S \geq k) = P_{μ} (X_{1} \in J_{1}, M_{2} \in J_{2}, \dots, M_{k - 1} \in J_{k - 1}) .$ (6)For the sake of readability, we denote by $m$ the mean of the $x_{i}$ ’s observed until the experiment termination, i.e., $m = m_{S}$ . Formally, $m$ is the experiment outcome.

As the naive reference prior of $μ$ follows a normal distribution with infinite variance, the density of the reference prior as expressed in (3) is proportional to $E_{μ} {(S)}^{\frac{1}{2}}$ , which is sometimes called the corrective term in what follows. Upon experiment termination, the reference posterior takes the form: $π^{R} (μ | m, d_{K}) \propto π^{R} (μ | m) E_{μ} {(S)}^{\frac{1}{2}} \sim N (m, 1 / S) E_{μ} {(S)}^{\frac{1}{2}}$ A method for simulating from this distribution is the following acceptance-reject algorithm:

Step 1. Sample values of $μ$ from the naive posterior distribution $N (m, 1 / S)$ .

Step 2. Numerically estimate $E_{μ} {(S)}^{\frac{1}{2}}$ by repeatedly simulating the experiment for each $μ$ value.

Step 3. Sample values of $U \sim U n i f o r m (0, 1)$ and accept $μ$ values if $u K^{\frac{1}{2}} \leq E_{μ} {(S)}^{\frac{1}{2}}$ , reject otherwise.

In two-stage design (

K = 2

), step 2 is simplified since

E_{μ} {(S)}^{\frac{1}{2}}

can be derived analytically. Let us define the continuation interval

J_{1} = [l_{1}, u_{1}]

for

x_{1}

at stage

1

and its counterpart

J_{1}^{'} = [l_{1} - μ, u_{1} - μ]

for

x_{1} - μ

, the corrective term is defined using the cumulative function for normal density

Φ

as follows:

E_{μ} {(S)}^{\frac{1}{2}} = (1 + P_{μ} (X_{1} \in J_{1}))^{\frac{1}{2}} = (1 + Φ (J_{1}^{'}))^{\frac{1}{2}} .

To figure out the influence of the design information on the reference posteriors, I now consider the case where the experiment stops early as soon as the mean of the

x_{i}

’s is negative. This rule is a special case of (6). It can be expressed by:

P_{μ} (S \geq k) = P_{μ} (X_{1} > 0, M_{2} > 0, \dots, M_{k - 1} > 0) .

(7)To ease the presentation, I focus on the two-stage design

d_{n e g 2}

(

K = 2

) where the experiment stops at stage

1

x_{1} \leq 0

and the five-stage design

d_{n e g 5}

(

K = 5

) where the experiment stops at stage

1

x_{1} \leq 0

or at stage

k

as soon as

m_{k} \leq 0

k = 2, 3, 4

. Figure 1 displays the curves of the reference posterior densities of

μ

in the two designs if the experiment stops at stage 1 and

x_{1} = 0

is observed. The stopping rule expressed in (7) implies that the negative values of

μ

are associated with an early stopping. In the design-dependent posteriors, this effect is counterbalanced by over-weighing the probabilities of the positive values of

μ

more likely leading to a late termination. This correction effect is more accentuated in the five-stage design wherein the stopping rule effect is stronger.

Figure 1.

Naive (- - -) and design-dependent (—) reference posterior densities of $μ$ if the experiment stops at stage 1 and $x_{1} = 0$ is observed in the designs $d_{n e g 2}$ and $d_{n e g 5}$ .

Table 1 shows the influence of the design information on the reference posterior estimates. The mean, the median, and the mode of the posteriors are provided in the two designs if the experiment stops at stage $1$ and $x_{1} = 0$ is observed or stops at stage 2 and $m_{2} = 0$ is observed. These estimates are also given in the five-stage design if the experiment continues to stage $5$ and $m_{5} = 0$ is observed. To complete the description, I give the $95 %$ -credible intervals based on the $2.5 %$ and the $97.5 %$ posterior quantiles. The raw shift of the point estimates from $0$ is a straightforward indicator to measure the influence of the design information. The influence is greater in the five-stage design since the shift of the posterior mean varies from $0.24$ if the experiment stops at stage $1$ (which is $24 %$ of the standard deviation of $X_{1}$ ) to $0.11$ if the experiment continues to stage $5$ (which is 5% of the standard deviation of $M_{5}$ ). The effect of the prior remains however important in the two-stage design since the shift of the posterior mean represents 9% of the standard deviation of $X_{1}$ if the experiment stops at stage $1$ and 5% of the standard deviation of $M_{2}$ if the experiment stops at stage $2$ .

Table 1.

Mean, median, and mode of the reference posteriors and $95 %$ -credible intervals if the experiment stops at stage $1$ and $x_{1} = 0$ is observed or stops at stage k and $m_{k} = 0$ is observed for $k = 2$ in the design $d_{n e g 2}$ or for $k = 2, 5$ in the design $d_{n e g 5}$ .

	Naive	Design-dependent posterior
	posterior	$K = 2$	$K = 5$
Stop at stage 1
mean,med,mod	$0$	$.09, .10, .08$	$.24, .26, .33$
$95 %$ -CI	$[- 1.96, 1.96]$	$[- 1.87, 2.02]$	$[- 1.76, 2.10]$
Stop at stage 2
mean,med,mod	$0$	$.07, .07, .05$	$.17, .19, .21$
$95 %$ -CI	$[- 1.39, 1.39]$	$[- 1.32, 1.43]$	$[- 1.24, 1.49]$
Stop at stage 5
mean,med,mod	$0$		$.11, .12, .13$
$95 %$ -CI	$[- .88, .88]$		$[- .79, .94]$

The bias is an important frequentist property of point estimators. Another obvious characteristic is precision, often quantified in terms of the mean squared error (MSE). The bias may be small while precision can be large. In the case that the bias reduction substantially increases the MSE compared to the MLE, the use of a bias-reduced estimator is more than questionable. Figure 2 shows the bias and the MSE of the reference posterior mean estimator in the two-stage and the five-stage designs. The curves are displayed for an interval of ‘true values’ (in the frequentist sens of the term) of the parameter of the mean, which is noted $μ_{f}$ not to confuse with $μ$ its Bayesian counterpart. As element of comparison, the curves of Whitehead’s bias-adjusted estimator which is known to have good MSE properties are also shown.

Figure 2.

Bias and MSE of the naive (- - -) and the design-dependent (—) reference posterior mean estimator and Whitehead’s bias-adjusted estimator ( $\cdot$ - $\cdot$ ) in the designs $d_{n e g 2}$ and $d_{n e g 5}$ .

The reference posterior mean estimator based on the design-dependent prior exhibits a lower bias magnitude compared to its naive counterpart with ranges decreasing from [ $- 0.20, 0$ ] to [ $- 0.13, 0.03$ ] in the two-stage design and from [ $- 0.35, 0$ ] to [ $- 0.22, 0.08$ ] in the five-stage design. Whereas the MSEs of the three estimators are comparable, Whitehead’s bias-adjusted estimator keeps an advantage in term of bias with the range [ $- 0.04, 0.04$ ] in the two-stage design and [ $- 0.08, 0.06$ ] in the five-stage design. It is important to recall here that this estimator has been specifically developed for bias reduction. In return, a major criticism is that the adjustment for the bias only depends on the ML estimate without consideration for the statistical information. On the contrary, the bias correction in the reference posterior estimators depends on the stopping stage: the greater the statistical information, the lower the correction. In line with the reference prior theory principle, the design information is appropriately and objectively used at each interim analysis.

An important frequentist characteristic for credible intervals is the coverage rate, i.e., the proportion of the time that the interval contains the ‘true value’ of $μ_{f}$ . Figure 3 shows the one-sided coverage rates of the $95 %$ -credible intervals based on the reference posteriors in the two-stage and the five-stage designs. For the upper and the lower CI limits, the design-dependent approach allows a lower departure of the coverage rates from the nominal level $0.975$ . For the upper limit, the coverage range improves from [ $0.958, 0.976$ ] using the naive approach to [ $0.962, 0.979$ ] using the design-dependent approach in the two-stage design and from [ $0.951, 0.975$ ] to [ $0.961, 0.985$ ] in the five-stage design. For the lower limit, the coverage range improves from [ $0.973, 0.992$ ] to [ $0.971, 0.990$ ] in the two-stage design and from [ $0.973, 0.994$ ] to [ $0.968, 0.990$ ] in the five-stage design.

Figure 3.

One-sided coverage rates of the $95 %$ -credible intervals based on the naive (- - -) and the design-dependent (—) posteriors in the designs $d_{n e g 2}$ and $d_{n e g 5}$ .

For many statisticians this presentation is of minimal value since the population mean can not be summarized by a fixed value of $μ_{f}$ . Consider now that the values of interest are summarized by a random variable $μ_{r}$ which is normally distributed with a standard deviation of 1. This value is chosen to ease the interpretation as it represents the (known) standard deviation of the outcomes by stage. In this way, the quantity of interest is not the bias for a specific value of $μ_{f}$ but the bias averaged over the values of $μ_{r}$ with respect to its distribution. Table 2 shows that the design-dependent approach allows substantial reductions in terms of so-called ‘average bias’. If the distribution of $μ_{r}$ is centered at 0 such that $μ_{r} \sim N (0, 1)$ , this quantity decreases from $- 0.14$ to $- 0.08$ in the two-stage design and from $- 0.23$ to $- 0.11$ in the five-stage design. Important improvements are also observed in terms of so-called ‘average CI coverage’. If the distribution of $μ_{r}$ is shifted to the critical value $1.96$ such that $μ_{r} \sim N (1.96, 1)$ , the average coverage of the upper limit improves from $0.969$ to $0.972$ in the two-stage design and from $0.963$ to $0.970$ in the five-stage design. These corrections toward the nominal value $0.975$ are also observed for the lower limit if $μ_{r} \sim N (- 1.96, 1)$ since the average coverage improves from $0.981$ to $0.977$ in the two-stage design and from $0.984$ to $0.977$ in the five-stage design.

Table 2.

Average bias of the posterior mean estimator and average coverage rate of the $95 %$ -credible interval based on the reference posteriors in the designs $d_{n e g 2}$ and $d_{n e g 5}$ .

	$K = 2$		$K = 5$
	naive	design-dep.	naive	design-dep.
	posterior	posterior	posterior	posterior
Average bias for $μ_{r} \sim N (0, 1)$	$- 0.14$	$- 0.08$	$- 0.23$	$- 0.11$
Upper limit coverage for $μ_{r} \sim N (1.96, 1)$	$0.969$	$0.972$	$0.963$	$0.970$
Lower limit coverage for $μ_{r} \sim N (- 1.96, 1)$	$0.981$	$0.977$	$0.984$	$0.977$

These results obtained with normally distributed data confirm the good frequentist properties that were evidenced in the Bernouilli and the binomial models.^12,13

Application of approach in clinical trial

Description of the case study

A clinical trial design is called ‘adaptive’ if an element can be modified at an interim analysis according to prospectively planned specifications with full control of the frequentist type 1 error.²⁰ However, the simple rejection of a null hypothesis is not sufficient to establish a convincing evidence of the efficacy of a treatment and full interpretation of the results should be based on point and interval estimates. In group sequential design, the trial can stop early if the results reveal sufficiently conclusive evidence to support an hypothesis but parameter estimation is subject to a bias toward the more extreme values of hypothesis. To remedy this, I apply the design-dependent Bayesian approach to estimate the treatment effect in a clinical trial.

The motivating dataset comes from the double-blind clinical trial FAST (NCT number : NCT02151656) to compare the efficacy of a new compound versus placebo in the treatment of acute exacerbation of schizophrenia.¹⁶ The primary outcome was the change of the positive and negative syndrome scale from the randomization to 6 weeks (noted $Δ$ PANSS). The analysis was based on an analysis of covariance model (ANCOVA) incorporating the PANSS baseline (noted BPANSS) as a continuous covariate and a stratum factor to adjust for possible imbalance across the four European countries where the trial was conducted. The ANCOVA model can be expressed by: $Δ {PANSS}_{h i j} = μ + α {BPANSS}_{h i j} + (i - 1) β + γ_{h} + e_{h i j}$ (8)where $Δ {PANSS}_{h i j}$ is the PANSS change value for the $j$ th subject ( $j = 1, \dots, n_{h i}$ ) receiving the treatment $i$ ( $i = 1, 2$ ) in the country $h$ ( $h = 1, 2, 3, 4$ ) and ${BPANSS}_{h i j}$ is his/her baseline value. The parameter $μ$ is the intercept and $β$ represents the treatment effect. A positive value of $β$ indicates a greater improvement in the test group than under placebo. Next, $γ_{h}$ is the fixed effect of the $h^{t h}$ country. To avoid model over-parametrization, the parameter of the last level $γ_{4}$ is set to 0. Last, $e_{h i j}$ is the experimental error assumed to be distributed according to a $N (0, σ^{2})$ .

The trial was powered for a one-sided hypothesis test to detect a treatment effect of $8$ with $σ = 19$ and a significance level set to $0.05$ . The ratio of the treatment effect to the standard deviation yields an effect size of $0.42$ . In the original protocol, the one-sided test was a gatekeeper for the two-sided test version in a hierarchical manner. In other words, the treatment effect is first tested at a $0.05$ one-sided significance level and, if this test is statistically significant, the treatment effect is then tested at a $0.05$ two-sided significance level. Totally, 142 subjects were planned to be randomized in a $1 : 1$ ratio.

Another important information is that an independent data monitoring committee (IDMC) was set-up to decide whether the trial should stop for futility or continue until its expected end based on interim results. The interim analysis was planned to be conducted on half of the patients having completed their 6-week follow-up. The IDMC statistician was in charge to run the programs of analysis using the actual randomization list. There were no formal rules to stop the trial for futility but to help them in their decision-making process the IDMC members were provided with information like the treatment-effect estimate and the frequentist conditional power. The IDMC members were also informed that the treatment effect estimated upon trial termination should be at least 4 to pursue the clinical development of the compound further. Of note, this treatment-effect magnitude corresponds to half of the expectation considered under the alternative testing hypothesis.

Actually, the interim analysis was conducted on $71$ patients using an ANCOVA model with the PANSS baseline as the only covariate. The interim estimate of the treatment effect was $3.3$ . Although this magnitude was under the clinical objective, the IDMC members decided to continue the trial until its expected end based on information combining global efficacy and tolerance. The final analysis was conducted on $134$ patients, $67$ patients in each treatment group, and the treatment effect estimated via model (8) was $6.2$ ( $2.5$ ) $p = 0.007$ (one-sided). Incidentally, this promising result obtained on the trial population revealed a discrepancy between the patients included before and after the interim analysis although there were no modifications in the inclusion criteria.

Estimation upon frequentist test termination

The objective Bayesian strategy to analyzing treatment effect via model (8) is extremely simple and goes as follows. To derive the reference prior, the parameters need to be grouped and set in descending order of interest. Let us consider ( $β, Ψ, σ$ ) with $Ψ = (μ, α, γ_{1}, γ_{2}, γ_{3})$ where $β$ is the parameter of interest and ( $Ψ, σ$ ) are the nuisance parameters. The expected Fisher information for this ordering takes the simple form of a three-bloc diagonal matrix with $1 / 2 σ$ in the bottom component and $1 / σ$ elsewhere. Since only the term $σ$ appears in the matrix, formula (4) yields a reference prior which is: $π^{R} (β, Ψ, σ) \propto π^{R} (σ | β, Ψ) \propto σ^{- 1} .$ The marginal reference posterior of $β$ was derived a long time ago.²¹ The authors used a prior proportional to $σ^{- 1}$ , in place of the Jeffreys prior which is proportional to $σ^{- 2}$ , based on intuitive arguments. The marginal reference posterior follows a Student distribution such that $π^{R} (β | d) = \frac{β - d}{\sqrt{M S E (\frac{1}{n_{1}} + \frac{1}{n_{2}})}} \sim t_{λ}$ (9)where $d$ is the difference between the treatment groups, $n_{1}$ and $n_{2}$ are the sample sizes, $M S E$ is the residual mean sum of squares, and $λ$ is the associated degree of freedom.

We now turn our attention to the estimation in multistage designs. According to (5), the reference posterior of $β$ given the difference $d$ and the design $d_{K}$ takes the form: $π^{R} (β | d, d_{K}) \propto π^{R} (β | d) E_{β} {(S)}^{\frac{1}{2}} .$ (10)To this end, the FAST dataset is reanalyzed by introducing fictitious interim analyses with possibility to stop the trial early for futility or efficacy. In what follows, we consider two stopping rules which are defined as:

Stop for futility if the difference between the treatment groups is below 4 ( $d < 4$ ): Interestingly enough, this rule corresponds to the context of the IDMC meeting during the trial. This rule was implicitly suggested as it was a condition to stop the development of the compound if such a difference was observed upon trial termination and this information was known by the IDMC members.

Stop and conclude for efficacy if the frequentist ‘ $p$ -value’ is lower than the Pocock boundaries for testing the no-effect hypothesis: Pocock’s method is used to preserve the frequentist type I error when several analyses to test the same hypothesis are planned. The method results to equal significance levels across analyses.

After completing the trial, the reference posterior (10) can be derived using the acceptance-rejection algorithm described in Section “Frequentist properties”. In step 2 of algorithm, the corrective term

E_{β} {(S)}^{\frac{1}{2}}

is estimated by repeatedly simulating trials for each

β

value generated from the naive posterior (9). For the futility stopping rule, the estimates of

E_{β} {(S)}^{\frac{1}{2}}

are based on simulated sequences of differences

d

, whereas sequences of

p

-values are simulated for the efficacy stopping rule. It is important to note that a Bayesian strategy could also be considered in either stopping rule. For example, the futility stopping rule could be based on the posterior probability that

β > 4

. In this case,

E_{β} {(S)}^{\frac{1}{2}}

is estimated from simulated sequences of posterior probabilities.

For the reanalysis of the FAST data set, we consider two-stage designs ( $K = 2$ ) with one interim analysis on half of the trial population and three-stage designs ( $K = 3$ ) with two interim analyses on one-third and two-third of the trial population. The selection of the patients in the interim analyses only depends on their inclusion dates. The implementation of the stopping rules results to the futility and the efficacy designs described in Table 3. To match with the first inference of the original analysis, the Pocock boundaries are given for one-sided tests to preserve an overall $α$ -risk of $0.05$ . In addition to the designs shown in Table 3, I introduce the ‘combined design’ which combines both stopping rules: the trial can stop early for futility if $d < 4$ or for efficacy if the $p$ -value is lower than the Pocock boundary.

Table 3.

Multistage designs based on the rules to stop the trial early for futility or efficacy.

		Stopping rules
		Futility	Efficacy
Two-stage	Analysis 1 ( $n = 67$ )	stop if $d < 4$	stop and reject $H_{0}$ if $p < 0.0294$
designs	Analysis 2 (n=134)	reject $H_{0}$ if $p < 0.05$	reject $H_{0}$ if $p < 0.0294$
Three-stage	Analysis 1 (n=45)	stop if $d < 4$	stop and reject $H_{0}$ if $p < 0.0221$
designs	Analysis 2 (n=90)	stop if $d < 4$	stop and reject $H_{0}$ if $p < 0.0221$
	Analysis 3 (n=134)	reject $H_{0}$ if $p < 0.05$	reject $H_{0}$ if $p < 0.0221$

Table 4 allows measuring the influence of the design information on estimation. The results are shown for all analyses at any stage regardless the earlier decision to stop or continue the trial. The reference posterior means are used to estimate the treatment effect. The interval estimation is based on the 90%-credible intervals whose the limits are given by the 5% and the 95% posterior quantiles. The coverage probability of 90% does not match with the Pocock boundaries. However, this unique coverage for all the analyses whatever the design eases the appraisal of the influence of the design information on the credible intervals.

Table 4.

Reference posterior means and 90%-credible intervals in the naive and the design-dependent approaches for each multistage design.

		Naive	Design-dependent approach
		approach	Futility	Efficacy	Combined
Two-stage	Analysis 1	$3.4$ $[- 2.2, 9.1]$	$3.8$ $[- 1.9, 9.3]$	$3.2$ $[- 2.3, 8.7]$	$3.5$ $[- 2.0, 9.0]$
designs	Analysis 2	$6.2$ $[2.1, 10.2]$	$6.3$ $[2.3, 10.3]$	$6.0$ $[1.9, 10.0]$	$6.1$ $[2.2, 10.2]$
Three-stage	Analysis 1	$4.5$ $[- 2.4, 11.3]$	$5.1$ $[- 1.7, 11.8]$	$3.9$ $[- 2.7, 10.6]$	$4.6$ $[- 2.2, 11.2]$
designs	Analysis 2	$4.8$ $[- 0.1, 9.8]$	$5.2$ $[0.3, 10.0]$	$4.5$ $[- 0.3, 9.4]$	$4.9$ $[0.0, 9.7]$
	Analysis 3	$6.2$ $[2.1, 10.2]$	$6.4$ $[2.4, 10.4]$	$5.9$ $[1.9, 9.9]$	$6.2$ $[2.2, 10.1]$

As observed in the previous section, the densities of the design-dependent reference posteriors are moved away from the stopping boundaries. Relative to the naive approach, the treatment-effect estimates are increased in the futility designs and decreased in the efficacy designs. In the combined designs, the posterior densities are moved away from both the futility and the efficacy boundaries and makes the distribution more concentrated on its central value. Consequently, the posterior means are close to those obtained in the naive approach whereas the lengths of the credible intervals are reduced. The fluctuation of the results is maximum in the three-stage designs at the first interim analysis with treatment-effect estimates varying from $3.9$ to $5.1$ . This variation is caused by both the influence of the design information which is stronger in the three-stage designs and the prior weight which is greater at the first interim analysis in presence of a limited number of patients. For the inverse reason, the results at the last planned analyses are more stable with estimates varying from $5.9$ to $6.4$ whatever the design configuration.

Whatever the design, no trials stop early for efficacy since none of the $p$ -values pass the Pocock boundaries. The significance levels are $p = 0.16$ at analysis 1 in the two-stage designs and $p = 0.14$ at analysis 1 and $p = 0.055$ at analysis 2 in the three-stage designs. However, the treatment effect estimated using model (8) is $3.4$ at analysis 1 in the two-stage designs so that this analysis is the final one in the futility and the combined designs. Table 5 shows the statistical decisions and the estimates obtained upon trial termination. For the efficacy and the combined designs, the CI limits are based on the $α_{p o c o c k}$ and the $1 - α_{p o c o c k}$ quantiles of the reference posterior densities, whereas the 90%-credible intervals are shown for the futility designs.

Table 5.

Statistical decisions and estimates obtained upon trial termination using the design-dependent Bayesian approach for each multistage design.

	Stopping rules
	Futility	Efficacy	Combined
Two-stage	stop at analysis 1	reject $H_{0}$ at analysis 2	stop at analysis 1
designs	Posterior mean= $3.8$	Posterior mean= $6.0$	Posterior mean= $3.5$
	$90 % - C I = [- 1.9, 9.3]$ *	$94.1 % - C I = [1.4, 10.7]$	$90 % - C I = [- 2.0, 9.0]$ *
Three-stage	reject $H_{0}$ at analysis 3	reject $H_{0}$ at analysis 3	reject $H_{0}$ at analysis 3
designs	Posterior mean= $5.2$	Posterior mean= $4.5$	Posterior mean= $4.9$
	$90 % - C I = [2.4, 10.4]$	$95.6 % - C I = [1.0; 10.9]$	$95.6 % - C I = [1.3; 11.0]$

* The $90 %$ -credible interval is given for descriptive purpose as the trial stops early for futility.

Concluding remarks

The use of data-dependent stopping rules in experiments has long been a source of controversies among theoretical statisticians. Some are reluctant to transgress the stopping rule principle according to which once the data have been obtained, the reasons for stopping the experiment should have no bearing on the evidence reported about the parameter. The stopping rule principle is the main consequence of the likelihood principle which states that all of the information about the parameter provided by an experimental outcome is expressed in the likelihood function. In turn, the likelihood principle is considered as a direct implication of Bayes’ theorem. However, any applied statistician considers that the design information cannot be ignored because of the bias induced by the stopping rule.

An important breakthrough was made by showing that the likelihood principle is no longer a direct implication of Bayes’ rule.¹⁰ Bayes’ rule was expressed with an explicit reference to the experimental design $d_{K}$ . Using the formalism of Section “Extension to multistage design”, $x_{(k)} = (x_{1}, x_{2}, \dots, x_{k})$ is a sequence of independent outcomes observed in the design $d_{K}$ . We assume that $X_{(k)}$ has a known density function $p (x_{(k)} | θ)$ which satisfies minimum conditions of regularity. Bayes’ rule can be expressed as: $π (θ | x_{(k)}, d_{K}) \propto π (θ | d_{K}) p (x_{(k)} | θ, d_{K}) .$ (11)Formulation (11) holds for any multistage design governed by a proper stopping rule. It becomes now evident that a state of prior ignorance cannot be characterized without reference to the experimental design. This formulation also shows that the approach developed in this article is fully Bayesian justified.

The idea behind the reference prior theory is to maximize a distance between the prior and the posterior distributions, as data are collected. In return, the ‘data collected’ have maximum influence on the posterior estimates. In this article, I extend this interpretation to the ‘data collected in a given experimental design’. I also show that the reference prior theory applied to multistage experiments yields a class of Bayesian estimators with good frequentist properties while allowing a unified approach to the point and the interval estimations. The reference posterior estimators make it possible to avoid the various problems of the alternatives such as the pre-ordering in the observation space in the frequentist approach or the bias-correction regardless the level of information as in Whitehead’s bias-adjusted point estimator. These arguments justify the use of the reference posterior estimators as a default choice for objective estimation in multistage experiments.

Footnotes

ORCID iD

Pierre Bunouf

Supplemental material

Supplementary material for this article is available online.

References

Armitage

. Numerical studies in the sequential estimation of a binomial parameter. Biometrika 1958; 45: 1–15.

Tsiatis

Rosner

Mehta

. Exact confidence interval following group sequential test. Biometrics 1984; 40: 797–803.

Emerson

Fleming

. Parameter estimation following group sequential hypothesis testing. Biometrika 1990; 77: 875–892.

Jennison

Turnbull

. Group Sequential Methods with Applications to Clinical Trials. New York: Chapman & Hall, 2000.

Whitehead

. The case for frequentism in clinical trials. Stat Med 1993; 12: 1405–1413.

Liu

Hall

. Unbiased estimation following a group sequential test. Biometrika 1999; 86: 71–78.

Whitehead

. On the bias of maximum likelihood estimation following a sequential test. Biometrika 1986; 73: 573–581.

Govindarajulu

. The Statistical Analysis of Hypothesis Testing, Point and Interval Estimation, and Decision Theory. Columbus, OH: American Sciences Press, 1981.

. Reference priors when the stopping rule depends on the parameter of interest. J Am Stat Assoc 1993; 88: 360–363.

10.

de Cristofaro

. On the foundations of likelihood principle. J Stat Plan Inference 2004; 126: 401–411.

11.

Bunouf

Lecoutre

. Bayesian priors in sequential binomial design. C R Acad Sci Paris, Ser I 2006; 343: 339–344.

12.

Bunouf

Lecoutre

. On Bayesian estimators in multistage binomial designs. J Stat Plan Inference 2008; 138: 3915–3926.

13.

Bunouf

Lecoutre

. An objective Bayesian approach to multistage hypothesis testing. Seq Anal 2010; 29: 88–101.

14.

Bernardo

. Reference posterior distributions for Bayesian inference. J Roy Statist Soc B 1979; 41: 113–147. (with discussion).

15.

Sun

Berger

. Objective Bayesian analysis under sequential experimentation. IMS Collections, Pushing The Limits of Contemporary Statistics: Contributions in Honour of Jayanta K Ghosh 2008; 3: 19–32.

16.

Bitter

Istvan

. Randomized, double-blind, placebo-controlled study of F17464, a preferential

D_{3}

antagonist, in the treatment of acute exacerbation of schizophrenia. Neuropsychopharmacology 2019; 44: 1917–1924.

17.

Bernardo

. Reference analysis. Handb Stat 2005; 25: 17–90. Elsevier.

18.

Berger

Bernardo

Sun

. The formal definition of reference priors. The Annals of Statistics 2009; 37: 905–938.

19.

Bernardo

. Bayesian Reference Analysis. A Postgraduate Tutorial Course, 1998, available at https://www.uv.es/bernardo/Monograph.pdf.

20.

FDA. Center for Drug Evaluation and Research (2019), Adaptive Design Clinical Trials for Drugs and Biologics Guidance for Industry.

21.

Box

Tiao

. Bayesian Inference in Statistical Analysis. New York: Wiley, 1992.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.18 MB

0.00 MB