Abstract
Introduction– Assessing Prediction Models to Support Clinical Decisions
What is the situation?
A fundamental problem of medical decision making is that of prognosis.
1
The patient and clinician must decide which among the available treatments is likely to lead to the best outcome
Statistical model selection and prediction assessment are long-standing problems in the field of statistics (see eg,2,3). Much effort has been focused on the statistical properties of predictive models and their predictions. Common evaluation criteria include the Brier score and the area under the receiver operating characteristic (ROC) curve. However, the clinical benefit of an improved predictive model remains difficult to assess. New measures are emerging which seek to quantify the clinical utility of predictions. These include reclassification measures (net reclassification improvement and integrated discrimination improvement 4 ), as well as decision curves 5 and relative utility curves.6,7 The decision curve analysis quantifies the clinical utility of a diagnostic prediction model by incorporating harms and benefits into an optimal decision threshold. The advantage of the Vickers and Elkin (VE) 5 approach is that a risk probability threshold can be used to “both categorize patients as positive or negative and to weight the false-positive and false-negative classifications”. 8 Baker et al. 6 extend decision curves ideas to evaluate the relative expected maximum utility. This is the ratio of expected utility achieved by a risk prediction model to that obtained by perfect prediction. A key idea in both Ref. 5 and 6 is that the importance of harms and benefits may differ from patient to patient. Both approaches consider a range of thresholds appropriate to a particular diagnostic situation.
What is our solution?
We propose a novel approach to evaluating prediction models using a decision analytic framework. Our work stems from the observation that a prediction model is clinically useful only if it changes a treatment decision and the prediction-supported treatment improves the patient's outcome compared to that which would have occurred with the original treatment choice. The clinical utility of prediction relies on the availability of better treatment options. Our approach combines a predictive model's ability to discriminate good from poor outcome with the benefits afforded by treatment. It also includes the (potential) negative consequences of treatment. We term this combination of predictive model and treatment efficacy the “combined benefit” (CB) of predictive treatment. We focus on a setting where the proposed treatment reduces a patient's risk (probability) of a poor primary outcome. The choice of whether to take a chemopreventive agent is our motivating example.
To preview our result, consider the probabilities of acquiring disease when doing nothing or taking a treatment,
That is, the reduction in risk of disease is greater than the cost-to-benefit ratio of treatment.
To assess a prediction model or treatment rule, we propose the CB criterion. This combines model-based predictions of acquiring disease (
where
The cost–benefit ratio may be considered a patient-specific threshold for selecting treatment. Competing prediction models and/or treatment rules can be compared at each threshold value. It is possible for one model to provide greater benefit when the treatment cost is high, but a different model to be superior with low treatment cost. Further, patients have heterogeneous attitudes toward treatment cost and benefit. Thus, identifying a relevant range of treatment thresholds is key to evaluating competing prediction models. We note that individual patients do not benefit directly from the proposed CB framework. The benefit is indirect, and is achieved through the use of decision support models tuned to problem-specific costs and benefits.
Links to similar approaches
Our approach follows directly from an application of decision analysis, and is related to several results reported previously. Observe that the decision rule above is related to a widely used measure of clinical effectiveness, the “number needed to treat” (NNT).
10
This is the number of patients who must be treated to prevent one patient's disease. The form of this measure is
Clearly, NNT is the reciprocal of the standard decision rule (eqn. 1, above), but in our approach, it is scaled by the relative benefit and cost of treatment.
Also, our approach is similar to Vickers and Elkin 5 and to Baker et al. 6 for evaluating diagnostic prediction.
where
However, our approach differs in several important ways. First, we are concerned with problems in which the proposed treatment reduces the risk of disease. This leads to a criterion based on the difference in risk probabilities. Conversely, Ref. 5 and 6 consider the problem of diagnosis, and their criterion follows the odds ratio. Second, our CB measure relies on both the predictive model and the costs and benefits of the treatment. VE's “net benefit” criterion combines predictive accuracy with the costs of misclassification. Finally, CB makes use of utilities from both treated and untreated patients, whereas net benefit considers only patients with a positive diagnosis. 11 By comparison, Baker et al. 6 developed a relative utility curve, which compares the performance of a risk prediction model with that achieved by perfect prediction. They also propose a “test threshold”: the minimum number of tests that would be traded for a true positive while maintaining non-negative expected utility.
Other perspectives
Our approach relies on a Bayesian perspective of decision making under uncertainty.12,13 Specifically, it allows personalistic, subjective probabilities and utilities. Despite a scientific history since the 1930s,14,15 there remain both practical difficulties and philosophical foundation controversy regarding this approach. Practically, evaluating and quantifying each patient's cost–benefit ratio (eg, in eqn. 1) is a key challenge. Both costs and benefits are composed of multiple objectives, and contribute to patient's highly personalistic utility valuations. In addition, the philosophical foundations of Bayesian decision theory have been criticized for their subjective nature, behavioristic decision making (rather than scientific inference), and reliance on semi-empirical, a priori reasoning.16,17
The next section introduces a motivating example in the area of colorectal adenoma chemoprevention. We make use of data from a clinical trial 18 evaluating a drug treatment to prevent adenoma recurrence. This trial exhibits key features that motivate our approach, and is an informative example for evaluating a predictive model. Note, however, we do not consider this as an analysis of the trial data. Because formal decision analysis is frequently omitted from informatics, biostatistics, and epidemiology training, Section 3 reviews the principles involved. Section 4 develops the CB measure, and Section 5 demonstrates its use with the adenoma chemoprevention trial data. Finally, we discuss ramifications of using formal decision analysis techniques to evaluate patient treatment decisions.
Example: Chemoprevention of Colorectal Adenoma
To motivate development, we consider a chemoprevention trial to prevent recurrence of colorectal adenomas. 18 This trial was hugely successful in recurrence prevention, and has multiple features which make it informative for methodologic examination. We use data from this clinical trial to motivate development of the methods, and to demonstrate use of the predictive model CB analysis.
Difluoromethylornithine (DFMO) and sulindac clinical trial overview
Three hundred seventy-five patients with a history of resected adenoma were randomly assigned to an oral chemopreventive, DFMO plus sulindac, or placebo following a stratified randomization scheme. Colonoscopies were performed at baseline and three years post-randomization. An independent data safety and monitoring board recommended early-stopping of the study for treatment efficacy. There were 267 evaluable patients: 129 in the placebo arm and 138 assigned to treatment with DFMO. Adenoma recurrence was 41% in the placebo group, and only 12% for patients treated with DFMO (risk ratio 0.30, 95% confidence interval 0.18–0.49,
Trial safety: side effects with chemopreventive treatment
Any chemopreventive may increase the risk of side effects and adverse events. The DFMO treatment suggests small increases in risk of several side effects (shown in Table 1). None of the treatment groups comparisons reached statistical significance (
Reported frequency of side effects and adverse events (AE) from the DFMO plus sulindac trial, Meyskens et al. 2008.
Decision problem components
Suppose we now consider treating a new patient with a resected adenoma. The patient has the choice of taking a chemoprevention therapy (DFMO + sulindac) to prevent recurrence. “Should this patient take DFMO + sulindac or not?”
The patient's decision may involve (at least) the following questions:
What is the patient's risk of adenoma recurrence, say, in 3 years?
If chemoprevention is chosen, what is the risk of recurrence?
With chemoprevention, what are the risks and severity of side effects?
Are there additional treatment risks without chemoprevention (such as risk associated with more colonoscopies)?
The usual statistics of trial reporting (OR = 0.3,
Fundamentals of Decision Analysis
When faced with a decision in the context of uncertain risk and benefit, we rely on Bayesian decision analysis to provide a principled, coherent approach. We provide only a brief overview of the process. For textbook accounts of general Bayesian decision analysis, see eg, Ref. 13 and 19. For a text focusing on medical decisions see Ref. 20. Also, Ref. 9 provides a readable introduction to the implementation of evidence-based medicine as Bayesian decision-making.
A decision analysis explicitly recognizes multiple components of a decision problem. We outline the components and their parallel in the chemoprevention example.
The decision maker (DM): patient (and her physician).
The set of actions available to DM: take DFMO + sulindac or not.
The possible outcomes or consequences that may be uncertain: adenoma recurrence, adverse events, hearing loss, carcinoma.
Information or evidence that may be relevant: DFMO and sulindac chemoprevention trial
Utility, an assessment of the DM's preferences for the different outcomes: weighs disease recurrence against possible side effects of medication. This also considers less well defined factors such as the requirement of taking daily medication, or increased risk from more colonoscopies. Patients’ utilities vary substantially by individual.
The DM's goal is to choose among the possible actions to achieve the best outcome. “Best” is defined by the probability weighted outcome preferences; this is maximum expected utility.
More formally, consider the set of actions
The information available about θ is denoted by
The expected utility for each potential action,
The best action is the
Note that we may rearrange the equation, and integrate
where
Combining Risk Prediction and Treatment Benefit
We develop the prediction–treatment CB criterion. To ease interpretation, we describe development in terms of the adenoma chemoprevention example. For this development, we assume that a model is available to predict the probability of adenoma recurrence. This model accounts for differences in baseline risk associated with patient-specific covariates, and for differences in risk associated with chemopreventive treatment. In the next section, we describe one modeling approach to predict heterogeneous probability of recurrence.
For each person, we estimate the reduction in probability of adenoma recurrence associated with DFMO treatment. Let
We also posit a benefit of avoiding disease recurrence:
A standard decision analysis result (Ashby and Smith, 2000) says to treat only if
We define the
Side effects should also be considered under uncertain outcomes, and their risk modeled. For simplicity we consider them fixed for each patient.
Thus, we treat only if predicted risk reduction is greater than δ
CB
Now consider a fixed risk reduction threshold
Treatment decision and outcomes for a specific value of δ.
The table entries
Now, for a fixed δ the expected benefit (expected utility) of the combined treatment and prediction model is
Expected Benefit
Consider this the average benefit per person.
To derive CB, we perform some algebra adding and subtracting (
Expected benefit
Note that the last term is constant for all values of δ, and can be ignored for decision making. Finally, we divide by
For any risk reduction, δ =
Use of CB
The relative cost of treatment, 8, is a useful index to aid treatment decisions. At the indifference threshold, δ may be interpreted as both the relative cost of treatment and (predicted) risk reduction necessary to justify treatment. For treatments with a small relative cost (eg, taking a multivitamin), only a small reduction in risk is needed to accept treatment. Conversely, when the relative cost is high (eg, prophylactic colonectomy), then the risk reduction must be large to justify treatment.
CB can be used to compare different prediction models or rules, as well as the treat ALL and treat NONE decision rules. Prediction models enter CB(δ) through the computed values
We care only about a specific range of δ values for each decision. Better prediction outside that range is not clinically relevant.
CB may be improved by better identification of patients likely to be helped by treatment.
CB is also improved by identifying patients unlikely to benefit from treatment.
Predicting a Patient's Risk of Recurrence
We outline our procedure for predicting risk of adenoma recurrence; the details are given in Appendix 1. The goal of the CB criterion is to evaluate the clinical relevance of a prediction model and treatment decisions based on the predictive distributions. The model developed for our adenoma example is intended to illustrate the procedure. It is not intended as an exhaustive analysis of adenoma recurrence.
The primary outcome is adenoma recurrence after three years of follow-up. We model the probability of recurrence using logistic regression with Bayesian model averaging (BMA 21 ). BMA accounts for uncertainty in the selection of the prediction model, as well as in the model coefficients. This approach has been shown to improve model predictive performance, and appears less prone to overfitting than alternative procedures.
We fit separate models for placebo- and DFMO-treated patients. In each model, potential predictors include patient demographics (age, sex, body mass index [BMI], aspirin use), as well as characteristics of their baseline adenoma. These characteristics include:
Location: proximal or distal colon
Large adenoma (>1 cm)
Number of adenomas
Villous (yes/no)
Potential molecular (PGE2, putrescine, spermidine) and genotypic (
BMA results overview
For patients receiving placebo, the model average fitting summary is shown in Table 3. The second column, Pr(β ≠ 0), sums the posterior probabilities across models that include a given predictor. Unlike
Distribution of BMA logistic regression coefficients for placebo patients. Results average over 30 best models retained by BMA.
Figure 1 shows the posterior predictive probability of recurrence for patients assigned to the placebo group. We observe substantial heterogeneity of recurrence risk ranging from 25% to about 75%. The error bars indicate uncertainty associated with modeling. These regions indicate 66% (black) and 95% (gray) posterior predictive probability. For DFMO-treated patients, none of the predictors has substantial probability of model inclusion. With DFMO treatment, our best prediction is that all patients have about 12% risk of recurrence. This inability to detect important predictors of recurrence is likely because of the small number of recurrences among treated patients (17 of 138). These posterior predictive probabilities will be used in the calculation of the CB criterion. For each patient, they represent our best estimates of

Predicted probability of recurrence for patients with placebo treatment. Center point is the Bayesian model average prediction. Error bars show 66% (black) and 95% (gray) model uncertainty intervals. Orange line denotes the predicted recurrence with DFMO plus sulindac treatment (with 95% credible region).
Results– cb Curves to Assess Prediction
We use the BMA results of the previous section to demonstrate the CB curve method. We use point estimates for disease probabilities and patient fractions
Figure 2 shows the CB curve [CB(δ)] for the BMA prediction model of adenoma recurrence (blue line). Small values of δ correspond to low cost treatments (those with mild side effects), while large values are associated with high treatment cost. The dashed (black) line corresponds to the CB of treating ALL patients, while the horizontal dotted line denotes the benefit of treating NONE. At δ = 0 (no cost of treatment), the benefit of treating ALL vs. NONE is denoted by the vertical distance between lines (0.88 - 0.59 = 0.29). This is the difference in non-recurrence probabilities in treated and placebo arms.

The CB of prediction and treatment (
At treatment cost δ = 0.29 (the observed reduction in recurrence), the treat ALL and treat NONE lines cross. Thus, if the cost of treatment is equivalent to 0.29 adenoma recurrences, there is no net benefit to treating all patients (compared with treating none).
The figure shows that for treatment thresholds between 0.13 and 0.50, the BMA prediction provides substantial benefit compared with the treat ALL and treat NONE strategies. This benefit is provided by not treating selected patients with small treatment-related reductions in risk of recurrence.
Note that CB for the BMA prediction and treat ALL strategies coincide for treatment thresholds δ < 0.13. This occurs because the prediction model cannot reliably identify patients with recurrence probabilities less than 0.25 (
What is the relevant range of thresholds (δ) associated with the DFMO plus sulindac treatment?
Figure 3 shows the same CB curves with an approximate range of relevant treatment thresholds. The DFMO plus sulindac treatment may contribute to potentially serious side effects, but these are only weakly indicated by the trial data. Thus, we posit that small-to-moderate reductions in recurrence risk (0.02–0.20) are sufficient to indicate treatment. Note that the BMA prediction model provides only limited benefit at the upper end of this range. Among patients who are most averse to taking a chemopreventive, we may identify a few with low enough baseline risk to justify avoidance of treatment. This indicates that if we wish to improve prediction in this situation, we should focus on patients with low recurrence probabilities.

The relevant threshold region for DFMO plus sulindac treatment is indicated by the orange shaded region. Patients with recurrence risk reduction between 0.02 and 0.20 receive limited benefit with DFMO, and might prefer to avoid chemopreventive treatment. The BMA prediction model is relatively poor at identifying such patients.
We next illustrate how CB can be used to compare different prediction models or rules. Rather than using the full prediction model, suppose we instead choose a risk cut-point and treat all patients exceeding that point. Figure 4 shows the CB curve when that risk probability is 0.40 (approximate frequency of recurrence in the placebo arm). With this simplified rule, we obtain much of the benefit afforded by the full BMA model, and vastly exceed the CB obtained by the treat ALL rule. This benefit is obtained by excusing low-risk patients from treatment. Note that this simple rule fixes each patient's decision threshold at δ = 0.28. 2

CB curve for a fixed decision probability of 0.40. This simpler rule achieves much of the benefit of the full BMA prediction. The equivalent threshold is δ = 0.28.
Finally, Figure 5 compares the performance of the full BMA prediction model with a restricted model that omits adenoma location (restricted model). This demonstrates how predictions based on different covariates (eg, biomarkers) can be compared. The inclusion of adenoma location provides a modest improvement in predictive performance. But, this improvement is realized primarily among smaller threshold values (δ < 0.33). These smaller thresholds are more relevant for this chemoprevention treatment decision.

CB curves for BMA prediction with all covariates (blue) and for model averaged predictions with adenoma location omitted (restricted model, red). Note that the full model outperforms the restricted model up to δ = 0.33. The two models exhibit similar performance at higher thresholds.
Discussion
Summary
We have developed a criterion that combines a patient's predicted outcomes under different treatment options with consideration of loss associated with the treatment.
This threshold is greater than our proposed risk region for this treatment decision. We include this to demonstrate the use of CB.
The CB curve helps us focus on the relevant risk groups by considering only the range of risk reduction that is consistent with the relative cost of treatment. The CB curves can be used to compare different prediction models, the contribution of potential biomarkers to an existing model, and different treatment decision rules.
In our motivating example for chemoprevention of colorectal adenoma, we observe that there is substantial interpatient heterogeneity of recurrence risk among untreated patients. However, over the risk region of interest we are unable to identify patients who would benefit by avoiding treatment. This example demonstrates that clinically beneficial improvements in prediction (eg, new biomarkers) should identify patients with very low risk of recurrence – those who would benefit by avoiding treatment. While not addressed in our example, it would also be useful to identify patients with high risk of experiencing side effects associated with treatment.
For the medical community to fully embrace personalized medicine, we need improved approaches for assessing treatment decisions. These include improvements in
predicting what will happen to individual patients,
evaluating predictive models,
incorporating treatment benefits and consequences, and
understanding patient utilities for outcomes.
The decision analytic approach outlined above demonstrates how these components interact, and that evaluation of individual components in the absence of the others is incomplete. We argue that prediction–decision statistical approaches are more relevant for clinical decision support than
Why not use clinical trials for benefit assessment?
Clinical trials provide a wealth of information about patients with disease or those susceptible to it. In addition, trials include a formal monitoring mechanism to assess outcomes, and to evaluate side effects. As we demonstrate, this information is useful for estimating patient outcome predictive distributions, and is necessary to evaluate clinical benefit (not just treatment efficacy). A slight expansion of current clinical trial protocols would include information about patient utilities. This additional information would allow a more complete picture of the benefits of treatment.
Our societal trend toward personalized medicine indicates that we need more information about “who to treat,” and less focus on “which treatment to use.” Such a shift in perspective would change the focus of clinical trials from drug superiority to one of patient benefit. This seems much more relevant for health care than the usual
Author Contributions
DB and BL conceived the concepts and wrote the first draft of the manuscript. DB, CEM, and BL analyzed the data. DB, EWG, CEM and BL contributed to the writing, made critical revisions and approved the final version. All authors reviewed and approved of the final manuscript.
