Abstract
Keywords
Background
Risk prediction models that objectively quantify the probability of clinically important events based on observable characteristics are critical tools for efficient patient care. A risk prediction model is typically constructed using a development (or training) sample, but before it is adopted for use in a target population, its performance needs to be assessed in an independent (external) validation sample drawn from that population. In examining the appropriateness of a risk model, 2 fundamental aspects are discrimination and calibration. The former refers to the capacity of the model to properly stratify individuals with different risk profiles, and the latter refers to the degree to which predicted risks are close to the true risks. 1
The receiver-operating characteristic (ROC) curve and the area under the ROC curve (AUC, or the c-statistic) are classical examples of tools for assessing model discrimination. 2 When evaluating a risk prediction model in a sample, the discriminatory performance of the model can be affected by both the distribution of predictor variables (case mix) and the validity of the model in that sample. 3 Consequently, when comparing the performance of a model between development and validation samples, differences in the case mix between the 2 samples can make comparisons difficult. One area of interest in the present work is to untangle these 2 sources of discrepancy. Early progress in this area was made by Vergouwe et al. 3 who proposed benchmarks based on simulating responses from predicted risks and fitting the model in the validation sample. More recent work has largely focused on the AUC, an overall summary measure of the ROC curve.4–6
Compared with model discrimination, examining model calibration has not received the same level of attention.7,8 Model calibration is often neglected in the evaluation of the overall performance of risk prediction models, so much so that it is referred to as “the Achilles’ heel of predictive analytics.” 9 In the context of a logistic model for binary responses, Van Calster et al. 10 proposed a hierarchy of definitions for model calibration. In particular, a model is “moderately calibrated” if the average observed risk across all subjects with a given predicted risk is equal to the predicted risk. Moderate calibration is contrasted with mean calibration when the expected values of predicted and true risks are equal, with “weak” calibration when a linear calibration plot has an intercept of 0 and slope of 1, and with “strong” calibration when the predicted and observed risks are equal for all covariate patterns (an unrealistic condition in practical situations). 10 Moderate calibration is typically assessed using the calibration plot, which shows the average value of the observed risk as a function of the predicted risk after grouping or smoothing response values.
In this work, we propose a model-based ROC (mROC) analysis. We show that the mROC connects ROC analysis, a classical means of evaluating model discrimination, to model calibration. With the help of a stylized example, we demonstrate how the mROC enables investigators to disentangle the effect of case mix and model validity on the shape of the ROC curve. We propose a novel statistical test for the assessment of model calibration that does not require specification of smoothing or grouping factors and evaluate its performance through simulation studies. Through a case study, we put the developments in a practical context.
Notation and Context
Our main interest is in the “external validation” context, in which a previously developed risk prediction model for a binary outcome is applied to a new independent (external) sample to examine its performance in that sample’s target population. The risk prediction model is given by the deterministic function
Empirical ROC Curve
Two fundamental probability distributions underlie the ROC curve: the distribution of predicted risks among individuals who experience the event (positive individuals, or cases) and among individuals who do not experience the event (negative individuals, or controls). Let
The true-positive and false-positive probabilities are closely linked with the distribution of risk among the positive and negative individuals, respectively:
where
With the external data set, consistent estimators for
and
mROC Curve
The
Let
and
The application of Bayes’s rule leads to the following estimators in the external sample:
and
Hence, one can generate a “model-based” ROC or
Connection Between the mROC Curve, Case Mix, and Model Calibration
The limiting forms (population equations) of the estimated CDFs
Unlike in the expression of
Consider the mROC and empirical ROC curves in the validation sample when examining the external validity of a model. The former carries the association between the predictors and outcome from the development sample through the prediction model, whereas the latter captures such association in the validation sample. However, both are based on the case mix in the validation sample. Because of the shared case mix, discrepancies between these curves point toward model miscalibration in the validation sample. This can be demonstrated using a stylized example: Consider a single predictor

Empirical receiver-operating characteristic (ROC; black) and model-based ROC (mROC; red) curves for the stylized example. *Distribution of the single predictor in the validation population: X∼Normal(µ= 0, σ = 0.5). †Association model in the validation population: P(Y = 1) = 1/(1 + exp(−X/2)). ‡Predictor distribution same as in panel B, and association model same as in panel C.
mROC as the Basis of a Novel Statistical Test for Model Calibration
Although moderate calibration is a sufficient condition for the convergence at all points of the empirical ROC and mROC curves, moderate calibration on its own might not be a necessary condition for such convergence. To progress, in Supplementary Material section 4 we show that at the population level, the equivalence of ROC and mROC curves guarantees moderate calibration if an additional condition is imposed. This condition is mean calibration, i.e.,
Based on this finding, we propose a statistical inference procedure. We define the null hypothesis (
These hypotheses jointly provide the necessary and sufficient conditions for the risk prediction model to be calibrated.
For
For
Given that both
The null distributions of both
where
where
Individually, the 2 statistics provide insight about the performance of the model. However, it is more desirable to obtain a single overall
would have a chi-square distribution with 4 degrees of freedom. However, as the 2 statistics are generated from the same data, they are dependent. An adaptation of Fisher’s method for dependent
Simulation Studies
We performed simulation studies to evaluate the finite-sample properties of the proposed test and compare its performance against the conventional Hosmer-Lemeshow and likelihood ratio tests of model calibration. We modeled a single predictor
In the second set, the true risk model remained the same as above, and we modeled nonlinear miscalibrations as

Relationship between predicted (
We calculated the power of the mean calibration test, the mROC/ROC equality test, the unified test, the Hosmer-Lemeshow test (based on decile groups), and the likelihood ratio test in detecting miscalibration at the 0.05 significance level. Following recommendations on objectively deciding on the number of simulations,
17
we obtained the results through 2500 Monte Carlo iterations such that the maximum SE around the probability of rejecting H0 would be 0.01. Within each iteration,
Results of the first set of simulations are provided in Supplementary Material section 6. The power of the unified test was very close to that of the likelihood ratio test across all scenarios examined. Figure 3 provides the empirical ROC and mROC curves for the second set of simulations. As all the mappings from

Receiver-operating characteristic (ROC; black) and model-based ROC (mROC; red) curves for the second simulation scenario. The panels positionally correspond to the calibration plots and simulation parameters presented in Figure 2. The ROC curves approximate the population-level curves as they are based on a large sample size (10,000 simulated observations). The area under the ROC curve is 0.740 in all scenarios. ROC, receiver operating characteristic; B, ROC equality statistic; mAUC, area under the model-based ROC curve.
The performances of all tests are summarized in Figure 4. The middle panel on the top row, where

Probability of rejecting the null hypothesis for the mean calibration (white bar), receiver-operating characteristic (ROC) equality (gray bar), unified (orange bar), Hosmer-Lemeshow (dark blue bar), and likelihood ratio (light blue bar) tests. The panels positionally correspond to the calibration plots and simulation parameters presented in Figure 2. Results are based on 2,500 simulations for each scenario.
Application
Chronic obstructive pulmonary disease (COPD) is a common chronic disease of the airways. Periods of intensified disease activity, referred to as exacerbations, are an important feature of the disease. Individuals vary widely in their tendency to exacerbate. 19 Predicting who is likely to experience an exacerbation, especially a severe one that will require hospital admission, will provide opportunities for preventive interventions. 20
We used data from the MACRO 21 and STATCOPE, 22 two clinical trials in COPD patients with exacerbations as the primary outcome, to, respectively, develop and validate a risk prediction model for the occurrence of COPD exacerbations in the first 6 mo of follow-up. Baseline characteristics of both samples are provided in Table 1.
Baseline Characteristics and Outcomes for MACRO and STATCOPE Samples
IQR, interquartile range; SGRQ, St. George Respiratory Questionnaire; FEV1, forced expiratory volume at 1 s; LABA, long-acting beta agonist; LAMA, long-acting antimuscarinic agent.
Of note, these data have previously been used for a more sophisticated prediction model. 23 Here, we focus on a simpler approach as the nuances of model development are beyond the scope of this work. We used a logistic regression model based on the data from the MACRO trial that included the predictors as listed in Table 1 based on an a priori list of covariates generated from prior knowledge of possible association with the outcome. We considered 2 outcomes: all exacerbations and severe exacerbations, and we developed 2 distinct models. The regression coefficients for both models are provided in Table 2. The study was approved by the University of British Columbia and Providence Health Research Ethics Board (H11–00786).
Regression Coefficients for the Risk Prediction Models (Based on the MACRO Sample) for All and Severe Exacerbations
SE, standard error; SGRQ, St. George Respiratory Questionnaire; FEV1, forced expiratory volume at 1 s; LABA, long-acting beta agonist; LAMA, long-acting anti-muscarinic agent.
We included a coefficient for randomized treatment (azithromycin), but it was set to 0 for prediction (as the model is applicable to those who are not on preventive therapy, and none of the individuals in the validation sample were on such a therapy).
Log-odds ratios are for a 1-unit increase for continuous variables
Figure 5 provides the empirical ROC curve from the development sample (MACRO) as well as the empirical ROC and mROC curves from the validation sample (STATCOPE) and the calibration plot for both outcomes. For all exacerbations, the mROC curve was very close to the development empirical ROC curve but not to the external empirical ROC curve. This indicates that the reduction in the discriminatory performance of the model in the validation sample is due to miscalibration. Indeed, both components of the proposed test indicated a departure from calibration. The mean calibration test produced

The empirical ROC curves from the MACRO development (blue) and STATCOPE validation (black) samples, the mROC curve from the STATCOPE validation sample (red; left panels), and the calibration plot (right panels). AUCdev, area under the curve (c-statistic) in the development sample; AUCval, area under the curve (c-statistic) in the validation sample; mAUC, area under the model-based ROC curve An, mean calibration statistic; Bn, ROC equality statistic; pAn):
The model for severe exacerbations had higher discriminatory power. All 3 ROC curves were generally aligned with each other. The mean calibration test produced
Discussion
This article provides an introduction of the model-based ROC (mROC) curve, the ROC curve that should be expected if the model is at least moderately calibrated in an external validation sample. We showed that moderate calibration is a sufficient condition for the convergence of empirical ROC and mROC curves. We extended these results by proving that, together, the mean calibration and the equivalence of mROC and ROC curves in the population are sufficient conditions for the model to be moderately calibrated. To test for such equivalences within a sample, we suggested a simulation-based test. Our simulations empirically verified the postulated properties of this novel test. To the best of our knowledge, this is the first time that the ROC plot, a classical means of communicating model discrimination, has been connected to model calibration. We have implemented the proposed methodology in an R package, which is available from https://github.com/resplab/predtools/.
Previous investigators have suggested using case mix–corrected performance metrics in judging the external validation of a model. Vergouwe et al.
3
proposed a general approach for calculating different model-based metrics by simulating responses from predicted risks in the validation sample and comparing the resulting metrics with the empirical ones in the validation sample. Van Klaveren et al
5
focused on one such metric, the c-statistic, and developed closed-form estimators that would quantify the expected change in a model’s discriminative ability due to case-mix heterogeneity. Our methodology extends such work to the entire ROC curve and in doing so establishes a connection between mROC/ROC equality and model calibration that enables formal statistical inference on moderate calibration. The test that is classically associated with calibration plots is the Hosmer-Lemeshow test, which is criticized because of its sensitivity to the grouping of the data and lack of information about direction of miscalibration.
24
Our proposed test is free from arbitrary grouping of the data or the choice of smoothing factors. Given the shortcomings of the Hosmer-Lemeshow test, alternative inferential techniques for evaluating model calibration have been proposed. Allison
24
reviewed the measures of fit of logistic regression models and categorized them as indices of predictive power (such as
These developments can be used in practice in different ways. Steyerberg and Vergouwe 14 have proposed an “ABCD” approach for external validation of a model (where A is the mean calibration; B, calibration slope; C, c-statistic; and D, decision curve analysis). 14 The B step in their approach can be replaced with the mROC’s B statistic, which, together with the A step (which is the same as the A step in the unified test), will test moderate calibration, the most desired form of calibration, as opposed to weak calibration tested via calibration slope. 10 Further, if the research involves simultaneous model development and external validation, drawing the empirical ROC curves from both samples alongside the mROC curve in the validation sample will provide visual information on the causes of difference in the performance of the model between the 2 samples (as demonstrated in our case study). Incompatibility between mROC and empirical ROC in the validation sample will rule out moderate calibration. Conversely, while agreement between the 2 curves does not rule in moderate calibration per se, it does so provided that mean calibration (calibration-in-the-large) is achieved. This visual interpretation can be augmented with formal hypothesis testing using the proposed unified statistic. Such comparisons can also be made for subgroups within the sample, although multiple hypothesis testing should be controlled for in such circumstances. Even when the investigators are not planning to produce ROC curves, the proposed test for moderate calibration can be reported independently. This can complement the scalar metrics that measure the degree of miscalibration but are not based on formal hypothesis testing, such as Harrell’s Emax, 26 Austin and Steyerberg’s Integrated Calibration Index, 27 and Van Hoorde et al.’s Estimated Calibration Index. 28
There are several ways the proposed methodology can be extended. The ROC curve has been extended to categorical
29
as well as to time-to-event data,30,31 and similar developments can also be pursued for the mROC methodology. Development of inferential methods that would not require Monte Carlo simulations can also be of potential value. As the ROC curve can be interpreted as a CDF,
11
nonparametric statistics based on the distance between CDFs can conceivably be developed to test the equivalence of mROC and ROC curves. However, the calculation of the simulation-based
One of the promises of precision medicine is to empower patients in making informed decisions based on their specific risk of outcomes. 33 Basing medical decisions on miscalibrated predictions can be harmful. Our contribution is the development of mROC analysis, a simple method for separating the effect of case mix and model miscalibration on the ROC curve and for inference on model calibration. Recent arguments and counterarguments indicate that the methodological research community is divided in its opinion on the utility of ROC curves in the assessment of risk prediction models.34,35 ROC curves, however, remain a widely adopted tool among applied researchers in understanding and communicating the discriminatory performance of such models. The mROC methodology adds to the utility of ROC curves by enabling the examination of model calibration using the ROC plot. These developments can result in more attention to model calibration as an often-neglected but crucial aspect in the development of risk prediction models.
Supplemental Material
sj-docx-1-mdm-10.1177_0272989X211050909 – Supplemental material for Model-Based ROC Curve: Examining the Effect of Case Mix and Model Calibration on the ROC Plot
Supplemental material, sj-docx-1-mdm-10.1177_0272989X211050909 for Model-Based ROC Curve: Examining the Effect of Case Mix and Model Calibration on the ROC Plot by Mohsen Sadatsafavi, Paramita Saha-Chaudhuri and John Petkau in Medical Decision Making
Footnotes
Supplemental Material
Data availability
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
