Abstract
Keywords
Decision models are routinely used in health economics and public health for the purpose of evaluating the harms and benefits of competing health interventions. The results of these models may be used to inform clinical guidelines, optimize health care resources, or guide public health policies. 1 An individual-level state transition simulation model (IL-STM) is a type of decision model in which individual life histories are simulated through multiple health states. These models are flexible enough to be used for a wide range of policy evaluations in complex decision problems. However, reflecting individual characteristics comes at a cost of increased model complexity, resulting in a high number of model parameters. In addition, it may be necessary to simulate thousands or even millions of life histories to minimize stochastic uncertainty and obtain stable model outcomes.
Several tasks could then easily become computationally expensive, namely, 1) calibration, due to a large number of required IL-STM runs; 2) policy evaluation, if many treatment regimens or screening policies are evaluated2,3; and 3) uncertainty quantification, in the form of a probabilistic analysis (PA) 1 (also known as probabilistic sensitivity analyses) or value-of-information analyses (VOI). 4 For instance, Rutter et al. 5 performed 180,000 model runs (2 million life histories) to calibrate their model parameters using a Markov chain Monte Carlo algorithm, van Hees et al. 3 performed 19,200 model runs (10 million life histories) for policy analyses of colorectal cancer screening, and VOI may require an analyst to run the model more than 100,000 times. 6
This has led to the development of metamodels or emulators, regression models for the relationship between the IL-STM parameters and outcomes, which are computationally inexpensive to train and run.7,8 The main purpose of metamodeling is to substantially decrease the amount of computation time needed to perform calculations with the IL-STM by replacing it with a fast approximation (metamodel), which requires a small number of IL-STM runs to train. Previous studies in health economics claimed that metamodels could reduce the computational expense of the analyses between 85% 9 and more than 99%, 10 and this has been named as a “key area for further research work” in 2018 by the Second Panel on Cost-effectiveness in Health and Medicine. 11 Emulators have been applied for multiple tasks within decision modeling in health economics, including model calibration, PA, VOI, inference of parameter influence, and resource optimization9,12–14; however, they are still not routinely used. 8
The routine use of emulators is hindered by a lack of guidance and training on how to perform emulation in an efficient way. 11 A key step of metamodeling is the choice of statistical model. Common choices include linear regression and Gaussian process regression (GP). GP is widely used in the engineering literature, as it can handle multiple shapes of smooth curves. On the other hand, GP has the limitation 15 that if the number of parameters is relatively large (>15), the model fit will become relatively slow. These facts motivated the use of alternative nonlinear statistical models, including generalized additive models (GAMs) and artificial neural networks (ANNs).12,16 However, there is not yet any formal comparison between these statistical models, and hence, researchers may adopt a suboptimal type of metamodel, compromising the runtime and/or accuracy of their analyses.
The goal of this study is to provide guidance on choosing an accurate metamodel with a fast computation time for uncertainty quantification using R. In particular, we focus on PA, which reflects the uncertainty in IL-STM outcomes caused by uncertainty in IL-STM parameters by repeatedly drawing parameter values from relevant sampling distributions with Monte Carlo simulation and generating an empirical distribution for the IL-STM outcomes.
We first survey previously used metamodels in health economics and public health and corresponding R packages. We then evaluate each R package with respect to the computational speed to fit the metamodel and prediction accuracy for PA in a simulation study. We investigate the role of IL-STM characteristics, related to the level of parameter uncertainty (size of the 95% confidence interval or uncertainty interval), number of model runs, number of simulated life histories, low/high rate of health state transitions, and Markov/semi-Markov assumption, on the choice of the most accurate model. Finally, we apply the metamodels in a case study concerning a PA for a cost-effectiveness analysis comparing 2 treatments for stage I non–small-cell lung cancer. 17
Methods
Metamodeling Steps for Uncertainty Quantification
In Supplementary Figure S1, we show the steps of metamodeling for uncertainty quantification.
Training data requirements
Before starting the analysis, there are at least 2 prerequisites related to minimizing the amount of first-order uncertainty (stochastic uncertainty) to ensure a reasonable metamodel prediction accuracy: 1) the number of simulated life histories by the IL-STM should be large enough, to obtain stable outcomes with low simulation error, and 2) in each simulation run, common random numbers should be used, that is, each simulation run should use the same seed number for the random generator. Each parameter included in the analyses should have a corresponding sampling distribution.
9
To reduce the computational expense associated with generating the training data or fitting the metamodel, the researcher could use prior knowledge about the model behavior and/or variable selection techniques to select the model parameters with the strongest effect on the model outcomes and leave out of the metamodel the parameters with little to no impact.18,19 Examples of variable selection techniques include, among others, running a linear regression and selecting only the variables with a
Generate training data
Let
Choice of regression model for the emulator
The relationship between simulator input parameters and outcomes may be characterized by a high degree of nonlinearity and possibly multiple interactions and correlations between model parameters. However, for a PA, we are usually interested in only a small subset of the whole parameter space, based on the distribution of each model parameter. If the associated confidence intervals are relatively small and/or if the relationship between model parameters and outcomes is relatively simple, then even linear regression could be a good approximation to the effect of model inputs on the outcomes. Alternatively, we may obtain a better fit by using nonlinear models. For instance, GP regression is a flexible model that makes metamodel predictions by interpolating between design points, GAMs allow for nonlinear relationships by including smooth functions of model parameters, and ANNs allow for arbitrary complex nonlinear relationships.
In Table 1, we review previously used statistical models as metamodels in health economics. Our search strategy consisted of combining the results of a February 2018 review on metamodeling in which one of the coauthors participated (H.K.) 8 and our own literature search on PubMed to capture articles published between 2018 and January 2020 and/or studies that were missed in the review. We found 5 studies9,21–24 of 300 PubMed search results (search took place in March 2020; see Supplementary Figure S2) in addition to the 13 studies found in Degeling et al. 8 The most common choices in health economics are linear regression (LM) and Gaussian GP, with a squared exponential covariance matrix. 8 Alternative choices include GAMs,16,25 ANNs,12,14 GP using Matern and rational quadratic covariance matrix, 22 and symbolic regression. 26 Although symbolic regression is valuable because it does not assume any prior model structure, it is relatively more difficult to implement and therefore was excluded from this study (Table 1).
Review of Statistical Models Used as a Metamodel in Health Economics and Corresponding R Packagesa,b
This review is based on Degeling et al. 8 Some studies used multiple statistical models as a metamodel in health economics (linear model, neural networks, and Gaussian process regression) including Cevik et al. (2015), 12 Tappenden et al. (2004), 45 and Alam and Briggs (2019). 21 In Tappendden et al., 45 the metamodels were used for value-of-information analyses. In Cevik et al., 12 the metamodels were used for calibration. In the study by Alam and Briggs, 21 the metamodels were used for sensitivity analyses.
While 2 R packages were found for this model, 1 is removed from the R package main repository (last checked April 2020; rgp) and the other (gramEvol) would require using an additional R package to fit a proper metamodel. 28
“Classical” models for emulation
Linear regression and GP
A simple metamodel could be built by assuming that the model outcome of interest is well approximated by a linear function of the model parameters,
where,
However, the linear model could be a poor approximation of the relationship between
where
The key parameter here is vector
The R packages used in this study contain different implementations of GP:
Alternative choices for emulator/metamodel
Generalized additive model
Another way to extend the linear regression would be to consider a GAM,
where,
Artificial neural networks
ANN is a nonlinear regression model suitable to model highly complex nonlinear relationships. Each ANN consists of the combination of neurons, layers, and an activation function.
39
In this study, we consider only ANNs with a single hidden layer
where
Metamodel validation
Before using the metamodel, the analyst should verify whether it produces an accurate prediction of the IL-STM outcome. One could validate the metamodel by using a training-test set approach, in which the decision model is run for an additional set of parameter values
Simulation Study: Prediction Accuracy and Computational Speed of Statistical Models and Their Corresponding R Packages
Aims
We evaluate the accuracy of metamodels when the goal is to carry out a PA, given different IL-STM characteristics. We also evaluate the computation time of each R package used, to prevent usage of inefficient packages (i.e., with a computation time much longer than other packages with similar or higher accuracy; Table 1).
Data-generating process: IL-STM
We build a simple continuous-time microsimulation multistate model with 5 health states and 10 parameters in total (see Supplementary Table S1). The health states correspond to 1, healthy; 2, mild disease; 3, severe disease; 4, disease-specific death; 5, other-cause death. The main outcome is life-years gained due to treatment. This is obtained by running the IL-STM with and without an effect of treatment. The simulation model is governed by a matrix of transition hazards
where,
The transitions to other-cause death,
The range for the simulation model parameter values is shown in Supplementary Table S1.
Evaluation of metamodel accuracy
Metamodeling scenarios
We evaluate how metamodel accuracy and choice may change under different commonly observed scenarios in health economic modeling. In Table 2, we describe the different scenarios. The exact details of the simulation settings are shown in Supplementary Table S2. We examine the following metamodeling characteristics: 1) level of uncertainty in model parameters: for the transition hazards, this is based on sample size of the individual-level data sets (
Modeling Scenarios Used to Evaluate Metamodel Accuracy a
IL-STM, individual-level state transition simulation model.
See Supplementary Table S1 for the range of simulated model parameters and Supplementary Table S2 for the basic simulation settings.
Simulation procedure
The simulation study has 2 phases. First, we generate the uncertainty range in which each model parameter is allowed to vary during PA. Then, we use this range to simulate training data to fit and validate the metamodels (Figure 1). The algorithm for the simulation is described in

Steps of the simulation study to evaluate metamodel prediction accuracy.
Algorithm: Simulation Procedure
LHS, Latin hypercube sampling.
The first step is to sample microsimulation model parameter values (
Statistical models/R packages
The R packages to be tested are shown in Table 1 and correspond to previously used metamodels in the health economic/public health literature. See Supplementary Table S3 for hyperparameter details. The main source for the R packages used is the
Performance measures
We evaluate the model accuracy by computing the root mean square error (RMSE) divided by the range of the model outcome and percentage absolute error (PAE) over 50 data sets. We also show the mean absolute error and the mean square error. For each cross-validation iteration, we left 10% of the data out of the training sample in each iteration.
Evaluation of metamodel computation time
We fit the metamodels to a simulated single training data set sample (
Case Study: PA for Cost-Effectiveness Analysis of 2 Treatments for Stage I Non–Small-Cell Lung Cancer
The case study is based on Wolff et al. (2020). 17 This is a modeling study comparing the cost-effectiveness of 2 treatments for stage I non–small-cell lung cancer (NSCLC), stereotactic body radiation therapy (SBRT), and video-assisted thoracic surgery (VATS). This study contained a PA that required about 10,000 microsimulation model runs. This could be run in approximately 1 h, and in practice, we would not need to use a metamodel in this example. We use this case study to illustrate how to apply metamodeling to reduce the computation expense of uncertainty quantification by replacing most of the simulation model runs by the metamodel.
The main model outcomes are total discounted costs and total discounted quality-adjusted life-years (QALYs) gained by SBRT and VATS. A flowchart of the microsimulation model is shown in Supplementary Figure S3. For the PA, we include the same 22 model parameters included in the original analyses consisting of tumor growth parameters, probabilities for receiving treatment, excess mortality due to treatment, health utilities, and costs. Parameter values and their confidence intervals are given in Supplementary Table S4. The sample size for training is 10 times the number of included model parameters in the analyses (
Results
Accuracy of Different Statistical Models for Emulation
In Figure 2, we measure the effects of varying

Accuracy of different statistical models to emulate a simulation model with 10 parameters.abc(a) RMSE.range denotes the average of the root mean squared error divided by the range of the simulated model outcome. This was computed over 50 data sets (

Accuracy of different statistical models: effect of parameter value (upper row), semi-Markov scenario, and disease-specific mortality with treatment (bottom row).abc (a) rmse.range denotes the average of the root mean squared error divided by the range of the simulated model outcome. This was computed over 50 data sets (
In the base-case scenario given in the upper left corner of Figure 2, commonly used metamodels in health economics (LM, GP, GAM, and ANNs) have a similar accuracy (RMSE = 0.02 and PAE <1%). If
In Figure 3, no significant changes in the ranking of the metamodels were observed. In the upper row, we analyzed the effects of low and high transition rates. If
Computational Speed of R Packages to Fit a Metamodel
In Supplementary Figure 5, we show the computation speed of each R package. Most metamodels can be fit a in less than 1 s. The exception is GP with both
Case Study: PA for Cost-Effectiveness Analyses of 2 Treatments for NSCLC
Figure 4 shows the accuracy of different metamodels for predicting 10,000 simulation IL-STM runs for SBRT costs, whereas in Supplementary Table S8, we show the results for all metamodels and model outcomes. For both SBRT and VATS costs metamodels, 13 parameters were included, whereas for SBRT and VATS QALYs gained, respectively, 18 and 17 parameters were included.

Observed (simulation model) and predicted (metamodels) total costs of SBRT.abc (a) SBRT denotes stereotactic body radiation therapy. Costs are deflated to 2018 Euros. Discount rate for costs is 1.5%, based on Dutch guidelines. The average total discounted costs of SBRT during the probabilistic analysis (PA) using the individual-level state transition simulation models equal to 24,998 Euros. (b) Each emulator is denoted by its R package name or R function.
Most metamodels had an RMSE less than 0.01 for SBRT and VATS costs because of a relatively wide range of treatment costs. The best metamodels were GAM (R package
Discussion
In this study, we systematically evaluated the prediction accuracy of several statistical models for the purpose of uncertainty quantification. In general, prediction accuracy becomes better with increasing
In the case study, a PA of an IL-STM for NSCLC was reanalyzed. The goal was to verify whether we could replace a majority of the 10,000 simulation model runs by a metamodel at a cost of a small prediction error. For all 4 model outcomes examined, the best metamodel had a reasonable prediction accuracy. In the worst case, the best metamodels had a PAE of 2.1%. In total, about 220 model runs were used to train the metamodels for the different model outcomes, which represents a reduction of more than 97% in the number of simulation model runs.
The 2 most commonly used metamodels are the LM and GP regression. LM provides a good performance in case the relationship between model parameters and outcomes is linear on the domain being considered (i.e., either this relationship is relatively simple or the confidence intervals are relatively small). It is much easier to understand, implement, and interpret than alternative models, especially compared with GP and ANNs. The computation time to fit a GP regression model grows exponentially with the number of IL-STM parameters and number of simulation runs. Modern machine-learning methods can easily handle dozens of model parameters in a fraction of the computation time. However, in this context of a small training sample and parameters, machine-learning methods could be at a disadvantage. Namely, models based on decision trees performed poorly, and Bayesian networks have a similar performance to the linear model. It also seems that the accuracy of GP is in some cases superior to that of ANN (Figure 2), and therefore, we should not completely discard classic GP regression if the number of model parameters is smaller than 15 and/or if we are not using cross-validation. GAMs are an interesting alternative to ANNs and GPs, especially if the number of parameters is large and the researcher is concerned about interpretability of the model parameters.
Another factor to be considered when choosing a metamodel could be the number of hyperparameters used by each statistical method. Metamodels with a high number of hyperparameters (e.g., ANNs) may have a lower bias; however, this comes at a cost of a higher variance of the metamodel predictions. We may be more interested in methods with lower variance, if we use metamodeling for policy analyses, in which the confidence intervals of the metamodel predictions could also be relevant. Recent results in high-dimensional statistics29,44 suggest that under certain conditions, it is possible that overparametrized models (i.e., models with a
We are not aware of any other study systematically evaluating metamodel prediction accuracy; however, 3 studies12,21,45 reported the prediction accuracy of 2 or 3 metamodels, ANNs, GP, and LM. All studies reported a better accuracy of ANN and/or GP as compared with the LM. ANN was also superior to GP in one study. The exception is the metamodel for QALYs from Alam and Briggs, 21 in which LM has roughly similar accuracy to the ANNs and GP.
The models that were found to have highest prediction accuracy are not necessarily the “optimal models.” Our goal was to find a set of models that results in a good accuracy, within a reasonable amount of time and without too much effort from the analyst. Therefore, we deliberately tested only a handful of possible hyperparameter combinations (a maximum of 6 per statistical model) to find the best models. For some models, such as ANNs or GAM boosting (R package:
These results are applicable to a certain class of microsimulation models used to model chronic, nontransmissible diseases. The main assumption here is that small changes in input values will cause a small change in the output. The second assumption is that the region of interest for metamodeling is relatively small. If this is large, such as, for calibration, then the rule of thumb of using 10 times the number of parameters as the number of simulation model runs for training does not apply and may result in a poor approximation of the model outcome regardless of the metamodel used (Supplementary Figure S4). The main outcome measure of the simulation study is life-years gained, whereas the main outcome of interest in cost-effectiveness analyses is a ratio (cost per QALY), which could introduce additional complexity in the relationship between model parameters and outcomes compared with the simulation study and case study presented.
The generalizability of these results is also limited by the fact that we considered in the simulation study a single decision model with only 10 parameters, without any interactions or correlations. IL-STMs used in public health could include between 50 and 100 parameters. However, in practice, it is unlikely that an analyst would consider including all model parameters in the metamodel, as this would require running the decision model thousands of times to generate training data, which could be computationally unfeasible. In a typical decision model, only a handful of model parameters are influential, whereas most parameters will have a limited effect on the outcome. The analyst could simplify the problem by using prior knowledge about the model to select the most influential model parameters or use variable selection techniques.
There are several possible statistical models that an analyst could use as a metamodel. For instance,
To facilitate the use of metamodeling in the health economics and public health community, we share the code used to fit the metamodels at https://github.com/tmdecarvalho. This will provide a reference on how to implement different metamodels in R. Intermediate/advanced users could then adapt the script and “play” with the hyperparameters until they find the best specification for their needs or even add their own models and use these specifications as a benchmark.
Supplemental Material
sj-pdf-1-mdm-10.1177_0272989X211016307 – Supplemental material for Choosing a Metamodel of a Simulation Model for Uncertainty Quantification
Supplemental material, sj-pdf-1-mdm-10.1177_0272989X211016307 for Choosing a Metamodel of a Simulation Model for Uncertainty Quantification by Tiago M. de Carvalho, Joost van Rosmalen, Harold B. Wolff, Hendrik Koffijberg and Veerle M. H. Coupé in Medical Decision Making
Footnotes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
