Abstract
Introduction
Model validation is defined as the process of determining the degree to which a model is an accurate representation of the real world from the perspective of the intended uses of the model. 1 In other words, how accurately predictions from a model match up with observations from physical experiments. With the increasing use of computational modeling in engineering design, performance estimation, and safety assessment, a quantitative method, also called validation metric, is needed to provide a quantitative measure of the mismatch between predictions and experimental observations. As both aleatory and epistemic uncertainties can exist in the practices of model validation, characterizing the effect of mixed uncertainties associated with the validation metric is a challenge.
The topic of validation metric has received considerable attentions in the last decade. A validation metric is a formal measure of the mismatch between predictions and data that have not previously been used to develop the model. There are many desirable properties of a validation metric that would be useful in assessing the accuracy of models. First of all, it is an objective measure of distance. Then, the validation metric should reflect differences in the full distribution of predictions and the data. Third, it should express its result in physical units rather than some esoteric statistical units. 1 Area metric2–4 is one of the most popular validation metrics, because (1) it could be the same physical unit with predictions and (2) it is not too sensitive to the long tails of distributions. Ferson et al. 1 used area metric to characterize the model form uncertainty by comparing the distributions of predictions and experiment data. Li et al. 4 extended the concept of “area metric” and the “u-pooling method” for multiple correlated responses, while the new validation metrics are still the statistical units which are not convenient to use in prediction. Roy and Oberkamph 2 presented a framework of validation based on probability boxes (p-boxes). As shown in their study, area metric was still adopted to compare p-box and p-box/distribution. However, the proposed method in their study can still provide a single value based on the definition of minimal area. Such result only emphasizes the evidence of mismatch and may underestimate the risk of model form uncertainty.
In practice, a model is often used to make several different predictions, and multiple values of the validation metric are, therefore, needed to assess the accuracy of the model. u-pooling4,5 is introduced to combine all the usable values by pooling all the validation metrics into a universal scale via the probability transformation. 6 This strategy allows us to pool fundamentally incomparable data in terms of the relevance of each datum as evidence of the mismatch of the model with experimental data. To date, u-pooling method can be only used for prediction distributions, and it is inapplicable to the case where predictions and/or experiment data are represented as p-boxes.
The example of the thermal challenge problem is one of the three problems posed at the Sandia Validation Challenge Workshop. The mathematical model and the solution provided by Dowding et al. 7 are based on one-dimensional, linear heat conduction in a solid slab, with heat flux boundary conditions. Multiple approaches were used to evaluate the validity of the provided mathematical solution for its use in a specified application with a defined regulatory criterion, and to predict regulatory compliance. Ferson et al. 1 discussed the validation questions with multiple predictions about different outputs and prediction and/or experiments with probability distribution, and the area metric has been presented. Rutherford 8 discussed the methodology for compensating for computational /experimental discrepancies identified in the validation analysis. Hills and Dowding 9 presented a multivariate validation metric accounting for model parameter uncertainty and correlation between multiple measurement/prediction differences. But, none of them validated the problem based on the mixed uncertainties framework. Until 2018, Wang et al. 10 proposed a novel model validation approach using evidence theory with considering epistemic uncertainty, but evidence theory abandons the probability information and the separation of aleatory and epistemic uncertainties.
To overcome the above issues, a validation and uncertainty quantification (UQ) framework based on p-boxes is proposed in this article. By introducing new concepts for interval-valued area validation metric and interval-valued u-pooling method, model form uncertainty can be well quantified with the physical units. The thermal challenge problem is used to examine the performance of the proposed method.
The organization of this article is as follows: first, the thermal challenge problem is briefly introduced. Second, the UQ based on p-boxes is proposed to quantify the uncertainty associated with the data from the challenge problem. The meaning and the difference of interval-valued area metric from the original single area metric are demonstrated in detail, and a new u-pooling method based on p-boxes is also presented. Finally, the proposed method is implemented to characterize the predictive capability of this model in the thermal challenge problem.
Summary of the thermal challenge problem
The thermal challenge problem consists of a set of mathematical models, three sets of experimental data set with different sizes (“low,”“medium,” and “high”), and a regulatory requirement. The mathematical model of the temperature under heating is formulated as following
where
with given material properties of the device associated with a particular manufacturing process. The challenge is to use available empirical data to assess whether the regulatory requirement in equation (2) can be satisfied. The empirical data of the two material properties
The data provided for the thermal challenge problem include (1) data developed to characterize the thermal conductivity
For the sake of simplicity, in this article, the effects of different experimental data sets are not considered. The material characterization data from the “medium” data set is used here to take account of the uncertainty associated with samples. The p-boxes are used to characterize both aleatory and epistemic uncertainties associated with input and output variables. The focus of this work is placed on the UQ and propagation, validation metric, and extrapolation with p-boxes.
UQ based on p-boxes
Probability boxes
P-boxes can characterize both epistemic and aleatory uncertainties in a way that does not confound the two. P-boxes based on probability bounds analysis take advantage of probability and set theories and keep the separation of aleatory and epistemic uncertainties. As shown in Figure 1, the horizontal breadth of the p-box quantifies the amount of the epistemic uncertainty in terms of the system response quantity (SRQ), and the slope of the p-box is associated with the frequency of the aleatory uncertainty. See Ferson 11 for a detailed discussion on p-boxes.

Representation of p-box for both epistemic and aleatory uncertainty.
Constructing p-boxes for a small amount of data
There are many methods to construct p-boxes. For instance, with a small amount of data, the family of distributions can be specified, whereas the parameters of distribution are unknown. In this situation, it is straightforward to construct a p-box that encompasses all the possible cumulative distribution functions (CDFs). The upper and lower confidence bounds of the CDFs can be identified by parameter estimation and the associated confidence intervals. For a more generalized case, without any hypothesis for distribution, you can get a CDF and confidence bounds by the Kaplan–Meier estimate and the Greenwood formulation. Most recently, the kernel density estimation and the bootstrap were used to estimate the bounds of p-boxes.
For the thermal challenge problem, 20 observations are available for thermal conductivity

p-box and the distribution family for the thermal conductivity.
Validation based on p-boxes
Validation metric for p-boxes
A validation metric is a formal measure of the mismatch between predictions and observations that have not previously been used to develop the model. We are interested in validation metric that can be applied when predictions are p-boxes. In this section, by extending the definition of single value area metric to interval-valued area metric, a new validation metric is developed to compare quantities characterized by p-boxes.
In the context of p-boxes, the area metric between two p-boxes is defined as 12
where
which is the shortest distance between two intervals. When one of

Area metric between p-box and distribution/p-box.
When
When one of
where the range of
As shown in Figure 4, it is assumed that experimental data are enough, and the experimental distribution is normal (5, 1). The predicted p-boxes are normal ([4, 6], 1) and normal ([6, 8], 1), respectively. Figure 5 illustrates the area measure of the mismatch of predicted p-boxes and experimental distribution.

Example of mismatch an experimental distribution and different prediction p-boxes.

Example of mismatch an experimental p-box and different prediction p-boxes.
It is assumed that experimental p-box is normal ([4, 6], 1), while predicted p-boxes are normal ([5, 7], 1) and normal ([7, 9], 1), respectively. Figure 6 shows the area measure of the mismatch of predicted p-boxes and the experimental p-box.

Translation of observations (spikes) through prediction distributions (gray) to a universal probability scale. 1
Both Figures 4 and 5 show how the interval-valued area metric differs from a single area metric. As seen from them, the original definition of area metric is the lower bound of the interval-valued area metric, which may be much less than the upper bound. The potential risk could be underestimated.
u-pooling for p-boxes
When several observations are collected for a single prediction distribution (or p-box), the empirical distribution (or p-box) is used to pool these experimental data into a single object for comparison of the prediction distribution (p-box). However, pooling is not possible when data are to be compared with different distributions (or p-boxes). One could compute all the areas separately for each pair of prediction distribution (or p-box) and the corresponding observations. One needs to merge all the areas together into an aggregate measure of the overall discrepancy.
By integrating the evidence from all relevant data over the entire validation domain into a single measure, the overall mismatch can be assessed by a u-pooling approach which uses the probability integral transform theorem in statistics. The process of u-pooling can be described simply as follows: 1
Transforming to get
Pooling and back-transformation. All these
Comparing to get area. According real experiment data at different conditions, the pooled back-transformed values construct the dummy experimental distribution for the intended condition. The area between dummy experimental distribution and the predicted distribution at intended condition is calculated for validation metric, shown in Figure 7 (right).

Back-transformation from the u-scale to an archetypal scale determined by a predicted distribution
It is noted that there would be a little difference when the corresponding prediction distribution in step 1 and the regular prediction distribution in step 2 are p-boxes. Finally, two p-boxes are compared to get the area metric in step 3. A set of interval-valued u, rather than a single u, for each

Translation to a universal probability scale and back-transformation to archetypal scale by p-boxes.
Following the flowchart in Figure 9, the extended u-pooling for p-boxes is calculated based on the following procedure:
Transforming to get
Back-transformation and pooling. One needs back-transformation for all
Obtaining the area. The areas between dummy experimental and predicted p-boxes are calculated for an interval-valued area metric

Flowchart of validation based on the extended u-pooling method.
Validation and prediction for the challenge problem
The normal distribution hypothesis is accepted by Lilliefors hypothesis tests for the “medium” (20 samples) set of the thermal problem. Interval estimate of means and point estimate for standard deviations for the pooled property data are as follows
The p-boxes can be constructed as given in equation (7) and nested iterations are performed, that is, each sample drawn from the epistemic variables on the outer loop results in a sampling over the aleatroy variables on the inner loop. The resulting predicted p-box of the surface temperatures after 1000 s based on equation (1) is shown in Figure 10 as the solid and dotted distribution, and the dashed line is the distribution without considering epistemic uncertainty, which lies in the p-box. Both minimum and maximum estimated probabilities are much larger than 0.01.

The predicted p-box against the regular requirement and the exceeded probability.
Every temperature observation in the validation domain is thereby paired with a prediction p-box of temperature. There are 140 of these pairs in the “medium” data set. The pairs define the

p-box of u-values compared to theoretical uniform distribution.

p-box of back-transformed u-values compared to predicted temperature p-box.
When there are no direct experimental observations to be available under given conditions, in order to extrapolate the model form uncertainty at any conditions with given
At the configuration for the regulatory requirement, this equals 112 +
It is still an open question for extrapolation of area metric, and the parallel push for both sides of the prediction p-box 4 is adopted here as shown in Figure 13. The risk of exceeding 900° ranges from 0.01 to 0.91, which seems too large range to decision-maker, with epistemic uncertainties quantified. But, the range can be reduced by increasing of experiment samples, improving the heat transform model and considering the relation of input parameters and temperature.

Parallel push for estimated area metric value as predictive capability.
Summary/conclusion
This article has proposed a new model validation metric under p-boxes. The UQ processes in the context of p-boxes are developed, including the input UQ, propagation, comparison and prediction UQ. The extended interval u-values are used to build the back-transformation-predicted p-box in the validation domain. The extended interval-valued area metric is then used to obtain the model form uncertainty for given parameters, and the relations of appropriate area metric and parameters are captured via regression. It seems that the extended interval area metric obviously expands the uncertainty range too large to be acceptable, especially for extrapolation, which may be over conservative. Although interval-valued area metric can characterize more complicated epistemic uncertainty, more reasonable area metric definition and extrapolation method should be studied in the future work.
