Abstract
Keywords
Introduction
In clinical research, to assess the agreement/interchangeability between two measurement methods, when the characteristic of interest is continuous, the Bland & Altman's limits of agreement (LoA) method1,2 is one of the most frequently used methodologies (their 1986 2 paper has been cited over 50,000 times as of May 2022). Often this is motivated by a new and perhaps less expensive or easier method of measurement against an established reference standard. To evaluate the comparability of the methods, the investigator collects repeated measurements from each method on a sample of subjects. Bland & Altman's LoA are then computed by adding and subtracting 1.96 times the estimated standard deviation from the mean differences, that is, the average bias estimate. A scatter plot of the differences versus the means of the two variables with the LoA superimposed is used to visually appraise the level of agreement. Further, a regression of the differences as a function of the means is added to the plot to indicate the direction and amplitude of the bias. 3 A final decision regarding the agreement between the two methods is based on whether the LoA is within pre-defined clinical tolerance limits.
However, it has been shown that in the presence of a proportional bias or situations where the variances of the measurement errors of each of the methods are not constant (i.e. heteroscedastic) the Bland & Altman's plot may be misleading. 4 When this is the case, the regression line may show an upward or downward trend when there is no bias or a zero slope when there is a bias. Therefore, recently, a new statistical methodology to assess the bias, precision, and agreement of a new measurement method, which circumvents the deficiencies of the Bland & Altman's LoA method, has been developed.4,5 This methodology, however, is in the same spirit as that of Bland & Altman, and the level of agreement is not directly quantified (except for the concept of “percentage of agreement,” which, however, does not depend on pre-defined tolerance limits). Rather, based on inspection of the Bias, Precision, and Agreement plots, the investigator has to decide whether disagreement (not agreement) is not too high for the two methods to be deemed to be interchangeable.
To more directly quantify the level of agreement, Lin et al. 6 have proposed the concept of “coverage probability,” where the probability that the absolute difference between the two measurements made on the same subject is less than a pre-defined value is computed. Their methodology, however, allows one to assess the overall agreement and does not take into account that the level of agreement might depend on the value of the true latent trait in a continuous way. In addition, implicitly, homoscedastic measurement errors are assumed (an often too strong assumption) and the presence of a possible bias is not assessed. For these reasons, Stevens et al. 7 have extended this methodology to allow the coverage probability to depend on the value of the latent trait, as well as on the amount of bias. Later, they further extended their methodology to allow for heteroscedastic measurement errors. 8 They have called their extended agreement concept “probability of agreement.”
There are, however, several important limitations to the Stevens et al.7,8 methodology. First of all, it essentially relies on a parametric specification of the distribution of the latent trait, which is unpleasant as usually little is known about this distribution, particularly given that measurement errors make its identification difficult. In addition, the investigator has to decide over which support the index will be computed, thereby possibly extrapolating to impossible or unrealistic values of the genuine latent trait. Third, they considered constant tolerance limits, whereas often the variance of the measurement errors is heteroscedastic and increases with the value of the latent trait. Consequently, non-constant tolerance limits, which depend on the latent trait might be better suited. Fourth, they computed pointwise, and not simultaneous, confidence bands around the probability of agreement, which bears important limitations, as in practice one may be interested in assessing the agreement level at different values of the latent trait or even across an entire interval and not just at a single point of the support. Fifth, they did not show any simulation results regarding the coverage rate of their confidence band, nor used, for example, the logit transformation to avoid negative values to be included in the confidence band (as illustrated by Figure 2 in their 2018 paper 7 ). Actually, their developments were based on the use of the delta method, which may not always provide reliable results (as demonstrated by the results of the present study), particularly when the function of the parameters is highly non-linear. Indeed, the same issue with the delta method has been demonstrated in a recently published study. 5 Sixth, since the probability of agreement index depends on the value of the true trait, which is by definition latent and not estimated in their approach, it is practically not possible to assess the true level of agreement. Lastly, their unconditional probability of agreement concept, 7 to assess the overall agreement level, is not a proper unconditional or marginal probability; rather it is a conditional probability evaluated at the mean value of the latent trait and therefore it does not summarize the overall agreement appropriately.
Many other methods to assess agreement/interchangeability have been proposed in the literature and we refer the interested reader to some recently published papers for a more extensive literature review.8–10
Therefore, in this paper, the main goal is to develop an extended methodology, based on Lin et al. 6 and Steven et al. 8 coverage probability/probability of agreement concepts, which overcomes the above-mentioned limitations. For this, we will rely on an empirical Bayes approach (to predict the value of the true latent trait), which has already been fruitfully used to overcome the limitations of Bland & Altman's LoA methodology.4,5 The inference will be developed to build both pointwise and simultaneous confidence bands around the conditional agreement curve, such that the investigator may adopt either method according to his inference goal.
Methodology of the clinical tolerance limits
Consider the general measurement error model:
where
When method 2 is the reference standard and method 1 is the new method to be evaluated, the model reduces to:
Note that this measurement error model is slightly different from the classical measurement error model
11
in that the heteroscedasticity depends on the latent trait and not on an observed average. In addition, we have considered a simple linear relationship between
Consider the differences:
One may obtain the
As regards the clinical tolerance limits,
One simple alternative is to set constant values, which do not depend on the true latent trait:
Other choices are of course possible and the tolerance limits need not be linear.
We do not want to make strong distributional assumptions regarding the true latent trait
where for the sake of notational convenience we have suppressed the dependence of the density functions
Notice, that with sufficient repeated observations per individual (e.g. between 5 and 10) and a sufficient number of individuals (e.g. 100) the empirical distribution of the BLUP of
Estimates of the standard deviation of the measurement errors are obtained following a similar approach to that of Bland and Altman3,4:
Once the parameters have been estimated and conditional agreement computed:
To proceed, we will rely on two different approaches, the first, the conventional multivariate delta method, and the second, based on a simulation method. In our experience, with complicated functions involving ratios of random variables, the delta method may perform poorly. 5 Also, to guarantee confidence bands comprising values between 0 and 1, the logit transform (of the conditional agreement) will be used.
When the goal is to assess the agreement level for a specific value of the latent trait, a pointwise confidence interval will be fine as it guarantees that on average 95% of the computed intervals will cover the true value. However, when interest lies in several points from the support or in the whole curve, a simultaneous confidence band is required, as it guarantees a proper coverage rate for the simultaneous inference, whatever the number of points from the support.
The variance of Step 1: For each individual Step 2: As Step 3: 1000 values for Step 4: Finally, 1000 values for
A pointwise 95% confidence band for
With regard to a
The probability statement (12) is equivalent to:
As in our case, as
Concretely, a Step 1: Step 2: For Step 3:
To guarantee values comprised strictly between 0 and 1, the logit transform (of the conditional agreement) is used in all the calculations and the inverse transform is applied in the end to get the required band.
The performance of these alternative variance estimators and confidence bands has been studied in the simulation section below.
The goal of this simulation study is to assess the performance of the various confidence bands developed in section 2 and introduce the “Conditional agreement plot.”
Performance of the confidence bands
We simulated 1000 data sets according to the following data generating process:
For the sake of brevity, we focused on five different settings, which may be relevant for the clinical practice: (1) zero differential and no proportional biases; (2) negative differential bias with proportional bias less than 1; (3) negative differential bias with proportional bias larger than 1; (4) positive differential bias with proportional bias less than 1; (5) positive differential bias with proportional bias larger than 1. To allow comparisons with results from a previous study,
5
the following pairs of values for the differential and proportional biases were used
The number of repeated measurements from the reference standard was drawn in the three distributions: (1)
Notice that
As regards the clinical tolerance limits, for the sake of simplicity, they have been set to constant values and do not depend on the true latent trait:
Coverage rates (nominal 95%) of the pointwise confidence band (Normal approximation/quantiles of the simulation distribution/delta method).
Coverage rates (nominal 95%) of the simultaneous confidence band.
Coverage rates are generally very good when the variance has been computed by the simulation method and the Normal approximation is used, and at least five to 10 repeated measurements per individual are available by the reference method. With few repeated measurements per individual, the coverage rate of the method based on the quantiles of the simulation distribution has not performed as well as that based on the simulated variance and Normal approximation. Globally, the coverage rate of the delta method is not good.
For example, when
Coverage rates are generally on the conservative side with at least five to 10 repeated measurements per individual by the reference method.
To introduce the “Conditional agreement plot,” the following data generating process has been considered:
where
Consider, first, the setting where the clinical tolerance limits have been set to constant values and do not depend on the true latent trait:

Scatter plot of the simulated data.
On the scatter plot of the simulated data the proportional bias of the new method is clearly apparent, as well as the heteroscedasticity of the measurement errors.
Figure 2, left, shows the true Tolerance limit plot, which serves as a benchmark for the comparison with the Tolerance limit plot (Figure 3 left) in which the unknown true trait has been predicted by the empirical Bayes method (i.e. BLUP of

(Left) Scatter plot of the differences y1–y2 versus the true latent trait with tolerance limits, (right) scatter plot of the true conditional probability of agreement.

(Left) Scatter plot of the differences y1–y2 versus the BLUP of
In this example, due to the presence of a proportional bias and heteroscedasticity of measurement errors, the true conditional probability of agreement is not constant and decreases with the level of the true latent trait value (
Based on the estimates of the parameters of model (2), the conditional probability of agreement and BLUP of the true latent trait
The Tolerance limits plot from Figure 3 is very similar to that of Figure 1, which illustrates that already with five repeated measurements per individual the BLUP of
For the sake of comparison, in Figure 4 we have also computed the Bland & Altman's LoA and Taffé's agreement plots.

(Left) Bland & Altman's LoA plot, (right) agreement plot.
In comparison with the Conditional probability of agreement plot, which directly provides the level of agreement, the investigator must carefully inspect the LoA on the Bland & Altman's and Taffé's agreement plots to assess the degree of agreement. In this respect, the percentage of agreement (defined as
To fix this issue, one may define tolerance limits that depend on the true latent trait in a specific form, for example, linearly:

(Left) Scatter plot of the differences y1–y2 versus the BLUP of
Clearly, now, the percentage of the agreement curve and the Conditional agreement plot give the same message, the level of agreement is best for values of the latent trait around 25.
A worked example based on real data is presented in the Supplemental Appendix.
In this study, we have further extended the methodology proposed, first, by Lin et al. 6 and, then, extended by Stevens et al., 8 on the coverage probability/probability of agreement concepts, by relaxing the strong parametric assumptions regarding the distribution of the latent trait. In addition, we have extended the tolerance limits concept, which is allowed to depend on the true latent trait, and developed an inference theory based on different variance estimation methods. The focus has been on developing simultaneous, and not simply pointwise, confidence bands to allow for simultaneous inferences, that is, to allow the inference for the whole curve and not only for a specific value of the latent trait. This is particularly relevant from a clinical perspective, as it may turn out that the probability of agreement is high enough only for a limited range of the values of the latent trait. In addition, by superimposing on the same plot the conditional agreement curves obtained from several competing new measurement methods, along with their simultaneous confidence bands, one may assess which of these competing methods performs best and for which values of the latent trait.
We have investigated two different methods to compute the variance of the conditional agreement, the standard delta method, and a simulation method. Simulation results (in Table 1) have shown that the delta method has not performed well, whereas the simulation method performed very well when there were at least five to 10 repeated measurements per individual by the reference method and only one measurement by the new method. This poor result of the delta method has already been observed in Taffé 5 and it is recommended to abandon it for the simulation method. It is worth mentioning that to avoid impossible negative values in the CB, the logit transform should be used when using the Normal approximation or delta method. Also, in some settings, the coverage rate of the pointwise confidence band based on the quantiles of the simulation distribution has not performed as well as that based on the simulated variance and Normal approximation, particularly with few repeated measurements per individual. The reason is not clear and should be investigated in further research.
The coverage rate of the simultaneous CB (in Table 2) has been found to be quite conservative in almost all the settings investigated. This result has already been observed in Taffé 5 and by others in the setting of longitudinal data and mixed models.18,19 However, in the latter studies, the authors did not have to deal with the tricky issues posed by the use of a predicted latent variable as one of the regressors. Further research should strive to improve this.
Notice that, as shown in Taffé, 5 the repeated measurements need not be from the reference standard, and the estimation method may be easily adapted to the setting where the repeated measurements come from the new method. This is a great asset of the proposed methodology, as sometimes it may turn out to be easier to perform many measurements by the new measurement method. Nevertheless, requiring repeated measurements by one of the two methods might discourage the applied researcher to use our methodology. However, this is necessary for statistical identification. Indeed, when the variance of the measurement errors of each measurement method is not constant or their ratio is unknown, which is usually the case in the biomedical field (the variance of measurement errors often increases as the latent trait increases), having only one measurement by each of the two measurement methods does not allow one to identify all the parameters of the model (2). 11
The marginal probability of agreement has simply been estimated by the proportion of observations between the two tolerance limits and its confidence interval computed by the method of Agresti and Coull, 17 although other methods such as that of Wilson may also have been used. We, also, have investigated the performance of our methodology for other distributions of the latent trait (Normal and Gamma, with skewness 1 and kurtosis 4.5, results not shown), and different parameter values for the heteroscedasticity. It still performed very well. This is the great asset of the empirical Bayes approach, which can accommodate virtually any distribution of the latent trait, given there are enough repeated measurements per individual.
We have seen that depending on how the tolerance limits have been defined (based or not on the value of the latent trait), the conditional agreement may be similar in shape to the percentage of agreement proposed by Taffé. 5 As mentioned above, the former is based on pre-defined tolerance limits, whereas the latter depends on the width of the LoA and the amount of bias, and does not require the investigator to set tolerance limits. Which one should be preferred depends on the information available a priori to the investigator for setting the limits. We have illustrated that the definition of the tolerance limits may have an important leverage effect regarding the level of the conditional agreement calculated, whereas the percentage of the agreement depends solely on the variability and bias found in the data. It is recommended to compute both measures of agreement and thoroughly inspect the plots before deciding on the agreement.
Finally, it is important to emphasize that our modeling strategy rests on the assumption that the individual latent trait is constant within individuals, that is,
In summary, we have extended the methodology proposed by Lin et al. 6 and Stevens et al.,7,8 on the coverage probability/probability of agreement, by relaxing the strong parametric assumptions regarding the distribution of the latent trait and developing an inference method allowing us to compute pointwise and simultaneous CBs. The methodology requires repeated measurements by at least one of the two methods and can accommodate heteroscedastic measurement errors. It performs often very well even when one has only one measurement by one of the two methods and five to 10 repeated measurements from the other. This methodology will be made available in a future Stata package.
Supplemental Material
sj-docx-1-smm-10.1177_09622802221137743 - Supplemental material for Use of clinical tolerance limits for assessing agreement
Supplemental material, sj-docx-1-smm-10.1177_09622802221137743 for Use of clinical tolerance limits for assessing agreement by Taffé Patrick in Statistical Methods in Medical Research
Footnotes
Declaration of conflicting interests
Funding
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
