Abstract
Keywords
Introduction
Intensive care unit–acquired weakness (ICU-AW) is a frequent complication of critical illness and is associated with prolonged stay in the ICU and increased short- and long-term morbidity and mortality. 1 –3 Before structural muscle and nerve damage is detectable, muscle and nerve dysfunction occurs, which may be fully reversible. 4 –6 Electrophysiological signs of critical illness, polyneuropathy or myopathy, can often be detected within the first week after ICU admission and may resolve before ICU discharge. 6 Therefore, future treatments may be most beneficial early in the disease. Furthermore, the benefits of being able to predict the development of ICU-AW in a given patient may allow for the more timely initiation of supportive interventions, such as early mobilization. 7,8
Intensive care unit–acquired weakness is currently diagnosed by assessment of manual muscle strength. 9 This is often not possible in the first couple of days after ICU admission because of impaired consciousness or attentiveness, for example, due to delirium, coma, or sedation. 10 –12 To avoid this diagnostic delay, we previously developed a prediction model for ICU-AW, including 3 early available predictors obtained 2 days after ICU admission. 13 The model showed fair discriminative performance after internal validation but was built on data collected in only 1 hospital.
Before prediction models can be applied in practice, the external validity should be studied in a new independent population. 14 The aim of this study was to externally validate and, if necessary, update the previously developed prediction model for ICU-AW. External validation included both temporal (patients from a later time period) and geographical (patients from other institutions) validation to assess generalizability of the model.
Methods
Design and Ethical Approval
We performed a multicenter prospective observational cohort validation study. This study was reported according to the recently published TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) guidelines. 15,16
The institutional review board of the Academic Medical Center, Amsterdam, the Netherlands, decided that the Medical Research Involving Human Subjects Act does not apply to this study (decision notice W13_193#13.17.0239), and therefore, written informed consent was not needed. Verbal consent to use patient data was obtained from all included patients. The study was registered in the Netherlands Trial Register (#NTR4331).
Study Setting
The study was conducted in medical-surgical ICUs of 5 hospitals in the Netherlands: 2 university hospitals, 2 university affiliated teaching hospitals, and 1 regional hospital.
Inclusion and Exclusion Criteria
Consecutive, newly admitted ICU patients, ≥18 years old, mechanically ventilated at 48 hours, after ICU admission, were included (irrespective of the duration of mechanical ventilation). This was different from the development study where patients who were mechanically ventilated for >2 days after admission were included. As in the development study, we excluded patients with an admission diagnosis of cardiac arrest, neuromuscular disease, or central nervous system (CNS) disease (stroke, traumatic brain or spinal cord injury, CNS infection, or CNS tumor). Furthermore, patients with preexisting spinal injury, a poor pre-ICU functional status (modified Rankin scale ≥4), 17 and patients who were expected to die within 48 hours were excluded.
Predictor Assessment
All 20 candidate predictors of the development study
13
were assessed. These predictors were defined, collected, and interpreted as in the model development study, except for lowest Pa
Additionally, 3 new candidate predictors, based on newly described risk factors, were collected: erythrocyte transfusion, 18 hypercalcemia, 19 and hypophosphatemia (own data, not published). These were defined as any erythrocyte transfusion within 24 hours before ICU admission or in the first 48 hours after ICU admission, highest ionized calcium (mmol/L), and lowest phosphate (mmol/L) in the first 48 hours after ICU admission, respectively. All predictors were prospectively assessed and recorded in an online case report form by local investigators, blinded for the strength assessment results.
Strength Assessment (Reference Standard)
As in the development study, trained physiotherapists assessed muscle strength as soon as patients were alert (Richmond Agitation and Sedation Scale [RASS] between −1 and 1) and attentive (able to follow verbal commands using facial expressions). 12,20,21 Muscle strength was assessed using the Medical Research Council (MRC) score in 6 prespecified muscle groups, as in the development study. 13,22 The average MRC score of these muscle groups was used for the analysis (values were not imputed when a muscle group could not be assessed). Intensive care unit–acquired weakness was defined by an average MRC score <4, in accordance with international consensus statements. 1,9 Physiotherapists were blinded for the predictors (except age, gender, and admission reason).
Additional Data Collected
We additionally collected the following clinical characteristics: the Acute Physiology and Chronic Health Evaluation IV (APACHE IV) score, the maximal Sequential Organ Failure Assessment (SOFA) score of the first 2 days after ICU admission, day of MRC assessment, number of days on mechanical ventilation, length of stay in the ICU, and ICU mortality.
Data Analysis
We applied the original model, with its predictors and assigned weights as estimated in the development study, 13 to our new data. The original model was:
We assessed the performance by calibration and discrimination. Calibration reflects the agreement between the predicted ICU-AW risk by the model and the observed ICU-AW frequency in the validation cohort. This was assessed for each decile of predicted risk, ensuring 10 equally sized groups, by calculating the ratio of predicted ICU-AW risk to observed ICU-AW frequency. Calibration was analyzed graphically and using goodness of fit (Hosmer-Lemeshow test). Discrimination, the ability of the test to correctly classify those with and without the disease, was assessed by the area under the receiver operating characteristic curve (AUC-ROC). We defined AUC-ROC values between 0.90 and 1 as excellent, 0.80 and 0.90 as good, 0.70 and 0.80 as fair, 0.60 and 0.70 as poor, and <0.60 as failed.
Next, to improve the performance of the original model, we used updating methods, which combine the information that is captured in the original model with the information of the new patients, instead of making a whole new prediction model.
The previously described updating methods 23 vary from simple recalibration (reestimation of the intercept or slope of the linear predictor) to more extensive revisions, like reestimation of some or all regression coefficients and model extension with new predictors. Before stepwise addition to the model, distributions of the 3 new candidate predictors were checked for normality. The AUC-ROCs of the updated models were calculated. The change in Akaike information criterion (AIC) between the updated models and the recalibrated model was compared. 24 A model in which the AIC was at least 2 points lower than the AIC of the recalibrated model was considered an improved model. In this improved model, the reestimated predictors were shrinked toward the recalibrated model, any new predictors were shrinked toward zero, and the intercept was again determined.
To further assess improved discrimination, we evaluated the degree of correct reclassification using the continuous net reclassification improvement (cNRI), which is more sensitive to change than the AUC-ROC. 25 The cNRI of the updated model was compared to the recalibrated model. We also assessed the cNRI of the APACHE IV score and maximal SOFA score in the first 2 ICU days.
As a sensitivity analysis to assess the influence of missing data, we examined calibration and discrimination in data sets in which missing data were imputed, using multivariate imputation by chained equations (10 iterations of 10 imputations). 26 All predictors and the outcome (ICU-AW) were used for the imputation model. We checked validity of imputed data. The AUC-ROC and the corresponding confidence intervals (CIs) of the 10 imputed data sets were averaged using Rubin's rules, a method to take into account variation within and between multiple imputation data sets. 27
A second sensitivity analysis assessed the influence of the difference in inclusion criteria between the development study and the validation study. We repeated the analysis for patients who were 2 days mechanically ventilated at time of inclusion. As an additional sensitivity analysis, we assessed the performance of the model in only the patients of the hospital in which the model was developed.
Furthermore, we used the combined data from the development and validation cohort to make a new prediction model. Predictor selection was done as comprehensively described in the development study.
13
In short, we used bootstrapped backward selection and selected those predictors who were selected in >50% of the bootstrap samples (n = 1000;
Proportions are presented with percentages and total numbers, mean values with standard deviation, and median values with interquartile range. Differences between proportions were assessed using χ2 test, differences between normally distributed variables using Welch's
Power Calculation
Empirical evidence suggests a minimum of 100 events and 100 nonevents for external validation studies. 28 With an incidence of ICU-AW of about 50%, at least 200 patients were needed for validation (and updating). To further validate an updated model, 200 additional patients would be needed. We aimed to include at least 500 patients to account for people in whom MRC measurements could not be performed (ie, because they died before it could be measured, had an ongoing delirium, etc).
Results
Screened and Included Patients
Figure 1 displays the flow chart. Consecutive ICU patients were screened for inclusion from February 2014 to December 2015. A total of 538 patients fulfilled the inclusion criteria and did not meet exclusion criteria. In 349 (65%) patients, muscle strength could be assessed; 190 (54%) patients were classified as having ICU-AW. Unfortunately, loss to follow-up was larger than expected. We decided to deviate from the analysis plan and chose to only validate and update the model. This meant that no separate cohort of patients was left for validation of the updated model.

Flowchart of screened and included patients. Center 1 is the center in which the original model was developed. ICU-AW indicates intensive care unit–acquired weakness; MRC, Medical Research Council.
Relatedness Between the Development and External Validation Cohorts
Table 1 shows the study and patient characteristics of the development and validation study. Table 2 shows the distribution of the assessed predictors of the development and external validation cohort.
Study and Patient Characteristics.
Abbreviations: APACHE IV: Acute Physiology and Chronic Health Evaluation IV; ICU-AW, intensive care unit–acquired weakness; IQR, interquartile range; LOS ICU, length of stay in the intensive care unit; MRC, Medical Research Council; MV, mechanical ventilation; SD, standard deviation; SOFA, Sequential Organ Failure Assessment.
Distributions of Candidate Predictors.a
Abbreviations: Ca, calcium; ICU, intensive care unit; IQR, interquartile range; P/F, Pa
aThe predictors in italic are the predictors included in the original prediction model.
Performance of the Original Model in the Validation Cohort
The original model was applied to our validation cohort. Calibration was poor with evidence for lack of fit (Figure 2). The predictions were too extreme: for low predicted probabilities by the model, the true fraction with ICU-AW was higher; and for high predicted probabilities, the true fraction was lower. The AUC-ROC was 0.60 (95% CI: 0.54-0.66), which is interpreted as poor discrimination.

Model performance: calibration and discrimination of original model. A, The model calibration assessed with a fitted curve based on Loess regression with 95% confidence interval. Perfect calibration is illustrated by the dotted line. Triangles represent deciles of predicted probability and grey points represent predicted probabilities of individual patients. Goodness of fit was assessed with the Hosmer-Lemeshow test. B, Model discrimination assessed with the receiver operating characteristic curve. AUC, area under the curve; ICU-AW indicates intensive care unit–acquired weakness.
Model Updating
We tried several methods to update our model (Table 3 and Figure 3) using recalibration, reestimation, and extension with new candidate predictors. Model updating, using method 6 in which the new candidate predictors were added one-by-one to the recalibrated model (method 3), improved discrimination when lowest phosphate was added to the model. With all updating methods, calibration improved, but the AUC-ROC remained 0.60 (95% CI: 0.54-0.66). The cNRI of the updated model was as good as the cNRI of the recalibrated model.
Model Updating Results.a,b
Abbreviation: AUC-ROC, area under the receiver operating characteristic curve.
aMethod 1 is the original model. The model was recalibrated by adjusting only the intercept (method 2) or both the intercept and slope (method 3). With method 4, we investigated whether predictors were having a clearly different effect in the validation cohort, by selective reestimation of one or more of the included predictors. None of the models with reestimations improved the model; therefore, no selective reestimations were done. In method 5, the model was fitted in the validation data by reestimation of the intercept and regression coefficients for all predictors. In method 6, the 3 new predictors were one-by-one added to the recalibrated model. Only adding lowest phosphate improved the model. In method 7, model 5 was extended with new predictors. In method 8, a model with all old and new predictors was assessed.
bShrinkage was applied to the improved model (method 6), and the intercept was recalculated.
cUniform shrinkage factor applied.
dShrinkage toward recalibrated values.
eTransformed using the natural logarithm.
fShrinkage toward zero.

Calibration plots of updated models. Model calibration of the updated models from Table 3 were assessed with a fitted curve based on Loess regression with 95% confidence interval. Perfect calibration is illustrated by the dotted line. Triangles represent deciles of predicted probability and grey points represent predicted probabilities of individual patients. Goodness of fit was assessed with the Hosmer-Lemeshow test.
Comparison With SOFA and APACHE IV Scores
The AUC-ROC of the maximal SOFA score in the first 2 days after admission for prediction of ICU-AW in the validation cohort was 0.63 (95% CI: 0.58-0.69), and the AUC-ROC of the APACHE IV score was 0.63 (95% CI: 0.57-0.69). Compared to using the SOFA score, the updated model reduced classification with 31% (cNRI; 95% CI: 9-52), whereas it performed as good as the APACHE IV score (21% [95% CI: 1-43]).
Sensitivity Analysis
Of the predictors used in external validation and updating analyses, highest lactate levels were missing in 2 patients and lowest phosphate in 8 patients; the other predictors included in the original model did not have missing values. The combined AUC-ROC of the imputed data sets was 0.59 (95% CI: 0.53-0.65).
When the original model was only applied to patients who were mechanically ventilated for 2 days at the time of inclusion (n = 291), the AUC-ROC was 0.58 (95% CI: 0.51-0.64) and when it was applied to the patients in the hospital of the development study (center 1, n = 123), the AUC-ROC was 0.59 (95% CI: 0.49-0.69).
New Prediction Model
The following predictors were included in >50% of the bootstrap samples: RASS score, gender, highest lactate, lowest P/F ratio, highest glucose, and ICU treatment with corticosteroids. In the final model (based on 536 patients due to missing values of RASS score [n = 6] and lactate [n = 19]), RASS, gender, highest lactate, and treatment with corticosteroids were included (selected by a drop in AIC >2). A universal shrinkage factor (0.94) was applied to adjust for overfitting. The new prediction model is described with the following formula:
Calibration was excellent (Figure 4). The AUC-ROC after internal validation was 0.70 (95% CI: 0.66-0.75). Discrimination improved when using the new prediction model compared to the SOFA or APACHE IV score (cNRI: 38% [95% CI: 21-55] and 30% [95% CI: 13-47], respectively).

Calibration plot of new model. Calibration plot of new model based on combined data of the development and validation cohort. Model calibration was assessed with a fitted curve based on Loess regression with 95% confidence interval. Perfect calibration is illustrated by the dotted line. Triangles represent deciles of predicted probability and grey points represent predicted probabilities of individual patients. Goodness of fit was assessed with the Hosmer-Lemeshow test. ICU-AW indicates intensive care unit–acquired weakness.
Discussion
In this study, we assessed the performance of a previously developed prediction model for ICU-AW. The model showed both poor calibration and discrimination in our new patient cohort. Updating methods improved calibration but not discrimination. A new prediction model based on combined data from the development and validation cohort had an excellent calibration. The AUC-ROC of this new model was 0.70 after internal validation. The new prediction model classified patients better than the SOFA and APACHE IV scores.
Reasons for Poor Performance of the Model
Poor performance in new data sets is often seen and can have several reasons. 16 First of all, this can be caused by differences in case mix. The distribution of baseline characteristics and predictors showed differences between the development and the validation cohort. Patients in the validation cohort seemed to be less severely ill indicated by lower SOFA scores, less often sepsis, less days of mechanical ventilation, less repeated administration of neuromuscular blockers, and earlier MRC assessment; whereas, on the other hand, these patients had less urine production and lower P/F ratios. A major difference in use of aminoglycosides was seen, because it was regularly used in the center in which the model was developed but less frequent in the other centers. The differences in case mix cannot solely be explained by the multicenter design since these differences were also seen when the development cohort was compared with validation cohort patients only from the hospital in which the model was developed. Although no major changes in standard of care were noted, unrecognized changes in care over time may cause differences in case-mix, which could be a reason for failed temporal validation.
Besides differences in subject-level characteristics as described previously, differences in study-level characteristics (such as inclusion criteria) can also lead to a worse performance. Our inclusion criteria in the validation cohort differed slightly from the development cohort, possibly selecting less severe patients because some patients had a lower duration of mechanical ventilation in the first 48 hours after admission. We chose this inclusion criterion to make inclusion more easy for the investigators and to increase the amount of eligible patients. We assumed that this population would be comparable to the population in the development study. Actually, sensitivity analyses showed that when the original model was applied to only those patients who were mechanically ventilated for a duration of 2 days at inclusion (83% of the patients in the validation cohort), as in the development cohort, calibration and discrimination remained poor. Thus, differences in inclusion criteria do not explain the poor performance.
At last, the fact that the performance of the original model could not be reproduced in the validation cohort may be attributable to the small sample size of the development study causing unstable predictions and possibly incorrect predictor selection. In fact, in the newly developed model, including the cohorts of the development and validation cohort, other predictors (except for lactate) were selected.
Importance of External Validation
The performance of any prediction model tends to be lower than expected when it is applied to new patients. 29 Therefore, every developed prediction model should be validated in new individuals before the model is applied in practice or implemented in guidelines. This step is, erroneously, often skipped. 16
This study underlines the importance of external validation. It showed that generalizability and transportability of the previously developed model was poor and that the original model could thus not be used in clinical practice, also after extensive updating. Even the maximal SOFA score in the first 2 days after admission could predict ICU-AW better than the updated models.
Limitations of the Study
This study has some limitations not previously declared. Of all patients in which strength could not be measured (n = 189), 64 patients were transferred before they were attentive. As the clinical condition of these patients allowed a transfer from the ICU to the ward, this group may be less severely ill and may contain less patients with ICU-AW, masking the true incidence of ICU-AW in the validation cohort.
Furthermore, because strength measurements were available in less patients than beforehand accounted for, we did not have enough data to validate the model, update the model, and again externally validate an updated or new model. Therefore, we used all available data to validate and update the model. Future studies should account for more loss of patients (due to dead, transfer, delirium, etc).
Development of a New Prediction Model
Model updating did not result in a useful model with sufficient discrimination and therefore a new model was developed using the development and validation cohort together. This new model included RASS score, gender, highest lactate, and treatment with corticosteroids as predictors. The new model was based on a much larger cohort than the original development cohort, resulting in more stable estimates. The AUC-ROC was fair (0.70 [95% CI: 0.66-0.75]) and comparable with that of the original model. External validation is needed to prove performance and clinical usefulness in a new validation cohort.
Recently another prediction model for ICU-AW was proposed, 30 including the following predictors: steroid therapy, intensive insulin therapy, number of days on mechanical ventilation, sepsis, renal failure, and hematologic failure. This model, which was based on data of 4157 patients, at least 12 hours mechanically ventilated, in whom only 3% had ICU-AW, showed good discrimination (AUC-ROC: 0.81 [95% CI: 0.78-0.84]). Calibration was, however, not reported, and external validation was not performed. In this study, the definition of ICU-AW was based on an operational definition and not on the MRC score. Therefore, it is very likely that patients with mild-to-moderate ICU-AW have been missed, explaining the very low incidence rate of ICU-AW (3%) in their cohort. These differences make comparison of the study results difficult. No other studies investigating the early prediction of ICU-AW, using clinical parameters, have been published.
Conclusions
External validation of a previously developed prediction model for ICU-AW showed poor calibration and discrimination. Updating methods improved calibration but not discrimination. A new prediction model using data from the development and validation cohort showed fair discrimination and classified patients better than the APACHE IV and the SOFA scores. However, early prediction of ICU-AW, using clinical parameters, with good discrimination seems to be challenging.
