Sage Journals: Discover world-class research

Abstract

For prediction models developed on clustered data that do not account for cluster heterogeneity in model parameterization, it is crucial to use cluster-based validation to assess model generalizability on unseen clusters. This article introduces a clustered estimator of the network information criterion to approximate leave-one-cluster-out deviance for standard prediction models with twice-differentiable log-likelihood functions. The clustered network information criterion serves as a fast alternative to cluster-based cross-validation. Stone proved that the Akaike information criterion is asymptotically equivalent to leave-one-observation-out cross-validation for true parametric models with independent and identically distributed observations. Ripley noted that the network information criterion, derived from Stone’s proof, is a better approximation when the model is misspecified. For clustered data, we derived clustered network information criterion by substituting the Fisher information matrix in the network information criterion with a clustering-adjusted estimator. The clustered network information criterion imposes a greater penalty when the data exhibits stronger clustering, thereby allowing the clustered network information criterion to better prevent over-parameterization. In a simulation study and an empirical example, we used standard regression to develop prediction models for clustered data with Gaussian or binomial responses. Compared to the commonly used Akaike information criterion and Bayesian information criterion for standard regression, clustered network information criterion provides a much more accurate approximation to leave-one-cluster-out deviance and results in more accurate model size and variable selection, as determined by cluster-based cross-validation, especially when the data exhibit strong clustering.

Keywords

Predictive modeling clustered data network information criterion cluster-based cross-validation Fisher information matrix

Get full access to this article

View all access options for this article.

References

Rogers

. Regression standard errors in clustered samples. Stata Technical Bulletin 1994; 3: 19–23.

Harrell

. Overview of Maximum Likelihood Estimation. In: Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, And Survival Analysis, pp. 181–217. Springer, Cham, 2015. https://doi.org/10.1007/978-3-319-19425-7_9. https://doi.org/10.1007/978-3-319-19425-7_9.

Bouwmeester

Twisk

Kappen

, et al. Prediction models for clustered data: comparison of a random intercept and standard regression model. BMC Med Res Methodol 2013; 13: 19.

Debray

TPA

Collins

Riley

, et al. Transparent reporting of multivariable prediction models developed or validated using clustered data (TRIPOD-Cluster): explanation and elaboration. BMJ 2023; 380: 071058.

Bouwmeester

Zuithoff

NPA

Mallett

, et al. Reporting and methods in clinical prediction research: a systematic review. PLoS Med 2012; 9: 1001221.

Reddy

Aggarwal

. Healthcare data analytics. 1st ed. New York: Chapman and Hall/CRC, 2015.

Takada

Nijman

Denaxas

, et al. Internal-external cross-validation helped to evaluate the generalizability of prediction models in large clustered datasets. J Clin Epidemiol 2021; 137: 83–91.

Shipe

Deppen

Farjah

, et al. Developing prediction models for clinical use using logistic regression: an overview. J Thorac Dis 2019; 11: 574–584.

Steyerberg

. Clinical prediction models: a practical approach to development, validation, and updating: Springer Science & Business Media. 1st ed. NY: Springer New York, 2008.

10.

Christodoulou

Collins

, et al. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol 2019; 110: 12–22.

11.

Eftekhar

Mohammad

Ardebili

, et al. Comparison of artificial neural network and logistic regression models for prediction of mortality in head trauma based on initial clinical data. BMC Med Inform Decis Mak 2005; 5: 3.

12.

Boateng

Abaye

. A review of the logistic regression model with emphasis on medical research. J Data Anal Inform Proces 2019; 07: 190.

13.

Niestroy

Moorman

Levinson

, et al. Discovery of signatures of fatal neonatal illness in vital signs using highly comparative time-series analysis. NPJ Digital Med 2022; 5: 1–10.

14.

Qiu

Fiore

JMD

Krishnamurthi

, et al. Highly comparative time series analysis of oxygen saturation and heart rate to predict respiratory outcomes in extremely preterm infants. Physiol Meas 2024; 45: 055025.

15.

Fairchild

Aschner

. HeRO monitoring to reduce mortality in NICU patients. Res Rep Neonatol 2012; 2: 65–76.

16.

Ruminski

Clark

Lake

, et al. Impact of predictive analytics based on continuous cardiorespiratory monitoring in a surgical and trauma intensive care unit. J Clin Monit Comput 2019; 33: 703–711.

17.

McWilliams

Tammemagi

Mayo

, et al. Probability of cancer in pulmonary nodules detected on first screening CT. N Engl J Med 2013; 369: 910–919.

18.

Ostrowski

Biñczyk

Marjañski

, et al. Performance of various risk prediction models in a large lung cancer screening cohort in Gdañsk, Poland—a comparative study. Transl Lung Cancer Res 2021; 10: 1083–1090.

19.

Ibrahim

Imam

. Predictive value of heart rate observation (HeRO) score for sepsis in preterm neonates. Ann Neonatol 2023. https://doi.org/10.21608/anj.2022.176991.1060 . Accessed 2024-07-04.

20.

Bennaoui

Lalaoui

Slitine

Idrissi

, et al. The HeRO score: enhancing prognosis and predicting nosocomial infections in newborns: insights from the neonatal intensive care unit. J Neonatal Perinatal Med 2024; 17: 57–62.

21.

Zimmet

Clark

Gadrey

, et al. Pathophysiologic signatures of bloodstream infection in critically ill adults. Critical Care Explorations 2020; 2: 0191.

22.

Qiu

Zimmet

Bell

, et al. Pathophysiological responses to bloodstream infection in critically ill transplant recipients compared to non-transplant recipients. Clin Infect Diseases: Offic Public Infect Dis Soc Am 2023; 662. https://doi.org/10.1093/cid/ciad662 .

23.

Kausch

Brandberg

Qiu

, et al. Cardiorespiratory signature of neonatal sepsis: development and validation of prediction models in 3 NICUs. Pediatr Res 2023; 93: 1913–1921.

24.

Fulcher

Jones

. Highly comparative feature-based time-series classification. IEEE Trans Knowl Data Eng 2014; 26: 3026–3037.

25.

Monfredi

Andris

Lake

, et al. A novel predictive analytics score reflecting accumulating disease burden—an investigation of the cumulative CoMET score. Physiol Meas 2023; 44: 055005.

26.

Skrondal

Rabe-Hesketh

. Generalized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equation Models. New York: Chapman & Hall/CRC Press, 2004. https://doi.org/10.1201/9780203489437 .

27.

Liang

K-Y

Zeger

. Longitudinal data analysis using generalized linear models. Biometrika 1986; 73: 13–22.

28.

Gardiner

Luo

Roman

. Fixed effects, random effects and GEE: what are the differences? Stat Med 2009; 28: 221–239.

29.

Zorn

CJW

. Generalized estimating equation models for correlated data: a review with applications. Am J Pol Sci 2001; 45: 470–490.

30.

Elst

Hermans

Verbeke

, et al. Unbalanced cluster sizes and rates of convergence in mixed-effects models for clustered data. J Stat Comput Simul 2016; 86: 2123–2139.

31.

McNeish

Harring

. Covariance pattern mixture models: eliminating random effects to improve convergence and performance. Behav Res Methods 2020; 52: 947–979.

32.

Nie

. Convergence rate of MLE in generalized linear and nonlinear mixed-effects models: theory and applications. J Stat Plan Inference 2007; 137: 1787–1804.

33.

Hastie

Tibshirani

Friedman

. Model Assessment and Selection. In: Hastie T, Tibshirani R, Friedman J (eds.) The Elements of Statistical Learning: Data Mining, Inference, And Prediction, Springer, New York, NY, 2009, pp.219–259. https://doi.org/10.1007/978-0-387-84858-7_7. https://doi.org/10.1007/978-0-387-84858-7_7.

34.

Ripley

. Pattern Recognition and Neural Networks. Cambridge: Cambridge University Press, 1996. https://doi.org/10.1017/CBO9780511812651. https://www.cambridge.org/core/books/pattern-recognition-and-neural-networks/4E038249C9BAA06C8F4EE6F044D09C5C (accessed 2024-02-13).

35.

Stone

. Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B (Methodological) 1974; 36: 111–147.

36.

Browne

. Cross-validation methods. J Math Psychol 2000; 44: 108–132.

37.

Rabinowicz

Rosset

. Cross-validation for correlated data. J Am Stat Assoc 2022; 117: 718–731.

38.

Feng

McLERRAN

Grizzle

. A comparison of statistical methods for clustered data analysis with Gaussian error. Stat Med 1996; 15: 1793–1806.

39.

Bergmeir

Benítez

. On the use of cross-validation for time series predictor evaluation. Inf Sci (Ny) 2012; 191: 192–213.

40.

Jong

VMT

Moons

KGM

Eijkemans

MJC

, et al. Developing more generalizable prediction models from pooled studies and large clustered data sets. Stat Med 2021; 40: 3533–3559.

41.

Akaike

. Information Theory and an Extension of the Maximum Likelihood Principle. In: Parzen E, Tanabe K, Kitagawa G. (eds.) Selected Papers of Hirotugu Akaike, pp. 199–213. Springer, New York, NY, 1998. https://doi.org/10.1007/978-1-4612-1694-0_15. https://doi.org/10.1007/978-1-4612-1694-0_15 (accessed 2024-07-18).

42.

Schwarz

. Estimating the dimension of a model. Ann Stat 1978; 6: 461–464. DOI: Publisher: Institute of Mathematical Statistics. Accessed 2024-06-15.

43.

Calcagno

Mazancourt

. glmulti: an R package for easy automated model selection with (generalized) linear models. J Stat Softw 2010; 34: 1–29.

44.

Brewer

Butler

Cooksley

. The relative performance of AIC, AICC and BIC in the presence of unobserved heterogeneity. Method Ecol Evolu 2016; 7: 679–692.

45.

Burnham

Anderson

. Multimodel inference: understanding AIC and BIC in model selection. Sociol Methods Res 2004; 33: 261–304.

46.

Neath

Cavanaugh

. The Bayesian information criterion: background, derivation, and applications. WIREs Comput Stat 2012; 4: 199–203.

47.

Vaida

Blanchard

. Conditional Akaike information for mixed-effects models. Biometrika 2005; 92: 351–370.

48.

Jones

. Bayesian information criterion for longitudinal and clustered data. Stat Med 2011; 30: 3050–3056.

49.

Pan

. Akaike’s information criterion in generalized estimating equations. Biometrics 2001; 57: 120–125.

50.

Wang

. Consistent model selection and data-driven smooth tests for longitudinal data in the estimating equations approach. J R Stat Soc: Ser B (Statistical Methodology) 2009; 71: 177–190.

51.

Stone

. An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J R Stat Soc Ser B (Methodological) 1977; 39: 44–47.

52.

Murata

Yoshizawa

Amari

. Network information criterion-determining the number of hidden units for an artificial neural network model. IEEE Trans Neural Network 1994; 5: 865–872.

53.

Freedman

. On the so-called “Huber sandwich estimator” and “Robust standard errors”. Am Stat 2006; 60: 299–302.

54.

Barlow

. Robust variance estimation for the case-cohort design. Biometrics 1994; 50: 1064–1072.

55.

McCullagh

. Generalized Linear Models. 2nd ed. New York: Routledge, 2019.

56.

Cox

Hinkley

. Theoretical Statistics. 1st ed. New York: Chapman & Hall/CRC Press, 1979.

57.

Sakamoto

Ishiguro

Kitagawa

. Akaike Information Criterion Statistics. Mathematics and its applications (D. Reidel Publishing Company). KTK Scientific Publishers; D. Reidel, Sold and distributed in the USA and Canada by Kluwer Academic Publishers, Tokyo, Dordrecht, Boston, 1986. OCLC: 13665112. http://www.gbv.de/dms/hbz/toc/ht002888076.pdf (accessed 2024-02-13).

58.

Huber

. The behavior of maximum likelihood estimates under nonstandard conditions. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 221–233. Berkeley, CA: University of California Press, ???, 1967. Issue: 1.

59.

Ambalavanan

Weese-Mayer

Hibbs

, et al. Cardiorespiratory monitoring data to predict respiratory outcomes in extremely preterm infants. Am J Respir Crit Care Med 2023; 208: 79–97.

60.

Kent

Rothwell

Ioannidis

, et al. Assessing and reporting heterogeneity in treatment effects in clinical trials: a proposal. Trials 2010; 11: 1–11.

61.

Kent

Nelson

Dahabreh

, et al. Risk and treatment effect heterogeneity: re-analysis of individual participant data from 32 large clinical trials. Int J Epidemiol 2016; 45: 2075–2088.

62.

Rohde

French

Stewart

, et al. Bayesian transition models for ordinal longitudinal outcomes. Stat Med 2024.

63.

Kalil

Sandkovsky

Florescu

. Severe infections in critically ill solid organ transplant recipients. Clin Microbiol Infect 2018; 24: 1257–1263.

64.

Kalil

Opal

. Sepsis in the severely immunocompromised patient. Curr Infect Dis Rep 2015; 17: 1–10.

65.

Pan

Wall

. Small-sample adjustments in using the sandwich variance estimator in generalized estimating equations. Stat Med 2002; 21: 1429–1441.

66.

Redden

. Small sample performance of bias-corrected sandwich estimators for cluster-randomized trials with binary outcomes. Stat Med 2015; 34: 281–296.

67.

Young

Parkinson

Lees

. Simplicity out of complexity in environmental modelling: Occam’s razor revisited. J Appl Stat 1996; 23: 165–210.

68.

Real

Vargas

. The probabilistic basis of Jaccard’s index of similarity. Syst Biol 1996; 45: 380–385.

69.

Durkalski

Palesch

Lipsitz

, et al. Analysis of clustered matched-pair data. Stat Med 2003; 22: 2417–2428.

70.

German Mesner

. Pediatric Academic Societies 2024 NICU Mortality Prediction Challenge. https://doi.org/10.18130/V3/5UYB4U. https://doi.org/10.18130/V3/5UYB4U.

71.

Sugiura

. Further analysis of the data by Akaike’s information criterion and the finite corrections: further analysis of the data by Akaike’s. Commun Stat – Theory Method 1978; 7: 13–26.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

2.49 MB

Fast leave-one-cluster-out cross-validation using clustered network information criterion

Abstract

Keywords

Get full access to this article

References

Supplementary Material