Abstract
Introduction
According to the International Diabetes Federation (IDF) 2021 Atlas, 537 million adults aged 20–79 worldwide, approximately 1 in 10, are currently living with diabetes.1,2 This alarming statistic is projected to rise to 784 million by 2045, with China bearing one of the heaviest burdens of diabetes worldwide.1,3 Type 2 diabetes accounts for approximately 90% of adult diabetes and is the primary focus of our research.1,4 Diabetic kidney disease (DKD), a common complication of diabetes, arises from damage to small blood vessels, leading to impaired kidney function or failure.5,6 The global incidence of DKD due to type 2 diabetes has surged from approximately 1.4 million cases in 1990 to 2.4 million in 2017, indicating a 74% increase.1,2
Diabetes and DKD have become significant global public health concerns.2,7,8 Despite numerous studies analyzing risk factors for diabetic kidney damage in type 2 diabetes patients,9–11 there remains a need for research that utilizes a more comprehensive set of multidimensional clinical data to explore these associations in depth. Our study addresses this gap by incorporating a wide range of clinical variables and employing efficient machine learning techniques for a thorough analysis. Our study used a large-scale dataset encompassing 70,000 individuals, incorporating more than 60 feature variables including epidemiological characteristics, geographical information, clinical test results, and medical prescription histories.
Identifying kidney complications in patients with diabetes based on clinical data was a crucial task in our research. Early detection and intervention for kidney damage in patients with diabetes are essential for patients and healthcare professionals. 12 Compared with regression analyses, machine learning methods offer unparalleled advantages in classifying high-dimensional data.11–13 Leveraging a retrospective cohort with 10 years of electronic medical record data from our hospital, focusing on inpatients with type 2 diabetes, we trained several interpretable machine learning models.14,15 These models aim to predict the occurrence of DKD in type 2 diabetes and analyze the factors influencing its development. Our research seeks to provide data and model support for the establishment of a comprehensive DKD prevention and control system to enhance the accuracy of DKD diagnosis and to provide early prevention.
Data and method
Ethics approval statements
Data collection and analysis in this study were approved by the Ethics Committee of the First Affiliated Hospital of Zhengzhou University (No. 2023-KY-0810). All research processes were anonymous and retrospective, with the need for written informed consent waived.
Study population
Our study used a large, retrospective, single-center cohort from the First Affiliated Hospital of Zhengzhou University and comprised data from hospitalized patients with type 2 diabetes from January 2013 to December 2022. The dataset included information from 101,896 individuals. Individuals aged ≥18 years diagnosed with type 2 diabetes were included. The exclusion criteria included missing data, an initial hospitalization diagnosis of advanced cancer, or death (Figure 1(a)). Finally, 73,101 individuals were included in the analysis.

Geographic, demographic, and temporal insights, and study workflow in diabetes patient hospitalization analysis. (a) Data curation and data analysis process. (b) Residential geographic distribution of the population. (c) Gender distribution in the NDKD and DKD groups. (d) Number of person with diabetes newly admitted in different years. (e) Overall process of the study: diabetes patients hospitalized once or multiple times, data recorded in the hospital electronic medical record system; after data acquisition, the model is trained, evaluated, and analyzed.
The provided data had been anonymized and included demographic characteristics, laboratory data, comorbidities, and medication information. In total, 67 distinct variables were considered in our analysis, encompassing height, weight, blood pressure, laboratory results including blood and urine analyses, and medication details, all uniformly recorded at admission. Laboratory data, including blood and urine analyses, were primarily obtained from the laboratory department. Urine protein quantification and urinary albumin-to-creatinine ratio (UACR) assessments are mainly performed in the renal disease laboratory. In this study, the endpoint event was defined as the clinical diagnosis of DKD.
Comorbidities and medication history
The occurrence of diabetic complications was primarily determined based on the information provided in medical records, and these data are routinely submitted to health authorities for quality assessment. We defined the incidence of complications such as coma; ketoacidosis ophthalmic, neurological, and peripheral circulatory complications; and diabetic arthropathy based on ICD-10 codes 16 (refer to the figure for the incidence rates of these complications in both groups). ICD-10 codes were used to determine the occurrence of hypertension (I10–I15), heart disease (I20–I52), and cerebrovascular diseases (I60–I69).
Information regarding medication history was sourced from prescription records and follow-up data, and the use of combination drugs was determined based on their primary components. This approach ensured a comprehensive understanding of the occurrence of comorbidities and medication usage patterns and provided valuable insights for our analysis.
Datasets and classifiers
The final dataset, consisting of 73,101 individuals, was randomly divided into 80% training and 20% testing datasets. All model training was conducted in Python 3.8, which utilizes widely adopted machine learning methods for binary classification.12,13 These methods include gradient boosting-based algorithms such as XGBoost, 17 CatBoost, 18 LightGBM, 19 AdaBoost, 20 gradient boosting decision tree (GBDT), stochastic gradient descent (SGD), and other commonly used tree-based methods including decision tree and random forest. The use of these diverse classifiers ensured a comprehensive exploration of the predictive models in our analysis.
Hyperparameter optimization
Hyperparameter optimization of the models was conducted using Optuna (v3.5.0 https://optuna.org/). 21 Optuna is an open-source project that employs Bayesian optimization algorithms to search the hyperparameter space, providing support for a wide range of machine learning frameworks. This approach enhanced the efficiency of our models by systematically tuning the hyperparameters to achieve optimal performance.
Performance evaluation
To assess model performance, we calculated and compared various metrics, including the accuracy, precision, recall, area under the receiver operating characteristic curve (AUC), and the area under the precision-recall curve (AUPR). These metrics provide comprehensive insights into the predictive capabilities of the models. 22
Precision is defined as the ratio of correctly predicted positive observations to the total number of predicted positives and indicates the accuracy of the positive predictions made by the model, revealing the number of predicted positive instances that are true positives. Recall is the ratio of correctly predicted positive observations to all the actual positives. Recall assesses the ability of a model to capture all the actual positive instances, highlighting its sensitivity to positive cases.
In addition, we obtained the importance levels of the different factors within each model and conducted a comparative analysis to interpret the significance of these factors in the occurrence of DKD. This comprehensive evaluation provided a nuanced understanding of the predictive capabilities of the models and insights into the relative importance of individual factors in the development of DKD.
SHAP (SHapley Additive exPlanations)
Shapley values, which have desirable properties, are widely used in cooperative game theory. 23 We used the SHAP Python package(v0.44) to explain progressively more complex models, such as XGBoost and CatBoost. 24 SHAP values not only can provide a unified explanation for different models but can also explain how different factors influence predictions in various directions. We used SHAP to create bar charts of the average SHAP values and beeswarm summary plots for the groups with and without diabetic nephropathy. Additionally, we provided an example waterfall plot for one patient in each group. The waterfall plot starts with the base value (the average prediction across the dataset) and sequentially adds the effect of each feature, thus illustrating how the cumulative addition of features leads to the final prediction. This method provides a clear and intuitive way to understand the incremental impact of each feature on the model's output.
Statistical analysis and software
Statistical analyses were performed utilizing Excel (Microsoft Corporation) and Python (v3.8, Python Software Foundation). For visualization, we employed matplotlib (v3.8.2)
25
and the machine learning framework utilized was Scikit-learn (v1.4).
15
For normally distributed continuous data, descriptive statistics, such as means and standard deviations, were employed. Non-normally distributed continuous data were characterized using the median and interquartile range. Categorical data are presented as percentages. Univariate analysis was used to identify potential correlation factors. Statistical tests, including the Mann–Whitney
Result
Demographic and clinical characteristics
We analyzed the data of 73,101 patients hospitalized with type 2 diabetes at The First Affiliated Hospital of Zhengzhou University between January 2013 and December 2022. The dataset included demographic characteristics, laboratory data, comorbidities, and medication information, as shown in Table 1. Our study population covered all provincial-level administrative units in China, excluding patients from the Macao Special Administrative Region (Figure 1(b)), demonstrating considerable representativeness, especially in the Yellow and Huai River Basins. The average age of diabetes patients in our cohort was 59.68 ± 12.23 years, with men constituting 44.8% in the non-diabetic kidney disease (NDKD) group and 37.9% in the DKD group (Figure 1(c)). The independent sample
Demographic and clinical characteristics.
Data are expressed as percentage (
BMI: body mass index; SBP: systolic pressure; DBP: diastolic pressure; BUN: blood urea nitrogen; eGFR: estimated glomerular filtration rate; TG: triglycerides; TC: total cholesterol; Scr: serum creatinine; HbA1c: glycated hemoglobin A1c; AST: aspartate aminotransferase; ALT: alanine aminotransferase; γ-GT: γ-glutamyl transpeptidase; ALP: alkaline phosphatase; PT: prothrombin time; APTT: activated partial thromboplastin time; TT: thrombin time; CRP: C-reactive protein; ESR: erythrocyte sedimentation rate; 24h-UP: 24-h urine protein quantification; UACR: urine albumin-to-creatinine ratio.
In analyzing complications and medication information, it was evident that hypertension was the most prevalent non-diabetes-related complication in DKD, a finding corroborated by a macro-level data analysis.1,26 In the hospitalized DKD patient group, 68.8% also experience hypertension, compared with 52.8% for patients with NDKD. Peripheral circulatory complications emerged as the most common diabetic complications in patients with DKD, affecting approximately 34.8%. Ophthalmic and neurological complications occurred in 29.9% and 27.5% of patients, respectively. In comparison, among patients without DKD, these numbers were notably lower at 10.2%, 7.5%, and 6.5%, respectively (Supplementary Table S1). Ophthalmic complications in diabetes mellitus (E11.3) include diabetic cataract (H28.0*) and diabetic retinopathy (H36.0*); neurological complications encompass diabetic amyotrophy (G73.0*), diabetic autonomic neuropathy (G99.0*), diabetic mononeuropathy (G59.0*), diabetic polyneuropathy (G63.2*), and diabetic autonomic neuropathy (G99.0*). These complications were categorized according to ICD-10 diagnostic codes.
Performance comparison
We employed a diverse array of machine learning methods utilizing 67 variables as input features to predict DKD. The models applied include XGBoost, 17 CatBoost, 18 LightGBM, 19 AdaBoost, GBDT, SGD, and Random Forest. 15 The receiver operating characteristic and precision-recall curves are illustrated in Figure 2(a) and (b), respectively. The model performance metrics, including the AUC, AUPR, accuracy, precision, and recall, are presented in Table 2.

Performance evaluation and SHAP for model interpretation. (a) Receiver operating characteristic (ROC) curve of the testing dataset; (b) precision-recall curve of the testing dataset; (c) frequency count of the top 20 important features for each variable across different models. (d) SHapley Additive exPlanations (SHAP) bar plots of XGBoost and CatBoost: Display the mean absolute SHAP values of each factor, with UACR and non-diabetic kidney disease (NDKD) consistently ranking in the top 2 for both models. (e) Summary plots for both models. (f) Waterfall plot drawn from a sample dataset of the models: it reveals UACR, serum creatinine, among others, as the primary risk factors leading the model to predict DKD (indicated in red for positive SHAP values, signifying these factors as risk elements for the occurrence of DKD, with blue indicating the opposite). BMI: body mass index; UP: urine protein; UACR or ACR: urine albumin-to-creatinine ratio; TCR: urine total protein-to-creatinine ratio; UP24h: 24-h urine protein quantification; SCr: serum creatinine; WBC: leukocyte; UA: urine acid; CysC: cystatin C; eGFR: estimated glomerular filtration rate; AST: aspartate aminotransferase; ALT: alanine aminotransferase; TP: serum total protein; Alb: serum albumin; K: serum potassium; ARB: angiotensin receptor blocker; E11.3: with ophthalmic complications; E11.4: with neurological complications; E11.5: with peripheral circulatory complications; E11.7: with multiple diabetic complications; Multi: multiple other complications.
Performance of models.
AUC: area under the receiver operating characteristic curve; AUPR: area under the precision-recall curve.
Among the various models, CatBoost and XGBoost demonstrated particularly promising outcomes. In the test set, CatBoost achieved an AUC of 0.97 and an AUPR of 0.84, while XGBoost exhibited an AUC of 0.95 and an AUPR of 0.76. These models showcase the ability to accurately predict the occurrence of DKD. A comprehensive evaluation of the model performance metrics provided insights into the effectiveness of different machine learning approaches in the context of DKD prediction.
Factor analysis
The machine learning models we selected were interpretable,14,15 and the top 20 important features of XGBoost and CatBoost are depicted in Supplementary Figure S3. Among them, UACR, diabetes with peripheral circulatory complications, and other kidney diseases emerged as the top three factors in terms of importance in both models. Supplementary Figure S3 illustrates the frequency with which each factor entered the top 20 feature importance levels across all seven models. Diabetes with peripheral circulatory complications and BMI stood out as the most frequent features, consistently appearing in the top 20 features of every model (Figure 2(c)). This emphasis on interpretability enhances the transparency of our machine learning approach, shedding light on the key factors influencing the prediction of DKD.
In addition, we employed SHAP to conduct a factor analysis and an interpretation of two models to demonstrate performance: XGBoost and CatBoost. The bar plots delineate the absolute SHAP values for each factor within the models. Notably, renal diseases not attributed to diabetes and the UACR were among the top three factors influencing the onset of DKD, underscoring the pivotal role of proteinuria as an early marker for DKD identification (Figure 2(d)). The summary plot rendered for each sample delineated the Shapley values of the individual features, highlighting the most influential factors and their respective impacts on the dataset (Figure 2(e)). The risk of developing DKD is elevated in patients with renal diseases unrelated to diabetes. Furthermore, consistent with previous analyses, BMI was identified as one of the top five factors across both models, indicating that a higher BMI correlates with an increased risk of DKD. Additionally, we randomly selected a diabetic patient's data to construct a Waterfall Plot using SHAP, where blue denotes a negative contribution of the factor and red signifies a positive contribution, illustrating the role of each indicator within the model (Figure 2(f)).
Discussion
In this retrospective study, we used a dataset comprising 67 variables, including demographic characteristics, laboratory data, comorbidities, and medication information, to train and validate multiple machine learning models. This study aimed to predict the onset of DKD in patients with diabetes. Among the models, CatBoost and XGBoost demonstrated superior performance on our dataset, including both the training and validation sets.
In the feature importance analysis of various models, the presence of NDKD emerged as a significant factor for the development of DKD. Patients with other kidney diseases prior to the onset of diabetic renal damage or diabetes are more likely to experience a vicious cycle of diabetic kidney injury. This is considered a dynamic process. Clinically, DKD, NDKD, and a combination of both are often considered static diagnoses. However, NDKD in patients with diabetes may evolve into a combined state of DKD and NDKD owing to the progression of diabetes. Diabetes with peripheral circulatory complications also ranked highly as a risk factor in many studies27–29 specifically diabetic gangrene, diabetic peripheral angiopathy, and diabetic ulcerative DKDs, as microvascular complications, are more highly correlated with diabetic retinopathy than any other diabetes-related complications. This association has been consistently reported in several studies.9,10,30 In our model, diabetes with ophthalmic complications ranked after peripheral circulatory complications and neurological complications.31–33 Compared to the insidious onset of renal damage, peripheral circulatory complications, which are more dangerous among diabetic complications, tend to present more symptomatic changes due to the severe hyperglycemic state 34 and heightened oxidative stress and inflammatory responses. 35 They share common risk factors for DKD. Urine protein remains a critical early marker for the identification of DKD. In several models predicting end-stage renal disease in DKD, the urinary albumin level can be used to assess the severity of DKD and is a key predictor of its progression.36–40
Consequently, patients with other NDKDs may exhibit urinary albumin abnormalities. Even with a confirmed diagnosis of NDKD, vigilance for DKD is of paramount importance. BMI was one of the most frequently top-ranked indicators across the models, indicating an intrinsic link between BMI and DKD. This includes obesity-related renal damage and the insulin-resistant state associated with BMI, 41 complicating glycemic control and maintenance, potentially affecting renal function through various mechanisms, including increased renal blood flow and pressure, chronic inflammation, and perpetuating a vicious cycle that accelerates the onset of DKD.
Additionally, we used regular expressions and large language models for text analysis to extract data from past EMRs to obtain the duration of diabetes and the age of onset of diabetes. In our dataset, only 43,849 patients had these data, and the data were not entirely accurate. However, we conducted similar analyses on these data: we divided the 43,849 records into 80% training set and 20% testing set. For the duration of diabetes, we categorized the data into three groups: 0–5 years as 0, 6–10 years as 1, and more than 10 years as 2. As shown in Supplementary Figure S5, both the duration of diabetes (T2DMTRANK) and the age of onset of diabetes (T2DMAGE) were among the top three important variables in the CatBoost, LGB, and AdaBoost models, highlighting their significance. However, data missingness is unavoidable in many large retrospective datasets and real-world data. We processed a large amount of medical text information, which was challenging and might not be entirely accurate. Nonetheless, based on this information, we achieved good results in our newly set dataset: the AUCs for XGBoost, CatBoost, LightGBM, RF, AdaBoost, and GBDT were all above 0.9 in both the training and testing datasets (Supplementary Figures S6 and S7). Adding the duration of diabetes and the age of onset of diabetes to the models did not significantly change the final evaluation results, suggesting that other indicators might also relate to these variables. Nevertheless, the duration of diabetes is undeniably an important and objective factor in the occurrence of DKD.
Our medical center, located in an urban area, has observed a notable trend in our diabetic patient population; farmers constitute the largest occupational group, accounting for approximately 38.38% (28,053 individuals). This finding highlights the severe burden of DKD in rural areas. According to data from the IDF, by 2045, rural China is projected to have the highest number of individuals with diabetes in the world. 1 Consequently, our efforts to prevent and treat diabetes and its complications should extend to rural areas. Our center is situated in the Huang-Huai region, which encompasses the Yellow and Huai River basins. This area has been densely populated since the pre-Qin era and has been a long-standing agricultural civilization. To date, it remains a core agricultural zone comparable to the Corn Belt in the central United States and the Chornozem (Black Earth) region of Ukraine. With a high proportion of rural population, this region presents a crucial direction for future research and applications in rural.
Our ongoing efforts to enhance patient care and medical research include the development of a sophisticated and convenient follow-up and predictive modeling system that leverages the capabilities of mobile networks and applications. This innovative approach involves deploying predictive models to send alerts about abnormal values and increased risks of DKD directly to patients, physicians, and researchers. Furthermore, these alerts will be shared with the patients’ family doctors, integrating our efforts with tiered healthcare systems. Currently, we are updating our follow-up data system and integrating it into a mobile application. This initiative aims to make the follow-up system accessible not only to researchers and physicians, but also to patients, offering a user-friendly interface for seamless data exchange. This dual approach empowers both healthcare professionals and patients to proactively engage in the early prevention of DKD and other related complications.
This study had several limitations. Although our cohort was large and the data were derived from four different hospital branches in various administrative districts of Zhengzhou City, this remains a single-center study. Fortunately, our data comprehensively recorded patients’ residential and work locations. The distribution of our data broadly represents inpatients with diabetes from both rural and urban areas of the Yellow and Huai River basins. As our research progresses and with the development of the aforementioned data center and updates to the follow-up system, we plan to collaborate with hospitals in surrounding cities for multicenter studies. This expansion will serve to validate and enhance our predictive models and enlarge the cohort of hospitalized patients with diabetes.
Conclusion
Diabetic nephropathy often has a concealed onset, making it particularly important for people with diabetes to identify those among them who are suffering from DKD. This study provides a comprehensive approach for predicting DKD in patients with type 2 diabetes, employing a large dataset and machine learning techniques. This approach allowed us to identify factors involved in the development of DKD. We believe that our study makes a significant contribution to the literature because we show that UACR, peripheral circulatory complications, other kidney diseases, and BMI were the most important factors predicting DKD.
Supplemental Material
sj-docx-1-dhj-10.1177_20552076241265220 - Supplemental material for A 10-year retrospective cohort of diabetic patients in a large medical institution: Utilizing multiple machine learning models for diabetic kidney disease prediction
Supplemental material, sj-docx-1-dhj-10.1177_20552076241265220 for A 10-year retrospective cohort of diabetic patients in a large medical institution: Utilizing multiple machine learning models for diabetic kidney disease prediction by Guangpu Li, Jia Li, Fei Tian, Jingjing Ren, Zuishuang Guo, Shaokang Pan, Dongwei Liu, Jiayu Duan and Zhangsuo Liu in DIGITAL HEALTH
Footnotes
Acknowledgements
Contributorship
Declaration of conflicting interests
Ethical approval
Funding
Guarantor
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
