Abstract
Keywords
Background and introduction
Cardiovascular disease (CVD) is the primary cause of deaths in the world.1–3 Acute coronary syndrome (ACS) is defined as unstable angina, non-Q wave myocardial infarction (MI), and Q wave MI. 4 It is important to recognize a patient with ACS promptly because appropriate therapy can markedly improve the patient’s prognosis. As a tool of assisting to improve the patient’s prognosis, many CVD risk prediction models have been developed through regression-based methods and machine learning–based approaches using the prognostic factors. Examples of regression-based method, which aimed to apply to clinical diagnosis by converting the prognostic factors into risk indices, are Framingham risk score (FRS),5–8 QRISK,9,10 and GRACE11–14 models. On the other hand, there are machine learning–based prediction models of CVD occurrences such as random forests (RFs), 15 neural networks (NNs), 16 and support vector machines (SVMs). 17 The machine learning–based approaches are known as methods to solve the limitations of traditional regression-based prediction models of the CVD occurrences. The basic objectives of machine learning–based mortality prediction models are to find the associations between different diseases and have a high accuracy of prediction results and acquire an excellent ability to process missing and outlier data. It is also possible to perform the data analysis on small and incomplete training data sets with dependent variables which is the disadvantage of the regression-based model (logistic regression and Cox proportional hazard regression models). 18
We can summarize the critical and challenging issues of ACS patients in previous prediction models of the CVD occurrence as follows. First, most of the previous regression-based CVD prediction models do not provide a high accuracy in the prognosis and diagnosis of the CVD occurrences in patients with moderate risk. For example, approximately half of MIs and strokes occur in those people who are not predicted to be at the risk of CVD. 19 Even though the guidelines for CVD risk diagnosis and prediction were provided, the doctor often administers unnecessary treatment to the patients with moderate risk. Second, we implicitly assume that each prognostic factor in the regression-based CVD prediction model is associated with the occurrence of major adverse cardiovascular events (MACEs): a composite of death, MI, or repeat coronary revascularization of the target lesion, and therefore, the non-linear interactive relations among prognostic factors oversimplify. 18 Third, conventional regression-based CVD prediction tools include major prognostic factors such as age, blood pressure (BP), heart rate, diabetes, cholesterol, smoking, and history of heart disease, whereas machine learning–based CVD prediction models involve different prognostic factors 19 mentioned in Table 6. Fourth, machine learning–based CVD prediction models have already been used in various medical areas but mainly focus on analyzing the medical images using convolutional neural network (CNN). 20 Especially, there is little research on the machine learning–based mortality prediction model in clinical patients with ACS, which suggests the need to analyze and predict about mortality for this research. Just because of unavailability of medical facilities, growing cost of health care, and staff shortage in emergency situation, it is essential to find a solution that redress the aforementioned problems, predict the degree of risk according to patients’ previous medical follow-up record provided by hospital emergency departments, and identify factors affecting the seriousness of patients.21–24
Therefore, this article proposes a machine learning–based mortality prediction model during 1-year follow-up tracking after hospital discharge in clinical ACS patients. Its aim is to assess the degree of risk in patients with CVD and develop a clinical decision support system to accurately predict the mortality of ACS patients during 1-year follow-up after discharge. Our contributions can be summarized as follows. First, we used a data set of the Korea Acute Myocardial Infarction Registry (KAMIR) and preprocessed it using one-hot encoding rule. Second, we selected 8962 subjects excluding 5923 people who failed to follow up after hospital discharge from the population and had missing values. We finally selected 8227 subjects (7832 alive and 395 dead), excluding 735 patients who died during hospital admission from the 8962 subjects. Third, the selected data set was divided into a training data set of 6606 (80.297%) and testing of 1621 (19.703%) through random sampling. Fourth, we implemented our machine learning–based mortality prediction model using gradient boosting machine (GBM),25,26 generalized linear model (GLM),
27
RF,
15
and deep neural network (DNN)16,28 during the 1-year follow-up after discharge. Then, we compared the performances of machine learning–based mortality prediction models by using the area under the receiver operating characteristic (ROC) curve (AUC), precision, recall, accuracy, and
Method
Data collection
KAMIR was the first nationwide, multicenter online registry designed to describe the characteristics and clinical outcomes of patients affected with MI and reflected current management of patients with the ACS in Korea. 29 This is the data set of Asian patients affected with acute MI, and it reflects the real-world medical information and treatment practice of all patients. This data set is collected from expert coordinators by using the standardized form and protocol which is approved by the ethics committee of all institutions participated in it. The registry includes 52 community and university hospitals with the capability of primary percutaneous coronary intervention (PCI) treatment. Data were collected at each site by trained study coordinators with the standardized protocol retrospectively. All enrolled subjects were emergency patients who were diagnosed with ACS and had a chest pain within 24 h. In this article, we used 14,885 ACS subjects enrolled in KAMIR from 1 November 2005 to 31 January 2008. The experimental data set consisted of 14,885 records with 22 continuous values (e.g. age, body mass index (BMI), waist-to-hip ratio (WHR), symptom-to-balloon time, and arrival-to-balloon time) and 43 categorical values (gender, pain, dyspnea, previous angina before MI symptom, etc.), as described in Table 1. The main outcome of this article is defined as cardiac and sudden deaths during 1-year clinical follow-up tracking after hospital discharge and predict about MACEs which will be helpful to determine the risk of patients’ mortality. The death after hospital discharge includes cardiac and non-cardiac death. There were also four discrete variables, such as Killip class and lesion type, which show the severity of the patient’s condition.
Applied variables for the mortality prediction model.
BMI: body mass index; WHR: waist-to-hip ratio; SBP: systolic blood pressure; DBP: diastolic blood pressure; LV: left ventricular; CK-MB: creatine kinase-muscle brain; DOA: dead on arrival; ECG: electrocardiogram; PCI: percutaneous coronary intervention; CABG: coronary artery bypass grafting; HDL: high-density lipoprotein; LDL: low-density lipoprotein; NT-proBNP: N-terminal of the prohormone brain natriuretic peptide; hsCRP: high-sensitivity C-reactive protein; STEMI: ST-segment elevation myocardial infarction; NSTEMI: non-ST-segment elevation myocardial infarction.
Data preprocessing
During the data preprocessing, all outliers in numeric data (e.g. special characters, numeric values like 999 that are out of range, and invalid datetime) are converted into a null value. In data source, all attributes that can be subdivided are subdivided into independent classes, and each class generates a new attribute. The new attribute is encoded to be a 1 if an attribute value is true for the new class; 0 otherwise, in accordance with the one-hot encoding rule, a representation of categorical variables as binary vectors. This first requires that each categorical value be mapped to an integer value. Then, each integer value is represented as a binary vector, that is, all zero values except the index of the integer, which is marked with a 1.
In this article, we skip the detailed encoding rules for numerical variables and categorical variables. A numeric value is converted into a 1 if an attribute value is true for the new class and 0 otherwise, in accordance with the reference values at Mayo Clinic which is a global reference laboratory that provides healthcare facilities and shares their medical care advices globally and make their practices and specialized tests accessible to physicians worldwide. 30 For example, attribute “age” is converted into six new attributes such as “<36,” “36–45,” “46–55,” “56–65,” “66–75,” and “⩾76.” WHR is preprocessed as a “1” for obesity, as ⩾1 in men and ⩾0.85 in women; otherwise a “0” for normal. BMI is converted into four attributes: “⩽18.5,” “18.5–22.99,” “23–24.99,” and “⩾25.” 31 During the preprocessing of categorical variables, all values for each variable generate new attributes, and then, each attribute has a value 1 in the column that corresponds to the true for this category and 0 otherwise. During this conversion of categorical variables, all null values are replaced with a 0.
Data extraction
For the experiment of this study, we used the data of 14,885 ACS patients enrolled in KAMIR from 1 November 2005 to 31 January 2008. 29 We selected 8962 subjects from the original data set and excluded 5923 people who failed to follow up after hospital discharge. Table 2 is the criteria failed at the 1-year follow-up tracking after discharge. In Table 2, the Null value means that the patient’s tracking during 1-year follow-up after hospital discharge failed. However, our data set excludes all the subjects who failed at the 1-year follow-up tracking after discharge but includes the subjects of cardiac and non-cardiac death during the follow-up period only.
Criteria in patients who failed at the 1-year follow-up after hospital discharge.
After that, we finally selected 8227 subjects (7832 alive and 395 dead) and excluded 735 patients who had died during the hospital admission from the 8962 subjects. The overall data extraction processes are shown in Figure 1. The 8227 subjects are then subdivided into a training data set of 6606 (80.297%) for model learning and a testing data set of 1621 (19.703%) for evaluating the prediction model through random sampling. The training and testing data set includes the deaths of 305 and 90 patients, respectively.

Experimental data extraction.
Architecture of the proposed mortality prediction model
To develop a mortality prediction model for patients with ACS, we employed machine learning algorithms such as GBM,25,26 GLM, 27 RF, 15 and DNN.16,28 First, DNN is an artificial neural network (ANN) with multiple hidden layers between the input and output layers, comprising three hidden layers in artificial networks and non-linear patterns in unstructured data.28,32 Second, GBM is a boosting method plus gradient descent. It creates a model, generates a fitting model to the residual, and combines both models. Next, if the residual is found in the combined model again, then a fitting model creates in the residual, and a final prediction model generates by repeating until the residual does not exist. Third, GLM is an extension of the linear regression model which enhances the linear model so that it is analyzed even when the dependent variables are not in the normal distribution. The GLM is a combination of traditional statistical methods and machine learning techniques in which dependent variables are linearly related to independent variables through a specified link function and finds the combination of hyperparameter values through the grid search approach. RF builds multiple decision trees and merges them together to get a more accurate and stable prediction. It is a flexible, simple supervised ensemble machine learning algorithm that mostly produces the accurate result without hyperparameter tuning and can be used for both classification and regression tasks.
Our mortality prediction model employed four machine learning algorithms, and Figure 2 shows the overall processing architecture of mortality prediction model for patients with ACS. Its processing phases can be summarized in detail as follows. First, the preprocessed data are subdivided through random sampling into two classes with a training data set (80%) for learning the models and a test data set (20%) for evaluation. During the random sampling, the rate of death and survival should maintain constantly. Second, we selected the ranges of hyperparameters to find the best prediction model of each machine learning model, including RF, GBM, GLM, and DNN. According to the machine learning algorithms, we created a machine learning–based mortality prediction model with the hyperparameters for clinical patients with ACS, which completes the range fitness through grid search using training data, and then it is evaluated by fourfold stratified cross-validation. Third, we found the best prediction model with the highest performance in each machine learning algorithm and extracted its hyperparameters. Fourth, each machine learning–based model employed the best hyperparameters and was evaluated by the test data. Finally, we compared the performances of mortality prediction models and then selected the best mortality prediction model during the 1-year follow-up tracking in patients with ACS.

Processing architecture of our proposed mortality prediction model during the 1-year follow-up tracking using machine learning algorithms.
Statistical analysis and implementation environments
In statistical analysis, continuous variables (e.g. age and BP) are represented as median ± standard deviation and categorical variables (e.g. gender, discharge medication (DM), and smoking) do as the rate and frequency. We use independent
Performance measures
We apply the test data sets to evaluate the accuracy of the mortality prediction model in patients with ACS. We will describe the prediction results as a table and the AUC. The performance measures of the machine learning–based mortality prediction model and regression-based model (GRACE) will be compared as a table including AUC, precision, recall, accuracy, and
Results
In this chapter, we implemented a machine learning–based mortality prediction model in patients with ACS during the 1-year follow-up tracking after hospital discharge. Before the evaluation of prediction models, the baseline characteristics of subjects were analyzed in survival and death groups during the 1-year follow-up after hospital discharge. Then, we compared the top nine primary prognostic factors between the regression-based prediction model, GRACE, and machine learning–based models, GBM, DNN, GLM, and RF, as well as the performances of their mortality prediction models after hospital discharge in accordance with AUC, precision, recall, accuracy, and
Baseline characteristics
In this article, we selected 8227 experimental subjects with survivals of 7832 and deaths of 395 after hospital discharge and excluded 5923 people who failed at the 1-year clinical follow-up and in-hospital deaths of 735 subjects from the population of 14,885 with ACS. The subjects were then subdivided into two groups such as survival (alive) and death, and their baseline characteristics were summarized in Table 3. The average age of the subjects is 62.19 ± 12.54 years, and the difference between the survival group (61.67 ± 12.40) and death group (72.43 ± 10.77) was around 10 years and highly significant (
Baseline characteristics of subjects after hospital discharge.
BMI: body mass index; WHR: waist-to-hip ratio; LV: left ventricular; CK-MB: creatine kinase-muscle brain; NT-proBNP: N-terminal of the prohormone brain natriuretic peptide; ECG: electrocardiogram; RBBB: right bundle branch blocking; LBBB: left bundle branch blocking; MI: myocardial infarction; HDL: high-density lipoprotein; LDL: low-density lipoprotein; hsCRP: hsCRP: high-sensitivity C-reactive protein.
Table 4 described the medication characteristics of the subjects after hospital discharge. The survival rate was high in patients who had prescribed medicines, such as aspirin, angiotensin-converting enzyme (ACE) inhibitor, clopidogrel, statin, and nitrate, after hospital discharge, while the death rate was high in patients who had prescribed medicines such as diuretics, digoxin, amiodarone, and spironolactone.
Discharge medication characteristics of all participants.
ACE: angiotensin-converting enzyme.
The angiographic characteristics of the subjects after hospital discharge were described in Table 5. In the coronary angiographic findings, attribute “coronary angiography was not performed” was 2.9 percent in the survival group and 21.0 percent in the death group, and the latter was seven times as high as the former. In the angiographic findings, the survival rate in patients with one vessel was twice higher than that in the death group and statistically significant. In case of attribute “LV ejection fraction,” it was under 35 % the value in the death group was three times high, while in case of over 50 percent, the value in the survival group was three times high. In attribute “PCI stent types with Taxus and Cypher,” the value in the survival group was significantly high.
Angiographic characteristics of the subjects after hospital discharge.
LV: left ventricular; PCI: percutaneous coronary intervention; BMS: bare metal stent; DESs: drug-eluting stents.
Variable significance in mortality prediction model after hospital discharge
The significance of all variables in the prediction model was calculated as a percentage. The significance degree of the variables ranged from 0 to 1, where 1 for the most significance (100%) and 0 for the least significance (0%). Table 6 described the top nine primary prognostic factors that each prediction model needs to predict the mortality during the 1-year clinical follow-ups in ACS patients. The primary prognostic factors in the prediction models were very different, depending upon the applied model such as DNN, GBM, GLM, RF, and GRACE. For example, variable “age >76” played an important role in mortality prediction models such as RF, GBM, and GLM, and an elder age had a big impact on the death rate. Next, variable “age ranging from 66 to 75” had also a big impact on mortality prediction models. So, we divided the age into six groups accordingly. In addition, variables “coronary angiogram was not performed in angiographic findings,” “diuretics,” “LV ejection fraction,” “aspirin discharge medication,” “creatinine,” and “Killip class Ⅲ” had an important impact on mortality prediction models using RF and GBM, and among them, variables “coronary angiogram was not performed in angiographic findings,” “diuretics,” and “Killip class Ⅲ” yielded higher power in the death group than in the survival group (Tables 3 to 5). It was certain that the higher the level of creatinine, the higher the death rate, while the lower the level of variable “LV ejection fraction,” the higher the death rate. In previous works, the variables age, creatinine, and Killip class were significantly important in the machine learning algorithms. However, note that the importance of variables in DNN was very different from that in machine learning models (RF, GBM, and GLM), as shown in Table 6.
Descending ranks of the top nine primary prognostic factors during the 1-year clinical follow-up after hospital discharge.
RF: random forest; GBM: gradient boosting machine; GLM: generalized linear model; DNN: deep neural network; PCI: percutaneous coronary intervention; BMI: body mass index; LDL: low-density lipoprotein; HR: heart rate; DM: discharge medication; TH: thrombolysis; MT: medical therapy.
Discussion
In this article, we employed four machine learning algorithms such as RF, GBM, GLM, and DNN in the mortality-based prediction model during the 1-year clinical follow-up tracking in patients with ACS, and then, their performances were compared with GRACE Risk Score 2.0.11–14 Normally, machine learning models are evaluated based on different performance measures such as AUC, precision, recall, accuracy, and
Comparison of the performance in mortality prediction models during the 1-year clinical follow-up tracking after hospital discharge.
AUC: area under the receiver operating characteristic curve; GBM: gradient boosting machine; GLM: generalized linear model; DNN: deep neural network.
Figure 3 showed the ROC curves in the mortality prediction models in patients with ACS during the 1-year clinical follow-up after hospital discharge. The AUC values were in the decreasing order of DNN, GBM, RF, GLM, and GRACE. Overall, GBM was superior to other approaches in the AUC, recall, accuracy, and

The ROC curves in mortality prediction models during the 1-year clinical follow-up tracking after hospital discharge.
Conclusion
This article proposed a mortality prediction model using machine learning algorithms, including DNN, GBM, GLM, and RF, during 1-year clinical follow-up after hospital discharge in Korean patients with ACS. Finally, we can summarize our main contributions as follows. First, this article led to the development of a machine learning–based 1-year mortality prediction model for clinical patients with ACS. Second, this model could forecast the occurrences of mortality during the 1-year clinical follow-up after hospital discharge in Korean patients with ACS because they did well reflect the Korean’s demographic characteristics. Third, it was shown that the performances in machine learning–based mortality prediction model were superior to GRACE. Finally, it was expected that these results would contribute to develop a future diagnosis and forecast tool of the occurrences of MACEs in clinical ACS patients.
Finally, there were some potential limitations on our research. First, we used only 8227 experimental subjects, and thus, our data set was insufficient because machine learning algorithms need to employ a large scale of data set in the experiment. Second, our proposed model was also limited in diagnosing and forecasting the mortality in Korean patients with ACS. Third, it was difficult to explain the prediction result in machine learning–based approaches because GBM, RF, and DNN were non-linear models, whereas, in the regression-based prediction models, it was easy to explain that major prognostic factors were associated with the mortality in patients with ACS because they were based on statistical analysis. Finally, there was a limit on checking up the mortality prediction during the short clinical follow-up period of 1 year after hospital discharge in our experimental data set.
