Abstract
Introduction
With the breakthrough and innovation of science and technology, we are now in the cloud era of massive data. In the face of complicated and expansive data, appropriate analysis methods must be adopted to explore its potential value. In recent years, research fields such as data mining, artificial intelligence, and machine learning have developed strongly. As one of the important contents, classification technology has also become the research focus of scholars.
Data imbalance refers to the uneven distribution of samples in each category in the data set. The data set can be divided into the majority class (negative class) and the minority class (positive class) according to the sample size. 1 The classification problem of imbalanced data exists in many aspects of life, such as medical diagnosis, information security, text mining and target detection, and so on. 2 Because the current traditional classification algorithms aim at maximizing the overall accuracy and are established based on the premise that the data distribution of each category is relatively balanced, the misclassification rate for minority samples is high, and they cannot be applied maturely and stably in the problem of data imbalance. 3 Therefore, how to obtain a more accurate and ideal classification effect on imbalanced data sets is an urgent problem to be solved, which has great practical significance and application value.
Comprehensively, the research work of scholars at home and abroad for the classification of imbalanced data over the years has mainly focused on three levels: data, algorithm and evaluation index.
4
The core of data-level research is to reconstruct and adjust the sample distribution of the original data set in various ways to reduce or eliminate its imbalance. The main methods include data resampling and feature selection.
5
The key of algorithm-level research is to improve the imbalanced limitations of traditional classification algorithms, which mainly including cost-sensitive learning and ensemble learning methods. When dealing with imbalanced data, cost-sensitive learning will complete the distinction by setting different misclassification costs for various samples, which is more flexible, but there is a risk of overfitting, and the learning cost is high. The ensemble learning method mainly combines traditional algorithms with other improved algorithms. For example, Chawla et al.
6
proposed the SMOTEBoost algorithm combining the classical over-sampling algorithm synthetic minority over-sampling technique (SMOTE) and AdaBoost algorithm, which improves the generalization ability. The evaluation index level is mainly aimed at exploring and optimizing the classification algorithm index. For example, Cheng et al.
7
tried to improve the
The extreme gradient boosting algorithm XGBoost 8 is an ensemble learning algorithm with the advantages of high flexibility, strong predictability, strong generalization ability, high scalability, high model training efficiency, and great robustness. The current research work on XGBoost mainly focuses on direct application,9–14 integration with other algorithms,15–18 and parameter optimization.19–21 In terms of imbalanced data research, Jia 22 combined the improved SMOTE algorithm of clustering with XGBoost, and applied ensemble learning to realize the abnormal detection of bolt process. Cui 23 combined the EasyEnsemble under-sampling algorithm with XGBoost, and comprehensively used most classes of sample information for classification and prediction. However, due to the diversity of imbalanced data distribution, the effect of combining over-sampling and under-sampling methods with XGBoost is often not ideal.
In view of this, the primary goal of this article is to optimize and improve the classification performance of XGBoost in case of data imbalanced, and an XGBoost classification algorithm combining mixed sampling technology and ensemble learning is proposed. First, at the data level, SVM–SMOTE and EasyEnsemble are used to reduce the imbalance of data. Then, at the algorithm level, XGBoost is used to train the generative model, and the Bayesian optimization algorithm is used to automatically search the optimal parameters. By analyzing the experimental results, it can be seen that the classification model proposed in this article has a better effect than the other three representative classification models (RUSBoost, 24 CatBoost, 25 and LightGBM 26 ) and the XGBoost algorithm based on mixed sampling designed by Yue. 27
Related algorithms
SVM–SMOTE
As a classic over-sampling algorithm with universal applicability, SMOTE 28 realizes data synthesis based on random interpolation by selecting the nearest neighbor distance between samples, expands the feature decision area of minority samples, and can effectively balance data. However, it also has problems such as low quality of synthetic samples, fuzzy class boundary, and uneven distribution of a few samples. 29
The subsequent improved algorithms mainly include Borderline-SMOTE, ADASYN, and SVM–SMOTE. Among them, Borderline-SMOTE and ADASYN mainly solve the quality problem of the generated samples, while SVM–SMOTE divides the minority samples into safe (more than half of the nearest neighbor samples belong to the minority class), dangerous (more than half of the nearest neighbor samples belong to the majority class), and noise (all the nearest neighbor samples belong to the majority class), and then training SVMs based on dangerous samples, using the support vectors found by SVM to generate new samples close to the boundary of most classes and a few classes, giving full play to the advantages of SVM algorithm in boundary decision-making, and can solve the problems of generated sample quality and class boundary ambiguity at the same time. It is an over-sampling method with good effect.
EasyEnsemble
Easyensemble 30 is a hybrid ensemble under-sampling algorithm. It uses Bagging to fuse random down sampling and AdaBoost algorithm, which makes up for the defect that the general under-sampling algorithm may lose important classification information. The AdaBoost algorithm selected by the base classifier also improves the classification accuracy and generalization ability. Different from the supervised combination of another representative algorithm, BalanceCascade, EasyEnsemble is based on unsupervised under-sampling. It has the advantages of low-time complexity and high utilization of data, which can greatly avoid the waste of limited data resources. It is an effective and more expansive method. The principle steps of EasyEnsemble are shown in Table 1.
Algorithm steps of EasyEnsemble.
XGBoost
For a given training set with
The modeling process of XGBoost is to leave the original model unchanged, and set the input of the next tree as the residual of
Initialize
In the above formula,
Since
In XGBoost, the number of leaf nodes
In general, formula (1) can be rewritten as the following format
where
Assuming that the tree structure has been determined, the optimal solution obtained by directly deriving
XGBoost calculates the gain value through the formula
and selects the feature with the largest corresponding gain value and the split point under this feature for splitting.
Among them,
Algorithm optimization and design
Regularization term optimization
Because the
then formula (2) becomes
The optimal solution of the derivative solution is
and the gain calculation formula becomes
It can be seen from the above formula that the new definition makes
Classification algorithm
Since mixed sampling can take into account both under-sampling and over-sampling, it can often produce better results. Therefore, this article considers the integration of mixed sampling method and ensemble learning method, and designs an XGBoost imbalanced data classification algorithm which combines the SVM–SMOTE algorithm and the EasyEnsemble algorithm. Its strategy is as follows:
Divide the original data set
Use the SVM–SMOTE algorithm to oversample the minority samples
Use the EasyEnsemble under-sampling algorithm to independently and randomly extract multiple subsets
Use the original parameters and area under the curve (AUC) values of XGBoost as the input and output of the objective function in the Bayesian optimization search, adjust the best parameter combination in time, and perform
Based on the model obtained after tuning the parameters, the final prediction is completed on the test set.
The flow framework of the algorithm is shown in Figure 1.

Framework diagram of algorithm flow.
Experiment and result analysis
Experimental platform and data introduction
The experimental platform environment of this article is Window 10×64 operating system, 8 GB memory, Intel(R)Core(TM)i7-7500 CPU@2.70 GHz; the experimental tools are mainly Python3.9, including xgboost 1.3.3, jupyter1.0.0, seaborn0.11.1, sklearn0.0, pandas 1.2.2, numpy1.20.1, matplotlib3.3.4, imblearn0.0, and other packages.
In order to prove the classification performance of the proposed algorithm in imbalanced data sets, two public imbalanced data sets are used for experiments. The first data set is the credit card data set in Taiwan publicly provided on UCI website. The data set records the history of bank customers’ arrears, credit data, statistical characteristics, billing statements, and other information from April to September 2005. It has 24 characteristic attributes and one category identification, which can predict the customer’s default situation. The specific meaning of features is shown in Table 2. The second data set is the credit fraud data set provided by ULB machine learning laboratories. The data set contains the credit card transactions of a bank in Europe in September 2013, including 492 frauds in 284,807 transactions, and the data categories are extremely imbalanced. For security reasons, the original data set has been desensitized and principal component analysis (PCA) dimensionality reduction, with a total of 29 feature attributes and 1 category identification. Among them, the features “V1”–“V28” are the principal components obtained by PCA, and the features not processed by PCA are “Time” and “Amount.”“Time” represents the time interval between all transactions and the first transaction, and “Amount” represents the transaction amount. For the class label “Class,” 1 represents the fraud and 0 represents the normal.
Characteristic description.
For PAY_0 and PAY_2-PAY_6, the value of −1 means that the customer has repaid on time, 1 means that the customer will postpone the repayment for 1 month, 2 means that the customer will postpone the repayment for 2 months, and so on. In addition, if the value of PAY_AMT is greater than or equal to the previous month’s BILL_AMT value, it is deemed to be repaid on time. If the value is less than the previous month’s BILL_AMT value but is greater than the lower limit of the repayment amount set by the bank, it will be regarded as delayed repayment, and less than the minimum repayment amount is default.
Data analysis
The first data set is mainly shown and analyzed below. The data in the second data set do not need cleaning, and there is no significant correlation between characteristic variables. The processing flow is similar to that in the first data set.
First, by loading and viewing the data set, and performing missing value testing and descriptive statistical analysis, it is found that the data have a total of 30,000 rows and 25 columns, all features have no missing values, and all the data are integers. Due to the large differences between the values, standardization is required. Part of the data and test results are shown in Figures 2 and 3.

Partial data display.

Missing value test results.
Second, by looking at the sample distribution of each feature, the outliers were tested, and it was found that the features EDUCATION and MARRIAGE were abnormal. The value of DUCATION is 0, 5, and 6 more than in the introduction, and the number of outliers is 345, which is relatively small compared to the total sample size, so it is classified as one with 4. The value of MARRIAGE is 0 more than in the introduction, and the number of outliers is 54. It is classified as one with 3 for correction and filling.
At the same time, analysis shows that there will be 6636 customers defaulting next month, which is much lower than the number of non-defaulting customers (23,364), so the data are obviously imbalanced. The information of the two experimental data sets is summarized in Table 3.
Experimental data set information.
Moreover, by analyzing the variables in the credit card data set, it can be seen that among all customers, the default ratio of males is 24.2%, and that of females is 20.8%. The number of customers who repay on time between 30 and 40 is the largest, and the older the age, the higher the default rate. The unmarried customer group has the largest number of repayments on time. The number of defaults is similar to that of married groups. High school customers have the highest default rate, and the higher the education level, the lower the default rate.
Finally, through comprehensive feature analysis, it is found that there is only PAY_0 has the most complete information in the repayment status, and the two types of features BILL_AMT and PAY_AMT are highly correlated, so features with large correlation coefficient with default.payment.next.month can be selected for modeling, respectively. Therefore, delete ID, PAY_2–PAY_6, BILL_AMT2, BILL_AMT4–BILL_AMT6, PAY_AMT3–PAY_AMT 6, and default.payment. next.month. The specific feature correlation heatmap is shown in Figure 4.

Feature correlation heatmap.
Performance evaluation index
In traditional classification problems, single evaluation indexes such as recall rate and accuracy rate are often used to better reflect the performance of the algorithm.
However, in the face of the imbalanced classification problem of data tilt, only using single index no longer has good reference value. Therefore, this article selects the typical comprehensive evaluation index G-mean and AUC value to compare and analyze the prediction effect of the experiment.
In the confusion matrix,
Therefore, G-mean contains two single evaluation indexes
Experiment settings and results
Experiment settings
Through data analysis and feature selection, we first divide the original imbalanced data set into training set and test set in the ratio of 7:3, and designs two groups of comparative experiments. The average value after 10-fold cross-validation is used as the experimental result to make it more objective.
Group 1: (compare the feasibility of the proposed algorithm at different levels)
The training set does not perform any operation, but directly constructs the XGBoost classifier for experimentation, and the parameters are all default.
Select SVM–SMOTE over-sampling and EasyEnsemble under-sampling at the data level, respectively, and use XGBoost modeling experiment to compare the classification effect after adding sampling (the model is abbreviated as SS-XGB and EE-XGB).
Use the SVM–SMOTE+EasyEnsemble+ XGBoost (SE-XGB) model, where the sampling proportion of SVM–SMOTE is set to 30% of the majority class.
Use the SVM–SMOTE + EasyEnsemble+ Bayesian search tuning + XGBoost (SEB-XGB) model.
Group 2: (comparison between SEB-XGB and other imbalanced classification models)
In order to prove the effectiveness of the algorithm model proposed in this article, it is compared with the classic improved algorithm RUSBoost in the field of imbalanced data classification, the recently popular improved algorithms CatBoost, LightGBM, and the EBB-XGBoost algorithm proposed in Yue. 27
Result analysis
The comparison results of the Group 1 and the Group 2 of experiments on the data set are shown in Tables 4 and 5, respectively.
Results of Experiment I.
AUC: area under the curve.
Results of Experiment II.
AUC: area under the curve.
Table 6 shows the optimal parameter combination results of Bayesian automatic search. The parameter combinations are as follows: learning_rate, max_depth, min_child_weight, colsample_bytree, subsample, n_estimators, lamda, and alpha.
SEB-XGB parameter combination.
Among them, for credit card data set, the AUC value of SEB-XGB after 10-fold cross-validation can reach 0.7796, while for credit fraud data set, the AUC value after 10-fold cross-validation can reach 0.9998.
It can be seen from Table 4 that the classification performance of SEB-XGB model has been improved by gradually adding data-level sampling processing, using the model combining mixed sampling and ensemble learning, and finally adding Bayesian parameter tuning. Compared with a single XGBoost, SEB-XGB increases the G-mean and AUC values by 12.4% and 2.51%, respectively, in the first data set, and 6.49% and 4.36%, respectively, in the second data set, which proves the feasibility of the proposed algorithm.
As can be seen from Table 5, on the two data sets, compared with other improved classification models, the G-mean and AUC values of SEB-XGB are the best on the whole, indicating that the algorithm has higher recognition rate and better classification prediction effect.
Conclusion
In order to improve the classification performance of XGBoost when the data are imbalanced, this article proposes an SEB-XGB algorithm combining sampling technology and ensemble learning from the two aspects of algorithm principle and imbalanced data processing. This algorithm first uses SVM–SMOTE over-sampling at the data level to generate minority supplementary samples, and then uses EasyEnsemble under-sampling to balance the data categories. Then, at the algorithm level, XGBoost is used as the base learner for training and ensemble, and the final model is obtained by Bayesian automatic search and optimization parameters. The results of two groups of comparative experiments show that the proposed algorithm is feasible and has a better effect than the original single XGBoost algorithm and other improved classification algorithms.
However, this article only studies the imbalanced binary classification problem, which also has certain limitations. As the research continues to deepen, we will try to explore multi-classification problems in the future.
