Sage Journals: Discover world-class research

Abstract

As a new and efficient ensemble learning algorithm, XGBoost has been widely applied for its multitudinous advantages, but its classification effect in the case of data imbalance is often not ideal. Aiming at this problem, an attempt was made to optimize the regularization term of XGBoost, and a classification algorithm based on mixed sampling and ensemble learning is proposed. The main idea is to combine SVM-SMOTE over-sampling and EasyEnsemble under-sampling technologies for data processing, and then obtain the final model based on XGBoost by training and ensemble. At the same time, the optimal parameters are automatically searched and adjusted through the Bayesian optimization algorithm to realize classification prediction. In the experimental stage, the G-mean and area under the curve (AUC) values are used as evaluation indicators to compare and analyze the classification performance of different sampling methods and algorithm models. The experimental results on the public data set also verify the feasibility and effectiveness of the proposed algorithm.

Keywords

XGBoost imbalanced data sampling technology ensemble method machine learning

Introduction

With the breakthrough and innovation of science and technology, we are now in the cloud era of massive data. In the face of complicated and expansive data, appropriate analysis methods must be adopted to explore its potential value. In recent years, research fields such as data mining, artificial intelligence, and machine learning have developed strongly. As one of the important contents, classification technology has also become the research focus of scholars.

Data imbalance refers to the uneven distribution of samples in each category in the data set. The data set can be divided into the majority class (negative class) and the minority class (positive class) according to the sample size.¹ The classification problem of imbalanced data exists in many aspects of life, such as medical diagnosis, information security, text mining and target detection, and so on.² Because the current traditional classification algorithms aim at maximizing the overall accuracy and are established based on the premise that the data distribution of each category is relatively balanced, the misclassification rate for minority samples is high, and they cannot be applied maturely and stably in the problem of data imbalance.³ Therefore, how to obtain a more accurate and ideal classification effect on imbalanced data sets is an urgent problem to be solved, which has great practical significance and application value.

Comprehensively, the research work of scholars at home and abroad for the classification of imbalanced data over the years has mainly focused on three levels: data, algorithm and evaluation index.⁴ The core of data-level research is to reconstruct and adjust the sample distribution of the original data set in various ways to reduce or eliminate its imbalance. The main methods include data resampling and feature selection.⁵ The key of algorithm-level research is to improve the imbalanced limitations of traditional classification algorithms, which mainly including cost-sensitive learning and ensemble learning methods. When dealing with imbalanced data, cost-sensitive learning will complete the distinction by setting different misclassification costs for various samples, which is more flexible, but there is a risk of overfitting, and the learning cost is high. The ensemble learning method mainly combines traditional algorithms with other improved algorithms. For example, Chawla et al.⁶ proposed the SMOTEBoost algorithm combining the classical over-sampling algorithm synthetic minority over-sampling technique (SMOTE) and AdaBoost algorithm, which improves the generalization ability. The evaluation index level is mainly aimed at exploring and optimizing the classification algorithm index. For example, Cheng et al.⁷ tried to improve the F-value using the support vector machine (SVM) of cost-sensitive learning.

The extreme gradient boosting algorithm XGBoost⁸ is an ensemble learning algorithm with the advantages of high flexibility, strong predictability, strong generalization ability, high scalability, high model training efficiency, and great robustness. The current research work on XGBoost mainly focuses on direct application,^9–14 integration with other algorithms,^15–18 and parameter optimization.^19–21 In terms of imbalanced data research, Jia²² combined the improved SMOTE algorithm of clustering with XGBoost, and applied ensemble learning to realize the abnormal detection of bolt process. Cui²³ combined the EasyEnsemble under-sampling algorithm with XGBoost, and comprehensively used most classes of sample information for classification and prediction. However, due to the diversity of imbalanced data distribution, the effect of combining over-sampling and under-sampling methods with XGBoost is often not ideal.

In view of this, the primary goal of this article is to optimize and improve the classification performance of XGBoost in case of data imbalanced, and an XGBoost classification algorithm combining mixed sampling technology and ensemble learning is proposed. First, at the data level, SVM–SMOTE and EasyEnsemble are used to reduce the imbalance of data. Then, at the algorithm level, XGBoost is used to train the generative model, and the Bayesian optimization algorithm is used to automatically search the optimal parameters. By analyzing the experimental results, it can be seen that the classification model proposed in this article has a better effect than the other three representative classification models (RUSBoost,²⁴ CatBoost,²⁵ and LightGBM²⁶) and the XGBoost algorithm based on mixed sampling designed by Yue.²⁷

Related algorithms

SVM–SMOTE

As a classic over-sampling algorithm with universal applicability, SMOTE²⁸ realizes data synthesis based on random interpolation by selecting the nearest neighbor distance between samples, expands the feature decision area of minority samples, and can effectively balance data. However, it also has problems such as low quality of synthetic samples, fuzzy class boundary, and uneven distribution of a few samples.²⁹

The subsequent improved algorithms mainly include Borderline-SMOTE, ADASYN, and SVM–SMOTE. Among them, Borderline-SMOTE and ADASYN mainly solve the quality problem of the generated samples, while SVM–SMOTE divides the minority samples into safe (more than half of the nearest neighbor samples belong to the minority class), dangerous (more than half of the nearest neighbor samples belong to the majority class), and noise (all the nearest neighbor samples belong to the majority class), and then training SVMs based on dangerous samples, using the support vectors found by SVM to generate new samples close to the boundary of most classes and a few classes, giving full play to the advantages of SVM algorithm in boundary decision-making, and can solve the problems of generated sample quality and class boundary ambiguity at the same time. It is an over-sampling method with good effect.

EasyEnsemble

Easyensemble³⁰ is a hybrid ensemble under-sampling algorithm. It uses Bagging to fuse random down sampling and AdaBoost algorithm, which makes up for the defect that the general under-sampling algorithm may lose important classification information. The AdaBoost algorithm selected by the base classifier also improves the classification accuracy and generalization ability. Different from the supervised combination of another representative algorithm, BalanceCascade, EasyEnsemble is based on unsupervised under-sampling. It has the advantages of low-time complexity and high utilization of data, which can greatly avoid the waste of limited data resources. It is an effective and more expansive method. The principle steps of EasyEnsemble are shown in Table 1.

Table 1.

Algorithm steps of EasyEnsemble.

Input: training set

S

, the number of weak learners of AdaBoost

(M)

, the learning algorithm of AdaBoost

(L)

1. Divide the training set

S

to obtain the minority sample set

P

and the majority sample set

N

| P | < | N |

, and select

T

subsets from

N

;

2. For

i = 1 : T

Randomly select subset

N_{i}

from

N

so that

| N_{i} | = | P |

; let

S_{i} = N_{i} \cup P

, train an AdaBoost learner

H_{i}

constructed by

M

weak learners on

S_{i}

, record the weight

w_{i, j}

of each weak learner

L_{i, j}

and the ensemble decision threshold

θ_{i}

, that is

H_{i} (x) = sgn (\sum_{j = 1}^{M} w_{i, j} L_{i, j} (x) - θ_{i})

. End

Output: the prediction result of the test sample is

H (x) = sgn (\sum_{i = 1}^{T} \sum_{j = 1}^{M} w_{i, j} L_{i, j} (x) - \sum_{i = 1}^{T} θ_{i})

XGBoost

For a given training set with $n$ examples and $m$ features, $D = {(x_{i}, y_{i})}_{i = 1}^{n}$ $(| D | = n, x_{i} \in R^{m}, y_{i} \in R)$ . XGBoost can be regarded as an additive model ${\hat{y}}_{i} = \sum_{k = 1}^{K} f_{k} (x_{i}), f_{k} \in F$ which composed of $K$ CART trees. Among them, $f_{k} (x_{i})$ represents the predicted value obtained after inputting the ith sample $x_{i}$ into the kth tree, ${\hat{y}}_{i}$ represents the final prediction result of $x_{i}$ , and $F$ is the set space of all regression trees. The objective function of XGBoost can be defined as $Obj = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}) + \sum_{k = 1}^{K} Ω (f_{k})$ . Among them, $y_{i}$ is the real result and $\sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i})$ is the loss function, which can measure the prediction ability of the model; $\sum_{k = 1}^{K} Ω (f_{k})$ is the regularization term of the model, used to control the complexity and avoid overfitting.

The modeling process of XGBoost is to leave the original model unchanged, and set the input of the next tree as the residual of ${\hat{y}}_{i}$ and $y_{i}$ . The general steps are as follows:

Initialize ${\hat{y}}_{i}^{(0)} = 0$ , we can get

$\begin{array}{l} {\hat{y}}_{i}^{(1)} = f_{1} (x_{i}) = {\hat{y}}_{i}^{(0)} + f_{1} (x_{i}) \\ {\hat{y}}_{i}^{(2)} = f_{1} (x_{i}) + f_{2} (x_{i}) = {\hat{y}}_{i}^{(1)} + f_{2} (x_{i}) \\ \dots \\ {\hat{y}}_{i}^{(t)} = \sum_{k = 1}^{t} f_{k} (x_{i}) = {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i}) \end{array}$

In the above formula, ${\hat{y}}_{i}^{(t)}$ and ${\hat{y}}_{i}^{(t - 1)}$ , respectively, represent the predicted value of the model during the tth and previous $t - 1$ iterations of $x_{i}$ , and $f_{t} (x_{i})$ is the newly added prediction function in each round, so the objective function of the tth iteration is

$\begin{array}{l} O b j^{(t)} = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t)}) + \sum_{k = 1}^{t} Ω (f_{k}) \\ = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})) + \sum_{k = 1}^{t - 1} Ω (f_{k}) + Ω (f_{t}) \end{array}$

Since ${\hat{y}}_{i}^{(t - 1)}$ and $l (y_{i}, {\hat{y}}_{i}^{(t - 1)})$ are both constants, the second-order Taylor expansion can be used for the loss function to approximate the objective function. Let $g_{i} = \partial_{{\hat{y}}_{i}^{(t - 1)}} l (y_{i}, {\hat{y}}_{i}^{(t - 1)})$ and $h_{i} = \partial_{{\hat{y}}_{i}^{(t - 1)}}^{2} l (y_{i}, {\hat{y}}_{i}^{(t - 1)})$ , omitting the constant term, we have

$Ob j^{(t)} \approx \sum_{i = 1}^{n} [g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + Ω (f_{t})$ (1)

In XGBoost, the number of leaf nodes $T$ and the weight $w$ of the tree are considered to define the complexity of the tree, that is, $Ω (f_{t}) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} w_{j}^{2}$ . Among them, $γ$ and $λ$ are the parameters controlling complexity. The larger the value, the more complex the structure of the tree.

In general, formula (1) can be rewritten as the following format

$\begin{matrix} Ob j^{(t)} \approx \sum_{i = 1}^{n} [g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + γ T + \frac{1}{2} λ \sum_{j = 1}^{T} w_{j}^{2} \\ = \sum_{j = 1}^{T} [(\sum_{i \in I_{j}} g_{i}) w_{j} + \frac{1}{2} (\sum_{i \in I_{j}} h_{i} + λ) w_{j}^{2}] + γ T \end{matrix}$ (2)

where $I_{j} = {i | q (x_{i}) = j}$ is the sample on the jth leaf node, and $w_{j}$ is the weight of the jth node. Let $G_{j} = \sum_{i \in I_{j}} g_{i}$ and $H_{j} = \sum_{i \in I_{j}} h_{i}$ , then there is

$Ob j^{(t)} \approx \sum_{j = 1}^{T} [G_{j} w_{j} + \frac{1}{2} (H_{j} + λ) w_{j}^{2}] + γ T$

Assuming that the tree structure has been determined, the optimal solution obtained by directly deriving $w_{j}$ is $w_{j}^{*} = - (G_{j} / H_{j} + λ)$ . Substituting this solution into equation (2), the optimal value of the objective function is

$Ob j^{*} = - \frac{1}{2} \sum_{j = 1}^{T} (\frac{G_{j}^{2}}{H_{j} + λ}) + γ T$

XGBoost calculates the gain value through the formula

$Gain = \frac{1}{2} [\frac{G_{L}^{2}}{H_{L} + λ} + \frac{G_{R}^{2}}{H_{R} + λ} - \frac{{(G_{L} + G_{R})}^{2}}{H_{L} + H_{R} + λ}] - γ$

and selects the feature with the largest corresponding gain value and the split point under this feature for splitting.

Among them, $(G_{L}^{2} / H_{L} + λ)$ and $(G_{R}^{2} / H_{R} + λ)$ , respectively, represent the structure score of the left and right subtrees, and $((G_{L} + G_{R})^{2} / H_{L} + H_{R} + λ)$ is the structure score when the current node is not split. The rest of the detailed algorithm flow can be found in the literature.⁸

Algorithm optimization and design

Regularization term optimization

Because the $L_{1}$ regularization term has stronger anti-noise ability and robustness, but there may be multiple optimal solutions, and the weight coefficients will be sparse, while the $L_{2}$ regularization term has lower computational complexity and faster speed. Therefore, this article refers to the idea of ElasticNet,³¹ combines $L_{1}$ and $L_{2}$ regularization terms, and redefines the complexity of the tree model, which is

$Ω (f_{t}) = \frac{1}{2} λ \sum_{j = 1}^{T} {(w_{j} - 1)}^{2} = - λ \sum_{j = 1}^{T} w_{j} + \frac{1}{2} λ \sum_{j = 1}^{T} w_{j}^{2} + \frac{1}{2} λ T$

then formula (2) becomes

$Ob j^{(t)} \approx \sum_{j = 1}^{T} [(G_{j} - λ) w_{j} + \frac{1}{2} (H_{j} + λ) w_{j}^{2}] + \frac{λ}{2} T$ (3)

The optimal solution of the derivative solution is $w'_{j} = - (G_{j} - λ / H_{j} + λ)$ , then the optimal value of the objective function is

$Obj' = - \frac{1}{2} \sum_{j = 1}^{T} (\frac{{(G_{j} - λ)}^{2}}{H_{j} + λ}) + \frac{λ}{2} T$

and the gain calculation formula becomes

$\begin{matrix} Gain' = \frac{1}{2} [\frac{{(G_{L} - λ)}^{2}}{H_{L} + λ} + \frac{{(G_{R} - λ)}^{2}}{H_{R} + λ} - \frac{{(G_{L} + G_{R} - λ)}^{2}}{H_{L} + H_{R} + λ}] \\ - \frac{λ}{2} \end{matrix}$ (4)

It can be seen from the above formula that the new definition makes $| w'_{j} | < | w_{j}^{*} |$ , that is, compared with the original regularization term, the parameters become smaller after the change, so overfitting can be avoided and the model variance can be reduced.

Classification algorithm

Since mixed sampling can take into account both under-sampling and over-sampling, it can often produce better results. Therefore, this article considers the integration of mixed sampling method and ensemble learning method, and designs an XGBoost imbalanced data classification algorithm which combines the SVM–SMOTE algorithm and the EasyEnsemble algorithm. Its strategy is as follows:

Divide the original data set $D$ into training set and test set according to the preset proportion.

Use the SVM–SMOTE algorithm to oversample the minority samples $P$ in the training set once to generate a sample set $P'$ to increase the minority sample size.

Use the EasyEnsemble under-sampling algorithm to independently and randomly extract multiple subsets $N_{i} (i = 1 : m)$ from the majority samples $N$ in the training set, so that $| N_{i} | = | P' |$ , and then combine $N_{i}$ and $P'$ into multiple balanced subsets and train multiple weak learners (apply XGBoost with better classification performance instead of AdaBoost as a weak learner), and then ensemble the results to obtain a strong learner model.

Use the original parameters and area under the curve (AUC) values of XGBoost as the input and output of the objective function in the Bayesian optimization search, adjust the best parameter combination in time, and perform K-fold cross-validation on the strong learner.

Based on the model obtained after tuning the parameters, the final prediction is completed on the test set.

The flow framework of the algorithm is shown in Figure 1.

Figure 1.

Framework diagram of algorithm flow.

Experiment and result analysis

Experimental platform and data introduction

The experimental platform environment of this article is Window 10×64 operating system, 8 GB memory, Intel(R)Core(TM)i7-7500 CPU@2.70 GHz; the experimental tools are mainly Python3.9, including xgboost 1.3.3, jupyter1.0.0, seaborn0.11.1, sklearn0.0, pandas 1.2.2, numpy1.20.1, matplotlib3.3.4, imblearn0.0, and other packages.

In order to prove the classification performance of the proposed algorithm in imbalanced data sets, two public imbalanced data sets are used for experiments. The first data set is the credit card data set in Taiwan publicly provided on UCI website. The data set records the history of bank customers’ arrears, credit data, statistical characteristics, billing statements, and other information from April to September 2005. It has 24 characteristic attributes and one category identification, which can predict the customer’s default situation. The specific meaning of features is shown in Table 2. The second data set is the credit fraud data set provided by ULB machine learning laboratories. The data set contains the credit card transactions of a bank in Europe in September 2013, including 492 frauds in 284,807 transactions, and the data categories are extremely imbalanced. For security reasons, the original data set has been desensitized and principal component analysis (PCA) dimensionality reduction, with a total of 29 feature attributes and 1 category identification. Among them, the features “V1”–“V28” are the principal components obtained by PCA, and the features not processed by PCA are “Time” and “Amount.”“Time” represents the time interval between all transactions and the first transaction, and “Amount” represents the transaction amount. For the class label “Class,” 1 represents the fraud and 0 represents the normal.

Table 2.

Characteristic description.

Number	Features	Description
1	ID	Customer’s unique identification
2	LIMIT_BAL	Credit line (in NT), including personal and family credit lines
3	SEX	Customer’s gender (1 = male; 2 = female)
4	EDUCATION	Customer’s educational level (1 = master’sdegree and above;2 = undergraduate; 3 = high school; 4 = others)
5	MARRIAGE	Customer’s marital status (1 = married;2 = unmarried; 3 = other)
6	AGE	Customer’s age
7	PAY_0	Repayment status in September
8	PAY_2	Repayment status in August
9	PAY_3	Repayment status in July
10	PAY_4	Repayment status in June
11	PAY_5	Repayment status in May
12	PAY_6	Repayment status in April
13	BILL_AMT1	Bill amount in September
14	BILL_AMT2	Bill amount in August
15	BILL_AMT3	Bill amount in July
16	BILL_AMT4	Bill amount in June
17	BILL_AMT5	Bill amount in May
18	BILL_AMT6	Bill amount in April
19	PAY_AMT1	Payment amount in September
20	PAY_AMT2	Payment amount in August
21	PAY_AMT3	Payment amount in July
22	PAY_AMT4	Payment amount in June
23	PAY_AMT5	Payment amount in May
24	PAY_AMT6	Payment amount in April
25	default.payment.next.month	Default in the next month (1 = yes; 0 = no)

For PAY_0 and PAY_2-PAY_6, the value of −1 means that the customer has repaid on time, 1 means that the customer will postpone the repayment for 1 month, 2 means that the customer will postpone the repayment for 2 months, and so on. In addition, if the value of PAY_AMT is greater than or equal to the previous month’s BILL_AMT value, it is deemed to be repaid on time. If the value is less than the previous month’s BILL_AMT value but is greater than the lower limit of the repayment amount set by the bank, it will be regarded as delayed repayment, and less than the minimum repayment amount is default.

Data analysis

The first data set is mainly shown and analyzed below. The data in the second data set do not need cleaning, and there is no significant correlation between characteristic variables. The processing flow is similar to that in the first data set.

First, by loading and viewing the data set, and performing missing value testing and descriptive statistical analysis, it is found that the data have a total of 30,000 rows and 25 columns, all features have no missing values, and all the data are integers. Due to the large differences between the values, standardization is required. Part of the data and test results are shown in Figures 2 and 3.

Figure 2.

Partial data display.

Figure 3.

Missing value test results.

Second, by looking at the sample distribution of each feature, the outliers were tested, and it was found that the features EDUCATION and MARRIAGE were abnormal. The value of DUCATION is 0, 5, and 6 more than in the introduction, and the number of outliers is 345, which is relatively small compared to the total sample size, so it is classified as one with 4. The value of MARRIAGE is 0 more than in the introduction, and the number of outliers is 54. It is classified as one with 3 for correction and filling.

At the same time, analysis shows that there will be 6636 customers defaulting next month, which is much lower than the number of non-defaulting customers (23,364), so the data are obviously imbalanced. The information of the two experimental data sets is summarized in Table 3.

Table 3.

Experimental data set information.

Data set	Number ofsamples	Number ofpositive samples	Number ofnegative samples	Imbalancerate	Number of categories
Credit card	30,000	6636	23,364	4.52	2
Credit fraud	284,807	492	284,315	577.88	2

Moreover, by analyzing the variables in the credit card data set, it can be seen that among all customers, the default ratio of males is 24.2%, and that of females is 20.8%. The number of customers who repay on time between 30 and 40 is the largest, and the older the age, the higher the default rate. The unmarried customer group has the largest number of repayments on time. The number of defaults is similar to that of married groups. High school customers have the highest default rate, and the higher the education level, the lower the default rate.

Finally, through comprehensive feature analysis, it is found that there is only PAY_0 has the most complete information in the repayment status, and the two types of features BILL_AMT and PAY_AMT are highly correlated, so features with large correlation coefficient with default.payment.next.month can be selected for modeling, respectively. Therefore, delete ID, PAY_2–PAY_6, BILL_AMT2, BILL_AMT4–BILL_AMT6, PAY_AMT3–PAY_AMT 6, and default.payment. next.month. The specific feature correlation heatmap is shown in Figure 4.

Figure 4.

Feature correlation heatmap.

Performance evaluation index

In traditional classification problems, single evaluation indexes such as recall rate and accuracy rate are often used to better reflect the performance of the algorithm.

However, in the face of the imbalanced classification problem of data tilt, only using single index no longer has good reference value. Therefore, this article selects the typical comprehensive evaluation index G-mean and AUC value to compare and analyze the prediction effect of the experiment.

In the confusion matrix, $TP$ , $FP$ , $FN$ , and $TN$ , respectively, represent the sample sizes of true positive, false positive, false negative, and true negative cases. From this, the definitions of the above-mentioned types of evaluation indexes can be obtained

$\begin{matrix} Recall = Sensitivity = TPR = \frac{TP}{(TP + FN)} \\ Precision = \frac{TP}{(TP + FP)} \\ Specificity = \frac{TN}{(TN + FP)} \\ FPR = \frac{FP}{(TN + FP)} \\ G - mean = \sqrt{Sensitivity \times Specificity} \end{matrix}$ (5)

Therefore, G-mean contains two single evaluation indexes $Sensitivity$ and $Specificity$ , which improve and perfect the overall accuracy and can better measure the classification effect. In addition, the receiver operating characteristic (ROC) curve is drawn with $FPR$ and $TPR$ as the horizontal and vertical axes, respectively. AUC is the coverage area below the ROC curve. The larger the value, the closer the ROC curve will be to the upper left corner, representing the more accurate the classification effect of the model.

Experiment settings and results

Experiment settings

Through data analysis and feature selection, we first divide the original imbalanced data set into training set and test set in the ratio of 7:3, and designs two groups of comparative experiments. The average value after 10-fold cross-validation is used as the experimental result to make it more objective.

Group 1: (compare the feasibility of the proposed algorithm at different levels)

The training set does not perform any operation, but directly constructs the XGBoost classifier for experimentation, and the parameters are all default.

Select SVM–SMOTE over-sampling and EasyEnsemble under-sampling at the data level, respectively, and use XGBoost modeling experiment to compare the classification effect after adding sampling (the model is abbreviated as SS-XGB and EE-XGB).

Use the SVM–SMOTE+EasyEnsemble+ XGBoost (SE-XGB) model, where the sampling proportion of SVM–SMOTE is set to 30% of the majority class.

Use the SVM–SMOTE + EasyEnsemble+ Bayesian search tuning + XGBoost (SEB-XGB) model.

Group 2: (comparison between SEB-XGB and other imbalanced classification models)

In order to prove the effectiveness of the algorithm model proposed in this article, it is compared with the classic improved algorithm RUSBoost in the field of imbalanced data classification, the recently popular improved algorithms CatBoost, LightGBM, and the EBB-XGBoost algorithm proposed in Yue.²⁷

Result analysis

The comparison results of the Group 1 and the Group 2 of experiments on the data set are shown in Tables 4 and 5, respectively.

Table 4.

Results of Experiment I.

Data set	Evaluation index	Models
Data set	Evaluation index	XGBoost	SS-XGB	EE-XGB	SE-XGB	SEB-XGB
Credit card	AUC	0.7545	0.7445	0.7645	0.7649	0.7728
Credit card	G-mean	0.5763	0.6542	0.6933	0.6930	0.7003
Credit fraud	AUC	0.9654	0.9683	0.9709	0.9740	0.9849
Credit fraud	G-mean	0.8853	0.9078	0.9238	0.9364	0.9502

AUC: area under the curve.

Table 5.

Results of Experiment II.

Data set	Evaluation index	Models
Data set	Evaluation index	RUSBoost	CatBoost	LightGBM	EBB-XGBoost	SEB-XGB
Credit card	AUC	0.7623	0.7695	0.7687	0.7683	0.7728
Credit card	G-mean	0.6874	0.5659	0.5674	0.7003	0.7003
Credit fraud	AUC	0.9533	0.9685	0.9694	0.9821	0.9849
Credit fraud	G-mean	0.9147	0.8738	0.8621	0.9344	0.9502

AUC: area under the curve.

Table 6 shows the optimal parameter combination results of Bayesian automatic search. The parameter combinations are as follows: learning_rate, max_depth, min_child_weight, colsample_bytree, subsample, n_estimators, lamda, and alpha.

Table 6.

SEB-XGB parameter combination.

Data set	Parameter combination
Credit card	0.1016, 5, 0.7701, 2.094, 0.9995, 90, 2.317, 2.187
Credit fraud	0.2999, 5, 1.5252, 0.5318, 0.8, 85, 0.0865, 1.5804

Among them, for credit card data set, the AUC value of SEB-XGB after 10-fold cross-validation can reach 0.7796, while for credit fraud data set, the AUC value after 10-fold cross-validation can reach 0.9998.

It can be seen from Table 4 that the classification performance of SEB-XGB model has been improved by gradually adding data-level sampling processing, using the model combining mixed sampling and ensemble learning, and finally adding Bayesian parameter tuning. Compared with a single XGBoost, SEB-XGB increases the G-mean and AUC values by 12.4% and 2.51%, respectively, in the first data set, and 6.49% and 4.36%, respectively, in the second data set, which proves the feasibility of the proposed algorithm.

As can be seen from Table 5, on the two data sets, compared with other improved classification models, the G-mean and AUC values of SEB-XGB are the best on the whole, indicating that the algorithm has higher recognition rate and better classification prediction effect.

Conclusion

In order to improve the classification performance of XGBoost when the data are imbalanced, this article proposes an SEB-XGB algorithm combining sampling technology and ensemble learning from the two aspects of algorithm principle and imbalanced data processing. This algorithm first uses SVM–SMOTE over-sampling at the data level to generate minority supplementary samples, and then uses EasyEnsemble under-sampling to balance the data categories. Then, at the algorithm level, XGBoost is used as the base learner for training and ensemble, and the final model is obtained by Bayesian automatic search and optimization parameters. The results of two groups of comparative experiments show that the proposed algorithm is feasible and has a better effect than the original single XGBoost algorithm and other improved classification algorithms.

However, this article only studies the imbalanced binary classification problem, which also has certain limitations. As the research continues to deepen, we will try to explore multi-classification problems in the future.

Footnotes

Handling Editor: Yanjiao Chen

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This work was supported by the National Natural Science Foundation of China (nos 12071112 and 11471102) and the Key Scientific Research Projects of Colleges and Universities in Henan Province (no. 20A520012).

ORCID iD

Ping Zhang

References

Song

Wang

Yang

, et al. Application research of improved XGBoost in imbalanced data processing. Comput Sci 2020; 47(6): 98–103.

Liu

Qiao

Zhang

, et al. A survey on data sampling methods in imbalance classification. J Chongqing Univ Technol Nat Sci 2019; 33(7): 102–112.

Fan

. Research on imbalanced dataset classification. Hefei, China: University of Science and Technology of China, 2011.

Wan

. Research on imbalanced classification method based on XGBoost. Hefei, China: Anhui University, 2018.

Chi

. Machine learning classification strategy for imbalanced data sets. Comput Eng Appl 2020; 56(24): 12–27.

Chawla

Lazarevic

Hall

, et al. SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of the European conference on principles of data mining and knowledge discovery, Dubrovnik, 22–26 September 2003, pp.107–119. Berlin: Springer.

Cheng

Zhou

Gao

, et al. Efficient optimization of F-measure with cost-sensitive SVM. Math Probl Eng 2016; 2016: 5873769.

Chen

Guestrin

. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, CA, 13–17 August 2016, pp.785–794. New York: ACM.

Zhou

. Application of XGBoost algorithm in diabetic blood glucose prediction. J Jilin Norm Univ Nat Sci Ed 2019; 40(4): 118–125.

10.

Zhao

. Application of XGBoost algorithm in prediction of wind motor main bearing fault. Electr Pow Autom Equip 2019; 39(1): 73–77, 83.

11.

Sun

Shi

, et al. Risk prediction of cancer in adult population based on support vector machine versus XGBoost. Chin Gen Pract 2020; 23(12): 1486–1491.

12.

Yuan

Zhao

. Research on abnormal user detection technology in social network based on XGBoost method. Appl Res Comput 2020; 37(3): 814–817.

13.

Romeo

Frontoni

. A unified hierarchical XGBoost model for classifying priorities for COVID-19 vaccination campaign. Pattern Recogn 2022; 121: 108197.

14.

Batunacun, Wieland

Lakes

, et al. Using shapley additive explanations to interpret extreme gradient boosting predictions of grassland degradation in Xilingol, China. Geosci Model Develop 2021; 14(3): 1493–1510.

15.

Zhang

Xiu

Wang

, et al. High-precision WiFi indoor localization algorithm based on CSI-XGBoost. J Beijing Univ Aeronaut Astronaut 2018; 44(12): 2536–2544.

16.

Liu

Qiao

. Heart disease prediction based on clustering and XGBoost algorithm. Comput Syst Appl 2019; 28(1): 228–232.

17.

Ruan

, et al. CUS-heterogeneous ensemble-based financial distress prediction for imbalanced dataset with ensemble feature selection. Appl Soft Comput 2020; 97(Part A): 106758.

18.

Lin

Zhu

Hua

, et al. Detection of ionospheric scintillation based on XGBoost model improved by SMOTE-ENN technique. Rem Sens 2021; 13(13): 2577.

19.

Wang

Zhou

, et al. The improvement and application of XGBoost method based on the Bayesian optimization. J Guangdong Univ Technol 2018; 35(1): 23–28.

20.

Wang

Guo

. Application of improved XGBoost model in stock forecasting. Comput Eng Appl 2019; 55(20): 202–207.

21.

Zhang

Wang

, et al. Diabetes risk prediction based on GA_XGBoost model. Comput Eng 2020; 46(3): 315–320.

22.

Jia

. Anomaly detection of bolt tightening process for imbalanced data sets. Jinan, China: Shandong University, 2018.

23.

Cui

. Application of hybrid XGBoost model in unbalanced dataset classification predication. Lanzhou, China: Lanzhou University, 2018.

24.

Seiffert

Khoshgoftaar

Van

, et al. RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 2010; 40(1): 185–197.

25.

Prokhorenkova

Gusev

Vorobev

, et al. CatBoost: unbiased boosting with categorical features, 2017, https://proceedings.neurips.cc/paper/2018/file/14491b756b3a51daac41c24863285549-Paper.pdf

26.

Meng

Finley

, et al. LightGBM: a highly efficient gradient boosting decision tree, 2017, https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

27.

Yue

. Research on XGBoost performance optimization based on imbalanced data. Lanzhou, China: Lanzhou Jiaotong University, 2019.

28.

Chawla

Bowyer

Hall

, et al. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 2002; 16(1): 321–357.

29.

Shi

Chen

. Summary of research on SMOTE oversampling and its improved algorithms. CAAI Trans Intell Syst 2019; 14(6): 1073–1083.

30.

Liu

Zhou

. Exploratory under-sampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B 2009; 39(2): 539–550.

31.

Zou

Hastie

. Regularization and variable selection via the elastic net. J Roy Stat Soc 2005; 67(5): 301–320.

Research and application of XGBoost in imbalanced data

Abstract

Keywords

Introduction

Related algorithms

SVM–SMOTE

EasyEnsemble

XGBoost

Algorithm optimization and design

Regularization term optimization

Classification algorithm

Experiment and result analysis

Experimental platform and data introduction

Data analysis

Performance evaluation index

Experiment settings and results

Experiment settings

Group 1: (compare the feasibility of the proposed algorithm at different levels)

Group 2: (comparison between SEB-XGB and other imbalanced classification models)

Result analysis

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References