Abstract
The state of health (SOH) of power battery is an important parameter of the battery management system (BMS), which can reflect the age of the battery,
1
and its value usually directly determines whether the device needs to replace the battery module or pack.
2
In current research, SOH is generally defined as the ratio of the maximum capacity that the battery can currently charge to the rated capacity when it left the factory.
3
The formula is:
At present, the commonly used traditional model-based power battery state estimation models are: equivalent circuit model, 5 electrochemical model 6 and finite element model. However, the internal chemical reactions of lithium batteries cannot be directly observed or measured, and there is a certain amount of noise. 7 Therefore, the above model-based method has some deviations, and it is impossible to establish a model that can be reasonably and accurately applied to actual conditions. 8
In response to this problem, this paper studies the application of data-driven algorithms in estimating the state of health of power batteries. 9 Comparing with the traditional model-based modeling method, 10 this method considers the actual usage scenarios and habits of users, such as the average charging current and the temperature which have a greater impact on the actual SOH, and only needs to analyze a large amount of historical usage data in the data platform, 11 and the estimation accuracy of the algorithm can be verified through the corresponding experimental data. 12
Data preprocessing
The data used in this article is provided by the Shanghai New Energy Vehicle Public Data Collection and Monitoring Research Center. The data comes from the Shanghai New Energy Vehicle Operation Data collected by the data center, all of which have been desensitized. The data includes the operating data of 25 pure electric vehicles, the data collection interval is 10 s, and the time span is 180 days, including vehicle operating data, battery extreme value data, and static information. Part of the sample data is shown in Table 1.
Sample data of new energy vehicles.
Data segmentation
Since the data used in this paper are all the original driving data and do not contain the label of the model: SOH, the most important thing is estimating the real SOH of vehicles from the raw data. As the definition of SOH which is showed in formula (1), the maximum rechargeable capacity in the current state needs to be calculated, so each charging event needs to be filtered from all the raw data at first In this paper, the charging event segmentation algorithm considers the following two possible abnormal situations: The first is that when the vehicle is charging in an underground parking garage or other places with poor network signal, the data may not be uploaded in time, resulting the loss of the next piece of data in this charging process; The second is that the charging pile is abnormal and the output current is unstable. When the maximum or minimum current is collected by the SOC algorithm in BMS, it will generate a program bug. As a result, the SOC of the data of the latter frame during charging is smaller than that of the previous frame, and the normal SOC should be an increasing process during charging. 13 Considering the above possible situations in reality, in order to reduce the misclassification of charging events, the following segmentation algorithm is designed which can ensure the accuracy of subsequent labels:
Case1: When the time interval between the previous piece of data and the next piece of data is ≤10 min and the SOC of previous piece ≤the SOC of next piece, it is considered the same charging event;
Case2: When the time interval between the previous data and the next data is> 10 min and <30 min and the SOC of previous data< the SOC of next data, it is considered to be the same charging event;
Case3: Otherwise, it is considered as the next charging event.
According to formula (1), QN is a fixed, known quantity. Clef can be converted to estimate the maximum charging capacity of the vehicle from actual operating data, so it can be changed to formula (2):
K-means distinguishes fast and slow charging
In the segmented charging events, selecting the charging events that can more accurately evaluate the SOH, and the conditions are as follows:
If the time of a charging event is too short, that is, the value of SOCend - SOCstart is too small, it will cause the numerator of formula (2) to tend to infinity, and eventually cause the estimated SOH to be abnormal. The first condition in this paper is that the value of SOCend - SOCstart is >30%. There are two problems in the fast charging of vehicles: Firstly, for safety and prolonging the service life of lithium batteries, the charging pile will add a lower current operation during fast charging,
14
that is, in the low SOC range is high current, and in the high SOC range, it will switch from constant current charging to constant voltage charging, and gradually switch to low current to protect the battery.
15
If data loss occurs during this period, it will cause the calculated ampere-hour integral error, and then cause SOH error; secondly, the current is too large during fast charging, part of the current is actually converted into heat loss, and the capacity calculated by the Ampere-hour integral is too high, resulting in the estimated SOH is too large.
16
Therefore, in all charging events, only the slow charging events are reserved for SOH estimation, which is to prepare labels for the training of the model. The condition for distinguishing between fast charging and slow charging is the current.
17
As shown in Figure 1, it can be seen that there are two obvious peaks in the current distribution, namely, fast charging and slow charging. Since the current range of fast and slow charging cannot be artificially given, so it belongs to the unlabeled situation and is suitable for unsupervised learning models. In this paper, all charging current data are inputted into the K-means clustering model for training. Since there are only two types of fast charging and slow charging, K = 2 is used to classify the fast and slow charging events. The clustering result showed that the cluster center of slow charging events is −13.7 A, and the cluster center of fast charging events is −76.4 A, which is consistent with the result shown in the distribution diagram in Figure 1.

Charging current distribution diagram.
Screening of charging events
According to the algorithm in 1.1 and 1.2, 1468 events which meeting the conditions were screened out from all 6029 charging events, and the SOH of 25 vehicles was estimated by formula (2) which is already different from the SOH calculated on board because of the SOH in the vehicle-mounted BMS on the market is basically obtained by multiplying the coefficient obtained by the cyclic test experiment by the rated capacity, or obtained by the look-up table method which does not consider the different actual use conditions of each vehicle. Since all 25 vehicles need to be flattened to the same dimension, considering the exact cumulative driving time cannot be known, the “absolute value” of cumulative mileage is selected as the abscissa. As can be seen from Figure 2, due to the different charging currents, different vehicle models and other characteristics, the SOH is also quite different under the same mileage, which is in line with the actual situation, but the SOH estimated based on 1468 charging events shows a gradual decline as the accumulated mileage increases. Secondly, using the principle of least squares to fit a straight line as shown in the green dashed line in Figure 2, and then calculating the 95% confidence interval based on the regression line, as shown in Figure 3, the red dashed internal interval in Figure 3, points outside the range of the interval are regarded as abnormal SOH values and deleted which further ensure the accuracy of subsequent labels. There are 1387 charging events rest These charging events can be used as labels for subsequent model training.

95% Confidence interval screening of charging events.

Charging events after screening.
Feature engineering
The health status of the power battery is affected by many factors such as the number of cycles, use time, battery consistency, user driving habits, etc. 18 This paper identifies 13 characteristics that affect the battery health status, as shown in Table 2.
Feature selection of SOH model.
Among them, the number of times the highest/lowest temperature probe code changes during this charge reflects the stability/disorder of the internal temperature distribution of the battery pack; the number of times the highest voltage monomer code changes during this charge reflects the consistency of the battery pack, when the battery consistency is better, the code of the highest voltage cell will change alternately. On the contrary, if the highest and lowest voltage always appear on two cells, it means that the battery pack has poor consistency; the initial SOC of this charge reflects the depth of discharge (DOD) of the battery pack, also reflects the user's habits from the side. For the feature of month, this article uses One-Hot encoding for this feature and converts it into a classification feature. Drawing the distribution diagram between all the features in pairs as shown in Figure 4. Figure 5 is a partial magnification of the upper left corner of Figure 4. It can be seen that the voltage (end_volt) at the end of the charge and the SOC (end_soc) at the end of the charge are present in the two features marked in the red box in the figure. The obvious positive correlation indicates that the features are redundant and will cause multicollinearity in the model, making the regression model lacking stability and difficult to distinguish the individual effects of each explanatory variable, so only end_soc is retained.

Distribution map between features.

Partial magnification of Figure 4.
At this time, there are 12 features, 11 of which are numeric features, and 1 is a classification feature (month). According to the Pearson correlation coefficient, the correlation between the numerical features is calculated and the heat map is drawn. Figure 6 shows the Pearson correlation coefficient heat map and the values in the figure represent the Pearson correlation coefficient calculated based on the input features. The purpose is to visualize the degree of correlation between the input features, which is a step in feature engineering. In the figure, you can see that there is no correlation coefficient approaching 1, indicating that the features at this time have met the needs of model training. These 11 numeric features are processed according to Max-Min standardization to eliminate the dimensional difference between the features for subsequent model training. The metrics of the linear analysis with Pearson and Spearman coefficients are shown in Table 3 and Table 4.

Feature correlation heat map.
Pearson coefficients.
Spearman coefficients.
SOH estimation model based on linear regression
This paper selects four linear regression models of Linear Regression, Lasso, Ridge and Elastic Net to establish the SOH estimation model of the power battery health status, and compares the differences between the four.
19
The four regression algorithms are introduced below:
(1) Linear Regression (2) Lasso (3) Ridge (4) Elastic Net
Linear regression is a kind of regression problem. Linear regression assumes that the target value and the feature are linearly correlated, that is, satisfying a multiple linear equation. By constructing a loss function, the parameters w and b when the loss function is the smallest are solved. The objective function is:
In order to solve the above-mentioned Linear Regression over-fitting problem, it is proposed to add a regularization term of the L1 norm to the objective function (3), called Lasso, and its objective function is:
Similar to Lasso, Ridge is also to solve the problem of Linear Regression overfitting. Unlike Lasso, Ridge adds a regularization term of the L2 norm to the objective function. The objective function is:
Elastic Net is a mixture of Lasso and Ridge. It uses both L1 and L2 regularization. The objective function is as follows:
This paper treats the preprocessed data as time series data, sorts values by the accumulated mileage from small to large, and uses forward verification method which is suitable for time series models to train the model. The sorted data is divided into training set, validation set and test set according to the ratio of 6:2:2. The cumulative mileage of all data ranges from 6930 km to 141,583 km. Therefore, the training set contains data of accumulated mileage of 6930 km∼ 80,792 km, the validation set contains data of accumulated mileage of 80,792km∼ 107,722 km, and the test set contains data of accumulated mileage of 107,722 km∼ 141,583 km. The model evaluation index selected in this paper is the Mean Absolute Error (MAE), which is the average of the absolute value of the error between the predicted value and the true value:
Horizontal comparison of models.
By comparing the four linear regression models, it can be seen that Lasso and Elastic Net with L1 regularization both perform poorly, because the actual running vehicle does not upload the data of each single battery, so this article can only select 12 effects. The features of SOH have a small number of features. If L1 regularization is used for dimensionality reduction, the model will be too simple and lead to larger errors. Therefore, the research in this paper shows that Ridge with L2 regularization performs well and is a better choice. 20
In this paper, the Ridge linear regression model is used to make SOH prediction, and the SOH prediction chart is drawn, as shown in Figure 7. The blue dot in the figure is the real SOH under the current accumulated mileage, and the orange dot is the SOH predicted by the model. It can be seen from the figure that the prediction made by this method is basically consistent with the real SOH and meets the actual demand.

Ridge model predicts SOH.
Conclusion
This paper uses the operating data of 25 new energy vehicles provided by the Shanghai New Energy Vehicle Public Data Collection and Monitoring Research Center to evaluate and build an SOH estimation model. This article fully introduces the entire process, including data preprocessing, SOH evaluation, feature engineering, and training and verification of the final SOH estimation model. This paper compares the application of the four linear regression models named Linear Regression, Lasso, Ridge, and Elastic Net in the prediction of power battery health status. The results of the study show that the MAE of the four is less than 5%. However, in the low-dimensional feature application scenarios without cell data, the most commonly used Lasso model with dimensionality reduction function in the industrial field is not the most suitable. Instead, the simpler Linear Regression or Ridge with the L2 regularization term is more applicable and meets the needs of practical use.
