Sage Journals: Discover world-class research

Abstract

As increasing data-driven control strategies are applied in electric arc furnace systems, the problem of novelty detection has drawn more attentions than before. The presence of outliers should be the main obstacle in practical applications for these advanced control techniques. To this end, this paper proposes a dynamically selective support vector data description model to discover novelties in electric arc furnace. In this model, support vector data description plays the role of base detector. Artificial outliers are generated with two objectives, one is to assist the dynamic selection, and the other is to optimize two parameters of support vector data description. Then clustering technique is used to determine the validation set for each test point. Finally, a probabilistic method is used to compute the competence of base detectors. In contrast to other novelty ensembles that have parallel structures, our ensemble model has a dynamic selection mechanism that could facilitate the mining of the potential of base detectors. Three synthetic and three real-world datasets are used to validate the effectiveness of the proposed detection model. Experimental results have approved our method by comparing it with several competitors.

Keywords

Electric arc furnace novelty detection support vector data description ensemble learning

Introduction

Novelty detection or outlier/anomaly detection techniques have been applied in many practical domains, such as fraud detection for credit cards, intrusion detection for cyber-security, faulty detection for industrial systems, to name but a few.^1–3 However, developing dedicated novelty detectors for industrial control systems has rarely been taken into account, let alone for an electric arc furnace (EAF) control system. During recent years, many industrial systems (including EAF) have introduced data-driven techniques to facilitate the modeling and control processes as more and more process data could be collected and stored. Inspired by this, novelty detection is drawing increasing attentions in industrial systems, because anomalous observations have adverse impact on both the modeling and the control process that use any data-driven technique. In the control system of EAF, outliers are referred to as observations that cannot reflect the normal system states.⁴

Recently, several advanced control strategies such as adaptive control and model predictive control have been proposed to apply to EAF systems, in order to improve the control performance and save energy.^5–8 We have noticed that in these control methods, some machine learning algorithms like neural network have been used to establish the process model of EAF. It is well known that these data-driven models are often very sensitive to outliers in training sets and the resultant control performance will deteriorated when they have been included. (This should be the main reason why these advanced data-driven control strategies have not been extensively used in practical EAF control systems.) In this situation, implementing an efficient novelty detector on the EAF control system may be beneficial. It is noteworthy that this novelty detector has subtle distinct with a process monitoring model. Here, we are only interested in the controlled variables or variables that would be used by data-driven control strategies. In contrast, process monitoring concerns more about the state variables of systems.

In spite that rare dedicated detectors in literature have been proposed for EAF, there still many existing techniques in machine learning and data mining that can be used here. According to the availability of supervision, these novelty detectors are often categorized into three groups, that is, supervised detectors, unsupervised detectors, and semi-supervised detectors. Supervised detectors use labeled training data to learn conventional classifiers, such as support vector machine (SVM) and decision tree, that can separate normal observations from anomalous ones. A crucial drawback of this type of detectors resides on the need of labeled training data, which is of great challenges for most practical applications including EAF, because labeling process data is a time- and human-consuming work. In contrast, unsupervised detectors can use some similarity criteria like distance and density to mine potential outliers in databases. Therefore, these detectors are usually used in off-line ways and most presentative methods should be distance-based detector and local outlier factor (LOF).^9,10 Semi-supervised detectors are also referred to as one-class (OC) classifiers and data description techniques. The basic idea of this type of detectors is that a normal pattern can be learnt since all training samples are assumed from the target class. In this paper, we will often use the term OC classifiers to denote semi-supervised detectors. The most representative OC classifiers should be support vector data description (SVDD), which aims to enclose all training data with a hypersphere whose volume should be as small as possible.¹¹

Based on the mechanisms of these detectors, those semi-supervised ones are more appropriate for novelty detection in an online manner. Then in order to improve the performance of single detectors, several ensemble models have been proposed.^12–14 By combining diverse base detectors, the final detection performance could be enhanced. Note that a notable limitation of these ensemble models is that the used base detectors should be accurate and diverse simultaneously. This assumption can hardly be satisfied when we have none data labels. In this paper, we propose a dynamic selective model that uses SVDD as base detectors. Our detector can also be deemed an ensemble model as several base learners are necessary. In contrast to above ensemble detectors where a fix model structure will be used for all test points, our detector will select the most competent base detector for each test point dynamically. In order to facilitate the selective procedure, we use a trick to generate artificial outliers. Moreover, these outliers will also be used to obtain an optimal parameters of algorithm SVDD in our detector. We have noted that this problem has always been ignored by many researches that use SVDD. In addition, clustering technique is also used in the selective process to determine validation set for each test point.

Here, we conclude the contributions as follows:

A dynamic novelty detection model is proposed for EAF control system.

Artificial outlier examples are generated to determine validation sets and optimize parameters of SVDD.

Datasets from real-world EAF system are used to verify the effectiveness of our detection model.

The rest of paper is organized as follows. Some related works and preliminaries will be presented in section “Related works and preliminaries.” The proposed method will be introduced in section “Methodology,” followed by the experiments in section “Experiments and analysis.” Finally, some conclusions will be drawn in section “Conclusion.”

Related works and preliminaries

Several related works regarding novelty detection in EAF systems will be introduced in spite of their sparseness. Then some necessary preliminaries will also be presented simply.

Related work

In Liu et al.,⁴ a model-based novelty detection model is proposed for the process control system of EAF. In this model, an improved Radial Basis Function (RBF) neural network is first used to establish the process model of EAF. Then hidden Markov model (HMM) is used to analyze the residuals between the true measurements and the results of this process model. From our point of view, the main drawback of this method is that it can only be used for univariate dataset. For multivariate datasets, several such models may be necessary. Then it heavily depends on the predictive model, and the detection performance will deteriorate much when the predictions are biased. In Wang and Mao,¹² a clustering-based ensemble detector is proposed for EAF control system. In this method, a clustering algorithm is used first to separate the training set into several subsets, in each of which a single detector is established. Then any test point will be labeled as an outlier if it rejected by all base detectors. In Wang and Mao,¹³ technique Random Subspace (RS) is used to develop an ensemble detector. In this method, RS is used first to divide the feature space into several subspaces, then all training points will be projected onto these subspaces to generate several training subsets, on which corresponding base detectors can be trained. Then a combination rule is used to derive the ultimate result for each test point. As we have mentioned previously that a main drawback of these ensemble detectors is that generating accurate and diverse base detectors may be difficult for some situations.

In fields of machine learning and data mining, novelty detection is always a hot topic since outliers often indicate interesting data patterns. In general, existing novelty detection methods can be categorized into probabilistic detection models, distance-based detection models, reconstruction-based detection models, and domain-based detection models.¹⁵ For probabilistic detection models, Gaussian mixture model (GMM) should be one of the most popular parametric ones, and HMM and Kalman filter are another two commonly used parametric ones. Kernel density estimation should be the most popular non-parametric detection method. For distance-based detection models, methods based on nearest neighbors and clustering technique are often used in many applications. Then neural network (NN) based and principal components analysis (PCA) should be two commonly used reconstruction-based detection models. Finally, SVDD and OC SVM are two representative methods of domain-based detectors.

Preliminaries

Basic concepts concerning SVDD and ensemble learning will be introduced simply.

SVDD

Algorithm SVDD defines a model by using a hypersphere to give a closed boundary around all observations in the training set. This hypersphere can be characterized by center a and radius a in original feature space.¹¹ In order to enlarge the possibility of outliers in the training set, a slack variable $ξ_{i}$ is introduced. In addition, kernel trick $ϕ (x_{i})$ is also used to make the description more flexible. The minimization problem becomes hence the following

$min_{R, a, ξ} R^{2} + C \sum_{i = 1}^{n} ξ_{i}$ (1)

$s . t . ‖ ϕ (x_{i}) - a ‖^{2} \leq R^{2} + ξ_{i}$ (2)

$ξ_{i} \geq 0, i = 1, 2, \dots, n$ (3)

where the parameter C controls the balance between the volume of the hypersphere and errors on the target class. Larger value of C implies that less training points should be inside the hypersphere. Note that the above convex optimization problem can be solved through its dual form

$\begin{matrix} L (R, a, α_{i}, γ_{i}, ξ_{i}) = R^{2} + C \sum_{i = 1}^{n} ξ_{i} \\ - \sum_{i = 1}^{n} α_{i} (R^{2} + ξ_{i} - {‖ ϕ (x_{i}) - a ‖}^{2}) - \sum_{i = 1}^{n} γ_{i} ξ_{i} \end{matrix}$ (4)

By setting partial derivations to zero, the dual optimization problem has changed into

$min_{α_{i}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{i} α_{j} (ϕ (x_{i}) \cdot ϕ (x_{j})) - \sum_{i = 1}^{n} α_{i} (ϕ (x_{i}) \cdot ϕ (x_{i}))$ (5)

$s . t . \sum_{i = 1}^{n} α_{i} = 1$ (6)

$0 \leq α_{i} \leq C, i = 1, 2, \dots, n$ (7)

Solution of this dual optimization problem are a set of values of $α_{i}$ . Objects $x_{i}$ with $α_{i} > 0$ are referred to as support vectors (SVs), with which the center a can be calculated. Furthermore, SVs on the boundary can be used to calculate the radius

$R^{2} = \frac{1}{| S V_{< C} |} \sum_{x_{i} \in S V_{< C}} ‖ ϕ (x_{i}) - a ‖^{2}$ (8)

where $S V_{< C}$ is the set of SVs with $0 < α_{i} < C$ .

For any test point x, the distance from x to the center a is derived by

$d = \sqrt{1 - 2 \sum_{i = 1}^{n} α_{i} K (x_{i}, x) + \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{i} α_{j} K (x_{i}, x_{j})}$ (9)

If d is larger than R, sample x is labeled as an outlier or a fault.

Ensemble of OC classifiers

Designing parallel ensembles of OC classifiers is easier than sequential ones because those techniques used in conventional classification problem can be directly used here. The rationale of parallel ensemble resides in reducing the variance by inducing diverse base detectors or OC classifiers and aggregating them. Several strategies could be used to enhance this diversity. The most commonly used technique should be Bagging that uses a bootstrap sampling. By fusing individuals learnt on different training subsets, Bagging is expected to obtain more robust result. Another well-known strategy of enhancing diversity is to use different feature subsets. RS and feature bagging (FB) should be the most commonly used techniques of this type. Note that subspace-based outlier ensembles are more efficient on high-dimensional datasets since outliers could be easily masked there and may be exposed in certain subspace. Apart from these two strategies, using different model parameters (or initializations) and even different algorithms have also been proposed to enhance diversity of outlier ensembles.

EAF

The EAF is a highly energy-intensive process used to convert scrap metal into molten steel. EAFs range in capacity from a few tons to as many as 400 tons. Figure 1 gives a simple description of the EAF operation. The graphite electrodes that connect to the electrical supply could convert the electrical energy into thermal energy via the electric arcs between electrodes and the steel scrap surface. In addition, natural gas and oxygen would also been injected into the furnace so that the releasing chemical energy could be converted into the thermal energy. The scrap keeps melting by absorbing extensive thermal energy. Therefore, the scrap surface becomes irregular as parts of the scrap melt and removed, leading to the contours of the scrap surface. The corresponding disturbances will occur for the arc length. Then the electrode regulate system will response to these disturbances by adjusting the distance from the electrodes to the scrap surface so that the optimal arc length can be obtained. We should also note that when sufficient space is available inside the furnace, another scrap charge will be added. Then the melting process will proceed until a flat batch of molten steel is formed at the end of the batch.

Figure 1.

Electric arc furnace operations.

Methodology

Base detectors

As discussed previously that our detection model can also be deemed an ensemble model, the generation of base detectors is thus indispensable. In order to train diverse and accurate SVDD models in our ensemble, a subspace-based ensemble technique called FB¹⁷ is used. It is observed that subspace-based ensemble techniques are often more efficient than subsampling techniques in unsupervised learning, even the dimension of the given data is not very high.¹⁸ The basic steps of FB can be described as follows:

Sample an integer r from $d / 2$ to $d - 1$ ;

Select r dimensions from the data $D$ randomly to create an r-dimensional projection;

Use base detector on projected representation;

Repeat the above steps until M iterations.

The rationale is to sample an integer r between $d / 2$ and $d - 1$ , and r dimensions are then sampled from the dataset. The data are finally scored in this projection.

Artificial outliers

Before introducing the generation of artificial outliers, we first give a simple description of dynamic ensemble, by which the goal of artificial outliers can hence be explained. A general training and test process of dynamic ensemble learning can be demonstrated in Figure 2.

Figure 2.

General description of dynamic ensemble learning.

Once base detectors have been trained, a selective procedure will be implemented for each test point. We can see from Figure 2 that a validation set is necessary to complete the selection. The objective of this validation set is to provide reference examples in order to identify the competence of all base detectors, with respect to this test point. Then the most competent base detector(s) will be selected according to the result of competence calculations.

From the above description of dynamic ensemble, we can find that the role of validation set is very critical as it is the premise of competence calculation. However, we have none labeled training examples in novelty detection to constitute this validation set. As a result, we have to generate some artificial ones instead. A simple strategy of generating artificial outliers is by sampling examples from a bounded uniform distribution. Another strategy is to assume that outlier examples locate in sparse regions of the target domain, that is, regions where the target data are either absent or isolated from the rest of the data.¹⁹ Fan et al.²⁰ propose to generate outliers close to the target data by constraining the learning algorithm to form an accurate boundary between known classes and anomalies. The value of one feature of a target point is changed randomly while leaving other features unchanged. Major drawbacks of these two methods are the impossibility to generate a sufficient amount of outlier examples in high-dimensional situations due to the curse of dimensionality. To this end, Tax and Duin²¹ propose to generate a uniform hyper-spherical outlier distribution that might fit tighter around the target class than a hyper-box distribution. However, such an approach to generate artificial outliers is mainly used to optimize parameters of OC classifiers, and the resultant outlier class heavily covers the target class. Consequently, it is not appropriate for providing reference validation sets in dynamic ensembles.

In Désir et al.²² randomization principles of ensemble learning methods to subsample the number of features and the number of training target instances are used to generate artificial outliers from a computational point of view. Random subspace method (RSM) and Bagging are used to subsample features and training set, respectively. Then the amount of required outliers has been reduced much more than the original version. Furthermore, sparsity information extracted from the original training set is also used to make the artificial outlier class complementary to the target class. Experimental results on several benchmark datasets have shown the superiority of this method. Inspired by this, we also use such a strategy to generate outliers in this paper. But some adjustments are made here. In particular, RSM and Bagging will not be used to sample features and training set, and only FB is used instead since it is also used to train base detectors.

Score normalization

Before the selective procedure, a normalization procedure for all base detectors is of great necessity in order to achieve an unbiased selection. As has been proved that even when using the same method as base detector and identical parameterization, outlier scores obtained from different subspaces could vary considerably, if some subspaces have largely different scales.²³ For techniques concerning outlier score normalization, those converting outlier scores of different base detectors into probability estimates are more acknowledged. As claimed by Gao and Tan,²⁴ there are many advantages to transforming outlier scores into well-calibrated probability estimates, and a dominant one is that the probability estimates are more appropriate for developing an ensemble outlier detection framework. Sigmoid function and mixture modeling are accordingly used to fit outlier scores into probability values in this study. While in Kriegel et al.,²⁵ a more general framework of outlier score normalization is provided. The fundamental motivation of this framework is to establish sufficient contrast between outlier scores and inlier scores so that outliers could be easily separated from inliers. This seems more practical than only the interpretation of outlier scores because we actually need to pick out outliers in some applications. However, we may encounter a problem if we directly use these normalization methods. Note that normalization methods in Gao and Tan²⁴ and Kriegel et al.²⁵ are used for mining outliers in given databases. The normalization or scaling procedures are implemented with samples only in the given database. When using these procedures for unseen test samples, the probability of an observation being an outlier may be out of the range of $[0, 1]$ , leading to the loss of interpretability of probability estimates.

To address this problem, we first do some adjustments on outputs of base detectors before converting to probabilistic estimates. Generally, normalizing them to Z-scores is a good choice, and its effectiveness has been verified in Aggarwal and Sathe²⁶ and Nguyen et al.²⁷ However, space of improvement still exists like that using Z-score for outlier detection for univariate data (“3 $σ$ edit rule”). As we all know that Z-scores are sensitive to outliers because the presence of outliers tends to inflate the variance estimate. Z-scores of all normal data would therefore move toward those of outliers, making the contrast between outliers and inliers smaller. As we have emphasized that sufficient contrast between outliers and inliers is preferred by most scenarios of outlier or novelty detection. In light of this, we propose to adjust the calculation of Z-score just like the improvement by “Hample Identifier,” which replaces the outlier-sensitive mean and standard deviation estimates with the outlier-resistant median and median absolute deviation from the median (MAD), respectively.²⁸

Dynamic selection

Validation set

In general, algorithm K-nearest neighborhood (KNN) is always used to determine the validation set of any test point in dynamic ensembles. However, calculating the distances to all data points in the training set is necessary to find its nearest K neighbors. The cost of computation is too expensive for online application sometimes. While for the clustering-based method only the determination of its belonging cluster is necessary so long as all data points have been divided into several clusters at the training phase. As a result, we prefer to use a clustering algorithm to determine the validation set for each test point.

Here, we choose three representative clustering algorithms as candidates and do some quantitative comparisons. The first one is the classical K-means clustering, the second one is GMM, the third one is a density-based algorithm named DBSCAN (density-based spatial clustering of applications with noise). K-means should be the most popular clustering algorithm due to its simple theory and implementation.²⁹ Its drawbacks mainly reside on its sensitivity to initial values, noise, and outliers. Correspondingly, several improved versions have also been proposed. GMMs are among the most statistically mature methods for clustering. Each cluster is represented by a Gaussian distribution. The clustering process thereby turns to estimate the parameters of the Gaussian mixture, usually by the Expectation-Maximization algorithm.³⁰ Its probabilistic form of output may be an advantage, which make GMM clustering can be combined with other statistical learning models more smoothly and naturally. But its drawbacks lie on its probabilistic assumption, which also requests more on the size of samples and representativeness. The largest advantage of DBSCAN should reside on its ability of discovering clusters of arbitrary shape.³¹ It also requires less input parameter, including the number of clusters. While it becomes unstable when detecting border objects of adjacent clusters.

The quantitative criterion we use is Calinski–Harabasz (CH).³² This quantity is defined as

$C H_{k} = \frac{S S_{B}}{S S_{W}} \times \frac{(N - k)}{k - 1}$ (10)

where $S S_{B}$ denotes the overall between-cluster variance, $S S_{W}$ is the overall within-cluster variance, k is the number of clusters, and N is the number of observations. Then $S S_{B}$ and $S S_{W}$ are defined as

$S S_{B} = \sum_{i = 1}^{k} n_{i} ‖ m_{i} - m ‖^{2}$ (11)

$S S_{w} = \sum_{i = 1}^{k} \sum_{x \in c_{i}} ‖ x - m_{i} ‖^{2}$ (12)

where $n_{i}$ indicates the number of observations in cluster i, $m_{i}$ is the centroid of cluster i, m is the overall mean of the sample data, $| | m_{i} - m | |^{2}$ is the $L^{2}$ norm between $m_{i}$ and m, and $| | x - m_{i} | |^{2}$ is the $L^{2}$ norm between x and $m_{i}$ . Note that the CH index can be deemed the ratio of between-cluster variance and within-cluster variance, and larger CH value indicates better data partition. The optimal number of clusters can be determined by maximizing $C H_{k}$ with respect to k.

When classifying a test point, we should decide its belonging cluster first. Then all data points in that cluster constitute the validation set with respect to this test point.

Competence calculation

It is desired that we could select the most competent base detector by computing their competences through the validation sets. With the artificial outliers in the training set, we can employ selection mechanisms proposed traditional classification problem. More than just estimating the classifier accuracy on the basis of a simple percentage of corrected classified samples, here we use a probabilistic measure to select the most competent classifier.

Let $V = {X_{t}^{1}, \dots, X_{t}^{K}}$ denote the validation set of test pattern $X_{t}$ . For each base classifier $C_{j}$ , the probability of correct classification of any test pattern can be estimated by (if class labels of training patterns are available)

$\hat{p} (correc t_{j}) = \frac{N_{j}}{K} j = 1, \dots, L$ (13)

where $N_{j}$ is the number of neighbor patterns that are correctly classified by classifier $C_{j}$ . Assume that a neighbor pattern $X_{t}^{i} \in ω_{l}, l \in {O, M}$ ( $O, M$ denotes the outlier and normal class, respectively), then $P^{j} (ω_{l} | X_{t}^{i})$ provided by classifier $C_{j}$ can be deemed its measure of competence for $X_{t}^{i}$ . As a result, the competence of classifier $C_{j}$ on pattern $X_{t}$ can be derived by averaging competences on all neighbor patterns

$\hat{p} (correc t_{j}) = \frac{1}{K} \sum_{i = 1}^{K} P^{j} (ω_{l}^{i} | X_{t}^{i})$ (14)

where $ω_{l}^{i}$ is the label of $X_{t}^{i}$ . Then a weight can be assigned to each neighbor pattern in order to reduce the uncertainty in the definition of the neighbor size

$\hat{p} (correc t_{j}) = \frac{\sum_{i = 1}^{K} P^{j} (ω_{l} | X_{t}^{i}) W_{i}}{\sum_{i = 1}^{K} W_{i}}$ (15)

where $W_{i} = 1 / d_{i}$ , and $d_{i}$ is the distance from pattern $X_{t}^{i}$ to $X_{t}$ .

As outputs of base classifiers have been transformed into posterior probability estimates, we can exploit this information to measure classifier competence

$\begin{matrix} \hat{p} (correc t_{j} | C_{j} (X_{t}) = ω_{l}) = \hat{p} (X_{t} \in ω_{l} | C_{j} (X_{t}) = ω_{l}) \\ = \frac{N_{ll}}{\sum_{j \in {M, O}} N_{jl}} \end{matrix}$ (16)

where $N_{ll}$ is the number of neighbor patterns that have been correctly classified by $C_{j}$ to class $ω_{l}$ , and $\sum_{j \in {M, O}} N_{jl}$ is the total number of neighbor patterns that have been classified to class $ω_{l}$ by classifiers $C_{j}$ . Then we exploit the posterior probabilities of neighbor patterns according to the Bayesian theory

$\begin{matrix} \hat{p} (X_{t} \in ω_{l} | C_{j} (X_{t}) = ω_{l}) \\ = \frac{\hat{p} (C_{j} (X_{t}) = ω_{l} | X_{t} \in ω_{l}) \hat{p} (ω_{l})}{\sum_{i \in {M, O}} \hat{p} (C_{j} (X_{t}) = ω_{l} | X_{t} \in ω_{i}) \hat{p} (ω_{i})} \end{matrix}$ (17)

The term $\hat{p} (C_{j} (X_{t}) = ω_{l} | X_{t} \in ω_{l})$ indicates the probability that patterns belonging to class $ω_{l}$ are correctly classified. It can be estimated as

$\hat{p} (C_{j} (X_{t}) = ω_{l} | X_{t} \in ω_{l}) = \frac{1}{\sum_{j \in {M, O}} N_{jl}} \sum_{X_{t}^{i} \in ω_{l}} P^{j} (ω_{l} | X_{t}^{i})$ (18)

The term $\hat{p} (ω_{l})$ denotes the prior probability of class $ω_{l}$ , which can be estimated as

$\hat{p} (ω_{l}) = \frac{\sum_{j = 1}^{L} N_{lj}}{K}$ (19)

Then a weight is also assigned to each neighbor pattern to reduce the uncertainty triggered by the neighbor size. Finally, competence of classifier $C_{j}$ on test pattern $X_{t}$ can be obtained

$\hat{p} (correc t_{j} | C_{j} (X_{t}) = ω_{l}) = \frac{\sum_{X_{t}^{i} \in ω_{l}} P^{j} (ω_{l} | X_{t}^{i}) W_{i}}{\sum_{i = 1}^{K} P^{j} (ω_{l} | X_{t}^{i}) W_{i}}$ (20)

We conclude this procedure in Algorithm 1.

Algorithm 1.
Input: the pool of base detectors C; Training set Tr and test set Te; the number of samples in validation set K;Output $c_{t}^{}$ , the most promising classifier for each unknown sample t in the Te;for each testing sample t in Te do Find $Ψ$ as the K samples in the region of competence of t in Tr;for each classifier $c_{j}$ in C do Compute $p (correc t_{j})$ on $Ψ$ ;if $p (correc t_{j}) > 0.5$ then $CS = CS \cup^{c_{j}}$ ;end ifend for $p (correc t_{m}) = ma x_{j} (p (correc t_{j}))$ ; $c_{m} = \arg ma x_{j} (p (correc t_{j}))$ ; $selected = TRUE$ ;for each classifier $c_{j}$ in CS do $d = p (correc t_{m}) - p (correc t_{j})$ ;if $((j \neq m) and (d < Threshold))$ then $selected = FALSE$ ;end ifend forif $(selected = = TRUE)$ then $c_{t}^{} = c_{m}$ ;else $c_{t}^{}$ = a classifier randomly selected from CS, with $d < threshold$ ;end ifUse the classifier $c_{t}^{}$ to classify t;end for

Algorithm 1.

Input: the pool of base detectors C; Training set Tr and test set Te; the number of samples in validation set K;Output

c_{t}^{*}

, the most promising classifier for each unknown sample t in the Te;for each testing sample t in Te do Find

Ψ

as the K samples in the region of competence of t in Tr;for each classifier

c_{j}

in C do Compute

p (correc t_{j})

Ψ

;if

p (correc t_{j}) > 0.5

then

CS = CS \cup^{c_{j}}

;end ifend for

p (correc t_{m}) = ma x_{j} (p (correc t_{j}))

;

c_{m} = \arg ma x_{j} (p (correc t_{j}))

;

selected = TRUE

;for each classifier

c_{j}

in CS do

d = p (correc t_{m}) - p (correc t_{j})

;if

((j \neq m) and (d < Threshold))

then

selected = FALSE

;end ifend forif

(selected = = TRUE)

then

c_{t}^{*} = c_{m}

;else

c_{t}^{*}

= a classifier randomly selected from CS, with

d < threshold

;end ifUse the classifier

c_{t}^{*}

to classify t;end for

Optimization of SVDD

In algorithm SVDD, two parameters v and s need to be determined a prior. Parameter v is defined as

$v = \frac{1}{NC}$ (21)

where N is the number of training data, and C indicates the trade-off parameter. It has been proved in Tax and Duin¹¹ that parameter v is an upper bound for the fraction of target class objects outside the description. In addition, the fraction of objects which become SV is a leave-one-out estimate of the error on the target set

$E [P (e r r o r t a r g e t s e t)] = \frac{# S V}{N}$ (22)

where $# SV$ indicates the number of SVs. Using this equation, one parameter (v or s) can be optimized. Unfortunately, this will not uniquely define both parameters. Minimizing just the error on the target set is not sufficient. To estimate the outlier acceptance rate without outlier examples, we have to assume an outlier distribution. In section “Artificial outliers,” some artificial outliers have been generated. With these outlier examples, we can optimize both parameters simultaneously by minimizing the following error term

$E (v, s) = λ \frac{# SV}{N} + (1 - λ) f_{O}^{+}$ (23)

where $f_{O}^{+}$ denotes the fraction of accepted outlier examples. Therefore, parameter $λ$ indicates a trade-off between target error and outlier error. When we take $λ = 1 / 2$ , then errors on the fraction of target and outlier observations are weighted equally.

Experiments and analysis

Datasets

In EAF control systems, three secondary current and three secondary voltage are often used by data-driven control strategies. For example, in Li and Mao,⁶ these six variables are used to identify the process model of EAF. In this paper, we also use these variables to constitute the training and test sets. Totally, we will use six datasets, including three synthetic ones by the simulation model and three real-world ones. A simple description of these datasets is shown in Table 1.

Table 1.

Description of datasets.

Dataset	# Examples	% Outliers	# Attributes
1	10,000	10	6
2	10,000	10	6
3	10,000	10	6
4	5594	9	6
5	6422	11	6
6	7826	7	6

The first three datasets are generated using the simulation model, and different faults are simulated in different datasets. The last three datasets are collected from real-world EAF control systems. In each dataset, 70% normal data is randomly selected to constitute the training set, and all remaining samples constitute the test set. This process will be repeated 10 times, and the averaging values will be used as the final results.

Competitors and metrics

In order to put the experimental results of our method into context, we compare it with several competitors:

Random subspace SVDD (RS-SVDD) proposed in Wang and Mao.¹³ In this detection model, technique RSM is used to develop a parallel ensemble model.

Bagging SVDD (BA-SVDD) proposed in Ge and Song.³³ In this detection model, technique Bagging is used to develop a parallel ensemble model.

Clustering-based SVDD (C-SVDD) proposed in Wang and Mao.¹² In this detection model, clustering technique is used to develop a parallel ensemble model.

Here, we refer to our method as Dynamic selection SVDD (DS-SVDD) as it is a dynamic selective SVDD model.

In this paper, we use three metrics, that is, G-mean, F-measure, and ROC-curve (receiver operating characteristic curve). A representation of classification performance is formulated by a confusion matrix as illustrated in Table 2.

Table 2.

Confusion matrix of two-class classification problem.

	Actual label
	Target class	Negative class
Predicted label
Target class	True positive (TP)	False positive (FP)
Negative class	False negative (FN)	True negative (TN)

Then we can formulate G-mean as follows

$G - mean = \sqrt{\frac{TP}{TP + FN} \times \frac{TN}{TN + FP}}$ (24)

This metric evaluates the degree of inductive bias in terms of a ratio of positive accuracy and negative accuracy.

F-measure can be formulated as follows

$F - measure = \frac{(1 + β^{2}) \cdot Recall \cdot Precision}{β^{2} \cdot Recall + Precision}$ (25)

where $β$ is a coefficient to adjust the relative importance of precision versus recall (usually, $β = 1$ )

$Recall = \frac{TP}{TP + FN}, Precision = \frac{TP}{TP + FP}$ (26)

F-measure combining recall and precision as a measure could provide more insight into the functionality of a classifier than the accuracy metric.

The ROC curve describes the trade-off between the true-positive rate and the false-positive rate. (Note that normal data are regarded as positive in this paper. So true-positive rate indicates the rate of correctly detected normal data.) It could thus evaluate the general performance rather than performance at only one working point. In practice, the area under the ROC curve (AUC) is used since comparing directly ROC curves of varying detectors is difficult. For a novelty detection task, the AUC value of a perfect algorithm equals 1, implying that all outliers have been identified and none misclassified normal data occur simultaneously. Algorithms with AUC values smaller than 0.5 are often deemed invalid since “random guessing” could obtain the AUC of 0.5. Here, we employ method in Huang and Ling³⁴ to calculate the AUC.

Result and analysis

Results on all six datasets with respect to three metrics are shown in Tables 3 –5, respectively. Apart from the values in terms of three metrics, we also provide the averaging values over all datasets so that we can have an insight into the general performance. The comparison of these averaging values can be understood clearly in Figure 3, from which we could find that our method (DS-SVDD) has achieved the best general result on all three metrics. Then we compare DS-SVDD with its competitor one by one:

DS-SVDD versus SVDD. On all six datasets, DS-SVDD has outperformed SVDD in terms of three metrics. This result implies that the improvement provided by the dynamically selective procedure in DS-SVDD is very efficient. Although the outlier examples in the training set are artificial, their role in the selection has effected much on the detection result. Certainly, other processes in DS-SVDD have also made great contributions to the final result.

DS-SVDD versus RS-SVDD. The superiority of DS-SVDD over RS-SVDD has been shown on all datasets. The reason may be twofold from our point of view. FB technique used in DS-SVDD to generate base detectors may be more efficient than RS method. Actually, this conclusion has also been verified in other researches. On the other hand, our selective mechanism can further improve the performance of FB.

DS-SVDD versus BA-SVDD. Before comparing these two models, we first compare the result of BA-SVDD with that of RS-SVDD. From Figure 3, we can see that RS-SVDD has achieved better general result than BA-SVDD. On four of five datasets, RS-SVDD outperforms BA-SVDD. This comparative result indicates that subspace-based novelty ensembles are often more efficient than subsampling-based ones. With this comparison, we can easily understand why DS-SVDD outperforms BA-SVDD on all datasets.

DS-SVDD versus C-SVDD. An interesting point in this pair of comparison is that C-SVDD outperforms DS-SVDD on dataset 3. In order to find the reason behind this, we first check the performance of RS-SVDD. From results in terms of three metrics, we can see that RS-SVDD has obtained the worst result on this dataset. Although a little improvement has been achieved by DS-SVDD, but this improvement has not pushed DS-SVDD to outperform C-SVDD. But on other datasets, DS-SVDD still has better performance than its competitors.

Table 3.

Comparative result with respect to G-mean values.

Dataset	SVDD	RS-SVDD	BA-SVDD	C-SVDD	DS-SVDD
1	0.855	0.881	0.874	0.893	0.904
2	0.847	0.863	0.839	0.852	0.895
3	0.811	0.798	0.826	0.837	0.830
4	0.907	0.921	0.918	0.902	0.948
5	0.944	0.959	0.949	0.949	0.975
6	0.967	0.976	0.972	0.980	0.994
Ave	0.889	0.900	0.896	0.902	0.924

SVDD: support vector data description.

The best result is in bold.

Table 4.

Comparative result with respect to F-measure values.

Dataset	SVDD	RS-SVDD	BA-SVDD	C-SVDD	DS-SVDD
1	0.891	0.916	0.909	0.918	0.927
2	0.877	0.892	0.884	0.890	0.908
3	0.851	0.856	0.862	0.874	0.871
4	0.911	0.926	0.920	0.907	0.951
5	0.948	0.964	0.968	0.968	0.979
6	0.959	0.962	0.969	0.974	0.988
Ave	0.906	0.919	0.919	0.922	0.937

SVDD: support vector data description.

The best result is in bold.

Table 5.

Comparative result with respect to AUC values.

Dataset	SVDD	RS-SVDD	BA-SVDD	C-SVDD	DS-SVDD
1	0.909	0.923	0.916	0.920	0.937
2	0.900	0.914	0.904	0.908	0.921
3	0.875	0.880	0.886	0.901	0.895
4	0.923	0.938	0.929	0.934	0.950
5	0.951	0.963	0.951	0.951	0.969
6	0.962	0.970	0.966	0.978	0.990
Ave	0.920	0.931	0.925	0.932	0.944

AUC: area under the receiver operating characteristic curve; SVDD: support vector data description.

The best result is in bold.

Figure 3.

Comparison result of the averaging values on three metrics.

Conclusion

To facilitate the development of advanced data-driven control strategies in EAF systems, this paper proposes a dedicated novelty detection model with the help of dynamic ensemble learning theory. In this detection model, SVDD plays the role of base detector. Artificial outliers are generated with two objectives, one is to complete the dynamic selection, and the other is to optimize two parameters of SVDD. Then clustering technique is used to determine the validation set for each test point. Finally, a probabilistic method is used to compute the competence of base detectors. In order to validation the proposed detection model, we compare it with four competitors on three synthetic and three real-world datasets. We compare results of all these methods and show the superiority of our method.

However, several issues regarding our method are still open to solve. For example, the procedure of generating artificial outliers may be not appropriate in some situation. When training set contains unknown outliers, the robustness of our method may be poor. These problems have not been considered in this paper, but they should be our future research directions.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) received no financial support for the research,authorship,and/or publication of this article.

ORCID iD

Biao Wang

References

Aggarwal

. Outlier analysis. 2nd ed. Cham: Springer, 2017.

Chandola

Banerjee

Kumar

. Anomaly detection: a survey. ACM Comput Surv 2009; 41(3): 1–58.

Hodge

Austin

. A survey of outlier detection methodologies. Artif Intel Rev 2004; 22(2): 85–126.

Liu

Mao

. Outlier detection for process control data based on a non-linear auto-regression hidden Markov model method. T Inst Meas Contr 2012; 34(5): 527–538.

Jia

Ping

Yan

, et al. Model predictive control synthesis approach of electrode regulator system for electric arc furnace. J Iron Steel Res 2011; 18(11): 23–28.

Mao

. A direct adaptive controller for EAF electrode regulator system using neural networks. Neurocomputing 2012; 82(4): 9198.

. Electric arc furnace based on fuzzy neural network control system and MATLAB simulation. Appl Mech Mater 2011; 48–49: 491–495.

Rashid

Mhaskar

Swartz

CLE

. Multi-rate modeling and economic model predictive control of the electric arc furnace. J Process Contr 2016; 40: 50–61.

Breunig

Kriegel

H-P

, et al. LOF: identifying density-based local outliers. In: Proceedings of the ACM SIGMOD 2000 international conference on management data, Dallas, TX, 16–18 May 2000. New York: ACM.

10.

Knorr

Tucakov

. Distance-based outliers: algorithms and applications. VLDB J 2000; 8(3): 237–253.

11.

Tax

DMJ

Duin

RPW

. Support vector data description. Mach Learn 2004; 54(1): 45–66.

12.

Wang

Mao

. Detecting outliers in electric arc furnace under the condition of unlabeled, imbalanced, non-stationary and noisy data. Meas Contr 2018; 51(3–4): 83–93.

13.

Wang

Mao

. One-class classifiers ensemble based anomaly detection scheme for process control systems. T Inst Meas Contr 2018; 40(12): 3466–3476.

14.

Yuan

Mao

Wang

. A pruned support vector data description-based outlier detection method: applied to robust process monitoring. T Inst Meas Contr. Epub ahead of print 3 March 2020. DOI: 10.1177/0142331220905951.

15.

Pimentel

MAF

Clifton

, et al. Review: a review of novelty detection. Sig Process 2014; 99: 215–249.

16.

Bird

Ghobara

YEM

Khater

, et al. Modeling, optimization and estimation in electric arc furnace (EAF) operation. Chem Eng 2013, https://macsphere.mcmaster.ca/bitstream/11375/13345/1/fulltext.pdf

17.

Lazarevic

Kumar

. Feature bagging for outlier detection. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining, Chicago, IL, 21–24 August 2005. New York: ACM.

18.

. The random subspace method for constructing decision forests. IEEE T Pattern Anal Mach Intel 1998; 20(8): 832–844.

19.

Hempstalk

Frank

Witten

. One-class classification by combining density and class probability estimation. In: Proceedings of the joint European conference on machine learning and knowledge discovery in databases, Antwerp, 15–19 September 2008.

20.

Fan

Miller

. Using artificial anomalies to detect unknown and known network intrusions. In: Proceedings of the 2001 IEEE international conference on data mining, San Jose, CA, 29 November–2 December 2001. New York: IEEE.

21.

Tax

DMJ

Duin

RPW

. Uniform object generation for optimizing one-class classifiers. J Mach Learn Res 2001; 2: 155–173.

22.

Désir

Bernard

Petitjean

, et al. One class random forests. Pattern Recogn 2013; 46: 3490–3506.

23.

Zimek

Campello

Sander

. Ensembles for unsupervised outlier detection: challenges and research questions a position paper. ACM SIGKDD Explor Newsl 2014; 15(1): 11–22.

24.

Gao

Tan

P-N

. Converting output scores from outlier detection algorithms into probability estimates. In: Proceedings of the IEEE sixth international conference on data mining (ICDM’06), Hong Kong, China, 18–22 December 2006. New York: IEEE.

25.

Kriegel

H-P

Kroger

Schubert

, et al. Interpreting and unifying outlier scores. In: Proceedings of the 2011 SIAM international conference on data mining, Mesa, AZ, 28–30 April 2011.

26.

Aggarwal

Sathe

. Theoretical foundations and algorithms for outlier ensembles. ACM SIGKDD Explor Newsl 2015; 17(1): 24–47.

27.

Nguyen

Ang

Gopalkrishnan

. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In: Proceedings of the international conference on database systems for advanced applications, Tsukuba, Japan, 1–4 April 2010.

28.

Pearson

. Exploring process data. J Process Contr 2001; 11(2): 179–194.

29.

Kumar

Quinlan

, et al. Top 10 algorithms in data mining. Knowl Inform Syst 2008; 14(1): 1–37.

30.

Cai

Shao

, et al. Laplacian regularized Gaussian mixture model for data clustering. IEEE T Knowl Data Eng 2011; 23(9): 1406–1418.

31.

Ester

Kriegel

H-P

Sander

, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd international conference on knowledge discovery and data mining, Anchorage, AK, 4–8 August 1996.

32.

Calinski

Harabasz

. A dendrite method for cluster analysis. Commun Stat: Theor Method 1974; 3(1): 1–27.

33.

Song

. Bagging support vector data description model for batch process monitoring. J Process Contr 2013; 23(8): 1090–1096.

34.

Huang

Ling

. Using AUC and accuracy in evaluating learning algorithms. IEEE T Knowl Data Eng 2005; 17(3): 299–310.

Using a dynamically selective support vector data description model to discover novelties in the control system of electric arc furnace

Abstract

Keywords

Introduction

Related works and preliminaries

Related work

Preliminaries

SVDD

Ensemble of OC classifiers

EAF

Methodology

Base detectors

Artificial outliers

Score normalization

Dynamic selection

Validation set

Competence calculation

Optimization of SVDD

Experiments and analysis

Datasets

Competitors and metrics

Result and analysis

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References