Sage Journals: Discover world-class research

Abstract

Due to the increasing arriving rate and complex relationship of behavior data streams, how to detect sequential behavior anomaly in an efficient and accurate manner has become an emerging challenge. However, most of the existing literature simply calculates the anomaly score for segmented sequence, and there is limited work going deep to investigate data stream segment and structural relationship. Moreover, existing studies cannot meet efficiency requirements because of large number of projected subsequences. In this article, we propose EADetection, an efficient and accurate sequential behavior anomaly detection approach over data streams. EADetection adopts time interval and fuzzy logic–based correlation to segment event stream adaptively based on rolling window. Through dynamic projection space–based fast pruning, large number of repeated patterns are reduced to improve detection efficiency. Meanwhile, EADetection calculates the anomaly score by top-k pattern–based abnormal scoring based on directed loop graph–based storage strategy, which ensures the accuracy of detection. Specially, we design and implement a streaming anomaly detection system based on EADetection to perform real-time detection. Extensive experiments confirm that EADetection can achieve real time and improve accuracy, significantly reduces latency by 36.8% and reduces false positive rate by 6.4% compared with existing approach.

Keywords

User behavior anomaly detection sequence pattern data stream stream segment projection pruning

Introduction

Sequential behavior data streams occur in a wide variety of applications, such as system call logs in a computer, operational logs of an aircraft flight, and alerts of intrusion detection system. Anomaly detection in such data streams has attracted considerable attention because of the potential harm caused by abnormal behaviors.¹ For instance, more than three-quarters of government agencies are more focused on combating insider threats by user behavior anomaly detection.^2,3 In order to reduce harm, user behavior anomaly detection required to be efficient and accurate. First, it is quite necessary to achieve real-time data processing to keep the capacity of continuous detection. The large gap between data processing rate and arriving rate makes anomaly detection less meaningful.⁴ Second, it has to be accurate to find and prevent abnormal behaviors. False positives or false negatives may lead to large resource waste.⁵

These two requirements post emerging challenges to sequential behavior anomaly detection over data streams. (1) Events data stream usually arrives with high rate. For instance, the intrusion detection systems employed in a real-world network generates more than 9 million alerts per day.⁶ The detection approach must be able to support such high arriving rate of live event data. (2) In continuously collected event stream, windows-based data processing is widely used. As there is no exact boundary to distinguish different behaviors, it is hard to fully reconstruct user behavior scene.⁷ The key is to keep complex events correlation in behaviors through stream segment strategy and pattern abstraction. (3) As an event data stream is an unbounded sequence, the size of historical knowledge increases with event arriving, which occupies lots of memory and slows down the detection efficiency. The key is to design proper data structure to store historical knowledge for highly efficient anomaly detection.

Generally, sequential behavior anomaly detection consists of two stages:⁸ (1) to establish patterns of behaviors and (2) to determine whether a behavior is abnormal based on the patterns. To establish a more efficient pattern, Hidden Markov Model,^9,10 data mining,^11,12 neural network,¹³ clustering,¹⁴ and some others have been adopted to anomaly detection research. However, most of them build lots of redundancy patterns by scanning the data stream repeatedly, which brings either computational complexity or high memory overhead to cause real-time problem. Meanwhile, these methods take less account of the features of user behaviors, which directly affects the accuracy of pattern construction.

To determine whether a behavior is abnormal, existing studies quantify detection results with statistical or probabilistic methods by comparing with the normal pattern. Similarly, there are many algorithms, such as Bayesian reference,¹⁵ edit distance,¹⁶ Markov transfer,¹⁰ fuzzy logic,¹⁷ and kernel.¹⁸ Note that the storage structure of patterns makes a significant impact on the performance of the detection algorithms. Existing major storage structure is the tree and there are two limitations:¹⁹ (1) large number of overlapping subtrees, which occupy enormous memory, and (2) complex traversal depth and breadth, which spend too much time to generate and traverse the tree.

Besides, the existing anomaly detection systems place more emphasis on the data processing module.^20,21 There is almost no anomaly detection systems, which are energy data collection, data distribution, data processing and data storage as a whole, which leads to a confused situation for streaming anomaly detection. And it is very inconvenient for research and practical application. Hence, it is necessary to build a high-efficiency streaming anomaly detection system (SADS).

To address these challenges, we propose EADetection, an efficient and accurate sequential behavior anomaly detection approach over data streams. Specially, we mainly focus on three problems: (1) How to mine representative patterns from events stream to characterize behavior? (2) How to calculate the anomaly score for detected sequence? (3) How to design a high-efficiency streaming anomaly detection platform? Generally speaking, we provide the following contributions:

To realize the adaptive segment of user behavior sequence, EADetection adopts time correlation and fuzzy logic to define events association intensity, which keeps the behavior correlation based on rolling window. Through dynamic projection space–based fast pruning, large number of repeated patterns is reduced to improve the detection efficiency.

EADetection adopts the directed loop graph–based storage strategy to store the structural relationship of historical pattern and calculates the anomaly score by Bayesian network after top-k pattern collision test, which ensures the accuracy of detection.

We further design and implement a SADS based on EADetection to perform real-time detection. To improve stability and scalability, we adopt autocorrelation for modules to simplify configuration and topology mapping to achieve rapid deployment.

We conduct extensive experiments over uncertain behavior streams, which are based on SADS in the real deployment, to verify the high efficiency and accuracy of EADection under different parameter settings.

The remainder of this article is organized as follows: Section “Overview of EADetection” presents some definition and the workflow of EADetection. In sections “Construction of Sequential Pattern” and “Real-time Anomaly Detection,” we put forward EADetection algorithm in detail, including rolling windows–based adaptive segment, projection space–based fast pruning, directed loop graph–based pattern storage and top-k pattern–based abnormal scoring. Section “SADS” presents SADS, which includes the introduction of architecture and some optimizations. Evaluation results will be shown in section “Evaluation.” Finally, we provide our key conclusions and future directions in the last section of this article.

Overview of EADetection

In an unbounded sequential behavior data stream $S = {e_{0}, e_{1}, e_{2}, \dots, e_{t - 1}, e_{t}, \dots}$ , $e_{i}$ represents the continuously collected event with two features: name represents the event type and ts is the event occurrence timestamp.

Definition 1 (Behavior). The behavior is defined as a subset of S with strong structural relationship: $B = 〈 e_{i_{0}}, e_{i_{1}}, e_{i_{2}}, \dots, e_{i_{n - 1}} 〉$ where $i_{k} < i_{k + 1}$ , $e_{i_{k}} \in S$ .

Definition 2 (Sequence). The behavior is defined as sequential subset of S: $B = 〈 e_{i}, e_{i + 1}, e_{i + 2}, \dots, e_{i + m - 1} 〉$ where $e_{k} \in S$ .

Definition 3 (Correlation). Let $e_{i}, e_{j}, e_{k}$ be three events data and $B_{0}, B_{1}$ be two behaviors, $e_{i}, e_{j} \in B_{0}$ and $e_{k} \in B_{1}$ . We believe that the events in the same behavior exist strong correlation, such as $〈 e_{i}, e_{j} 〉$ . Namely, there is a weak behavior correlation between events in different behaviors, such as $〈 e_{j}, e_{k} 〉$ . Correlation is the basis of data stream segment to characterize behaviors.

Definition 4 (Pattern). The pattern P of behavior B is defined as $P = 〈 e_{j_{0}}, e_{j_{1}}, e_{i_{2}}, \dots, e_{i_{m - 1}} 〉$ , where $e_{j_{k}} \in B$ . P is a representative subset of B. All the patterns of B is used in anomaly detection instead of B.

Definition 5 (rolling window). As shown in Figure 1, rolling window is a subset of unbounded data stream. Once the segment condition is triggered, all the data in the window are popped for further processing.

Figure 1.

An example of rolling window.

As there is no exact boundaries for behaviors in event stream, we segment the stream into sequences while keeping the behavior correlation. Then, we mine patterns from sequences to characterize behavior for further processing. We illustrate the workflow and dataflow of EADetecion in Figure 2. The above-described process can be divided into two stages: construction of sequential pattern and real-time anomaly detection. In the construction phase of sequential pattern, EADetecion adopts rolling windows–based adaptive segment strategy to transform stream into sequences, which is based on the intensity of behavior correlation defined by time interval and fuzzy logic. Then, we use dynamic projection space based fast pruning algorithm to train the sequences to develop a pattern library, which is used to characterize the behaviors. The algorithm considers the randomness and repeatability of behaviors to achieve the self-adaption of behaviors. Meanwhile, it constantly prunes projected subsequences to achieve real-time sequence mining.

Figure 2.

The overview of EADetection.

For real-time anomaly detection, we propose top-k pattern–based abnormal scoring based on a directed loop graph–based pattern storage structure called DLGPStructure. The DLGPStructure preserves the structural relationship to guarantee the accuracy and reduce false positive rate (FPR). Moreover, the algorithm reduces the traversal level by top-k strategy to improve the efficiency.

Construction of sequential pattern

Rolling windows–based adaptive segment

EADetection maintains a rolling window for each user to establish sequences. The popping condition of rolling window is given based on the intensity of behavior correlation, which is defined by time interval and fuzzy logic. The time interval refers to the time difference of arrival between adjacent data, which is inversely related to behavior correlation. Moreover, we present fuzzy logic based on correlation to adapt behaviors with uncertain processing time, which maintains the specific two data and the measure value of their occurrences. The stream is self-adaptively segmented into sequences by these two indicators.

Time interval–based correlation

Time interval–based correlation (TIBC) is calculated based on the assumption that the correlation intensity has an inverse relationship with the length of time interval. If the time interval of two events is beyond a given threshold, then we believe that the correlation of the two events is weak, which means the possibility that they belong to the same behavior is low. With arriving events stream, the TIBC of the current event $e_{cur}$ is defined as

$TIBC (b_{cur}) = \frac{1}{1} + e^{tg}$ (1)

where $tg = e_{cur} . ts - e_{tail} . ts$ , $e_{tail}$ is the last event in the rolling window.

Fuzzy logic–based correlation

Fuzzy logic is employed to handle the concept of partial truth, where the truth value may range between completely true and completely false.²² It is important and useful to employ the fuzzy logic sets in uncertain scenarios.^23,24 Motivated by this, we proposed fuzzy logic–based correlation to handle the complex behavior with long processing time. Each user is bounded with a fuzzy logic set, called fuzzyLib, which is used to store the event pairs with different levels of correlation. For arriving events with low TIBC, we put the last event of the rolling window $e_{tail}$ and current event $e_{cur}$ into fuzzyLib as a event pair, which is associated with a dynamic membership function w

$w (e_{tail}, e_{cur}) = {\begin{matrix} w (e_{tail}, e_{cur}) \cdot θ, (e_{tail}, e_{cur}) \in fuzzyLib, \\ α, otherwise \end{matrix}$ (2)

As shown in equation (2), if it is the first time the event pair appears in the rolling window, the weight is initialized with a constant $α$ . Otherwise, the weight increases at rate $θ$ with increasing matching frequency. The event pairs with high weight is segmented into a behavior with high possibility.

As shown in Algorithm 1, for arriving event $e_{cur}$ , we first calculate the TIBC between $e_{cur}$ and last event in rolling window $e_{tail}$ by equation (1). If $TIBC (e_{cur}) < segThr$ , then the rolling window is popped as a sequence. Otherwise, with the condition $TIBC (e_{cur}) > tsThr$ , we add $e_{cur}$ into the the rolling window. If $segThr ⩽ TIBC (e_{cur}) ⩽ tsThr$ , we add $e_{cur}$ and $e_{tail}$ into the fuzzyLib and update the weight between them for further processing. Once the event pair in the fuzzyLib has a higher weight than given threshold, $e_{cur}$ is added into the rolling window.

Algorithm 1. Rolling Windows–Based Adaptive Segment
Input: Current Event $e_{cur}$ , Rolling Window RWOutput: Sequence Queue BQueue;1: if $TIBC (e_{cur}) < segThr$ then2: RW is popped as a sequence B;3: add B into BQueue;4: if $segThr ⩽ TIBC (e_{cur}) ⩽ tsThr$ then5: insert $〈 e_{tail}, e_{cur} 〉$ into fuzzyLib;6: update weight of $w (e_{tail}, e_{cur})$ by equation 2;7: if $w (e_{tail}, e_{cur}) > wThr$ then8: add $e_{cur}$ into RW;9: if $TIBC (e_{cur}) > tsThr$ then10: add $e_{cur}$ into RW;11: return BQueue.

Algorithm 1. Rolling Windows–Based Adaptive Segment

Input: Current Event

e_{cur}

, Rolling Window RWOutput: Sequence Queue BQueue;1: if

TIBC (e_{cur}) < segThr

then2: RW is popped as a sequence B;3: add B into BQueue;4: if

segThr ⩽ TIBC (e_{cur}) ⩽ tsThr

then5: insert

〈 e_{tail}, e_{cur} 〉

into fuzzyLib;6: update weight of

w (e_{tail}, e_{cur})

by equation 2;7: if

w (e_{tail}, e_{cur}) > wThr

then8: add

e_{cur}

into RW;9: if

TIBC (e_{cur}) > tsThr

then10: add

e_{cur}

into RW;11: return BQueue.

Projection space–based fast pruning

EADetecion adopts patterns with high support degree to characterize segmented sequences. The number of projection space is the key factor, which influences the pattern mining efficiency. Therefore, projection space–based fast pruning strategy removes duplicate projected subsequences and merges high similar ones to avoid bottleneck caused by repeated mining.

Different from traditional frequent pattern mining,²⁵ events sequences here are approximations to behaviors and do not have exact boundaries. Therefore, it is not necessary to keep the intermediate subsequences contained by other sequences. Figure 3 shows a pattern mining example with prefix a in event stream $〈 a, b, c, b, c, a, d, b, c 〉$ . There are a large number of repeated or similar patterns in the iteration results, which lead to resource waste and high mining latency. After projection space–based fast pruning, only $〈 a, b, c, d 〉, 〈 a, b, c, c 〉, 〈 a, b, c, c 〉, 〈 a, c, b, c 〉, 〈 a, d, b, c 〉$ need to be stored.

Figure 3.

An example of pattern mining.

As shown in Algorithm 2, after obtaining the event list by scanning the sequence, prefix trees with high support degree are added into the projection space. Then, we decrease projected subsequences and merge high similar ones to obtain the sequential pattern during the sequence mining. It resches the next mining iteration after that.

Algorithm 2. Projection Space Based Fast Pruning
Input: Sequence BOutput: Sequence Pattern P1: Scan B to get the event list L;2: for all $l_{i}$ in Ldo3: if $Support (l_{i}) / support (L) > S_{0}$ then4: Extract the suffix sequence of $L_{i}$ ;5: Add $L_{i}$ to the prefix trees of sequential pattern;6: Decrease repeating projected subsequences and contained ones;7: Mining sequential patterns recursively;8: Obtain the sequential pattern P;9: return P.

Algorithm 2. Projection Space Based Fast Pruning

Input: Sequence BOutput: Sequence Pattern P1: Scan B to get the event list L;2: for all

l_{i}

in Ldo3: if

Support (l_{i}) / support (L) > S_{0}

then4: Extract the suffix sequence of

L_{i}

;5: Add

L_{i}

to the prefix trees of sequential pattern;6: Decrease repeating projected subsequences and contained ones;7: Mining sequential patterns recursively;8: Obtain the sequential pattern P;9: return P.

Real-time anomaly detection

Directed loop graph–based pattern storage

As the structure of pattern storage has a direct impact on real time and accuracy, we analyze the features of sequences to propose directed loop graph–based pattern storage structure, called $DLG - PStructure$ , which effectively reduces the traversal level to improve memory utilization and detection efficiency

$\begin{matrix} \begin{matrix} DLG - PStructure = {eventInfo, patternOutline, \\ curWeight, isUpdated} \end{matrix} \end{matrix}$ (3)

$eventInfo = {eventName, preCount, nextEvents}$ (4)

As shown in equations (3) and (4), $DLG - PStructure$ consists of four elements: (i) The eventInfo is the node in the graph, which is associated with three variables: (1) eventName, which represents the called command name; (2) preCount, the cumulative number of the prefix node to current node; and (3) nextEvents, the connected next m-nodes. (ii) patternOutline is a list structure, which denotes a summary information of patterns in $DLG - PStructure$ and characterizes the difference with other patterns. (iii) curWeight ( $0 ⩽ curWeight ⩽ 1$ ) represents the weight of $DLG - PStructure$ , which declines over the updated time. (iv) isUpdated keeps a record of the renewal to avoid double counting of weight.

Figure 4 shows a brief comparison between tree and $DLG - PStructure$ . Compared with the tree, $DLG - PStructure$ contains little redundant data to improve memory utilization. Moreover, $DLG - PStructure$ stores all required information for anomaly detection. Specially, the directed property preserves structural relationship well, which directly affects the accuracy. Meanwhile, loop structure is adopted to solve the problem of undefined boundary in data streams, which is a compensatory measure for the partition of data streams.

Figure 4.

A brief comparison.

Top-k pattern–based abnormal scoring

In this section, we propose top-k pattern–based abnormal scoring, to obtain real-time detection results with high accuracy.

Let $B = {e_{0}, e_{1}, e_{2}, \dots, e_{n - 1}}$ be the detected behavior sequence and $P = {P_{0}, P_{1}, \dots, P_{i}, \dots, P_{n}}$ be pattern library. Hence, the generating probability of B in $P_{i}$ can be defined as

$G P_{i} = Π_{j = 0}^{n} \frac{preCoun t_{j}}{\sum_{i = 0}^{m} preCoun t_{i}}$ (5)

Note that the weaker the correlation between $P_{i}$ and B, the lower the $G P_{i}$ , or even zero. To improve the efficiency of detection, we introduce the top-k strategy, which characterizes B by the most k relevant patterns, to reduce the high complexity caused by traversal detection. Specially, the k patterns are acquired by crash tests between B and patternOutline of $P_{i}$ . Although additional computation is introduced, it still reduces latency as the complexity of crash tests is far below the pattern detection. Then, we score anomaly by equation (6)

$AS = \frac{\sum_{i = 0}^{k} G P_{i} \times curWeigh t_{i}}{k}$ (6)

As shown in equation (6), $G P_{i}$ has little impact on the abnormal score when behavior correlation is extremely weak. Briefly, the influence of the ignored patterns on the detection results is almost negligible. Hence, the top-k abnormal scoring algorithm guarantees accuracy, which is illustrated in Algorithm 3.

Algorithm 3. Top-k Pattern–Based Abnormal Scoring
Input: Detected Behavior sequence B, Sequential Pattern P, Pattern Matching Number kOutput: Abnormal Score $λ$ 1: Current collision similarity $δ = 0$ 2: Current matching number $γ = 0$ ;3: Empty the matching patterns PK;4: for all $P_{i}$ in Pdo5: Calculate the collision similarity $S_{0}$ between B and the patternOutline of $P_{i}$ ;6: if $S_{0} ⩽ δ$ then7: continue;8: Insert $P_{i}$ into PK;9: if $γ < k$ then10: $γ = γ + 1$ ; $δ = S_{0}$ ;11: else12: Remove the one with the smallest collision similarity in PK;13: for all $P K_{i}$ in PKdo14: Calculate abnormal score $O_{i}$ of B in $P K_{i}$ by $G P_{i} * curWeigh t_{i}$ ;15: Plus $O_{i}$ to the total abnormal score Scores; return The abnormal score $λ$ , which is the average value of Scores;

Algorithm 3. Top-k Pattern–Based Abnormal Scoring

Input: Detected Behavior sequence B, Sequential Pattern P, Pattern Matching Number kOutput: Abnormal Score

λ

1: Current collision similarity

δ = 0

2: Current matching number

γ = 0

;3: Empty the matching patterns PK;4: for all

P_{i}

in Pdo5: Calculate the collision similarity

S_{0}

between B and the patternOutline of

P_{i}

;6: if

S_{0} ⩽ δ

then7: continue;8: Insert

P_{i}

into PK;9: if

γ < k

then10:

γ = γ + 1

;

δ = S_{0}

;11: else12: Remove the one with the smallest collision similarity in PK;13: for all

P K_{i}

in PKdo14: Calculate abnormal score

O_{i}

of B in

P K_{i}

G P_{i} * curWeigh t_{i}

;15: Plus

O_{i}

to the total abnormal score Scores; return The abnormal score

λ

, which is the average value of Scores;

As shown in Algorithm 3, we acquire the top-k patterns by crash tests. First, we calculate the collision similarity $S_{0}$ to compare with the current one $δ$ , which determines whether $P_{i}$ characterizes B (Lines 4–6). Second, we update the top-k patterns either by inserting $P_{i}$ into PK when the number of patterns in PK is less than k or by exchanging $P_{i}$ with the one which has the smallest collision similarity in PK (Lines 7–11). Then, we calculate the abnormal score by equation (6) in PK to determine whether B is abnormal (Lines 12–15).

SADS

Architecture of SADS

To perform real-time detection, we design and implement the SADS. Taking account of the flexibility, stability and scalability of SADS, we adopt hierarchical design thought to decouple the service modules of anomaly detection over data streams.²⁶ And the overall structure is summarized in Figure 5.

Figure 5.

Architecture of SADS.

As shown in Figure 5, there are several layers and modules in the framework. The bottom is the fundamental communication platform, which is mainly responsible for the communication of the nodes involved in the process of anomaly detection. The upper layer, business implement layer, is the module for the input and management of data streams, which mainly consists of three submodules: (1) collecting module, which is mainly responsible for collecting real-time data stream, such as system calls and unix shell commands; (2) analyzing module, which mainly includes the work of sequence mining and anomaly detection and (3) monitoring module, which mainly provides data support for optimization strategy, such as running condition and node load.

As the interaction layer, service interface layer mainly provides a series of interfaces for users to interact with the system, which includes (1) the topology design, which maps the physical cluster into topology to achieve rapid deployment; (2) the autocorrelation, which achieves the automatic configuration and dynamic distribution of nodes; (3) the cluster monitor, which mainly reflects the cluster state and (4) the version control, which is mainly responsible for updating the library.

Implementation of EADetection on SADS

In this section, we implement EADetection in real cluster environment to achieve real-time collection, monitoring and analysis for anomaly detection based on SADS. And the process is shown in Figure 6.

Figure 6.

Implementation of EADetection on SADS.

First, we design and submit a topology with the dashboard provided by SADS for cluster deployment, and an example is given in Figure 6. Then, SADS analyzes the submitted topology to deploy the cluster, which includes (1) assigning modules to nodes, (2) installing modules and (3) configuring the cluster. Finally, we run the topology to implement EADetection in real deployment environment for anomaly detection, which includes collecting, monitoring and analyzing.

Note that the communication between modules of conventional deployment requires additional process of complex configuration, which restricts the scalability of deploying. Moreover, the artificial assignment easily leads to the node overload without understanding the status of nodes, which affects the stability of analyzing and may result in the failure of anomaly detection. To solve these problems, we propose the technique of autocorrelation, which dynamically assigns modules to nodes and automatically configures relevant modules based on real-time cluster status. Meanwhile, the traditional cluster deployment for anomaly detection results in lots of repeating work when installing different modules in different nodes by artificial distribution, which restricts the flexibility of system. Thus, we propose the technique of topology mapping to achieve rapid deployment, which improves the flexibility of deploying.

Autocorrelation

Let $M = {m_{0}, m_{2}, \dots, m_{i}, \dots, m_{n - 1}}$ be modules used in system, and the corresponding weights are $W = {w_{0}, w_{1}, \dots, w_{i}, \dots, w_{n - 1}}$ . Then, we calculate the loading capacity by equation (7) to rank the nodes, which determines the allocation of modules. Briefly, the lowest ranking has the top priority to allocate. Note that the real-time cluster status is obtained by taking initiative pull. Based on the dynamic assignment of nodes, SADS automatically configures relevant modules to achieve autocorrelation

$LC = \sum_{i = 0}^{n - 1} m_{i} \times w_{i}$ (7)

Autocorrelation effectively simplifies the configuration of modules, which refrains from inflexibility caused by artificial assignment. Meanwhile, the dynamic allocation mechanism by ranking effectively improves the scalability of system. Specially, the sufficient measure for load capacity of nodes avoids node overload to achieve load balance.

Topology mapping

SADS provides the online method for designing topology visually based on workflow, which makes design effort convenient and effective. During the process of installing modules, SADS deploys the cluster by analyzing the submitted topology in the absence of human intervention. Namely, SADS realizes the mapping between cluster topology and physical nodes to abstract cluster into topology, which makes the cases of nodes transparent to users.

Topology mapping reduces lots of repetitive work, such as different deployment for each node. Specially, it avoids the man-made faults caused by the frequent switch of mainframe during the installation and configuration. Hence, topology mapping can achieve rapid deployment and improve the flexibility, which allows developers and researchers focusing on the application logic.

Optimizations and discussions

According to the above introduction of SADS, we can see that there are still some open problems need to be further considered and solved, which are also the directions of our current and future work.

Stability analysis. The main parameters of SADS are the weights of all modules. The weights vary in different application scenarios. Here, for simplicity and generality, we set the weight w = 1 for each module. Currently, we are working on the strategy of self-adaptive weight assignment based on the CPU memory utilization.

Communication of modules. The communication of the processing modules not only refers to the high efficiency of SADS but also relates to the efficiency of anomaly detection. Generally, the communication includes the communication between the modules and the communication between the modules and the function iteration layer. The former is the fundamental structure, which is determined by the modules. And the latter is determined by the system and directly relates to the detection efficiency. Since the system needs fast response and the used memory has been limited, we believe that heartbeat method is more suitable than others like active pull. Therefore, we choose heartbeat method to achieve the communication of modules.

Load balancing for all nodes. Load balancing aims to avoid the case that the load in some nodes are too heavy, while the other nodes are idle.²⁶ To improve the resource utilization and meet the real-time requirement, the load of nodes should be considered when the system assigns tasks. Besides, SADS should be able to dynamically adjust the load of nodes in order to avoid the performance degradation or a single node of failure. Thus, exploring dynamic load balancing strategies is also a very important aspect for SADS.

Fault-tolerance of detection process. It is the guarantee of detection accuracy and robustness. Too many factors may lead to failures, such as instructions issued by mistake, allocation of nodes or runtime error. Moreover, as the number of components or nodes increase, so does the probability of failure. Once failures occurred, the anomaly detection may be interrupted, and it may lead to an extremely serious result. Therefore, we need to study the fault-tolerant strategies to solve the problems, which consist of the fault detection and rapid recovery.

Dynamic scalability of cluster. The change of node or topology is a normal and frequent operation in the streaming anomaly detection process. For existing systems, once the change of the nodes or the cluster topology happened, we have to interrupt the detection process and restart the system, meaning the inconvenience and the potential threats. Moreover, the new form of shared global economy requires cloud services to be collaboratively provisioned by different cloud providers in a geo-distributed manner. For instance, JointCloud²⁷ is a cross-cloud cooperation architecture for integrated Internet service customization. The topology will be much more flexible under such circumstances. So, we should explore dynamic scalable strategies of cluster to avoid the interruption and restart.

Self-adaption of repository. The repository directly relates to the commonality and transportability of SADS. The incompatibility among detection modules can easily lead to failures to interrupt anomaly detection. Besides, the multi-source of modules increase the artificial maintenance cost. Hence, we should design and implement a self-adapting technique for repository including (1) self-updating, (2) collision detection and (3) auto-merging.

Evaluation

Experimental environment

We have implemented EADetection based on SADS to verify the effectiveness and real-time performance in the real cluster environment. All the hosts are homogeneous, each of which is configured with an Intel Core i5-4590 CPU, 4G main memory, 1TB hard disk and gigabit Ethernet. Meanwhile, the SADS is implemented with Java running on the Ubuntu operating system.

The experiment datasets used in this article are the Sendmail dataset from the University of New Mexico (UNM)²⁸ and the Unix Shell dataset from Purdue.²⁹ The former includes 7 normal data files and 19 abnormal ones, and it is shown in Table 1. The latter includes 8 user’s log record within 2 years. One of them is taken as normal data, and it is mixed with few other user’s data as abnormal data.

Table 1.

UNM system call dataset files.

Normal files (7)	Abnormal files (19)
Normal files (7)	Decode (2)	fwd-loops (5)	sscp (3)	failed (5)	syslog (4)
bounce.int
bounce-1.int		fwd-loops-1.int		cert-chasin-1.int
bounce-2.int		fwd-loops-2.int	sm-10763.int	cert-recursive.int	syslog-local-1.int
plus.int	sm-280.int	fwd-loops-3.int	sm-10801.int	cert-sm5x-1.int	syslog-local-2.int
queue.int	sm-314.int	fwd-loops-4.int	sm-10814.int	cert-sm565a-1.int	syslog-remote-1.int
sendmail.int		fwd-loops-5.int		cert-smdhole-1.int	syslog-remote-2.int
sendmail.log.int

UNM: University of New Mexico.

PrefixSpan is a scalable and efficient sequence mining method, which has a good performance in sequence anomaly detection over data streams.³⁰ Similarly, EADetection also adopts sequence mining method to achieve low latency and accurate anomaly detection. Thus, we compare with EADetection to verify the latency and accuracy of EADetection.

To verify the latency, we adopt two metrics: (1) the number of projection space, which have a direct affect on the efficiency of sequence mining and significantly reduce latency when reducing the number, and (2) processing delay, which is used to characterize the low latency. The projection space is affected by the initialized threshold, which is calculated by the minimum support and the length of behavior sequence. To verify the accuracy, we adopt two metrics: (1) true negative rate (TNR) and (2) FPR. The metrics are usually affected by the anomaly degree threshold which is used to distinguish between normal ones and abnormal ones. To analyze how the behavior sequence length affects the detection efficiency and accuracy, we compare the detection results on UNM dataset and Purdue dataset.

Experiment performance

Performance of efficiency

To analyze the real-time performance, we first compare the EADetection and PrefixSpan by the number of projected space, which is the key factor that influences the efficiency. Then, we compare the processing latency between the two approaches to intuitively show the performance.

As shown in Figure 7, EADetection reduces the number of projected space significantly under different sequence length. The number of projected space decreases with the growth of minimum support degree, which indicates that minimum support degree is a key factor which influences the generation of projected space. We increase the sequence length from 8 to 11, and the number of projected space increases with high rate. This is because larger sequence length can increase the support degree, and then, more projection space is generated if the minimum support degree remains the same.

Figure 7.

The number of projection space of PrefixSpan and EADetection with different min support degree under four sequence length setting: (a) sequence length = 8, (b) sequence length = 9, (c) sequence length = 10, and (d) sequence length = 11.

Figure 8 shows that EADetection significantly reduces the processing latency compared to PrefixSpan. Different from the phenomenon of Figure 7, processing latency does not always decrease with the growth of minimum support degree. It remains stable after minimum support degree is set to 8. This is because there is a large number of projected space when minimum support degree is small, which increases the processing latency due to variable patterns. With the growth of minimum support degree, the number of patterns becomes small. Then, the value of k in top-k detection strategy is a main factor that influences the processing latency. Processing latency remains stable because the value of k is unchanged.

Figure 8.

The processing latency (ms) of PrefixSpan and EADetection with different min support degree under four sequence length setting: (a) sequence length = 8, (b) sequence length = 9, (c) sequence length = 10, and (d) sequence length = 11.

Performance of accuracy

We evaluate the accuracy of EADetection with TNR and FPR. Suppose that the total count of sequences in the abnormal dataset is $N - abnormal$ , the total count of the sequences in the normal dataset is $N - normal$ , the count of sequences judged abnormal in the abnormal dataset is $TJ - abnormal$ , the count of the sequences judged abnormal in the normal dataset is $FJ - normal$ and the count of sequences judged normal in the abnormal dataset is $FJ - abnormal$

$TNR = \frac{TJ - abnormal}{N - abnormal}$ (8)

$FPR = \frac{FJ - normal}{N - normal}$ (9)

As shown in Figure 9, EADetection achieves high detection accuracy with TNR close to 98% and FPR less than 10%. TNR of EADetection increases with the growth of minimum support degree. When minimum support degree is set to 10, it changes to be steady. FPR keeps at a low value when minimum support degree changes. And this is because EADetection removes and merges large number of short sequences.

Figure 9.

Accuracy of PrefixSpan and EADetection with different min support degree: (a) TNR and (b) FPR.

To further verify the effectiveness of top-k pattern–based abnormal scoring, we evaluate the accuracy with different value of k and without top-k. As shown in Figure 10, TNR does not change too much when the value of k changes, and this is because EADetection chooses the k pattern with highest correlation intensity. Unlike TNR, FPR decreases significantly when k increases, which indicates that removing too many compared pattern will cause more false alarms. After k is set to 9, it changes to be steady. So, removing patterns with weak correlation intensity dose not impact the detection accuracy. Therefore, we can achieve high TNR while keeping low FPR by adjusting the value of k.

Figure 10.

Accuracy of PrefixSpan and EADetection with different min support degree under different values of k: (a) TNR and (b) FPR.

Performance of difference datasets

To analyze how the behavior sequence length affects the detection efficiency and accuracy, we also experiment on the Purdue dataset. With respect to the UNM dataset, the sequences in the Purdue dataset are shorter. The number of the events in the two dataset are provided as the same to compare the efficiency and accuracy of the detection and the results are shown in Figure 11.

Figure 11.

The result comparison of Purdue and UNM.

From Figure 11, we can observe that it has obvious efficiency advantages at the Purdue dataset. This is mainly caused by the two factors in short transaction conditions: (1) small projected database in the process of sequence pattern mining and (2) low-level traversal. Although the advantage in accuracy is relatively weak, we can still believe that the EADetection is more suitable for the detection scenario with relatively short length transactions.

Conclusion and future work

Nowadays, user behavior anomaly detection over data streams has received considerable attention in numerous fields, but it poses immense challenges to researchers. In this article, we propose EADetection, an efficient and accurate sequential behavior anomaly detection approach over data streams. To realize the adaptive segment of user behavior sequence, EADetection adopts time correlation and fuzzy logic to reconstruct user behavior scene based on rolling window. Through dynamic projection space–based fast pruning, large number of repeated patterns are reduced to improve the detection efficiency. Meanwhile, EADetection adopts the directed loop graph–based storage strategy to preserve the structural relationship of user behavior and calculates the anomaly score by Bayesian network after top-k pattern collision test, which ensures the accuracy of detection. Specially, we design and implement an SADS based on EADetection to perform real-time detection. Extensive experimental results demonstrate that EADetection achieves the two goals of streaming anomaly detection: (1) improving detection efficiency to meet low latency requirements and (2) guaranteeing the high accuracy.

Although our proposed approach efficiently completes user behavior anomaly detection over data streams, there is still much work to do in the future to conduct more research in this area. For instance, the considerations of concept drift is meaningful and challenging. The direction of our future work is to implement it in EADetection to further improve the performance of anomaly detection.

Footnotes

This is an extended version our paper “A User Behavior Anomaly Detection Approach based on Sequence Mining over Data Streams” presented at PDCAT2016.

Handling Editor: Antonino Staiano

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This work was supported by the National Key Research and Development Program of China (2016YFB1000101),the National Natural Science Foundation of China (Grant No. 61379052),the Natural Science Foundation for Distinguished Young Scholars of Hunan Province (Grant No. 14JJ1026),and Specialized Research Fund for the Doctoral Program of Higher Education (Grant No. 20124307110015).

References

Modi

Patel

Borisaniya

et al . A survey of intrusion detection techniques in cloud. J Netw Comput Appl2013; 36(1): 42–57.

Symantec. Internet security threat report, https://www.symantec.com/security-center/threat-report (accessed 12 March 2017).

Vormetric. Vormetric insider threat report, http://www.vormetric.com/campaigns/insiderthreat/2015 (accessed 12 March 2017).

Han

Kamber

Data mining: concepts and techniques. Amsterdam: Elsevier, 2011.

Song

Ben Salem

Hershkop

et al . System level user behavior biometrics using fisher features and Gaussian mixture models. In: IEEE security and privacy workshops, San Francisco, CA, 23–24 May 2013, pp.52–59. New York: IEEE.

Wang

et al . A survey of queries over uncertain data. Knowl Inf Syst2013; 37(3): 485–530.

Salem

Stolfo

. Modeling user search behavior for masquerade detection. In: Sommer

Balzarotti

Maier

(eds) International workshop on recent advances in intrusion detection. Berlin: Springer, pp.181–200.

Garcia-Teodoro

Diaz-Verdejo

Maciá-Fernández

et al . Anomaly-based network intrusion detection: techniques, systems and challenges. Comput Secur2009; 28(1): 18–28.

Hoang

. An efficient hidden Markov model training scheme for anomaly intrusion detection of server applications based on system calls. In: 12th IEEE international conference on networks, Singapore, 19 November 2004, vol. 2, pp.470–474. New York: IEEE.

10.

Qiu

A simple and efficient hidden Markov model scheme for host-based anomaly intrusion detection. Network2009; 23(1): 42–47.

11.

Chand

Thakkar

Ganatra

Sequential pattern mining: survey and current research challenges. Int J Soft Comput Eng2012; 2(1): 185–193.

12.

Deypir

Sadreddini

Tarahomi

An efficient sliding window based algorithm for adaptive frequent itemset mining over data streams. J Inf Sci Eng2013; 29(5): 1001–1020.

13.

Vieira

Schulter

Westphall

et al . Intrusion detection for grid and cloud computing. IT Prof2010; 12(4): 38–43.

14.

Koucham

Rachidi

Assem

Host intrusion detection using system call argument-based clustering combined with Bayesian classification. In: SAI intelligent systems conference, London, 10–11 November 2015, pp.1010–1016. New York: IEEE.

15.

Hill

Minsker

Amir

. Real-time Bayesian anomaly detection for environmental sensor data. Water Resour Res 2007; 45(4): 450–455.

16.

Quan

Jinlin

Wei

et al . Improved edit distance method for system call anomaly detection. In: 12th international conference on computer and information technology, Chengdu, China, 27–29 October 2012, pp.1097–1102. New York: IEEE.

17.

Usman

Muthukkumarasamy

XW.

Mobile agent-based cross-layer anomaly detection in smart home sensor networks using fuzzy logic. IEEE T Consum Electr2015; 61(2): 197–205.

18.

O’Reilly

Gluhak

Imran

MA.

Adaptive anomaly detection with kernel eigenspace splitting and merging. IEEE T Knowl Data En2015; 27(1): 3–16.

19.

Pyun

Yun

Ryu

KH.

Efficient frequent pattern mining based on linear prefix tree. Knowl-Based Syst2014; 55: 125–139.

20.

Bhuyan

Bhattacharyya

Kalita

JK.

Network anomaly detection: methods, systems and tools. IEEE Commun Surv Tut2014; 16(1): 303–336.

21.

Agrafiotis

Legg

Goldsmith

et al . Towards a user and role-based sequential behavioural analysis tool for insider threat detection. J Internet Serv Inf Secur2014; 4(4): 127–137.

22.

Esposito

Ficco

Palmieri

et al . Smart cloud storage service selection based on fuzzy logic, theory of evidence and game theory. IEEE T Comput2016; 65(8): 2348–2362.

23.

Wang

et al . Distributed host-based collaborative detection for false data injection attacks in smart grid cyber-physical system. J Parallel Distr Com2017; 103: 32–41.

24.

Esposito

Castiglione

Palmieri

Information theoretic-based detection and removal of slander and/or false-praise attacks for robust trust management with Dempster-Shafer combination of linguistic fuzzy terms. Concurr Comp: Pract E2018; 30: e4302.

25.

Aloysius

Binu

An approach to products placement in supermarkets using PrefixSpan algorithm. J King Saud Univ: Comp Inf Sci2013; 25(1): 77–87.

26.

Wang

Zhao

et al . GPS: a general framework for parallel queries over data streams in cloud. In: International conference on high performance computing and communications, Zhangjiajie, China, 13–15 November 2013, pp.1139–1146. New York: IEEE.

27.

Wang

Shi

Zhang

. Jointcloud: a crosscloud cooperation architecture for integrated internet service customization. In: 2017 IEEE 37th international conference on distributed computing systems (ICDCS), Atlanta, GA, 5–8 June 2017, pp.1846–1855. New York: IEEE.

28.

University of New Mexico (UNM). System call datasets, http://www.cs.unm.edu/~immsec/data-sets.htm (accessed 19 March 2017).

29.

Lane

. Machine learning techniques for the computer security domain of anomaly detection. Doctoral Dissertation, Purdue University, West Lafayette, IN, 2000.

30.

Huang

Fox

et al . Online system problem detection by mining patterns of console logs. In: Ninth IEEE international conference on data mining, Miami, FL, 6–9 December 2009, pp.588–597. New York: IEEE.

EADetection: An efficient and accurate sequential behavior anomaly detection approach over data streams

Abstract

Keywords

Introduction

Overview of EADetection

Construction of sequential pattern

Rolling windows–based adaptive segment

Time interval–based correlation

Fuzzy logic–based correlation

Projection space–based fast pruning

Real-time anomaly detection

Directed loop graph–based pattern storage

Top-k pattern–based abnormal scoring

SADS

Architecture of SADS

Implementation of EADetection on SADS

Autocorrelation

Topology mapping

Optimizations and discussions

Evaluation

Experimental environment

Experiment performance

Performance of efficiency

Performance of accuracy

Performance of difference datasets

Conclusion and future work

Footnotes

Declaration of conflicting interests

Funding

References