Abstract
Introduction
Due to the development of new applications such as social networks, location based service, video sharing, the scale of Internet continues to expand. At the same time, network traffic data are also showing a trend of explosive growth, which has brought severe challenges to network traffic anomaly detection. Specially, the security and privacy problems caused by these data and their solutions are particularly important.1,2 There are many kinds of definitions of anomaly, one widely accepted definition is that of Hawkins: 3 “An anomaly is an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.” To alleviate the impact of abnormal network traffic, network anomaly detection plays a key role on anomaly handling. Network anomaly detection was originally proposed by Denning, 4 which refers to filtering out abnormal information from traffic data, identifying and diagnosing the security status of the network so as to ensure proper functioning of the network.
Network anomaly detection is a crucial part of network security; it is an useful technology to understand the status and performance of network, which is important to the management of Internet and systems. Network anomaly detection has been applied to many fields, such as wireless sensor networks (WSNs), 5 mobile network, 6 healthy and medical application.7,8
The current mainstream methods for detecting anomalies from network traffic include statistical based, time series–based, sketch-based, and machine learning–based methods. Among them, the Isolation Forest algorithm has many advantages, including high accuracy at low false positive rates, high efficienc,y and extensibility, compared to others. In particular, its computational complexity is linear with respect to the size of data sample, and it can deal with high-dimensional data. However, the Isolation Forest algorithm is designed for single computer, meaning that it cannot work when the size of dataset exceeds the memory limit of a single machine. Therefore, in order to process large-scale network traffic data, it is necessary to realize the parallelization of the Isolation Forest algorithm.
To solve the above problems, we propose a parallel algorithm for network traffic anomaly detection based on Isolation Forest and Spark (SPIF). Our contributions in this work can be summarized as follows:
We use parallel strategies to construct multiple trees simultaneously, which improves the efficiency of modeling process.
We propose a parallel algorithm for network anomaly evaluation. We use parallel strategies to evaluate multiple data simultaneously, which improves the efficiency of abnormal evaluation.
We integrate our algorithm in the Spark framework. The data read operation of Spark is based on memory’s secondary caching mechanism, so it can avoid frequent disk Input/Output operations and thus improve the data processing efficiency.
The remainder of this article is organized as follows. Section “Related work” reviews the related work. In section “Preliminaries,” we provide a brief introduction to some preliminaries. Then, we propose our parallel algorithm for network traffic anomaly detection based on SPIF in section “Proposed model.” In section “Experimental study,” we briefly introduce the experimental environment and data we used, and the experimental results are analyzed in detail. Finally, conclusions are drawn and further work is given in section “Conclusion.”
Related work
Network traffic anomaly detection is usually classified into four main categories: statistical based, time series based, sketch based, and machine learning based. For sketch based and machine learning based approaches, the detection system needs labeled data to train a detection model which is quite time consumption. However, they always have a higher accuracy. As for the first two categories, they are mostly unsupervised, which means they do not need labeled data.
Statistical-based approaches are suitable for the detection of anomaly. Matsuda et al. 9 used principal component analysis (PCA)–based network traffic anomaly detection technology to project the measured flow data into normal subspace and abnormal subspace to detect traffic anomaly. This method increased the robustness of anomaly detection system and reduced the computational cost. De la Hoz et al. 10 proposed a network traffic anomaly detection method based on PCA and self-organizing map (SOM), which can quickly realize the intrusion detection system (IDS) to deal with the current link bandwidth. Bereziski et al. 11 studied the application of anomaly detection in the field of network intrusion detection and verified that the entropy-based method is suitable for the abnormal detection of modern botnet. In addition, Markov model was also anomaly traffic detection methods which are based on statistical analysis. 12 Markov anomaly detection model has self-learning function, it can determine training time according to data size, and it is easy to implement. But its real-time performance of anomaly detection is poor, and it is difficult to detect abnormal in a relatively short time.
Time series–based approaches include auto-regression and moving average model (ARMA) regression, 13 wavelet transform, 14 empirical mode decomposition (EMD) transformation, instantaneous frequency analysis, and so on. When these methods are used for network traffic anomaly detection, they are suitable to deal with the network traffic data that meet the requirements of quantization, and they can use the technology of signal processing flexibly. The research of Celenk et al. 15 presented a method of anomaly detection based on Wiener filtering of noise and ARMA modeling of network flow data. In the research, they use network-monitoring metrics for traffic features to dynamically calculate noise and traffic signal statistics. Han and Zhang 16 performed a real-time EMD on the network traffic and used weighted self-similarity parameter to detect abnormal activities over the Internet. Yu et al. 17 proposed a traffic anomaly detection algorithm for WSNs based on the autoregressive integrated moving average (ARIMA) model. Jiang et al. 18 proposed a high-speed backbone network traffic anomaly detection method based on multi-scale analysis. In the first step of this method, network traffic will be transformed on a continuous scale, and the PCA will be carried out. Then, they can extract the characteristics of abnormal network traffic and construct new mapping functions to detect abnormal traffic.
Sketch is a distributed profile data structure that can handle large amounts of data in a short time, so it is widely used in network traffic anomaly detection. Huang and Lee 19 proposed a data structure of traffic anomaly detection based on distributed architecture. This structure combines the traditional counter-based and sketch-based technologies; its detection phase is divided into two stages: local detection and distributed detection, which ensures the accuracy and scalability of the model. Chen et al. 20 implemented the detection of abnormal Internet Protocol (IP) source address by combining the sketch data structure with the improved multi-scale principal component analysis (MSPCA) detection algorithm. This approach takes advantages of the characteristics of PCA and wavelet analysis, so MSPCA can identify anomalies efficiently.
Machine learning–based approaches can quickly and effectively deal with large-scale network traffic data through self-learning method. The main machine learning methods include classification, clustering, pattern recognition, neural network, and decision tree. Literature21,22 applied the clustering of machine learning to network traffic anomaly detection and improved the efficiency of anomaly detection. Shon et al. 23 proposed a machine learning framework for anomaly detection which uses genetic algorithm (GA) for feature selection and support vector machine (SVM) for packet classification. Casas et al. 24 solved the automation problem of network traffic anomaly detection and classification using decision tree and proposed a multi-detector method to improve the performance of the whole anomaly detection system. Sheikhan and Jadidi 25 used multi-layer perceptron (MLP) neural classifier to distinguish benign and malicious flow of network intrusion detection system (NIDS) based on flow, and on this basis, they used modified gravitational search algorithm (MGSA) to optimize the interconnection weight of the neural anomaly detector and improved the performance of the detection system.
Preliminaries
Isolation Forest
There are two quantitative characteristics used in our proposed method: (1) anomalies account for a very small proportion of the whole dataset and (2) there is a big difference in their attribute values between the anomaly and normal instance, which make them “few and different” and more susceptible to isolate them from normal instances. In this article, we use tree structure to identify every single instance. Since anomalies’ susceptibility to isolation, they will be closer to the root of the tree. On the contrary, normal instances will locate at the leaf level of the tree. Those are the foundations of our method for anomaly detection, and we call it Isolation Tree or iTree. 26
Isolation Forest or iForest is a combination of iTrees, which is an efficient method of anomaly detection based on ensemble, and it is an effective and popular algorithm that can deal with large data. It is different from traditional anomaly detection methods for its high accuracy, linear time complexity, and low memory cost.
Isolation Forest returns the anomaly score of each sample using the Isolation Forest algorithm. The Isolation Forest “isolates” observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node. This path length, averaged over a forest of such random trees, is a measure of normality and our decision function. Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies.
In our work, iForest algorithm only has two parameters: the member of iTrees
The iForest algorithm contains two stages: (1) modeling stage, which randomly samples the dataset to get subsets and builds an ensemble of iTrees; (2) evaluating stage, which passes test data through iTrees and records the path length of each test instance, then calculates the anomaly score for each test instance. The key of iForest algorithm is the construction of Isolation Forest, which is a collection of iTrees and each iTree is a binary tree.
Modeling stage
In this stage, iTrees will be constructed by recursively executing the process of partitioning of the giving dataset until instances are divided up completely or the tree reaches the maximum depth. Details of this stages are given as follows:
Randomly collect
Randomly select an attribute and a split point
The split point
Repeat steps 2 and 3 in the child nodes, and continue to construct the left and right subtrees, until any of the three conditions (1)
Repeat the above steps 1–4, produce
Evaluating stage
Evaluating stage is the process to evaluate every instance on the basis of the iTrees constructed in modeling stage and assign a anomaly score to each instance. Details of this stages are given as follows:
For each instance
Calculate the path length of instance
Use the abnormal score obtained from step 2 to evaluate instance
In order to accurately identify anomaly instance, some definitions are given below.
Definition 1
where
In equation (1)
When
When
When
Big data processing platform
Spark is an open source and universal distributed computing framework that borrows from MapReduce’s distributed concurrency ideas, and it uses memory-based methods to store intermediate results, which is more efficient than Hadoop and is more suitable for batch processing of large-scale data. Spark has the following advantages:
Running speed: Spark introduces the directed acyclic graph (DAG) execution engine and memory-based computation into the processing of big data, which increases performance by nearly 100 times.
Easy to use: Compared to Hadoop, which provides only Map and Reduce operations, Spark offers dozens of operations, and it supports three languages—Java, Python, and Scala, and can be implemented on a Hadoop cluster.
Versatility: Spark’s ecosystem is composed of multiple components, and components can be called from each another, which provides a one-stop data processing platform for seamless integration and avoids the independent running of multiple systems. Moreover, it also allows advanced components benefit from improvements in the underlying components.
Run everywhere: Spark can implement the read operation of HBase, Hadoop, S3, Cassandra, and Tachyon, achieve the read–write operation of the native data of the persistence layer, and is highly adaptable.
Spark has obvious advantage in big data processing, it has been used by many researchers and industrial practitioners worldwide and has become the top project of Apache. In recent years, the ecosystem of Spark has been gradually improved and is fully functional, and the Spark ecosystem has many components, as shown in Figure 1.

The ecosystem of Spark.
Spark Core is the core component of Spark ecosystem; its primary responsibility is to implement data reading and application analysis, which ensures the execution efficiency of distributed computing by taking advantages of memory computing and directed acyclic graph mechanism to minimize the data reading time for multiple iterations. Besides, Spark Streaming provides Streaming real-time big data processing services, Spark Submit and Spark Shell provide interactive services for batch processing, SparkSQL provides instant query service, Machine Learning Library (MLlib) provides Machine Learning algorithm service, GraphX provides image processing services, and SparkR provides data analysis and computing services. Multiple components can be seamlessly combined with each other, which can be applied to most of today’s task scenarios.
Proposed model
Although the traditional iForest algorithm has the advantages of high accuracy, linear time complexity, and low memory requirement, it has a obvious disadvantage in dealing with big data—it is based on a single machine design, which means it is unable to handle large-scale data. With the explosive growth of network traffic data, the computational and storage limit of a single node can no longer meet the need of anomaly detection, and this greatly limits the applicability of the iForest algorithm.
To solve the problem of computation bottleneck on single machine, we propose a parallel model for network traffic anomaly detection based on Isolation Forest. The system architecture is described in Figure 2.

The system architecture of the parallel model for network traffic anomaly detection based on big data processing platform.
System model
The model is composed of network traffic data acquisition layer, network traffic data preprocessing layer, network traffic anomaly detection layer, and application service layer. Each layer is independent of the others, but they are closely linked through the big data processing platform, and they need work together to complete the processing of network traffic anomaly detection.
The network traffic data acquisition layer is one of the basic layers of the model. It uses a variety of network traffic acquisition techniques to collect large-scale network traffic of corresponding network nodes, and it is also responsible for transferring the collected network traffic information to the network traffic preprocessing layer.
Network traffic data preprocessing layer is also one of the basic layers of the model. The main task of this layer is to use the data preprocessing algorithm to process the data into the format required by the network traffic anomaly detection method. After all operations are finished, the data will be handed over to the network traffic anomaly detection layer.
The network traffic anomaly detection layer is the core layer of the model. Its first task is to ensure the parallelization of network traffic anomaly detection through the big data processing platform–Spark and Hadoop, improve the efficiency of anomaly detection, and submit the test results to the application service layer.
The application service layer can provide a series of traffic anomaly detection application services and provide the basis for the analysis of network security situation. This layer is based on the network traffic anomaly detection layer.
The big data processing platform is the carrier of the whole model, which provides powerful computation power. As a distributed computing framework for large-scale network traffic data, Spark can realize efficient use of cloud computing resources and it is highly scalable.
Algorithm implementation
To solve the problem that iForest algorithm is unable to deal with large-scale network traffic data, we take advantages of the iForest algorithm in anomaly detection and efficiency of Spark technology in big data processing and design a parallel algorithm for anomaly detection based on SPIF. This method realizes the parallelization of the iForest algorithm in modeling process and the batch processing of anomaly evaluation. The frame of the proposed model is presented in Figure 3.

The system framework of SPIF.
Modeling stage
Since the iForest algorithm needs to build multiple iTrees, this sequential construction method is time-consuming and is limited by the maximum memory capacity, and cannot adapt to anomaly detection on large-scale network traffic data. However, the construction process of each iTree is independent and has no influence on each other, so it is feasible to improve performance through parallel execution. In view of the above reasons, this article proposes the SPIF algorithm. Based on the process of model construction, SPIF uses the Spark platform to divide the work of iTrees construction across multiple computation nodes to execute simultaneously, thus achieving parallelism of the construction process. Algorithm 1 gives the pseudo-code of iForest for constructing iTrees in parallel.
SPIF-Construction(D, n, samplesize).
In Algorithm 1, D represents the network traffic dataset,
In Step 4 of Algorithm 1, the
Evaluation stage
The original iForest algorithm can only evaluate one instance at a time. In the process of calculating the anomaly score, the average path length of an instance needs to be calculated, and the calculation of anomaly score of each instance needs to iterate through the whole set of iTrees, which is time-consuming. To solve this problem, SPIF algorithm adopts the strategy of parallel evaluation of multiple instances at the same time, to improve the efficiency of anomaly evaluation. The process of anomaly evaluation of SPIF algorithm is given in Algorithm 2.
SPIF-AnomalyEval(D, iForest).
In Algorithm 2,
For Algorithm 1, we suppose the time cost to construct a iTree is
Experimental study
To evaluate the effectiveness of our proposed SPIF algorithm, we conduct extensive experiments on real network traffic datasets. We investigate the performance of the proposed algorithm from three aspects, that is, detection efficiency, effectiveness, and scalability on Spark platform and Hadoop platform. To improve the reliability of experimental results, we repeat each experiment
Experimental environment
The experiment is carried out on cloud platform. The spark platform we built consists of four machines (or nodes), where one node is configured as the master and the other three are worker nodes. The configuration of the cluster is shown in Tables 1–3.
The software configuration of Spark cluster.
The software configuration of Hadoop cluster.
Resource configuration of Spark and Hadoop cluster.
Experiment dataset
The experiment dataset used in the experiment is UNSW-NB1527,28—the comprehensive dataset used by the latest network IDS in academia. The dataset is obtained by the network security laboratory of Australia’s network security center, which is created using IXIA PerfectStorm tool, and it is used to simulate the normal network activity and attack behavior of real applications. The dataset consists of four csv files, a total of
Summary of dataset UNSW-NB15.
In order to meet the requirements of different experiments, we divide the dataset to different sizes, as shown in Table 5.
The size of different datasets.
Experiment result and analysis
Detection efficiency
In order to verify the performance of the proposed SPIF method in terms of detection efficiency, this section compares SPIF with the iForest algorithm in the single-machine environment, and a parallel algorithm for anomaly detection based on Isolation Forest and Hadoop (HPIF for short). Since this experiment is to verify the anomaly detection efficiency of massive network traffic data in a single-machine environment and a cluster environment, we need large-scale data. In order to increase the reliability of the experimental results, we use datasets Data2, Data3, Data4, and Data5 to verify the efficiency of the algorithms. Figure 4 gives the running time of the iForest algorithm based on single-machine environment, HPIF and SPIF, under different settings of data size and the number of iTrees.

Running time versus different database sizes and number of iTrees.
From Figure 4, we can see that SPIF and HPIF are significantly better than the iForest algorithm in single-machine environment when dealing with large-scale network traffic data. When dataset size and the number of iTrees are both small, the performance of the iForest algorithm based on single environment is slightly different from SPIF and HPIF. With the increase of network traffic dataset size, SPIF and HPIF obviously outperform the iForest algorithm based on single-machine environment, and the performance gap is widening as the amount of data increases. This is because SPIF algorithm and HPIF algorithm can distribute the work of modeling to the nodes of the cluster in parallel. When the number of iTrees increases, the task allocation between nodes will also be adjusted automatically. Hence, SPIF algorithm and HPIF algorithm are not sensitive to the number of iTrees. However, the iForest algorithm based on single-machine environment can only build iTrees in a sequential manner; as the number of iTrees increases, its running time will also increase linearly. Meanwhile, SPIF algorithm is based on Spark, which can put network traffic data in the memory cache, so it can read data directly from memory when iterative operations are executed. This can avoid excessive disk I/O operations, improve the iterative efficiency, and greatly reduce processing time.
Experimental results show that when dealing with anomaly detection tasks on large-scale network traffic data, the proposed SPIF method clearly outperforms HPIF algorithm and the iForest algorithm based on single-machine environment, since SPIF reduces the running time for anomaly detection.
Effectiveness
We use UNSW-NB15 dataset to verify the effectiveness of the proposed SPIF algorithm and iForest algorithm. The dataset we use is Data5 and the evaluation measures are area under curve (AUC) and Accuracy. AUC is an index normally used to evaluate the efficiency of classifiers, which is defined as the area under the receiver operating characteristic (ROC) curve, and Accuracy is the proportion of instances in the network traffic data that are detected correctly. The results are shown in Table 6.
AUC and Accuracy of iForest and SPIF.
SPIF: Isolation Forest and Spark; AUC: area under curve.
As can be seen from Table 6, the AUC and Accuracy of the SPIF algorithm and iForest are basically consistent, and there is no obvious difference. In other words, the experimental results show that the proposed SPIF algorithm can effectively reduce the data processing time and improve the execution efficiency of network traffic anomaly detection without degradation in accuracy. Therefore, SPIF is suitable for anomaly detection on large-scale network traffic data.
Scalability
We compared the SPIF algorithm with HPIF algorithm and iForest algorithm based on single-machine environment to verify their scalability by investigating their running time, respectively. In this experiment, the data sample size is set to

Running time of iForest, HPIF, and SPIF under different data sizes.
From Figure 5, we can see that for given fixed data sample size and the number of iTrees, the running time of iForest algorithm based on single-machine environment is increasing linearly with dataset size, whereas the running time of HPIF algorithm increases slowly, much lower than that of the iForest algorithm based on single-machine environment. SPIF algorithm has the shortest running time and remains stable. The experimental results show that the proposed SPIF algorithm outperforms the iForest algorithm based on single-machine environment and HPIF algorithm; hence, SPIF is more suitable for anomaly detection on large-scale network traffic data.
In order to investigate how SPIF performs when the number of computation nodes increases, we use Speedup for this purpose, which is defined as
where
From Figure 6, given a fixed number of iTrees, the Speedup value of the SPIF algorithm increases gradually as the number of computation nodes increases. On the other hand, given a fixed the number of computation nodes, the Speedup of SPIF shows a steep increasing trend with the increase of the number of iTrees. The experimental results show that SPIF algorithm can effectively accelerate the process of constructing iTrees, and reduce the time consumption of anomaly evaluation. It can meet the needs of large-scale network traffic anomaly detection, and complete the process in a relatively short time. To sum up, Spark’s parallel processing technology can effectively improve the efficiency of network traffic anomaly detection, and make SPIF algorithm has good extendibility.

The Speedup of SPIF for different number of nodes and iTrees.
Conclusion
This article introduces the problem of efficient anomaly detection from large-scale network traffic data and proposes a parallel network traffic anomaly detection algorithm SPIF that is based on iForest and Spark. By taking advantages of Isolation Forest algorithm for network traffic anomaly detection and Spark technology for big data processing, SPIF can perform well in parallel the modeling process and the anomaly evaluation process. We verify the superiority of SPIF method through extensive experiments on real datasets. The experimental results show that compared with the iForest method based on single-machine environment, the SPIF algorithm not only has higher accuracy and good scalability, but also runs faster. Meanwhile, compared with the HPIF algorithm, SPIF algorithm successfully avoids frequent disk I/O operations and significantly reduces data processing time; thus, it can used for efficient anomaly detection from large-scale network traffic data.
In our future work, we will investigate dimension reduction techniques for our network traffic data so as to further improve the efficiency of our SPIF algorithm. And we also plan to study network security risks by analyzing anomaly detection results.
