Sage Journals: Discover world-class research

Abstract

Increasing interest and advancement of internet and communication technologies have made network security rise as a vibrant research domain. Network intrusion detection systems (NIDSs) have developed as indispensable defense mechanisms in cybersecurity that are employed in discovery and prevention of malicious network activities. In the recent years, researchers have proposed deep learning approaches in the development of NIDSs owing to their ability to extract better representations from large corpus of data. In the literature, convolutional neural network architecture is extensively used for spatial feature learning, while the long short term memory networks are employed to learn temporal features. In this paper, a novel hybrid method that learn the discriminative spatial and temporal features from the network flow is proposed for detecting network intrusions. A two dimensional convolution neural network is proposed to intelligently extract the spatial characteristics whereas a bi-directional long short term memory is used to extract temporal features of network traffic data samples consequently, forming a deep hybrid neural network architecture for identification and classification of network intrusion samples. Extensive experimental evaluations were performed on two well-known benchmarks datasets: CIC-IDS 2017 and the NSL-KDD datasets. The proposed network model demonstrated state-of-the-art performance with experimental results showing that the accuracy and precision scores of the intrusion detection model are significantly better than those of other existing models. These results depicts the applicability of the proposed model in the spatial-temporal feature learning in network intrusion detection systems.

Keywords

Network intrusion detection deep learning spatial feature learning temporal feature learning Convolutional Neural Networks Bi-directional Long Short Term Memory

1. Introduction

With the rapid increase in importance of cyber security, research work on Intrusion Detection Systems (IDS) have been active. In addition, the rapid growth in the volume, velocity, veracity and variety of network data and connected devices bears intrinsic security risks and threats. Modern day organizations are in constant search for solutions to combat these threats with the tenacity to safeguard the integrity, confidentiality and availability of their information assets [25]. Research shows that intrusions on computer networks results into loss of trillions of money annually, impede availability of critical systems for end users, drives up the costs significantly, and affects corporate reputation [51].

Owing to these adverse effects, several efforts have been made by various researchers to help curb this menace by developing network-based intrusion detection systems (NIDS) as a solution for security problems on network administration. These systems are being touted as one of the most effective approaches in dealing with the present-day network attack detections [4,34,46,56]. Contingent on the methods of intrusion detection, NIDSs are categorized into two classes namely, signature-based detection (or misuse detection) and anomaly-based detection (behaviour-based detection) [8,36]. Signature-based detection attempts to define a set of rules (or signatures) that can be used to decide whether a given pattern is an attack or not. Thus, signature-based systems are capable of achieving high levels of accuracy and minimal number of false positives while identifying intrusions. On the other hand, anomaly-based detection systems are able to search the abnormal traffic by comparing the actual behaviour with the normal system behaviour and flag off any intrusions. They are well-suited for the detection of unknown and new attacks leading to their wide acceptance among the research community [23].

The NIDS is a kind of a classifier that involves analysis of network traffic collected and then compared with the baseline defined for the system. It monitors and analyses the network traffic entering into or exiting from the network devices of an organization and raises alarms if an intrusion is observed. As a classifier, the NIDS differentiate the malicious entries from the legitimate entries in network traffic data. As such, they are permitted to utilize machine learning techniques. These systems use supervised, semi-supervised, or unsupervised mechanisms to learn patterns of both benign and malignant activities when subjected to large quantities of network flow data. In the recent years, several techniques and models have been developed and employed to retrieve significant information from data obtained in large and high-speed networks. The NIDS pioneered by Denning [18] in 1987 have been successfully used to detect and prevent malignant network behaviours over the years. Numerous machine-learning based NIDS have been developed and evaluated with supervised deep learning approaches, such as a deep neural networks (DNN), convolutional neural networks (CNN), and long short-term memory (LSTM) gaining increased interest in the recent years. These deep learning algorithms have been significantly utilized in the network security domain that involves large scale data in modern cyberspace networks since they are capable of learning the deeply integrated features.

The scope of NIDS detection targets both the host and network levels. Essentially, they are opening a frontier for deployment at strategic points in modern day networks to monitor incoming traffic and discover the existence of malicious packets in the network traffic [49]. In view of the increasing scope and typology of security risks inherent in large and high-speed networks, network monitoring based on machine learning techniques have been evolving rapidly [8,26,36]. However, according to [33], two main challenges have been identified while developing an efficient and flexible NIDS for detecting novel network attacks. First, selection of features from the network traffic dataset is difficult as one class of attack may not work well for other categories of attacks owing to constantly changing and evolving attack scenarios. Secondly, network administrators are reluctant towards reporting any intrusion that might have occurred in their networks owing to the need to preserve the confidentiality of the internal organizational network structure as well as the privacy of various users.

A key factor for designing an effective NIDS is to select significant features for making decisions. Consequently, this involves extracting relevant attributes from huge corpus of noisy, high dimensional data that is characteristic of large and high-speed networks. Data mining and machine learning techniques have emerged over the years and are being subjected to extensive research in developing NIDS using different intrusion detection datasets [25]. Deep learning, an offshoot of machine learning, has come to the fore as a multifaceted tool for synthesis and analysis of volumes of heterogeneous data, offering reliable predictions of complex and uncertain phenomena [5,28]. Deep learning is inspired by neuroscientific advancements in an effort to understand the working of the human brain, attributed to the human ability to “think” [45]. Given their multiple levels of abstraction, deep networks learn to delineate input features to the output, where learning does not hinge on handcrafted features [46]. Additionally, deep learning algorithms have enormous flexibility in designing each part of the network architecture, resulting in several ways of discovering the most efficient activation functions and improving generalizability [17].

In this paper, a novel hybrid network intrusion detection algorithm is proposed that seeks to improve the detection speed and accuracy. Since deep learning requires huge amounts of data for classification and regression tasks, in essence this necessitate design and use of algorithms that would be efficient in this regime. The supervised deep learning algorithms namely, Convolutional Neural Network (CNN) and Long Short-Term Memory Network (LSTM) were considered. These neural networks use multiple layers where non-linear transformation is used to extract higher-level features from the input data [50]. The hybrid architecture integrates the CNN and Bi-directional long short term memory (Bi-LSTM) respectively to learn both spatial and temporal features from network flow data for detection and classification of attacks in large and high-speed networks. The combination of CNN and LSTM units has presently produced promising results in problems requiring spatial and temporal information classification [4,23,31]. For experiments, we developed three deep learning models – a CNN, Bi-LSTM, and the combined CNN+Bi-LSTM and compared the results. Experimental results were generated based on two benchmark datasets: CIC-IDS 2017 and the NSL-KDD. After training and validation, it was determined that the proposed model architecture results in higher accuracy rates, increased precision and detection speed. The key contributions of this study are as follows:

A novel deep learning network structure of hybrid CNN-Bi-LSTM is designed, in which the CNN module can effectively extract the spatial features and the Bi-LSTM module extracts the temporal features for network intrusion prediction.

To verify the performance of the hybrid CNN-Bi-LSTM model, the study was conducted on two public available network intrusion datasets. The performance results demonstrates the ability of the proposed model architecture to effectively detect network intrusions in modern day organizations. The proposed model can be used in the design of NIDS to accurately predict network intrusions, which can help network administrators to flag off network intrusions and make evidence-based decisions.

The rest of this paper is organized as follows. Section 2 presents the motivation for this paper. Section 3 presents an analysis of related works. Section 4 provides detailed information about spatial and temporal feature learning. Section 5 presents the proposed model architecture and the generic steps taken in order to develop the model using deep learning strategies. Section 6 describes the methods and materials for developing, training and evaluating the proposed hybrid network flow prediction model. Section 7 presents experimental results. Lastly, Section 8 presents the conclusions and future research direction.

2. Motivation

Safeguarding the security of an organization’s information and systems from unauthorized access, use, disclosure, disruption, modification, or destruction is of paramount importance for its success. Today, these organizations are investing considerable resources to protect and safeguard confidentiality, integrity, availability and privacy of their information assets against a host of new and evolving cybersecurity threats [55]. These threats are evolving on a daily basis as organizations adopt and use information technology products and services to run their day-to-day activities and share information and data online. Further, the increasing connection points to the internet have led to increased cybersecurity threats. Network intrusion detection systems supports network administrators in detecting network security breaches. Thus, developing an efficient and accurate NIDS would assist in reducing network security threats.

In the recent years, researchers have focused their attention to the selection and extraction of specific features from network traffic, with an objective to facilitate the discrimination of benign and malicious traffic data [15,16,64]. The ability to capture intrusions in time, particularly in large scale networks, is critical and very challenging besides the recognition of the existence of spatial and temporal features in the network traffic data that can be effectively extracted [75]. Deep learning based approaches have been touted as being effective in this process leading to design and implementation of such effective and flexible NIDS owing to their potential to extract better representations from the data [9,28,51]. Inspired by the issues mentioned, this paper presents a novel hybrid deep learning-based model for detecting network intrusions in organizations to demonstrate high detection rates [12,62]. We apply bi-directional Long Short Term Memory (LSTM) and Convolutional Neural Network (CNN) in the design of a novel hierarchical NIDS model structure that extracts discriminative and temporal characteristics from normal and malicious network traffic, train and evaluate the model using NSL-KDD and CIC-IDS 2017 benchmark datasets and measure the performance using standard metrics.

3. Related work

We reviewed several related work that exemplifies hybrid deep learning architectures for network intrusion detection. Owing to their efficiency in finding ideal solutions with both large and finite amount of data, deep learning approaches were considered having garnered significant research attention [57]. In their work, Wang et al. [73] proposed a model combining CNN and LSTM for analysis and detection of attacks in network flow. Their model employs an architecture comprising of CNN to initially learn low-level spatial features and then the LSTM to learn high-level temporal features from network flow. The method was reported to attain remarkable results in terms of accuracy and detection rate. Sun et al. [64] developed a deep learning-based intrusion detection system (dubbed DL-IDS) that exploits the hybrid architecture of CNN and LSTM to mine both the spatial and temporal features from the flow data and yield a superior model. The work of Yao et al. [79] proposed an intrusion detection algorithm founded on the cross layer feature fusion of a LSTM and CNN networks for Advanced Metering Infrastructure.

Kolosnjaji et al. [43] attempted to build a model comprising of recursive and convolutional layers, capable of getting classification features that would possibly produce a malware detection system. They obtained an architecture for hierarchical feature extraction that coalesces benefits of convolution process from the network’s convolutional layer and the sequence modelling from recursive layer. Ahsan and Nygard [4] introduced a novel approach that used a hybrid algorithm of CNN and LSTM to detect network intrusions achieving unprecedented high accuracy on a standard NSL-KDD dataset without applying any hyperparameter tuning. The work of [75] proposed a model named LuNeT, a hybrid hierarchical network architecture comprising of CNN and LSTM. The model was used to extract spatial and temporal features by synchronizing both the CNN learning and the LSTM learning into multiple steps with the learning granularity being gradually increased from coarse-grained to fine-grained and tested on both the NSL-KDD and UNSW-NB15 dataset.

In [63], the model combined bi-directional long short-term memory (Bi-LSTM), an attention mechanism, and multiple convolutional layers. Their approach used structured network traffic information that generated time series features. The multiple convolutional layers extracted local features, the Bi-LSTM produced the packet vectors, whereas the attention mechanism screened the network flow comprising packet vectors. Finally, the model was tested with the NSL-KDD and KDD-CUP99 datasets with Softmax classifier utilized for final classification.

More recently, Abdallah et al. [23] proposed a hybrid intrusion detection mechanism capable of capturing the spatial and temporal features of the network traffic. The model combined the CNN and LSTM and achieved a detection accuracy of 96.32% when applied in Software-Defined Networking environment. Table 1 gives a summary of relevant hybrid deep learning based NIDS models reviewed in the literature.

Two major gaps were recognized in the literature review focusing on the deep-learning based NIDS. First, these methods experience a high false-alarm rate [76,78] which can consequently demand additional resources and time to discount the numerous alerts generated [55,72]. Secondly, developing an intrusion detection system in a dynamically changing computing environment that require a fast and suitable feature selection method [80] still remains a challenging matter given that most of the NIDS are dependent on the deployed environment [36]. In the light of the above mentioned gaps, our study aims to address them by developing and validating a hybrid deep learning detection model and present an evaluation through standardized performance metrics for model classification. Further, the proposed model, goes further to recommend a hierarchy of combined CNN and bi-directional LSTM which is a variation of RNN that reduces the computational burden and proneness to overfitting compared to the models proposed by [4,73], and [75].

Table 1
Summary of reviewed hybrid deep learning based NIDS

Authors Year Model Description Strengths and Weakness Potential improvements Dataset used Performance Metrics

Wang et al. [73] 2017 A hierarchical spatial-temporal features-based intrusion detection system (HAST-IDS). The model CNN learns the low-level spatial features of network traffic using deep convolutional neural networks (CNNs) and then learns high-level temporal features using long short-term memory networks. Strength: The model required no feature engineering techniques. Model adopted the hierarchical spatial-temporal features learning architectureWeakness: the HAST-IDS model doesn’t perform well enough for the classes of network traffic with fewer samples. There is need to increase train and test the model on more recent network intrusion datasets that have many samples per class. Further, a dimensionality reduction method, such as PCA, can be incorporated in the model design to improve system performance. DARPA1998 and ISCX2012 Accuracy, False Alarm Rate, and Detection Rate

Hsu et al. [31] 2019 A deep learning model that combined convolutional neural networks and LSTM (CNN-LSTM) for intrusion detection system. The authors developed and evaluated two models; LSTM-only model which views the dataset as the time series and CNN-LSTM model where CNN is used to extract important features vectors. Strength: the proposed model combined the CNN and LSTM architecture to learn spatial and temporal network traffic features.Weakness: the authors reported that the obtained result were not satisfactory since the model was tested on NSL-KDD dataset as the benchmark compared to more recent dataset. Data pre-processing and proper choice of hyper-parameters would help improve the performance of the proposed model. The model can also be tested on a modern dataset as the benchmark. NSL-KDD Accuracy

Zhang, [81] 2019 A deep hierarchical network that incorporates improved LeNet-5 (CNN) and LSTM neural network structures, while learning both the spatial and temporal features from original network flow. The model was used to detect abnormal flow. Strength: The proposed model is a hierarchical network integrating the improved LeNet-5 and LSTM in a cascading method to learn the spatial and temporal features of net flow. Further, the authors chose two modern datasets that large and contains the attack types which are relatively new. Weakness: Though a time and cost-efficient solution for deep learning problems, LeNet-5, a pre-trained CNN architecture, may not be appropriate for NIDS as its initial architecture was designed for handwritten character recognition. Since LeNet was not designed to work on large images but on a fixed-size input, the proposed model could be further be improved by designing a CNN from scratch, perform data pre-processing including feature reduction, and choose proper hyper-parameters. CICIDS2017, CTU dataset Accuracy, precision, recall and F1-measure

Authors	Year	Model Description	Strengths and Weakness	Potential improvements	Dataset used	Performance Metrics
Wang et al. [73]	2017	A hierarchical spatial-temporal features-based intrusion detection system (HAST-IDS). The model CNN learns the low-level spatial features of network traffic using deep convolutional neural networks (CNNs) and then learns high-level temporal features using long short-term memory networks.	Strength: The model required no feature engineering techniques. Model adopted the hierarchical spatial-temporal features learning architectureWeakness: the HAST-IDS model doesn’t perform well enough for the classes of network traffic with fewer samples.	There is need to increase train and test the model on more recent network intrusion datasets that have many samples per class. Further, a dimensionality reduction method, such as PCA, can be incorporated in the model design to improve system performance.	DARPA1998 and ISCX2012	Accuracy, False Alarm Rate, and Detection Rate
Hsu et al. [31]	2019	A deep learning model that combined convolutional neural networks and LSTM (CNN-LSTM) for intrusion detection system. The authors developed and evaluated two models; LSTM-only model which views the dataset as the time series and CNN-LSTM model where CNN is used to extract important features vectors.	Strength: the proposed model combined the CNN and LSTM architecture to learn spatial and temporal network traffic features.Weakness: the authors reported that the obtained result were not satisfactory since the model was tested on NSL-KDD dataset as the benchmark compared to more recent dataset.	Data pre-processing and proper choice of hyper-parameters would help improve the performance of the proposed model. The model can also be tested on a modern dataset as the benchmark.	NSL-KDD	Accuracy
Zhang, [81]	2019	A deep hierarchical network that incorporates improved LeNet-5 (CNN) and LSTM neural network structures, while learning both the spatial and temporal features from original network flow. The model was used to detect abnormal flow.	Strength: The proposed model is a hierarchical network integrating the improved LeNet-5 and LSTM in a cascading method to learn the spatial and temporal features of net flow. Further, the authors chose two modern datasets that large and contains the attack types which are relatively new. Weakness: Though a time and cost-efficient solution for deep learning problems, LeNet-5, a pre-trained CNN architecture, may not be appropriate for NIDS as its initial architecture was designed for handwritten character recognition.	Since LeNet was not designed to work on large images but on a fixed-size input, the proposed model could be further be improved by designing a CNN from scratch, perform data pre-processing including feature reduction, and choose proper hyper-parameters.	CICIDS2017, CTU dataset	Accuracy, precision, recall and F1-measure

Table 1

(Continued)

Authors	Year	Model Description	Strengths and Weakness	Potential improvements	Dataset used	Performance Metrics
Ahsan et al. [4]	2020	A hybrid algorithm CNN and LSTM providing improved intrusion detection. The first part comprise of CNN used for feature extraction while the later part is based on the feature fusion using LSTM.	Strengths: the proposed model used multiple convolutional kernels to extract features from the dataset, a mechanism that institutes an end-to-end mapping of the relationship between the features and the attack types. Weakness: The proposed model was tested on a relatively old benchmark dataset.	Applying hyper-parameter tuning would further improve model performance. The model can also be tested on a modern benchmark Dataset.	NSL-KDD	Precision, false positive, F1 score, and recall
Sun et al. [64]	2020	A deep learning-based intrusion detection system (dubbed DL-IDS) that exploits the hybrid architecture of CNN and LSTM to mine both the spatial and temporal features from the flow data and produce a superior model.	Strengths: Proposed a hierarchical hybrid model that uses category weights for optimization, a method that reduces the effect of the number of unbalanced samples of several attack types, improves the robustness of training and prediction. Could perform both binary and multiclass classification. Weakness: Authors reported low detection accuracy on Heartbleed and SSH-Patator attacks due to data lack.	A feature dimensionality reduction technique may be incorporated in the model design to improve system performance.	CICIDS2017	Accuracy, true positive rate (TPR), false positive rate (FPR), and F1-score.
Kim et al. [40]	2020	An Artificial Intelligence-based Intrusion Detection System (AI-IDS). The system comprise optimal CNN-LSTM network model, with the architecture normalizing UTF-8 character encoding for Spatial Feature Learning so as to sufficiently extract the characteristics of real-time HTTP traffic without encryption, calculating entropy, and compression.	Strengths: the model aids to write and improve Snort rules for signature-based IDS based on newly identified patterns in a high-performance computing environment. It is flexible and scalable system that was evaluated on real-time HTTP (KF-ISAC) data and validated on CSIC-2010 and CICIDS2017 HTTP datasets. Weakness: The proposed model was not tested on other network attack types to determine its scalability. Further, the model need to be re-validated for predicted suspicious events due to false positives alarms.	The proposed model may be applied in the detection of other network attacks besides the web-based attacks.	CSIC-2010, CICIDS2017, fixed real-time data	Accuracy, precision, recall (Sensitivity, Detection Rate), Specificity and F1-score

Table 1

(Continued)

Authors	Year	Model Description	Strengths and Weakness	Potential improvements	Dataset used	Performance Metrics
Elsayed et al. [23]	2021	A hybrid Intrusion Detection System (IDS) combining CNN and LSTM able to capture both the spatial and temporal features of the network traffic in Software Defined Networks. Authors used two regularization techniques i.e., L2 regularization and dropout method to improve performance and avoid overfitting.	Strengths: the proposed hybrid model that integrates both 2D-CNN with LSTM algorithms for weight sharing and reduce computational costs. L2 regularization and dropout techniques to improve the SDN capability to detect novel malicious network attacks. Weakness: Authors reported that false alarms of the proposed model were quite high. The percentage of false positive rate (FPR) and false negative rate (FNR) of the hybrid model were 6% and 3%, respectively.	Unsupervised learning techniques may be considered to enhance the performance of the detection model for zero-day attacks. Further, comparisons may be done with supervised learning techniques to determine the best approach to training the model. Model validation may also be done on different datasets.	InSDN	Accuracy
Yao et al. [79]	2021	A cross-layer feature-fusion CNN-LSTM intrusion detection model for advanced metering infrastructure (AMI) intrusion in smart grid. The cross-layer was formed whereby the CNN component was used to distinguish regional features from the data and obtain global features, whereas the LSTM component obtained periodic features by memory function.	Strengths: the model adopted a cross-layer fusion feature that represent the multi-domain characteristics of the data thus producing a model that can more effectively identify intrusions in AMI.Weakness: the authors reported that the results obtained were not absolute, as the randomness of the training set and test set selection may have led to different final results.	Due to the limited number of samples in the chosen datasets, modern datasets that contains large samples of each attack types may be considered. This would further improve the model’s performance.	KDD Cup 99 and NSL-KDD	Accuracy, precision, detection rate, F-measure, and false-positive rate
Wichmann et al. [74]	2021	A hybrid CNN-LSTM model to detect brute-force attacks in encrypted traffic in protocols such as SMTPS, IMAPS, HTTPS, FTPS, and SSH.	Strengths: Proposed model permits detection of brute-force attacks in encrypted traffic without the need to decrypt it. Weakness: the dataset used for training and evaluation was synthetically generated which minimizes the constraints associated with the use of regulated or sensitive data.	Regulated or publicly available benchmark datasets may be used to test and validate the model.	Real-world traffic from a Tor exit node on the Internet.	Accuracy, false positives rate, F-measure

Table 1

(Continued)

Authors	Year	Model Description	Strengths and Weakness	Potential improvements	Dataset used	Performance Metrics
Fan & Cao [24]	2021	A ConvLSTM model trained end-to-end where a convolution operation was introduced into the internal structure of an LSTM to extract spatial characteristics of the data. In this case, high-order spatial-temporal features from network traffic data are better fused, thus reducing the computational complexity, and improving the accuracy of intrusion detection.	Strengths: the ConvLSTM model considers the spatial and temporal characteristics of the data at the same time. Convolution operation was introduced into the internal structure of LSTM, so that the high-order spatial-temporal data of network traffic could be better fused. Thus, the precision and the recall rates performed well compared to CNN-LSTM model in video surveillance systems. Weakness: the model was only tested on video surveillance systems. In order to verify the effectiveness of the proposed model, there is need to test the model in other intrusion detection systems.	Other modern publicly available benchmark datasets may be used to test and validate the model.	KDD Cup 99	Precision, Recall and F1-score
Thapa et al. [65]	2021	A model that combines base CNN and LSTM models. The ensemble approach used the Stacking algorithm in combination with a neural network as a meta learner.	Strengths: the authors developed a Cyber Defense (CCD)-IDSv1 labeled flow-based dataset in an OpenStack environment for the evaluation of an anomaly-based network IDSs based on ensemble CNN and LSTM. Feature importance using Random Forest was evaluated for both anomaly detection and threat classification	The proposed dataset to be improved such that more attacks upon the network environment that imitate real-life scenarios can be captured.	CCD-IDSv1	Accuracy, precision, recall, and F1-score
Al & Dener [6]	2021	A Hybrid Deep Learning classification-based network attack detection system consisting of CNN and LSTM aimed at detecting network attacks in imbalanced datasets.	Strength: Proposed model adopted the data imbalance processing mechanism consisting of Synthetic Minority Oversampling Technique (SMOTE) and Tomek-Links sampling methods (STL) to reduce the effects of data imbalance on system performance. The model could perform both binary and multiclass classification.	The model can also be tested in both balanced and imbalanced data sets to determine its efficacy.	CIDDS-001, UNS-NB15	Accuracy, F-Measure, Precision, Recall, ROC Curve and Precision-Recall Curve
Duong [22]	2021	A CNN-LSTM deep learning model for detecting cyber-attacks based in network traffic. The model was based on the association and combination of individual deep learning models to achieve better results.		Other modern publicly available benchmark datasets may be used to test and validate the model.	UNS-NB15	Accuracy, Precision, Recall, and F1-Measure

4. Spatial-temporal feature learning

4.1. Spatial feature learning

Learning spatial features involves mastering expressive feature representations in the data containing spatial properties. Generally, deep network architectures are being utilized to take advantage of this characteristic in network traffic data that customarily contain spatial properties alongside some dimensions. The design of CNNs is based on spatial coherence, and are frequently being used for spatial feature learning. CNNs present layers with specialized operations into deep neural networks [28], considering the spatial properties of the data thereby building more efficient networks. CNNs hold powerful learning ability largely owing to their multiple feature extraction stages (hidden layers) that facilitate automatic learning of representations from the data [38]. Two-dimensional CNN are used for spatial features learning and with spatial features extraction, there is potential for redundancy between spectrum bands. Consequently, such a framework is normally combined with a feature dimensional reduction algorithm that minimizes the spectral dimension before spatial feature extraction [16,37]. The general architecture of a CNN has three main layers namely; Convolutional layer, pooling layer, and fully connected layer. The general structure of a CNN is as shown in Fig. 1.

Fig. 1.

Structure of a CNN (source: [21]).

4.1.1. The convolutional layer

The Convolutional layer comprises of multiple layers that are convoluted with the input layer to generate activations [41]. The multiple layers are achieved by increasing the number of filters (or kernels) and biases that makes learning coarse-grained (at the beginning of the network) and fine-grained (at the end of the network) [62]. The filters have receptive fields that are trained to learn specific features from an image. The convolutional layer is formulated as presented in the following Eq. (1): $\begin{array}{l} (1) & y_{j}^{l} = σ (\sum_{i = 1}^{k} x_{i}^{l - 1} * w_{i j}^{l} + b_{i}^{l}) \end{array}$

Where $y_{j}^{l}$ denotes the $j t h$ output of the 2D feature map of the current $(l)$ th layer, the matrix $x_{i}^{l - 1}$ is the $i t h$ feature map of the previous $(l - 1)$ th layer and K represents the number of input activation maps. $w_{i j}^{l}$ and $b_{i}^{l}$ are the weight and bias vectors respectively which are randomly initialized. The * is meant for convolution operation and σ represents the nonlinear activation function. For increasing the nonlinearity in the feature maps, the rectified linear unit (ReLU) activation function is used [79]. When the activation function is applied to every activation map, the generated feature maps are then sent to the pooling layer [27].

4.1.2. The pooling layer

This layer collects the discriminative information by eliminating irrelevant details and makes the convolution features more invariant towards the small translations of the input. In other words, the pooling layer offers translation invariance while diminishing the resolution of the activation maps thereby reducing the computational complexity [48]. The pooling operation reduces the number of parameters in the network structure effectively preventing overfitting. There are two types of pooling strategies: ranked-based and value-based. In the literature, most of the existing deep CNN models adopted the max-pooling strategy, which belongs to the value-based pooling [58,77]. Max-pooling offers the strongest value from the $h X h$ pooling region of the convolutional activation map [10] acting as a sliding window with a stride distance on point to set the maximum value inside the dimension of a sliding window. The pooling function can be written as $h_{j} = pool (h_{j - 1})$ where after consecutive convolution and pooling layers, $h_{j}$ need to be reshaped into a vector.

4.1.3. Fully connected layer

The latter part of the CNN usually connects several fully connected layers. The output of this layer becomes the output value to the classifier. Individual neuron in a fully connected layer connects with all neurons in the previous layer thus accommodating learning from all activation maps in the previous layer. At each iteration, the convolutional filters and fully connected layers are updated, with the goal of limiting the average loss, E, throughout the true class labels and the network outputs [57] as shown in the following Eq. (2): $\begin{array}{l} (2) & E = \frac{1}{m} \sum_{i = 1}^{m} \sum_{k = 1}^{c} y_{i}^{* (k)} log (y_{i}^{(k)}) \end{array}$

Where $y_{i}^{* (k)}$ and $y_{i}^{(k)}$ denotes the true label and the network output respectively, of the ith input at the kth class with m training input and c neurons in the output layer. This study adopted the adaptive moment estimation (Adam) for optimization and thus limit the average loss during model training [52]. At this layer, we obtain the spatial features from the network traffic data extracted by CNN.

4.2. Temporal feature learning

The objective in temporal feature learning is to encapsulate temporal features in an efficient manner that permits generalization to a series of data with arbitrary length. This may be achieved by sharing parameters through time, rather than re-learning them in every step. Recurrent neural network (RNN) is the prominent algorithm for training temporal data [79]. RNNs introduce recurrent connections in time, permitting the parameter sharing in a deeper means. Ferrag et al. [25] describe an RNN as a neuron network, whose connection graph contains as a minimum one cycle and encompasses the feedforward architecture that permits recurrent connections to occur within layers. A previous model state is considered as a supplementary input at the individual temporal step of the RNN enabling it to formulate a memory in its hidden state over previous inputs information [53].

In the literature, several types of RNNs are proposed. Long short term memory (LSTM), a variant of RNN, has been adopted by many cybersecurity researchers since it uses memory function to substitute the hidden units [34]. Conversely, LSTM has long-term memory owing to its slow weight changes over time, in addition to its capacity to activate short-term memory in a short-range. This structure allows LSTM to recall long-range features better than conventional RNNs. The basic unit of the LSTM hidden layer is the memory module. This module comprises of one memory unit and three adaptive multiplication gating units, that is, the input gate, output gate and forget gate [35]. Principal information of the LSTM is conveyed along the horizontal line, where the LSTM forgets the old information, and learns new information via the three gate units [79]. Figure 2 presents a basic architecture of a long short term memory network.

Fig. 2.

A basic structure of a conventional LSTM network (source: [31]).

In our work, a bi-directional LSTM (Bi-LSTM) was used. The Bi-LSTM processes a sequence of data in both forward and backward directions by means of two distinct hidden layers and then join them with the same output layer leading to the two sub-networks working alongside each other and learn from the past and the future sequence of data [7]. The output layer yields an output vector, $y_{t}$ , calculated by the equation: $y_{t} = σ (\overset{\leftarrow}{h_{t}}, \vec{h_{t}})$ whereby the activation function σ combines both the output sequences. The forward hidden state, $\vec{h_{i}^{t}}$ and the backward state $\overset{\leftarrow}{h_{i}^{t}}$ are computed, concatenated and then fed forward to the next layer, where $t = 0$ is the input layer. Figure 3 shows a Bidirectional LSTM architecture.

Fig. 3.

A Bi-directional LSTM architecture (source: [54]).

The Bi-LSTM is superior in attaining the relations among elements in an entire sequence by exploiting information in both directions, rather than recalling the features in only one direction which is the case with conventional LSTM. During the training process, the Bi-LSTM adopts the time backpropagation algorithm. The computation operation for every neuron node in LSTM is in this fashion. At time t, the input gate is input according to the output result $h_{t - 1}$ of the cell at the preceding moment [35]. The input $x_{t}$ at the present moment decides whether to update the current information into the cell using a sigmoid function. It is calculated as shown in Eq. (3): $\begin{array}{l} (3) & i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i}) \end{array}$

Where σ is the sigmoid function, W denotes the weight, and b represents the bias of neurons.

The forget gate based on the latter moment of the hidden layer output $h_{t - 1}$ and the present time input determines which information the LSTM is going to delete from the cell state or retain. It is calculated as shown in Eq. (4): $\begin{array}{l} (4) & f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f}) \end{array}$

The output gate determines what will be the output by carrying out a sigmoid function that selects which part of the cell LSTM is going to output. At this point, the result is then passed through a Tanh layer (value between −1 and 1) to produce only an output that will be determined to pass to the next neuron as shown in the equation: $\begin{array}{l} (5) & \begin{aligned} O_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o}) and \\ h_{t} = o_{t} * tanh (C_{t}) \end{aligned} \end{array}$

In the meantime, the memory cell state value $C_{t}$ is adjusted by the current candidate cell $C_{t}$ and its own state $C_{t - 1}$ , input gate and forget gate [35]. The character * represent the element-wise matrix multiplication. It is calculated as shown in Eq. (6): $\begin{array}{l} (6) & \begin{aligned} \tilde{C_{t}} = tanh (W_{c} \cdot [h_{t - 1}, x_{t}] + b_{C}) and \\ C_{t} = f_{t} * C_{t - 1} + i_{t} * \tilde{C_{t}} \end{aligned} \end{array}$

Since the Bi-LSTM comprises two LSTM networks, a forward LSTM and a backward LSTM, the following feature extraction steps are made:

The forward and backward propagation algorithm is achieved by using the internal self-recurrent state update method. The output value of the single module is computed as follows: $\begin{array}{l} (7) & \begin{aligned} h_{t}^{forward} = {LSTM}^{forward} (h_{t - 1}, x_{t}, C_{t - 1}) \\ h_{t}^{backward} = {LSTM}^{backward} (h_{t - 1}, x_{t}, C_{t - 1}) \end{aligned} \end{array}$

The state of Bi-LSTM at time t contains the forward output and backward output $\begin{array}{l} (8) & H_{t} = [h_{t}^{forward}, h_{t}^{backward}] \end{array}$

From two directions of the time and network layer, each LSTM’s error term is computed reversely

The gradient of each weight is calculated for the error term

A gradient-based optimization algorithm is applied to update the weight.

5. Proposed hybrid network intrusion detection model

In this section, the system model requirements are formalized by combining CNN with Bi-LSTM to extract the discriminative features and finally a deep hierarchical network model is constructed. The proposed model is shown in Fig. 4. The proposed model performs the pre-processing, feature extraction through network training and network testing and final classification.

Fig. 4.

The proposed network intrusion detection model (source: authors, 2022).

Network traffic data may be described as multi-variate time series data that has spatial similarities. The spatial-temporal data can be exemplified with a matrix with $X \in R^{s, \overline{t}, f}$ where, where s represents the number of spatial points, $\overline{t}$ denotes the total number of time stamps, and f stands for the number of features. Assuming a huge value of $\overline{t}$ , a sliding window method with window size of w would produce a sequence of data points $x^{t} \in R^{s, w, f}$ for $t \in {1, \dots, \overline{t}}$ . By using the sliding window method, the flexibility of designing machine learning models capable of processing input data over different temporal or spatial features is increased.

In the proposed model, both the spatial and temporal features of the network flow data are extracted concurrently by constructing a hierarchical network model that combines the CNN with Bi-LSTM algorithms together. Since the CNN and Bi-LSTM network differ in the input format, the extracted spatial features are first adjusted at the CNN output to conform to the input format of the Bi-LSTM network. The CNN’s fully connected layer output is a 1x128 feature vector. This feature vector is adjusted to an input size of 64 before being forward to the input layer of the Bi-LSTM network, and the time step set to 2. During model flattening the model input vector is automatically adjusted in the fully connected layer. Therefore, Bi-LSTM input size is consistent with the output size of CNN and the two Bi-LSTM layers execute temporal feature extraction.

The Sigmoid function was used as the activation function in each layer to perform non-linear operations. The results from individual Bi-LSTM recursive operation is obtained as a fusion of all the previous features and the current features. In the model architecture, one fully connected layer is linked to the output layer of the Bi-LSTM. The previously extracted features are integrated, with the output value of the last fully connected layer passed on to softmax function. After the learning process, the model performs multi-class classification to categorise the output. The following is a description of each component and its role in the proposed NIDS model architecture:

5.1. Data pre-processing

The paper considers the detection of malicious attacks based on a network flow which has a distinct hierarchy, as presented in Fig. 5. According to a specific network protocol format, a number of flow bytes are combined to form a network packet, and then multiple network packets are combined to form a network flow. The network flow is subsequently split into normal or malicious tasks. Then discriminative features are learnt using a deep learning algorithm.

Fig. 5.

Structure of network flow (source: [21]).

5.1.1. Numerical pre-processing

The raw network traffic data captured as libpcap file format (.pcap) are transformed into numeric vectors. Normally, the input flow data contains a multiplicity of features; some of them being non-numeric types. As such, they need to be encoded as numeric types before transmission into the neural network. One-hot encoding; a strategy for creating dummy values where every unique value in the category is a mapping of discrete eigenvalues to a vector of length n that comprises only 0 s and 1 s was used since the categorical values were not ordinal. This was done in Python using sci-kit learn library [13].

5.1.2. Feature selection

Feature selection was done to reduce model’s susceptibility to overfitting the data, faster training, and make sure memory, storage, and processing requirements are reduced as much as possible. To address the problem of data features redundancy, this paper adopted the feature selection algorithm proposed by [15]. The algorithm is based on Random Forest approach, as presented in Algorthim 1, that starts by calculating the significance of sample features and ranks them in order of importance. Then, it analyses the correlation between features by Pearson’s index, and lastly combines the two results to select the features. The feature selection process is demonstrated by the following pseudo code:

Algorithm 1
Feature selection based on random forest

Input:

Original data set, Dataset

Output:

Processed dataset, NewDataset

Procedure:

Selecting out-of-bag data and calculating the error, noted as error_a

Randomly add disturbances to the out-of-bag data, noted as error_b

Calculating feature importance

Ranking the importance of the features

Calculating the Pearson correlation coefficient and performing a correlation analysis

Combine steps (4) and (5) for feature selection

Obtain the selected dataset, NewDataset

Source: [ 15 ]

The Random Forest algorithm is integrated with decision tree as the base learner. The algorithm is capable of identifying significant features from huge amounts of sample features. According to [29], the crux of the algorithm is to analyse and compute the contribution of each feature of the sample in the tree, and then calculate its average value after which a comparison is made between features to identify the significant ones. The out-of-bag data error rate method for tuning parameter selection and error estimation was used the evaluation metric owing to its ability to yield less biased estimates of the true prediction error [32].

5.1.3. Data normalization

The features dataset customarily encompasses values that have diverse scale ranges during learning and tend to minimize the loss function. Data normalization is an important data pre-processing step aimed at improving efficiency and accuracy of the model, and prevent a feature with a particularly large value range from affecting the eventual distance calculation [20]. Essentially, the features having very large scope in the difference between the minimum and maximum values, a logarithmic scaling method is applied to scale and get the features that are mapped to a range. The minimal-maximal (Min-Max) scaling approach [13] that linearly scales each feature to the interval of [0, 1] was used. The Min-Max scaling approach enclose all features in a common boundary without losing information. This technique assists the gradient descent based algorithm converge more quickly towards the minima, and ensures the steps for gradient descent are updated at the same rate for all the features, while optimizing the loss function from the un-smoothing path toward the global minimum [9] as shown in the Eq. (9): $\begin{array}{l} (9) & X_{i} = \frac{X_{i} - X_{min}}{X_{max} - X_{min}} \end{array}$

Where $X_{i}$ represents each data point, $X_{min}$ stand for the minimum value from all data points, and $X_{max}$ represents the maximum value from all data points for every feature. The method takes the column’s minimum value and maximum value, where the output values range between 0 and 1.

5.1.4. Feature extraction and dimensionality reduction

In the feature extraction stage, dual input of flow data is used whose goal is to extract the features of netflow more comprehensively. The key objective of feature extraction is to obtain a set of representative vectors of the dataset while reducing the high dimensionality. The performance of an NIDS is considerably enhanced when the features are more discriminative and representative [2]. Empirical studies in network intrusion detection [1,69,70] has shown that the efficacy of machine learning algorithms can degrade when fitted on data containing redundant and irrelevant features. Consequently, it is necessary to reduce the quantity of features input. In the literature, among the most effective techniques proposed in the literature include the Principal Component Analysis (PCA), Laplacian Eigen Map, Independent Component Analysis, Kernel Principal Component Analysis (KPCA) [66], Fisher Linear Discriminant Analysis [68], Auto-encoder and Locally Linear Embedding [2,39].

In our work, PCA was adopted as the most suitable approach in solving intrusion detection problems owing to its simplicity and its ability to maintain most of the variability of the data [59]. PCA also assumes linearity in the set of reduced data. Further, PCA technique can deal with large datasets with fast run time while determining ideal number of Principal Components desired for intrusion detection [69]. The Principal Components can be represented by Eq. (10): $\begin{array}{l} (10) & P C_{i} = b_{1} X_{1} + b_{2} X_{2} + \dots + b_{n} X_{n} \end{array}$ where $P C_{i}$ is the principle component $^{'} i^{'}$ ; $X_{p}$ is the original feature $^{'} p^{'}$ ; and $b_{p}$ denotes the numerical coefficient for $X_{p}$ . Essentially, PCA is a non-parametric method that calculates the eigenvalue decomposition of an estimate of the covariance matrix of the data and then uses the most important eigenvectors to appraise the feature space to that of lower dimension [30]. PCA creates new uncorrelated variables that sequentially maximize variance.

Further, PCA reduces the computational costs and the error of parameter estimation by decreasing the number of dimensions of the feature space thereby extracting a subspace that offers the best description of the data [44]. To be precise, after standardization, PCA extracts the eigenvectors and eigenvalues from the covariance matrix (CM) given by the Eq. (11): $\begin{array}{l} (11) & C M = \frac{1}{n - 1} ({(X - x^{'})}^{T} (X - x^{'})) \end{array}$

Where $x^{'}$ is the mean vector $x^{'} = (\frac{1}{n}) \sum_{n}^{k - 1} (x_{i})$ and the covariance between two features given by Eq. (12): $\begin{array}{l} (12) & C v_{j k} = (\frac{1}{n - 1}) \sum_{n}^{i = 1} (x_{i j} - x_{j}^{'}) (x_{i k} - x_{k}^{'}) \end{array}$

The eigenvalues are structured in descending order, where k eigenvectors corresponding to k eigenvalues are selected, and k stands for the number of dimensions of the new feature subspace( $K < d$ ). Subsequently, PCA builds the projection matrix P from the selected k eigen-vectors. Lastly, the technique transforms the original dataset X through P to obtain a k-dimensional feature subspace $Y = X * P$ . The datasets of size $q \times m$ is then mapped to the given k – principal component framework and later transformed into dataset of size $q \times t$ , where q is number of instances and m is the number of original dimensions [60].

5.1.5. Segmentation and conversion to matrix

According to Sun et al. [64], different methods are used to segment the original traffic data into fragments of different forms. This means that the chosen method evidently has an impact on the ensuing analysis. Network traffic can be sliced in six ways: by connection, session, network flow, TCP, service class, and by host. The study adopted a session sharding approach in which a session describes any packet that is comprised of a bi-directional flow. To be precise, such a packet has comparable structure (source IP, transport layer protocol, destination port, source port, and destination IP) and substitutable source and destination ports and addresses.

The CNN input format ought to be three-dimensional data (height, width, and channel). Each input is dimensionally transformed corresponding to the format of the grayscale image. As such, as a single sample, the channel should be 1, toward reshaping a single flow sample with a length of $l = h^{*} w + 1$ in an attempt to get a data structure comparable to an image and generate a matrix M of $h^{*} w$ as follows: $\begin{array}{l} M^{'} = (\begin{matrix} M_{11} & \dots & M_{1 w} \\ \dots & \dots & \dots \\ M_{h 1} & \dots & M_{h w} \end{matrix}) \end{array}$

Respective network records of the data were dimensionally transformed consistent with the format of the grayscale image. Typically, RGB D =1 is the default. For the purpose of inputting into CNN, the network data is reshaped into a matrix. The dataset is then shuffled to generate random data and then split to obtain a separate training and validation datasets using Python language and the Pandas library. The training set contains 65% of the flows, while 35% was the validation set.

5.2. Model training

The study proposed to introduce a deep learning architecture based on a two layer two-dimensional CNN and a Bi-LSTM network for the detection of the malicious attacks. Since the network flow is presented in a 1D format, we used NumPy to convert/transform the one dimensional array using to reshape() function into a two-dimensional array with one column [47]. The two dimensional array data was then transformed into two-dimensional feature maps for further processing. The grayscale feature maps are input to the strides as 1 × 1, 1 × 1 and 3 × 3 convolutional modules to extract features, respectively. The 2D convolution has shown excellent performance in a recent study [15]. The training process is presented in Algorithm 2.

Algorithm 2
Hybrid model: CNN, Bi-LSTM training process

1: for each epoch do

2: for each batch do

3: 1) CNN input layer

4: 2) Several CNN hidden layers (contingent on selected number of hidden layers)

5: 3) Bi-LSTM Layer:

6: Forward LSTM

7: Reverse LSTM

8: 4) Update the parameters

9: end for

10 end for

The input layer receives input data. Normally, the size of the input layer is similar to the input data, for instance a vector $x = [x_{1}, x_{2}, \dots, x_{n}]$ . In every epoch, the entire training data is alienated into batches that are processed one at a time. Every batch comprises a list of network traffic matrices for a certain period of time. In every batch, the CNN model identify and learns the spatial local correlation represented in the network traffic flow. Multiple convolutional kernels are used to learn the features from the dataset. This is then followed by a set of hidden LSTM layers to learn temporal dependences occurring in the network traffic flows. The Bi-LSTM layer is then applied which comprises both forward state and backward state of LSTM. Finally, network parameters are updated.

During model training, the ReLU activation function $f (x) = (0, x)$ is used as well as flatten out the volatility in the time series. This function is largely utilized solution for vanishing gradient challenge in deep learning models. All the weights including bias in the proposed network model need to be trained.

5.3. Classification

A fully connected layer that is situated before the output layer is used for detection and classification. This layer targets high-level features that are consistent with the specific tasks of the output layer and perform mapping. The layer exploits the Softmax activation function after mapping to obtain a final classification results (either, normal or an attack). The Softmax classification function [3] normalizes real numbers into a probabilities distribution. Upon applying the softmax function, every distinct component will be in the interval (0, 1) eventually adding up to 1 that can be interpreted to map the non-normalized output of a model to a probability distribution over predicted output classes. Hence, the predicted class would be $\hat{y}$ as shown in Eq. (13): $\begin{array}{l} (13) & \hat{y} = arg max [σ, {(z)}_{j}] \end{array}$

Where z is a set of the predicted output classes $z = (z_{1}, \dots, z_{k}) \in R^{k}$ and the standard activation function $σ : R^{k} \to R^{k}$ well-defined by the Eq. (14): $\begin{array}{l} (14) & σ {(z)}_{j} = \frac{e^{z_{j}}}{\sum_{k = 1}^{K} e^{z_{k}}}, for j = 1, \dots, K . \end{array}$

A loss function, categorical cross-entropy, was used given the classification model output is a multi-classification with the probability value between 0 and 1 for each class. The cross-entropy loss can be computed as shown in Eq. (15); $\begin{array}{l} (15) & l o s s = - (y log (p)) - (1 - y) log (1 - p) \end{array}$

Where y is the binary indicator (0 and 1) if the class label is the correct classification for the observations and p is the predicted probability that the observation is of that class.

The Adam optimizer was used to learn the network weight parameters. Empirical results have demonstrated that Adam has better advantages compared to other optimizers in practice [42].

6. Methodology

This section describes how the proposed system architecture for netflow detection and classification of attacks in large and high-speed networks is actualized. Figure 6 shows the flow chart of the proposed model.

Fig. 6.

The feature learning process.

A hybrid deep learning architecture is proposed that present a hierarchical network architecture combining a CNN and Bi-directional LSTM for performing both binary and multiclass classification model. The CNN was used for learning spatial dependencies, whereas the Bi-LSTM is used to computing temporal dependencies.

6.1. Dataset

Deep learning approaches requires voluminous data for training. In this work, two well-known benchmarks datasets namely; CIC-IDS 2017 and the NSL-KDD datasets were used.

CIC-IDS 2017 Dataset

The CIC-IDS 2017 benchmark dataset developed by the Canadian Institute for Cybersecurity [14] was used. The dataset consist of the modern-day benign activities and malignant attacks that describes the concurrent network traffic. Contained in this dataset is the normal background traffic collected by means of B-profile system and constructs the abstract behavior of 25 network users created using FTP, SSH, email protocols, HTTP, and HTTPS protocol [72] that accurately mimic an actual network environment. The network traffic was collected over a period of 5 days and put together with normal activity traffic collected on the first day of data collection, and malicious attacks were introduced on other days. The network attack traffic collected is composed of eight types of attacks namely: SSH-Patator, FTP-Patator, Heartbleed, DoS, Botnet, Infiltration, Web Attack, and DDoS. The dataset comprise about 80 network flow features generated using PCAP analyzer and organized in a CSV file. Table 2 presents the distribution of packet samples for normal and attack classes CIC-IDS 2017.

Table 2
Distribution of normal and attack behaviours in the CIC-IDS2017 dataset

Class Training set Test set

Normal 1,654,737 709,173

DoS 176,863 75,798

PortScan 111,251 47,679

Patator 9,685 4,150

DDoS 89,619 38,408

Bot 2,752 1,180

Web attack 1,526 654

Infiltration 25 11

Heartbleed 7 4

Total 2,046,465 877,057

Class	Training set	Test set
Normal	1,654,737	709,173
DoS	176,863	75,798
PortScan	111,251	47,679
Patator	9,685	4,150
DDoS	89,619	38,408
Bot	2,752	1,180
Web attack	1,526	654
Infiltration	25	11
Heartbleed	7	4
Total	2,046,465	877,057

NSL-KDD Dataset

The NSL-KDD dataset is a benchmark for modern-day internet traffic and includes all types of attacks. In the literature, many researchers have utilized the NSL-KDD dataset to develop and evaluate the NIDS. NSL-KDD dataset consists of 41 features including one class attribute that are organized into three main categories (traffic-based features, basic features, and content-based features) and labelled as either normal or attack with the precise attack category [9]. The dataset features comprise basic features derived directly from a TCP/IP connection, traffic features accumulated in a window interval, either time, e.g. two seconds, or a number of connections, and content features extracted from the application layer data of connections. Out of the 41 features, three are nominal, four are binary, and last 34 features are continuous. The dataset is analysed and categorized into four different clusters depicting the four common different types of attacks [19] including denial of service, user to root, remote to local and probing attacks. Table 3 presents the number of samples for normal and attack classes.

Table 3

Summary of normal and attack records in NSL–KDD dataset

Class	Training set	Test set
Normal	67,343	9,711
DoS	45,927	7,460
Probe	11,656	2,421
R2L	995	2,885
U2R	52	67
Total	125,973	22,544

6.2. Experimental environment

A simulation experiment environment [71] was built on a 64-bit computer based on Ms Windows 10 operating system platform with Intel^® core i7-9750 Hz 2.60 GHZ processor, 8 GB memory, NVIDIA GeForce RTX 2060 6G GDDR6 GPU, and 10.2 CUDA. The experimental program was written in Python programming language and data processing using the Scikit-learn, NumPy, and Pandas. Keras, an effective high-level deep learning library built on top of TensorFlow was used to build the neural network model.

6.3. Parameter selection and optimization

A number of experiments were conducted wherein some parameters were changed. In general, parameters may be adjusted with respect to the loss function, optimization algorithm, dropout rate, activation function, batch size etc. The spatial and temporal features extracted by individual neurons may possibly be used to independently predict different variables [67]. The focus of this work is a multi-classification problem. Intrinsically, the multi-class logarithmic loss function namely, categorical_crossentropy was used. The number of iterations during experiments was set to 100 with diverse learning rates ranging from 0.01 to 0.25. In addition batch_size in the experiments was set to 120. The batch size, as one of the parameters, is associated with the frequency of weight updates. Accordingly, if the batch size is set too small, the model convergence may possibly be slow. The weight inactivation rate of dropout in the regularization method was set to 0.25. The optimizer function used was Adam. The ReLu function was employed as the hidden layers activation function while the softmax was used as the output layer activation function.

6.4. Performance metrics

Four evaluation indicators namely, Accuracy (AC), precision (P), recall (R), and f-measure (f1-score) were utilized for performance analysis of our experiments. Precisely, these metrics are associated with four classification functions, that is, true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Table 4 shows the detailed equations for the standard performance evaluation metrics based on a confusion matrix.

Table 4
Model performance evaluation metrics

Standard Evaluation Metric Equation

Accuracy (AC) (TP + FP) / (FP + TP + FN + TN)

precision (P) TP / (FP + TP)

Recall (R) TP / (TP + FN)

F-measure (f1-score) (2PR) / (P + R)

Standard Evaluation Metric	Equation
Accuracy (AC)	(TP + FP) / (FP + TP + FN + TN)
precision (P)	TP / (FP + TP)
Recall (R)	TP / (TP + FN)
F-measure (f1-score)	(2PR) / (P + R)

7. Results and discussion

This section presents the performance of the proposed hybrid CNN and Bi-LSTM network model evaluated with two publicly available intrusion detection datasets extensively adopted in previous research works: NSL-KDD and CIC-IDS 2017 datasets. In addition to the experiments performed, we conducted ablation experiments to determine the effectiveness of the model hyperparameters that we adopted for the predictive effect of the proposed model. Further, to objectively evaluate the performance of our proposed model, a comparison of our network with some related work was done.

7.1. Ablation experiments

Ablation experiments gives insights into the comparative contribution of different architectural and regularization components to machine learning models’ performance [61]. In this study ablation experiments were conducted where the two algorithms adopted for building the predictive model were trained and tested separately to understand which of the two algorithms (CNN and Bi-LSTM) was significant to the model performance. Lastly, we compared the results with the proposed hybrid CNN+Bi-LSTM model. Figure 7 presents the loss and accuracy results of the ablation experiment conducted on the CNN algorithm for 28 iterations as plotted using the CIC-IDS 2017 dataset.

Fig. 7.

Ablation experiment results for the CNN block.

Previous studies that were reviewed demonstrated that CNN and LSTM models successfully capture features, the correlation between features (through convolution units), and the time dependencies (through memory units), that are as a result used for intrusion detection. The CNN learn the spatial features from the network data where the algorithm was aimed at learning a weight map representing the relative importance of activation for the spatial features. The loss function informed how erroneous the classification prediction is from the ground truth. The training loss of the CNN model was lower than that of the LSTM model, showing that it trained more efficiently. Table 5 presents the results of performance evaluation of the CNN and Bi-LSTM models in the ablation study using 28 epochs.

Table 5

Performance evaluation in the ablation study

Model	Accuracy	Precision	Recall	Training Loss
CNN	96.12	96.91	95.89	0.084
Bi-LSTM	94.56	94.76	93.32	0.977

The results obtained from the ablation experiments indicate the CNN model provide better results. The CNN model performed better than the Bi-LSTM model in all the performance evaluation metrics. CNN is a discriminative classifier that learn high discriminative spatial features from hidden layers of the network and can better provide inference during hidden middle layers weight updates. The LSTM integrates a memory component to the model long-term dependencies in time-series task and require more parameters than CNN, hence are sluggish to train. The LSTM model also started to over fit at epoch 22, therefore an early stopping is conducted. However, its advantage derives from its capacity to look at long sequences of inputs without increasing the network size. The spatial-temporal parameters processed by LSTM assists the model to identify hidden outlines in feature sequences.

These results were then compared with the hierarchical hybrid model where the features extracted and learnt by the CNN are passed to the LSTM layer as inputs to learn the temporal dependencies. The study tried max-pooling layers after different convolutional layers in the CNN and the proposed hybrid CNN+Bi-LSTM models and established that a pooling layer that followed only the last convolutional layer improved the performance of both models. The developed hybrid deep learning model achieved improvements in performance in comparison to the individual learning models. The results obtained demonstrates that CNN and LSTM models are complementary and a combination of both further improves classification results.

7.2. Results on NSL-KDD dataset

The data samples of the NSL-KDD dataset were divided into two parts: the training set that was used to build the model, and the testing set for evaluation. Figure 8(a) presents the relationship between the loss, accuracy and the number of epochs of for the proposed hybrid CNN- Bi-LSTM model. It can be observed that as the number of epochs increase, the training loss gradually decreased and became stable in the 48th epoch. These results shows that the model structural design and hyperparameter settings were realistic, consequently exhibiting good convergence ability.

Fig. 8.

Model loss and accuracy plotted against epochs.

After careful fine-tuning, experimental results show that at 90th epochs, the proposed hybrid model reached 96.22% Accuracy, 95.87% precision, 95.21% recall and the F1-measure of 95.54%. Table 6 presents the results of the confusion matrix for the testing data.

Table 6

Confusion matrix for multi-class classification on test data

		Predicted class

		Normal	DoS	Probe	R2L	U2R	Total
Actual class	Normal	9,342	117	129	109	14	9,711
	DoS	87	7,181	79	105	8	7,460
	Probe	79	19	2233	77	13	2,421
	R2L	83	75	57	2,659	11	2,885
	U2R	2	2	0	1	62	67
	Total	9,593	7,394	2,498	2,951	108

The results indicates that from the evaluation dataset, 9,342 flows were detected as normal from the 9,711 normal network flows. 117 flows were predicted as DoS attack, 109 flows detected as Remote to Local Attack, and 129 as a Probe. Similarly, 7,181 flows were predicted correctly as DoS attacks from the 7,460 actual DoS attacks. 87 flows were predicted as normal whereas 105 packets were detected as Remote to Local Attacks. 79 flows were classified as Probe attacks and 8 packets as User to Root attacks. The results demonstrate that the accuracy of the model in predicting the various classes is very high.

These results were compared with other hybrid models tested on the NSL-KDD dataset and summarized in Table 7. According to our experiments, the proposed model have a comparative advantage over previous works. Based on the accuracy metric, we demonstrate that the proposed deep hybrid detection model poses a good detection efficiency for abnormal traffic while tested on the NSL-KDD dataset.

Table 7

Comparative results of hybrid CNN - Bi-LSTM models tested on NSL-KDD dataset

Authors	Accuracy
Ahsan, & Nygard [4]	99.7
Su et al. [63]	84.25
Hsu et al. [31]	94.12
Yao et al. [79]	99.79%
Authors	96.22%

The accuracy of the proposed model was observed to be lower than that of Ahsan, & Nygard [4] and Yao et al. [79]. There are some reasons that may explain this phenomenon. First, the study by [4] was done without applying any hyperparameter tuning while being trained yielding higher performance results. In our study hyperparameter turning was done to obtain optimal hyperparameter values. Hyperparameters are optimized through manual trial and error searches since grid search experiments suffer from poor dimensional coverage in dimensions [11]. In addition, training deep neural network to achieve tremendous generalization when subjected to unseen inputs is challenging. Further, model regularization reduces the validation loss at the expense of increasing training loss. Normally, regularization techniques such as the application of noise layers or early stopping that halts the training are simply used during training but not during validation, subsequently resulting in smoother and usually better functions in validation [50]. In our model, model call-backs using early stopping method was done that could result in optimization problem resolution.

7.3. Results on CIC-IDS 2017 dataset

To further validate the proposed model, we also performed experiments on the CIC-IDS 2017 dataset. The model achieved 99.27% Accuracy, 99.89% precision, 96.54% recall and the F1-measure of 98.19%. Figure 9 presents the model accuracy results after 100 epochs.

Fig. 9.

Accuracy results for the model trained on CIC-IDS2017 dataset.

These experimental results are compared with results obtained from other works shown in Table 8.

Table 8

Comparative results of hybrid CNN+Bi-LSTM model tested on CIC-IDS 2017 dataset

Authors	Performance metric (%)

	Accuracy	Precision	Recall	F1-measure
Sun et al. [64]	98.67	–	–	93.32
Kim, Park & Lee [40]	93	86.47	76.83	81.36
Zhang et al. [81]	99.81	99.84	99.98	99.91
Authors	99.27	99.89	96.54	98.19

As observed from the experimental results, the performance of the proposed hybrid detection model has better performance in contrast with other state of the art models in the literature that were tested on the CIC-IDS 2017 dataset. Our model exhibited better experimental results on the accuracy, precision, recall and F1-measure while compared with results obtained by [64] and [40]. Compared with the experimental results of [81], our experimental results show that our model is better on the precision metric. This means that every result classified by the model was correctly classified among the classified instances. The high precision score is a valuable measure of success particularly because the classes in the dataset are highly imbalanced.

It was further observed that the proposed hybrid model had the best performance with the highest number of hyperparameters. Albeit the number of hyper parameters for the LSTM being less than that of the CNN, the final model showed better validation performance without a big increase in computation time. Compared with the individual CNN and LSTM models, the hybrid model performed better than the two models, since its training loss was the lowest and is lower than its validation loss. Considering the results obtained, our model is more suitable and applicable for network intrusion detection particularly in large and high speed networks.

8. Conclusion and recommendations

In this paper, we explored an effective and novel spatial-temporal feature extraction model for network intrusion detection systems. A hybrid deep learning-based approach was suggested comprising of a two-dimensional CNN and Bi-LSTM for building an network intrusion detection system. The CNN component was used to extract spatial features, while the Bi-LSTM component extracted temporal features from the dataset. The two types of features were fused via a feature fusion component and finally the model was trained to detected and classify the attacks. So as to implement the proposed model, we used Keras library and TensorFlow deep learning frameworks. Principal Component Analysis was applied as the dimensionality reduction algorithm. The proposed model was used to perform multi-class classification using two benchmark intrusion detection datasets. The accuracy and precision values were observed to be greater while compared with other previous approaches, which demonstrates the efficacy of the proposed network intrusion detection model.

The results further demonstrates that the proposed model exhibits great potential for solving the network intrusion detection since the model scaled better with the complexity of the problem, requiring less memory for the parameters compared to existing hybrid deep learning networks.

Our future research will focus on the following two areas: first, we will focus on an efficient fusion method of the spatial and temporal features to further reduce the computation costs; and secondly, we shall study the semi-supervised approach for network intrusion detection and perform both binary and multi-class classification.

References

Abdulhammed,

Faezipour,

Musafer and

Abuzneid, Efficient network intrusion detection using PCA-based dimensionality reduction of features, in: 2019 International Symposium on Networks, Computers and Communications (ISNCC), Istanbul, Turkey, 2019.

Abdulhammed,

Musafer,

Alessa,

Faezipour and

Abuzneid, Features dimensionality reduction approaches for machine learning based network intrusion detection, Electronics 8(322) (2019), 1–27.

Adem,

Kiliçarslan and

Cömert, Classification and diagnosis of cervical cancer with stacked autoencoder and softmax classification, Expert Systems with Applications 115 (2019), 557–564. doi:10.1016/j.eswa.2018.08.050.

Ahsan and

Nygard, Convolutional neural networks with LSTM for intrusion detection, in: Proceedings of 35th International Conference on Computers and Their Applications, San Francisco, CA, USA, 2020.

Akhtar and

Mian, Threat of adversarial attacks on deep learning in computer vision: A survey, IEEE Access 6 (2018), 14410–14430. doi:10.1109/ACCESS.2018.2807385.

Al and

Dener, STL-HDL: A new hybrid network intrusion detection system for imbalanced dataset on big data environment, Computers Security 110 (2021). doi:10.1016/j.cose.2021.102435.

Aloraifan,

Ahmad and

Alrashed, Deep learning based network traffic matrix prediction, International Journal of Intelligent Networks 2 (2021), 46–56. doi:10.1016/j.ijin.2021.06.002.

S.A.

Althubiti,

E.M.

Jones and

Roy, LSTM for anomaly-based network intrusion detection, in: 2018 28th International Telecommunication Networks and Applications Conference (ITNAC), Sydney, NSW, 2018.

Alzahrani and

Alenazi, Designing a network intrusion detection system based on machine learning for software defined networks, Future Internet 13(111) (2021), 1–18.

10.

Bera and

Shrivastava, Effect of pooling strategy on convolutional neural network for classification of hyperspectral remote sensing images, IET Image Processing 14(3) (2020), 480–486. doi:10.1049/iet-ipr.2019.0561.

11.

Bergstra and

Bengio, Random search for hyper-parameter optimization, Journal of Machine Learning Research 13 (2012), 281–305.

12.

Berman,

Buczak,

Chavis and

Corbett, A survey of deep learning methods for cyber security, Information 10(122) (2019), 1–35.

13.

Bisong, Introduction to scikit-learn, in: Building Machine Learning and Deep Learning Models on Google Cloud Platform, Springer, Berlin, Germany, 2019, pp. 215–229. doi:10.1007/978-1-4842-4470-8_18.

14.

Canadian Institute for Cybersecurity, Intrusion Detection Evaluation Dataset (CICIDS2017), Canadian Institute for Cybersecurity, 2017. [Online]. Available: http://www.unb.ca/cic/datasets/ids-2017.html [Accessed 19 November 2020].

15.

Cao,

Li,

Song,

Qin and

Chen, Network intrusion detection model based on CNN and GRU, Applied Sciences 12(4184) (2022), 1–27.

16.

Chen,

Wei and

Xu, A lightweight spectral–spatial feature extraction and fusion network for hyperspectral image classification, Remote Sensing 12(9) (2020), 1–17.

17.

D.-A.

Clevert,

Unterthiner and

Hochreiter, Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs), 2016, arXiv preprint arXiv:1511.07289.

18.

Denning, An intrusion-detection model, IEEE Transactions on Software Engineering SE 13(2) (1987), 222–232. doi:10.1109/TSE.1987.232894.

19.

Dhanabal and

Shantharajah, A study on NSL-KDD dataset for intrusion detection system based on classification algorithms, International Journal of Advanced Research in Computer and Communication Engineering 4(6) (2015), 446–452.

20.

Dharamvir, Data normalization techniques on intrusion detection for dataset applications, International Journal of Advanced Science and Technology 29(7) (2020), 5083–5093.

21.

Ding,

Li,

Wang,

Wen and

Guan, HYBRID-CNN: An efficient scheme for abnormal flow detection in the SDN-based smart grid, Security and Communication Networks 2020 (2020), Article ID 8850550.

22.

Duong, Optimization of cyber-attack detection using the deep learning network, International Journal of Computer Science and Network Security (IJCSNS) 21(7) (2021), 159–168.

23.

Elsayed,

Le-Khac,

Jahromi and

Jurcut, A hybrid CNN-LSTM based approach for anomaly detection systems in SDNs, in: The16th International Conference on Availability, Reliability and Security (ARES 2021), Vienna, Austria, 2021.

24.

Fan and

Cao, An improved method of network intrusion discovery based on convolutional long-short term memory network, IEEE Access 9 (2021), 10.

25.

Ferrag,

Maglaras,

Moschoyiannis and

Janicke, Deep learning for cyber security intrusion detection: Approaches, datasets, and comparative study, Journal of Information Security and Applications 50 (2020), 1–19. doi:10.1016/j.jisa.2019.102419.

26.

García-Teodoro,

Díaz-Verdejo,

Maciá-Fernández and

Vázquez, Anomaly-based network intrusion detection: Techniques, systems and challenges, Computers Security 28 (2009), 18–28. doi:10.1016/j.cose.2008.08.003.

27.

Ge,

Cao,

Li and

Fu, Hyperspectral image classification method based on 2D–3D CNN and multibranch feature fusion, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13 (2020), 5776–5788. doi:10.1109/JSTARS.2020.3024841.

28.

Goodfellow,

Bengio and

Courville, Deep Learning, MIT Press, 2016.

29.

Gregorutti,

Michel and

Saint-Pierre, Correlation and variable importance in random forests, Statistics and Computing 27 (2017), 659–678. doi:10.1007/s11222-016-9646-1.

30.

G.E.

Hinton and

R.R.

Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313(5786) (2006), 504–507. doi:10.1126/science.1127647.

31.

C.-M.

Hsu,

H.-Y.

Hsieh,

Prakosa,

Azhari and

J.-S.

Leu, Using long-short-term memory based convolutional neural networks for network intrusion detection, in: Wireless Internet. WICON 2018,

Chen,

Pang,

Deng and

Lin, eds, Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, Vol. 264, Springer, Cham, 2019, pp. 86–94.

32.

Janitza and

Hornung, On the overestimation of random forest’s out-of-bag error, PLoS One 13(8) (2018), 1–31.

33.

Javaid,

Niyaz,

Sun and

Alam, A deep learning approach for network intrusion detection system, in: BICT’15: Proceedings of the 9th EAI International Conference on Bio-Inspired Information and Communications Technologies (Formerly BIONETICS), New York City, New York, United States, 2016.

34.

Javaid,

Niyaz and

W.A.

Sun, Deep learning approach for network intrusion detection system, in: Proceedings of the 9th EAI International Conference on Bio-Inspired Information and Communications Technologies, New York, NY, USA, 2015.

35.

Jiang,

Wang,

Wang and

Wu, Network intrusion detection combined hybrid sampling with deep hierarchical network, IEEE Access 8 (2020), 32464–32476. doi:10.1109/ACCESS.2020.2973730.

36.

Jyothsna and

Prasad, Anomaly-based intrusion detection system, in: Computer and Network Security, IntechOpen, 2019, pp. 1–15.

37.

Kang,

Li,

Li and

Lin, Classification of hyperspectral images by Gabor filtering based deep network, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11(4) (2018), 1166–1178. doi:10.1109/JSTARS.2017.2767185.

38.

Khan,

Sohail,

Zahoora and

Qureshi, A Survey of the Recent Architectures of Deep Convolutional Neural Networks, 2019, arXiv:1901.06032.

39.

Kiarashinejad,

Abdollahramezani and

Adibi, Deep learning approach based on dimensionality reduction for designing electromagnetic nanostructures, NPJ Computational Materials 6(12) (2020), 1–12.

40.

Kim,

Park and

D.H.

Lee, AI-IDS: Application of deep learning to real-time web intrusion detection, IEEE Access 8 (2020), 70245–70261. doi:10.1109/ACCESS.2020.2986882.

41.

Kim,

Shin and

Choi, An intrusion detection model based on a convolutional neural network, Journal of Multimedia Information System 6(4) (2019), 165–172. doi:10.33851/JMIS.2019.6.4.165.

42.

D.P.

Kingma and

Ba, Adam: A method for stochastic optimization, 2017, arXiv preprints arXiv:1412.6980v9.

43.

Kolosnjaji,

Zarras,

Webster and

Eckert, Deep learning for classification of malware system call sequences, in: Proceedings of Australasian Joint Conference on Artificial Intelligence, Hobart, Australia, 2017.

44.

Laghrissi,

Douzi,

Douzi and

Hssina, Intrusion detection systems using long short-term memory (LSTM), Journal of Big Data 8(66) (2021), 16.

45.

Lake,

T.D.

Ullman,

J.B.

Tenenbaum and

S.J.

Gershman, Building machines that learn and think like people, Behavioral and Brain Sciences 40 (2017), e253. doi:10.1017/S0140525X16001837.

46.

Lee,

Amaresh,

Green and

Engels, Comparative study of deep learning models for network intrusion detection, SMU Data Science Review 1(8) (2018), 1–13.

47.

Li,

Zhang and

Krebs, Prediction of flow based on a CNN-LSTM combined deep learning approach, Water 14(993) (2022), 1–13.

48.

Li,

Xie and

Li, Hyperspectral image reconstruction by deep convolutional neural network for classification, Pattern Recognition 63 (2017), 371–383. doi:10.1016/j.patcog.2016.10.019.

49.

Liao,

Lin,

Lin and

Tung, Intrusion detection system: A comprehensive review, Journal of Network Computing Applications 36(1) (2013), 16–24. doi:10.1016/j.jnca.2012.09.004.

50.

Madaeni,

Chokmani,

Lhissou,

Homayouni,

Y.Y.

Gauthier and

Tolszczuk-Leclerc, Convolutional neural network and long short-term memory models for ice-jam predictions, The Cryosphere 16 (2022), 1447–1468. doi:10.5194/tc-16-1447-2022.

51.

Magán-Carrión,

Urda,

Díaz-Cano and

Dorronsoro, Towards a reliable comparison and evaluation of network intrusion detection systems based on machine learning approaches, Applied Sciences 10(1775) (2020), 1–21.

52.

Marin,

Skelin and

Grujic, Empirical evaluation of the effect of optimization and regularization techniques on the generalization performance of deep convolutional neural network, Applied Sciences 10(7817) (2020), 1–30.

53.

Ming,

Cao,

Zhang,

Li,

Chen,

Song and

Qu, Understanding hidden memories of recurrent neural networks, in: 2017 IEEE Conference on Visual Analytics Science and Technology (VAST), Phoenix, AZ, USA, 2017.

54.

Minh-Tuan and

Y.-H.

Kim, Bidirectional long short-term memory neural networks for linear sum assignment problems, Applied Sciences 9(17) (2019), 8.

55.

Mishra,

Varadharajan,

Tupakula and

Pilli, A detailed investigation and analysis of using machine learning techniques for intrusion detection, IEEE Communications Surveys Tutorials 21(1) (2018), 686–728. doi:10.1109/COMST.2018.2847722.

56.

Mohammadi and

Amiri, An efficient hybrid self-learning intrusion detection system based on neural networks, International Journal of Computational Intelligence and Applications 18(1) (2019), 1950001. doi:10.1142/S1469026819500019.

57.

Mohammadpour,

Ling,

Liew and

Aryanfar, A mean convolutional layer for intrusion detection system, Security and Communication Networks 2020 (2020), Article ID 8891185.

58.

Paoletti,

Haut,

Plaza and

Plaza, A new deep convolutional neural network for fast hyperspectral imageclassification, ISPRS Journal of Photogrammetry and Remote Sensing 145 (2018), 120–147. doi:10.1016/j.isprsjprs.2017.11.021.

59.

Reich,

Price and

Patterson, Principal component analysis of genetic data, Nature Genetics 40(5) (2008), 491–492. doi:10.1038/ng0508-491.

60.

Rouast,

Adam and

Chiong, Deep learning for human affect recognition: Insights and new developments, IEEE Transactions on Affective Computing 12 (2021), 524–543. doi:10.1109/TAFFC.2018.2890471.

61.

Sheikholeslami,

Meister,

Wang,

Payberah,

Vlassov and

Dowling, AutoAblation: Automated parallel ablation studies for deep learning, in: EuroMLSys’21: Proceedings of the 1st Workshop on Machine Learning and Systems, United Kingdom, 2021.

62.

Sinha and

Manollas, Efficient deep CNN-BiLSTM model for network intrusion detection, in: Proceedings of the 2020 3rd International Conference on Artificial Intelligence and Pattern Recognition (AIPR 2020), Xiamen, China, 2020.

63.

Su,

Sun,

Zhu,

Wang and

Li, BAT: Deep learning methods on network intrusion detection using NS-KDD dataset, IEEE Access 8 (2020), 29575–29585. doi:10.1109/ACCESS.2020.2972627.

64.

Sun,

Liu,

Li,

Liu,

Lu,

Hao and

Chen, DL-IDS: Extracting features using CNN-LSTM hybrid network for intrusion detection system, Security and Communication Networks 2020 (2020), Article ID 8890306.

65.

Thapa,

Liu,

Shaver,

Esterline,

Gokaraju and

Roy, Secure cyber defense: An analysis of network intrusion-based dataset CCD-IDSv1 with machine learning and deep learning models, Electronics 10(15) (2021), 1–13.

66.

Tharwat, Independent component analysis: An introduction, Applied Computing and Informatics 17(2) (2021), 222–249. doi:10.1016/j.aci.2018.08.006.

67.

Tian,

Li and

Deng, Object tracking algorithm based on improved context model in combination with detection mechanism for suspected objects, Multimedia Tools and Applications 78(4) (2019), 259–268.

68.

Tu,

Zhang,

Wang and

Qian, Making Fisher discriminant analysis scalable, in: Proceedings of the 31st International Conference on Machine Learning (JMLR), Beijing, China, 2014.

69.

Vasan and

Surendiran, Dimensionality reduction using principal component analysis for network intrusion detection, Perspectives in Science 8 (2016), 510–512. doi:10.1016/j.pisc.2016.05.010.

70.

Velliangiria,

Alagumuthukrishnanb and

Thankumar, A review of dimensionality reduction techniques for efficient computation, Procedia Computer Science 165 (2019), 104–111. doi:10.1016/j.procs.2020.01.079.

71.

Vieira,

F.L.

Koch,

J.B.M.

Sobral,

C.B.

Westphall and

J.L.

de Souza Leao, Autonomic intrusion detection and response using big data, IEEE Systems Journal 14(2) (2019), 1984–1991. doi:10.1109/JSYST.2019.2945555.

72.

Vinayakumar,

Alazab,

Soman,

Poornachandran,

Al-Nemrat and

Venkatraman, Deep learning approach for intelligent intrusion detection system, IEEE Access (2019), 41525–41550. doi:10.1109/ACCESS.2019.2895334.

73.

Wang,

Sheng,

Wang,

Zeng,

Ye,

Huang and

Zhu, HAST-IDS: Learning hierarchical spatial-temporal features using deep neural networks to improve intrusion detection, IEEE Access 6 (2017), 1792–1806.

74.

Wichmann,

Marx,

Federrath and

Fischer, Detection of brute-force attacks in end-to-end encrypted network traffic, in: ARES 2021: The 16th International Conference on Availability, Reliability and Security, Vienna, Austria, 2021.

75.

Wu and

Guo, LuNet: A Deep Neural Network for Network Intrusion Detection, 2019, arXiv preprints arXiv:1909.10031v2.

76.

Xiao,

Xing,

Zhang and

Zhao, An intrusion detection model based on feature reduction and convolutional neural networks, IEEE Access 7 (2019), 42210–42219. doi:10.1109/ACCESS.2019.2904620.

77.

Xu,

Li,

Ran,

Du,

Gao and

Zhang, Multisource remote sensing data classification based on convolutional neural network, IEEE Transactions on Geoscience and Remote Sensing 56(2) (2018), 937–949. doi:10.1109/TGRS.2017.2756851.

78.

H.W.F.

Yang, Wireless network intrusion detection based on improved convolutional neural network, IEEE Access 7 (2019), 64366–64374. doi:10.1109/ACCESS.2019.2917299.

79.

Yao,

Wang,

Liu,

Chen and

Sheng, Intrusion detection system in the advanced metering infrastructure: A cross-layer feature-fusion CNN-LSTM-based approach, Sensors 21(626) (2021), 1–17.

80.

Zhang and

Wang, An effective feature selection approach for network intrusion detection, in: 2013 IEEE Eighth International Conference on Networking, Architecture and Storage, Xi’an, China, 2013.

81.

Zhang,

Chen,

Jin,

Wang and

Guo, Network intrusion detection: Based on deep hierarchical network and original flow data, IEEE Access 7 (2019), 37004–37016. doi:10.1109/ACCESS.2019.2905041.

Input:
	Original data set, Dataset
Output:
	Processed dataset, NewDataset
Procedure:
	Selecting out-of-bag data and calculating the error, noted as error_a
	Randomly add disturbances to the out-of-bag data, noted as error_b
	Calculating feature importance
	Ranking the importance of the features
	Calculating the Pearson correlation coefficient and performing a correlation analysis
	Combine steps (4) and (5) for feature selection
	Obtain the selected dataset, NewDataset
	Source: [ 15 ]

1:	for each epoch do
2:	for each batch do
3:	1) CNN input layer
4:	2) Several CNN hidden layers (contingent on selected number of hidden layers)
5:	3) Bi-LSTM Layer:
6:	Forward LSTM
7:	Reverse LSTM
8:	4) Update the parameters
9:	end for
10	end for

Discriminative spatial-temporal feature learning for modeling network intrusion detection systems

Abstract

Keywords

1. Introduction

2. Motivation

3. Related work

4.1. Spatial feature learning

4.1.2. The pooling layer

4.1.3. Fully connected layer

4.2. Temporal feature learning

5.1.2. Feature selection

5.1.4. Feature extraction and dimensionality reduction

5.1.5. Segmentation and conversion to matrix

5.2. Model training

6. Methodology

6.3. Parameter selection and optimization

6.4. Performance metrics

Table 4 Model performance evaluation metrics Standard Evaluation Metric Equation Accuracy (AC) (TP + FP) / (FP + TP + FN + TN) precision (P) TP / (FP + TP) Recall (R) TP / (TP + FN) F-measure (f1-score) (2*P*R) / (P + R)

7.1. Ablation experiments

References

Table 4
Model performance evaluation metrics

Standard Evaluation Metric Equation

Accuracy (AC) (TP + FP) / (FP + TP + FN + TN)

precision (P) TP / (FP + TP)

Recall (R) TP / (TP + FN)

F-measure (f1-score) (2PR) / (P + R)