Abstract
Increasing interest and advancement of internet and communication technologies have made network security rise as a vibrant research domain. Network intrusion detection systems (NIDSs) have developed as indispensable defense mechanisms in cybersecurity that are employed in discovery and prevention of malicious network activities. In the recent years, researchers have proposed deep learning approaches in the development of NIDSs owing to their ability to extract better representations from large corpus of data. In the literature, convolutional neural network architecture is extensively used for spatial feature learning, while the long short term memory networks are employed to learn temporal features. In this paper, a novel hybrid method that learn the discriminative spatial and temporal features from the network flow is proposed for detecting network intrusions. A two dimensional convolution neural network is proposed to intelligently extract the spatial characteristics whereas a bi-directional long short term memory is used to extract temporal features of network traffic data samples consequently, forming a deep hybrid neural network architecture for identification and classification of network intrusion samples. Extensive experimental evaluations were performed on two well-known benchmarks datasets: CIC-IDS 2017 and the NSL-KDD datasets. The proposed network model demonstrated state-of-the-art performance with experimental results showing that the accuracy and precision scores of the intrusion detection model are significantly better than those of other existing models. These results depicts the applicability of the proposed model in the spatial-temporal feature learning in network intrusion detection systems.
Keywords
Introduction
With the rapid increase in importance of cyber security, research work on Intrusion Detection Systems (IDS) have been active. In addition, the rapid growth in the volume, velocity, veracity and variety of network data and connected devices bears intrinsic security risks and threats. Modern day organizations are in constant search for solutions to combat these threats with the tenacity to safeguard the integrity, confidentiality and availability of their information assets [25]. Research shows that intrusions on computer networks results into loss of trillions of money annually, impede availability of critical systems for end users, drives up the costs significantly, and affects corporate reputation [51].
Owing to these adverse effects, several efforts have been made by various researchers to help curb this menace by developing network-based intrusion detection systems (NIDS) as a solution for security problems on network administration. These systems are being touted as one of the most effective approaches in dealing with the present-day network attack detections [4,34,46,56]. Contingent on the methods of intrusion detection, NIDSs are categorized into two classes namely, signature-based detection (or misuse detection) and anomaly-based detection (behaviour-based detection) [8,36]. Signature-based detection attempts to define a set of rules (or signatures) that can be used to decide whether a given pattern is an attack or not. Thus, signature-based systems are capable of achieving high levels of accuracy and minimal number of false positives while identifying intrusions. On the other hand, anomaly-based detection systems are able to search the abnormal traffic by comparing the actual behaviour with the normal system behaviour and flag off any intrusions. They are well-suited for the detection of unknown and new attacks leading to their wide acceptance among the research community [23].
The NIDS is a kind of a classifier that involves analysis of network traffic collected and then compared with the baseline defined for the system. It monitors and analyses the network traffic entering into or exiting from the network devices of an organization and raises alarms if an intrusion is observed. As a classifier, the NIDS differentiate the malicious entries from the legitimate entries in network traffic data. As such, they are permitted to utilize machine learning techniques. These systems use supervised, semi-supervised, or unsupervised mechanisms to learn patterns of both benign and malignant activities when subjected to large quantities of network flow data. In the recent years, several techniques and models have been developed and employed to retrieve significant information from data obtained in large and high-speed networks. The NIDS pioneered by Denning [18] in 1987 have been successfully used to detect and prevent malignant network behaviours over the years. Numerous machine-learning based NIDS have been developed and evaluated with supervised deep learning approaches, such as a deep neural networks (DNN), convolutional neural networks (CNN), and long short-term memory (LSTM) gaining increased interest in the recent years. These deep learning algorithms have been significantly utilized in the network security domain that involves large scale data in modern cyberspace networks since they are capable of learning the deeply integrated features.
The scope of NIDS detection targets both the host and network levels. Essentially, they are opening a frontier for deployment at strategic points in modern day networks to monitor incoming traffic and discover the existence of malicious packets in the network traffic [49]. In view of the increasing scope and typology of security risks inherent in large and high-speed networks, network monitoring based on machine learning techniques have been evolving rapidly [8,26,36]. However, according to [33], two main challenges have been identified while developing an efficient and flexible NIDS for detecting novel network attacks. First, selection of features from the network traffic dataset is difficult as one class of attack may not work well for other categories of attacks owing to constantly changing and evolving attack scenarios. Secondly, network administrators are reluctant towards reporting any intrusion that might have occurred in their networks owing to the need to preserve the confidentiality of the internal organizational network structure as well as the privacy of various users.
A key factor for designing an effective NIDS is to select significant features for making decisions. Consequently, this involves extracting relevant attributes from huge corpus of noisy, high dimensional data that is characteristic of large and high-speed networks. Data mining and machine learning techniques have emerged over the years and are being subjected to extensive research in developing NIDS using different intrusion detection datasets [25]. Deep learning, an offshoot of machine learning, has come to the fore as a multifaceted tool for synthesis and analysis of volumes of heterogeneous data, offering reliable predictions of complex and uncertain phenomena [5,28]. Deep learning is inspired by neuroscientific advancements in an effort to understand the working of the human brain, attributed to the human ability to “think” [45]. Given their multiple levels of abstraction, deep networks learn to delineate input features to the output, where learning does not hinge on handcrafted features [46]. Additionally, deep learning algorithms have enormous flexibility in designing each part of the network architecture, resulting in several ways of discovering the most efficient activation functions and improving generalizability [17].
In this paper, a novel hybrid network intrusion detection algorithm is proposed that seeks to improve the detection speed and accuracy. Since deep learning requires huge amounts of data for classification and regression tasks, in essence this necessitate design and use of algorithms that would be efficient in this regime. The supervised deep learning algorithms namely, Convolutional Neural Network (CNN) and Long Short-Term Memory Network (LSTM) were considered. These neural networks use multiple layers where non-linear transformation is used to extract higher-level features from the input data [50]. The hybrid architecture integrates the CNN and Bi-directional long short term memory (Bi-LSTM) respectively to learn both spatial and temporal features from network flow data for detection and classification of attacks in large and high-speed networks. The combination of CNN and LSTM units has presently produced promising results in problems requiring spatial and temporal information classification [4,23,31]. For experiments, we developed three deep learning models – a CNN, Bi-LSTM, and the combined CNN+Bi-LSTM and compared the results. Experimental results were generated based on two benchmark datasets: CIC-IDS 2017 and the NSL-KDD. After training and validation, it was determined that the proposed model architecture results in higher accuracy rates, increased precision and detection speed. The key contributions of this study are as follows:
A novel deep learning network structure of hybrid CNN-Bi-LSTM is designed, in which the CNN module can effectively extract the spatial features and the Bi-LSTM module extracts the temporal features for network intrusion prediction.
To verify the performance of the hybrid CNN-Bi-LSTM model, the study was conducted on two public available network intrusion datasets. The performance results demonstrates the ability of the proposed model architecture to effectively detect network intrusions in modern day organizations. The proposed model can be used in the design of NIDS to accurately predict network intrusions, which can help network administrators to flag off network intrusions and make evidence-based decisions.
The rest of this paper is organized as follows. Section 2 presents the motivation for this paper. Section 3 presents an analysis of related works. Section 4 provides detailed information about spatial and temporal feature learning. Section 5 presents the proposed model architecture and the generic steps taken in order to develop the model using deep learning strategies. Section 6 describes the methods and materials for developing, training and evaluating the proposed hybrid network flow prediction model. Section 7 presents experimental results. Lastly, Section 8 presents the conclusions and future research direction.
Motivation
Safeguarding the security of an organization’s information and systems from unauthorized access, use, disclosure, disruption, modification, or destruction is of paramount importance for its success. Today, these organizations are investing considerable resources to protect and safeguard confidentiality, integrity, availability and privacy of their information assets against a host of new and evolving cybersecurity threats [55]. These threats are evolving on a daily basis as organizations adopt and use information technology products and services to run their day-to-day activities and share information and data online. Further, the increasing connection points to the internet have led to increased cybersecurity threats. Network intrusion detection systems supports network administrators in detecting network security breaches. Thus, developing an efficient and accurate NIDS would assist in reducing network security threats.
In the recent years, researchers have focused their attention to the selection and extraction of specific features from network traffic, with an objective to facilitate the discrimination of benign and malicious traffic data [15,16,64]. The ability to capture intrusions in time, particularly in large scale networks, is critical and very challenging besides the recognition of the existence of spatial and temporal features in the network traffic data that can be effectively extracted [75]. Deep learning based approaches have been touted as being effective in this process leading to design and implementation of such effective and flexible NIDS owing to their potential to extract better representations from the data [9,28,51]. Inspired by the issues mentioned, this paper presents a novel hybrid deep learning-based model for detecting network intrusions in organizations to demonstrate high detection rates [12,62]. We apply bi-directional Long Short Term Memory (LSTM) and Convolutional Neural Network (CNN) in the design of a novel hierarchical NIDS model structure that extracts discriminative and temporal characteristics from normal and malicious network traffic, train and evaluate the model using NSL-KDD and CIC-IDS 2017 benchmark datasets and measure the performance using standard metrics.
Related work
We reviewed several related work that exemplifies hybrid deep learning architectures for network intrusion detection. Owing to their efficiency in finding ideal solutions with both large and finite amount of data, deep learning approaches were considered having garnered significant research attention [57]. In their work, Wang et al. [73] proposed a model combining CNN and LSTM for analysis and detection of attacks in network flow. Their model employs an architecture comprising of CNN to initially learn low-level spatial features and then the LSTM to learn high-level temporal features from network flow. The method was reported to attain remarkable results in terms of accuracy and detection rate. Sun et al. [64] developed a deep learning-based intrusion detection system (dubbed DL-IDS) that exploits the hybrid architecture of CNN and LSTM to mine both the spatial and temporal features from the flow data and yield a superior model. The work of Yao et al. [79] proposed an intrusion detection algorithm founded on the cross layer feature fusion of a LSTM and CNN networks for Advanced Metering Infrastructure.
Kolosnjaji et al. [43] attempted to build a model comprising of recursive and convolutional layers, capable of getting classification features that would possibly produce a malware detection system. They obtained an architecture for hierarchical feature extraction that coalesces benefits of convolution process from the network’s convolutional layer and the sequence modelling from recursive layer. Ahsan and Nygard [4] introduced a novel approach that used a hybrid algorithm of CNN and LSTM to detect network intrusions achieving unprecedented high accuracy on a standard NSL-KDD dataset without applying any hyperparameter tuning. The work of [75] proposed a model named LuNeT, a hybrid hierarchical network architecture comprising of CNN and LSTM. The model was used to extract spatial and temporal features by synchronizing both the CNN learning and the LSTM learning into multiple steps with the learning granularity being gradually increased from coarse-grained to fine-grained and tested on both the NSL-KDD and UNSW-NB15 dataset.
In [63], the model combined bi-directional long short-term memory (Bi-LSTM), an attention mechanism, and multiple convolutional layers. Their approach used structured network traffic information that generated time series features. The multiple convolutional layers extracted local features, the Bi-LSTM produced the packet vectors, whereas the attention mechanism screened the network flow comprising packet vectors. Finally, the model was tested with the NSL-KDD and KDD-CUP99 datasets with Softmax classifier utilized for final classification.
More recently, Abdallah et al. [23] proposed a hybrid intrusion detection mechanism capable of capturing the spatial and temporal features of the network traffic. The model combined the CNN and LSTM and achieved a detection accuracy of 96.32% when applied in Software-Defined Networking environment. Table 1 gives a summary of relevant hybrid deep learning based NIDS models reviewed in the literature.
Two major gaps were recognized in the literature review focusing on the deep-learning based NIDS. First, these methods experience a high false-alarm rate [76,78] which can consequently demand additional resources and time to discount the numerous alerts generated [55,72]. Secondly, developing an intrusion detection system in a dynamically changing computing environment that require a fast and suitable feature selection method [80] still remains a challenging matter given that most of the NIDS are dependent on the deployed environment [36]. In the light of the above mentioned gaps, our study aims to address them by developing and validating a hybrid deep learning detection model and present an evaluation through standardized performance metrics for model classification. Further, the proposed model, goes further to recommend a hierarchy of combined CNN and bi-directional LSTM which is a variation of RNN that reduces the computational burden and proneness to overfitting compared to the models proposed by [4,73], and [75].
Summary of reviewed hybrid deep learning based NIDS
Summary of reviewed hybrid deep learning based NIDS
(Continued)
(Continued)
(Continued)
Spatial feature learning
Learning spatial features involves mastering expressive feature representations in the data containing spatial properties. Generally, deep network architectures are being utilized to take advantage of this characteristic in network traffic data that customarily contain spatial properties alongside some dimensions. The design of CNNs is based on spatial coherence, and are frequently being used for spatial feature learning. CNNs present layers with specialized operations into deep neural networks [28], considering the spatial properties of the data thereby building more efficient networks. CNNs hold powerful learning ability largely owing to their multiple feature extraction stages (hidden layers) that facilitate automatic learning of representations from the data [38]. Two-dimensional CNN are used for spatial features learning and with spatial features extraction, there is potential for redundancy between spectrum bands. Consequently, such a framework is normally combined with a feature dimensional reduction algorithm that minimizes the spectral dimension before spatial feature extraction [16,37]. The general architecture of a CNN has three main layers namely; Convolutional layer, pooling layer, and fully connected layer. The general structure of a CNN is as shown in Fig. 1.

Structure of a CNN (source: [21]).
The Convolutional layer comprises of multiple layers that are convoluted with the input layer to generate activations [41]. The multiple layers are achieved by increasing the number of filters (or kernels) and biases that makes learning coarse-grained (at the beginning of the network) and fine-grained (at the end of the network) [62]. The filters have receptive fields that are trained to learn specific features from an image. The convolutional layer is formulated as presented in the following Eq. (1):
Where
The pooling layer
This layer collects the discriminative information by eliminating irrelevant details and makes the convolution features more invariant towards the small translations of the input. In other words, the pooling layer offers translation invariance while diminishing the resolution of the activation maps thereby reducing the computational complexity [48]. The pooling operation reduces the number of parameters in the network structure effectively preventing overfitting. There are two types of pooling strategies: ranked-based and value-based. In the literature, most of the existing deep CNN models adopted the max-pooling strategy, which belongs to the value-based pooling [58,77]. Max-pooling offers the strongest value from the
Fully connected layer
The latter part of the CNN usually connects several fully connected layers. The output of this layer becomes the output value to the classifier. Individual neuron in a fully connected layer connects with all neurons in the previous layer thus accommodating learning from all activation maps in the previous layer. At each iteration, the convolutional filters and fully connected layers are updated, with the goal of limiting the average loss,
Where
Temporal feature learning
The objective in temporal feature learning is to encapsulate temporal features in an efficient manner that permits generalization to a series of data with arbitrary length. This may be achieved by sharing parameters through time, rather than re-learning them in every step. Recurrent neural network (RNN) is the prominent algorithm for training temporal data [79]. RNNs introduce recurrent connections in time, permitting the parameter sharing in a deeper means. Ferrag et al. [25] describe an RNN as a neuron network, whose connection graph contains as a minimum one cycle and encompasses the feedforward architecture that permits recurrent connections to occur within layers. A previous model state is considered as a supplementary input at the individual temporal step of the RNN enabling it to formulate a memory in its hidden state over previous inputs information [53].
In the literature, several types of RNNs are proposed. Long short term memory (LSTM), a variant of RNN, has been adopted by many cybersecurity researchers since it uses memory function to substitute the hidden units [34]. Conversely, LSTM has long-term memory owing to its slow weight changes over time, in addition to its capacity to activate short-term memory in a short-range. This structure allows LSTM to recall long-range features better than conventional RNNs. The basic unit of the LSTM hidden layer is the memory module. This module comprises of one memory unit and three adaptive multiplication gating units, that is, the input gate, output gate and forget gate [35]. Principal information of the LSTM is conveyed along the horizontal line, where the LSTM forgets the old information, and learns new information via the three gate units [79]. Figure 2 presents a basic architecture of a long short term memory network.

A basic structure of a conventional LSTM network (source: [31]).
In our work, a bi-directional LSTM (Bi-LSTM) was used. The Bi-LSTM processes a sequence of data in both forward and backward directions by means of two distinct hidden layers and then join them with the same output layer leading to the two sub-networks working alongside each other and learn from the past and the future sequence of data [7]. The output layer yields an output vector,

A Bi-directional LSTM architecture (source: [54]).
The Bi-LSTM is superior in attaining the relations among elements in an entire sequence by exploiting information in both directions, rather than recalling the features in only one direction which is the case with conventional LSTM. During the training process, the Bi-LSTM adopts the time backpropagation algorithm. The computation operation for every neuron node in LSTM is in this fashion. At time
Where
The forget gate based on the latter moment of the hidden layer output
The output gate determines what will be the output by carrying out a sigmoid function that selects which part of the cell LSTM is going to output. At this point, the result is then passed through a Tanh layer (value between −1 and 1) to produce only an output that will be determined to pass to the next neuron as shown in the equation:
In the meantime, the memory cell state value
Since the Bi-LSTM comprises two LSTM networks, a forward LSTM and a backward LSTM, the following feature extraction steps are made:
The forward and backward propagation algorithm is achieved by using the internal self-recurrent state update method. The output value of the single module is computed as follows:
The state of Bi-LSTM at time
From two directions of the time and network layer, each LSTM’s error term is computed reversely
The gradient of each weight is calculated for the error term
A gradient-based optimization algorithm is applied to update the weight.
In this section, the system model requirements are formalized by combining CNN with Bi-LSTM to extract the discriminative features and finally a deep hierarchical network model is constructed. The proposed model is shown in Fig. 4. The proposed model performs the pre-processing, feature extraction through network training and network testing and final classification.

The proposed network intrusion detection model (source: authors, 2022).
Network traffic data may be described as multi-variate time series data that has spatial similarities. The spatial-temporal data can be exemplified with a matrix with
In the proposed model, both the spatial and temporal features of the network flow data are extracted concurrently by constructing a hierarchical network model that combines the CNN with Bi-LSTM algorithms together. Since the CNN and Bi-LSTM network differ in the input format, the extracted spatial features are first adjusted at the CNN output to conform to the input format of the Bi-LSTM network. The CNN’s fully connected layer output is a 1x128 feature vector. This feature vector is adjusted to an input size of 64 before being forward to the input layer of the Bi-LSTM network, and the time step set to 2. During model flattening the model input vector is automatically adjusted in the fully connected layer. Therefore, Bi-LSTM input size is consistent with the output size of CNN and the two Bi-LSTM layers execute temporal feature extraction.
The Sigmoid function was used as the activation function in each layer to perform non-linear operations. The results from individual Bi-LSTM recursive operation is obtained as a fusion of all the previous features and the current features. In the model architecture, one fully connected layer is linked to the output layer of the Bi-LSTM. The previously extracted features are integrated, with the output value of the last fully connected layer passed on to softmax function. After the learning process, the model performs multi-class classification to categorise the output. The following is a description of each component and its role in the proposed NIDS model architecture:
The paper considers the detection of malicious attacks based on a network flow which has a distinct hierarchy, as presented in Fig. 5. According to a specific network protocol format, a number of flow bytes are combined to form a network packet, and then multiple network packets are combined to form a network flow. The network flow is subsequently split into normal or malicious tasks. Then discriminative features are learnt using a deep learning algorithm.

Structure of network flow (source: [21]).
The raw network traffic data captured as libpcap file format (.pcap) are transformed into numeric vectors. Normally, the input flow data contains a multiplicity of features; some of them being non-numeric types. As such, they need to be encoded as numeric types before transmission into the neural network. One-hot encoding; a strategy for creating dummy values where every unique value in the category is a mapping of discrete eigenvalues to a vector of length
Feature selection
Feature selection was done to reduce model’s susceptibility to overfitting the data, faster training, and make sure memory, storage, and processing requirements are reduced as much as possible. To address the problem of data features redundancy, this paper adopted the feature selection algorithm proposed by [15]. The algorithm is based on Random Forest approach, as presented in Algorthim 1, that starts by calculating the significance of sample features and ranks them in order of importance. Then, it analyses the correlation between features by Pearson’s index, and lastly combines the two results to select the features. The feature selection process is demonstrated by the following pseudo code:
Feature selection based on random forest
Feature selection based on random forest
The Random Forest algorithm is integrated with decision tree as the base learner. The algorithm is capable of identifying significant features from huge amounts of sample features. According to [29], the crux of the algorithm is to analyse and compute the contribution of each feature of the sample in the tree, and then calculate its average value after which a comparison is made between features to identify the significant ones. The out-of-bag data error rate method for tuning parameter selection and error estimation was used the evaluation metric owing to its ability to yield less biased estimates of the true prediction error [32].
The features dataset customarily encompasses values that have diverse scale ranges during learning and tend to minimize the loss function. Data normalization is an important data pre-processing step aimed at improving efficiency and accuracy of the model, and prevent a feature with a particularly large value range from affecting the eventual distance calculation [20]. Essentially, the features having very large scope in the difference between the minimum and maximum values, a logarithmic scaling method is applied to scale and get the features that are mapped to a range. The minimal-maximal (Min-Max) scaling approach [13] that linearly scales each feature to the interval of [0, 1] was used. The Min-Max scaling approach enclose all features in a common boundary without losing information. This technique assists the gradient descent based algorithm converge more quickly towards the minima, and ensures the steps for gradient descent are updated at the same rate for all the features, while optimizing the loss function from the un-smoothing path toward the global minimum [9] as shown in the Eq. (9):
Where
Feature extraction and dimensionality reduction
In the feature extraction stage, dual input of flow data is used whose goal is to extract the features of netflow more comprehensively. The key objective of feature extraction is to obtain a set of representative vectors of the dataset while reducing the high dimensionality. The performance of an NIDS is considerably enhanced when the features are more discriminative and representative [2]. Empirical studies in network intrusion detection [1,69,70] has shown that the efficacy of machine learning algorithms can degrade when fitted on data containing redundant and irrelevant features. Consequently, it is necessary to reduce the quantity of features input. In the literature, among the most effective techniques proposed in the literature include the Principal Component Analysis (PCA), Laplacian Eigen Map, Independent Component Analysis, Kernel Principal Component Analysis (KPCA) [66], Fisher Linear Discriminant Analysis [68], Auto-encoder and Locally Linear Embedding [2,39].
In our work, PCA was adopted as the most suitable approach in solving intrusion detection problems owing to its simplicity and its ability to maintain most of the variability of the data [59]. PCA also assumes linearity in the set of reduced data. Further, PCA technique can deal with large datasets with fast run time while determining ideal number of Principal Components desired for intrusion detection [69]. The Principal Components can be represented by Eq. (10):
Further, PCA reduces the computational costs and the error of parameter estimation by decreasing the number of dimensions of the feature space thereby extracting a subspace that offers the best description of the data [44]. To be precise, after standardization, PCA extracts the eigenvectors and eigenvalues from the covariance matrix (CM) given by the Eq. (11):
Where
The eigenvalues are structured in descending order, where
Segmentation and conversion to matrix
According to Sun et al. [64], different methods are used to segment the original traffic data into fragments of different forms. This means that the chosen method evidently has an impact on the ensuing analysis. Network traffic can be sliced in six ways: by connection, session, network flow, TCP, service class, and by host. The study adopted a session sharding approach in which a session describes any packet that is comprised of a bi-directional flow. To be precise, such a packet has comparable structure (source IP, transport layer protocol, destination port, source port, and destination IP) and substitutable source and destination ports and addresses.
The CNN input format ought to be three-dimensional data (height, width, and channel). Each input is dimensionally transformed corresponding to the format of the grayscale image. As such, as a single sample, the channel should be 1, toward reshaping a single flow sample with a length of
Respective network records of the data were dimensionally transformed consistent with the format of the grayscale image. Typically, RGB D =1 is the default. For the purpose of inputting into CNN, the network data is reshaped into a matrix. The dataset is then shuffled to generate random data and then split to obtain a separate training and validation datasets using Python language and the Pandas library. The training set contains 65% of the flows, while 35% was the validation set.
Model training
The study proposed to introduce a deep learning architecture based on a two layer two-dimensional CNN and a Bi-LSTM network for the detection of the malicious attacks. Since the network flow is presented in a 1D format, we used NumPy to convert/transform the one dimensional array using to reshape() function into a two-dimensional array with one column [47]. The two dimensional array data was then transformed into two-dimensional feature maps for further processing. The grayscale feature maps are input to the strides as 1 × 1, 1 × 1 and 3 × 3 convolutional modules to extract features, respectively. The 2D convolution has shown excellent performance in a recent study [15]. The training process is presented in Algorithm 2.
Hybrid model: CNN, Bi-LSTM training process
Hybrid model: CNN, Bi-LSTM training process
The input layer receives input data. Normally, the size of the input layer is similar to the input data, for instance a vector
During model training, the ReLU activation function
A fully connected layer that is situated before the output layer is used for detection and classification. This layer targets high-level features that are consistent with the specific tasks of the output layer and perform mapping. The layer exploits the Softmax activation function after mapping to obtain a final classification results (either, normal or an attack). The Softmax classification function [3] normalizes real numbers into a probabilities distribution. Upon applying the softmax function, every distinct component will be in the interval (0, 1) eventually adding up to 1 that can be interpreted to map the non-normalized output of a model to a probability distribution over predicted output classes. Hence, the predicted class would be
Where
A loss function, categorical cross-entropy, was used given the classification model output is a multi-classification with the probability value between 0 and 1 for each class. The cross-entropy loss can be computed as shown in Eq. (15);
Where
The Adam optimizer was used to learn the network weight parameters. Empirical results have demonstrated that Adam has better advantages compared to other optimizers in practice [42].
Methodology
This section describes how the proposed system architecture for netflow detection and classification of attacks in large and high-speed networks is actualized. Figure 6 shows the flow chart of the proposed model.

The feature learning process.
A hybrid deep learning architecture is proposed that present a hierarchical network architecture combining a CNN and Bi-directional LSTM for performing both binary and multiclass classification model. The CNN was used for learning spatial dependencies, whereas the Bi-LSTM is used to computing temporal dependencies.
Deep learning approaches requires voluminous data for training. In this work, two well-known benchmarks datasets namely; CIC-IDS 2017 and the NSL-KDD datasets were used.
The CIC-IDS 2017 benchmark dataset developed by the Canadian Institute for Cybersecurity [14] was used. The dataset consist of the modern-day benign activities and malignant attacks that describes the concurrent network traffic. Contained in this dataset is the normal background traffic collected by means of B-profile system and constructs the abstract behavior of 25 network users created using FTP, SSH, email protocols, HTTP, and HTTPS protocol [72] that accurately mimic an actual network environment. The network traffic was collected over a period of 5 days and put together with normal activity traffic collected on the first day of data collection, and malicious attacks were introduced on other days. The network attack traffic collected is composed of eight types of attacks namely: SSH-Patator, FTP-Patator, Heartbleed, DoS, Botnet, Infiltration, Web Attack, and DDoS. The dataset comprise about 80 network flow features generated using PCAP analyzer and organized in a CSV file. Table 2 presents the distribution of packet samples for normal and attack classes CIC-IDS 2017.
Distribution of normal and attack behaviours in the CIC-IDS2017 dataset
Distribution of normal and attack behaviours in the CIC-IDS2017 dataset
The NSL-KDD dataset is a benchmark for modern-day internet traffic and includes all types of attacks. In the literature, many researchers have utilized the NSL-KDD dataset to develop and evaluate the NIDS. NSL-KDD dataset consists of 41 features including one class attribute that are organized into three main categories (traffic-based features, basic features, and content-based features) and labelled as either normal or attack with the precise attack category [9]. The dataset features comprise basic features derived directly from a TCP/IP connection, traffic features accumulated in a window interval, either time, e.g. two seconds, or a number of connections, and content features extracted from the application layer data of connections. Out of the 41 features, three are nominal, four are binary, and last 34 features are continuous. The dataset is analysed and categorized into four different clusters depicting the four common different types of attacks [19] including denial of service, user to root, remote to local and probing attacks. Table 3 presents the number of samples for normal and attack classes.
Summary of normal and attack records in NSL–KDD dataset
A simulation experiment environment [71] was built on a 64-bit computer based on Ms Windows 10 operating system platform with Intel® core i7-9750 Hz 2.60 GHZ processor, 8 GB memory, NVIDIA GeForce RTX 2060 6G GDDR6 GPU, and 10.2 CUDA. The experimental program was written in Python programming language and data processing using the Scikit-learn, NumPy, and Pandas. Keras, an effective high-level deep learning library built on top of TensorFlow was used to build the neural network model.
Parameter selection and optimization
A number of experiments were conducted wherein some parameters were changed. In general, parameters may be adjusted with respect to the loss function, optimization algorithm, dropout rate, activation function, batch size etc. The spatial and temporal features extracted by individual neurons may possibly be used to independently predict different variables [67]. The focus of this work is a multi-classification problem. Intrinsically, the multi-class logarithmic loss function namely, categorical_crossentropy was used. The number of iterations during experiments was set to 100 with diverse learning rates ranging from 0.01 to 0.25. In addition batch_size in the experiments was set to 120. The batch size, as one of the parameters, is associated with the frequency of weight updates. Accordingly, if the batch size is set too small, the model convergence may possibly be slow. The weight inactivation rate of dropout in the regularization method was set to 0.25. The optimizer function used was Adam. The ReLu function was employed as the hidden layers activation function while the softmax was used as the output layer activation function.
Performance metrics
Four evaluation indicators namely, Accuracy (AC), precision (P), recall (R), and f-measure (f1-score) were utilized for performance analysis of our experiments. Precisely, these metrics are associated with four classification functions, that is, true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Table 4 shows the detailed equations for the standard performance evaluation metrics based on a confusion matrix.
Model performance evaluation metrics
Model performance evaluation metrics
This section presents the performance of the proposed hybrid CNN and Bi-LSTM network model evaluated with two publicly available intrusion detection datasets extensively adopted in previous research works: NSL-KDD and CIC-IDS 2017 datasets. In addition to the experiments performed, we conducted ablation experiments to determine the effectiveness of the model hyperparameters that we adopted for the predictive effect of the proposed model. Further, to objectively evaluate the performance of our proposed model, a comparison of our network with some related work was done.
Ablation experiments
Ablation experiments gives insights into the comparative contribution of different architectural and regularization components to machine learning models’ performance [61]. In this study ablation experiments were conducted where the two algorithms adopted for building the predictive model were trained and tested separately to understand which of the two algorithms (CNN and Bi-LSTM) was significant to the model performance. Lastly, we compared the results with the proposed hybrid CNN+Bi-LSTM model. Figure 7 presents the loss and accuracy results of the ablation experiment conducted on the CNN algorithm for 28 iterations as plotted using the CIC-IDS 2017 dataset.

Ablation experiment results for the CNN block.
Previous studies that were reviewed demonstrated that CNN and LSTM models successfully capture features, the correlation between features (through convolution units), and the time dependencies (through memory units), that are as a result used for intrusion detection. The CNN learn the spatial features from the network data where the algorithm was aimed at learning a weight map representing the relative importance of activation for the spatial features. The loss function informed how erroneous the classification prediction is from the ground truth. The training loss of the CNN model was lower than that of the LSTM model, showing that it trained more efficiently. Table 5 presents the results of performance evaluation of the CNN and Bi-LSTM models in the ablation study using 28 epochs.
Performance evaluation in the ablation study
The results obtained from the ablation experiments indicate the CNN model provide better results. The CNN model performed better than the Bi-LSTM model in all the performance evaluation metrics. CNN is a discriminative classifier that learn high discriminative spatial features from hidden layers of the network and can better provide inference during hidden middle layers weight updates. The LSTM integrates a memory component to the model long-term dependencies in time-series task and require more parameters than CNN, hence are sluggish to train. The LSTM model also started to over fit at epoch 22, therefore an early stopping is conducted. However, its advantage derives from its capacity to look at long sequences of inputs without increasing the network size. The spatial-temporal parameters processed by LSTM assists the model to identify hidden outlines in feature sequences.
These results were then compared with the hierarchical hybrid model where the features extracted and learnt by the CNN are passed to the LSTM layer as inputs to learn the temporal dependencies. The study tried max-pooling layers after different convolutional layers in the CNN and the proposed hybrid CNN+Bi-LSTM models and established that a pooling layer that followed only the last convolutional layer improved the performance of both models. The developed hybrid deep learning model achieved improvements in performance in comparison to the individual learning models. The results obtained demonstrates that CNN and LSTM models are complementary and a combination of both further improves classification results.
The data samples of the NSL-KDD dataset were divided into two parts: the training set that was used to build the model, and the testing set for evaluation. Figure 8(a) presents the relationship between the loss, accuracy and the number of epochs of for the proposed hybrid CNN- Bi-LSTM model. It can be observed that as the number of epochs increase, the training loss gradually decreased and became stable in the 48th epoch. These results shows that the model structural design and hyperparameter settings were realistic, consequently exhibiting good convergence ability.

Model loss and accuracy plotted against epochs.
After careful fine-tuning, experimental results show that at 90th epochs, the proposed hybrid model reached 96.22% Accuracy, 95.87% precision, 95.21% recall and the F1-measure of 95.54%. Table 6 presents the results of the confusion matrix for the testing data.
Confusion matrix for multi-class classification on test data
The results indicates that from the evaluation dataset, 9,342 flows were detected as normal from the 9,711 normal network flows. 117 flows were predicted as DoS attack, 109 flows detected as Remote to Local Attack, and 129 as a Probe. Similarly, 7,181 flows were predicted correctly as DoS attacks from the 7,460 actual DoS attacks. 87 flows were predicted as normal whereas 105 packets were detected as Remote to Local Attacks. 79 flows were classified as Probe attacks and 8 packets as User to Root attacks. The results demonstrate that the accuracy of the model in predicting the various classes is very high.
These results were compared with other hybrid models tested on the NSL-KDD dataset and summarized in Table 7. According to our experiments, the proposed model have a comparative advantage over previous works. Based on the accuracy metric, we demonstrate that the proposed deep hybrid detection model poses a good detection efficiency for abnormal traffic while tested on the NSL-KDD dataset.
Comparative results of hybrid CNN - Bi-LSTM models tested on NSL-KDD dataset
The accuracy of the proposed model was observed to be lower than that of Ahsan, & Nygard [4] and Yao et al. [79]. There are some reasons that may explain this phenomenon. First, the study by [4] was done without applying any hyperparameter tuning while being trained yielding higher performance results. In our study hyperparameter turning was done to obtain optimal hyperparameter values. Hyperparameters are optimized through manual trial and error searches since grid search experiments suffer from poor dimensional coverage in dimensions [11]. In addition, training deep neural network to achieve tremendous generalization when subjected to unseen inputs is challenging. Further, model regularization reduces the validation loss at the expense of increasing training loss. Normally, regularization techniques such as the application of noise layers or early stopping that halts the training are simply used during training but not during validation, subsequently resulting in smoother and usually better functions in validation [50]. In our model, model call-backs using early stopping method was done that could result in optimization problem resolution.
To further validate the proposed model, we also performed experiments on the CIC-IDS 2017 dataset. The model achieved 99.27% Accuracy, 99.89% precision, 96.54% recall and the F1-measure of 98.19%. Figure 9 presents the model accuracy results after 100 epochs.

Accuracy results for the model trained on CIC-IDS2017 dataset.
These experimental results are compared with results obtained from other works shown in Table 8.
Comparative results of hybrid CNN+Bi-LSTM model tested on CIC-IDS 2017 dataset
As observed from the experimental results, the performance of the proposed hybrid detection model has better performance in contrast with other state of the art models in the literature that were tested on the CIC-IDS 2017 dataset. Our model exhibited better experimental results on the accuracy, precision, recall and F1-measure while compared with results obtained by [64] and [40]. Compared with the experimental results of [81], our experimental results show that our model is better on the precision metric. This means that every result classified by the model was correctly classified among the classified instances. The high precision score is a valuable measure of success particularly because the classes in the dataset are highly imbalanced.
It was further observed that the proposed hybrid model had the best performance with the highest number of hyperparameters. Albeit the number of hyper parameters for the LSTM being less than that of the CNN, the final model showed better validation performance without a big increase in computation time. Compared with the individual CNN and LSTM models, the hybrid model performed better than the two models, since its training loss was the lowest and is lower than its validation loss. Considering the results obtained, our model is more suitable and applicable for network intrusion detection particularly in large and high speed networks.
In this paper, we explored an effective and novel spatial-temporal feature extraction model for network intrusion detection systems. A hybrid deep learning-based approach was suggested comprising of a two-dimensional CNN and Bi-LSTM for building an network intrusion detection system. The CNN component was used to extract spatial features, while the Bi-LSTM component extracted temporal features from the dataset. The two types of features were fused via a feature fusion component and finally the model was trained to detected and classify the attacks. So as to implement the proposed model, we used Keras library and TensorFlow deep learning frameworks. Principal Component Analysis was applied as the dimensionality reduction algorithm. The proposed model was used to perform multi-class classification using two benchmark intrusion detection datasets. The accuracy and precision values were observed to be greater while compared with other previous approaches, which demonstrates the efficacy of the proposed network intrusion detection model.
The results further demonstrates that the proposed model exhibits great potential for solving the network intrusion detection since the model scaled better with the complexity of the problem, requiring less memory for the parameters compared to existing hybrid deep learning networks.
Our future research will focus on the following two areas: first, we will focus on an efficient fusion method of the spatial and temporal features to further reduce the computation costs; and secondly, we shall study the semi-supervised approach for network intrusion detection and perform both binary and multi-class classification.
