Abstract
Introduction
Malware (i.e. malicious software) remain so far the major threat against the Internet. A malware is a
Early malware were easily detectable by standard signature-based detection techniques, which were widely used by antivirus tools. A signature is
The malware detection process requires a code analysis phase, which extracts different attributes to identify whether the analyzed file is malicious or not. There are two types of analyses: static and dynamic. 10 Dynamic analysis requires executing the program in a controlled environment, usually built using an emulator, that is, virtual environment. 11 Static analysis, however, does not require the execution of the program, and only disassembles and inspects its static code. Both techniques have their strengths and weaknesses. For instance, static analysis is very fast and provides a real-time response. However, in contrast to dynamic analysis, it is not resilient against code obfuscation techniques such as packing. Thus, in this work, we adopt the dynamic analysis approach, as it is effective in defending against code obfuscation and can accurately describe the programs’ behaviors. Behavior-based detection techniques, which are mainly based on dynamic analysis, allow to describe exactly what a program does once it is executed, and then, its behavior profile is generated and compared to either legitimate or malicious profiles. However, the behavioral techniques suffer from high false positives (i.e. legitimate programs that are incorrectly classified as malicious), which can be explained by the fact that there is no strict boundaries between legitimate and malicious behaviors.
In this work, we address the above-mentioned issue by proposing a solution that takes advantage of both signature-based and behavior-based malware detection techniques. To this end, we propose an approach that employs dynamic analysis, behavior monitoring, and rule-based decision-making process. In this approach, we use three different types of dynamic attributes, namely, (1) the list of application programming interfaces (API) calls, (2) API sequences, and (3) the network traffic features (i.e. IP addresses and domain names), which are combined with the aim to achieve the highest possible accuracy. After the features are extracted, we perform efficient feature selection and construction, which is based on two methods: (1) TF-IDF (term frequency–inverse document frequency) feature selection, which is applied on network traffic and API calls, and (2) the LCS (longest common subsequence) feature construction algorithm, which is applied on API sequences. To decide on the maliciousness of the analyzed file, we use a simple yet efficient rule-based decision-making process, which is based on the outcomes of the two components: TF-IDF and LCS. The rule-based decision mainly relies on YARA (Yet Another Recursive Acronym) (https://yara.readthedocs.io/en/v3.7.0) and allows both detection (i.e. malware or benign) and categorization of malware (i.e. virus, Trojan, worm, etc.). To the best of our knowledge, it is the first work that uses such combination of features, feature selection techniques, and decision mechanism for malware detection based on dynamic analysis.
This article is organized as follows: section “Related work” presents the related work in dynamic malware detection. In section “Proposed malware detection and classification system,” we describe our proposed approach for malware detection. We present in section “Evaluation” the experimental results, which are discussed in section “Discussion.” Finally, section “Conclusion” concludes our article and presents our future work.
Related work
In the literature, there are two main malware analysis techniques, namely, static and dynamic. Static analysis has been largely investigated, and used several types of features, such as API calls,1,12 portable executable (PE) structural information,1,13,14 as well as Opcode sequences.15,16 However, dynamic analysis has also been widely investigated by many researchers for the purpose of constructing malware detection systems.17–26,27 Most of the dynamic analysis techniques rely on the usage of API calls as the main features for detecting malware,18,20,21,23,25 in addition to the machine activity.24,26–28 Here, we present existing dynamic analysis solutions that fully or partly use the same features as our proposed system.
Firdausi et al.
20
introduced a behavior-based malware detection approach using machine learning techniques. It consists of extracting APIs and system calls, then creating a vector model for both benign and malicious behaviors. In the last step, the vectors are classified as benign or malicious using different machine learning algorithms, namely, the
Alazab et al.
18
presented a behavior-based malware detection approach using API calls as behavioral features, which are automatically extracted and then represented as
Nair et al. 21 presented a metamorphic malware detection system based on dynamic analysis and API sequences. The authors used hidden Markov model (HMM) to represent statistical properties of a set of metamorphic virus variants. The metamorphic virus data set was generated from metamorphic engines. The HMM was trained on a family of metamorphic viruses and used to determine whether a given program is similar to the viruses that are defined by HMM. They obtained an overall accuracy of 76.66%.
Mosli et al. 23 proposed an approach for malware detection based on dynamic analysis and machine learning. The authors extracted four different types of features, namely, memory artifacts, registry activity, imported libraries, and API calls. The proposed approach reached 96% accuracy using SVM classifier.
Xiaofeng et al. 25 proposed a combined machine learning and deep learning–based model for malware detection using dynamic analysis. In this system, once the malware samples are analyzed in the Cuckoo Sandbox, two types of features are constructed, namely, API calls frequencies and API sequences. API calls are then used to construct the machine learning classification model, which is based on a random forest classifier. API sequences, however, are used to construct the deep learning–based model, which is a bidirectional long short-term memory (LSTM) model. The two models are then combined by the antivirus server. The proposed malware detection approach was able to achieve an accuracy of 96.7%
Differently from these works, our proposed malware detection system takes as input three different types of dynamic features, namely, list of APIs, API sequences, and network traffic. These features have been combined together in a way to get the highest possible accuracy. Moreover, we propose a lightweight yet efficient decision mechanism based on YARA rules.
Proposed malware detection and classification system
The proposed malware detection and classification system has two main modules: (1)

Proposed system’s architecture.
Feature extraction
The feature extraction phase analyzes both malicious and legitimate files using Cuckoo Sandbox, which automatically launches the virtual machine and runs the files for a specific period of time. Afterwards, activity reports in JSON format are generated by Cuckoo Sandbox. We extract from the activity reports the information that is necessary for our detection approach, namely, the list of APIs, the sequences of APIs, and the network traffic data (IP addresses and domain names).
List of APIs
This type of features represents the names of all APIs that are invoked by a given program and their corresponding number of occurrences during the execution process. The list of APIs and their corresponding frequencies have been widely used in static analysis–based malware detection1,12,29,30 and have proven their effectiveness. Therefore, we decide to consider this type of features, as it is similar to those used in Belaoued and Mazouzi,1,12 but in our case, we extract them dynamically. The list of APIs can be found in the

Example of the activity report of the list of APIs.
API sequences
The API sequences represent the exact behavior of the file. Indeed, the sequence represents the temporal-order invocation of APIs. If we consider the duplication example, some malware duplicate themselves by looking for files to infect and inserting themselves into these files. To do so, malware use specific APIs such as

Example of the activity report of API sequence.
Network traffic
It is represented by IP addresses and domain names. Some malware need to connect to the attack platform, which is often hosted by remote command-and-control (C&C) servers in order to receive commands to execute, download additional functionalities, or an entire malware. Network traffic (IP addresses and domain names) is obtained from the

Example of the network traffic activity report.
Feature selection and construction
After extracting the features, we apply two discriminating methods: LCS algorithm and the TF-IDF as feature construction and selection methods, respectively.
LCS
The LCS aims to find the longest common sequence in a list of sequences that may not necessarily be adjacent. To illustrate the principle of the LCS, let A and B be two strings, defined as follows:
By applying the LCS on the above two sequences, we obtain “
In our system, we use the LCS as a feature constructor, and it allows to obtain the longest API sequences in common between different types of malware (viruses, Trojans, etc.). We also consider the longest API sequences in common between benign files. Algorithm 1 presents the execution of the Longest Common API Subsequences. The algorithm goes through two sequences S1 and S2, and looks for their longest common sequences by comparing each word (API) of the first sequence with that of the second one.
We used LCS algorithm since it is intuitively designed to extract subsequences from a set of compared sequences, in addition to its rapidity and effectiveness, which is adequate for constructing relevant features in a short period of time.
TF-IDF
The TF-IDF (http://www.tfidf.com/) is a weighting method often used in information retrieval and text mining. It can be defined as
Typically, the weight of TF-IDF is composed of two terms: the first calculates the normalized terminal frequency (TF), which represents the number of times a word appears in a document, divided by the total number of words in that document, as shown in the following formula
The second term is the inverse document frequency (IDF), calculated as the logarithm of the number of documents in the corpus divided by the number of documents where the specific term appears
The resulting weight is the multiplication of the two measures.
We use the TF-IDF method in our system as a feature selector, and it allows us to
Extract the most common APIs that are invoked by each type of malware (ransomware, viruses, etc.) and by benign files.
Extract the most common IP addresses and domain names that are used by each type of malware and by benign files.
Our choice of using the TF-IDF method is motivated by the fact that it is characterized by its rapidity and effectiveness, which is very important in our case (i.e. malware detection).
Generation of behavioral rules (YARA)
The next phase that follows the feature selection and construction is building the viral signature database. Indeed, using YARA, we have the ability to create malware family descriptions based on textual or binary models. Each description is called a rule, which represents a set of character strings and/or Boolean expressions that determine its logic. More complex and powerful rules can be created using regular expressions, special operators, and many other features, making YARA a tool to help malware researchers identify and classify malware samples. In our approach, this tool is essential for building our database using the different features. Moreover, the rules are easy to write and understand. In Figures 5–7, we present real examples of some of the used YARA rules for API sequences, list of APIs, and network traffic, respectively.

A malicious YARA signature constructed using API sequences.

A malicious YARA signature constructed using lists of APIs.

A benign YARA signature constructed using network traffic.
The above YARA rules represent simple examples, which state that any file containing one of the strings $a1, $a2, $a3, $a4, and $a5 is reported as malware.
Decision
Once the required rule signatures are generated and stored in the signature database, the rules generation stage is done. In the decision stage, the extraction of information for the test files is done in the same way as in the rules generation stage, but with an additional step, namely, the matching step. In this step, the features (list of APIs, API sequences, and network activity), which are extracted from the test set, are compared to the previously constructed YARA rules representing malicious and benign profiles. In fact, our system uses a majority voting decision-making mechanism, in which a score is computed, and we can get three different outcomes:
The file that has more matches with malicious rules will be considered as malware.
The file that has more matches with benign rules will be considered as benign.
The file that has equal matches with malicious and benign rules will also be considered as malicious.
Regarding the third outcome, our choice is motivated by the fact that it is safer to consider an unknown file as malicious than as benign. Moreover, according to the experimental results, such a choice allows to considerably increase the detection rate compared to the second option, which is labeling unknown files as benign.
Evaluation
Data set
To evaluate the performance of our malware detection system, we use a data set of 604 Windows Portable Executable 32 (PE32) malware and benign programs. The 214 benign files were collected from a clean installation of Windows XP 32-bit operating system and other executable files representing utilities such as Skype, Adobe Reader, μTorrent, and so on. The malicious data set consists of 390 malware collected from various sources such as virusign.com and malshare.com, and is divided into several categories, as shown in Table 1. We use 70% of the data set, which is considered as a training set (see Figure 1), to construct viral database (behavioral signatures), and 30% for the test (Test set). Table 1 describes the used malware data set.
Description of the malware data set.
Evaluation metrics
To test the performance of our proposed system, we use several evaluation metrics, which are accuracy (ACC), true positive rate (TPR), and false positive rate (FPR): 1
Hardware and software configuration
The hardware and software configuration, which is used to implement our system, is depicted in Table 2.
Hardware and software configuration.
CPU: central processing unit; RAM: random access memory; GB: gigabyte; OS: operating system; LTS: long-term support.
Overview of the extracted features
In the following, we present the best features obtained according to their degrees of relevance after applying LCS and TF-IDF. In Tables 3 and 4, we present the best three API calls, and the most used IP address and domain names for each file category. The latter have obtained the highest TF-IDF score. In Table 5, we present some of the obtained longest common API subsequences for each malware category and benign files, which were constructed using the LCS algorithm.
Top three APIs according to TF-IDF score.
API: application programming interface; TF-IDF: term frequency–inverse document frequency.
The most used IP addresses and domain names according to TF-IDF score.
TF-IDF: term frequency–inverse document frequency.
Overview of the obtained longest common API sequences.
API: application programming interface.
Experimental results (accuracy)
Based on the results in Table 6, it can be seen that the results are medium or even poor when the three types of features are used separately, with an accuracy that varies between 66% and 88%. However, the combination of the features with each other considerably increases the accuracy, especially when the three features (i.e. API sequences + List of APIs + network traffic) are combined. Indeed, we obtain the highest accuracy 97.22%, which represents an improvement of nearly 9% compared to the best accuracy when a single type of features is employed, namely, the API sequences (88.89%). Moreover, we notice that the lowest FPR is achieved by combining network traffic and the list of APIs (i.e. 3.13%). When using network traffic alone, the proposed system scores the highest TPR (i.e. 100%). However, it incurs the worst FPR (i.e. 95.31%).
Experimental results.
ACC: accuracy; TPR: true positive rate; FPR: false positive rate.
The bold values represent the best results for each evaluation metric, namely TPR, FPR, and ACC.
In Figure 8, we present the accuracy results (ACC) alongside the TPR and FPR for each combination of features. Moreover, Figure 9 shows the detection rate (TPR) for each type of malware, which allows us to evaluate the ability of our system to detect the different types of malware in the data set. According to Figure 9, our system ensures a full detection of four types of malware, namely, botnets, Trojans, ransomware, and backdoor (i.e. 100% of detection rate). Viruses have a slightly lower detection rate with 98.3% followed by benign files with 95.3%. The lowest detection rate, that is, 80%, is reached in case of Email-Worms.

Experimental results (FPR, TPR, and ACC) for each feature type and their combination.

Experimental results (detection rate) for each file type.
Experimental results (detection time)
The detection time is the time required by our system in order to make a decision on an analyzed file: malware or benign. In our case, we can divide the total detection time into three parts:
Analysis time: it is the time required to analyze a file in Cuckoo Sandbox and generate the activity report.
Behavioral rules generation time: it is the time required to extract the different features (i.e. API, APISEQ, and NET) from the activity report and generate the behavioral YARA rules.
Decision time: this is the time required for our system to classify a file as malware or benign.
The detection time differs from one file to another. However, we calculated the average time required for each phase, and the results are presented in Table 7.
Experimental results (detection time).
Comparison with existing approaches
In Figure 10, and Table 8, we compare our system and other state-of-the-art approaches, which are presented previously, in terms of accuracy.

Comparison with existing approaches.
Comparison with previous related work.
TPR: true positive rate; FPR: false positive rate; ACC: accuracy; API: application programming interface; YARA: Yet Another Recursive Acronym.
Firdausi et al., 20 which employed System and API Calls, achieved a detection accuracy that is 0.42% lower than that of our system. Xiaofeng et al. 25 achieved an accuracy which is 0.52% lower. Alazab et al., 18 which used API Calls alone, obtained an accuracy that is 0.72% lower than that of our system. Compared to Mosli et al. 24 and Nair et al., 21 our system outperforms both of them in terms of accuracy, with an improvement of 1.22% and 20.62%, respectively.
Discussion
The proposed rule-based behavioral approach for malware detection has proven to be very efficient, with an overall accuracy of 97.22% with 4.69% of FPR. There are several factors that contributed to reach such a high accuracy. First, the most important factor was the use of API sequences, which were collected in a chronological order. By applying LCS algorithm, we could find common sequences among the set of malicious program, as well as among the set of benign programs, which allowed to accurately reflect the behavior of the programs (benign and malicious). Indeed, using this type of features alone allowed reaching 88.89% of accuracy with 6.25% of FPR. Second, the use of single API calls with their call frequencies allowed to identify those that are most likely used by malicious/benign programs based on their TF-IDF scores. This second type of features alone was able to achieve 79.44% of accuracy with 4.69% of FPR. Finally, network traffic (IP address and domain name) is a factor that cannot be neglected, as malware in general compromise the system and then connect to remote C&C server that are hosted by a cyber-attack infrastructure. Although this type of features has given a very high FPR (i.e. 95.31%) and a low accuracy (i.e. 66.11%), they have contributed in increasing the accuracy by adding the two other features. Indeed, the accuracy results of features taken individually are average and for this reason the decision process was made by trying several combinations of features, and the most successful was the combination of the three features (API sequences + List of APIs + network traffic).
Conclusion
In this article, we proposed a malware detection system based on rule-based behavioral analysis. Three types of features are used: list of APIs, API sequences, and network traffic data (IP addresses and domain names). By employing feature selection and construction techniques, that is, TF-IDF and LCS, a new set of features are built, which are used to generate a set of decision rules using YARA. The decision is made by matching the analyzed file’s signature with those already stored in the signatures database. A voting decision-making mechanism, which is based on a score that computes the number of matching rules, is employed to predict the file class (malware or benign) and the malware category (virus, Trojan, worm, etc.). The system achieved an accuracy of 97.22% and an FPR of 4.69%.
As future work, we aim to improve our detection system by trying to find other discriminating features such as strings. Moreover, we will try to identify different types of malware or their families that use the same attack infrastructure, which can be accomplished by conducting a more in-depth analysis of the generated network traffic. Finally, we plan to experiment our system on a larger data set, by taking into consideration both malware types and their families.
