Sage Journals: Discover world-class research

Abstract

Malware have become the scourge of the century, as they are continuously evolving and becoming more complex with increasing damages. Therefore, an adequate protection against such threats is vital. Behavior-based malware detection techniques have shown to be effective at overcoming the weaknesses of the signature-based ones. However, they are known for their high false alarms, which is still a very challenging problem. In this article, we address this shortcoming by proposing a rule-based behavioral malware detection system, which inherits the advantages of both signature and behavior-based approaches. We apply the proposed detection system on a combined set of three types of dynamic features, namely, (1) list of application programming interface calls; (2) application programming interface sequences; and (3) network traffic, which represents the IP addresses and domain names used by malware to connect to remote command-and-control servers. Feature selection and construction techniques, that is, term frequency–inverse document frequency and longest common subsequence, are performed on the three extracted features to generate new set of features, which are used to build behavioral Yet Another Recursive Acronym rules. The proposed malware detection approach is able to achieve an accuracy of 97.22% and a false positive rate of 4.69%.

Keywords

Malware detection dynamic analysis application programming interface sequences network traffic

Introduction

Malware (i.e. malicious software) remain so far the major threat against the Internet. A malware is a computer program that is designed to accomplish unauthorized actions without the user’s consent.¹ Malware exist in various forms such as viruses, Trojans, worms, and so on, and are at the origin of most of the cyber attacks. For instance, they can be used to spread spams, which represent more than half (55%) of all the exchanged emails,² and 26% of spams are malicious.³ Moreover, malware can also be used to launch targeted attacks such as distributed denial of service (DDoS),⁴ which can have devastating consequences. As a result, cyber attacks cost the global economy billions, and even trillions of dollars, every year.^5,6

Early malware were easily detectable by standard signature-based detection techniques, which were widely used by antivirus tools. A signature is a short string of bytes, which is unique for each known malware so that its future variants can be correctly classified with a small error rate.⁷ Signatures offer a fast and efficient way to detect known malware, which have been previously analyzed. However, malware writers can easily generate variants of known malware and employ code obfuscation techniques, which make the signature-based detection completely ineffective.⁸ Therefore, finding new detection techniques is of paramount importance. Among these techniques, we find the behavior-based ones, which aim to detect malicious behaviors. As a result, they are able to detect unknown and obfuscated malware, and thus succeed where the signature-based techniques fail.⁹

The malware detection process requires a code analysis phase, which extracts different attributes to identify whether the analyzed file is malicious or not. There are two types of analyses: static and dynamic.¹⁰ Dynamic analysis requires executing the program in a controlled environment, usually built using an emulator, that is, virtual environment.¹¹ Static analysis, however, does not require the execution of the program, and only disassembles and inspects its static code. Both techniques have their strengths and weaknesses. For instance, static analysis is very fast and provides a real-time response. However, in contrast to dynamic analysis, it is not resilient against code obfuscation techniques such as packing. Thus, in this work, we adopt the dynamic analysis approach, as it is effective in defending against code obfuscation and can accurately describe the programs’ behaviors. Behavior-based detection techniques, which are mainly based on dynamic analysis, allow to describe exactly what a program does once it is executed, and then, its behavior profile is generated and compared to either legitimate or malicious profiles. However, the behavioral techniques suffer from high false positives (i.e. legitimate programs that are incorrectly classified as malicious), which can be explained by the fact that there is no strict boundaries between legitimate and malicious behaviors.

In this work, we address the above-mentioned issue by proposing a solution that takes advantage of both signature-based and behavior-based malware detection techniques. To this end, we propose an approach that employs dynamic analysis, behavior monitoring, and rule-based decision-making process. In this approach, we use three different types of dynamic attributes, namely, (1) the list of application programming interfaces (API) calls, (2) API sequences, and (3) the network traffic features (i.e. IP addresses and domain names), which are combined with the aim to achieve the highest possible accuracy. After the features are extracted, we perform efficient feature selection and construction, which is based on two methods: (1) TF-IDF (term frequency–inverse document frequency) feature selection, which is applied on network traffic and API calls, and (2) the LCS (longest common subsequence) feature construction algorithm, which is applied on API sequences. To decide on the maliciousness of the analyzed file, we use a simple yet efficient rule-based decision-making process, which is based on the outcomes of the two components: TF-IDF and LCS. The rule-based decision mainly relies on YARA (Yet Another Recursive Acronym) (https://yara.readthedocs.io/en/v3.7.0) and allows both detection (i.e. malware or benign) and categorization of malware (i.e. virus, Trojan, worm, etc.). To the best of our knowledge, it is the first work that uses such combination of features, feature selection techniques, and decision mechanism for malware detection based on dynamic analysis.

This article is organized as follows: section “Related work” presents the related work in dynamic malware detection. In section “Proposed malware detection and classification system,” we describe our proposed approach for malware detection. We present in section “Evaluation” the experimental results, which are discussed in section “Discussion.” Finally, section “Conclusion” concludes our article and presents our future work.

Related work

In the literature, there are two main malware analysis techniques, namely, static and dynamic. Static analysis has been largely investigated, and used several types of features, such as API calls,^1,12 portable executable (PE) structural information,^1,13,14 as well as Opcode sequences.^15,16 However, dynamic analysis has also been widely investigated by many researchers for the purpose of constructing malware detection systems.^17–26,27 Most of the dynamic analysis techniques rely on the usage of API calls as the main features for detecting malware,^{18,20,21,23,25} in addition to the machine activity.^24,26–28 Here, we present existing dynamic analysis solutions that fully or partly use the same features as our proposed system.

Firdausi et al.²⁰ introduced a behavior-based malware detection approach using machine learning techniques. It consists of extracting APIs and system calls, then creating a vector model for both benign and malicious behaviors. In the last step, the vectors are classified as benign or malicious using different machine learning algorithms, namely, the k-nearest neighbors (KNN), support vector machine (SVM), Naive Bayes, J48, and multilayer perceptron (MLP). The best performance was achieved by J48 classifier with 96.8% accuracy.

Alazab et al.¹⁸ presented a behavior-based malware detection approach using API calls as behavioral features, which are automatically extracted and then represented as n-grams. For the classification task, the authors used a well-known machine learning algorithm, namely, SVM. The approach produced very promising results with 96.50% accuracy using 1-gram APIs.

Nair et al.²¹ presented a metamorphic malware detection system based on dynamic analysis and API sequences. The authors used hidden Markov model (HMM) to represent statistical properties of a set of metamorphic virus variants. The metamorphic virus data set was generated from metamorphic engines. The HMM was trained on a family of metamorphic viruses and used to determine whether a given program is similar to the viruses that are defined by HMM. They obtained an overall accuracy of 76.66%.

Mosli et al.²³ proposed an approach for malware detection based on dynamic analysis and machine learning. The authors extracted four different types of features, namely, memory artifacts, registry activity, imported libraries, and API calls. The proposed approach reached 96% accuracy using SVM classifier.

Xiaofeng et al.²⁵ proposed a combined machine learning and deep learning–based model for malware detection using dynamic analysis. In this system, once the malware samples are analyzed in the Cuckoo Sandbox, two types of features are constructed, namely, API calls frequencies and API sequences. API calls are then used to construct the machine learning classification model, which is based on a random forest classifier. API sequences, however, are used to construct the deep learning–based model, which is a bidirectional long short-term memory (LSTM) model. The two models are then combined by the antivirus server. The proposed malware detection approach was able to achieve an accuracy of 96.7%

Differently from these works, our proposed malware detection system takes as input three different types of dynamic features, namely, list of APIs, API sequences, and network traffic. These features have been combined together in a way to get the highest possible accuracy. Moreover, we propose a lightweight yet efficient decision mechanism based on YARA rules.

Proposed malware detection and classification system

The proposed malware detection and classification system has two main modules: (1) Rules generation module and (2) decision-making module, as shown in Figure 1. The first module aims at training the system by analyzing a number of malicious and legitimate programs and constructing the behavioral rules. It comprises four functional components: Feature extraction, Feature selection, Feature construction, and Behavioral rules generation. The second module aims at testing the proposed decision system on previously unseen malware and benignware files. It comprises two functional components: Feature extraction and Decision.

Figure 1.

Proposed system’s architecture.

Feature extraction

The feature extraction phase analyzes both malicious and legitimate files using Cuckoo Sandbox, which automatically launches the virtual machine and runs the files for a specific period of time. Afterwards, activity reports in JSON format are generated by Cuckoo Sandbox. We extract from the activity reports the information that is necessary for our detection approach, namely, the list of APIs, the sequences of APIs, and the network traffic data (IP addresses and domain names).

List of APIs

This type of features represents the names of all APIs that are invoked by a given program and their corresponding number of occurrences during the execution process. The list of APIs and their corresponding frequencies have been widely used in static analysis–based malware detection^1,12,29,30 and have proven their effectiveness. Therefore, we decide to consider this type of features, as it is similar to those used in Belaoued and Mazouzi,^1,12 but in our case, we extract them dynamically. The list of APIs can be found in the [behavior]/[apistats] section of the JSON file, as shown in Figure 2.

Figure 2.

Example of the activity report of the list of APIs.

API sequences

The API sequences represent the exact behavior of the file. Indeed, the sequence represents the temporal-order invocation of APIs. If we consider the duplication example, some malware duplicate themselves by looking for files to infect and inserting themselves into these files. To do so, malware use specific APIs such as FindFirstFileA, CopyFileA, GetFileType, and SetFilePointer.¹² The API sequences are extracted from the section [behavior]/[processes], as shown in Figure 3, and more specifically by browsing the behavior/process/i/calls/j/api/ section.

Figure 3.

Example of the activity report of API sequence.

Network traffic

It is represented by IP addresses and domain names. Some malware need to connect to the attack platform, which is often hosted by remote command-and-control (C&C) servers in order to receive commands to execute, download additional functionalities, or an entire malware. Network traffic (IP addresses and domain names) is obtained from the [network] section, as presented in Figure 4.

Figure 4.

Example of the network traffic activity report.

Feature selection and construction

After extracting the features, we apply two discriminating methods: LCS algorithm and the TF-IDF as feature construction and selection methods, respectively.

LCS

The LCS aims to find the longest common sequence in a list of sequences that may not necessarily be adjacent. To illustrate the principle of the LCS, let A and B be two strings, defined as follows:

A = “OMNIPOTENT”

B = “OVNI”

By applying the LCS on the above two sequences, we obtain “ONI.” Therefore, the common sequence is of length 3.

In our system, we use the LCS as a feature constructor, and it allows to obtain the longest API sequences in common between different types of malware (viruses, Trojans, etc.). We also consider the longest API sequences in common between benign files. Algorithm 1 presents the execution of the Longest Common API Subsequences. The algorithm goes through two sequences S1 and S2, and looks for their longest common sequences by comparing each word (API) of the first sequence with that of the second one.

Algorithm 1 Longest Common API Subsequences
1: Input: APIseq //Malcious or Benign APIsequences database.2: Output: APILSCS //The longest commonsequences of lenght L3: LengthOfSeq = 04: M = array (0..lengthOfS1,0..lengthOfS2)5: i, j = 06: x = 0 //x = every API in S17: y = 0 //x = every API in S2 and Y ≠ X8: EndOfSequence = last word of the sequence9: LengthOfSeq = length of the actual longest common sequence10: DisplaySequence(Document, start,end) = function to display the longest common sequence found from the start of the sequence to the end of it, parsing the document.11: for X in S1 do12: for Y in S2 do13: if X = Y then14: M[i][j] = M[i-1][j-1]+ 115: if M[i][j] > LengthOfSeq then16: LengthOfSeq = M[i][j]17: EndOfSequence = i18: else19: M[i][j] = 020: end if21: end if22: j = j + 123: end for24: i = i + 125: end for26: Start = EndOfSequence—LengthOfSeq27: End = EndOfSequence28: Return (DisplaySequence(S1, Start,End))

Algorithm 1 Longest Common API Subsequences

1: Input: APIseq //Malcious or Benign APIsequences database.2: Output: APILSCS //The longest commonsequences of lenght L3: LengthOfSeq = 04: M = array (0..lengthOfS1,0..lengthOfS2)5: i, j = 06: x = 0 //x = every API in S17: y = 0 //x = every API in S2 and Y ≠ X8: EndOfSequence = last word of the sequence9: LengthOfSeq = length of the actual longest common sequence10: DisplaySequence(Document, start,end) = function to display the longest common sequence found from the start of the sequence to the end of it, parsing the document.11: for X in S1 do12: for Y in S2 do13: if X = Y then14: M[i][j] = M[i-1][j-1]+ 115: if M[i][j] > LengthOfSeq then16: LengthOfSeq = M[i][j]17: EndOfSequence = i18: else19: M[i][j] = 020: end if21: end if22: j = j + 123: end for24: i = i + 125: end for26: Start = EndOfSequence—LengthOfSeq27: End = EndOfSequence28: Return (DisplaySequence(S1, Start,End))

We used LCS algorithm since it is intuitively designed to extract subsequences from a set of compared sequences, in addition to its rapidity and effectiveness, which is adequate for constructing relevant features in a short period of time.

TF-IDF

The TF-IDF (http://www.tfidf.com/) is a weighting method often used in information retrieval and text mining. It can be defined as a numerical statistic that reflects the importance of words in document collections or corpuses.³¹ The importance increases with the number of times a word appears in the document. Variations in the TF-IDF weighting scheme are often used by search engines as a central tool for scoring and ranking the relevance of a document based on the user’s data.

Typically, the weight of TF-IDF is composed of two terms: the first calculates the normalized terminal frequency (TF), which represents the number of times a word appears in a document, divided by the total number of words in that document, as shown in the following formula

$t f_{i, j} = \frac{n_{i, j}}{\sum_{k} n_{k, j}}$ (1)

The second term is the inverse document frequency (IDF), calculated as the logarithm of the number of documents in the corpus divided by the number of documents where the specific term appears

$id f_{i} = \log \frac{| D |}{| d : t_{i} \in d |}$ (2)

The resulting weight is the multiplication of the two measures.

We use the TF-IDF method in our system as a feature selector, and it allows us to

Extract the most common APIs that are invoked by each type of malware (ransomware, viruses, etc.) and by benign files.

Extract the most common IP addresses and domain names that are used by each type of malware and by benign files.

Our choice of using the TF-IDF method is motivated by the fact that it is characterized by its rapidity and effectiveness, which is very important in our case (i.e. malware detection).

Generation of behavioral rules (YARA)

The next phase that follows the feature selection and construction is building the viral signature database. Indeed, using YARA, we have the ability to create malware family descriptions based on textual or binary models. Each description is called a rule, which represents a set of character strings and/or Boolean expressions that determine its logic. More complex and powerful rules can be created using regular expressions, special operators, and many other features, making YARA a tool to help malware researchers identify and classify malware samples. In our approach, this tool is essential for building our database using the different features. Moreover, the rules are easy to write and understand. In Figures 5 –7, we present real examples of some of the used YARA rules for API sequences, list of APIs, and network traffic, respectively.

Figure 5.

A malicious YARA signature constructed using API sequences.

Figure 6.

A malicious YARA signature constructed using lists of APIs.

Figure 7.

A benign YARA signature constructed using network traffic.

The above YARA rules represent simple examples, which state that any file containing one of the strings $a1, $a2, $a3, $a4, and $a5 is reported as malware.

Decision

Once the required rule signatures are generated and stored in the signature database, the rules generation stage is done. In the decision stage, the extraction of information for the test files is done in the same way as in the rules generation stage, but with an additional step, namely, the matching step. In this step, the features (list of APIs, API sequences, and network activity), which are extracted from the test set, are compared to the previously constructed YARA rules representing malicious and benign profiles. In fact, our system uses a majority voting decision-making mechanism, in which a score is computed, and we can get three different outcomes:

The file that has more matches with malicious rules will be considered as malware.

The file that has more matches with benign rules will be considered as benign.

The file that has equal matches with malicious and benign rules will also be considered as malicious.

Regarding the third outcome, our choice is motivated by the fact that it is safer to consider an unknown file as malicious than as benign. Moreover, according to the experimental results, such a choice allows to considerably increase the detection rate compared to the second option, which is labeling unknown files as benign.

Evaluation

Data set

To evaluate the performance of our malware detection system, we use a data set of 604 Windows Portable Executable 32 (PE32) malware and benign programs. The 214 benign files were collected from a clean installation of Windows XP 32-bit operating system and other executable files representing utilities such as Skype, Adobe Reader, μTorrent, and so on. The malicious data set consists of 390 malware collected from various sources such as virusign.com and malshare.com, and is divided into several categories, as shown in Table 1. We use 70% of the data set, which is considered as a training set (see Figure 1), to construct viral database (behavioral signatures), and 30% for the test (Test set). Table 1 describes the used malware data set.

Table 1.

Description of the malware data set.

Malware type	Freq. Training	Freq. Testing
Trojan	42	18
Botnet	37	17
Backdoor	17	07
Ransomware	20	09
Virus	139	60
Email-Worm	19	5
Total	274	116

Evaluation metrics

To test the performance of our proposed system, we use several evaluation metrics, which are accuracy (ACC), true positive rate (TPR), and false positive rate (FPR):¹

TPR, also called detection rate, it refers to the percentage of correctly recognized malware, and is calculated using the following equation

$TPR = \frac{number of correctly detected malware}{total number of malware}$ (3)

FPR. It refers to the percentage of benign files incorrectly classified as malicious and is calculated using the following equation

$FPR = \frac{number of benign files classified as malware}{total number of benignfiles}$ (4)

Accuracy (ACC). It measures the percentage of files that are correctly classified out of all files and can expressed using the following equation

$ACC = \frac{number of correctly classified files}{total number of files}$ (5)

Hardware and software configuration

The hardware and software configuration, which is used to implement our system, is depicted in Table 2.

Table 2.

Hardware and software configuration.

Component	Configuration
Hardware configuration
CPU	Intel Pentium® CPU 4405 2.10 GHz × 4
RAM	4 GB
Software configuration
OS	Ubuntu 16.04 LTS
Virtual machine	VirtualBox 5.1.38 + Windows XP(32 bit)
Sandbox	Cuckoo Sandbox 2.0.5
Programming language	Python 2.7
Network simulation	Intesim 1.2.7

CPU: central processing unit; RAM: random access memory; GB: gigabyte; OS: operating system; LTS: long-term support.

Overview of the extracted features

In the following, we present the best features obtained according to their degrees of relevance after applying LCS and TF-IDF. In Tables 3 and 4, we present the best three API calls, and the most used IP address and domain names for each file category. The latter have obtained the highest TF-IDF score. In Table 5, we present some of the obtained longest common API subsequences for each malware category and benign files, which were constructed using the LCS algorithm.

Table 3.

Top three APIs according to TF-IDF score.

File type	Best three APIs
Virus	NtOpenKey NtAllocateVirtualMemory NtClose
Trojan	CryptCreateHash NetShareEnum CryptHashData
Botnet	CryptEncrypt GetNativeSystemInfo InternetCrackUrlW
Ransomware	WSARecv DnsQuery_W send
Email-Worm	LdrLoadDll LdrGetProcedureAddress NtFreeVirtualMemory
Backdoor	CryptExportKey CryptGenKey GetAsyncKeyState
Benign	CoInitializeEx GetFileVersionInfoW GetFileVersionInfoSizeW

API: application programming interface; TF-IDF: term frequency–inverse document frequency.

Table 4.

The most used IP addresses and domain names according to TF-IDF score.

File type	Network traffic (IP @, DNS)
Virus	219.232.224.226 www.game9988.cn
Trojan	13.107.21.200 Jyfllqilh.com
Botnet	78.88.88.24 www.orascomdm.com
Ransomware	128.31.0.39 Vjiuxtixi.com
Email-Worm	172.217.19.51 Ddd.com
Backdoor	67.227.226.240 www.poisonivy-rat.com
Benign	23.211.9.92 www.microsoft.com

TF-IDF: term frequency–inverse document frequency.

Table 5.

Overview of the obtained longest common API sequences.

File type	API sequence
Virus	NtCreateFile, NtSetInformationFile, NtSetInformationFile, NtWriteFile, NtReadFile, NtClose, NtClose
Trojan	NtClose, NtFreeVirtualMemory, NtClose, NtClose
Botnet	NtAllocateVirtualMemory, NtAllocateVirtualMemory, NtFreeVirtualMemory
Ransomware	FindResourceExW, LoadResource, FindResourceExW, LoadResource
Email-Worm	GetSystemMetrics, GetSystemMetrics, GetSystemMetrics, GetSystemMetrics
Backdoor	GetSystemMetrics, GetSystemMetrics, GetSystemMetrics, GetSystemMetrics GetSystemMetrics, GetSystemMetrics, GetSystemMetrics
Benign	RegOpenKeyExW, RegOpenKeyExW, RegOpenKeyExW, RegQueryValueExW, RegQueryValueExW, RegCloseKey

API: application programming interface.

Experimental results (accuracy)

Based on the results in Table 6, it can be seen that the results are medium or even poor when the three types of features are used separately, with an accuracy that varies between 66% and 88%. However, the combination of the features with each other considerably increases the accuracy, especially when the three features (i.e. API sequences + List of APIs + network traffic) are combined. Indeed, we obtain the highest accuracy 97.22%, which represents an improvement of nearly 9% compared to the best accuracy when a single type of features is employed, namely, the API sequences (88.89%). Moreover, we notice that the lowest FPR is achieved by combining network traffic and the list of APIs (i.e. 3.13%). When using network traffic alone, the proposed system scores the highest TPR (i.e. 100%). However, it incurs the worst FPR (i.e. 95.31%).

Table 6.

Experimental results.

Features	FPR (%)	TPR (%)	ACC (%)
API	4.69	70.69	79.44
APISEQ	6.25	86.21	88.89
NET	95.31	100	66.11
API + APISEQ	7.81	89.66	90.56
API + NET	3.13	96.55	96.67
APISEQ + NET	6.25	95.69	95.00
API + APISEQ + NET	4.69	98.28	97.22

ACC: accuracy; TPR: true positive rate; FPR: false positive rate.

The bold values represent the best results for each evaluation metric, namely TPR, FPR, and ACC.

In Figure 8, we present the accuracy results (ACC) alongside the TPR and FPR for each combination of features. Moreover, Figure 9 shows the detection rate (TPR) for each type of malware, which allows us to evaluate the ability of our system to detect the different types of malware in the data set. According to Figure 9, our system ensures a full detection of four types of malware, namely, botnets, Trojans, ransomware, and backdoor (i.e. 100% of detection rate). Viruses have a slightly lower detection rate with 98.3% followed by benign files with 95.3%. The lowest detection rate, that is, 80%, is reached in case of Email-Worms.

Figure 8.

Experimental results (FPR, TPR, and ACC) for each feature type and their combination.

Figure 9.

Experimental results (detection rate) for each file type.

Experimental results (detection time)

The detection time is the time required by our system in order to make a decision on an analyzed file: malware or benign. In our case, we can divide the total detection time into three parts:

Analysis time: it is the time required to analyze a file in Cuckoo Sandbox and generate the activity report.

Behavioral rules generation time: it is the time required to extract the different features (i.e. API, APISEQ, and NET) from the activity report and generate the behavioral YARA rules.

Decision time: this is the time required for our system to classify a file as malware or benign.

The detection time differs from one file to another. However, we calculated the average time required for each phase, and the results are presented in Table 7.

Table 7.

Experimental results (detection time).

Phase	Average required time
Analysis	3 min
Behavioral rules generation	1 min
Decision	2 s

Comparison with existing approaches

In Figure 10, and Table 8, we compare our system and other state-of-the-art approaches, which are presented previously, in terms of accuracy.

Figure 10.

Comparison with existing approaches.

Table 8.

Comparison with previous related work.

Work	Features Used	Decision method	TPR	FPR	ACC
Firdausi et al.²⁰	API calls, System calls	Machine learning	95.9%	2.4%	96.8%
Alazab et al.¹⁸	API calls	Machine learning	N/A	1.91%	96.50%
Nair et al.²¹	API sequences	Machine learning	N/A	N/A	76.66%
Mosli et al.²³	Memory artifacts, registry activity, imported DLLs, and API calls	Machine learning	N/A	N/A	96%
Xiaofeng et al.²⁵	API calls, API sequences	Machine learning, Deep learning	N/A	N/A	96.7%
Proposed system	API calls, API sequences, network traffic	Rule-based (YARA)	98.28%	4.69%	97.22%

TPR: true positive rate; FPR: false positive rate; ACC: accuracy; API: application programming interface; YARA: Yet Another Recursive Acronym.

Firdausi et al.,²⁰ which employed System and API Calls, achieved a detection accuracy that is 0.42% lower than that of our system. Xiaofeng et al.²⁵ achieved an accuracy which is 0.52% lower. Alazab et al.,¹⁸ which used API Calls alone, obtained an accuracy that is 0.72% lower than that of our system. Compared to Mosli et al.²⁴ and Nair et al.,²¹ our system outperforms both of them in terms of accuracy, with an improvement of 1.22% and 20.62%, respectively.

Discussion

The proposed rule-based behavioral approach for malware detection has proven to be very efficient, with an overall accuracy of 97.22% with 4.69% of FPR. There are several factors that contributed to reach such a high accuracy. First, the most important factor was the use of API sequences, which were collected in a chronological order. By applying LCS algorithm, we could find common sequences among the set of malicious program, as well as among the set of benign programs, which allowed to accurately reflect the behavior of the programs (benign and malicious). Indeed, using this type of features alone allowed reaching 88.89% of accuracy with 6.25% of FPR. Second, the use of single API calls with their call frequencies allowed to identify those that are most likely used by malicious/benign programs based on their TF-IDF scores. This second type of features alone was able to achieve 79.44% of accuracy with 4.69% of FPR. Finally, network traffic (IP address and domain name) is a factor that cannot be neglected, as malware in general compromise the system and then connect to remote C&C server that are hosted by a cyber-attack infrastructure. Although this type of features has given a very high FPR (i.e. 95.31%) and a low accuracy (i.e. 66.11%), they have contributed in increasing the accuracy by adding the two other features. Indeed, the accuracy results of features taken individually are average and for this reason the decision process was made by trying several combinations of features, and the most successful was the combination of the three features (API sequences + List of APIs + network traffic).

Conclusion

In this article, we proposed a malware detection system based on rule-based behavioral analysis. Three types of features are used: list of APIs, API sequences, and network traffic data (IP addresses and domain names). By employing feature selection and construction techniques, that is, TF-IDF and LCS, a new set of features are built, which are used to generate a set of decision rules using YARA. The decision is made by matching the analyzed file’s signature with those already stored in the signatures database. A voting decision-making mechanism, which is based on a score that computes the number of matching rules, is employed to predict the file class (malware or benign) and the malware category (virus, Trojan, worm, etc.). The system achieved an accuracy of 97.22% and an FPR of 4.69%.

As future work, we aim to improve our detection system by trying to find other discriminating features such as strings. Moreover, we will try to identify different types of malware or their families that use the same attack infrastructure, which can be accomplished by conducting a more in-depth analysis of the generated network traffic. Finally, we plan to experiment our system on a larger data set, by taking into consideration both malware types and their families.

Footnotes

Handling Editor: Mohsin Iftikhar

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This work was supported by the Deanship of Scientific Research at King Saud University for funding this work through research group no. RG-1439-021.

ORCID iDs

Mohamed Belaoued

Mohamed Amir Koalal

References

Belaoued

Mazouzi

A chi-square-based decision for 14 real-time malware detection using PE-file features. J Inform Process Syst 2016; 12(4): 644–660.

Statista. Global e-mail spam rate from 2012 to 2018, 2019. www.statista.com/statistics/270899/global-e-mail-spam-rate/ (accessed 17 May 2019).

Kaplan

The good (but mostly still bad) news 16 about spam and phishing, 2018. www.trustwave.com/en-us/resources/blogs/trustwave-blog/the-good-but-mostly-still-bad-news-about-spam-and-phishing/ (accessed 17 May 2019).

Mirkovic

Reiher

A taxonomy of DDoS attack and DDoS defense mechanisms. ACM SIGCOMM Comput Commun Rev 2004; 34(2): 39–53.

Graham

Cybercrime costs the global economy $ 450 billion: CEO, 2017. www.cnbc.com/2017/02/07/cybercrime-costs-the-global-economy-450-billion-ceo.html (accessed 8 December 2019).

The 2019 official annual cybercrime report, 2019. https://www.herjavecgroup.com/the-2019-official-annual-cybercrime-report/ (accessed 12 June 2019).

Jiang

, et al. CIMDS: adapting postprocessing 20 techniques of associative classification for malware detection. IEEE Trans Syst Man Cybernet Pt C 2010; 40(3): 298–307.

Vinod

Jaipur

Laxmi

, et al. Survey on malware detection methods. In: Proceedings of the 2009 3rd Hackers’ workshop on computer and internet security (IITKHACK’09), Kanpur, India, 17–19 March 2009, pp.74–79. Kanpur, India: Department of Computer Science and Engineering, Prabhu Goel Research Centre for Computer and Internet Security.

Jacob

Debar

Filiol

Behavioral detection of malware: from a survey towards an established taxonomy. J Comput Virol 2008; 4(3): 251–266.

10.

Egele

Scholte

Kirda

, et al. A survey on automated dynamic malware-analysis techniques and tools. ACM Comput Surv 2012; 44(2): Article 6.

11.

Aycock

Computer viruses and malware, vol. 22. New York: Springer, 2006.

12.

Belaoued

Mazouzi

. Statistical study of imported APIs by PE type malware. In: Proceedings of the 2014 international conference on advanced networking distributed systems and applications, Bejaia, Algeria, 17–19 June 2014, pp.82–86. New York: IEEE.

13.

Kumar

Kuppusamy

Aghila

A learning model to detect maliciousness of portable executable using integrated feature set. J King Saud Univ Comput Inform Sci 2017; 31(2): 252–265.

14.

Belaoued

Guelib

Bounaas

, et al. Malware detection system based on an in-depth analysis of the portable executable headers. In: International conference on machine learning for networking, Paris, 27–29 November 2018, pp.166–180. New York: Springer.

15.

Yuxin

Siyi

Malware detection based on deep learning algorithm. Neu Comput Appl 2019; 31(2): 461–472.

16.

Sun

Rao

Chen

, et al. An Opcode sequences analysis method for unknown malware detection. In: Proceedings of the 2019 2nd international conference on geoinformatics and data analysis, Prague, 15–17 March 2019, pp.15–19. New York: ACM.

17.

Willems

Holz

Freiling

Toward automated dynamic malware analysis using CWSandbox. IEEE Security Priv 2007; 5(2): 32–39.

18.

Alazab

Layton

Venkataraman

, et al. Malware detection based on structural and behavioural features of API calls. In: Proceedings of the 2010 international cyber resilience conference, Perth, Western Australia, Australia, 23–24 August 2010, p.1. Australia: Edith Cowan University.

19.

Bayer

Kirda

Kruegel

. Improving the efficiency of dynamic malware analysis. In: Proceedings of the 2010 ACM symposium on applied computing, Sierre, 22–26 March 2010, pp.1871–1878. New York: ACM.

20.

Firdausi

Erwin

Nugroho

, et al. Analysis of machine learning techniques used in behavior-based malware detection. In: Proceedings of the 2010 second international conference on advances in computing, control, and telecommunication technologies, Jakarta, Indonesia, 2–3 December 2010, pp.201–203. New York: IEEE.

21.

Nair

Jain

Golecha

, et al. Medusa: metamorphic malware dynamic analysis using signature from API. In: Proceedings of the 3rd international conference on security of information and networks, Taganrog, 7–11 September 2010, pp.263–269. New York: ACM.

22.

Huang

Stokes

. MtNet: a multi-task neural network for dynamic malware classification. In: International conference on detection of intrusions and malware, and vulnerability assessment, San Sebastián, 7–8 July 2016, pp.399–418. New York: Springer.

23.

Mosli

Yuan

, et al. Automated malware detection using artifacts in forensic memory images. In: Proceedings of the 2016 IEEE symposium on technologies for homeland security (HST), Waltham, MA, 10–11 May 2016, pp.1–6. New York: IEEE.

24.

Mosli

Yuan

, et al. A behavior-based approach for malware detection. In: Proceedings of the IFIP international conference on digital forensics, Orlando, FL, 30 January–1 February 2017, pp.187–201. New York: Springer.

25.

Xiaofeng

Xiao

Fangshuo

, et al. ASSCA: API based sequence and statistics features combined malware detection architecture. Proc Comput Sci 2018; 129: 248–256.

26.

Rhode

Burnap

Jones

Early-stage malware prediction using recurrent neural networks. Comput Secur 2018; 77: 578–594.

27.

Sihwail

Omar

Zainol Ariffin

, et al. Malware detection approach based on artifacts in memory image and dynamic analysis. Appl Sci 2019; 9(18): 3680.

28.

Burnap

French

Turner

, et al. Malware classification using self organising feature maps and machine activity data. Comput Secur 2018; 73: 399–410.

29.

Schultz

Eskin

Zadok

, et al. Data mining methods for detection of new malicious executables. In: Proceedings of the 2001 IEEE symposium on security and privacy. S&P, Oakland, CA, 14–16 May 2000, pp.38–49. New York: IEEE.

30.

Sami

Yadegari

Rahimi

, et al. Malware detection based on mining API calls. In: Proceedings of the 2010 ACM symposium on applied computing, Sierre, 2–26 March 2010, pp.1020–1025. New York: ACM.

31.

Lin

Wang

Xiao

, et al. Feature selection and extraction for malware classification. J Inf Sci Eng 2015; 31(3): 965–992.