Abstract
Keywords
Introduction
Malware (malicious software) continues to increase with the rapid development of mobile networks, especially on the Android platform and services. In November 2015, McAfee Labs Threats Report 1 announced that in first three quarters of the year, mobile malware incidents increased by approximately 4 million, more than twice the same period of 2014, and the total number of mobile malware incidents reached approximately 10 million. Android malware can infect and harm Android platforms and services through various methods such as malicious websites, spam, malicious SMS messages, and malware-bearing advertisements. Android malware causes security threats such as phishing, banking-Trojans, spyware, bots, root exploits, SMS fraud, premium dialers, and fake installers. Moreover, the most-frequent malware behaviors are escalating privileges, taking remote control, incurring financial charges, and stealing personal information. 2 Therefore, Android malware detection is a necessary and pressing task. There are two effective types of approaches: static analysis by decompiling the source code and dynamic analysis by monitoring application execution at runtime. 3 The datasets for most Android malware detection experiments are composed of both benign and malicious applications. The benign applications are collected from Google Play Store, and the malware applications are collected from malware share websites or by researchers themselves. Because it is difficult to create a comprehensive malware collection, the datasets used in the experiments are typically imbalanced: number of benign applications is larger than the number of malware applications.4–6 However, this imbalance problem is usually not considered seriously.
Dataset imbalance is not a new problem; it occurs in many real-world situations including intrusion detection, risk management, text categorization, and information filtering. 7 In these problems, just as in Android malware detection, it is important to correctly identify the minority class. There is a strong learning bias toward the majority class when using imbalanced datasets. As a result, the minority class examples are more likely to be misclassified; 8 therefore, the minority class should be given more attention. Machine learning from imbalanced datasets has been widely researched. Overall, the approaches for solving the imbalanced dataset problem can be divided into two types: resampling methods and imbalanced learning algorithms. 9 Resampling tries to balance the class distribution inside the training data through both oversampling method (adding examples to the minority class to approach the majority class size) and undersampling method (removing examples from the majority class to approach the minority class size). The most famous oversampling method is the synthetic minority oversampling technique (SMOTE). 10 Later, studies have proposed modifications to existing methods that have achieved good performances on balanced datasets or proposed new algorithms to resolve the imbalance problem.
In the SMOTE method, the minority class is oversampled to generate synthetic minority examples. For each minority example, the
The imbalance ratio (IR) characterizes the imbalance degree of the dataset. It is equal to the sample size of the majority class divided by the size of the minority class. 11 The IR is considered to be an important factor that affects the classification accuracy of the minority class. However, recent studies have indicated that the IR is not the only factor. For some datasets with a high IR, standard classifiers achieve minority class accuracy that is still high. Other factors also reduce minority class classification accuracy in the imbalanced datasets, such as overlap between classes12,13 and the presence of many minority examples within the majority class. 14 However, when multiple factors occur together in imbalanced datasets, the accuracy of minority class classifications can be seriously affected. Therefore, data distribution can have a large impact on the quality of imbalanced datasets and on classifier performance. The membership degree of the examples to the classes is computed to measure the distribution of the datasets. 15 Therefore, data distribution and the membership degree of the examples to the classes lead to the main goals of our study.
Our main goal is to solve imbalanced dataset distributions by adding new minority examples and to improve the accuracy of the minority class. The most important task is to construct a principle for oversampling the minority class. This principle should act as a guide to help determine which minority examples should be used to generate new synthetic examples and how many new synthetic examples should be generated for each minority example.
The concept of membership degree was first proposed in fuzzy set theory: it reflects the degree of uncertainty about whether an example belongs to a set, and it permits the gradual assessment of the membership of the examples in a set. 16 Membership degree quantifies the relationship of each example to a given dataset in a range [0.0, 1.0]. When the value of the membership degree of an example equals 1.0, that example is sure to belong to the dataset. When the value of membership degree is between 0.0 and 1.0, the example is fuzzy and only partially belongs to the dataset. Fuzzy set theory provides a methodology for data analysis; here, we extend fuzzy set theory to the task of Android malware detection in imbalanced datasets.
According to fuzzy set theory, minority examples in the imbalanced datasets that have a low membership degree to the minority class can easily be misclassified. However, the decision region of the minority class can be broadened to ensure correct classification by creating new minority examples that are similar to the minority class. 17 The result is that classifiers are no longer biased away the minority class. Based on the above concepts, a fuzzy region is defined to contain minority examples with low membership degree, and there is a need to generate new synthetic examples. To generate such examples, the following two questions should be answered:
What is the range of the fuzzy region?
How many new synthetic examples should be generated for each minority example in the fuzzy region?
The reason for question (1) is that when the range of the fuzzy region is small and contains few minority examples, the number of new synthetic examples needed is small, and the data distribution may still be imbalanced. In contrast, when the range of the fuzzy region is large and contains more minority examples, redundant examples may be generated that increase the cost of system resources and waste time. Therefore, answering question (1) involves finding proper range of the fuzzy region through a series of experiments. The reason for question (2) is that minority examples with low membership degree are more important; therefore, more synthetic examples should be generated for them. Answering question (2) involves combining the IR of the imbalanced dataset with fuzzy-SMOTE to calculate the appropriate number of synthetic examples to generate for each minority example.
In our work, Android malware applications are viewed as the minority class in the imbalanced dataset. To improve the classification performance of the minority class, we propose a new algorithm called fuzzy-SMOTE, which is based on the membership degree in fuzzy set theory and the SMOTE method. First, fuzzy-SMOTE computes the membership degree of each minority example to the minority class. Second, fuzzy-SMOTE explores the fuzzy region containing the minority examples that have low membership degrees and are likely to be misclassified. Third, fuzzy-SMOTE generates new synthetic minority examples for the minority examples in the fuzzy region, using the same process as SMOTE. Compared with other SMOTE-based methods, fuzzy-SMOTE pays more attention to find which minority examples should be oversampled and determine how many new synthetic examples should be generated for each minority example. Our research makes several significant contributions as follows:
We propose a new oversampling method called fuzzy-SMOTE, based on fuzzy set theory and SMOTE, to solve the imbalance problem in Android malware detection. The results indicate that fuzzy-SMOTE improves performance both on the accuracy of the minority class and on entire datasets.
Fuzzy-SMOTE calculates the membership degree of each minority example to the minority class. In addition, it defines a fuzzy region that contains the minority examples with low membership degrees that are more likely to be misclassified.
Combined with the IR, fuzzy-SMOTE generates more synthetic examples for each minority example with low membership degree. The new synthetic examples broaden the decision region of the minority class and reduce classifier bias to the majority class.
The remainder of this article is organized as follows: section “Related work” reviews related works on Android malware detection methods and imbalanced learning. Section “Methodology” presents the details of the research methodology. Section “Experiment and discussion” introduces and discusses the experiments, and section “Conclusion” provides conclusions.
Related work
Methods for Android malware detection
Android malware detection methods are divided into three categories: static analysis, dynamic analysis, and hybrid analysis. 3 Static detection methods analyze the decompiled source code at the binary level or the API level without executing the Android applications. Dynamic detection methods analyze application behaviors at runtime by monitoring behaviors indicative of Android malware activity. Hybrid analysis methods combine both static and dynamic techniques.
Static analysis methods contain rule policy18,19 and machine learning methods.4–6 Machine learning detection process consists of feature extraction, feature selection, and machine learning. In feature extraction part, permissions,
20
APIs,
4
combination of permissions and APIs,
6
system calls,
21
signatures,
22
and component information
23
are extracted as features from the decompiled source files. In feature selection part, the common methods are Information Gain (IG),6,24 CHI,6,25 and Fisher Score.25,26 Machine learning classification methods include KNN, naive Bayes (NB), decision tree (DT), logistic regression (LR), support vector machines (SVMs), AdaBoost, and
Imbalanced datasets have been widely used in previous studies. In Cen et al., 6 five training datasets were used in which the ratios of malicious applications to benign applications in the datasets were 0.0%, 0.23%, 4.79%, 0.25%, and 8.07%. In Aafer et al., 4 the dataset included 16,000 benign and 3987 malware applications. In Sanz et al., 5 the dataset included 1811 benign and 249 malware applications, and Jang et al. 32 included 109,193 benign and 9990 malware applications. Although the malware detection accuracy was high, reaching 99% in Cen et al., 6 the researchers did not pay attention to the imbalance problem or take special measures to solve it. However, in our work, we aim to explore the nature of the imbalance problem and its solution, and we propose a new oversampling method to ensure the performance of imbalanced datasets in malware detection.
Methods for imbalanced learning
The methods to solve the problem of the imbalanced dataset are divided into two types: data resampling methods and imbalanced learning algorithms. 9 The data resampling methods aim to balance the given dataset by adding or removing samples. While imbalanced learning algorithms focus on modifying the existing machine learning mechanism or creating new algorithms to improve the detection rate of minority class.
Resampling methods consist mainly of two types: oversampling and undersampling. 33 Oversampling duplicates existing examples or generates new examples for the minority class, while undersampling removes examples randomly from the majority class. Both oversampling and undersampling methods have been combined to solve the imbalance problem. 10 Specifically, random oversampling is a non-heuristic method that balances the class distribution by randomly replicating minority class examples. 34 The most famous of these methods is SMOTE, 10 which generates new (synthetic) minority examples between each minority example and its nearest neighbors. This method makes decision regions larger and less specific; consequently, classifiers achieve better performance. Han et al. 35 presented Borderline-SMOTE, which oversampled only the minority examples near the borderline. Borderline-SMOTE achieved a better true positive (TP) rate and F-measure than SMOTE and random oversampling methods. Chawla et al. 36 integrated SMOTE and the standard boosting method to create SMOTEBoost, which synthesized minority class examples and indirectly changed the updating weights to compensate for skewed distributions. Bunkhumpornpat et al. 37 proposed Safe-Level-SMOTE, which oversampled minority instances along the same line at a safe level. The safe level was defined based on the nearest-neighbor minority instances. Guo and Viktor 38 proposed DataBoost-IM, which generated synthetic examples for the majority and minority classes. Qiong et al. 39 proposed genetic algorithm–based synthetic minority oversampling technique (GASMOTE) to improve SMOTE-based on the genetic algorithm (GA) by setting different sampling rates for different minority samples. Moreover, SMOTE has been integrated with numerous machine learning9,40–42 and deep learning algorithms. 43 Undersampling methods also have many different forms including random undersampling, 34 inverse random undersampling, 44 and the EasyEnsemble and BalanceCascade undersampling strategies. 45
Imbalanced learning algorithms are divided into four types: cost-sensitive methods, kernel-based learning methods, active learning methods, and ensemble methods.46,47 (1) Cost-sensitive methods focus on the costs of the misclassified examples using different cost matrices and have outperformed various other empirical sampling methods. 48 Cost-sensitive methods can be categorized into three types: those that use misclassification costs as a form of dataspace weighting, those that use cost-minimization techniques to combine ensemble methods to improve performance, and those that indirectly incorporate cost-sensitive functions into classification paradigms to train classification models. 46 Several different cost-sensitive boosting methods based on the AdaBoost algorithm have been proposed. Sun et al. 49 changed AdaBoost’s weight-updating strategy by introducing cost items. They proposed three cost-sensitive boosting methods: AdaC1, AdaC2, and AdaC3. Song et al. 50 proposed BABoost, which assigned higher weights to the misclassified examples in the minority class. Cost-sensitive DTs 51 and cost-sensitive neural networks 52 have also been widely studied for imbalanced learning. (2) Kernel-based learning methods such as SVM mainly center on statistical learning and can achieve relatively robust classification results when addressing imbalanced datasets. 9 SVM adopts different error cost terms to shift the decision boundary away from positive examples to guarantee that negative examples are classified correctly.53,54 (3) Active learning methods are often integrated into kernel-based learning methods. As an active learning method, SVM is used to select the most informative examples from unknown training data while retaining the kernel-based methods. 46 (4) Ensemble methods are effective in averaging prediction errors and reducing bias and error variance. Most current ensemble methods have similar procedures for imbalanced datasets: resampling and voting. 55 Most of these ensembles are based on known strategies from bagging, boosting, or random forests. Moreover, the diversity of ensemble approaches determines the final prediction accuracy of an example. Consequently, creating algorithm strategies that ensure and enlarge diversity is a critical task. 56
Methodology
SMOTE
Chawla et al.
10
proposed the SMOTE approach, which can improve classifier minority class accuracy on imbalanced datasets. SMOTE performs oversampling by creating synthetic samples for the minority class rather than duplicating existing minority class samples. The synthetic samples are generated based on a random
The principles of the SMOTE algorithm are as follows: (1) find KNNs
More synthetic samples of the minority class can be obtained using the above steps. The new synthetic samples can maintain the distribution of the original minority class and balance the distribution of the entire training set. Therefore, SMOTE causes classifiers to create a new decision boundary that is no longer biased to the majority class.
SMOTE can improve classifier performances on imbalanced datasets. Meanwhile, improving SMOTE has attracted much research attention; consequently, several improved algorithms have been proposed that include Borderline-SMOTE and SMOTEBoost. 35 Borderline-SMOTE takes only the minority examples near the borderline to oversample based on SMOTE. SMOTEBoost is a combination of SMOTE and the boosting procedure that creates synthetic examples for minority class and thus indirectly changes the updating weights and the compensation for skewed distributions. These methods improve classifier performance on the minority class and overall performance on imbalanced datasets; however, they do not consider the distribution of the imbalanced dataset nor do they specifically consider the easily misclassified minority examples. Therefore, we propose an improved algorithm, called fuzzy-SMOTE, which is based on the membership degree concept from fuzzy set theory and the SMOTE method.
Fuzzy-SMOTE
For balanced datasets, most machine learning methods build an impartial decision boundary to achieve good overall performance. However, in the original imbalanced dataset space, this decision boundary tends to be biased toward the majority class because fewer minority examples exist to train the classification model. 8 To better identify the minority class, we concentrate on oversampling the minority examples. Therefore, we explore the distribution of the minority and majority classes and generate additional synthetic samples for those minority examples that are most easily misclassified. This is the difference between our method and the existing oversampling methods.
In the fuzzy-SMOTE method, we first calculate the membership degree of each minority example to the minority class based on fuzzy set theory; then, we define a fuzzy region in which the minority class examples have low membership degree to the minority class and are easily misclassified; and finally, by combining the imbalance factor, we create synthetic samples for the minority examples in the fuzzy region and add them to the original training set.
Suppose that the training set is
The IR is related to the oversampling rate of the training minority class. The oversampling rate
where
The fuzzy-SMOTE method is executed in the following steps:
Step 1: membership degree
Let the classes
where
A large value for

The distribution of the examples
Step 2: fuzzy region
Suppose example
In addition, examples in which the value of
Step 3: oversampling rate
For every minority example
When
Step 4: oversampling
First, we find the KNNs for each sample
The above procedure is repeated for each minority example in the fuzzy region until
Experiment evaluation
To evaluate the performance of our proposed method, the datasets are divided into training sets and testing sets using 10-fold cross-validation. In this method, a full dataset is split into 10 independent folds. In turn, nine folds are used to train the classification models, and the remaining fold is applied as the testing set to validate and assess the model. Consequently, each fold is used once as a testing set. Formally, the complete cross-validation estimate is the average of the 10-fold estimates computed in a loop. 57 While 10-fold cross-validation can be computationally expensive, it does not require as much data as a fixed arbitrary test set and is quite suitable for small datasets. 58
Traditionally, with balanced datasets, machine learning classifiers are evaluated by overall accuracy. However, to meet the special situations of imbalanced datasets, other evaluation measures are adopted to provide comprehensive assessments.10,56 In our work, the negative class (malicious applications) is the minority class and the positive class (benign applications) is the majority class. The evaluation metrics are defined as follows
where
Experiment and discussion
This section presents the detailed methodology and results of the experiments. For each method, the imbalanced evaluation values (such as
Data source
We used 10 datasets in the experiments as described in Table 1. Eight of these datasets, termed S1–S8, include both benign examples and malicious examples. We collected 1017 benign examples from the Google Play Store in March 2015. These include top applications from 14 categories including Office, Lifestyle, Travel, Children, Shopping, Education, Finance, Photography, Social, Tools, Reading, Multimedia, Sports, and Themes. We chose the most frequently downloaded applications in each category. The malicious examples were collected from the UNB ISCX Android Botnet Dataset and span a period from 2010 to 2014. 60 These malicious examples are analyzed in Kadir et al. 61 We used only four Android malware sets: DroidDream, Nickspay, MisoSMS, and SandDroid. The original number of examples in the DroidDream family was 343. To achieve our goals, we randomly duplicated 217 examples to generate a DroidDream1 set and randomly selected some examples to generate the DroidDream2, DroidDream3, and DroidDream4 datasets. Finally, the benign samples and the malware sets were integrated to form the S1–S8 datasets.
Descriptions of the datasets used in the experiments.
The other two datasets, termed S9 and S10, were collected from University of California, Irvine (UCI) Machine Learning Repository (http://archive.ics.uci.edu/ml). 62 S9 originally stems from UCI. S10 contains 1528 majority examples (from the “M” class in Abalone) and 396 minority examples (some random examples from the “I” class in Abalone). Their descriptions are also shown in Table 1.
The Android permission mechanism is used to restrict application access to system resources and guarantee runtime security. Permissions are declared in an AndroidManifest.xml file and requested when the applications perform sensitive behaviors through a corresponding set APIs. Malware and benign applications tend to request permissions differently. Malware applications often request higher-risk permissions such as SEND_SMS, RECEIVE_SMS, and READ_SMS. Therefore, to some degree, malware can be differentiated from benign applications based on the permissions listed in the AndroidManifest.xml file. Previous works have demonstrated that permission features can identify whether an application is malicious.63–65 Therefore, in our work, we extract the permissions from decompiled AndroidManifest.xml files and use them as features for detecting malware. In each dataset, the permission features are the union of permissions declared by both benign and malware applications. The numbers of features in the different datasets are listed in Table 1.
Accuracy bias to the majority class
For imbalanced datasets, machine learning algorithms create a decision boundary biased to the majority class. Consequently, the minority class examples are more likely to be misclassified. 8 In this section, we use S1, S2, S4, S6, and S7 datasets to explore the accuracy bias to the majority class in imbalanced datasets. The minority classes are similar in the S1, S2, S4, S6, and S7 datasets except for their number, so the datasets are comparable. We used SVM as the classifier. The results are shown in Table 2.
SVM performance on the imbalanced datasets.
SVM: support vector machine; IR: imbalance ratio.
As shown in Table 2, the results indicate that as the IR grows, the accuracy of minority class
Performance for fuzzy region
In section “Fuzzy-SMOTE,” the fuzzy region in fuzzy-SMOTE method is defined to contain the minority class examples, which are easily misclassified to the wrong class and should be given more attention. In this section, we explore the performance of the fuzzy region at different ranges. First, we calculate the membership degree

(a) The membership degree of the minority examples in dataset S1 to the minority class and (b) the membership degree of the minority examples in dataset S8 to the minority class.
The

(a) The number of synthetic samples generated by fuzzy-SMOTE with different fuzzy region ranges, (b) SVM
Comparison with existing oversampling methods
SMOTE
10
and Borderline-SMOTE
56
have been shown to improve classifier accuracy for the minority class. In this section, we compare several sample synthesis methods including SMOTE, Borderline-SMOTE, and fuzzy-SMOTE with the fuzzy region [0.0, 1.0]. The oversampling rate of each training dataset is
The comparison results are illustrated in Tables 3–7. As shown in Tables 4 and 7, all the oversampling methods improve
The number of synthetic samples generated by several oversampling methods.
IR: imbalance ratio; SMOTE: synthetic minority oversampling technique.
The bold values significance that the values are bigger when the methods are compared.
SMOTE: synthetic minority oversampling technique.
The bold values significance that the values are bigger when the methods are compared.
SMOTE: synthetic minority oversampling technique.
The bold values significance that the values are bigger when the methods are compared.
SMOTE: synthetic minority oversampling technique.
The bold values significance that the values are bigger when the methods are compared.
SMOTE: synthetic minority oversampling technique.
The bold values significance that the values are bigger when the methods are compared.
Comparison with machine learning methods
The DTs C4.5, NB, SVM, and AdaBoost have all been applied as classifiers in imbalanced dataset experiments.10,35,36,56 In the experiments described in the previous subsection, we showed that combining fuzzy-SMOTE and SVM is better than SVM combined with other oversampling methods. In this section, we explore how well fuzzy-SMOTE works with other machine learning methods. The principle of NB is to calculate the posterior probability for each class of testing samples; the class with the highest posterior probability is the outcome of the prediction. C4.5 algorithm is constructed using a recursive partitioning operation that splits the examples into successive subsets based on information gain. SVM constructs a hyperplane to divide the sample space into two parts: a positive part and a negative part. AdaBoost learns a series of weak classifiers using the training dataset and then conjoins the weak classifiers to create a boosted classifier. This section compares the performances of these methods with fuzzy-SMOTE oversampling.
The performances of four basic machine learning methods with fuzzy-SMOTE are presented in Tables 8–11. First, comparing the classifier performances with and without fuzzy-SMOTE, we can conclude that these four machine learning methods achieve better performance on
The performances of NB and NB with fuzzy-SMOTE.
NB: naive Bayes; SMOTE: synthetic minority oversampling technique.
The bold values significance that the values are bigger when the methods are compared.
The performances of SVM and SVM with fuzzy-SMOTE.
SVM: support vector machine; SMOTE: synthetic minority oversampling technique.
The bold values significance that the values are bigger when the methods are compared.
The performances of C4.5 and C4.5 with fuzzy-SMOTE.
SMOTE: synthetic minority oversampling technique.
The bold values significance that the values are bigger when the methods are compared.
The performances of AdaBoost and AdaBoost with fuzzy-SMOTE.
SMOTE: synthetic minority oversampling technique.
The bold values significance that the values are bigger when the methods are compared.
Conclusion
Previous works have paid little attention to the problem of imbalanced datasets in Android malware detection, where the number of malware examples is much smaller than that of benign examples; however, these imbalanced datasets can cause classifier decision boundaries to be biased toward the majority class. To solve this problem, we propose a new oversampling method called fuzzy-SMOTE, which is based on fuzzy set theory and the SMOTE method. Fuzzy-SMOTE generates synthetic examples for the minority class in the fuzzy region where the minority examples have low membership degrees and are more likely to be misclassified. Compared with traditional SMOTE and the Borderline-SMOTE methods, classifiers trained with datasets oversampled by fuzzy-SMOTE achieve better accuracy on the minority class and on the overall accuracy of the entire dataset.
Fuzzy-SMOTE improves classifier performances on imbalanced datasets because it broadens the minority decision boundary by generating additional minority class training examples. Based on the original class distribution of the training datasets, fuzzy-SMOTE calculates the membership degree of every minority example to the minority class. Then, it generates additional synthetic examples for those examples that have lower membership degrees. The new synthetic examples, which belong to the minority class, are then added to the original training dataset to create a new training set. When this new training set is used to train new classification models, the trained models pay more attention to the enhanced minority class. Consequently, the decision boundary of the minority class is enlarged and the classifier is no longer biased toward the majority class. The outcome is that classifiers trained with fuzzy-SMOTE classify more minority class examples correctly.
