Abstract
Keywords
Introduction
Identifying essential proteins is important for understanding the cellular processes in an organism because no other proteins can perform the functions of essential proteins. Once an essential protein is removed, dysfunction or cell death results. Thus, several studies have been conducted to identify essential proteins. Experimental approaches for identifying essential proteins include gene deletion, 1 RNA interference, 2 and conditional knockouts. 3 However, these methods are labor-intensive and time-consuming. Hence, alternative methods for identifying essential proteins are necessary.
The essential protein classification problem involves determining the necessity of a protein for sustaining cellular function or life. Among the methods available for identifying essential proteins, machine-learning based methods are promising approaches. Therefore, several studies have been conducted to examine the effectiveness of this technique. Chin 4 proposed a double-screening scheme and constructed a framework known as the hub analyzer (http://hub.iis.sinica.edu.tw/Hubba/index.php) to rank the proteins. Acencio and Lemke 5 used Waikato Environment for Knowledge Analysis (WEKA) 6 to predict the essential proteins. Hwang et al 7 applied a support vector machine (SVM) to classify the proteins.
Protein-protein interactions (PPIs) are well-known to be significant characteristics of protein function. Several studies have attempted to predict and classify protein function 8 as well as analyze protein phenotype 9 by studying interactions. A previous study 10 further suggested that essential proteins and nonessential proteins can be discriminated by means of topological properties derived from the PPI network. In spite of the above superior properties, however, analyzing PPI experimentally is time-consuming. With the advent of yeast two-hybrid 11 high-throughput techniques, which can be used to identify several PPIs in one experiment, obtaining PPI information has become easier. Since a PPI network is similar to a social network in many aspects, some researchers apply social network techniques for analyzing PPI networks. Thus, several topological properties have been extensively explored and studied in recent years.
Fundamental properties, such as sequence or protein physiochemical ones, are not subjected to detailed examination in previous studies. This may be because each of these preliminary properties alone is somewhat less relevant to essentiality. However, this information is highly accessible because only sequence information is required to derive these properties. Hence, we included these properties in our study. For topological properties, in addition to physical interactions, we incorporated a variety of interaction information, including metabolic, transcriptional regulation, integrated functional, and genomic context interactions. Our experimental results revealed that these features provide either complement information for essentiality identification or provide other biological justification.
To identify the reduced feature subset, which is crucial for biological processes, previous studies have used feature selection techniques. The advantages of this method include storage reduction, performance improvement or data interpretation. 12 In accordance with whether the feature selection procedure is bound with the predictor, the method is roughly classified into three categories: filter, wrapper, and embedded. Filter methods often provide a complete order of available features in terms of relevance measures. Methods such as Fisher score, 12 mutual information, minimal redundancy and maximal relevance (mRMR), 13 conditional mutual information maximization (CMIM), 14 and minimal relevant redundancy (mRR) 15 belong to this category. Both wrapper and embedded methods involve the selection process as a part of the learning algorithm. The former utilizes a learning machine to evaluate subsets of features according to some performance measurements. For example, sequential backward and forward feature selection 12 falls into this category. Embedded methods directly perform feature selection in the learning process and they are usually specific to given learning machines. Example include C4.5, 16 Classification and Regression Trees (CART), 17 and ID3. 18 Additionally, some researchers proposed an information-gain based the feature selection method, 19 which examines the effectiveness of classifier combination.
In this paper, we used two datasets. The first one was from
Next, SVM models were built using the selected feature subsets. In this study, the SVM software LIBSVM
23
was adopted for classification models. Each model was applied to both imbalanced and balanced data sets. The results were compared with those of previous studies and statistical tests were conducted to examine significance. For the imbalanced
Background
The data set
In this paper, we used two data sets for experiments:
The
In the above two data sets, the ratio of nonessential proteins to essential proteins was approximately 4:1 and 5:1, respectively. The data imbalance will inevitably led to biased fitting to nonessential proteins during the learning processes. Thus, we constructed another balanced data set. Taking the first data set, for example, we randomly selected 975 nonessential proteins and mixed them with essential proteins to form a balanced data set. In the new data set, the number of nonessential data elements against that of essential elements was equal.
Bootstrap cross-validation
We used bootstrap cross-validation (BCV) to compare the performance of the two classifiers using the
Performance measures
In this study, the performance measures included precision, recall, F-measure (F1), Matthews correlation coefficient (MCC), and top percentage of essential proteins. Their formulas are given as follows:
Precision: Recall: F-measure: 2 × MCC: Top percentage of essential protein:
Here, an essential protein is represented by the positive observation. True positive (TP), true negative (TN), false positive (FP), and false negative (FN) represent the numbers of true positive, true negative, false positive, and false negative proteins, respectively. The value n denotes the total number of predictions. In addition, receiver operating characteristic (ROC) curve 18 and area under curve (AUC) were used to evaluate the classification performance.
Feature extraction
The feature set we used included sequence properties (S), such as amino acid occurrence and average amino acid PSSM; protein properties (P), such as cell cycle and metabolic process; topological properties (T), such as bit string of double screening scheme and betweenness centrality related to physical interactions; and other properties (O), such as phyletic retention and essential index. There were a total of 45 groups and 90 features in the
Protein features.
The remaining features are detailed in the Appendix.
Lin et al 26 and Chin 4 proposed the double screening scheme. They used multiple ranking scores to sort essential proteins. The drawback is that each protein does not have a unique score. Thus, we propose a bit string implementation to incorporate these two properties into a single score.
An example of our bit string implementation is shown in Tables 2 and 3. Suppose that four proteins, W, X, Y, and Z, are to be ranked. In the first iteration, we desire to find the top one protein. We first select the top 2 proteins using the ranking method A, which are W and X. Next, we use method B to rank these two proteins. The ranks of W and X are 2 and 1, respectively. Hence, in the first iteration, X is finally selected. It follows the bit M [X, 1] is set to 1, and others, M [W, 1], M [Y, 1] and M [Z, 1], are set to 0. In the second iteration, 2 top-ranking proteins are to be found. First, four proteins W, X, Y and Z are selected, because they are the top 4 proteins by ranking method A. Next, with ranking method
Ranking by two different methods, where smaller numbers indicate higher ranks.
Bit strings by the double screening method.
There is still an issue in the bit string implementation, that is, M may be too sparse to be handled by classifiers. Since the number of proteins being selected is around n/2, the sum of about n/2 bits is close to 0. In our experience, this makes it difficult to distinguish between proteins. To overcome this problem, for each protein, we added another score n - r to the sum of the bit string, where r is the rank of the protein by the ranking method B. In this study, we used DMNC to rank A and MNC to rank B. In this example, n = 4, so the values n - r of W, X, Y and Z are 0, 2, 3, and 1, respectively. We summed the values with the bit string; hence, the final scores are 0, 4, 4, and 1. The overall procedure is given in the Procedure bit string implementation of DSS.
Sequential backward feature selection method
SVM is a well-established tool for data analysis which has been shown to be useful in various fields, such as text summarization, 27 intrusion detection, 28 and image coding. 29 In this study, we utilized the SVM software developed by Chang and Lin, called LIBSVM. 23 To address the data imbalance, we propose the modified sequential backward feature selection method.
Since most data were nonessential, choosing only accuracy as an objective or adopting conventional feature ranking schemes favored negative data. As more and more features were excluded, overall accuracy declined. Since the number of negative data elements was higher than that of positive factors, the true-positive rate thus decreased more than the true-negative rate. Thus, features should be selected that most positive samples are correctly classified while not deteriorating the overall accuracy too much. In this sense, rather than using only accuracy to guide the feature selection, we used a composite score
Experimental procedure and results
For comparison purposes, we used two feature selection methods: mRMR and CMIM. In the
Experimental procedure
The overall procedure of our experiments is illustrated in Figure 1 and is described as follows.

Flowchart for the construction of SVM models and performance comparison.
Stage 1: Determine benchmark feature set
For the
Stage 2: Tune SVM parameters for best performance
For the above two feature sets, we first ran the SVM software using the feature sets of Hwang or Gustafson and tuned the SVM parameters to achieve the highest average performances.
Stage 3: Adopt best performances as reference performances
After determining the best SVM parameters for the feature sets of Hwang and Gustafson, we recorded the SVM parameters and results. To compare our results with other models, such as those obtained using our methods, mRMR and CMIM, we used the same SVM software and adjust the cost parameters of SVM in order to achieve similar levels of precision.
Stage 4: Perform feature selection
We randomly chose 50% of available data. Next, the backward feature selection procedure was applied to these selected data. In the beginning of our feature selection procedure, we imposed no penalty on the score calculation. Hence, the procedure attempts to achieve the highest score. In the subsequent runs, we added penalties for feature sizes to the score calculation. Subsets with smaller feature size but only slightly inferior in performance were selected. To compare our results with those of other methods, we also used the mRMR and CMIM feature ranking methods and chose subsets as in Procedure backward feature selection.
Stage 5: Perform 10-fold and bootstrap cross-validations
The data were prepared in both balanced and imbalanced manners. For each data set, we randomly partitioned all data into 10 disjoint groups and used the feature subsets selected in the previous stage to calculate various performance measures. The data were prepared in both balanced and imbalanced manners. The 10-fold cross validation was repeated 10 times and average performance measures were computed. Next, a bootstrap sampling procedure was conducted and 200 bootstrap samples were produced, including both balanced and imbalanced samples. Each bootstrap sample was also partitioned for 10-fold cross validations and performance measures were calculated. Note that all models were examined by the same sets of data partitions for conventional and bootstraping cross-validations.
Stage 6: Perform significance tests
Once bootstrap cross validations were carried out, the significance tests were adopted accordingly. In addition to the average values of AUC, precision, recall, F-measure, and MCC, we conducted a statistical significance test for these performance measures. Additionally, we calculated ROC curves and top percentage values for imbalanced experiments.
Backward feature selection and mRMR/CMIM feature ranking
We used 50% of available data elements for feature selection. Taking the
For the subsequent runs, the value of
For each setting of
In addition to the methods of Hwang et al or Gustafson et al,
21
we also used mRMR
13
and CMIM
14
feature selection methods for comparison. Using mRMR as an example, the data used in our feature selection procedure were input into the mRMR program, which produced the ranking score of each feature. The feature with the least score was removed first and a subsequent 5-fold cross-validation with the preserved features is performed to calculate the composite score
Table 4 shows the selected feature subsets of different sizes for
Selected features for
After the feature subsets were selected, to conduct performance comparison as well as to cope with the randomness, we used Hwang's method to perform 10 10-fold cross-validations. Here, the true positive rates and false positive rates were input into a different software program to calculate ROC curves and AUC values. In this study, the software package we used is ROCR, which was developed by Tobias et al.30,31 Thus, the reported performance measures, including AUC, F1, MCC, precision, and recall values and ROC curves, were averaged over 10 10-fold cross-validations.
For the
We appled the same procedure for the
Selected features for
Bootstrap cross validations
During the bootstrapping stage, for each bootstrap sample, an identical 10-fold partition was employed for all feature subsets to carry out cross-validations and compute various average performance measures. The procedure was repeated for 200 distinct bootstrap samples. In order to perform parametric significance tests, we evaluated whether the distribution of the resultant performance measures was normal and the variances obtained from different feature subsets were similar. Consequently, 200 results of each performance measure for each feature subset were subjected to the Kolmogorov-Smirno test.
31
This test examines the null hypothesis that no systematic difference exists between the standard normal distribution and the underlying distribution against the alternative one that asserts a systematic difference. The threshold was set to 0.05. If the

The

The
For a certain performance measure, since the variances obtained by various feature subsets were quite similar, we used an analysis of variance 33 (ANOVA) test to examine whether differences existed among performance measures of different feature subsets. Here, one variance can be obtained from the multiple experiments with a feature subset. Differences existed according to the ANOVA. Next, all of these measures were compared with their associated benchmark to calculate performance deviations. The average deviation corresponding to each type of performance measure was evaluated using the 95% confidence interval covering 0 to determine significance.
Performance comparison and significance tests
In this section, we compared our experimental results with those associated with other feature selection methods and previous studies. For conciseness, we only show the most prominent results associated with mRMR and CMIM. We observed that feature sizes identified by these two methods were relatively large. To compare the feature subsets of smaller sizes, their comparison and their working principles are detailed in the Appendix.
S. cerevisiae
Table 6 lists the average values of five performance measures associated with a variety of feature subsets, which were obtained by 10 10-fold cross-validations for imbalanced data. We adjusted the SVM cost parameters in order to achieve similar levels of precision. The first four rows show results of CMIM32 (32 features), mRMR31 (31 features), Hwang's (10 features), and Acencio's (23 features). Values following these items are enclosed by parentheses and represent the numbers of features. Results produced by our method are listed in the subsequent rows of the table. Significance tests were carried out with the bootstrap cross-validations over 200 bootstrap samples. The first three symbols, which can be plus (+) or minus (-), following each numerical value represents significantly higher or lower than benchmark results. For those which serve as benchmarks are marked by star (*) symbols for clarity. For example, the recall of N6 was significantly higher than that of Hwang, while its AUC was significantly lower than those of mRMR31 and Hwang. For the feature subsets with a prefix name ‘N’, their fourth symbols behind numerical values are used to indicate the significance between two neighboring rows. For example, for N7, its AUC was significantly higher than that of N6 and its recall value was also significantly higher than that of N8. For values of the same performance measure in each column, the best is underlined. Values in the last row show the results with the full set of 90 features.
Performance comparison for the imbalanced
Based on Table 6, CMIM32, mRMR31 and Hwang's predictors outperformed Acencio's in all performance measures. For our feature subsets, the performance measures were slightly higher than Hwang's. For those of N8, there was no performance difference from Hwang's in AUC, while the remaining measure values were higher than Hwang's. When the feature size exceeded 8, except for precision values, improvement over Hwang's was consistently significant in most cases. For comparison with mRMR, our method performed nearly as well as mRMR31 when the feature size was between 9 and 13 with the exception of AUC values. When the feature size ranged from 14 to 18, there was no performance difference between our model and mRMR31. The most prominent predictor was CMIM32. Except for AUC values, our results achieved similar levels of performance when the size of features exceeded 14. Note that the number of features in CMIM32 and mRMR31 were 32 and 31, which was much higher than ours.
Table 7 shows the average performance measures in balanced experiments of the
Performance comparison for the balanced
In Table 6, we can observe that feature subsets N5, N7, N9, N13, N15, and N16 showed significant improvement in performance but were smaller in feature sizes when compared with neighboring rows. In Table 7, the significant subsets were N5, N6, and N9. In addition, as shown in Tables 6 and 7, our models performed equally well as CMIM32 and mRMR31 when the feature size was 16 or 17. We used N5, N9, and N16 to draw ROC curves.
E. coli
Tables 8 and 9 shows the average values of five performance measures associated with a variety of feature subsets, which were obtained by 10 10-fold cross-validations for imbalanced and balanced experiments, respectively. The first two rows show results of CMIM09 (9 features) and Gustafson's (29 features).
Table 8 shows that Gustafson's predictors outperformed CMIM09 in most performance measures in imbalanced experiments. For our feature subsets, the performance measures were slightly higher than CMIM09. When the feature size exceeded 6, the improvement over CMIM09 was consistently significant. To compare Gustafson's method with our method, ours almost performed as well as Gustafson's when the feature size was over 11. Note that the number of features in Gustafson's was 29, which was higher than ours. Table 9, except for the least effective predictor mRMR13, shows almost no performance difference among most feature subsets in balanced experiments. For further ROC analysis, in addition to CMIM09, mRMR13, and Gustafson's, we further used N4, N8, N11 and N80 to draw ROC curves. This is because we observed performances of insufficient, middle and full feature sets.
Performance comparison for imbalanced
Performance comparison for balanced
ROC analysis
S. cerevisiae
Figure 4 illustrates the average ROC curves and AUCs of various feature subsets for the imbalanced data experiments. Apart from the most competent predictor CMIM32, although the AUC of N5 is higher than that of Acencio's, an intersection can be observed at 0.5 on the horizontal axis. This indicates that N5 was a better predictor when the allowed maximal false positive rate was below 0.5. In contrast, when the allowed false positive rate exceeded 0.5, Acencio's was better than N5. Comparing N9 and Hwang's method, both AUC values were similar. For the feature subsets with sizes exceeding 8 (not all shown in this figure), all true positive rates were either higher or at least close to Hwang's. This was also supported by the significance tests in Table 6 and suggests that the feature subsets with sizes exceeding 8 achieved higher performance in AUC than Hwang's predictor.

The average ROC curves and AUCs for the imbalanced
Figure 5 illustrates the average ROC curves and AUCs of various feature subsets for the balanced data experiments. CMIM32 again was the most competent predictor. Additionally, N16 also achieved the same level of AUC. For the feature subsets of sizes ranging from 5 to 18 (not all shown), their true positive rates were either higher or at least close to Hwang's level. Thus, N5, N6, …, N18 outperformed or performed equally well for various combinations of true and false positive rates in the balanced experiments. Similarly to the imbalanced data set, the more features, the higher the AUC values. However, the improvement in AUC over the feature addition was not as significant as those in the imbalanced experiments. It should be noted that both the ROC curve and AUC of Acencio's predictor were reproduced by our experiments and thus they were slightly different from the original values reported by Acencio and Lemke. 5

The average ROC curves and AUCs for the balanced
E. coli
For imbalanced data set, Figure 6 illustrates the average ROC curves and AUCs of various feature subsets. It shows that all curves were similar below the 10% horizontal range. This indicates that there was little difference when the allowable false positive rate was less than 10%. For the horizontal range above 10%, N80 was the highest performer, Gustafson and N11 were secondary, and N4 was the worst. In contrast to the imbalanced data set, for Figure 7 corresponding to the balanced data set, N4 and N8 were the best performers. The remaining predictors showed few differences.

The average ROC curves and AUCs for the imbalanced
Top percentage analysis
S. cerevisiae
Table 10 shows the average top percentage information for the imbalanced data set. The top θ probability is defined as the ratio of the number of truly predicted essential proteins over the top-ranked θ X 975 proteins, where the total number of true essential proteins is 975. The top θ probability shows the likelihood that the proteins are essential if the user decides to choose a specific number of top-ranked candidates. It is slightly different from precision because the top-ranked candidates (or denominator) are not necessary to be classified as essential. CMIM32, mRMR31 and Hwang's results again served as benchmarks and they are denoted by star ‘*’ symbols in the table. The minus symbol following each value represents that the value was lower than the benchmark results.
Percentage of essential proteins in the imbalanced
Both mRMR31 and Hwang's predictor were extremely effective within the 10% range. This indicates that these predictors were quite preferable when the total number of true essential proteins was known and the allowable top-ranked candidates were within 10%. Most of our predictors outperformed them beyond 10%. For CMIM32, our predictors outperformed it beyond 30%. Thus, N14 may be a better choice because it is relatively effective beyond 10%. Figure 8 depicts the average top percentage curves.

The average ROC curves and AUCs for the balanced

The average top percentage curves for the imbalanced
E. coli
Table 11 shows the average top percentage information for the imbalanced data set. CMEVI09, mRMR13, and Gustafson's results serve as benchmarks. The CMIM09 predictor was the most effective over the entire range. Most of our predictors outperformed these predictors beyond 15%. N9 was the most prominent since it was relatively effective over the entire range. Figure 9 depicts the average top percentage curves.
Percentage of essential proteins in the imbalanced

The average top percentage curves for the imbalanced
Discussion
By inspecting the
For the
With experimental results for the two data sets, we conclude that phyletic retention is the most important feature for identifying essential proteins. It is defined as the number of present ortholog organisms. Gustafson et al study
21
analyzed different organisms to calculate phyletic retention for
In this study, we compiled various interaction information including physical, metabolic, transcriptional regulation, and integrated functional and genomic context interactions. The experimental results revealed that various properties, such as degrees, were more or less identified as important features. This implies that the interaction information, not limited to physical interactions, may also be closely related to essential properties. According to the literature, hubs of the networks, possessing abundance of interaction partners, are important due to the fact that they play central roles in mediating interactions among numerous less-connected proteins. Thus, proteins involved in the complex mediation processes are more likely to be crucial for cellular activity or survival.
For the feature selection proposed in this study, let the size of all available and target selected features be
If we inspect Tables 4 and 5, we can find that more than one-third of the features were not significantly relevant and thus were not selected. These features are relatively easy to remove during backward feature selection procedure at the beginning stage. According to the authors' experience, the rounds of retry
Conclusion and Future Work
In this study, we incorporated several protein properties, including sequence, protein, topology, and other properties. There was a total of 55 groups and 96 features. The features were included in two data sets for experiments:
In the imbalanced
For
There several possible methods for further improving the prediction capability. Features related to the protein sequence properties may also be useful for identifying essentiality. Furthermore, since proteins with similar primary structures may possess similar functions, thus the essentiality may be addressed from the sequence motif perspective. 34 In addition to the above approaches, performance can be improved by incorporating other tools or constructing hybrid predictors. Among these, the majority vote 35 is a strategy for combining classifiers. This method represents the simplest method for categorical data fusion. According to the literature, 36 the prerequisite for improvement arises from the fact that each individual classifier must contain distinct information for discrimination. Otherwise, some negative effects may be imposed on the constructed ensemble.
Appendix
Feature extraction
Outdegree and indegree related to transcriptional regulation interaction: The feature represents the number of outgoing (or incoming) links to the gene
Betweenness centrality transcriptional regulation interactions: Let σ
Betweenness centrality related to physical interactions: The value τ
Protein properties: Acencio and Lemke 5 discovered that the integration of topological properties, cellular components, and biological processes possess good capability for predicting essential proteins. Hence, our features also contained cellular components (cytoplasm, endoplasmic reticulum, mitochondrion, nucleus or other localization) and biological processes (cell cycle, metabolic process, signal transduction, transcription, transport or other process).
The above four feature sets were obtained from Acencio and Lemke. 5
Betweenness centrality related to integrated functional, PI and GC network: The values are defined identically as those mentioned above while the paths here are represented in terms of integrated functional, PI and GC network interactions.
Degree related to integrated functional, PI and GC network: The values were defined identically to those mentioned above, while the paths here are represented in terms of integrated functional, PI, and GC network interactions.
For the above two feature sets, we first collected network information from Hu et al 22 and then conducted calculations using iGraph software. 37
Maximum neighborhood component and density of maximum neighborhood component: The maximum neighborhood component (MNC) and density of maximum neighborhood component (DMNC) properties were proposed by Lin et al
26
and Chin.
4
For a protein
For a protein
Sequence features: We used ten feature sets from Lin et al.
20
Let
Protein length:
Cysteine count:
Amino acid occurrence: The composition of amino acid
Average cysteine position:
Average distance of every two cysteines:
Cysteine odd-even index:
Average hydrophobicity:
Average hydrophobicity around cysteine: The
Cysteine position distribution: For 1 ≤
Average PSSM of amino acid: The average PSSM of residue
Phyletic retention: Gustafson et al
21
discovered that the essential proteins are generally more conserved than nonessential proteins. Phyletic retention of protein
Essential index:
7
Essential index measures the ratio of essential proteins in the neighbors
Clique level:
7
The clique level of protein
Number of paralagous genes: It is shown that genes are more likely to be essential if there no duplicate existed in the same genome. 21 This feature is defined as the number of genes that are present in the same genome. In addition, their BLASTP E-values must be less than 10−20 and the ratios of the larger gene to the smaller do not exceed 1.33.
Open reading frame length: Gustafson et al 21 observed that ancestral genes are more likely to be essential and that proteins generally become larger throughout evolution. Consequently, The open reading frame length may indicate essentiality.
Confidence intervals of performance measures and informational odds ratios
All performance measures were multiplied by 100. The confidence intervals were set at 95%. We used the informational odds ratios (IOR) 32 to represent the association between the essentiality and predictions. IOR measures how much more likely a protein is to be essential when one learning machine outputs essentiality rather than non-essentiality. A value of 1.0 indicates no association between the essentiality and predictions produced by learning machines. All confidence intervals of performance measures and informational odds ratios corresponding to each prediction models are shown in Tables 12–15.
Confidence intervals of performance measures (x100) and informational odds ratios for models produced by the imbalanced
Confidence intervals of performance measures (x100) and informational odds ratios for models produced by the balanced
Confidence intervals of performance measures (x100) and informational odds ratios for models produced by the imbalanced
Confidence intervals of performance measures (x100) and informational odds ratios for models produced by balanced
Comparison with other feature selection methods
We first introduced two feature selection methods that served as benchmarks, mRMR 13 and CMIM, 14 both of which are theoretical methods. Next, we compared them with our feature selection method when the feature subsets of equal size were selected.
Unlike other methods that select top-ranking features based on F-score or mutual information without considering relationships among features, mRMR accommodates both feature relevance with respect to class label and dependency among selected features. The strategy combines both the maximal relevance and the minimal redundancy criteria. In order to take the above two criteria into consideration and to avoid an exhaustive search, mRMR adopts an incremental search approach. That is, the
For CMIM, a feature
In the following paragraph, we compare our method with the above two feature selection methods when the feature subsets of equal size were selected. We first ran the SVM software with Hwang's or Gustafson feature sets and tune the SVM parameters to achieve the highest average performances. To fairly compare the methods given feature subsets with same sizes obtained by our methods, mRMR and CMIM, we used the same SVM software and adjust the cost parameters in order to achieve similar levels of precision. For
For the
Performance comparison of our method vs. mRMR for the imbalanced
Performance comparison of our new method vs. CMIM for the imbalanced
Performance comparison of our method vs. mRMR for the balanced
Performance comparison of our method vs. CMIM for the balanced
For the
Performance comparison of our method vs. mRMR for the imbalanced
Performance comparison of our method vs. CMIM for the imbalanced
Performance comparison of our method vs. mRMR for the balanced
Performance comparison of our method vs. CMIM for the balanced
For methods such as mRMR and CMIM, both relevance and information redundancy are taken into consideration. Therefore, the obtained feature subsets were quite compact as well as effective. However, the relevance may only be appropriate for some performance measures, such as classification accuracy or precision. Our method took both the performance and feature size into consideration. Consequently, the resultant feature subsets were more effective in some other performance measures for given equal number of features and precision values.
