Abstract
Keywords
Introduction
Cancer is the second most common cause of death in the USA, which accounts for nearly one out of four deaths. In 2014, about 585, 720 Americans are projected to die of cancer. A key challenge of cancer treatment is the classification of cancer to its correct subtype. Applying cancer subtype-specific treatment increases efficacy and reduces toxicity. 1 However, classification of cancer is a challenging task. As a result, developments in cancer classification have been central to the advancements in medical treatment. Traditional classification techniques are mainly based on biological insights and morphological appearances of the tumor. 2 Existing approaches in this category, however, have serious limitations, and they yield low prediction accuracies.3,4 Cancers with alike morphological appearances can follow significantly different clinical courses and show different responses to therapy. 5
Expression levels of the human genes show cancer-type specific variations. Because of such variations, the gene expression levels collected from patients provide great potential to improve the accuracy of cancer classification.1,6,7 Currently, expression levels of thousands of genes can be measured in parallel using experimental techniques such as microarrays or RNA-seq. Microarray technology in comparison to measurement of other cancer markers such as chromatin states is usually experimentally easier and slightly cheaper. Because of these reasons, microarrays measuring expression levels of genes at the genomic scale are often preferred for cancer classification studies.
Computational methods for predicting cancer type using gene expression levels are relatively new strategies with a promise of significantly better accuracy compared to the classical methods.1,8,9 They provide an alternative, cheaper, and more efficient approach to the low-efficiency traditional cancer classification techniques. They are not meant to replace the existing morphology-based approaches. However, with the advances in technology for collecting data and the algorithms for studying these data, we believe that the computational cancer classification methods will provide an additional and useful resource in clinical practice. Computational classification methods often build a classifier from a dataset called the
Many supervised statistical learning algorithms such as decision trees, k-nearest neighbor (kNN), naïve Bayes (NB), support vector machines (SVM), and random forest (RF) have been used for the classification of cancer using gene expression datasets.10,11 Most of these traditional classification methods depend on the expression levels of individual genes. For example, kNN classification method uses expression levels of hundreds of genes to classify the samples to distinct cancer classes. However, usually gene expression levels show high variations in many subtypes of cancer. Leukemia subtypes, for instance, belong to this category. Furthermore, cancer can alter the gene expression because of primary and secondary effects. Primary effects indicate the transcriptional changes as a result of genetic and epigenetic mutations. Secondary effects indicate the indirect transcriptional changes arising from regulatory interactions of genes with other primarily or secondarily altered genes. As a result, only considering the gene expression levels of individual genes is not very informative and thus they can mislead in classification of complicated cancer types. New techniques that can summarize the collective aberrations in gene expressions of sets of genes are needed.
Our contributions are as follows:
In this study, we propose a new network-based classifier, called NBC. Briefly, our method works in two phases: learning and prediction. In the learning phase, for a given gene expression dataset of samples, first, we extract the most relevant features for classification of the training samples into their correct classes. Note that, features in our study are actually the genes. For this reason, in this paper, from now on we will use the terms feature and gene interchangeably depending on the underlying context. In the next step, using the gene expression levels, we create a gene association network for each class describing the dependency between these selected genes. Each node in an association network denotes a gene. Each edge between a pair of nodes indicates the correlation between the expression levels of the two corresponding genes in that cancer class. In the final step of the learning phase, for each gene, we create a predictor function using its immediate neighbors in the network model we built for each class. In the prediction phase, for each class, we use these functions to predict the expression levels of a given test sample and compare the prediction to the given test sample. We assign the given test sample to the class, which yields the least prediction error.
We compare NBC to five traditional classification methods using two- and multi-class cancer microarray datasets. More specifically, we compare our network-based classifier, NBC, to SVM, NB, C4.5, kNN, and RF using five recently published large-scale cancer microarray datasets. The datasets we used in our experiments cover a wide spectrum of scenarios; they include gene expression levels from cancer to normal patients, different cancer cell lines, or cancer subtypes.
One issue that affects the outcome of the classification analysis is the number of genes in the microarray datasets. Many of the genes are irrelevant to the classification of the cancer types. Thus, selecting the relevant genes improves the accuracy of the classification algorithms. Here we also compared the accuracy of five feature selection methods, namely support vector machine feature selection (SVM-FS), chi-square (χ2), 12 symmetrical uncertainty (SU), 13 information gain (IG), 14 and prediction analysis of microarrays (PAM). 15 We evaluated the class prediction efficiency of each classifier using these five-feature selection methods.
In our experiments, we also studied the correlation-based co-expression network topology (degree, clustering coefficient, and closeness centrality distributions) of different cancer classes. For this purpose, we compared the distinctive networks that were suggested by the NBC method. Our analysis shows that different cancer classes lead to drastic changes in the network properties, which suggests that cancer leads to major changes in the gene-to-gene interactions in different cancer classes.
Methods
In this section, first we provide a short description of the datasets used in our study. Then, we present a detailed description of the NBC classifier and five state-of-the-art classifiers, namely C4.5, kNN, NB, SVM, and RF. We also provide the descriptions of five-feature selection methods: SVM-FS, χ2, IG, SU, and PAM.
To observe the performance of our methods under a broad spectrum of scenarios, we have used five cancer microarray datasets in our experiments with varying characteristics. The datasets are summarized below.
Lung Cancer Dataset 16
The lung cancer dataset consists of 120 samples (60 paired samples from tumor (class 1) and normal (class 2) tissues) 54,675 probe sets from the Affymetrix chip. The dataset can be obtained from NCBI (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19804).
Breast Cancer Dataset 17
The breast cancer dataset consists of 162 samples with 54,675 probe sets from the Affymetrix chip. The samples contain 57 women with breast cancer diagnosis, 37 women with benign diagnosis, 31 women with normal initial mammogram, 15 breast cancer patients following surgery, 15 patients with gastrointestinal cancer, and 7 patients with brain tumor. We excluded the 15 breast cancer patients following surgery since we do not know whether any of these patients had recurring diagnosis. We have also excluded 15 gastrointestinal cancer and 7 brain tumor patients as we focus on breast cancer. In the final dataset, we had 125 samples belonging to three different classes (class 1 = PBMC_Normal (31 samples), class 2 = PBMC_Malignant (57 samples), and class 3 = PBMC_Benign (37 samples)). The complete dataset can be obtained from NCBI (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE27562).
NCI60 Dataset 18
This dataset contains 174 samples spanning nine tumor types with 54,675 probe sets from the Affymetrix chip. Nine cancer tissue origins consist of class 1 = leukemia (18 samples), class 2 = breast (15 samples), class 3 = ovarian (21 samples), class 4 = melanoma (26 samples), class 5 = central nervous system (18 samples), class 6 = colon (21 samples), class 7 = renal (23 samples), class 8 = non-small cell lung (26 samples), and class 9 = prostate (6 samples). Among these, we excluded six prostate samples (class 9), since six samples are too few for classification studies. The NCI60 dataset can be obtained from NCBI (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE32474).
Leukemia Dataset 19
The leukemia dataset consists of 574 samples in 10 classes with 22,283 probe sets from the Affymetrix chip. Four CD34 and four CD10CD19 samples are excluded from the dataset since four samples are too few for classification studies. We also excluded the 153 samples, which do not have a known karyotype to focus on leukemia. The pruned dataset contains 413 samples belonging to seven different leukemia types: class 1 = hyperdiploid (115 samples), class 2 = TCF3-PBX1 (40 samples), class 3 = ETV6_RUNX1 (99 samples), class 4 = MLL (30 samples), class 5 = PH (23 samples), class 6 = hypodiploid (23 samples), and class 7 = T-ALL (83 samples). The complete leukemia dataset can be obtained from NCBI (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE33315).
Colon Cancer Dataset 20
The colon cancer dataset consists of 566 samples in six classes with 54,675 probe sets from the Affymetrix chip. The dataset contains six colon cancer subtypes: class 1 = CINImmune-Down (116 samples), class 2 = dMMR (104 samples), class 3 = KRASm (75 samples), class 4 = CSC (59 samples), class 5 = CINWntUp (152 samples), and class 6 = CINnormL (60 samples). The complete colon cancer dataset can be obtained from NCBI (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE39582).
Classifiers
We have compared the NBC classifier to five state-of-the-art classifiers in our experiments. We believe that these methods collectively constitute a significant portion of the key literature on this topic. We provide a short description of all of these five classifiers as well as our method below. We also provide a detailed description of our method in the Supplementary Files.
kNN
This method is a non-parametric similarity-based classification algorithm.
21
in this method, a sample is classified by majority vote of its k nearest neighbors (kNNs). More specifically, each testing sample is assigned to the class most common among its kNNs. We say that a sample in a training set is a neighbor of a given test sample if that training sample is one of the
C4.5
This method builds a decision tree, which consists of a set of internal and leaf nodes. The internal nodes are associated with a splitting criterion, which consists of a splitting feature and one or more splitting rules defined on this feature. The leaf nodes are labeled with a single class label. C4.5 employs a two-step algorithm to generate decision trees from a dataset, using information entropy.
22
in the first step, C4.5 builds decision trees from a set of training data, using the concept of information entropy. The dataset is a set of already classified samples
NB
This method uses probabilistic induction to assign class labels to test samples, assuming independence among the features.
23
Briefly, a naïve Bayes (NB) classifier creates rules based on Bayes' theorem. Bayes' theorem is a result in probability theory, which relates the conditional and marginal probability distributions of random events:
In short, the NB method works as follows. Given a sample
Using Bayes' rule above, we label a new case
SVM
This method is one of the fundamental supervised machine learning algorithms for binary classification.25,26 The most commonly used SVMs in the biological data classification are the linear SVMs because of their simplicity of implementation. For a given training dataset,
Linear SVM is a binary classifier; however, it can also be used for multi-class datasets in the same way as any multi-class problem can be reduced to binary classification problems. There are several strategies for this purpose. Here, we use one of the most common strategies known as the one-versus-all approach. This strategy transforms the single multi-class problem into
RF
This method is an ensemble approach that builds multiple decision trees (described above in the C4.5 section) using the training dataset to achieve a better classifier performance.
29
The test samples are classified by assigning them to the classes that take the majority vote over all the decision trees. The method works as follows. In the first step,
NBC method
This method works in two phases: (A) learning and (B) prediction. In this section, we will summarize this method. A detailed description of this method is available in the Supplementary File.
The learning phase of the NBC method works in three steps: (i) feature selection, (ii) correlation network creation, and (iii) prediction function learning. In this phase, for a given gene expression training dataset
The prediction phase of the NBC method works in two steps: (i) gene expression prediction and (ii) class assignment. For a given testing sample set
Feature Selection Methods
Feature selection methods rank the features of the given dataset based on their importance. They remove noisy features (genes in our study) from the dataset, with the goal of increasing the classification accuracy while reducing the running time.14,30–33 The total number of genes available in the datasets used in our study is very large (ie, more than 10,000). So, it is critical to reduce the number of genes to a small subset when performing classification of the samples. Typically, we observe that 50–150 genes have been selected in the literature for binary classification studies.1,34,35 Herein, we use a subset of the top 50–300 genes based on the underlying feature selection algorithm employed. In this study, we compare five-feature selection methods to select a small set of genes. These methods are SVM-FS, χ2, IG, SU, and PAM. These methods are summarized below.
SVM-FS
This feature selection method is based on the SVM classifier algorithm described above. SVM classifier is trained as
χ 2 feature selection method
This method evaluates each gene's importance for classification individually by measuring the χ
2
statistics with respect to the classes.
12
First, the gene expression values are discretized into several intervals. Then using these discrete expression values, χ
2
value of each gene is calculated as described below. Let us denote the number of samples in the
We then pick the top
IG Feature Selection Method
This method evaluates the worth of a gene by measuring the information gain (IG) with respect to the class.
14
Let us denote the total entropy of the class with
We then pick the top
SU feature selection method
This method evaluates the worth of a gene by measuring the SU with respect to the class.13,36 Symmetric uncertainty (SU) is measured by
where
PAM Feature Selection Method
This method is based on the nearest shrunken centroid method.
15
The method works as follows. First, the method computes a centroid for each class and an overall centroid. Centroid for a class (
Cross-validation
Cross-validation is a model validation method for assessing generalizability of the classifiers into other independent datasets. It is a key step in classifier construction to assess the performance of the classifiers.
Network Measures
Analysis of the differences and the similarities between the correlation-based co-expression networks in different cancer classes is key to understanding cancer. In this study, we used three network measures – namely, degree distribution, clustering coefficient, and closeness centrality – to compare different cancer classes. These network measures have been calculated as described below.
Degree Distribution
The degree of a node (gene in our case) is the number of connections it has to other genes. The degree distribution
Closeness Centrality
The closeness centrality of a gene
Clustering Coefficient
Clustering coefficient of a gene measures the degree to which the adjacent genes of that gene in a graph tend to connect together. More specifically, the clustering coefficient of a gene is defined as
where
Results
In this section, we evaluated the performance of our network-based classifier NBC. Many supervised classification algorithms have been proposed for predicting cancer types.10,11 Among them, we compared NBC to five traditional classifiers (SVM, NB, kNN, C4.5, and RF). Collectively, these five methods covered a broad spectrum of alternative methods. We selected a subset of available genes in the transcriptome using five alternative feature selection methods, namely, SVM-FS, IG, SU, χ2, and PAM. We implemented SVM, NB, kNN, C4.5, and RF classifiers and PAM feature selection method in MATLAB software, and NBC classifier and SVM-FS feature selection method in the C programing language. We used the Weka platform 39 for the other feature selection methods.
We used 10-fold cross-validation to calculate the prediction accuracy of each of the classifiers. More specifically, we kept one fold (one-tenth of the set of all samples) as the test samples, and selected relevant genes and trained classifiers on these genes using the remaining nine folds as the training data. This way, we avoided any positive bias toward the test samples. We repeated this 10 times by using each fold as the test data. We reported the average accuracy we observed in all the 10 folds.
We tested the classifiers and feature selection methods using five cancer microarray datasets with varying characteristics. Descriptions of these datasets are provided in the Methods section. Diversely selected microarray datasets in this study include gene expression levels from cancer vs normal patients, different cancer cell lines, or cancer subtypes. In order to ensure that the noise arising from combining different experimental techniques does not give unfair disadvantage to any of the methods we compared, we focused on large-scale Affymetrix cancer microarray datasets in the gene expression omnibus (GEO) database, rather than combining datasets from varying experimental sources. We, first, normalized each of these datasets, so that each gene expression value has a mean of 0 and a variance of 1. Next, we applied feature selection methods on these normalized datasets. These methods provide the ranking of all the genes in respect to their importance for cancer classification. We choose the top
In the following sections, we first report our findings on the comparison of feature selection and classification methods (Sections A, B, and C). Then, we analyze the NBC method in detail (Sections D, E, and F).
Evaluation of feature selection methods
The choice of feature selection method and the underlying dataset impact the classification accuracy. The first question we seek to answer is how NBC compares to existing classification methods. Here, we present an extensive comparison of all the six methods on five different cancer datasets by using five alternative feature selection methods. Notice that each method can have a peak performance (ie, highest accuracy) at different number of features. As a result, setting the number of genes to a fixed value would give unfair advantage to some of the methods we evaluate. To avoid this, we report the highest accuracy of each method along with the number of features at which it is capable of achieving that accuracy. Table 1 presents the detailed results (see the last two columns).
A summary of 10-fold cross-validation prediction accuracy for lung, breast, NCI60, leukemia, and colon cancer datasets for different feature selection and classification methods when the maximum number of allowable genes is set to 100. In the tables, we report the best accuracy obtained by each combination of classifier and feature selection method, the number of genes used to obtain this accuracy level, and average (mean) and standard deviation (Std) of accuracies for each classifier and feature selection method. In the table, entries containing a pair of numbers (X, Y ) indicate the following: X refers to the number of genes with which we obtained the best classification accuracy (Y ). We use bold face to highlight the top five highest accuracies and the top two mean accuracies.
A summary of 10-fold cross-validation prediction accuracy for lung, breast, NCI60, leukemia, and colon cancer datasets for different feature selection and classification methods when the maximum number of allowable genes is set to 100. In the tables, we report the best accuracy obtained by each combination of classifier and feature selection method, the number of genes used to obtain this accuracy level, and average (mean) and standard deviation (Std) of accuracies for each classifier and feature selection method. In the table, entries containing a pair of numbers (
On the leukemia, colon, and breast cancer datasets, if only the expression levels of up to 100 genes are used, then the gap between the best and worst mean accuracy levels for feature selection methods is large (62.43 vs 86.04%, 52.53 vs 68.49%, and 46.00 vs 55.33%, respectively). These gaps are significant, particularly, given the fact that the standard deviations of different feature selection methods vary from 4.38 to 10.90, 4.12 to 7.86, and 1.65 to 5.22, respectively. Thus, even if we consider the largest standard deviations for these datasets, the gaps are close to or more than two standard deviations. As the number of maximum admissible genes used for the classification increases from 100 to 300, for the leukemia, colon, and breast cancer datasets, the gap between the best and worst mean accuracy levels reduces (70.18 vs 87.69%, 56.36 vs 69.17%, and 51.07 vs 56.67%, respectively) (Supplementary Tables 4 and 5). Similar but less severe behavior is observed for lung and NCI60 cancer datasets. Overall, we observe that the SVM-FS and PAM feature selection methods are not as efficient as other feature selection methods for all of the cancer datasets (Table 1). Only exception to this behavior is observed in the breast cancer dataset. SVM-FS performs comparable to other feature selection methods for this dataset. Thus, we observe that SVM-FS and PAM feature selection methods perform worse than other feature selection methods on an average especially when small number of genes is selected. From the remaining three feature selection methods, there is no significant winner. However, SU method shows slightly better accuracy results in most of the cancer datasets. Although, relevancy of the selected genes by the feature selection methods for the specific cancer types is not a focus of our study, we checked the appropriateness of the selected genes by the SU method for different cancer types. The genes selected by this method are usually relevant to the specific cancer type. As an example, the gene set selected for the colon cancer classification includes the colon cancer-related genes such as
We observe reduction in the variation of the accuracy levels between different feature selection methods as the number of maximum permissible genes increases from 100 to 300. This suggests that although there might be variations in the top 100 genes, this variation decreases as the number of genes increases. However, the accuracy levels of different classifiers show more variation for the SVM-FS and PAM feature selection methods in comparison to other feature selection methods. This suggests that SVM-FS and PAM feature selection methods are more sensitive to the underlying classification algorithm. The variation in classification accuracy is generally comparable and lower for the genes selected by the other methods.
Finally, we observe that as the number of genes used in the classification increases, although we see slight increases in the accuracy levels, above observations for the feature selection methods remain unchanged.
The choice of appropriate classification method is of utmost importance for correctly classifying samples to correct cancer types. Here, we compare the performance of the NBC method to five traditional classifiers described above by considering the average performance across all the five feature selection methods considered (see the last two rows in Table 1 for each dataset.)
Our results suggest that the NBC method performs at comparable or better accuracy levels than the traditional classification methods for nearly all datasets (Table 1, and Supplementary Tables 4 and 5). The NBC classifier on average achieves 54.88–99.05% accuracy on the five cancer microarray datasets using up to only 100 genes. The classification accuracies for the other classifiers such as SVM in these five datasets are usually lower than that of the NBC method. This suggests that the performance of our network-based cancer classifier is very promising.
NBC method outperforms the other classifiers for the NCI60, breast, and colon cancer datasets. For the leukemia and lung cancer datasets, SVM and RF classifiers provide the best mean performance, respectively. The C4.5 decision tree classifier shows the worst mean performance for the NCI60, lung, and colon cancer datasets. NB and kNN classifiers show the worst mean performance for the leukemia and breast cancer datasets, respectively. These behaviors are generally independent of the number of genes used for classification (Supplementary Tables 4 and 5).
Classifiers show different variation levels in their accuracies. While NBC classifier exhibits the least standard deviation for breast cancer dataset; RF, kNN, and SVM classifiers show the least variation for lung, NCI60, leukemia, and colon cancer datasets. Even for the four datasets (lung, NCI60, leukemia, and colon cancer datasets) for which NBC does not have the smallest standard deviation, its standard deviation remains to be very small. These observations suggest that the NBC method while providing comparative or better accuracy than other methods, its classification accuracy is also robust to choice of the feature selection method.
Finally, we observe that as the number of genes used in the classification increase, above observations for the classification methods are mostly conserved (see Supplementary Tables 4 and 5).
Feature Selection and Classification Method Combination
So far, we have discussed the average behavior of different feature selection or classification methods. Here, we focus on the specific combination of feature selection and classification methods. An appropriate combination of feature selection and classification methods leads to the best performance on different multi-class cancer datasets. 46 Here we will focus on the accuracy ranking of feature selection and classification methods using up to 100 genes selected from the five datasets (Table 2). However, our main results are mostly conserved as the number of genes used for classification increased (Supplementary Tables 6 and 7). Following from the results in Tables 1 and 2, we observe that NBC and RF classifiers are the two best classifiers when we adopt SU as the feature selection method. While NBC classifier achieves the highest prediction accuracy in NCI60 (accuracy: 100%) and breast cancer (accuracy: 59.2%) datasets, its performance in the lung (accuracy: 95.83%; ranking: 2), colon (accuracy: 73.85%; ranking: 4), and leukemia (accuracy: 85.71%; ranking: 10) datasets is very good. In the lung, colon, and leukemia cancer datasets, combinations of SVM classifier and IG feature selection method (accuracy: 96.67%), NBC classifier and IG feature selection method (accuracy: 75.27%), and kNN classifier and IG feature selection method (accuracy: 90.8%) show the best performances, respectively. Although, RF classifier does not achieve the best accuracy in any of the five datasets when we adopt SU as the feature selection method, this combination consistently achieves high accuracy levels. Its ranking in the lung, breast, NCI60, leukemia, and colon cancer datasets are, respectively, 2, 3, 15, 2, and 3 (Table 2). Because of that when we combined all the rankings in different datasets and found an average ranking, we observed that RF classifier combined with SU feature selection method is the second best algorithm (Table 2). These observations suggest that our new classifier is comparable to or better than the state-of-the-art classifiers including SVM and RF. Furthermore, we observe that NBC works best when it is combined with SU as the feature selection method.
A summary of accuracy rankings for lung, breast, NCI60, leukemia, and colon cancer datasets for different feature selection and classification methods when the maximum number of allowable genes is set to 100. In the first five tables, we report the best accuracy obtained by each combination of classifier and feature selection method, and their ranking. In these tables, entries containing a pair of numbers V :W indicate the following: V refers to the best classification accuracy and W refers to its ranking. We use bold face to highlight the top five highest accuracies. In the last table, we report the ranking of each combination of classifier and feature selection method based on the average accuracy obtained over all the five datasets.
A summary of accuracy rankings for lung, breast, NCI60, leukemia, and colon cancer datasets for different feature selection and classification methods when the maximum number of allowable genes is set to 100. In the first five tables, we report the best accuracy obtained by each combination of classifier and feature selection method, and their ranking. In these tables, entries containing a pair of numbers
Another important observation that follows from these results is that incorrect combination of feature selection and classification methods may lead to a poor performance. In Tables 1 and 2, we observe that C4.5 and NB classifiers are the two algorithms that yield poor accuracy results if we adopt PAM as the feature selection method. While C4.5 classifier combined with PAM feature selection method shows the poorest accuracy performance in the NCI60 and colon cancer datasets (accuracies: 72.02 and 46.82%, respectively), its performance in the lung (accuracy: 89.17%; ranking: 23), breast (accuracy: 48.00%; ranking: 25), and leukemia (accuracy: 67.31%; ranking: 25) datasets is also disappointing. Similarly, NB combined with PAM feature selection method shows the worst performance in the breast cancer dataset (accuracy: 38.40%). In addition, this method combination's performance in the other datasets is also poor (ranking in lung cancer: 16, NCI60: 25, leukemia: 26, and colon cancer: 28). In the lung cancer and leukemia datasets, C4.5 and NB algorithms show the worst performance when they are used in conjunction with SVM-FS feature selection method (83.33 and 44.79%, respectively). These observations suggest that kNN, NB, and C4.5 algorithms are not competitive against NBC, RF, and SVM classifiers especially when these classifiers are used in combination with SU as the feature selection method.
So far, in our experiments we have demonstrated that NBC yields better or similar accuracy as compared to state-of-the-art methods. Next, we focus further on the NBC method to understand its characteristics, strengths, and limitations. Briefly, two parameters characterize the predictive models generated by NBC. These are (i) the number of genes selected and (ii) the Pearson correlation threshold. These two parameters control the number of nodes and edges in the network models generated by NBC, respectively. We vary the values of these two parameters and report the accuracy of NBC for each parameter setting. More specifically, we vary the number of genes in the [50:300] interval and the Pearson correlation threshold in the [0.6:0.95] interval. Figure 1 presents the results.

Accuracy dependency of the NBC method to the number of genes and gene-to-gene associations in the network. Heat maps depicting the accuracy levels for varying number of genes and gene-to-gene interaction density are shown. In the figure, columns refer to the cancer datasets: leukemia, breast, lung, NCI60, and colon. Similarly, rows correspond to the feature selection methods: SVM-FS, symmetrical uncertainty, χ2, information gain, and PAM. The
The dependency of NBC on these two variables is also governed by underlying feature selection method and the dataset. While the SU, χ2, and IG algorithms show qualitatively similar behavior in different datasets, SVM-FS and PAM feature selection methods show very distinct behaviors. For example, classification accuracy mostly decreases as the number of genes increases for low correlation levels in the SVM-FS feature selection method in contrast to SU, χ2, and IG feature selection methods. The NBC classifier accuracy behavior is very distinct in different cancer datasets also, possibly because of different network structures (discussed below).
Despite the feature selection method and dataset dependencies, there are clear patterns with regard to the number of genes used in the classification and correlation threshold. For example, in general as the gene numbers increase, accuracy of the NBC method does not increase. We observe that our network-based classifier can predict at high accuracy levels while using only up to 75 genes in lung, breast, and NCI60 cancer datasets. The other algorithms, in particular SVM, usually need more genes to reach similar accuracy levels in these three datasets (Supplementary Table 5). For example, in the breast cancer dataset, while NBC can reach the accuracy level of 59.20% using only 50 genes, SVM and C4.5 classifiers use 300 genes to reach approximately the same accuracy level. The leukemia cancer dataset was generally difficult for all the algorithms, and they needed higher gene numbers for high accuracy levels (Supplementary Table 5). Since measuring gene expression levels is expensive, these observations suggest that our method is probably more relevant for biological applications since it can function at high accuracy levels using smaller number of genes in comparison to traditional classifiers.
With regard to Pearson correlation threshold cutoff, NBC method shows nonlinear behavior. When the threshold cutoff is small, we get lots of false-positive connections in the network that should not exist. When the threshold cutoff is large, probably we miss lots of gene-to-gene associations that should actually exist in the network. For this reason as the correlation threshold increases, first accuracy levels increase for the NBC method and then decrease dramatically. Despite this general behavior, there is no single threshold level that works best for all the cancer datasets. While a correlation cutoff of ∼0.7 works best for lung cancer, we need a threshold cut off score of ∼0.8–0.9 for the NCI60 dataset.
NBC method constructs a different and unique network for each cancer class and uses these networks and predictor functions constructed by linear regression to predict expression levels for the selected genes in each sample. In the next step, for each sample, it compares these class-specific predictions to actual gene expression levels in the sample. The method assigns the sample to the class that gives the minimum distance between the predicted and actual gene expression levels in the L
2
norm. To see how distinctive our method is in separating different classes, we computed the prediction errors for inter- and intra-subclasses using each class-specific predictor function of the NBC method. More specifically, we computed the error using the relative L
2
norm. Relative L
2
norm is defined as ‖

Intra- and inter-class prediction errors for different cancer datasets. In each graph,
Next, we focus on one of the most fundamental characteristics of the network models constructed by the NBC method, namely, we study the density of the resulting networks (ie, average number of gene-to-gene associations) formed by the NBC method for different cancer datasets. Figure 3 plots the results for varying number of genes and Pearson correlation threshold values. We observe that network density depends on the number of genes and the correlation threshold. In general, as the number of genes increases and correlation threshold decreases, the number of associations in the network increases. While this qualitative behavior is dataset independent, it shows slight quantitative differences. For example, some of the networks formed by the NBC method for the leukemia, NCI60, and lung cancer datasets are very dense networks (up to ∼99, ∼52, and ∼102 average gene-to-gene associations, respectively). However, the breast and colon cancer datasets show significantly lower density levels.

Dependency of the network density on cancer datasets and feature selection methods. Heat maps depicting the network density levels for varying number of genes and Pearson correlation cutoffs are shown. Columns refer to the cancer datasets: leukemia, breast, lung, NCI60, and colon. Rows correspond to the feature selection methods: SVM-FS, symmetrical uncertainty, χ2, information gain, and PAM. The
For the leukemia, lung, and NCI60 cancer datasets, network densities are specific to feature selection methods. For the leukemia, while the genes selected by the PAM method show a very sparse network (up to one average gene-to-gene association), the other feature selection methods have up to ∼99 average adjacent genes. Similar behavior is observed for the NCI60 and lung cancer datasets. For the lung cancer, genes selected by the PAM feature selection method give a dense network (up to ∼102 average adjacent genes) in comparison to other feature selection methods (up to ∼3 average gene-to-gene associations). Similarly, for the NCI60 dataset, SU and χ 2 feature selection methods give a dense network (up to ∼52 average gene-to-gene associations) in comparison to other feature selection methods (up to ∼5 average gene-to-gene associations).
Despite the vast amount of experimental and computational studies, we still have limited knowledge about the mechanisms of different cancer types. In order to understand cancer-dependent changes in the correlation-based co-expression networks, here we give a brief analysis of the network measures for the networks created by the NBC method for different cancer classes in leukemia and NCI60 cancer datasets (Figs. 3–6). As suggested above, the best feature selection method for the NBC classifier is the SU feature selection method. Because of that, in this section we focused on the association networks created by the genes selected by the SU feature selection method. Owing to sparse network structures, we omitted the lung, breast, and colon cancer datasets in this experiment (see Figure 3). We compared the networks created for different cancer classes with respect to three network measures, namely degree, clustering coefficient, and closeness centrality distributions of the nodes of the network models generated by NBC. For both datasets, we have used the network, which leads to the best classification of the datasets if up to 100 genes are used (Table 1 and Fig. 1). For the NCI60 dataset, the best accuracy is achieved at 75 genes with a correlation threshold of 0.725, and for the leukemia dataset, 100 genes with a correlation threshold of 0.75.
NCI60
In this cancer dataset, all of the correlation-based co-expression networks formed in different cancer classes show scale-free behavior. However, the frequency of isolated genes and the highest degree in the networks show slight variations in different cancer classes. While more than half of the genes in the networks formed by the NBC method for five classes (classes 3, 4, 6, 7, and 8) is isolated (Fig. 4A), in the remaining three classes (classes 1, 2, and 5) the frequency of the genes that are isolated is 50% or slightly less than 50%. Similarly, while the maximum degree in the classes 3, 4, and 7 ranges between 3 and 4, in classes 1, 5, 6, and 8, it ranges between 5 and 7. Class 2 shows the highest degree (10) in this cancer type.

Network degree distributions of the networks in different cancer classes. The degree distributions of the networks are shown for the NBC method for NCI60 (
Next we measured the clustering coefficient and closeness centrality values for each gene. We observed that in classes 3,

Clustering coefficient distributions of the networks in different cancer classes. The clustering coefficient distributions of the networks are shown for the NBC method for NCI60 (

Closeness centrality distributions of the networks in different cancer classes. The closeness centrality distributions of the networks that are created by the NBC method for NCI60 (
These observations suggest that in classes 1, 2, and 5, the expression levels of the genes are slightly more correlated. Because of that, we observed genes with high degree, clustering, and centrality scores.
In this dataset, the networks show less number of isolated genes in comparison to the NCI60 dataset. There are 30–35% isolated genes in classes 1, 2, 3, and 5; approximately 20% in classes 6 and 7; and 5% in class 4 (Fig. 4B). The networks for the classes 1, 2 and 3 show scale-free behavior; however, they show slight variations in their distributions. The maximum degree in these three classes ranges from 35 to 55. In classes 4, 5, 6, and 7, the degree distributions show non-scale-free behavior with degrees ranging between 57 and 65. In these classes, the frequency increases as the degree increases. This observation suggests that in classes 4, 5, 6, and 7, probably many genes form a tight clique, leading to higher frequency values for higher degree genes.
Clustering coefficient and centrality measures for each gene in this cancer type show cancer class-dependent behavior also. While in all the seven classes the clustering coefficient distributions show Gaussian behavior, the variance of these distributions is slightly different. In three of the seven classes (classes 1, 2, and 3), non-isolated genes have clustering coefficients between 0.3 and 1. In contrast, in classes 4, 5, 6, and 7, genes have slightly higher clustering coefficients (0.5–1) (Fig. 5B). In regard to closeness centrality, we observed Gaussian-like distribution for closeness centrality scores in classes 1, 2 and 3 (Fig. 6B), in which centrality scores range between 20 and 65. In contrast to this, in classes 4, 5, 6 and 7, the frequency of the genes with high closeness centrality score is larger. In these classes, we observe more central genes where centrality scores vary between 30 and 75.
This network analysis suggests that the NBC method does not only achieve high classification accuracies for different cancer types, but also reveal potentially important insights about the topological differences among gene associations in various cancer types.
Discussion/Conclusion
Network-based approaches for cancer classification have been proposed previously, where a combination of protein-protein interaction (PPI) networks and gene expression levels is used in sub-network identification for cancer classification.47,48 In this study, we proposed a new network-based classifier, NBC method, and compared its classification accuracy to five traditional classifiers using five different cancer microarray datasets. To reduce the dependency of the classification accuracy results on the genes used, we have employed five alternative feature selection methods to choose up to 300 genes. Our experimental results demonstrated that the choice of feature selection method does not have a very big impact on the classification accuracy. We observed that the SU, χ2, and IG feature selection methods show comparable but better results than SVM-FS and PAM feature selection methods. Our results also exhibited that our network-based classifier NBC method shows similar or better accuracy levels in comparison to those of the traditional classifiers such as SVM and RF. We also showed that the correct combination of the feature selection and classification method is the key for successful cancer classification studies. In this regard, we observed that NBC and RF classifiers combined with SU feature selection method showed the best overall performances in five different cancer datasets. Our results also support that the selection of the best classifier and feature selection method is dataset specific.
In a recent study, Staiger and colleagues 49 argued that the network-based cancer classification approaches do not really outperform the single-gene-based classifiers. Below we explain potential reasons for this observation, the shortcomings of the earlier network-based classifiers, and how our method differs from them. The disappointing results observed in Staiger et al. 49 might be because of a few reasons:
The network-based methods compared in Staiger et al. 49 use PPI data in combination with microarray data. The PPI datasets are usually generated by high-throughput biological experiments that results in noisy datasets. Inclusion of these noisy datasets might be leading to the low cancer classification accuracy levels for network-based approaches. Our NBC method differs from these earlier network-based methods. Instead of using existing PPI networks, our method constructs correlation-based association networks. By this way, we believe that our method is not affected by experimental noise as much as the earlier network-based classifiers that use PPI datasets.
Earlier network-based cancer classifiers overlay mRNA expression data with protein-level information. Since these two data types reflect events on different molecular levels, combining them is not trivial and might be resulting to inaccurate results. However, our NBC method uses only gene expression levels and results in better accuracy levels than the single-gene-based classifiers.
The network-based classifiers used in Staiger et al. 49 use a single network for all the cancer classes regardless of the key differences in the gene expression levels of different cancer classes. Then, these methods look for the network motifs that show significant expression-level changes between different cancer classes. Our method, however, constructs a different and unique network for each cancer class and uses these networks to model and classify different cancer types and subtypes. We believe that this significant technical difference between our approach and earlier network-based approaches gives our method an edge over earlier network-based and single-gene-based classifiers.
Staiger et al. 49 compared the network-based and single-gene-based methods only on breast cancer datasets. So, the results observed on this study might be breast cancer specific. In our study, to reduce the dataset specific effects, we tested our method using five different cancer datasets.
The aforementioned differences between the NBC method and earlier network-based approaches are notable, and suggest that our method in contrast to earlier network-based methods is more suitable for cancer classification.
Owing to high microarray costs, supervised cancer classification methods are still not employed in many cancer diagnoses. In this sense, new classifiers that can produce accurate classification of different cancer types using small number of genes are needed. Detailed analysis of the NBC method showed that our method could reach to high classification accuracy levels using usually less than 100 genes. In contrast, in general the traditional classifiers require more genes than the NBC method to reach similar accuracy levels. This suggests that our new network-based classifier might be medically more relevant in comparison to the other traditional classifiers. Future work in the medical application of our method to diagnosis of different cancer types is needed to elucidate this strength of our new classifier.
In order to analyze the class-dependent topological differences in gene-to-gene associations in different cancer types, we have also analyzed the network measures (degree, clustering coefficient, and closeness centrality distributions) in leukemia and NCI60 cancer datasets. In-depth analysis of the networks suggested by the NBC method provided new insights into the class-to-class changes of gene-to-gene interactions in cancer. While in some cancer classes we observed scale-free behavior in degree distributions of the genes, this scale behavior was lost in other cancer classes. Similarly, clustering and centrality distributions of the genes show distinct behaviors in different cancer classes. These changes in the network properties suggest that different cancer classes will show distinct responses to similar drugs since their gene regulatory network topologies are different. It also suggests that the design of new cancer drugs should take into account the topological differences in the regulatory networks of different cancer classes. Finally, our study indicates the need for new network-based classification algorithms and analysis techniques to decipher cancer mechanisms and find new therapeutic treatments for cancer.
Author Contributions
Developed the NBC method: AA, DG, TK. Conceived and designed the experiments: AA, TK. Analyzed the data: AA, TK. Wrote the first draft of the manuscript: AA. Contributed to the writing of the manuscript: AA, TK. Agreed with manuscript results and conclusions: AA, DG, TK. Jointly developed the structure and arguments for the paper: AA, TK. Made critical revisions and approved the final version: AA, TK. All authors reviewed and approved the final manuscript.
