Abstract
Introduction
Prostate cancer is the leading type of cancer in men with a 27% incidence. The Canadian Cancer Society estimates about 72,700 cancer deaths, of which 3500 are in men with prostate cancer. 1 There are several different types of treatments for prostate cancer, although some of the treatment techniques can substitute each other. The choice of a treatment is dependant upon the stage of the disease, i.e. the extent of cancer spread and whether the cancer has spread beyond the prostate. This information results in staging the cancer. In essence, on the information gathered about the disease through biopsy and/or prostatectomy, staging categorizes a patient. Since different treatments result in variable outcomes, staging helps assess the risk of cancer progression and death based on the current extent of cancer, tumor characteristics, metastasis in the lymph nodes, and the spread of the disease to other parts of the body. Staging also helps in establishing a tradeoff between the risks of death due to cancer and death or medical complications due to treatment. This is particularly true for prostate cancer since a majority of men diagnosed with the disease are older adults, often suffering from multiple comorbidities.
Typically, medical doctors assess the clinical and the pathological data about the individual patient to assign a cancer stage and to choose the most appropriate treatment procedure. This is also known as clinical decision making and it is a complex process. Cancer staging performed by a doctor involves weighing multiple variables and processing information gathered by patient examination and by conducting various tests. However, this process may be subjective and therefore depends heavily on the doctor's experience, skills and knowledge. Machine learning techniques can also be employed to learn and model the underlying theory when provided with relevant information and data. A variety of statistical, probabilistic and optimization tools under the umbrella of machine learning can be used to learn from past examples and a number of information systems have been developed to aid the clinical decision-making process. 2
An automated cancer staging system was developed and is described in this paper. A classifier based framework was used to draw a distinction between two primary stages of the disease: organ-confined disease and extraprostatic disease. The classifier was modeled using past data from patients diagnosed with prostate cancer who are selected to undergo surgery (or prostatectomy). One of the key issues with such data sets is the class imbalance, resulting from the number of patients with organ-confined disease, which considerably exceeds the number of patients with extraprostatic disease. Data imbalances pose problems when less observed patterns are of higher relevance, because most of the machine learning techniques tend to generalize the patterns observed over the majority of data and ignore those observed over smaller portions of the data. 3 Three approaches of dealing with the class imbalance problem were explored in this study. We also investigated the feature space reduction, because the data can potentially contain more information than required to perform the classification.
There are a number of machine learning techniques that can be used to develop a classifier, but it is well known that specific techniques are more suitable for certain domains. That is, for a particular problem, specific techniques have superior performance, while other techniques are only able to produce mediocre or acceptable results. Even within one problem domain, different techniques can differ in the efficiency for different (partial) ranges of the problem space. This leads to the conclusion that it is most appropriate when solving a problem, such as the one considered here, to primarily rely on actual data and information about the problem, rather than trying to generalize the performance of certain machine learning techniques in a generic manner.
Lately, a number of machine learning based tools have been developed for cancer diagnosis and prediction. And a majority of this work can be categorized into those which are heavily dependant upon expert domain knowledge or on extensive historic data. Garzotto et al. 4 and Spurgeon et al. 5 sequentially developed a decision tree approach, specifically CART (classification and regression tree) to classify patients with aggressive prostate cancer based on ultrasound and biopsy markers. Sensitivity and specificity were adopted as performance metrics in the study. Zlotta et al. 6 developed a set of two artificial neural networks (ANN), where the first ANN predicts the pathological stage of the patient and the second ANN classifies patients in groups based on cancer stage and the model performance was measured using the Receiver Operating Characteristic (ROC) curve. Veltri et al. 7 favorably compared an ANN model to logistic regression for prostate cancer prediction and staging. Percentage of correctly classified cases was adopted as the performance metric.
In this paper, we propose a novel system which selects the features automatically and couples appropriate techniques in order to maximize the system performance. Specifically, performance of three re-sampling techniques and three feature selection techniques were evaluated. A GA is used to search for optimal pairs of feature selection and sampling techniques; where optimality is based on the performance of the system. Once such pairs are identified, respective classifiers are developed and a DS method 8 is used to combine the component classifier performances. The performance of such a system has been shown for two different types of classifiers; the k-nearest neighbor (KNN) method and the support vector machines (SVM). ROC curves 9 are used as performance measures for the proposed cancer staging system. The next section describes the data used for this study and Section 3 provides justification for the use of ROC as a performance metric. Section 4 introduces the proposed method along with brief descriptions of the component classifiers and feature extraction and sampling methods. Section 5 details the classifier fusion and the obtained results. Section 6 provides details on the generic applicability of this method by validating it on a simulated prostate cancer dataset and on two publicly provided datasets of breast and lung cancer.
Study Population
Staging and analysis of prostate cancer may be regarded as a function of the information (predictors) gathered during the diagnosis of every patient. Data from a total of 1174 patients with prostate cancer positive biopsies matched with their radical prostatectomies were used for this study. All specimens were processed in one laboratory between 07/2000 and 04/2005 and were reported using standard synoptic reports. The biopsy and the prostatectomy data were collected from the information system (Cerner) of the Calgary Laboratory Services. Various biopsy predictors that were considered are: Patient Age, Primary Gleason Grade, Secondary Gleason Grade, Biopsy Gleason Score, Prostate Specific Antigen (PSA), PSA Density (PSAD), Digital Rectal Exam (DRE), Transrectal Ultrasound (TRUS), Gland Volume, Number of Positive Cores, Total percent of Cores Involvement, and Total Cancer Length in mm. Prostatectomy data included: Disease Stage (pTNM), Primary Gleason Grade, Secondary Gleason Grade, Prostatectomy Gleason Score, Tumor Volume, Seminal Vesicle Involvement, Surgical Resection Margin Status, and Pelvic Lymph Node Involvement. The prostate cancer dataset contained multiple stages for the disease, and thereby the stage data was converted into a binary format where extraprostatic disease represented by stages pT3 and pT4 was denoted by 1 and the organ-confined disease represented by stage pT2 was represented with a 0. Table 1 presents the statistical description of the data used in this study. Of the 1174 patient records, 1054 records were retained after removing the ones with missing variables; 934 patients in the organ-confined stage and a 120 with an extra prostatic extension.
Patient characteristics.
Abbreviations: PSA, prostate specific antigen; PCa, prostate cancer.
Patient characteristics.
A classifier's performance typically reflects how well it can discriminate between the objects belonging to different classes (two in this case). But the proposed system is expected to play a critical role, particularly for patients where treatment modality may have a substantial impact on the post-treatment prognosis and survival. Therefore, in terms of the classification performance, the cost of wrongly classifying a patient with extraprostatic disease is much higher than the cost of wrongly classifying one in the organ-confined stage. Conventional measures of performance therefore will not provide a relevant estimate of the classifier performance; therefore an ROC curve and the area under the curve (AUC) were used as performance indices in this study.
An ROC curve is obtained by plotting the true positive rate (TPR) against the false positive rate (FPR) for varying decision thresholds. As shown in Table 2, TPR (and FPR) is representative of the number of positive (negative) examples correctly (incorrectly) classified. They may be computed as follows,
ROC confusion matrix.
Abbreviations: TP, true positive; FN, false negative; FP, false positives; TN, true negatives.
ROC confusion matrix.
The decision threshold or boundary for binary classification refers to a threshold, below which the object is classified as negative and above which it is classified as positive. Such a threshold can be adjusted to trade off the cost of
One should note that, depending on the outcome of misclassification, ideal decision thresholds may vary. For example, if the cost of misclassifying a patient with extraprostatic disease is lower than misclassifying a patient with organ-confined disease, then a reciprocal ROC curve (to the one discussed above) will be preferred. But in this study, a classifier with an ROC tending towards the top-left of the graph indicates better performance than the ones with a lower ROC. In addition, ROC curves for different classifiers tend to intersect each other; in which case AUC is used as an alternate metric. AUC ranges between the interval [0,1] and greater the value of AUC, better is the technique. The AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example. AUC as a measure has been proved to be equivalent of the Wilcoxon test statistic 10 and the Gini Index 11 i.e. unlike a typical measure of classification accuracy, the AUC quantifies the likelihood that the underlying method will assign a higher probability of success to a patient having extraprostatic disease compared to a patient where the cancer is contained. Therefore such a measure will provide a true insight even in the case of imbalanced data. Another important advantage is that the respective ROC is invariant to monotone transformations of feature values, 12 which renders flexibility in manipulating the feature set if necessary.
The proposed system consists of four major parts: feature extraction, data sampling, classification and classifier fusion. Feature extraction helps identify the most prominent features in the search space thereby reducing the required computational and interpretational effort. Data sampling provides a mechanism to eliminate the bias or imbalance that exists in the data by over-sampling the minority class or under- sampling the majority class or a selective combination of both. Feature extraction and sampling enable the implementation of a classifier in order to model the class disparity in the data. A GA is then used to identify compatible sets of features, sampling techniques and classifiers in order to maximize the performance of subsequent DS classifier fusion. The adopted methods are individually described in the following subsections.
Feature Selection
RS Features
Rough Set Theory
13
adopts an equivalence relation such that two objects (
The lower approximation set is also known as the Positive region i.e.
The significance of a variable is then expressed as a function of the dependency (γ) of a variable in classifying the objects into the classes of
where |
Because dependency (defined by Equation 3) only considers the number of rules that cover various instances and not the number of instances that each of the rules represents, a parameterized average support heuristic method has been adopted to include both, the number of rules and the number of instances supporting each rule in computing the average support function (similar to the measure of dependency), given as:
where
The significance of a variable (Equation 4) is redefined as:
PCA
14
reduces the feature space by projecting the complete feature set onto fewer variables known as the principal components with the objective of maximizing the variance in a least squares sense. This produces uncorrelated components with minimal information loss.
If
where
where
GA 15 based search depends on a user defined fitness function; in this study, the product of the number of features and the average error in the predicted output class has been adopted as the fitness function. GA based search uses a chromosome representation of the solutions and a set of genetic manipulations in order to arrive at an optimum solution. First, a chromosome of the length equal to the total number of features in the input space is created. The value of the bit associated with a feature is set to 0 to indicate that the feature is not considered, whereas a bit value of 1 indicates that the feature will be considered. The search process begins with an initial generation where the population is generated randomly; all of the chromosomes in this generation are evaluated against the fitness function and the best chromosomes (representing better solutions) are chosen to propagate into the next generation. Through heuristic manipulation of the chromosome structures in every generation, it is ensured that the newer generations always have an average fitness higher than the previous generations. The search stops either when a fitness threshold is achieved, or when the search runs out of the threshold on the number of generations. In this study, the population size per generation was set to 25 and the limit on the generations used in the search was set to 1000. These numbers were selected after a short sequence of random trials. Mutation and crossover operators were utilized to generate off springs for the next generation, and a simple natural selection based on the current fitness values was used to identify potential parents.
Data Sampling
SMOTE is an over-sampling of the minority class by creating “synthetic” examples. SMOTE is actually an interpolation approach where the synthetic examples are created along the line segments joining the example under consideration and any/all of its For each minority class example, find its Randomly choose m ( Calculate the differences between the minority class example under consideration and its m nearest neighbors, which are randomly chosen. Multiply the differences by a random number between 0 and 1, and add the results to the minority class example under consideration to produce m synthetic examples for this minority class example.
In this study, we used three re-sampling techniques: 1) under-sampling 2) SMOTE and 3) combined under-sampling and SMOTE in the multi-classifier fusion diagnosis system.
Support Vector Machines
SVM
19
is a supervised machine learning methodology used for classification and regression. Compared with the traditional statistical and neural network methods, SVM has the advantage to effectively avoid a local minimum due to its convexity property. On the other hand, SVM uses the kernel trick to apply linear classification techniques to nonlinear classification problems. When Gaussian kernels are used, the resulting SVM corresponds to a radial basis function (RBF) network with Gaussian radial basis functions. In comparison with traditional RBF network, SVM has the ability to automatically determine the model size by selecting the support vectors based on quadratic programming (QP) procedure. Hidden neurons correspond to the support vectors. The support vectors serve as the centers of basis function in the RBF network. For linear SVM, the decision function is given in a linear form as:
The decision value produced by SVM is not the estimate of posterior probability. Here, the binning technique is used to transform the output of the SVM into calibrated probability.
19
The binning technique proceeds by first sorting the training examples according to their decision values, and then dividing the value range into
where
KNN
20
is a statistical method for classifying objects based on their
where
where
Suppose the universal set Θ = {
Subsets
The belief value
where
is a normalizing factor, which measures how much
Multiple Classifier Fusion
For the prostate cancer stage prediction problem, there are two exhaustive and mutually exclusive propositions
where
where
The final decision is made according to the approach dealing with the imbalanced data. If re-sampling is used in the fused diagnosis system, the final decision is made by assigning the label of the class with the largest belief value:
If the imbalance is dealt with by changing the decision threshold, the class of the example is determined by changing the threshold on belief value of the positive class:
where
Experiments were done to assess the performance of all of the feature extractor-sampling-classifier pairs. The available data set was divided into two: one for building the models and the other to test the developed models. Appropriate ratio of the training and testing data sizes for SVM was identified by running different trials as shown in Figure 1. The best testing performance was observed when 70% of the total samples were used for training. The dip in the performance of the SVM beyond the training data size of 70% can be attributed to overfitting, when the trained model lost its ability to generalize, and was rather rigid to the training data. As the performance of the KNN depends on the number of neighbors considered in the output class allocation, the optimum number of neighbors was identified by running trials with different sizes, and 5 was the most optimal. Although higher number of neighbors may seem to have the ability to generalize, it is the separability of the data according to assigned classes that has the highest influence on the appropriate number of neighbors.

SVM performance for different training data sizes.
The AUC curves were generated for all of the feature-sample-classifier sets by altering the decision threshold. When the outputs of the classifiers for all combinations were transformed into conditional probabilities, altering the decision threshold simply meant altering the respective probability threshold. By varying the probability threshold, the testing examples are re-labeled, giving out a series of (FP, TP) pairs. Each pair of (FP, TP) is a point on the ROC curve. For individual classifiers, ROC curves are created by changing the threshold on conditional probability

KNN performance for different number of neighbors.

ROC curves of SVM and Rough Set Features.

ROC curves of KNN and Rough Set Features.

ROC curves of SVM and PCA Features.

ROC curves of KNN and PCA Features.

ROC curves of SVM and GA Features.

ROC curves of KNN and GA Features.

ROC curves of SVM and Under-sampling._

ROC curves of SVM and SMOTE.

ROC curves of SVM and combined sampling.

ROC curves of KNN and Under-sampling.

ROC curves of KNN and SMOTE.

ROC curves of KNN and combined sampling.
AUC values with SVM for PCa.
AUC values with KNN for PCa.
To show the effect of combining multiple sets of features and the effect of combining different types of sampling, DS fusion of the classifier outputs was performed over various combinations of sampling and feature selection methods. The following notation was adopted to refer to the classifiers developed in this study;
where ‘C’ represents the type of classifier used, i.e. {SVM, KNN}; ‘F’ represents the type of feature selection tool used, i.e. {R (rough sets), P (principal component analysis), G (genetic algorithm)} and ‘S’ represents the type of data sampling adopted, i.e. {U (under sampling), S (SMOTE), US (under-sampling + SMOTE)}. Therefore SVM-P-S would imply the classifier based on SVM trained over PCA features obtained from SMOTE sampled data.
It is evident in Table 3 that the SVM trained over RST based features had a superior performance. Similarly, under-sampling combined with SMOTE trained SVM had the best performance among the types of sampling. DS fusion of SVM-R-[U, S, US] has 86.1% under favorable AUC. KNN performed poorly as a classifier over all sampling and features. Although much lower than average SVM, the combination of KNN-R-S has the highest favorable AUC at 80.9%, as illustrated in Table 4. In general, DS fusion improved the performance over any single model. We also note that under sampling proved a more efficient method of re-sampling the data than generating synthetic samples. As it has been observed previously, 18 generating synthetic samples can cause the decision boundary to spread, resulting in a poor performance. Despite the fact that fewer samples make parameter estimation difficult in SVM and neural networks, under sampling of the data should be preferred, since performance degradation is higher when using synthetic sampling than using SVM trained over smaller data sets.
Methodology proposed in this study relies on the use of GA to identify the most optimal set of classifiers for fusion, where fitness is defined as the overall fused performance. A total of 18 models (9 each for SVM and KNN with variations in the sampling method and features) were developed, and GA was used to choose the best set of fusion classifiers. Once an
Performance of GA optimized fusion for PCa.
Performance of GA optimized fusion for PCa.
In order to contrast the performance of the proposed method with existing techniques, a C4.5 decision tree and a two-layer neural network have been additionally developed using the same data. C4.5, similar to the likes of expert systems and nomograms, is a simple and transparent technique which can also be regarded as means of knowledge representation. ANN, similar to the likes of SVM and nonlinear regression, on the other hand is a complex nonlinear model. These two models have been chosen to represent the majority of present day classifiers. As can be observed from Table 5 (B), the proposed method fairs very well compared to the two other. Moreover, the performance of a 2-model combination (AUC: 0.8617, Accuracy: 89.4%) is better than that of the DS (C4.5 + ANN) method (AUC: 0.8580, Accuracy: 88.5%).
A number of such techniques have been reported in the literature and these techniques perform very well for individual datasets that the techniques have been built for. Data used in the previous section is a large, unbiased and consecutive patient cohort originating from one institution. Therefore it should be comparable to other current patient data obtained from patients who undergo prostatectomy in the larger North American centers. Although currently we have no access to data sets from different hospitals, we present the results of applying the proposed method to three different cancer datasets to testify that the proposed method is generic in applicability and that it outperforms other existing techniques, Firstly, we considered a simulated prostate cancer (SimPCa) dataset. 1000 patient records were synthetically generated using a combination of the under-sampling and SMOTE methods. It was ensured that the overall statistics of the data remained consistent with the dataset reported in Table 1. In order to facilitate comparison, the same classifiers, feature extractors and sampling techniques were adopted. ROC curves were generated and the respective AUC given in Tables 6—8. The performance of the classifier is very similar to the results in Section 5. SVM trained over RST identified features had the best performance and KNN remained a weak classifier for this dataset. Consistent with the other results, GA optimization identifies the same four model fusion.
AUC values with SVM for SimPCa.
AUC values with SVM for SimPCa.
AUC values with KNN for SimPCa.
Performance of GA optimized fusion for SimPCa.
The other two datasets considered for this purpose have been adopted from the University of California-Irvine data repository. Using the two public datasets we demonstrate that the proposed method compares favorably to the existing techniques. The Wisconsin Breast Cancer (BCa) data 22 consists of 683 patient records, each described using a set of 9 features. The objective was to accurately distinguish patients suffering from benign and malignant breast cancer. The Hong-Yang Lung Cancer (LCa) data 23 consists of 32 patient records, each described with a set of 56 features. The objective was to classify the patients into groups of three types of lung cancer. Tables 9—10 present the performance of SVM and KNN as classifiers over all sampling and feature extraction combinations for the breast cancer data and Tables 13—14 for the lung cancer data. Tables 11 and 15 depict the performance of the GA optimized classifier fusion.
AUC values with SVM as the classifier for BCa.
AUC values with KNN as the classifier for BCa.
Performance of the GA optimized fusion for BCa.
Classification accuracy for BCa.
AUC values with SVM as the classifier for LCa.
AUC values with KNN as the classifier for LCa.
It can be observed from the above Table 9 that SVM outruns KNN as a classifier and the best SVM classifier was built using GA for feature extraction and under-sampling as a remedy for data imbalance. Unlike in the prostate cancer data, GA optimization identified a three model fusion with the same AUC as for higher combinations. The best overall classification accuracy was 99.4%. Table 12 summarizes the performance of different methods in the literature for the same data. Not all the works listed in Table 12 use ROC as a metric; therefore classification accuracy has been used to draw a comparison between all of them. Quinlan 24 and Pena-Reyes and Sipper 25 applied C4.5 decision tree and fuzzy-GA methods. Although these methods are relatively simpler and yield a transparent and user-friendly model, the overall accuracy of the methods is compromised. On the other hand, Goodman et al. 26 Ubeyli, 27 Polat and Gunes 28 and Akay 29 applied thoroughly non-linear techniques. But it is evident that the proposed method, although marginal compared to Akay, 29 has a superior performance than the other techniques.
For the lung cancer data, KNN and SVM have a similar performance. The combination of under-sampling and SMOTE with PCA feature extraction had the best performance for SVM as a standalone classifier, whereas the combination of SMOTE and PCA feature extraction had the best performance for a single KNN based classifier. GA optimization yielded a four model fusion as shown in Table 15 with an overall classification accuracy of 97.5%. Table 16 compares the performance of the proposed method to two other methods from the existing literature. The proposed method has a much better performance when compared to that of Aeberhard's 30 RDA model and the neuro-fuzzy model adaptation used by Luukka. 31 Although Luukka 31 reports a 99.99% accuracy for a larger training-testing ratio, a performance of 65.48% is given for a 70:30 ratio, which is the same as used for this work. Therefore, it is concluded that the proposed GA based optimization of multiple model fusion is generically applicable across a wide range of data and in addition ensures better performance than most existing techniques.
Performance of the GA optimized fusion for LCa.
Classification Accuracy for LCa.
A number of classifier models based on KNN and SVM have been developed to test the automatic prediction of cancer stage using different features and data samples. We propose a novel approach of using a GA to select the best models and to combine their outputs using the DS theory. Owing to DS fusion and GA optimization, the overall performance improved in all tested models. In particular, three sampling and three feature selection methods have been employed in this study. Under-sampling and rough sets based features were identified to be most useful in improving overall performance of the system.
Disclosure
The authors report no conflicts of interest.
