Abstract
Introduction
The important amount of data that are generated by high-throughput technologies increased the need for computational analysis.
1
Among these technologies, microarray chips are used to measure the expressions of tens of thousands mRNAs simultaneously. Supervised and unsupervised classifications are the two broad methods for analyzing the gene expressions measured by these devices. Supervised classification methods are the major tool for extracting quantitative information from the datasets: the analyst selects representative variables from a training dataset (
In transcriptomic studies, a
Regarding the possible use of these classifiers in clinical routine, it is of upmost importance to minimize the number of genes in the molecular signatures because predictors relying on small sized signatures would be easier to implement into cheap low-throughput technologies. Moreover, small signatures could be of higher value for biological investigations when gene interaction networks are the main concerns, because the complexity of the gene interaction networks is exponential in the networks’ sizes.
In the present work we addressed the computing of molecular signatures as optimizing an explicit bi-objective function.
5
Because we were seeking for robust classifications and small molecular signatures, we aimed at both maximizing the interclass distance relative to the signature and minimizing the signature's size. Because the molecular signatures were sets of genes, the total number of potential molecular signatures was 2
We assessed our predictive modeling on seven datasets in oncology. Five of them were used for benchmarking: we assessed the predictive performances of the optimal signatures and compared them to those of previous methods. We analyzed the performances on the two last datasets more thoroughly, paying attention to the biological relevancy of the optimal signatures, and comparing the signatures and prediction performances to previously reported methods and results. The two datasets came from a clinical trial of preoperative chemotherapy in breast cancer.
Methods
Definition of the Objective Function
Let
We aimed at maximizing the interclass distance and minimizing the size of the signature S, which are two conflicting objectives. Hence, we proposed to combine them into a single bi-objective function
For any fixed weight w, the optimal solution was a signature
where
Regarding the weight parameter
Computing the Optimal Signature
Let
where
The difference was positive if and only if
Because the weight parameter
Hence, the optimal signatures followed an inclusion property:
Now let the set 𝕊 of all the probesets be ranked in decreasing order according to their contributions δ to the interclass distance:
Otherwise stated, among the weight values of the real unit interval [0, 1] the only values of interest were those of the finite set
This result holds for our bi-objective function which is the convex combination of the signature size and interclass distance. Other combinations of objectives would make sense. Which of them share the optimal signature inclusion property is an open question that we do not address in the present article.
Automatic Selection of the Predictor's Optimal Size
The above analysis shows that the optimal solutions of
An in-depth study of the prediction of the response to preoperative chemotherapy in breast cancer 8 evaluated a total of 780 distinct classifiers (sets of genes and classifier models). The conclusions were that the Diagonal Linear Discriminant Analysis (DLDA) classifier had better prediction performances for gene expression data 9 and higher sensitivities. 10 For these reasons we decided to use DLDA for prediction modelling, taking as input the expressions of the genes belonging to the signatures that were the optima of the bi-objective function.
The classifier model being chosen, we selected the optimal-sized signature by a wrapper approach, searching for the signature whose accuracy (see
Datasets
We compared the performances of our predictors to those of the major published ones on seven different datasets in oncology. The purpose of this benchmarking was to assess the relevancy and predicting power of the optimal signatures our method unveiled. To this end, we computed the signatures on a significant number of different prediction problems in oncogenomics and compared their performances (using a DLDA classifier) to the state-of-the-art published predictive modelings. Each of them was a two-class dataset, whose characteristics are summarized in Table 1. First, we used five datasets of tumors in different tissues (Datasets I-V) to evaluate the performance of our predictors. We invite the reader to refer to the cited articles for further explanations of the different methods and protocols that were used in the reported studies. In a second stage, we conducted a detailed study on two breast cancer datasets (Datasets VI–VII).
Description of the seven publicly available cancer datasets used in this study.
Cross Validation
Three different cross validation techniques were used in this study: 1)
Experimental Protocol (Non-Biased)
In all of the aforementioned experiments, the signatures, their sizes and the parameters of the classification model were computed on the training set only, without any reference to the test sets.11,12 Then the performances of the model designed in the training phase were computed on the external test sets. The same protocol was used in the cross-validation procedures: at each run of the cross-validation, the whole predictive modeling was repeated on the new independent training set of samples.
Results
We have applied our predictive modeling (Methods: Experimental Protocol (Non- biased)) to seven different microarray datasets in oncology. Five of them were used for benchmarking, and two of them for a thorough analysis of breast cancer signatures.
Evaluation of the Performance of the Predictors Unveiled by the Bi-Objective Optimization
We first performed a benchmarking on different datasets to objectively assess the performances of the signatures predicted by the bi-objective optimization. To this end, we have applied our predictive modeling to Datasets I-V (Methods: Datasets) and conducted two kinds of cross-validation procedures (Methods: Cross-validation): 1) a three-fold cross-validation whose results are reported in Table 2, and 2) a leave-one-out cross-validation procedure whose results are in Table 3. The results outlined three interesting features: 1) the signature sizes (between 3 and 10 probesets in average), were rather small compared to what was commonly reported in the literature (most of the time containing between 20 and 50 genes); this property being exemplified in Table 4 where we compared different published classifiers to our model, 2) the performances were high and concordant (high accuracy, with both high specificity and sensitivity, see Terminology and Abbreviations). Only the brain cancer dataset seemed to harbor relatively weaker performances, but one still had ~68% accuracy with signatures of ~10 probesets. 3) The two different cross validation protocols had very similar results, which strengthen the confidence in the robustness of the predictions.
Average performances and size of the signatures predicted by bi-objective function optimization; three-fold cross-validation. We computed the δ-filtered signatures 100 times for different random training/testing subsets on Datasets I–V (Methods: 3-fold CV). Average performances across the runs are reported along with their standard deviation.
Average performances and size of the signatures predicted by bi-objective function optimization; leave-one-out cross-validation. We computed the δ-filtered signatures for different random training/testing subsets on Datasets I–V (Methods: Leave- one-out CV). Performances are computed from the summary of the runs.
Comparison of average performances and size of signatures reported in the literature. In this table are reported the mean accuracies (Ac., in percent) and mean number of probesets (
Next, we compared our results to previously published ones that we have selected because they were non-biased (Methods: Experimental Protocol (Non-biased)). The comparative results are reported in Table 4. For the colon cancer dataset, our predictive modeling was only outperformed by the random forest (accuracy ~89% vs. ~85%) but the sizes of these predictors were five times larger (~50 vs. ~10 probesets). For the leukemia dataset, our predictive modeling was not outperformed (accuracy ~98% with ~3 variables). Regarding the prostate cancer dataset, our modeling was not outperformed. GLMPath method had the same performances (accuracy ~94%) and was twice smaller (~2 vs. ~4 variables). Finally, our model was not outperformed in both the brain cancer (accuracy ~68% with ~9 variables) and the lymphoma datasets (~88%, ~5 variables).
Hence, the performances of our predictors were at the level of, or higher than those of the previous studies. Moreover, our signatures were most of the time significantly smaller. A noticeable exception is GLMPath 13 whose performances, even though slightly lower than ours, were obtained with remarkably small signature sizes. Our predictive modeling appeared to be among the two best methods, if not the best one. Beside this result, one should stress that our predictive modeling is very simple, quick to compute on commonly available personal computers, and fully automated (no parameter to tune).
Application in Breast Cancer: Molecular Signature Designed for Prediction of Preoperative Chemotherapy Treatments
After having assessed the performances of the predictive modeling we propose, we applied it on two breast cancer datasets: Datasets VI & VII (Methods: Datasets).
Dataset VI comes from a clinical trial conducted at the MD Anderson Cancer Center (Houston, Texas, USA). One of the purposes was to provide data for designing predictors of the response to preoperative chemotherapy treatments in breast cancer. We used the same protocol as in 8 : the set of 133 patient cases was split into a fixed training set of 82 patient samples and a test set of 51 patient samples, each one showing the same ratio of responder (Pathologic Complete Response, PCR) and partially responder patient cases (NoPCR or Residual Disease, RD), respectively 1/3 and 2/3.
According to our predictive modeling, the probesets were ranked by decreasing contributions to the interclass distance measured on the learning set of 82 patient cases
8
and the signatures were computed in this ranking. In Table 5, we compared the performances of our predictive modeling to two other recently published classifiers. We computed two signatures containing respectively 30 (δ-DLDA-30) and 11 (δ-DLDA-11) probesets. The δ-DLDA-30 signature was created by taking the 30 top ranked genes according to their δ score, and was built for comparing its performance to the two published classifiers of same size. The δ-DLDA-11 signature was built by the automated procedure which chooses the optimal size of the signature on the training dataset without human supervision (see Methods). The results of our predictive modeling were compared to two published studies on the same datasets: 1) a predictor using the same DLDA classifier (DLDA-30) designed on the 30 probesets of smallest
Comparison of the performances of signatures predicted on breast cancer dataset VI (training set=82 samples). The first column of this table (δ-DLDA-30) corresponds to the result of our predictive modeling with a fixed size of 30 probesets (for direct comparison with other methods). The second column (δ-DLDA-11) contains the results of our predictive modeling obtained without fixing the number of probesets of the signature. The optimal non-singular predictor found by our method contained 11 probesets. The third (DLDA-308) and fourth (Bi-Majority-3014) columns report results found in the literature with the same data/protocol.
Figure 1 shows heatmaps for each of the four signatures cited above, applied on the testing Dataset VI (51 test samples). We can observe that the signatures unveiled by our method (Panels A&B) clearly dichotomize the responder patients. The up-regulated and down-regulated genes are clearly visible, while this seems less apparent for the two other signatures found in the literature (Panels C&D): the classifier rules would probably be more complicated to analyze/interpret. Interestingly, we found that patients PERU11, PERU14, M120, M353 and M503 were misclassified by all the methods. This can suggest that these five patients form one or several sub-clusters regarding the response to therapy.

Heatmaps of the four signatures on testing data (51 patients).
In order to put more focus on the biological relevancy of the genes unveiled in the δ-DLDA-11 signature, we detailed the characteristics of the 11 selected probesets in Table 6 and plotted the boxplots of their expression levels in Figure 2. Table 6 unveils a relatively high overlap between our signature and the two other published ones. Such compliance across studies is rarely observed in the literature and is worth mentioning. The highest ranked gene of our signatures was ESR1 (estrogen receptor protein coding gene 1). Although this gene is known to be one of the most determinant marker of the response to the chemotherapy (eg, 15 ), it had not been selected in neither 8 nor. 14 Beside ESR1 three other genes of the δ-DLDA-11 signature were neither in the DLDA-30 signature nor in the bi-majority-30 one: AGR2, SOX11, and TBC1D9 (see Discussion section).
Detailed caracteristics of the δ-DLDA-11 signature. This table contains descriptions of the 11 probesets of highest contributions to the interclass distance unveiled by our predictive modeling on Dataset VI (rank of the probesets following our prioritization by δ scores, names of the targeted genes, Affymetrix references of the probesets, values of their contributions to the interclass distance, and

Boxplots of the expressions of the 11 probesets of the δ-DLDA-11 signature.
Dataset VII is a second independent dataset of 91 patient cases 16 coming from a cohort subject to the same clinical trial and protocol as Dataset VI. 8 Measurements of the expression levels were conducted with the same microarray (Affymetrix U133A). With this dataset we aimed at validating the performances of the predictors we computed on Dataset VI (82 training data samples). Hence, we only used this new dataset as a test dataset. The performances of our predictors are reported in Table 7 together with those of the DLDA-308 and bi-majority-3014 signatures. The performances of the δ-DLDA-predictors were close to those of the two other predictors. More specifically, the δ-DLDA-11 predictor's specificity was slightly lower than that of the DLDA-30 but the two predictors had the same sensitivity value, and almost the same high negative predictive values, ie, the probability of a partial response given that the prediction is a partial response to the treatment.
Performances of the predictors trained on breast cancer Dataset VI (82 training samples) and applied on Dataset VII (91 test samples). In the first column (δ-DLDA-30) are the performances of our predictive modeling with the 30 probesets signature predicted on Dataset VI (for direct comparison with other methods). In the second column (δ-DLDA-11) are the results of our predictive modeling obtained with the optimal non-singular DLDA predictor (11 probesets signature) predicted on Dataset VI. The third (DLDA-308) and fourth (Bi-Majority-3014) columns report the performances of two predictors whose signatures were the 30 probesets of smallest
Finally, we used the two test datasets to assess more deeply the robustness of the δ-DLDA-11 predictor. In this last experiment, the signature was kept constant while the parameters of the DLDA classifier were tuned at each run of the 3–fold cross-validation procedure (Methods: Cross-Validation). The results of these two distinct cross-validations are reported in Table 8. Neither DLDA-308 nor bi-majority-3014 signatures reported cross-validation assessment. Both results found on Dataset VI (51 test samples) and Dataset VII (91 test samples) are concordant with (and thus reinforce) the ones found without cross-validation.
Three-fold cross-validation average performances of the δ-DLDA-11 signature, applied on the two external test datasets. The table contains the results of our predictive modeling obtained with the 11 probesets signature (δ-DLDA-11) predicted on Dataset VI (82 training samples). The predictor was applied on both test datasets following a 3-fold cross-validation experimental protocol (Methods: 3-fold cross-validation). Average performances and standard deviations are reported.
Global Assessment of the Robustness of the Method
We statistically assessed the robustness of our predictive modeling on the 133 cases dataset
8
by means of permutation resampling procedures. We first assessed that our predictor was better than a random predictor. To this end, we performed 1,000 random subsampling cross-validations (Methods: Cross-Validation), and conducted our predictive modeling twice at each run: once on the learning set of samples, leading to a first predictor, and once on the same learning set where the samples were randomly assigned to the classes. The null hypothesis was
Next, we assessed that this result was not the consequence of the classifier model alone or, otherwise stated, that the signatures were relevant for the predictions. To this end, we have assessed the optimal signatures by designing another random subsampling cross-validation procedure (1000 runs; Methods: Cross-Validation) conducted on the same dataset. At each run the predictive modeling was conducted, together with the design of a predictor with a random signature of same size. The null hypothesis was stated as
Discussion
In this study we proposed a feature selection method by which the molecular signatures were the optima of a bi-objective function expressing the tradeoff between the class discrimination (interclass distance) and the size of the signature. We demonstrated that this optimization was reducible to ranking the probesets by their contributions to the distance between the class centroids. This method of signature selection is non parametric because the contribution to the interclass distance makes no assumption on the distribution of the expression levels. Moreover, we have proposed a fully automated method for computing the optimal sized signature on a training dataset without supervision.
In this study we used the DLDA classifier method, following the conclusions of an in-depth study that compared several predictive modelings. 8 Other classifier methods could have been used such as SVM, Naïve Bayes, Random Forest, etc. Investigating the effects or performances of each classifier model on the molecular signatures unveiled by our feature selection method would be interesting. However, since we used the classifier model after the feature selection step, and according to the referenced study, 8 we may reasonably assume that the difference should be small or in favor of the DLDA classifier results.
In this study we defined the interclass distance as the Euclidean distance between the centroids of the two classes. This definition implies that we used two specific parameters for defining our objective function: 1) the “representative” element of each class, that we here defined as the centroid of all the samples pertaining to this class, and 2) the choice of the Euclidean distance to measure the difference between the two classes “representative” elements. Arguably, other representative elements can be used, such as the vector of medians, the medoids, 17 etc. However, since the distance between the representative elements will always increase with the dimensionality (ie, the number of probesets included in the signature), this choice will not change the conclusions of our study. This choice have the potential to increase (or decrease) the performances of the predictors, 18 but the global methodology will remain the same.
Concerning the bi-objective function choice, we can also stress out the fact that this choice could have been different. Indeed, concerning the second objective, the minimization of the size is produced by the right part of Equation 1): 1 - |
We applied our method on microarray data, but it could as well be applied to RNA-Seq, methylation or other type of data. The complexity of the signatures computation is that of assigning each gene its δ value (this step is in linear time) plus the complexity of sorting the genes according to this score (this step is in log-linear time). This first step of finding the n + 1 optimal signatures is performed very fast. It drastically reduces the dimension of the search space, so that the computation of the optimal-sized signature is very efficient. Hence, the final wrapper step for selecting the best signature is performed among this small subset of n + 1 optimal signatures. It implies that the complexity of the whole process is low (close to log-linear). This efficiency makes the method a good candidate for predictor design on Next Generation Sequencing data.
Our results are supported by a study concerning expression data of breast cancer tumors. 19 The authors have assessed three factors underlying the ability of predicting phenotypes in breast cancer: 1) the presence of genes whose expressions had high fold changes between the different classes, 2) the amount of such genes in the microarray and 3) the number of learning cases at which these genes had strong different mean expressions.
The conclusion of this study was that the fold-change had the greatest influence on the success of model building, followed by the size of the gene signature then by the number of informative cases. Our molecular signatures are designed by selecting the genes in the ranking of their contributions to the interclass distance. Because this ranking and that of the fold-change are equivalent, our method and result are coherent with the study. 19
In,
20
the authors compared and assessed a large panel of feature selection methods that are available for DNA microarray studies. The conclusion of the study was:

Scatterplot of the probesets.
Regarding molecular signatures, the motivation for selecting subsets of genes whose contributions to the interclass distance are the highest (implying that they are of small
Concerning the breast cancer datasets and the issue of predicting the responses to preoperative chemotherapy treatments, our predictive modeling was at the level of, or outperformed the previous studies, with three times smaller signatures. The highest ranked gene of our signatures was ESR1 (estrogen receptor protein coding gene 1, cf. Table 6). Although this gene is known to be one of the most determinant marker of the response to the chemotherapy (eg,
15
), it was neither selected in
8
nor.
14
In the latter study the probesets were ranked according to their “
It should be noted that two of the probesets belonging to the 30 δ-signature (but not to the 11 δ-signature) had
We saw in Table 4 that only one method, GLMPath, had prediction performances that were very close to ours, and signature sizes equal or smaller than ours. The GLMPath method is a generalized LASSO method (Least Absolute Shrinkage and Selection Operator.) The LASSO algorithm is based on a regression method estimating the best parameters of a Gaussian multi-variable model and simultaneously discarding the non-explicative variables. An analysis of LASSO conducted by R. Tibshirani,
28
the author of the method, showed that LASSO is a greedy algorithm, solving a linear regression model by adding at each step the variable that is the most correlated to the current residual. Hence LASSO method computes a greedy solution to the problem of molecular signature selection. This greedy solution is not the optimal one in the general case. In contrast we demonstrated that the greedy solutions of our bi-objective function were the global optima of the bi-objective function. They were not greedy approximations of the optimal signatures: they were the optimal signatures. We also demonstrated that our bi-objective function had a very small number of optima, precisely
Multi-objective functions whose optima can be found by a greedy optimization are very rare. The reason why this optimization problem could be solved by a greedy approach is that the two objectives of the function are
Because of the third criterion this three-objective function is not
Programs Availability
A Web application and scripts written in R language that implement the feature selection procedure can be freely accessed or downloaded from http://gardeux-vincent.eu/DeltaRanking.php.
Terminology and Abbreviations
The performances of a two-class predictor (binary classification) can be measured according to the following criteria, where TP, TN, FP and FN are respectively the numbers of True Positive, True Negative, False Positive and False Negative of the predictor.
Accuracy: true prediction rate. Accuracy = (TP + TN) / (TP + FP + FN + TN) Sensitivity (or Recall): true positive prediction rate. Sensitivity = TP / (TP + FN) Specificity: true negative prediction rate. Specificity=TN / (TN + FP) PPV (Positive Predictive Value, or Precision): probability that a case belongs to the positive class when the positive class is predicted. PPV = TP / (TP + FP) NPV (Negative Predictive Value): probability that a case belongs to the negative class when the negative class is predicted. NPV = TN / (FN + TN)
The following abbreviations are used for specific signatures and classification methods:
DLDA: Diagonal Linear Discriminant Analysis, which is a classification method. δ-DLDA-30: The signature found by our method, using a fixed number of 30 features (the 30 top-ranked genes). δ-DLDA-11: The signature found by our method, using the automatic process selecting an optimal number of features of 11 (the 11 top-ranked genes). DLDA-30: The 30-genes signature found in
8
by Bi-majority-30: The 30-genes signature found in
14
corresponding to the highest bi-informative values and using majority voting as classifier.
Conclusions
Defining molecular signatures as optima of a bi-objective function as a tradeoff between the interclass distance and the signature size, is relevant. Computing the signatures in the ranking of genes’ contributions to the interclass distance is well founded and effective. Our benchmarking showed that the performances of the linear discriminant predictors designed on these optimal signatures were at the level of the best reported studies and that the signatures were significantly smaller than almost all of them. Applying the method on two breast cancer datasets brought very small and biologically relevant molecular signatures. They further helped designing robust and efficient predictors for the response to the chemotherapy.
Author Contributions
Defined the method: VG, RN, PS, RC. Performed the experiments: VG. Provided and analyzed the data and results: LP, RR, FR. Wrote the paper: VG, RN, PS, RC, LP, APB, MFBW. All authors reviewed and approved of the final manuscript.
