Abstract
Background
Cancer and many other diseases are caused by the aberrant alterations of genes responsible for proper cell growth, differentiation, and division at different levels of regulation, for example, genomic mutations, transcription, and translation. There is considerable heterogeneity in every cancer type, and improvement in the identification of homogeneous subgroups of patients would significantly improve patient care and decrease morbidity and mortality due to disease.1–4 Examining the gene expression patterns of cancer has proven to be effective in identifying novel clinically relevant subgroups and genetic signatures.4–6 Common clinical outcome variables of interest are distance metastasis, tumor recurrence, and patient survival. These variables further help in understanding the underlying fundamental biology of regulating and selecting appropriate patient care options for different cancer subgroups. For example, in the study by Calon et al, gene expression profiling analysis of colorectal cancer samples identified elevated expression of transforming growth factor (TGFB1) and TGFB3 in poor-prognosis cases. The transcriptomic profiling results of these experiments lead to the identification of a therapeutic TGF-β inhibitor, LY2157299, capable of reducing disease progression in an immunodeficient mouse model. 6
Clustering algorithms are used to identify patterns in gene expression and discover the clinically relevant subgroups. These algorithms use the expression of selected genes and choice of distance/similarity metrics to discover samples that exhibit similar transcriptomic profiles, and hence the sample subgroups are identified. A variety of clustering algorithms have been applied for such purposes in cancer studies. Hierarchical clustering is the most common form of clustering used in the literature7–10 followed by
Hierarchical clustering is one of the most popularly employed methods of clustering due to its amenability for visualization of the relationships and because there is no need to set the number of clusters a priori. 7 Hierarchical clustering divides observations into clusters and creates a tree diagram or a dendrogram where each node of the tree represents a cluster. This algorithm is used to visualize each sample's/gene's relationship with other samples/genes within the cluster and among all clusters.7,11 This method of clustering is dependent, in addition to the right choice of genes, on the choice of distance metric or how the distance between data points is defined for clustering. A common distance metric used in hierarchical clustering is Euclidean distance, or the straight line distance between two points in geometric space. In addition, Pearson's correlation coefficient and unsigned Pearson's correlation coefficient have also been popularly used. Ultimately, using different distance metrics can influence the shape and organization of the clusters.16,17 The linkage method is another important component of hierarchical clustering that determines the distance between two clusters. Many linkage criteria are available, which include complete, single, average linkage, and Ward's minimum variance method.16,18 One of the most common linkage methods is average linkage, that is, the distance between two clusters is calculated as the average distance of all pairs of points in the clusters. However, there is no clear evidence for the superiority of the average linkage over the other linkage methods.19–21
Another important class of data clustering is partitioning. Partitioning methods such as
An important step toward applying the above clustering algorithms on transcriptomic data is feature (gene or probe) selection, or selecting a subset of variables or features relevant to clustering.24,25 Feature selection is an integral part of this discovery process because the quality and relevance of clustering are sensitive to the selected features or genes. Feature selection is needed in the context of whole transcriptomic profile analysis because there are many irrelevant genes that may dilute signal required for clustering tumor samples. 26 A typical feature selection method involves ranking genes based on a criterion and selecting a certain number of genes at the top of the ranked list (Fig. 1). Some frequently used gene ranking methods include variance-based ranking, Gaussian Mixture Modeling, coefficient of variation, and mean-variance methods.27–29 However, these methods remain relevant as none of them have been shown to be superior to the others and are often inconsistent in identifying biologically relevant clusters across different types of data.

Flowchart depicting the commonly used steps in a typical cluster analysis of cancer gene expression data. First, genes are ranked based on a chosen statistical metric expected to capture the relevance of genes. Second, the number of relevant genes is selected. Third, the appropriate number of clusters is chosen to discretize genes and samples. Fourth, clustering is performed using either a hierarchical or a
A challenge in feature selection is determining which method is the most effective at identifying the subset of relevant genes.8,9,24 Another challenge in most feature selection methods is determining how many features, or genes in the dataset, are deemed relevant. Current approaches have been arbitrary with no clear attention paid to the effect of the number of genes on the quality of the resultant clustering algorithm. Choosing the appropriate method is difficult, as the recent work 9 demonstrates that no method is a clear winner based on generic clustering metrics such as F-score and Entropy.
There is currently no package that facilitates the choice of gene selection methods from a variety of options (ranking and choice of number of genes) in combination with commonly used clustering algorithms. Hence, in this review, we describe our R-package that offers gene ranking and selection methods from a wide variety of methods used in the literature and choice of hierarchical or
Methods and Package Workflow
In this section, we describe the components and workflow of our R-package An in-depth vignette explaining how to use our package and obtain publicly available gene expression datasets with clinical outcome information Four gene ranking options that order genes based on different statistical criteria:
CV_Rank CV_Guided (novel method) SD_Rank Poly Four different ways to select the number of genes:
Fixed Percent Poly (novel method) GMM Two ways to determine the cluster number:
Fixed Gap statistic Two clustering algorithms:
Hierarchical clustering A function to calculate the average gene expression in each sample cluster A function to correlate sample clusters with clinical outcome
A summary of the necessary steps to use our package is depicted in Figure 2.

Flowchart of the
If users intend to obtain gene expression and clinical data from public databases such as Gene Expression Omnibus (GEO), 30 they should refer to our package vignette for more guidelines (Supplementary Package Vignette File). Our package is compatible for use with both microarray and RNA-seq gene expression data. This vignette includes instructions about how to obtain gene expression and clinical data from GEO using R as well as gene expression normalization procedures for both data types. Common methods of normalization used for GEO microarray datasets include robust multi-array average (RMA) normalization, or quantile normalization and log2 scaling.31,32 When using RNA-seq data of fragments per kilobase of transcript per million mapped reads (FPKM) values, a common method of normalization is to log2 transform the data; however, other systematic methods are available. 33 After downloading and preprocessing the gene expression dataset (Fig. 2), users have the choice of using our package function or other available R functions to load the gene expression matrix into an R object. Users who have their own gene expression datasets may also use our package function to load their data into R. However, if the expression data are already loaded as an R object, the user may proceed to the gene ranking and selection steps in the package.
After obtaining the gene expression and clinical data and loading them into R, the next step is to specify the gene ranking option. As summarized in Table 1, there are four different gene ranking methods to choose from within
Choice of gene ranking methods offered in multiClust.
Choice of four gene selection methods available in multiClust.
The Poly algorithm is a novel iterative method akin to a robust regression method that uses a series of three second-order polynomial regressions to filter out genes in each dataset. Each of the three linear regressions maps the expected standard deviation of a gene given its mean. After each regression is calculated, the genes with a standard deviation below the expected value will be filtered out. When the next regression is being calculated, the genes that are filtered out are not included in the calculations. This model is more selective than the constant standard deviation cutoff filter, as each standard deviation cutoff function is dependent on the mean of the remaining genes. The filtering process was carried out specifically three times, as it removes more than 90% of genes with low variance. 29
In the GMM method, Gaussian Mixture Models are used to calculate the number of Gaussian distributions (
The first term
Figure 3 portrays scatterplot images that illustrate the top 1000 genes selected by the four ranking methods. In all of the scatterplots, the selected genes are plotted against the total genes based on their mean and standard deviation values across each sample. Selected genes are depicted in red, while filtered genes are represented in black. Once the gene ranking and selection process is completed,

Illustration of choice of relevant genes by various gene ranking methods on a mean-variance plot using the GSE3494 breast cancer dataset. 27 (A) CV_Rank method. (B) CV_Guided, the yellow line is the coefficient of variation for the dataset, while the blue lines represent the mean and standard deviation cutoffs for a given number of genes to be selected. (C) SD_Rank. (D) Poly method.
After selecting genes, users have the choice of determining the number of clusters to group samples into two different methods as summarized in Table 3. The first method “Fixed” requires the user to input the number of clusters they would like to partition samples into. This number is generally arbitrary; however, in many studies, the number of clusters is often chosen as the number of known molecular subtypes of that cancer.4,6,35 In contrast, the second method, “Gap Statistic” uses a function from the R package “cluster” to calculate a goodness of clustering measure for the tumor samples. 36 Following this calculation, the cluster number with the highest goodness of clustering score is chosen to discretize samples. 37
Summary of the methods available to identify the number of clusters in multiClust.
Next, users have the option to cluster samples via hierarchical clustering or

The dendrogram of tumor samples from the GSE349427 breast cancer dataset generated using hierarchical clustering. The top 1000 genes were selected via the SD_Rank method. Genes and samples were clustered using Euclidean distance and Ward.D2 linkage. Samples were split into four clusters.

A Java TreeView heat map of clustering of breast cancer dataset, GSE3494. 27 A total of 1000 genes were selected using SD_Rank and were clustered using centered Pearson's correlation. Samples were clustered using Euclidean distance metrics with Ward.D2 linkage. In this heat map, rows are genes and columns are tumor samples.
After the clustering of genes and samples,
Lastly, our package uses Cox Proportional Hazard Models to test the clinical relevance of the sample clusters and thereby identify novel cancer subgroups. After cluster analysis of selected genes and tumor samples, a Kaplan-Meier plot is produced (Fig. 6) portraying patient survival probability over time. A Cox proportional hazard model test from the R package “survival” is performed for each clustered dataset and clinical outcome measure provided.39,40 Several of the gene expression datasets used in our study had multiple clinical outcome measures available as shown in Table 1. As a result, we performed multiple survival analysis tests on these datasets and only one survival test on those datasets with one clinical outcome measure. Following each survival test, a

Kaplan–Meier plot of disease specific survival (DSS) of patients from the GSE3494 dataset. 27 Patients were clustered using the top 1000 genes on an SD_Rank ranking of relevance, and hierarchical clustering coupled with Euclidean distance and Ward.D2 linkage.
Results
In this study, we used
Summary of gene expression datasets used in this study.
We downloaded publicly available solid tumor gene expression datasets with clinical data from NCBI GEO and The Cancer Genome Atlas (TCGA). These gene expression and clinical outcome files can be viewed in the Supplementary Gene Expression File and Supplementary Clinical Outcome File.
Gene expression datasets were downloaded from GEO using the “GEOquery” package30,41 and then normalized using the robust multichip average (RMA) procedure. Specifically, the Bioconductor package “affy” 31 was used in R to perform the normalization. Datasets consisting of multiple gene expression datasets such as the GSE26712–14764 cohort were prepared using an alternative method. First, the raw expression data were merged into a collective dataset and then quantile normalized using the Bioconductor package “preprocessCore” 32 and then log2 scaled. Ultimately, both these types of normalization methods resulted in gene expression values being in a range of approximately 3–16. The TCGA Glioblastoma dataset was downloaded from the TCGA Data Portal and was composed of median expression data from three different expression centers: The Broad Institute, University of North Carolina, and the Lawrence Berkley National Laboratory. 42 These data were normalized prior to obtaining it from the TCGA database. Prior to clustering, all gene expression data were further normalized between a range of 0 and 1 via feature scaling.
For each dataset, a fixed number of clusters was determined in order to divide samples. This number was kept the same during each
Hierarchical and

Histograms of performance on clinical relevance of clusters identified for different combinations of gene ranking and gene selection.
As mentioned earlier, some gene expression datasets provided multiple clinical outcome measures (RFS, DFS, etc.) and survival tests were conducted for these respective datasets. A panel of graphs comparing the number of survival tests with a significant association with sample clustering and clinical outcome across the different gene ranking, gene selection, and clustering methods can be seen in Figure 8. Similarly, the method of choosing top 1000 genes in CV_Guided ranking and the method of choosing top 1% genes in Poly ranking were the most effective for hierarchical clustering using Euclidean distance and average linkage. These combinations of gene ranking and selection yielded 6 survival tests out of 21 that had significant correlation (Fig. 8B). All other gene ranking and selection methods for this clustering algorithm produced less than six survival tests with statistical significance. For the hierarchical clustering, using Euclidean distance and Ward. D2 linkage, the combination of gene selection and ranking that produced the highest number of significant results was the top 1000 genes in CV_Guided ranking. This combination of methods yielded 14 survival tests, whereas all other groups of methods produced less than 14 survival tests with significant relationships between sample clustering and clinical outcome (Fig. 8A). Overall, hierarchical clustering with Ward. D2 linkage produced more survival tests with significant associations when compared to the average linkage clustering. Lastly, the

Histograms of performance on clinical relevance of clusters identified for different combinations of gene ranking and gene selection.
Discussion
Tuning the overall methodology for clustering including gene ranking, gene selection, and clustering method with appropriate choice of parameters is essential to elicit relevant sample groups using gene expression data. There are several studies that have evaluated the efficacy of various gene expression clustering methods.7,9,17 In addition, there are several Bioconductor packages that have been made available for studying gene expression clustering. Examples include “iBBiG” 43 and “rqubic” 44 for biclustering of gene expression data. Similarly, the “EBarrays” 45 software package offers tools for analyzing microarray data such as the expectation maximization algorithm for gene expression mixture models. However, there are currently no known software packages that permit the tuning of overall gene selection and clustering methodologies to examine the efficacy of these steps.
To meet such a need, we presented an R-package called
As a use case, we used
The SD_Rank method has not been highlighted in the literature, as it is sensitive to the choice of number of features. However, this method was identified to perform the best when used in conjunction with our GMM-based TRank method of identifying the number of genes. The GMM gene selection method fits Gaussian mixture models to the data using an Expectation-Maximization (EM) algorithm
28
in combination with a
Furthermore, the SD_Rank method proved to be the most effective when using a k-means clustering framework. The k-means algorithm uses Euclidean distance and variance to cluster samples.
47
This clustering algorithm may be the most effective because it groups samples in the same way in which genes were ranked and ordered. Several studies have demonstrated that k-means clustering is more effective at analyzing the similarity of observations in a cluster relative to all other observations within the cluster than that of hierarchical clustering.48,49 Furthermore, in many instances of the literature,
Hierarchical clustering methods often have problems with incorrectly merging neighboring clusters.
48
Observations can be mistakenly assigned to the same cluster at early stages of the clustering process when they rightfully belong to different clusters. In hierarchical clustering, the assignment of the observation to the cluster cannot be changed. However, in
Furthermore, our results show that hierarchical clustering using the Ward.D2 linkage criteria was more effective than that of the average linkage method, which is commonly used in gene expression studies. Average linkage takes the mean distance between all pairs of data points of two clusters and therefore does not give special weight to sample outliers. Therefore, if these outliers are merged with larger clusters, the interesting local cluster structure can be lost.
16
It is possible that hierarchical clustering with average linkage criteria is not effective at assigning samples with variable features into the appropriate clusters. In other words, samples with completely different gene expression profiles may be mistakenly grouped into a sample cluster, therefore preventing us from identifying cancer patient groups with distinct clinical outcome. Ward.D2 linkage implements the Ward's minimum variance method,
18
which uses an error sum of squares calculation to cluster samples. This linkage criteria measure the sum of squared differences between each data observation and the group's mean.
18
Such a method groups observations while minimizing the sum of squares, ensuring that similar observations end up in the same clusters. This linkage method is more effective at identifying sample groups with similar expression profiles because it groups samples based on the variation of their expression profiles rather than an average of their profiles. Even though the methodology we suggested above yielded the most datasets and survival tests, there are cases for which the alternative methodologies also worked. Hence, we also need to study other types of diseases and gene expression datasets to identify the respective best-performing methodologies. Our
Another future objective will be to incorporate more complex model-based clustering algorithms into our
In conclusion, our R package
Author Contributions
Authored the
