Abstract
Introduction
Triple-negative breast cancer (TNBC), characterized by a lack of estrogen receptor (ER), progesterone receptor (PR) and human epidermal growth factor receptor 2 (HER2) expression, has been a challenging breast cancer subtype for oncological therapy. 1 TNBC accounts for 10%–20% of all breast cancer cases and is diagnosed more frequently in younger individuals, those with BRCA1 mutations, and African- American/Hispanic women. 2 Chemotherapy is the only systematic treatment for TNBC, but TNBC patients with standard treatment have a higher rate of distant relapse and a poorer prognosis than patients with other breast cancer subtypes.3,4
There is significant overlap between TNBC and basal-like breast cancer, however, the evidence from immunohistochemical expression, molecular features, and prognosis suggests that these two breast cancer subtypes are not equivalent.5,6 Although TNBC has been considered to be a unique breast cancer subtype, TNBCs display heterogeneous patterns in morphological, genetic, immunophenotypic and clinical features.7–9 The survival curve of TNBC patients supports this phenomenon, in which the risk of distant recurrence of TNBCs rises sharply during the first one to three years after diagnosis, but drops dramatically thereafter and shows a pattern similar to other non-TNBCs after five years. 4 Thus, better understanding of the subtypes within TNBCs is necessary for developing personalized treatment for TNBC patients.
Genomic profiling can be a powerful tool to gain insight to complex diseases such as cancer. Using microarray gene expression data, Perou et al (2000) used intrinsic gene signatures to define five breast cancer subtypes. 10 We recently collected 587 TNBC gene expression profiles from 3,247 breast cancer cases available in 21 publicly available data sets. Based on our gene expression meta dataset, six TNBC subtypes including two basal-like (BL1 and BL2) subtypes, an immunomodulatory (IM) subtype, a mesenchymal (M) subtype, a mesenchymal stem-like (MSL) subtype and a luminal androgen receptor (LAR) subtype, and the corresponding gene signatures were established. 11
Based on these TNBC gene signatures, we were able to predict the subtypes of several breast cancer cell lines representing each of six TNBC subtypes. Cell lines modeling each of the subtypes differentially responded to chemotherapeutic and targeted agents. Cell lines from both the BL1 and BL2 subtypes were highly sensitive to cisplatin. M and MSL subtypes responded to NVP-BEZ235 (a PI3K/mTOR inhibitor) and dasatinib (an Abl/Sarc inhibitor). The LAR cell lines were sensitive to bicalutamide (an AR antagonist). Our analysis and experiments were one of the first systematic transcriptomic profiling studies to identify TNBC subtypes, and the results are promising in terms of TNBC biomarker and drug target discovery. Herein we describe our recently developed web-based subtyping tool for classifying TNBC samples from any high-throughput gene expression platform using subtype signatures based on our collected gene expression meta-data.
Methods and Implementation
To follow, we describe the analysis workflow and the data source we used for predicting breast cancer subtypes. In addition, we illustrate the web interface for data loading and results delivery.
Data collection and TNBC identification
We collected 2,353 breast cancer gene expression profiles from 14 publicly available microarray datasets for the identification of TNBCs and the discovery of subtypes. Another cohort of 894 breast cancer gene expression profiles from seven public data sets was used for the identification of TNBCs and subtype validation. All data sources are listed in Supplementary Table 1. The analysis workflow is displayed in Figure 1. All gene expression profiles were generated using Affymetrix platforms and RMA (Robust Multi-array Analysis) was used to normalize each independent dataset. The three Affymetrix probes 205225_at, 208305_at and 216836_s_at were selected to represent

Workflow for developing the TNBC subtype gene signature.
Based on K-means clustering analysis, we defined the six subtypes as follows: basal-like 1 (BL1), basal-like 2 (BL2), immunomodulatory (IM), mesenchymal (M), mesenchymal stem-like (MSL), luminal androgen receptor (LAR) characterized by the canonical pathways and differentially expressed genes. 11
TNBC subtype gene signature derivation
In this analysis, we selected for the genes that are relatively unique for each subtype. The 20% of genes with the highest and lowest expression levels in at least 50% of the samples in each of six subtypes were initially selected. The Kruskal-Wallis test was used to identify the genes showing significant difference in at least two subtypes for all selected genes. We chose Bonferroni adjusted
TNBC subtype prediction
We computed six centroids for TNBC subtypes based on the six gene signatures and the training cohort with 386 samples. For candidate TNBC samples especially those based on Affymetrix platforms, we first applied quantile normalization. Next, each gene was standardized by subtracting its sample mean (calculated across all testing samples) and dividing by its sample standard deviation. Using Spearman correlation, individual candidate tumor or cell line samples were correlated with each of six centroids for subtypes. When determining statistical significance of the correlation coefficients, the number of genes within each of the six signatures (size effect) is different, therefore, to make the results comparable between the subtypes, we applied a permutation test to remove this size effect. Candidate samples were then assigned to the TNBC subtype with the highest correlation, and those that had low correlation (correlation coefficient < 0.1 or
Impact of er positive samples for prediction and solution
For probe-based gene expression platforms, we highly recommend pre-processing and normalization of the raw data for TNBC samples only. The distinctions between TNBC subtypes are relatively subtle compared to the dramatic difference between TNBC and ER-positive breast cancer samples at the transcriptome level. Thus, the presence of ER-positive samples with TNBC could affect TNBC gene expression normalization, and thus final prediction results. We performed a series of experiments to illustrate the impact of ER positive expression on subtype prediction. We chose a dataset (GSE7904) from our initial training cohort that contains 43 breast cancer microarray samples, in which 17 samples were identified as TNBC and matched reported IHC status. Thus, the subtype membership assignments for these 17 samples based on clustering analysis of 386 patients in the training cohort can be treated as a “gold standard”. First, we normalized the 17 TNBC samples alone and used TNBCtype to predict subtype memberships. As expected, the prediction results match the original subtype assignments (Fig. 2A). Second, we normalized all 43 samples (including the same 17 TNBC samples and other ER positive samples) and performed predictions (Fig. 2B). The differences between these two predictions were striking: nine samples were classified as basal-like 1 (BL1) subtype in the second prediction procedure. This result demonstrates how TNBC sample predictions can be skewed toward basal-like samples if the TNBC test cohort was contaminated by ER positive samples. This same analysis was also applied to another dataset (GSE12276) from our initial testing cohort, which included 49 TNBC samples. This comparison is shown in Supplementary Figure 1 and the results are similar to those for GSE7904. Thus identification and removal of ER-positive samples from candidate cohort are necessary steps to ensure the accuracy of TNBC subtype prediction.

er positive samples dramatically affect TNBC subtype prediction results. (
Given that the prediction results can be greatly impacted by ER-positive samples and that ER classification by IHC can miss 15.1%–21.8% of ER-driven cancers, 14 we developed an ER-positive filter to remove potential false negative ER samples from a given test set. For the GES7904 dataset, we calculated the percentile of ER gene expression for each sample among all genes. This comparison indicates a dramatic difference of ER expression between TNBC samples (n = 17) and ER positive samples (n = 16) using percentile (Fig. 3). The above analysis suggests that filtering based on percentile of ER expression within each sample could be an effective approach to identify and remove ER-positive samples from unannotated data or samples that were falsely identified as negative by IHC. Therefore, we examined the distribution of percentiles of ER expression within our 386 TNBC training cohort and found ER expression in 96% of the samples was below 75 percentile of all genes (data not shown). Thus we have implemented a quality control step in TNBCtype program, to remove samples in which ER expression is greater than the 75 percentile at transcriptome level.

ER gene expression for TNBC and ER positive samples.
Website of TNBCtype
To accelerate genomic research of TNBC to the community, we designed a user- friendly interface for TNBC subtype prediction, available at http://cbc.mc.vanderbilt.edu/tnbc. Users can classify TNBC tumors or cell line samples by uploading a normalized (without standardization) gene expression data matrix and a valid email address. Input data matrix must consist of gene expression values in a .csv file with gene symbols as rows and sample IDs as columns. Once the uploaded data matrix passes a data format check, an automatic email will be sent to the user for confirmation. In the event that a sample does not pass the ER-filter, the user will be notified to remove the possible ER-positive sample and redo the normalization procedures. The user will then receive another email when the analysis is complete and the results are ready to be retrieved.
Result and Discussion
To demonstrate the functionality of the website, we have performed prediction on a test cohort with 26 publicly available TNBC samples and the results are displayed in Figure 4. Six colors were selected to represent each of the six TNBC subtypes. The table on the left shows the predicted subtype assigned to each sample, the correlation with the corresponding subtype centroid, and the

Snapshot of TNBC prediction outcome.
One of the key implementations is permutation-based
Conclusions
Our gene expression meta analysis of TNBC with large sample size demonstrates not only the heterogeneity of TNBC but that genomic data can be used for the guidance of possible treatments and the identification of patients for the design of clinical trials for TNBCs. 11 We developed the web-based TNBC subtyping tool for the research community. This software can be used by researchers to classify TNBC tumors into subtypes and provides the means to retrospectively analyze patient response to therapy. These retrospective studies will be critical to the design of future clinical trials that may eventually lead to biomarker discovery for patient selection. To ensure accurate subtype prediction, we implemented and ER-positive filter using percentile to remove all ER-positive samples, which can influence normalization and prediction results. In the future, integrated genomic analysis including DNA copy number, somatic mutation, epigenetic, and microRNA data will further improve our gene expression-based tool and help find the key “driver” components in each subtype for the potential of novel drug discovery and for more personalized treatment options for TNBC patients.
Availability and Requirements
Author Contributions
XC, JL, WHG, BDL designed and implemented the tool. XC, JL, WHG, BDL, JAB, YS, JAP read, wrote and approved the final manuscript.
Competing Interests
The authors declare that they have no competing interests.
Funding
This research was supported by NIH grants as follows: CA068485 (to XC); CA95131 (Specialized Program of Research Excellence in Breast Cancer); CA148375; CA105436 and CA070856 (to JAP); CA009385 (to JAB); American Cancer Society Grant #PF-10-226-01-TBG (to BDL); and Komen Foundation grant SAC110030 (to JAP).
Disclosures and Ethics
As a requirement of publication author(s) have provided to the publisher signed confirmation of compliance with legal and ethical obligations including but not limited to the following: authorship and contributorship, conflicts of interest, privacy and confidentiality and (where applicable) protection of human and animal research subjects. The authors have read and confirmed their agreement with the ICMJE authorship and conflict of interest criteria. The authors have also confirmed that this article is unique and not under consideration or published in any other publication, and that they have permission from rights holders to reproduce any copyrighted material. Any disclosures are made in this section. The external blind peer reviewers report no conflicts of interest.
