Abstract
Keywords
Introduction
The development of cancer requires multiple genetic alterations perturbing distinct cellular pathways. In human cancers, these alterations often arise owing to mutations in tumor suppressor genes and proto-oncogenes, which in turn trigger uncontrolled cell proliferation, survival and genomic instability. Consequently, the study of tumor suppressor proteins and proto-oncogenes and the cellular signaling networks deregulated by the corresponding mutant proteins has become a centerpiece of contemporary cancer research. In fact, investigations of their mode of action have pinpointed key mechanisms that protect humans against tumor development and thus provided rational foundations for preventing, detecting, and treating cancer.
Inactivation of tumor suppressor genes or the activation of oncogenes invariably trigger changes in gene expression programs. DNA microarrays 1 are in wide use as a method to quantify changes in global expression levels. Public microarray databases contain measurements of transcription programs in cells under thousands of different biological conditions and/or perturbations. One of the most prominent is NCBI Gene Expression Omnibus (GEO), 2 a curated repository containing microarray data in a standardized format. 3 This database therefore offers a rich resource of quantitative data on the behavior of gene expression changes in response to cancer gene mutations.
By specifically analyzing gene expression program changes associated with cancer gene activation or inactivation, we sought for signatures shared among distinct cancer genes listed in the census of cancer genes 4 and integrated the resultant data sets into logical networks of interactions.
Meta-analysis of cancer microarray data has been successfully applied by Rhodes et al to find a common gene-expression signature 5 in independent data sets from different cancer types. Ramaswamy and colleagues discovered a predictive signature 6 for the metastatic status of tumors from diverse origins and Creighton reported coordinate expression patterns of multiple oncogenic pathway signatures 7 in human prostate tumors. Our approach used measurements from cell culture experiments in which the expression of specific cancer-causing genes has been either induced or downregulated. This allowed us to unveil gene expression signatures common to cancer genes that have not been linked previously.
Methods
Selection and acquisition
As outlined in Figure 1, the first step in data acquisition was searching the 385 genes present in the cancer gene census in the NCBI GEO repository. 324 GEO entries contained one of these gene names in the title or abstract. The descriptive fields of the entries were duplicated into a local database. False positives were made apparent through an appropriate visualization of the entry description and subsequently removed. We selected experiments in which cancer genes were over-expressed, depleted by the application of small interfering RNAs or genetically eliminated by virtue of gene knock-out in mice. In addition, we included experiments in which dominant-negative forms of cancer genes were expressed. All of these 99 experiments were performed in either human, mouse or rat cell lines as described in the publication accompanying the experiment. If the measured values were given as raw intensities, samples for induced and control as well as their replicates were selected. Replicates and type of logarithm were selected for entries providing ratios. The definition database was made accessible from network-edges in the web-interface.

Flowchart of the data acquisition process.
Expression values were retrieved for the 607 defined samples. The probe identifiers were mapped to Entrez gene identifiers using the microarray platform description provided by GEO and the UniGene database.
8
78 data sets were successfully mapped, the remaining 21 did not contain valid identifiers. NCBI HomoloGene provided human homologues of mouse and rat genes. Measurements were available for 18885 genes, while there was a total of 19978 human genes in HomoloGene. The probe-level measurements (N = 9981226) were grouped by genes and replicates. Intensity values were grouped into induced/control to compute the log(ratio) as well as the
Clustering
Automatic classification and multiscale bootstrap resampling was performed using the R 9 package pvclust. 10 A matrix consisting of experiments as columns, genes as rows and gene-expression changes for each pair was used as input. Because not every gene was present in all of the microarray experiments, there are missing values in this matrix. An iterative method was used to reduce the fraction of missing values. In each step, the row or column containing the largest fraction of missing values was removed from the matrix. This procedure was repeated until no more rows or columns contained a fraction of missing values larger than the desired threshold.
The matrices ranging from 1% to 33% missing value thresholds (termed “namax” in the web-application) were computed. Each of them was separately clustered and the resulting dendrograms were made available through the web-application. Negative correlation distance of pairwise complete observation was used as the distance measure. The average linkage method was used in hierarchical clustering. The relative size of the bootstrap sample was increased from 50% to 140% in 10 steps. At each step, 1000 bootstrap replications were performed.
Interaction network analysis
In order to find the significant gene expression changes in our data sets, thresholds were set on the
The novelty of the gene interactions was determined by searching PubMed. Edges in the web-application link to the manually confirmed PubMed source sentences. The web-application can be controlled through a menu at the top, where the first item (“HELP”) provides information on usage.
Transcription factor analysis
The TRANSFAC database of transcription factors and their binding sites 13 was aquired from BIOBASE. All human genes annotated with known binding sites were extracted from the “GENE” table. The gene-factor pairs found by in-vivo ChIP-chip and ChIP-Seq experiments were extracted from the “FRAGMENT” table. Promoter sequences of all human genes from –5000 to +500 relative to the transcriptions start site were retrieved from the ENSEMBL database (version GRCh37) using the Perl API. The “match” program provided by BIOBASE was used to detect further binding sites in these sequences, using the non-redundant vertebrate profile, where matrices from the “MATRIX” table were selected with respect to minimize the rate of false positives. Matches exhibiting a matrix similarity scores above 0.9 were used. Finally, these three data-sets were pooled into one table containing 1458399 gene to binding-site pairs. Over-representation of transcription factor targets in gene sets was detected by a Fisher exact test.
Results and Discussion
Clustered experiment groups
Searching for microarray measurements based on the reported census of genes causally linked to cancer progression
4
resulted in 78 individual studies, performed on 56 cancer genes. Table 1 shows which cancer genes were present in which type of perturbation. The processed data matrix of these 78 experiments with expression levels for an average of 12517 genes contained 34% missing values. In order to find groups of experiments in which genes are similarly regulated, this data matrix was clustered. Groups of cancer genes along known classical pathways were detected with high significance in bootstrapped clustering, corroborating the methodology. The dendrograms in the web-application shows clusters above the 95% significance level highlighted by red rectangles. The numbers in red next to the dendrogram branches are the “Approximately Unbiased” (AU) values provided by pvclust, from which the
Cancer genes.
The largest group of genes repeatedly detected in clustering analysis (
In-depth analysis
The most significant cancer gene group that produced a common gene expression signature identified here included BRAF+, BRCA1–, NOTCH1–, HOXA9–and ERG+ and has been studied by different authors in different cell lines of which four were of human and one of mouse origin (Table 2). Moreover, three different kinds of microarray chips were used for the measurements. Therefore, the similarity observed is likely not based on experimental variations but rather reflects relevant changes in gene expression and commonalities following perturbations of the above-noted cancer genes. We note that there were 10866 genes common to all five experiments. Genes with average ratios below the mean of up-regulated genes or above the mean of down-regulated genes were excluded to obtain a set of 1296 regulated genes.
GEO Series in the most significant group of five experiments from clustering.
Table 3 lists the 20 genes exhibiting comparably small standard deviation along with their function and ratios from each of the five measurements. Most of these genes control growth directly or through signaling events. The three genes most highly up-regulated were previously shown to contribute to tumor development and include ubiquitin specific peptidase 2 (USP2), 14 duffy blood group chemokine receptor (DARC) 15 and C-C motif chemokine receptor 3 (CCR3). 16 The EED gene is a member of the Polycomb-group (PcG) family, which form multimeric protein complexes involved in maintaining the transcriptional repressive state of genes. The down-regulation of the EED gene on activation of the oncogene BRAF as well as on inactivation of the tumor suppressor BRCA1 leads to inactivation of gene silencing, pointing to a possible mechanism of neoplastic transformation mediated by these distinct cancer genes.
Top 20 most similar regulated genes in the group of five experiments.
The Oncomine database
17
was used to check for consistent differential expression in actual cancer tissue. Table 3 contains the number significant unique analyses showing up- or down-regulation in the 10% rank percentile,
From this list, five genes (IDI1, CHST1, ADNP2, EED, EDNRA) showed consistent differential expression in a majority of analyses represented in Oncomine (see Table 3). Thirteen genes (DDX50, USP2, NCAPD2, DARC, NUP37, PSMG1, NOL4, RPA2, PCBP1, HAT1, CD7, DEK, KDR) showed consistent differential expression in a minority of analyses. Only one gene (LRIT1) was inconsistent and one gene (CCR) did not show differential expression at all.
Checking for over-representation of transcription factors in the TRANSFAC database, we found that with the exception of three genes (CD7, DARC, EDNRA), all genes can be potentially regulated by the transcription factor NF-YA (
Networks
The expression value matrix for the analyzed cancer genes contained 9% missing values, as they were present in essentially all arrays used as a basis for this study. Inspection of the network of protein interactions (Fig. 2) showed that Mitogen-activated protein kinase kinase 1 (MAP2K1), Notch homolog 1 (NOTCH1), V-myc myelocytomatosis viral oncogene homolog (MYC), and Paired box 5 (PAX5) were the most highly interconnected genes, connecting each to more than six other nodes. This observation was constant at all combination of thresholds. For MYC and MAP2K1, PubMed articles confirmed many of the interactions, but only one was found for NOTCH1 and PAX5. This indicates novel links to PAX5 and NOTCH1, outlining a previous unrecognized importance for these potential cancer genes.

Logical network at
The expression value matrix was randomized and the resulting network compared to the one created from experimental data. In the randomized data network, node connectivity was decreased to 2.2 from 3.8 in experimental data. The number of predicted connections confirmed by publications in PubMed dropped from 16 in experimental to 4 in randomized data.
Looking at sub-networks revealed verifiable sequences of interactions between genes, as shown in Figure 3. This sub-network contained retinoblastoma 1 (RB1) (+), homeobox A9 (HOXA9) (–), Phosphatase and tensin homolog (PTEN) (–), PAX5 (–), and Transferrin receptor (TFRC) (–). These results therefore predict that RB1, a key regulator of entry into cell division, can be connected to TFRC, which regulates cellular uptake of iron, an element known to be required for cell division. This is confirmed by the measurement of a down-regulation of TFRC in the RB1 up-regulation experiment and can therefore serve as an internal validation of the prediction.

Sub-network of the above network.
Conclusions
Through analysis of a large set of gene expression data publicly available, we were able to identify a select class of cancer-causing genes whose activities converge to deregulate a common transcription program. This finding is surprising, as the products of the identified cancer-causing genes are known to function in distinct signaling pathways. However, tumor cell evolution proceeds via a process in which genetic changes confer one or another type of growth advantage. Therefore, it is conceivable that changes in the gene expression signature described here mediate a specific aspect of tumor cell evolution. Clearly, the methodology reported here allows to extract from the inundation of gene expression information key patterns of gene expression and to connect them to the deregulation of specific cancer-causing gene products.
Author Contributions
NF, IC and WK conceived the study. NF wrote the programs for the methodology. IC and PW annotated the cancer studies. NF and WK principally wrote the paper, with revisions and contributions from IC. All authors read and approved the final manuscript.
Funding
This work was supported in part by the Competence Center for Systems Physiology and Metabolic Disease.
Disclosures and Ethics
As a requirement of publication author(s) have provided to the publisher signed confirmation of compliance with legal and ethical obligations including but not limited to the following: authorship and contributorship, conflicts of interest, privacy and confidentiality and (where applicable) protection of human and animal research subjects. The authors have read and confirmed their agreement with the ICMJE authorship and conflict of interest criteria. The authors have also confirmed that this article is unique and not under consideration or published in any other publication, and that they have permission from rights holders to reproduce any copyrighted material. Any disclosures are made in this section. The external blind peer reviewers report no conflicts of interest.
