Abstract
Keywords
Introduction
Lung cancer is one of the most common cancers worldwide and the leading contributor to cancer deaths, having one of the lowest survival rates within 5 years after diagnosis.1,2 Lung carcinomas are usually classified as small-cell lung carcinomas (SCLC) and non-small cell lung carcinomas (NSCLC). NSCLC accounts for 80% of all lung cancer cases. The most common histological types of NSCLC are squamous cell carcinoma (SCC), adenocarcinoma (AC) and its subtype bronchioloalveolar carcinoma (BAC).
Tumor stage according to TNM classification remains the strongest predictor of survival in lung cancer until now. The TNM staging system is based on tumor size, involvement of lymph nodes (nodal status) and presence or absence of metastases.3,4 However, it is not based upon intrinsic biological differences between tumor cells and does not provide a sufficiently accurate prognosis. 5 25% to 30% of patients are diagnosed with early-stage (ie, stage I and II) disease and are treated primarily by surgical resection. However, 30%–55% of these patients develop recurrence within 5 years, indicating the existence of biological variation and heterogeneity among patients’ tumors. 6 Therefore, markers that can be used to accurately classify early-stage NSCLC patients into different prognostic groups may be helpful in selecting patients who should receive specific therapies.
Recently, gene expression profiling has been used to identify patients with high risk of relapse.7–10
We investigated gene expression signature of NSCLC by microarray analysis and report a gene expression pattern associated with subtype. We also defined for stage Ib patients the survival rates for a 1000 day cut-off together with metagenes potentially associated with a survival.
Materials and Methods
Patients and Tumor Samples
The Ethical Committee of Tartu University approved the sample collection based on signed informed consent with the patients. Normal samples were collected from tissues adjacent to the patients’ tumors, and were confirmed to be non-cancerous by pathologists. The histological classification of the carcinomas was conducted according to the standards of the World Health Organization (WHO) classification method for carcinoma.11–14 Eighty five patients underwent surgical resection and the tumors were pathologically confirmed as NSCLC pulmonary carcinoma in Tartu University Hospital between November 2002 and December 2006. Of the 85 tumors, 62 were SCCs and 23 were BCs and ACs. According to the guidelines of the American Joint Committee on Cancer 15 the patients were staged after the surgery (Supplemental Table 1).
Statistical analysis of gene expression data
Quantile-normalized and log-transformed expression data was obtained from our previous study, 16 which used Illumina Sentrix BeadChip (HumanWG-6_V2) to profile gene expression. Aforementioned array provides genome-wide transcriptional coverage of well-characterized genes, gene candidates and splice variants, with a significant portion targeting well-established sequences of 18,072 genes. Our initial dataset consisted of 85 lung tumor samples and 21 adjacent cancer-free lung samples (Supplemental Table 1).
Patients who had received preoperative chemotherapy were excluded from differential gene expression analyses leaving 78 tumor samples and 20 adjacent control samples.
Moderated two-sample t-test from R package LIMMA
17
was used to find differentially expressed genes between sample groups, using FDR-corrected
The differentially expressed genes and all samples were clustered hierarchically using Pearson correlation distance with average linkage and visualized using a heatmap. Pathway enrichment analyses were carried out using GeneCodis 2.0.18,19
Bayesian analysis to assess risk genes in stage Ib patients
We restricted our analysis to stage Ib patients only, since preliminary analysis suggested that gene-expression patterns associated with survival differ in different stages and stage Ib group was the largest group (48 patients altogether, but clinical data available for only 46 of them). From the survival times we identified two distinct patient groups, a group with short survival (<1000 days, n = 20) and a group with long survival (>1000 days, n = 26) (Fig. 2A). To reduce the number of genes, 500 gene-clusters or “metagenes”, were formed from the 5000 genes with highest variance among the 46 patients. Clustering of the genes is performed using complete-linkage hierarchical clustering,20,21 where the distance between two genes is defined as the correlation coefficient between gene-expression values. The constructed metagenes are summarized by the mean of the genes in the cluster. Thus genes demonstrating similar variation between the subjects are summarized as a single “metagene”.
The association between the metagenes and the two groups was analyzed with a sparse Bayesian probit model for binary response variables.22–24 According to the model, the probability of being in the short survival group is given by
The performance of the method was further evaluated by the ROC curve, constructed by plotting the false positive rate (sensitivity) against the true positive rate (1-specificity) for different cut off values in the range (0.1). In addition to drawing the empirical curve, we report the AUC (area under the ROC curve) value together with the
Quantitative RT-PCR
To validate the gene expression levels detected with microarray analysis, qRT-PCR was performed for the four up-regulated genes:
Results and Discussion
Identification of two gene expression patterns and correlation with NSCLC subtypes
We identified 599 genes which were down-regulated and 402 genes which were up-regulated in NSCLC compared to the normal lung tissue (Supplemental Table 3 and supplemental Fig. 1). According to Genecodis 2.0 analyses the main up-regulated processes in cancer were not only related to mitosis, cell division, DNA replication, blood vessel development, keratinozyte differentiation and epidermis development (Supplemental Table 4). The number of various down-regulated processes is much larger including immune response, signal transduction, cell to cell adhesion, cell surface reception linked signaling pathway, cell differentiation and others (Supplemental Table 5). To find differentially expressed genes between lung cancer subtypes, we used only tumor samples of different subtypes as sample groups. Because of the small sample size and similar histology 27 of AC and BC samples we used these in analyses as one group. We identified 112 genes which were up-regulated and 101 genes which were down-regulated in AC/BC compared to the SCC (Supplemental Fig. 2). Some of the genes showing largest fold changes in our dataset are belonging to keratin gene family, which overexpression is shown to be SCC specific. 28
We also carried out analyses to identify genes which may distinguish NSCLC stages. Because most of our sample group consisted of early stage (Ia and Ib) tumor samples and there were limited number of later stage samples, we compared only I stage tumor samples with all the others. We did not find any genes which were significantly overexpressed in one of the sample groups after multiple testing correction. This may suggest that although TNM staging system reflects the clinical status of tumor, it may not be the best tool to assess the underlying biological properties of cancer caused by aberrant gene expression.
The expression of the four up-regulated and four down-regulated genes (see supplemental data Table 2) were confirmed using eight pairs of normal and tumor tissue samples from microarray analysis sample set and further validated using three sample pairs that were not analyzed using microarray previously (Fig. 1A). The results were quantified using 2-ΔΔCt method and

Among the up-regulated genes we validated,
Identifying genes associated with survival in stage Ib patients
To identify prognostic markers for high-risk patients with early-stage disease, the Bayesian model was first applied to both Ia and Ib patients (results not shown). However, the model-based prediction into low and high survival risk groups was clearly less accurate for Ia patients than for Ib patients. This suggests that gene-expression profiles associated with survival are different between stages Ia and Ib. A possible explanation for why the model worked for stage Ib but not for stage Ia patients could be that the number of patients in stage Ib (n = 46) was larger than that in stage Ia (n = 14).
Since the largest group of samples in our dataset were from patients with tumor stage Ib, we next focused to this subset of patients. A sparse Bayesian regression model was constructed using all 46 stage Ib patients. The number of influential metagenes supported by the model was 4 (mode) with a 95% credible interval of 1–10. The top two metagenes (

).
Metagenes
Identifying stage specific survival-associated genes would require fitting separate models or adopting a model that allows state-specific gene-effects. For example, the binary classification-tree approach 36 allows one to combine very different gene-expression profiles (leaves of the tree) into the same class. We used the cut-off value of 1000 days, since at that point there exists a clear gap (757 days to 1225 days) in the survival times of Ib patients. As shown in Fig. 2A, there are 3 patients who died after the cut-off that may have a negative effect on the quality of the data, ie, if the censored times may turn out to be much longer. In Potti et al 2006, the analysis was based on a cohort consisting of NSCLC patients with a recurrence within 2.5 years and patients with no recurrence within 5 years. Also, their analysis differs from ours in the sense that their model considers both clinical data and gene data.
Predicting patient survival for stage Ib patients
The ability of the Bayesian model to correctly classify stage Ib patients into groups of short and long survival based on gene-expression profiles was evaluated using leave-one-out cross-validation and the ROC curve. The use of leave-one-out cross-validation provides an estimate of the ability of the model to predict survival of new patients (Fig. 2A). The cross-validation classification error was 33% (15 out of 46 were misclassified) for all patients, 29% (5/17) for the short survival group (<1000 days), and 34% (10/29) for the long survival group (>1000 days).
Kaplan-Meier analysis of the model-predicted risk groups for Ib patients was performed. The difference between the survival rates in the high and low risk groups was significant (

The empirical ROC curve (solid curve): The true positive rate plotted as a function of the false positive rate for different cut off values. Jumps in the curve correspond to changes in the classification outcome of the patients due to the use of different cut off values. For cut off value 0 all patients are classified into the long survival group, FTP = 0 and TPR = 0. In the other extreme, ie, cut off value 1 all patients are classified into the short survival group, FTP = 1 and FPR = 1. The area under the curve (AUC = 0.728,

Kaplan-Meier plot of the survival probability in the high and low risk groups predicted by the Bayesian model. The high risk group (short survival) consisting of 24 patients and the low risk group (long survival) consisting of 22 patients. Vertical drops indicate deaths and ticks on the solid lines are censored survival times. The survival rates of the two groups are significantly different (
Conclusion
In this paper we applied two statistical approaches, one designed to identify alternating gene-expression patterns associated with tumor subtypes, and another designed to find genes associated with the risk for low survival in NSCLC patients. The approaches complement each other as the first approach provides information on gene-expression variation between different tumor subtypes, but does not address the important issue of different survival outcomes within patients classified into a single tumor stage. The second approach identifies genes associated with the risk for low survival in Ib stage patients which was the largest group of the patients. Our results definitely provide a possible strong reference for diagnosis/prognosis as the genes most up-regulated in pattern I and the genes most down regulated in pattern II may distinguish carcinogenesis progression for gene-expression pattern that corresponds to subtypes.
Disclosure
This manuscript has been read and approved by all authors. This paper is unique and not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.
