Abstract
Keywords
Introduction
Cancer is a leading cause of death, and accounts for 17% of mortality worldwide. According to a report from the International Agency for Research on Cancer (IARC), in 2018 a total of 9.5 million deaths and 18 million new cases of cancer were reported worldwide. Interestingly, incidence and mortality rates are higher in men and in the developed world. 1 While some types of cancers are treated based on biomarkers and specific genetic mutations,1,2 most cases are still treated according to specific guidelines by surgery, chemotherapy, and/or radiotherapy based on data integrating the clinical, histopathological, details of therapy, imaging, and outcome information of the patients.
Accurate prediction of prognosis of the various subtypes of cancer may improve tailoring of therapy by allowing to take into consideration the expected outcome versus therapy choice, intensity, risk, side effects, and late complications.
In the last decade, large OMICs databases were created that contain data generated from thousands of cancer samples. The largest one, The Cancer Genome Atlas (TCGA), a repository that contains genomic, epigenomic, transcriptomic, proteomic, and clinical data, characterizing 33 types of tumors from over 20,000 patients, is considered to be one of the largest sources for cancer OMICs data. Many groups have tried to use TCGA data to predict the prognosis of patients affected by various tumors using machine learning approaches, with varying levels of success.3-8
Random Forest 9 is a simple yet effective Machine Learning algorithm that proved to be a successful predictor when using structured data such as RNA expression analysis. 10 It has low overfitting and a simple feature importance scoring function that is based on the Mean Decrease in Impurity function (Gini Importance). This allows refinement of prediction models and adds important insights into the biological role of each feature in cancer development and prognosis.
Cancer outcome prediction using OMICS-related data evolved in the last 2 decades starting with the use of gene-expression microarrays.11,12 The accumulation of data from various OMICS technologies calls for the development of advanced cancer outcome prediction tools.
Here we describe a robust and simple analysis prediction tool using the Random Forest algorithm on 5 tumor types using the TCGA database.
Methods
Data
All RNA-seq datasets were downloaded from Genomic Data Commons (GDC). Clinical data was downloaded from the firehose data portal. The RNA-seq FPKM-UQ normalized data for cancer types of the TCGA projects were downloaded from National Cancer Institute’s Genomic Data Commons data portal. The samples in each project were divided into 2 groups. The first group included samples from patients who were tumor-free for over 3 years (Tumor-Free samples), the second group included samples from patients that succumbed to the disease at any time point (Deceased group). We only used projects where the ratio between group size and the total number of samples was between 20% and 80% (Table 1). Validation of the models was done using 2 datasets from Clinical Proteomic Tumor Analysis Consortium v3(CPTAC3): Clear Cell Renal Cell Carcinoma (CPTAC3-ccRCC) and Uterine Corpus Endometrial Carcinoma (CPTAC3-UCEC).
Samples summary for each TCGA project for total samples in the cohort and samples with RNA-seq data. The Area Under the Curve of Receiver Operating Characteristic curve (AUC-ROC) mean for the last 500 models (500 to 1 features) was calculated for each project. The bold lines are the models that scored averaged AUC-ROC of above 0.8. The CPTAC3-ccRCC and CPTAC3-UCEC data were tested on the selected model with the minimal number of features for each project and the AUC-ROC was calculated respectively.
Software
We used python 3.7.6 and dependencies for full data analysis. Random Forest classifiers were created using scikit-learn 0.23.2. Data parsing and analysis were done using pandas 1.1.1. Ingenuity Pathway Analysis was used for network enrichment assessments. The webapp was created based on Flask 1.1.2 and Jinja2 2.11.2.
Random forest model training and testing
Dimension reduction for the model required several steps as illustrated in Figure 1. The first RF model for each TCGA project was created using all 65 483 mRNA features. The model parameters were selected using GridSearchCV module from scikit-learn, which tests all possible combinations from the provided list as detailed in Supplemental Table 2. After the model training, the features were scored and sorted using the model’s property

Model generation workflow for each TCGA project.
The selected TCGA modes features were analyzed using Ingenuity Pathway Analysis (IPA) software for enriched networks. Only significant results (

AUC-ROC results as a function of the features numbers for 3 datasets: TCGA selected model data, CPTAC3-ccRCC, and CPTAC3-UCEC. The datasets were tested on each model and AUC-ROC score was calculated. The blue line represents the average AUC-ROC for all 500 results of the TCGA dataset.
The code for the models creation pipelines can be downloaded from https://github.com/omrin/surviveai.
Results
AUC-ROC mean of over 80% was achieved in 5 projects
Out of 26 RNA-Seq TCGA-tumor type projects, only 14 had the required ratio between group size and the total number of samples (20%-80%) and had a minimum of 30 samples.
An average AUC-ROC score was calculated for the last 500 models (features range from 1 to 500). Out of the 15 cancer types, 5 tumor groups had an average AUC-ROC of over 80% TCGA-LGG (low grade glioma) 0.92 AUC-ROC, TCGA-COAD (colon adenocarcinoma) 0.84 AUC-ROC, TCGA-SARC (sarcoma) 0.86 AUC-ROC, TCGA-CESC (cervical squamous cell carcinoma and endocervical adenocarcinoma) 0.8 AUC-ROC, and TCGA-KIRP (kidney renal papillary cell carcinoma) 0.88 AUC-ROC. Detailed results and statistics for each TCGA project can be found at Table 1.
The model with the minimal features that most closely predicted the calculated AUC-ROC average was selected (See Figure 2). Each selected model used dozens of dimensions: a maximum of 90 features for TCGA-LGG and minimum 12 features for TCGA-COAD.
Prediction of the top 5 models highly correlates with sample tumor origin
The top 5 models were tested on all 15 TCGA project sample datasets. AUC-ROC scores were calculated for each dataset using the predictions of each sample and the known final results, see heatmap in Table 3. As expected, the scores for the training samples that were used to create prediction models were high and close to 1. For other datasets, the predictions were almost without correlation to the true condition of the samples (score .5)
Each data set was tested on top 5 models. The AUC-ROC score was calculated based on the predictions rate for each dataset.
Red indicates opposite prediction correlation and intensity ranges between 0 to 0.5. Green indicates direct prediction correlation and ranges from 0.5 to 1. White indicates that there is no correlation.
Interestingly, a high negative correlation was found between the prediction models TCGA-CESC, TCGA-LGG, and TCGA-SARC and the predictions for the samples of the TCGA-READ project.
Correlation of outcome predictions of TCGA dataset analyses with the validation datasets within tumor types
In addition to the validation of the predictions obtained by analysis of the TCGA training sets using the testing sets, we further validated the models using 2 independent datasets that served for measuring the robustness of the models. We chose CPTAC3-ccRCC renal tumor dataset for the validation of a model developed for the same tissue of origin tumor which had a high prediction score, and CPTAC3-UCEC uterine tumor as an independent dataset for the testing of our model for the analysis of a tumor type where specific prediction model had a lower score.
The AUC-ROC obtained by the application of the TCGA-KIRP based model for the analysis of the CPTAC3-ccRCC data was 0.86, very similar to the value 0.88 of the TCGA-KIRP test group. Interestingly, TCGA renal cell tumors have 2 sub-types: Kidney renal clear cell carcinoma (TCGA-KIRC) and Kidney renal papillary cell carcinoma (TCGA-KIRP). We analyzed these subtypes separately and the predictions were 0.79 and 0.88 AUC-ROC for the test groups, respectively. When we applied these 2 prediction models for validation cohort CPTAC3-ccRCC, which contains clear cell renal cell tumors, the predictions of both models were similar, 0.77 and 0.86 AUC-ROC, respectively. The prediction of the less efficient TCGA-UCEC model for the CPTAC3-UCEC data indeed gave a low predictions score, 0.63. Surprisingly the tissue discordant TCGA-KIRP prediction model for the analysis of uterine the CPTAC3-UCEC data set over performed the prediction of the tissue concordant TCGA-UCEC model and scored AUC-ROC 0.663. Finally, we have tested all the TCGA predictions models using renal CPTAC3-ccRCC dataset. For most of the models, the results were below 0.7, except for the TCGA-SARC model which was 0.85. When we used the uterine CPTAC3-UCEC dataset on all the TCGA prediction models, all the scores were very low except for the TCGA-SARC model which was 0.74.
TCGA-KIRP model accurately predicted the prognosis of CPTAC3-ccRCC samples, but on a different scale
We analyzed the RNA-SEQ data of CPTAC3-ccRCC samples using the selected TCGA-KIRP final model (created using 42 features as described in Table 2). The mean scores for the Deceased and Tumor-free groups were significantly different as shown in Figure 3, however, the scale by which each group was measured also differed. For the TCGA-KIRP samples, the model produced scores between 0.025 and 0.95 while the CPTAC3-ccRCC sample scores were 0.18 to 0.425 (before normalization to 1). The model prediction AUC-ROC score for the CPTAC3-ccRCC was 0.86, almost identical to the TCGA-KIRP testing set.

Mean score results from TCGA-KIRP model on TCGA-KIRP and CPTAC3-ccRCC groups, deceased, and tumor free. †
Pathway analysis revealed enrichment for cancer and cancer-related canonical pathways
We used the Ingenuity Pathway Analysis (IPA) to analyze the genes selected in the final model for enrichment of related pathways. The top pathways were those involved in basic functions related to tumorigenesis and organ development. For example, in the TCGA-KIRP tumor prognosis prediction model, the pathways of Cell Cycle, Connective Tissue Development and Function, and Renal and Urological System Development and Function were the most highly enriched with a
As expected, functional pathways related to the tissues of origin such as Renal and Urological System Development and Function, Reproductive System Disease, and Connective Tissue Disorders correlated to the primary tissue of the TCGA prediction models: kidney, cervix and glial cells, respectively.
Comparison of gene enrichment of 2 prediction models for the same samples cohort
We developed, based on the analysis of the same set of TCGA-KIRP samples, 2 separate models based on 300-feature model and 42-feature model. The predictions of the 2 models were 0.85 and 0.86, respectively. The 42 genes that comprised the smaller prediction model are included in the 300 genes model. Analysis of the 300 features TCGA-KIRP model by the Ingenuity Pathways Analysis software matched 7 networks that are significantly enriched by those genes (see Supplemental Table 3 for specific molecules in each network and

Shared features between top networks of TCGA-KIRP prediction models: (A) The top network from IPA prediction for the TCGA-KIRP 300 features model. That network is associated with Cancer, Organismal Injury and Abnormalities, Reproductive System Disease pathways. The gray nodes are the nodes from the model feature list (26 out of 35 network nodes,
DMBT116 (Deleted In Malignant Brain Tumors 1) is a tumor suppressor gene. Deletions in this gene play a role in the progression of many human cancers, including brain, lung, esophageal, gastric, and colorectal tumors. IL11, as part of KRT8-IL11 axis activation upregulation, 17 promotes tumor metastasis and is predictive of a poor prognosis in renal cell carcinoma. It was also suggested as a potential therapeutic target in cancer treatment. 18 HOXB6 (Homeobox B6) was found to play different roles in several cancer pathways,19,20 including in methylation-driven genes related to prognosis in renal cell carcinoma. 21 TRIB3 (Tribbles pseudokinase 3) has many biological functions. However, high expression of TRIB3 was correlated with both advanced tumor stage and unfavorable prognosis. 22 High expression of TRIB3 in other cancer types, such as hepatocellular carcinoma and lung cancer, also correlated with poor survival rate.23,24 PIM1 is a proto-oncogene belonging to the Ser/Thr protein kinase family. It was recently found that when overexpressed in human renal cell carcinoma tissues and cell lines, it positively correlated with disease progression. 25 PIM1 was found to be involved in Smad2, Smad3, and c-Myc 26 phosphorylation and was suggested as a potential therapeutic target for renal cell carcinoma patients.
SurviveAI webapp
An interactive free software based on the models was created using Flask 1.1.2. It enables physicians and researchers to get clinical predictions (for research purposes only) for RNA-Seq cancer multiple samples. The easy to use interface allows one to insert specific gene lists with FPKM-UQ values for each gene and to get predicted survival scores for 5 cancer types: Cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), colon adenocarcinoma (COAD), kidney renal papillary cell carcinoma (KIRP), brain lower grade glioma (LGG) and sarcoma (SARC). The tool uses scikit-learn’s
SurviveAI webapp can be accessed at https://tinyurl.com/surviveai
Discussion
Following the significant price decrease of high-throughput sequencing, projects like TCGA have generated vast amounts of data that enable machine learning. Usually, only specific types of cancer cohorts are used to create prediction models, combining multiple sources of OMICs-data to enhance AUC-ROC-based predictions. A multi-OMICs prediction model is more costly and less useful for routine clinical use, due to the increased number of methodologies needed. In order for a model to be user friendly and readily applicable, we based our model on RNA-seq data only, which is affordable and accessible, in clinical and research facilities. We have used 70% of the samples in each TCGA project to train the prediction models, and in order to validate the prediction, we tested them against the rest 30% of the samples from the cohort that was not used for training (test data). In addition, the models were tested against external datasets, CPTAC3-ccRCC and CPTAC3-UCEC. As expected, the models provided low prediction scores for the CPTAC3-UCEC samples, as none of the models were related to uterus cancer. Although KIRP (Kidney Renal Papillary Cell Carcinoma) and ccRCC (Clear cell renal cell carcinoma) are different subtypes of kidney cancer, the TCGA-KIRP model provides excellent predictions (AUC-ROC = 0.86) for the ccRCC dataset samples. Interestingly, the TCGA-SARC model also provides about the same accuracy (AUC-ROC = 0.85) for this dataset, even though the 2 models (KIRP and SARC) do not share any features at all (see Table 2)
We highly recommend that before using the models to calibrate with a truth set that contains at least 10 to 20 samples, as RNA expression level tends to be sensitive to batch effect.
For example, the CPTAC3-ccRCC Tumor-Free samples produce average score results of 0.73 for the Tumor-Free samples while the TCGA-KIRP survived results were between 0.9 and 1 (Table 3).
In this study, we show a novel method of machine learning driven pathways discovery using the simple and robust technique of reverse feature elimination. Also, the decision to use 2 distinct groups (Deceased and Tumor free), allowed us to decipher critical genes and features that are important for progression prediction in some of the projects.
We checked all possible projects available for analysis on the TCGA datasets and used only RNA-seq data for predictions. The reason for this is the relatively low cost and simplicity to produce such data for clinical and research purposes. This allows other researchers to use the models available free online. The Random Forest model is simple and allows us to easily extract the most important features from the data.
In 4 out of 5 models, a significant portion of the models’ genes were part of cancer-related pathways. The molecules which were not included might be extensions of those networks or create another unknown network themselves. From a clinical perspective those genes might serve as new drug targets or biomarkers.
Supplemental Material
sj-xlsx-1-cix-10.1177_11769351221127875 – Supplemental material for SurviveAI: Long Term Survival Prediction of Cancer Patients Based on Somatic RNA-Seq Expression
Supplemental material, sj-xlsx-1-cix-10.1177_11769351221127875 for SurviveAI: Long Term Survival Prediction of Cancer Patients Based on Somatic RNA-Seq Expression by Omri Nayshool, Nitzan Kol, Elisheva Javaski, Ninette Amariglio and Gideon Rechavi in Cancer Informatics
Footnotes
Funding:
Declaration Of Conflicting Interests:
Author Contributions
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
