Abstract
Keywords
Introduction
Radiotherapy is a major cancer treatment, but the radiosensitivity of different tumors or even the same type of tumor in different patients varies widely. 1 Therefore, predicting the radiosensitivity of patients before radiation therapy, identifying underlying molecular signatures, and constructing their regulatory network have high clinical and oncological importance.
Radiosensitivity can be measured as the fraction of cells surviving a single 2 Gy dose of radiation (SF2), with high values indicating radiation resistance. While other methods are available to measure cellular radiation sensitivity in cell lines, SF2 is considered to be the gold standard and is supported by strong clinical evidence. 2
Variations in mRNA expression values (GE), copy numbers, and other genomic patterns are thought to be the main underlying factors for different radiation responses. Accumulating large amounts of these data provides an effective but challenging way to predict the radiosensitivity of tumor cells.
Torres-Roca et al predicted the radiosensitivity of 35 human cell lines in a NCI-60 panel using a linear classifier of expression values of tens of genes selected by the significance analysis of microarrays (SAM) method in 2005.
3
They developed a radiosensitivity index (RSI) as a biomarker of cellular radiosensitivity in 48 NCI-60 cancer cell lines in 2009. SF2 was the central criterion for both feature selection and final model training for RSI development. Ten of the selected ‘hub’ genes were then used to construct a linear prediction model of SF2.
4
Additionally, Tewari et al investigated the feasibility of integrating an
Besides GE data, copy number variation (CNV) and methylation (ME) data are also related to radiosensitivity. Work by Moelans et al indicated that allelic loss and amplification at the 8p11-12 breakpoint region are associated with poor radiotherapy responses, 7 while Zhu et al reported a pivotal role for DNA ME in tumor radiosensitivity. 8
Unfortunately, none of the individual types of genomic data thoroughly capture the complexity of the cancer genome or precisely pinpoint the cancer-driving mechanism.
9
Additionally, it has become increasingly clear that the integrative analysis of multi-omic data types can aid the search for potential “drivers” by uncovering genomic features dysregulated by multiple mechanisms.
10
More importantly, true oncogenic mechanisms are more visible when combining evidence across patterns of alterations in DNA CNV, ME, GE and mutational profiles.
11,12
A well-known example is the
No widely accepted threshold exists between radiation-sensitive or -resistant phenotypes and SF2 values from 0 to 1. Therefore, instead of roughly dividing cell lines into different groups by subjective cutoffs of SF2 values, it may be more useful to consider a regression issue for continuous variables of SF2. To this end, we propose a novel integrated multiple genomic data regression method for SF2 prediction, focusing on identifying signature genes for functional and genetic network analysis, rather than “hotspot” or “hot-loci” from GE, CNV, and ME data. 13,14
Studying the gene regulatory network (GRN) structure provides important insights into the mechanisms of complex diseases. 15,16 Several studies have shown that gene expression is influenced not only by the expression of other genes but also by CNV or other biological variations. 17 Therefore, it is also necessary to infer GRN using multi-genomic data. Correspondingly, the aims of this study were two-fold: 1) to identify signature genes strongly associated with radiosensitivity from fused multiple genomic data to further corroborate and expand the evidence of radiosensitivity-associated signature genes in the prediction of radioresponses; and 2) to uncover regulatory relationships among identified signatures using fused multiple genomic data, employing least absolute shrinkage and selection operator (LASSO) regression based on coordinate descent algorithms to construct GRN. Figure 1 shows the study outline.

The overall structure of this paper. Based on the fact that the current standard approaches rely on separate mono-genomics data analyses followed by manual integration, multiple genomic data fused regression approach (MGPLS) is proposed to identifying signature genes. MGPLS method can analysis all data types simultaneously using a single integrated regression model as well as eliminating noise effects. VIP: variable importance on projection; CV: cross-validation; UVE: uninformative variable elimination.
Materials and Methods
Datasets
GE, CNV, and ME data of NCI-60 cell lines were downloaded from (https://discover.nci.nih.gov/cellminer/loadDownload.do). GE data were collected via five platforms (Affymetrix: HG-U95, HG-U133, HG-U133 Plus 2.0, HG Exon 1.0 ST; and Agilent: Whole Human Genome Oligo array). 18 CNV data were collected via four different platforms (Agilent Human Genome CGH Microarray 44A, Nimblegen HG19 CGH 385K WG Tiling v2.0, Affymetrix GeneChip Human Mapping 500k Array Set, and Illumina Human1Mv1_C Beadchip). 19 ME data were collected using the Infinium HumanMethylation450 BeadChip Kit platform. The measured SF2 values of corresponding cell lines were collected from the study of Eschrich et al 4 and are listed in Table 1.
Measured and predicted SF2 values for NCI-60 cell lines
* Cell lines sorted by multiple genomic data fused partial least square deep regression (MGPLS).
Data preprocessing
Before training the dataset, we performed the following preprocessing on the downloaded data: Considering that there are only 3–4 samples per cancer type in the NCI-60 cell line panel, if the value of a gene is missing in one sample then it is absent from >25% of the samples of that cancer type. Therefore, these genes were removed from the analysis. Common genes in all GE, CNV, and ME datasets were identified. Because GEs and CNVs of genes that are strongly related to each other are more likely to contain less noise,
19
they were selected using Pearson correlation coefficients (cutoff, 0.5). GE, CNV, and ME data were respectively standardized with the Z-Score method for further regression analysis.
A multiple genomic data fused partial least square deep regression (MGPLS) method
Partial least squares (PLS) is a widely used algorithm for modeling relationships between sets of observed variables using latent variables. It comprises regression and classification tasks as well as dimension reducing and modeling. 20 Instead of identifying hyperplanes of minimum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables (regression or classification labels) and the observed variables (fused GE, CNV, or ME values of genes in our case) to a new lower space. 21 This is highly suited to the analysis of high-dimension, low-sample-size data in bioinformatics.
To integrate multiple genome data for improved regression performance, we proposed an MGPLS method:
Given the predictor matrix
where
MGPLS may be applied to cases where the aim is to describe the linear relationship between
based on the basic latent variable decomposition:
where
where
From Eq.(1) and Eq.(5), it is obvious that
where
The regression coefficient matrix
where, in our case, SF2 is the response variable. As a result, the regression model of SF2 is a combination of GE, CNV, and ME, which means they are integrated at the raw data level. MGPLS incorporates principal component analysis and LV extraction together so that it can simultaneously analyze multiple genomic data using a single integrated regression model.
The measurement of regression/prediction performance
The root mean square error (RMSE) was used to evaluate the accuracy of the radiosensitivity prediction model:
where
The identification of signature genes by optimizing SF2 regression accuracy
Because of the missing information on ME data (see

Flow chart of signature genes selection using GE and CNV data. MGPLS-UVE algorithm with 10 times of 6-fold cross-validation is employed to select the genes with the stably highest contribution to SF2 predicting model. There are 7622 variables at the beginning and 500 variables are left after MGPLS rough selection.
Support vector machine (SVM)
SVM is a supervised machine learning method used for classification and regression analysis. The key concept is to non-linearly map input vectors to a very high-dimension feature space
Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes Pathway (KEGG) analysis
The GO knowledgebase is the largest source of information worldwide regarding gene function. It provides a comprehensive, computational model of biological systems, ranging from the molecular to the organic level, across the multiplicity of species in the tree of life. 25
KEGG is a knowledgebase for the systematic analysis of gene functions at the molecular level in biological systems, from cells to organisms and ecosystems. It has been generated by genome sequencing and other high-throughput experimental technologies.
26
Both GO and KEGG pathway enrichment analyses for all signature genes were performed using
Sparse GRN inference based on the LASSO method
Let
where
Eq.(10) can be rewritten in a least square minimization problem as:
where
To obtain a sparse model and avoid overfitting, we added the L1 regularization term to Eq.(11) to make it a LASSO regression form as follows:
where
LASSO is a multivariate linear regression method. When there are many features and the number of samples is relatively small, LASSO can effectively avoid overfitting and obtain sparse solutions via an
There are two hypotheses in the model: There is no self-regulation, i.e., the diagonal elements of the A gene can be directly regulated only by CNV for the gene itself not for other genes, i.e., only diagonal elements of the
After obtaining adjacency matrices
With the exception of KEGG and GO, all analyses were performed using MATLAB codes. The corresponding MATLAB toolkit, MGPLS-UVE, can be freely downloaded from our website https://www.clickgenome.org/papers/MGPLS.html. Further details of LASSO, GRN inference, and other methods and algorithms are included in the Supplementary materials.
Results
Identified signature genes and their SF2 prediction performance
According to the RMSE values obtained by different numbers of genes shown in Figure S1a, 113 genes corresponding to the smallest RMSE were selected as signature genes. The smallest RMSE obtained by these 113 genes means this gene set has the highest prediction performance or closest relationship to SF2. The gene names, Entrez gene IDs, and other detailed information of these 113 genes are listed in Table S2. Five (
Using only the CNV values of the five CNV signatures and the GE values of the 108 GE signatures, the RMSE was optimized to 0.094 with MGPLS, a linear method. The corresponding predicted SF2s are listed in Table 1. The smallest difference between the measured and predicted SF2s of all 60 cell lines was 0.002 (CNS:U251), while the largest was 0.229 (LC:NCI-H23). The average error of the 60 cell lines was 0.075. Five cell lines (CNS:U251, OV:OVCAR-4, LE:CCRF-CEM, CNS:SNB-19, and RE:SN12C) had predicted errors <0.01 and the other 46 cell lines had predicted errors <0.1.
To improve their prediction accuracy, SVM with radial basis kernel function was explored to predict SF2 values with the CNV and GE of the 113 signature genes. The corresponding predicted SF2s are listed in Table 1. The RMSE value of the SVM prediction model was 0.0025. Twenty-two cell lines had differences between measured and predicted SF2s of 0. Only five cell lines had an absolute error >0.005 (LC:EKVX, OV:SK-OV-3, LC:NCI-H460, CO:HT29, and LC:NCI-H23).
GRN
The inference of GRN using these 113 genes was performed using the LASSO method to analyze regulatory relationships, as shown in Figure S2. Twenty-four genes with linkages >10 were selected as “hub” genes. Thus, these 24 “hub” genes can directly regulate the expression of at least 10 other genes. The GRN of the 24 “hub” genes is shown in Figure 3 and further details are listed in Table 2.

Gene regulatory network among 24 “Hub” genes. The color of a gene (circle or triangle nodes) matches the color of its arrows to identify regulatory relationships between these genes more efficiently. There are two types of arrows: sharp arrows indicate the promotion of expression and blunt arrows mean the inhibition of expression. In addition, there are 12 genes (triangular nodes) whose CNV have a significant promoting effect on their respective expression process. They are
Details of 24 “hub” genes
Half of the 24 “hub” genes uncovered by GRN inference showed strong correlations between their own GE and CNV values (Figure 3). Corresponding Pearson correlation coefficients between GEs and CNVs are listed in Table S3, of which 12 out of 24 genes are >0.5.
Discussion
Variations in GE are thought to be the main underlying factors for different radiation responses, and several studies have attempted to correlate the relationship between radiation response and GE. Because ME and CNV are two of the main factors regulating GE, it is very important to take them into consideration.
The integration of multi-genomics data can compensate for the noise impact of mono-genomics data. This concept has been widely used in clustering and cancer subtype classification. For example, Hoadley et al 27 used data on chromosome arm-level aneuploidy, DNA hypermethylation, mRNA and microRNA expression levels, and reverse-phase protein arrays to conduct comprehensive integrative molecular analyses of the Pan-Cancer Atlas to reclassify human tumors and provide future directions for exploring clinical prognosis in cancer treatment.
Additionally, because SF2 values range from 0 to 1 continuously, it is reasonable to predict the SF2 values of samples rather than to roughly classify them into sensitive or resistant groups. In this study, for the first time, we applied the concept of multi-genomic data integration analysis in regression issues using radiosensitivity prediction as an example.
In theory, all types of data can be processed simultaneously using a single integrated regression model. Indeed, ME values of more than 480,000 probes can be collected using the Illumina Infinium Human Methylation 450K BeadChip. However, available ME data for NCI-60 consists of fewer than 20,000 variables, representing the loss of 96% of useful ME information. Because raw ME data of the NCI-60 platform are missing and available preprocessed ME data are inadequate, the prediction performance is much worse with than without ME data (see Figure S1b). This was apparent from our prediction RMSE which was smaller using only GE and CNV data than using all GE, CNV, and ME data (0.094
In our previous work (Zhang et al, 2014), we built a nonlinear SF2 prediction model for the NCI-60 panel using only GE data of 19,162 genes.
28
The RMSE value was as low as only 0.011. To test whether multi-genomic data model could identify more essential signature genes, we herein used the same nonlinear method, SVM, to train a nonlinear SF2 model. The comparison results are summarized in Table 3. Clearly, regardless of whether a linear or nonlinear model is used, our RMSE values are notably smaller than those of Zhang et al (0.094
RMSE comparison of different models.
Previous work by Torres-Roca et al used the expression values of selected genes to predict SF2 values of NCI-60 cell lines (RMSE=0.20), then at a later date simplified this to a 10-hub-gene model to predict 12 independent cell lines (RMSE=0.38). 3,4 Limited by the techniques available at the time, however, only GE data and some of the 60 cell lines were used. Additionally, two different types of cross-validations were employed to train the model. A comparison of these studies and our own would be unfair, so we instead compared the results obtained from fused multiple genomic data with those obtained from mono-genomic data. Correspondingly, RMSEs obtained using only GE or CNV values of 113 signature genes and 24 “hub” genes are shown in Table 3. The RMSE obtained using fused multiple genomic data is the smallest, indicating these data should be used to achieve the highest prediction performance or closest relationship to SF2.
Because our data had only five CNV signatures among 113 genes, the predominance in the number of GE signatures resulted in the RMSE value of the GE-only model being just slightly worse than what was obtained using fused multiple genomic data. For the same reason, the RMSE value of the CNV-only model was much worse. The poor SF2 prediction performance of the “hub” genes is consistent with what we expected because they only number 24, which is one fifth the number of signature genes. Measured SF2 values and predicted values in different models for each cell line are shown in Figure 4.

The measured and predicted SF2s of the 60 cell lines obtained using current signature genes and other published models. The measured and predicted SF2s of the 60 cell lines obtained using current signature genes and other published models.
Overfitting is almost unavoidable for cases with overwhelming high variable dimensions but small sample sizes, and our case is typical. Therefore, we attempted to overcome this in the present study as follows: 1) rather than the attempting the leave-one-out method used widely in small-sample cases, we employed 10 6-fold cross-validations in the training process; 2) we used the PLS method to reduce the original high variable dimension to a much lower LV dimension; 3) two types of UVE were processed iteratively to remove uninformative genes step by step rather than removing them in one step; and 4) we used LASSO to infer the GRN of identified signature genes because this is good at overcoming the overfitting problem in high-dimension small-sample cases.
Despite these methods, the number of samples remained extremely small compared with the number of variables, so we could not ensure that the overfitting problem was completely overcome. Therefore, molecular function analysis and pathway analysis were performed to verify the essentiality of identified signatures.
From our results, we can identify the absolute values of the regression coefficients of five CNV signatures (
Several studies have found that some cellular functions enriched by 113 genes are closely associated with radiosensitivity
29,30
or cancers such as breast cancer, lung cancer, bladder cancer, and leukemia.
31
-37
. For example,
According to our KEGG and GO analysis of all 113 genes shown in Figures S3 and S4,
Conclusion
On the basis of the advantages of integrating multiple genomic data, we proposed a novel multiple genomic data fused partial least squares deep regression method (MGPLS), which we used to identify 113 signature genes closely related to radiosensitivity. We further inferred the GRN using GE and CNV data belonging to these signature genes. The joint regression method we propose provides a unified framework to analyze large-scale cancer genomic data. These findings provide a reliable quantitative reference for optimizing “personalized” treatment options, and might aid our understanding of cancer mechanisms.
Supplemental Material
Clean_Revised_Supplementary_material_(1) - A multiple genomic data fused SF2 prediction model, signature identification, and gene regulatory network inference for personalized radiotherapy
Clean_Revised_Supplementary_material_(1) for A multiple genomic data fused SF2 prediction model, signature identification, and gene regulatory network inference for personalized radiotherapy by Qi-en He, Yi-fan Tong, Zhou Ye, Li-xia Gao, Yi-zhi Zhang, Ling Wang and Kai Song in Technology in Cancer Research & Treatment
Footnotes
Authors’ Notes
Declaration of Conflicting Interests
Funding
Supplemental Material
Abbreviations
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
