Abstract
Keywords
Introduction
Copy number variants (CNVs) are chromosomal aberrations that result in an abnormal number of copies of specific DNA segments in comparison with a reference genome. Studies have reported that as much as 12% of the human genome varies in copy number. 1 It is believed that some CNVs have no obvious phenotypic consequence or are merely related to normal phenotypic variations, while others may be related to genomic disorders and susceptibility to disease. For example, the amplification of a DNA segment in a gene that promotes cell replication may cause the cell to begin dividing excessively, as usually happens in cancer cells.
The challenge of detecting CNVs has received a lot of attention, and several methods have been developed to infer CNVs from high-throughput array-based technologies, such as comparative genomic hybridization (CGH) and single nucleotide polymorphism arrays. These methods mostly rely on hidden Markov models (HMMs)2,3 and circular binary segmentation. 4 Another question of interest is the identification of CNVs associated with biological functions and complex human diseases. Procedures commonly used include univariate tests or simple linear regression models, with multiple testing correction, to relate the normalized intensity measurements to the outcomes of interest.3,5 A stochastic partitioning method for a multivariate model has been recently developed. 6 The model identifies sets of correlated gene expression levels and sets of chromosomal aberrations that jointly affect mRNA transcript abundances. A disadvantage of all such methods is that they do not infer copy number states. Indeed, the high noise level in the raw signal intensities may lead to the identification of a large number of false positives (FPs). 7 An approach widely used to address this problem is to perform the analysis in two steps, first by estimating the copy number states and then using those as the true states in a subsequent association analysis. However, using the estimated copy numbers as if they were the true states ignores the uncertainty in the estimation process and can introduce bias. Some methods that incorporate the uncertainty in copy number estimation into the association analysis have been proposed.8,9
Here, we consider a Bayesian hierarchical model that handles CNV detection and association analysis in a unified manner, by integrating array CGH and gene expression data collected on the same set of subjects. The framework takes advantage of a recently proposed measurement error model 10 that relates the gene expression levels to latent copy number states. In turn, the latent states are related to the observed surrogate CGH measurements via an HMM. The model incorporates a variable selection procedure with a prior distribution on the latent selection indicator that exploits dependencies across adjacent DNA segments. In this study, we investigate an alternative formulation of the spatially dependent variable selection prior, that is the basis of the measurement error model in Ref. 10, and show that it allows for increased flexibility, remarkably easy interpretation of the key parameters and major performance improvements. More specifically, the selection prior that we propose herein is based on a latent probit link; therefore, it can easily accommodate additional available covariate information to improve detection of significant associations. Model fitting and posterior inference are accomplished via Markov chain Monte Carlo (MCMC) stochastic search techniques. We explore the performance of this model in simulations and demonstrate an overall better performance of the model with the newly proposed prior. We also show an application to data from a genomic study on lung squamous cell carcinoma, where we identify potential candidates of associations between CNVs and the transcriptional activity of target genes. GO analyses of our findings reveal enrichments in genes that code for proteins involved in cancer.
The rest of the article is organized as follows: in Section 2, we introduce the integrative Bayesian model and its major components. In Section 3, we report the results from a simulation study and the case study. We conclude with some remarks in Section 4.
Methods
This section is organized as follows. In Section 2.1, we review the integrative framework that we follow in the manuscript. In Section 2.2, we introduce an improved prior model for gene–CGH associations, and in Section 2.3, we describe the model for analyzing copy number aberrations.
Integrative Bayesian hierarchical model
A Bayesian hierarchical model that integrates gene expression levels with CNVs has been recently proposed.
10
The model provides a unified approach for simultaneously inferring copy number states for all samples and identifying associations between sets of gene expression levels and copy number states. Let
ξim = 1 for copy number losses;
ξim = 2 for copy neutral states;
ξim = 3 for a single copy gain;
ξim = 4 for multiple copy gains.
We assume that, given the latent states, the observed CGH measurements contain no additional information on the observed gene expression levels. Furthermore, we assume independence of the gene expression measurements, conditional on the copy number states, and independence of the CGHs, given their latent states. Hence, we factorize the likelihood into two components as
where one component captures the latent structure underlying the CGH intensities and the other component models the association between the resulting copy number states and the gene expression levels. Such joint modeling reduces the bias that arises when the uncertainty in the CNV estimation process is ignored (ie, copy number calls are used as if they were the true states), by allowing for the simultaneous inference of CNVs and their association with gene expression. 10
Modeling the association between gene expression and CNVs
The model on
where μ1, …, μ
In model (2), we find, for each gene, a parsimonious set of CGH aberrations that most likely affect the gene expression levels. This can be seen as a variable selection problem. Let
with δ0(·) being a point mass at zero.12–15 The prior model is completed with conjugate distributions on the error precision,
A key feature in the variable selection construction above is the prior distribution on the latent selection indicator
with
where Φ indicates the c.d.f. of a standard normal distribution, and
with α0 and α1 being hyperparameters to be set. From equations (4) and (5), some major features of the novel prior can be recognized. First of all, the probability of selection at location
Modeling copy number aberrations in CGH data via HMM
The model on the CGH data in (1) is defined in terms of the emission probabilities of an underlying HMM. This choice is supported by the typically persistent state observed in copy number data, meaning that copy number losses or gains at a region are often associated with an increased probability of gains and losses at neighboring regions.18–20 We use a four-state HMM and assume that, conditional on the latent states, the CGH intensities are independent and normally distributed, with state-specific means and variances as
where η
with
Posterior inference
For posterior inference, we rely on an MCMC stochastic search algorithm.
10
Our primary interest lies in the estimation of the association matrix
Update
Update ξ using a Metropolis–Hastings algorithm by randomly choosing a column and proposing new states for a subset of its elements using the current values of the transition matrix.
Update the emission distribution parameters, η
Update the transition probability matrix,
Metropolis–Hastings stochastic search algorithms of this type have been used extensively in the Bayesian variable selection literature.10–15 The update on
Given the output of the MCMC, for each element of
Applications
Simulation study
In this section, we assess the performance of our model on simulated data. For comparison purposes, we follow the simulation scheme in Ref. 10, which reflects the understanding that single copy number aberrations typically affect segments of DNA, and that neighboring chromosomal locations are expected to share similar copy number states. In addition, transitions to the normal diploid state are more likely than transitions between different states of copy number aberration. Accordingly,
we set
we initialize the matrix ξ with all elements set to 2;
we select
we randomly select half of the remaining columns and, for each column, generate 10% of its positions according to the above transition matrix;
we generate the elements of matrix
we obtain the matrix of true associations,
we generate the non-zero regression coefficients as β ~
finally, we generate the gene expression levels as
For setting the hyperparameters, we follow the general guidelines in similar regression models for the specification of the priors on the parameters μ
We begin the analysis of the simulation results by focusing on the inference on
Simulated data: Results on specificity, sensitivity, number of false positives and false negatives, MCC, Bayesian
The results show that values of α1 in the range [1, 1.5] lead to excellent performances, compared with both the independent prior (ie, α1 = 0) and higher values of α1, particularly for the larger σε value. Overall, our results are consistently better than those obtained with competing models. 10 This can be seen by the low number of FP detections for most of the parameter values in the range considered. Also, for some values of α1, our prior (4) and (5) achieves perfect classification, with specificity and sensitivity equal to 1. To investigate the effect of the threshold on the PPIs on the selection results, in Figure 1, we report receiver operating characteristic (ROC)-type curves that display the FP counts versus the FN counts, calculated at a grid of equispaced thresholds in the interval [0.07, 1]. The plots clearly show that our results are satisfactory across different thresholds. They also highlight the consistently worse performance of the independent prior.

Simulated data: The false positive (FP) and false negative (FN) counts obtained by considering different thresholds on the marginal probabilities, for (A) σε = 0.1 and (B) σε = 0.5. Threshold values are calculated as a grid of equispaced points in the range [0.07, 1].
As for the inference on ξ, Table 2 reports the misclassification counts and corresponding percent rates obtained by considering, for each element of ξ the modal state attained at that genomic location over all MCMC iterations (after burn-in). The performances are consistently good. For the case of α1 = 0 and σε = 0.1, our estimated means and standard deviations were
Simulated data: Results on ξ as the number of misclassified copy number states for various values of α1.
Lung cancer study
We applied our Bayesian model to data from a study of lung squamous cell carcinoma, which we obtained from The Cancer Genome Atlas data portal (https://tcga-data.nci.nih.gov/tcga/). We used the level 2 (normalized signals) Agilent 415K array as the CGH data, and the Affymetrix HG-U133A array as the gene expression levels. We performed our analysis on the 131 samples that were available for both data types. We considered CGH probes belonging to chromosome 3, as it has been highly implicated in lung squamous cell carcinoma.23,24 We further reduced the complexity of the data by filtering out genes and CGH probes that had a relatively small coefficient of variation (smaller than 1.9 and 0.35, in absolute value, for genes and CGH probes, respectively). The resulting data consisted of
We ran our model using a setting similar to that adopted in the simulated example described in Section 3.1. The results we report below were obtained by setting α1 = 1 and α0 = 2.32 and by running the MCMC sampler with 500,000 iterations and a burn-in of 250,000. Figure 2 shows a heat-map of the highest PPIs of gene–CNV associations corresponding to the elements of the association matrix

Case study: Heatmap of the highest PPIs of gene–CNV associations, selected by looking at the elements of the association matrix

Case study: Frequencies of estimated gains and losses among the 131 samples for the 2,133 CGH probes considered in our analysis. Red horizontal lines correspond to the 0.25 and 0.45 thresholds on the frequencies of deletion and amplification, respectively.
Our findings identify potential candidates of associations between CNVs and the transcriptional activity of target genes. In order to assess whether the identified associations have biological relevance, we performed GO analyses on the lists of selected target genes and CGH probes, by using the database for annotation visualization and integrated discovery (DAVID) tool. 35 We report the detailed results of the analyses in the Supplementary Material. Figure 4 shows some of the results from the enrichment analysis of the list of selected target genes. More specifically, the upper box of the figure (labeled mRNA) reports the four most relevant molecular functions, together with the corresponding lists of target genes. In the lower box (labeled DNA), we report the lists of CGH probes that our model found to be associated with the target genes. The estimated associations between target genes and CNVs are marked by solid lines; whereas probes appearing in multiple lists are indicated by dashed lines. In Figure 5, we report similar summaries from the gene enrichment analysis of the selected CGH probes. Specifically, in this figure, the upper box shows the molecular functions enriched in the list of CGH probes, and the lower box reports the list of target genes that our model found to be associated with the CGH probes.

Case study: Schematic representation of a GO analysis of the target genes identified by our model, via thresholding the posterior probabilities of inclusion. The upper box (labeled mRNA) shows the four most enriched molecular functions together with the corresponding lists of target genes. The lower box (labeled DNA) reports the lists of CGH probes that our model found to be associated with the target genes.

Case study: Schematic representation of a GO analysis of the CGH probes identified by our model, via thresholding the posterior probabilities of inclusion. The upper box (labeled DNA) shows the four most enriched molecular functions together with the corresponding lists of CGH probes. The lower box (labeled mRNA) reports the lists of target genes that our model found to be associated with the CGH probes.
The results from the GO analyses highlight the enrichment of genes that code for proteins with binding function, cell surface binding, or an extracellular matrix constituent in the selected target genes (Fig. 4), and the enrichment of genes that code for proteins in the signal transduction machinery, mainly with kinase activity, in the selected CGH probes (Fig. 5). In both cases, we identified genes as members of the ephrin family or NTRK, which have been shown to be altered in another study on lung adenocarcinoma. 36 Ephrin receptors have been shown to have an important role in tumor growth and progression in many cancers, including lung carcinoma. 37 Another relevant protein from the GO analyses is PIK3CB, phosphatidylinositol-4,5-bisphosphate 3-kinase. The PI3K/AKT1 pathway has been shown to be altered in many cancer types, and often correlates with a more aggressive form of disease.38–41 We also found proteins of the matrix metalloproteinase family, which are often involved in the induction and promotion of cancer cell migration (MMP10 and ADAM23).42,43 Combining this observation with the finding of alterations in genes that code for members of the fibrinogen family (FGA, FGB, and FGG), and in genes that code for proteins with surface- and matrix-binding properties, we may hypothesize a dysregulation of pathways involved in the acquisition of a migratory phenotype. Extracellular matrix remodeling plays an important role in cancer progression since it can facilitate the migration and invasion of tumor cells. The genes we found to be altered may play an important role in this context. Such a hypothesis is interesting, but will require further experimental investigation. Similar findings exist in the general literature on lung cancer in both human and mouse studies.44–49 Finally, many of the genes we identified have been reported in the literature on lung cancer, for example ASCL1, HLA-DQA1, and PROM1 among the gene expression probes, and EPAH3, PRKCI, and EPHB1 among the identified CGH probes.50–54
Conclusions
In this study, we have considered a recently developed Bayesian hierarchical framework for the integration of gene expression levels with CGH array measurements, collected on the same subjects. The proposed measurement error model relates the gene expression levels to latent copy number states which, in turn, are related to the observed surrogate CGH measurements via an HMM. We have investigated an alternative formulation of the spatial variable selection prior for the gene–CGH associations. Our prior exploits dependencies across adjacent DNA segments and allows for increased modeling flexibility, which has been shown to result in easy interpretable model parameters for the purpose of prior elicitation, as well as improved performances and false discovery control on simulated data. Our HMM model considers four copy number aberration states, as commonly encountered in the literature.19,55,56 Once the HMM states are appropriately defined, our model can easily accommodate an additional state for the loss of both copies.3,18
We have presented an application to data from a genomic study on lung squamous cell carcinoma. Our model has identified potential candidates of associations between CNVs and the transcriptional activity of target genes. We have assessed the biological relevance of our findings through GO analyses. These have revealed enrichments in genes that code for proteins involved in cancer, such as those of the ephrin family, phosphatidylinositol-4,5-bisphosphate 3-kinase and matrix metalloproteinase family. Among these, some are already known to be involved in lung squamous cell carcinoma, while others are interesting potential candidates for further experimental validation.
The approach we present can be extended to the analysis of RNA-Seq gene expression values. In order to appropriately take into account the nature of such data, the priors and the algorithm for posterior inference will need to be modified to accommodate the count data and a Poisson regression model. This represents an interesting avenue for future work.
Author Contributions
Conceived and designed the experiments: No experiments. Analyzed the data: AC, MG, MV. Wrote the first draft of the manuscript: AC, MG, MV. Contributed to the writing of the manuscript: AC, MG, MV. Agree with manuscript results and conclusions: AC, MG, MV. Jointly developed the structure and arguments for the paper: AC, MG, MV. Made critical revisions and approved final version: AC, MG, MV. All authors reviewed and approved of the final manuscript.
Supplementary Data
GO Analyses
The supplementary material shows details on the GO analyses described in the paper.
