Abstract
Introduction
Genome-wide DNA copy number (CN) data are an essential aspect of integrative cancer genome analyses directed at identifying dysregulated pathways in cancer. 1 Identification of regions and genes of interest in CN data has primarily been accomplished through the identification of consensus regions of alteration and statistically rationalized with tools such as Genomic Identification of Significant Targets in Cancer (GISTIC) that identify individual regions of recurrent CN alteration. 2 While useful, this approach does not address the impact of co-associations of distant genetic loci or visualize complex interactions clearly.
In contrast, Hi-C methodology maps physical interactions between chromosomes at specific loci, allowing the derivation of a matrix of chromosomal interactions of great utility in studies of 3-dimensional chromatin structure. 3 Visualizations of such a matrix can show hot spots of interactions between regions, whereas edge-node graphs and CIRCOS plots can become cluttered and nearly uninterpretable. Matrices can display more interactions per unit area in a clear fashion, and the matrix display of interactions shows interaction domains with row and column order preserved in the matrix. 4
Merging cancer CN data with a matrix-mapping approach similar to Hi-C analysis, we have developed methods to analyze and interactively display genome-wide interactions from a CN data set, with each matrix value representing the strength of the interaction between loci. Two current trends tend to encourage investigations that benefit from this type of interaction data. One is the growing understanding of the topology and specificity of nuclear chromosome territories, and the other is the ever more widespread use of whole genome sequencing in cancer genomics, which allows unprecedented precision in mapping structural variants and local CN. Particularly in cancers with extensive genome rearrangements, there is an unmet need for tools that facilitate the discovery of genomic aberrations that depend on aspects of higher order nuclear organization. Our goal has been to develop a method that essentially precomputes and visualizes signed correlations between any 2 points in the genome using binned segmented CN values from a large set of cancer samples. We found that a recursive linear regression algorithm produces visually intuitive, interpretable results that are consistent with known aspects of chromosome structure and genome rearrangement that can also rapidly identify novel features.
Here, we report this new methodology, R package, and a suite of Web-based tools accessible to the scientific community for the exploration of complex CN data sets to generate hypotheses connecting CN phenomena and their underlying chromosome aberrations to the pathogenesis of various cancers. The package can also accept The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Treatments (TARGET) data, allowing for analysis of a broad range of cancers using existing large cohort studies. 5 We have also provided guidance for the importation of unpublished user data. The R package is highly accessible, combining vignettes, documentation, examples, and an animated tutorial.
As an example of the application of CNVScope, we chose a sufficiently large publicly available neuroblastoma (NBL) data set. This aggressive childhood cancer has known features, notably the pattern of clustered chromosomal breakpoints in the
Methods
Input matrix using NBL data set
From the GDC legacy archive, an NBL data set of 126 samples was obtained (see vignette). Data were aggregated into a binned sample matrix with 1 Mb bins. Each bin value corresponded to the average segmentation value for TCGA or mean relative coverage for TARGET for segments that overlapped the bin in that specific sample. Row names signified bin genomic position, whereas column names represented the sample identifiers. This input matrix was then used as the basis for the matrix of log
Linear regression, postprocessing, and matrix set formation
From this input matrix, a matrix of negative ln

Workflow from GDC TARGET neuroblastoma CN data to finalized interchromosomal matrices used in the shiny application. Files are converted from GDC tab-delimited files with varying bin sizes into an input matrix of even 1 Mb bins and sample identifiers, and then into relationship metrics from linear regression (the negative log
Features
The CNVScope app allows the user to quickly identify hot spots and large features in a chromosomal interaction plot and provides a clear view of the contributing samples to every single value in the matrix. Genes and expression transcript levels are identified at every combination of genomic loci. COSMIC census genes are also noted. The matrix data also can be explored using a gene search tool to provide coordinates based on ensembl-75 (hg19). With coordinates specified, users can then plot the view zoomed directly on their location pair of interest.
Controls
The application features a gene search tool to get exact gene positions, a saturation threshold slider to control the effect of outlier pixels, a heatmap height slider, dropdowns for chromosomes, and a plot button. We have provided the NBL data in complete form along with several clinical subsets. The users are also given a choice of relationship metric correlation sign*-log(
Package vignettes
The package vignettes detail the process to import GDC data with images of the requisite steps, perform the relationship mapping using a high-performance computing system, postprocess the matrix, and briefly visualize results. A power analysis vignette is also provided, which suggests a minimum sample size of 108 individual CN cases. We also wish to note that several other cancer data sets have been demonstrated to work with the toolkit, including bladder cancer, prostate cancer, acute myeloid leukemia, and melanoma. A brief demonstration of the toolkit on these data sets is provided within the GitHub package.
Specific Observations in NBL
To understand the information carried by the CNVScope main plot, it is useful to examine the whole genome plot arising from the 126-sample NBL data set (Figure 2A). The strong correlation signal (red) on the diagonal represents the high probability of CN correlation of adjacent segments related to their chromosome topology. Note that the signal is not confined to the geometric diagonal but extends variably some distance from that line. The simplest example is the X chromosome that other than the small clearly delineated discontinuities of the pseudoautosomal regions appears as a rather uniform block due to the fact that each sample originated from either an XX or an XY genotype. Most other chromosomes exhibit a more complex pattern, with principal blocks often demarcated at the centromeres, consistent with the known high frequency of whole chromosome arm rearrangements in cancer. For example, on the chr20 × chr20 plot, independent correlation blocks exist for p and q arms, with breaks in correlation at the centromere. The 20p arm correlation block ends at 26 to 27 Mb, and the p arm block begins at 29 to 30 Mb, with jointseg calling boundaries at that these loci corresponding to the centromeric ends of the alignable sequence for each arm. Remarkably, CNVScope allows these boundaries to be readily discerned against a background of high correlation for the entirety of chr20, with only 15 of 126 (11.9%) samples showing chr20 arm–specific CN aberrations. On the other chromosomes, local decreases in

(A) A whole genome interaction view of neuroblastoma copy number (CN) associations (chr1-X). Boxed regions highlight chr2 (enlarged in Figure 3), chr11, and the negatively signed off-diagonal association of 11q and 17q. (B) The enlarged chr11-chr17 map illustrates the strong anticorrelated regions of 11q-17q. The lowest correlation point is highlighted (r =−0.482, Benjamini-Hochberg adjusted

(A) Intrachromosomal association plot for chromosome 2. The box highlights a distinct feature on the diagonal indicating narrowing of the region of local co-association, and white lines emanating from that region show a reduction in association from the MYCN locus across all loci on the chromosome. (B) Enlarged view of the
The G1/S-phase cyclin

From the whole genome plot, there are many off-diagonal regions demonstrating significant signal. In particular, 17q and 11q show strong anticorrelated regions visible in both the whole genome and the interchromosomal views as a large block (blue) (Figure 2(B)). Histograms validate that these regions have 2 distinct distributions that are very well separated, and a linear regression view of a single sample makes clear the downward trend driving the color coding (dark blue) visually displayed in the interchromosomal view (Figure 2(C and D)). These features allow the user to drill down from the relatively abstract view of the main plot to the detailed underlying data and appreciate that the genetic phenomenon flagged in CNVScope is the significant co-occurrence of 11q loss and 17q gain. This phenomenon has been previously reported in NBL.8,9 Remarkably the anticorrelated portion of 11q is bounded by a jointseg edge at 71 to 72 Mb, indicating that 11q loss consistently begins telomeric to
Many studies have examined single gene-gene associations for CN amplifications and deletions, but CNVScope does this on a whole genome scale, providing a survey view and rapid access to significant associations while also allowing access to the primary data, gene annotations, and other data types that a user might wish to integrate with CN data.10-12 In conclusion, we have described the methodology, the detailed features, and the potential of CNVScope to highlight significant genomic events such as those we have described in NBL. We invite others to explore the regions and hot spots which may be related to functionally important aspects of NBL genome biology and to use CNVScope to explore other cancer genomics projects with available CN data.
Limitations
It is important to point out that CNVScope resolution is ultimately limited by the probe density of the input data and the bin size selected. A 1-Mb bin size was chosen for the visualization tool to allow swift and stable function of the application. We feel that this is a reasonable compromise between resolution and computational limitations. It is also consistent with many existing data sets. We also note that the toolkit allows for the use of custom data to generate the relationship matrix should users with sufficiently high-resolution data wish to create an extremely high-resolution view of a selected region. Smaller and larger bin sizes have been tested on the NBL data set (0.1 and 10 Mb). Both the function and the commands for this have been listed in the input matrix vignette. The main focus of this work is to facilitate the rapid analysis of CN associations in integrative cancer genomics studies through the visualization of a precomputed association matrix.
Supplemental Material
Supplementary_material – Supplemental material for CNVScope: Visually Exploring Copy Number Aberrations in Cancer Genomes
Supplemental material, Supplementary_material for CNVScope: Visually Exploring Copy Number Aberrations in Cancer Genomes by James LT Dalgleish, Yonghong Wang, Jack Zhu and Paul S Meltzer in Cancer Informatics
Footnotes
Funding:
Declaration of Conflicting Interests:
Author Contributions
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
