Abstract
Introduction
Copy number variants (CNVs) are defined as DNA structural variants that result in either gain or loss of a chromosomal region, which can cause abnormal biological functions in the cell. In cancer, somatic copy number aberrations are frequently observed. 1 The amplification of oncogenes or the deletion of tumor suppressor genes identified in tumor cells can be used for understanding the progression of the disease and for predicting drug sensitivity. In the absence of a matched germline specimen, the identified alterations in a tumor sample cannot be classified as somatic, only the copy number state differs from normal. In the past decade, traditional genome-wide CNV detection methods used single nucleotide polymorphism (SNP) array and array comparative genome hybridization (aCGH),2–5 both of which required high-density probes and large sample sizes to detect small variants (less than 100 bps). Recently, whole exome sequencing (WES) was developed to target exonic regions for sequencing. This technology is primarily used for identifying single nucleotide variants and short indels but has been observed to be able to identify CNVs within the target regions as well. A recent study evaluated the performance of four well-known WES-based CNV detection tools and showed that none of the exome-based CNV detection methods can perform well in all situations. The authors provided a comprehensive and objective comparison to assist researchers in choosing the most suitable tools of WES data for their research needs. 6
In this study, we developed an algorithm RefCNV, which is the gene-based CNV detection for WES data. Instead of using a matched control, we use a collection of normal well-characterized controls as references to build a whole exome-based linear regression model to predict normal coverages using library sizes (total mapped reads) as a predictor and summarize the CNV predictions (deletion, normal, and amplification) at each exon within a gene to report gene-level CNVs. We compared CNVs of three genes (
Materials and Methods
Cell Culture and DNA Extraction
The DNA sample of HapMap cell line NA12878 was purchased from Coriell Institute for Medical Research, and all tumor cell lines were purchased from American Type Culture Collection. Cells were cultured using vendor-recommended conditions. DNA from freshly frozen cell pellets was extracted using the AllPrep DNA/RNA mini kit (Qiagen) according to the vendor instruction manual. DNA samples were quantitated by Qubit (ThermoFisher).
Digital PCR
Digital PCR (dPCR) assays were performed using a QX200 Droplet Digital PCR System (Bio-Rad). dPCR reactions consisted of 2x dPCR Supermix for Probes (no dUTP) (Bio-Rad), 900 nM final concentration target gene forward primer, 900 nM final concentration target gene reverse primer, 250 nM target gene probe, 20x Taqman human RNase P (
Whole Exome Sequencing
A total of 500 ng of genomic DNA was sheared to 150–200 bp by Covaris E220 sonication (Covaris). After AMPure XP beads (Beckman Coulter) cleanup, the samples were checked for correct size distribution using 2100 Bioanalyzer system (Agilent). For manual library preparation, the fragmented genomic DNA samples were processed with end-repair, dA addition, ligation of sequencing adaptors, and two rounds of six cycles preamplification using SureSelect XT Target Enrichment System for Illumina Paired-End Sequencing library construction kit (Agilent). Next, 750 ng of amplified DNA was hybridized with a biotinylated RNA bait set (SureSelectXT Human All Exon V5; Agilent) at 65 °C for 24 hours. The captured genomic DNA fragments were enriched by DynalMyOne Streptavidin T1 beads (ThermoFisher) and amplified with barcoded index-attached primers for 12 cycles. The AMPure XP-purified libraries were checked for size distribution (300–400 bp) using Agilent Bioanalyzer and quantified using Library Quantification Kit (Kapa Biosystems). For robotic library preparation, the same conditions and procedures were applied by using Sciclone G3 NGS Work station (Perkin Elmer). A pooled library made by mixing two final libraries at equal molar ratio were clustered at 11 pM per flow cell lane using the Illumina cBot prior to sequencing on an Illumina HiSeq 2000 platform (Illumina). Sequencing reactions were run using 2 x 100 paired-end mode. Demultiplexed FASTQ files were generated with Casava v1.8.2 configureBclToFastq.pl (provided by Illumina) from the. bcl files. The multiple FASTQfiles generated by this script were concatenated and primer trimmed using the ea-utilsfastq-mcf tool with the following options: “-l 30 -q 10 -u -P 33” to remove Illumina PCR and sequencing primers from the sequences. The trimmed sequences were mapped to human genome hg19 reference sequence using the Burrows-Wheeler Aligner v0.6.2 aln and sample mode in default settings. 19 The resulting SAM files were converted to BAM format, sorted, deduplicated, and indexed using samtools and Picard. 19 We also applied samtools to calculate the coverage as the number of total reads mapped in each defined capture regions. In addition, we also applied the principle component analysis (PCA) to investigate the coverage distribution based on the matrix of number of coverages for all 221,749 capture regions from whole genome.
CNV Detection Method
For a given exon
In the model,
where
where
Number of References
We examined the effect of the number of reference samples on the stability of the CNV estimates. For seven robotic reference samples, we tested RefCNV using subsamples between the size of three to six references and calculated the MSRs of genes
Results
Coverage Distributions, Scaling Factors, and Slope between Two Library Preparation Methods
To build a whole exome-based linear regression model as reference, the NA12878 hapmap sample was chosen because it is a reference genome in Genome in a Bottle Consortium and has been well characterized for structural variants. 21 A total of 18 replicates of WES were performed by two different library preparation methods: manual or robotic (nine samples for each method). As shown in Figure 1, the scatter plot of the first two principal components of all 18 references of NA12878 computed on the coverage distribution and three distinct clusters were found. The coverage distributions were different between two library preparation methods of robotic and manual. In addition, two robotic samples were clustered together on the bottom with a unique coverage pattern because these two samples were run on a different sequencing machine from all the other samples. For CNV detection using this read depth-based method, it is important to choose the references with the same library preparation method and run on the same sequencing machine.

Scatter plot of first two principal components (PCs) from PCA of all replicated reference NA12878 prepared by two library preparation methods (red: robotic; black: manual).
For constructing the distribution of MSRs for all genes with more than two exons from whole genome, we separately fitted the linear regression model in each exon from seven samples prepared by robotic method (we removed two samples that ran on different sequencing machines) and nine samples prepared by manual method. Among all the 215,676 exons, 212,644 exons (∼98.6%) are less than 500 bps. For those exon regions less than 500 bp, Supplementary Figure 1 shows the scatter plots of
Constructing MSRs and Selecting the Thresholds for CNV Detection Method
For each exon, the standardized residuals of each replicate were predicted by the LOOCV method and separately analyzed by library preparation methods (manual and robotic). The MSRs were calculated by taking the median of all standardized residuals from all capture regions covered by a single gene. For gene-based results, we only select genes that have more than two exons (80.1% genes have more than two exons from whole genome in our data sets). The MSRs of all manual and robotic replicates are summarized in Supplementary Figures 2 and 3. Our proposed method shows expected normal coverage distribution and considers coverage variability between controls. Under the assumption that all genes have a copy number status of 2, we used 2.5% and 97.5% of all MSRs as thresholds for deletion (equal or less than one copy) and amplification (more than two copies) as predicting a new sample by setting the type I error rate α = 0.05 for CNV detection method. In summary, the 2.5% and 97.5% quantile MSRs are -1.93 and 2.06 for manual replicates and -1.34 and 3.38 for robotic replicates, respectively.

Scatter plot of median standardized residuals (MSRs) and digital PCR of 13 cell lines on genes

Correlation between MSRs and dPCR of 13 cell lines for genes
RefCNV Prediction vs Digital PCR Experimental Results
Copy number variation of
Copy number variation predicted by RefCNV and corresponding dPCR results.
Genome-Wide Copy Number Association with DNA Copy Number Reported by CCLE
We then performed a genome-wide comparison between RefCNV-estimated MSR values and copy number values measured by SNP array in 10 cancer cell lines from the CCLE. 15 A total of 15,613 genes with both available CNV data from both RefCNV and CCLE were used for this analysis. Since RefCNV cannot directly estimate the exact copy number, we used Spearman's correlation to test the gene copy number association between MSR values and copy number values reported by CCLE. The average Spearman's correlation among 10 cell lines from whole genome is 0.82. Supplementary Figure 4 shows the scatter plot of our MSR values and copy numbers from CCLE separated by cell line.
We then performed the same genome-wide correlation with CCLE data using gene-based CNV estimations from three other methods CONTRA, ExomeCNV, and cn.MOPS. The average Spearman's correlation values among 10 cancer cell lines from whole genome were 0.67, 0.57, and 0.39 using CONTRA, ExomeCNV, and cn.MOPS, respectively. There were only 11,127 genes (57.1% from whole genome) reported by CONTRA because
Correlation between MSRs and Copy Number Measured by dPCR Using Difference Number of Reference Sets
Figure 3 shows the Spearman's correlation between predicted MSRs using all subsamples of size three to six robotic references and the dPCR results of genes
Discussion
With the reduced cost of new sequencing technologies, WES has become more affordable and replaced other traditional array-based approaches such as SNP array or aCGH for CNV analysis. In this study we developed an algorithm RefCNV to estimate gene-based CNV using WES data from a set of normal reference controls. By assuming the linear relationship between library size and coverage in each exon, we used the number of mapped reads as a predictor to estimate the coverage in each exon with a linear regression model, and the MSRs were applied to summarize the results from exons into gene-based estimates. First, we fit the linear regression model of coverage regressed on the library sizes from replicated controls. Second, the standardized residuals of all exons from each replicate were calculated by the prediction of coverages using the LOOCV procedure. Third, we summarized the results from exon-based into gene-based by taking the median of the standardized residuals from all exons covered by a gene to construct the expected normal coverage distribution. Finally, the prediction of gene-based CNVs is as follows: deletion, normal (diploid), and gain were identified by 2.5% and 97.5% (α = 0.05) quantile of all MSRs from the LOOCV procedure from controls using the same library preparation and sequencing method.
In this study, we showed RefCNV has the following novelties and advantages. (1) It does not require matched controls. Instead, RefCNV requires a set of normal reference data set, which can model the technical variability in coverage between samples to construct the normal coverage distribution. (2) We demonstrated that RefCNV prediction became more stable (less variation) as the number of reference samples was increased (more than six reference samples). There are still some limitations to our approach. RefCNV considers only genes with more than two exons, which retains 80.1% genes from the whole genome, but we kept more genes than CONTRA (57.1%) by setting
We observed that a few of RefCNV calls were discordant from copy number measured by dPCR in Table 1. Especially, copy number of three genes measured by dPCR in HOP-92 (A vs 1.89 for
We have implemented our algorithm in an R script RefCNV, which is available at GitHub from https://github.com/lunching/RefCNV. The program can take output of coverage generated by bedtools 23 (http://bedtools.readthedocs.org/en/latest) as input. We provided background information and tutorial to facilitate researchers when using our algorithm to predict gene-based CNV estimates.
To summarize, RefCNV allowed the identification of gene-based CNVs and paved the way for selecting replicated controls for CNV study. The RefCNV algorithm is empirically based and allows for local adjustment for differences in sequence read coverage across genomic regions. The method is based on establishing a set of normal reference controls, and we indicated an important issue to consider when assembling a reference set, which can be a complementary tool to existing methods for CNV prediction.
Author Contributions
Supervised the whole project: EP. Analyzed the data: LCC, EP. Wrote the first draft of the manuscript: LCC. Agree with manuscript results and conclusions: LCC, BD, CJL, HS, CC, PM, EP. Jointly developed the structure and arguments for the paper: BD, CJL. Performed partial statistical analysis: HS. Made critical revisions and approved final version: EP, CJL. Performed all the bench work: BD, CJL, CC, PM. All authors reviewed and approved of the final manuscript.
