Abstract
Keywords
Introduction
Regulation of nucleosomal localization is a complex process.1,2 To facilitate transcription, replication, and DNA repair, nucleosomes slide short distances and affect local nucleosomal reorganization. 3 The stabilization of sliding nucleosomes has been proposed to depend on the DNA sequence and may be affected by the histone modification profile of the nucleosomes.4,5 The density of nucleosomes in the vicinity may also determine the choices of DNA sequences a sliding nucleosome settles with.6,7 On the other hand, there are regions which have relatively inflexible nucleosomal occupancy due to sequence features. Sequences with high dA:dT density are nucleosome-depleted. 8 Positioned nucleosomes flanking the transcription start sites (TSSs) and constitutive heterochromatin serve as strong examples of sequence dependence of nucleosomal occupancy. 4 A periodicity of TT and AT at every 10th base pair facilitates the wrapping of DNA around the histone octamer. 9 Stiff sequences that have GC and AT inter-strand asymmetry provide less favorable sites for nucleosomes. 8 For maintaining the long constitutively nucleosome-rich regions distinct from nucleosome-depleted regions, the DNA sequence plays a key role through binding of boundary element proteins, typified by CCCTC-binding factor (CTCF).10,11 Nucleosome locations are neither fixed (like binding of sequence-specific transcription factor [TF] complexes) nor entirely random. Many TFs bind to specific cognate sequences which may be nucleosome-free or occupied by nucleosomes with activating or repressive histone modifications.12,13 However, the TF-chromatin interaction is also determined by the nucleosome-TFs interaction.14,15 Thus, TF-DNA interaction is potentially regulated by nucleosomal occupancy in many different ways: dense nucleosomal occupancy by nucleosomes with repressive histone modifications may render the DNA inaccessible, or the presence of activating histone modifications or simply an absence of nucleosomal occupancy at specific regions may promote TF-DNA interaction. Nucleosomes with activating histone modifications interact with and recruit positive regulators of transcription, whereas the histones with repressive marks serve as negative regulators of gene expression by either associating with and recruiting repressors of transcription or by hindering the access of DNA to positive regulators of transcription. 16 In addition, a balanced state of positive and negative regulators of transcription, all depending on bivalent modifications of histones as nucleosomal components, may maintain genes in a poised state of transcription. 16 However, high nucleosomal occupancy at TSSs rich in binding sites of transcription activators occurs in differentiation, suggesting that TF-binding sites are not unanimous determinants of gene expression.17,18 The cis-regulation of gene expression depends on TSS-flanking regulatory sequences. These TSS-flanking sequences determine transcription patterns of genes in a developmental stage-specific manner by being playgrounds of the epigenetic mechanisms that drive differentiation.17,18 A key set of these epigenetic mechanisms includes regulated changes in nucleosomal occupancy, histone modifications of nucleosomes, and the accessibility of transcription machinery to DNA. 19 Over two-thirds of all known genes have TSSs associated with GC-rich upstream regulatory regions which are nucleosome-poor and contain binding sites for TFs. 20 Histones, unlike TFs, are bound to DNA genome-wide without being restricted to the GC-rich TSS-flanking cis-regulatory elements, promoters, and TF-binding sites. Identification of histone-bound sequences genome-wide, typically by chromatin immunoprecipitation sequencing (ChIP-seq) on variously modified histones, cannot detect individual nucleosome-associated DNA sequences due to systemic features of the technique and the method. Histone-binding patterns derived from single molecules get pooled for alleles and replicated chromatids in single cell analyses. In the commonly performed analyses on pools of millions of cells, deep sequencing generates sequence reads which overlap and cluster in regions characterized as histone-bound peaks. A significant part of histone ChIP-seq reads, however, do not cluster to generate peaks but nevertheless demonstrate 2 facts: that there is significant intercellular heterogeneity in nucleosomal occupancy and that nucleosomal occupancy is not random but clustered in discrete sequences identified as peaks.21,22 The intercellular heterogeneity of epigenetic mechanisms thus seems to be patterned around certain signals that ensure that the specific types of histone modifications cluster at specific functional sites, such as TSSs with possible effects on transcription.
Transcription start sites and promoters with TF-binding sequences also display distinct nucleosomal patterns. 23 Various lines of research on nucleosomal positioning and histone occupancy have supported the notion that histone-DNA interactions are not governed by histone-specific DNA sequence motifs. While TF-binding site-rich sequences such as promoters are nucleosome-poor, the repeat-rich constitutive heterochromatin is often nucleosome-dense.19,24 Although nucleosomal positioning by DNA sequence motifs is safely ruled out except for a positive correlation with W/S motifs with repeats, 25 it is not known if DNA sequence motifs play a role in determining nucleosomal density.
Epigenetic mechanisms are important for maintenance of cell-type identity and are altered in cellular transformation.25–28 Also, disturbances in positioned nucleosomes could be caused by epigenetic changes such as cytosine methylation, activities of chromatin remodeling factors which can affect histone occupancy, and thus transcription and TSS usage. 28 Altered gene expression due to spatial disturbances in histone code and nucleosomal densities can contribute to tumorigenesis. Transcription start sites are sequences often associated with GC-richness, high CpG density, and association with CpG islands, as well as clusters of TF-binding motifs. Regulation of peri-TSS nucleosomal density by DNA sequence features and its consequence on gene expression remains unknown. We began with an assumption that histone-DNA interaction is not dependent on TF-binding site-like DNA sequence motifs. To test the validity of our assumption, we have performed a systematic analysis of DNA sequence motifs hidden in ENCODE DNase-seq and histone ChIP-seq datasets. Our approach relies on identification of common ChIP-seq Peak Regions (CPR) across multiple transformed cell lines or primary cell lines, and helps us decipher any subtle primary cell-specific and transformed cell-specific signature sequence motifs that underlie altered nucleosomal positioning in cellular transformation. Our analyses reveal that CPR in primary cells (p-CPR) contain contrasting sequence features compared with the CPR in transformed cells (t-CPR). A functional consequence of peri-TSS altered nucleosomal density seems to be aberrant in RNA Polymerase II activity in transformed cells immediately downstream of TSSs.
Methods
Cell lines and histone modification ChIP-seq datasets
Eleven histone modifications for which quality-controlled ChIP-seq datasets were available for both primary and transformed cells were selected from ENCODE (Supplemental Table S1). Processed narrow ChIP-seq peak data for each cell line were downloaded either in bigBed (for hg38) or bed (for hg19) formats. No prior filtering of the data was done before using them as inputs for CPR identification.
Cell lines and datasets of POLR2A ChIP-seq and RNA-seq
POLR2A ChIP-seq and RNA-seq data were not available for the same sets of cell lines as used for CPR identification based on histone modification ChIP-seq data. Disparate sets of quality-controlled RNA-seq and POLR2A ChIP-seq data from various primary and transformed cell lines were obtained from ENCODE (Supplemental Tables S6 and S7). All the RNA-seq data were from paired-end sequencing and available for both the strands as pairs of bigWig files for each cell line. POLR2A ChIP-seq peak data were available as single bigWig file per cell line.
Other datasets and publicly available resources
Transcription start site dataset (hg19 coordinates of Capped Analysis of Gene Expression [CAGE] peaks-verified regions) was downloaded from FANTOM database. Genome-wide GC content at 5 bp interval (gc5base.bw) was downloaded from University of California Santa Cruz (UCSC) Human Genome (hg38) Annotation database. The genomic coordinates, wherever required, were converted from hg19 to hg38 human genome assembly either through LiftOver tool in UCSC Genome database online or using a locally installed version of the same tool. Genome file for chromosome sizes and hg38 sequence (repeat-masked or unmasked) were downloaded from UCSC Genome database.
CPR identification and analyses of p-CPR and t-CPR
To find CPR between different numbers of cells lines within primary and transformed cells separately, histone modification-specific multiple intersects were performed using multiintersectBed function in Bedtools. The lengths of obtained CPR across different number of combinations of cells were filtered by retaining 100 bp or more.
To obtain distance distribution between CPR with p-CPR and t-CPR motifs, the closest distance between CPR from two sets was calculated using closestBed function in Bedtools. This distribution was compared against 10 iterations of distance distribution of CPR, pooled and randomly assigned to either p-CPR or t-CPR groups blindly. Significance of difference between the two distribution patterns was calculated by Kruskal-Wallis test in GraphPad Prism 7.
Motif discovery
FASTA sequences for the CPR were fetched from the hg38 build of Human genome through fastaFromBed function in Bedtools. Motif discovery was performed by subjecting these sequences of CPR to non-discriminative and discriminative motif search using Discriminative Regular Expression Motif Elicitation (DREME) program from Multiple Em for Motif Elicitation (MEME) suite29,30 command-line versions. The discriminative motif search was performed by using the command-line version of MEME suite application DREME. Discriminative Regular Expression Motif Elicitation performs a Fisher’s exact test on the difference of a motif occurrence between given sets of positive and negative sequences. By default, DREME applies an
For stringency, we considered only top 10 best discriminative motifs, and all the discriminative motifs reported here correspond to
For the discriminative motif search, the primary and transformed CPR FASTA sequences were used as negative files against transformed and primary CPR sequences, respectively.
Consensus Motif discovery
Consensus motif identification for each histone modification from the Position Weight Matrices (PWMs) of motifs obtained by DREME was carried out using STAMP tool (a web tool for exploring DNA-binding motif similarities).29–31 The PWMs obtained as STAMP output (PWMs of consensus motifs of all histone modifications) were further subjected to cell-type-specific consensus motif identification.
Motif prediction at CPR
For individual histone modification, the p-CPR and t-CPR motif prediction in CPR were carried out by performing Find Individual Motif Occurrences (FIMO) option in locally installed MEME suite by providing PWMs obtained from DREME as input (
Heat-maps and summary plots
Conversion of bed to bigWig was done using UCSC scripts bedItemOverlapCount and bedGraphToBigWig. The bigWigs were generated for CpG density genome-wide, CPR motifs and RNA-seq reads combined for all histone modifications.
The matrices for plotting heat-maps and summary plots were generated by using computeMatrix option in Deeptools. 32 The plots were generated by utilizing the plotHeatmap function for heat-map and the plotProfile for summary plots in Deeptools. RNA-seq read density and POLR2A occupancy plots were generated by aggregate option in bwtool to calculate signal at base level and visualized in GraphPad Prism 7.
Identification of CPR-TSSs and calculation of p-CPR or t-CPR motif distances from the TSSs
Using FIMO (MEME Suite), we first identified and filtered for further analysis a subset of CPR, which contained p-CPR motifs or t-CPR motifs (
All genome coordinate conversions were done using UCSC Genome browser LiftOver tool. After identifying the CPR-TSSs for each histone modification, the respective CPR-TSSs (.bed) files were concatenated and the entries were made unique to form a collective set of CPR-TSSs for all 8 histone modifications. Using Bedtools intersect function, the set of TSSs were identified (from the TSS dataset) excluding the CPR-TSSs and were termed as non-CPR-TSSs.
Repeat content analysis
Repeat contents in CPR-TSSs and non-CPR-TSSs upstream flanks (Supplemental Tables S4 and S5) were measured using RepeatMasker 33 by running with default options. All the CPR-TSSs undergoing a p-CPR to t-CPR motif shift for the different histone modifications were collated and bed coordinates were generated for the TSSs and up to 0.5 kb upstream locations. These coordinates were made unique and merged (using Bedtools mergeBed function) before extracting sequence (hg38) and subjecting to analysis by RepeatMasker.
Statistical analysis
All statistical analyses were performed in GraphPad Prism 7 on the data generated from analysis using different tools and OpenOffice Spreadsheet as mentioned.
Generation of shuffled genomic coordinate and mock sets of CPR
Random selection of permuted genomic regions of similar length and chromosomal profiles of the CPR were generated against each of the histone modifications by using Bedtools shuffleBed. To obtain mock set of CPR (t-CPR and p-CPR), 10 different rounds of randomization were carried out on regions pooled from CPR-threshold of primary and transformed groups for each histone modification using the sub-command random in Bedtools. The pooled sequences were randomly segregated proportionately into mock groups.
Results
p-CPR and t-CPR share common DNA sequence motifs
DNA binding of histones is not known to occur at defined DNA sequence motifs. However, if DNA-histone interaction in vivo (in the form of nucleosomal occupancy) depends on genomic DNA sequence, then histone occupancy would tend to recur consistently at such sequences across different cell types. Else, if histone occupancy solely depends on tissue type and differentiation, then histone occupancy would vary between tissue types and in cellular transformation. To test these possibilities in an unbiased manner, we began by collectively analyzing histone ChIP-seq data (for a panel of transformed and primary cell lines for 11 histone modifications; chosen from ENCODE database) and finding out CPR (Supplemental Table S1).
Using the multiple intersection feature in Bedtools, 34 we derived CPR (minimum 100 bp long) for each histone modification in panels of primary or transformed cells separately. For each histone modification, the number of CPR decreased exponentially with an initial increase in the number of cell lines across which commonality was calculated. For each histone modification, we subjectively chose a threshold number of cell lines (independently for the panels of transformed and primary cells) after which the increase in the number of cell lines did not have a major effect on the number of CPR (Figure 1A). The CPR for primary and transformed cells at these thresholds were called p-CPR and t-CPR, respectively. In contrast, at most of these thresholds, no common regions could be identified in shuffled genomic coordinates for any of the histone modifications (Figure 1A). This established that CPR are specifically present only in regions occupied by histones.

Identification of histone CPR and their co-occupancy. (A) For each histone modification, the number of CPR (100 bp and more) obtained after performing multiple intersections among the ChIP-seq peaks in primary and transformed cells separately was plotted (
The ChIP-seq data used in this analysis were derived from variously modified histones, some of which co-occupy the DNA as components of nucleosomes. Expectedly, we observed overlaps between CPR for the different histone modifications (Figure 1B). A Bland-Altman plot depicting the number of CPR shared between different histone modifications clearly showed that CPR for some pairs of histone modifications occur very differently between primary and transformed cells (Figure 1B and Supplemental Table S2).
The sequences of the p-CPR and t-CPR for each histone modification were subjected to a non-discriminative motif search. Unexpectedly, a consensus DNA sequence motif was found in CPR sequences for each histone modification in both primary and transformed cell types (Figure 2A). These motifs were called “non-discriminative CPR motifs.”

CPR exhibit DNA sequence motifs both with and without cell-type specificity. (A) Highly significant motifs were identified using DREME on sequences of CPR for primary and transformed cells in a non-discriminative mode (not shown) and top 10 motifs used as inputs for generating consensus motifs (using STAMP). The consensus motifs obtained in transformed and primary cell types are indicated with the names of respective histone modifications. Core consensus motifs were deduced from the histone modification-specific consensus motifs for transformed and primary cells (A). These core consensus motifs were strikingly similar between transformed and primary cells suggesting commonality of sequence features across all CPR independent of cell-type identity and the kind of histone modification. (B) To identify the presence of any cell-type specific motif in the primary and transformed CPR, discriminative motif search was performed by providing 1 set of CPR as background (negative sequence set) for the another (positive sequence set) in DREME. The consensus motifs in the transformed cells (t-CPR motifs) were relatively AT-rich for all histone modifications (except for H3K4me3) than the consensus motifs in primary cells (p-CPR motifs) that were relatively GC-rich. However, p-CPR motifs could not be identified in H3K4me3, H3K27ac, and H3K79me2. Core consensus for the discriminative motifs deduced using STAMP tool showed that indeed the t-CPR motifs were AT-rich and the p-CPR motifs were GC-rich. Unlike the non-discriminative CPR motifs which were highly similar between primary and transformed cells, the t-CPR and p-CPR motifs showed no similarity. All the motif identification were done using default significance values of DREME and STAMP. ChIP-seq indicates chromatin immunoprecipitation sequencing; CPR, ChIP-seq Peak Regions; DREME, Discriminative Regular Expression Motif Elicitation.
These findings suggested that the motifs in CPR are sites for preferential histone occupancy in a wide spectrum of primary and transformed cells. However, histone modification and nucleosomal occupancy are disturbed in cellular transformation. 35 So, we next wanted to know if there are additional sequence features not revealed by the above-mentioned motif search that are subtly different between p-CPR and t-CPR.
Discriminative motif search revealed subtle differences in t-CPR and p-CPR
The findings described above underscored remarkable sequence similarities between the DNA sequences of p-CPR and t-CPR for any particular histone modification across different primary and transformed cell types. Such a sequence similarity between two different sets of sequences (p-CPR and t-CPR) could arise due to (1) an overlap of p-CPR and t-CPR genomic coordinates or (2) presence of same motifs in non-overlapping p-CPR and t-CPR sequences. In the former case, the distance between p-CPR and t-CPR would be lower than the latter. We tested these possibilities by measuring the closest distances between the centers of p-CRPs and t-CPR with sorted genomic coordinates.
The distribution of closest distances between p-CPR and t-CPR was obtained and compared with a mock closest distance profile of the same set of CPR after randomization between primary and transformed groups (10 different rounds of randomization of all CPR into mock sets). Surprisingly, this closest distance distribution analysis revealed that the closest distance between p-CPR and t-CPR is higher than randomly expected (Supplemental Figure S1 and data not shown) (Kruskal-Wallis test,
The occurrence of p-CPR and t-CPR in distinct genomic regions raised some interesting possibilities. Are there subtle DNA sequence features that are unique to p-CPR or t-CPR? How is the CPR displacement relevant for gene expression in cellular transformation? Is cellular transformation associated with differential usage of CPR?
To identify any distinct p-CPR-specific and t-CPR-specific motifs, we performed a discriminative motif search by using 1 set as a background against the other. Results showed that for the CPR of H2AFZ, H3K4me1, H3K4me2, H3K9ac, H3K9me3, H3K27me3, H3K36me3, and H4K20me1, distinct p-CPR-specific and t-CPR-specific motifs exist. The t-CPR-specific motifs (t-CPR motifs) for all histone modifications were AT-rich (GC-poor) unlike the p-CPR-specific motifs (p-CPR motifs), which were GC-rich in nature. However, for the CPR of H3K4me3, H3K27ac, and H3K79me2 histone modifications, only t-CPR motifs could be discovered which were AT-rich for H3K27ac and H3K79me2 and GC-rich for H3K4me3 (no p-CPR motifs could be identified). This reinforced that the GC-richness of p-CPR motifs and AT-richness of t-CPR motifs is a specific phenomenon restricted to certain histone modifications only (Figure 2B). We concluded that although non-discriminative CPR motifs are associated with the occurrence of histone modifications in primary and transformed cells both, there are also different primary-specific and transformed-specific motifs for some histone modifications.
Unlike p-CPR motifs, the t-CPR motifs are not concentrated near CpG islands-associated TSSs
The high GC content of p-CPR motifs as compared with the t-CPR motifs led us to argue that (1) high levels of histone occupancy in GC-rich regions, such as CpG islands, are maintained in primary cells, and (2) this relationship between CPR and GC-rich regions gets disrupted in transformed cells such that both the locations of t-CPR and their underlying DNA sequence identifier motif become different from those of p-CPR.
For studying the changes in CPR and its effect on gene expression relevant in cellular transformation, we first narrowed down to a subset of CPR for each histone modification. The midpoints of these selected CPR were located within 1 kb of the nearest CpG island. These regions were either p-CPR motif-positive or t-CPR motif-positive. All the TSSs located within the coordinates of such CPR and associated CpG islands (CPR-TSSs) were filtered out for primary and transformed cells separately. For the 8 different histone modifications for which both p-CPR and t-CPR motifs were available, different numbers of CPR-TSSs were found to be shared between primary and transformed cells (detailed in the “Methods” section). The CPR-TSSs were important because they were strong candidates for transcriptional regulation by a CpG-rich sequence in cis and yet were exhibiting a GC-rich p-CPR to GC-poor t-CPR motif preference in cellular transformation.
To understand how differential histone occupancy at CPR motifs might affect the TSSs, the distance of p-CPR and t-CPR motifs from CPR-TSSs was calculated. Genomic location plots centered at the CPR-TSSs depicting the occurrence of p-CPR motifs and t-CPR motifs showed a striking concentration of p-CPR motifs closer to the TSSs, whereas the t-CPR motifs were located at a consistently larger distance. Without any exception, for all the 8 histone modifications, the GC-rich p-CPR motifs were located closer to the TSS than the GC-poor t-CPR motifs (Figure 3). In cases where multiple p-CPR and t-CPR motifs were present, we used the coordinates of the p-CPR and t-CPR motifs closest to the TSS for calculating the distances. These findings proved that many CpG island-associated TSSs contain p-CPR motifs and t-CPR motifs in flanking regions and that the transformed cells preferentially employ distant t-CPR motifs for histone occupancy over the proximal p-CPR motifs employed by the primary cells. These findings also indicated that the regions between p-CPR motifs and t-CPR motifs are transition zones for the GC and CpG contents.

p-CPR and t-CPR motifs have distinct localization with respect to the CPR-TSSs. For each of the 8 histone modifications as indicated, the locations of individual p-CPR and t-CPR motif with respect to the associated CPR-TSSs were plotted on the heat-maps (left flank). The graphs (right flank) show median distance of the motif-midpoint from the TSS midpoint (±interquartile range (IQR) values) for the same data as plotted in the heat-map to the left. The
ChIP-seq reads distribution recapitulate the patterns of p-CPR and t-CPR motif distribution around CPR-TSSs
The occurrence of p-CPR and t-CPR motifs was deduced from the analyses of ChIP-seq peaks derived from different independent experiments. To verify the robustness of these findings, we wanted to ensure that the post-alignment read-level ChIP-seq data were in agreement with the findings obtained from peak-level analyses. We mapped the read densities for the 8 histone modification ChIP-seq datasets from comparable pairs of transformed and primary cells36–38 in 5 kb flanks of CPR-TSSs. We compared histone modification datasets from A549 cell line with IMR90 cells and OCI-LY1 cell line with B-cells.38–40 Reinforcing our findings that a p-CPR to t-CPR motif shift occurs at CPR-TSSs, the results showed that the ChIP-seq read densities were stronger in the immediate flanks of the CPR-TSSs in normal cells and relatively distal in transformed cells (Supplemental Figures S2 and S3). As expected, for each histone modification, a distinct read density profile was observed, but the differences in the density of reads between primary and transformed cells relative to the CPR-TSSs were consistent. In comparison, the TSSs falling in CPR that are unique either to transformed or primary cells and not containing p-CPR or t-CPR motifs (non-CPR-TSSs) showed a different pattern of ChIP-seq read densities (Supplemental Figures S2 and S3 and Supplemental Table S3). Although the read densities at non-CPR-TSSs were weaker in transformed cells compared with the primary cells, the shift away from the TSSs as observed for the CPR-TSSs was not visible. Thus, by mapping ChIP-seq read densities in representative pairs of primary and transformed cells, we could reinforce the robustness of our findings.
Distinct sequence features mark flanking sequences of TSSs with t-CPR and p-CPR motifs
The presence of motifs of contrasting GC-richness near some CPR-TSSs suggested that in the flanks of these TSSs, the GC content is non-uniform. To understand the nature of GC-content heterogeneity, we first analyzed how the GC and CpG contents change in the region between p-CPR and t-CPR motifs. The GC-richness and CpG density were plotted for the regions between p-CPR motifs and t-CPR motifs (Supplemental Figure S4). It was observed that the GC content was high at the p-CPR motifs and decreased toward the t-CPR motifs. This corroborated our previous findings (Figure 3) that the p-CPR motifs are positioned closer to the TSSs in GC-rich regions unlike the t-CPR motifs. The region defined by p-CPR motifs and t-CPR motifs thus turned out to be a region within which the GC content reduces abruptly as the coordinates shift away from TSSs. This feature was also replicated when CpG density was plotted in the same regions (Supplemental Figure S4). The CpG density was higher than expected for the p-CPR motifs and dropped to expected levels at the GC-poor t-CPR motifs (Supplemental Figure S4). These findings showed that through a motif selectivity for GC-poor t-CPR motifs, the transformed cells employ CpG-poor regions for determining histone occupancy.
The p-CPR and t-CPR motifs were all identified in repeat-masked sequences. Transcriptional regulation by them at the linked TSSs was a tantalizing possibility. However, it remained unclear how repeats and repeats-associated recurring sequence motifs could affect these TSSs. To eliminate the probability that the possible regulation of TSSs by p-CPR and t-CPR motifs could be confounded by repeat sequences, we measured repeat content in immediate upstream regions of these TSSs. We found that the CPR-TSSs were located in regions (0.5 kb upstream flank) with unexpectedly low repeat content as compared with non-CPR-TSSs (Supplemental Table S4). A strand-specific analysis of CPR-TSSs (0.3 kb upstream flank) revealed that there is no strand-specific bias in the repeat contents of immediate upstream flank sequences and TSSs on both strands are equally repeat-poor (Supplemental Table S5). Together these results indicated that p-CPR and t-CPR motifs are major regulators of these TSSs and they potentially act differently between primary and transformed cells.
Differential H2AFZ occupancy at CPR and non-CPR-TSSs indicates a role of CPR motifs in transcription
The disturbances in histone occupancy at TSS flanks impact transcription. The presence of positioned nucleosomes, with H2AFZ as a component, defines the TSSs. 2 To understand if the CPR-TSSs employ altered transcription at the start sites, we focused on H2AFZ occupancy at CPR-TSSs and non-CPR-TSSs in primary (IMR90) and transformed (A549) cells. H2AFZ occupancy when mapped at read-levels for these cells revealed that the TSS-positioning peaks of H2AFZ are clearly maintained in IMR90 cells but lost in A549 cells. In IMR90 cells, the H2AFZ signal was concentrated downstream of TSS (approximately <100 bp) at non-CPR-TSSs. However, at CPR-TSSs, this downstream concentration of H2AFZ was at a larger distance (approximately >100 bp) (Figure 4). Interestingly, this pattern of H2AFZ read density upstream of CPR-TSSs when compared with that of A549 cells showed a difference that followed the same pattern as we had found for the distribution of p-CPR and t-CPR motifs (Figure 3). Like the p-CPR motifs, the H2AFZ reads were concentrated closer to the TSSs in IMR90 (primary) cells and distal in A549 (transformed) cells. At the non-CPR-TSSs, there was no such qualitative difference in H2AFZ occupancy. Hence, the A549 cells showed no H2AFZ enrichment distal from the non-CPR-TSSs compared with IMR90. The only difference between H2AFZ occupancy in IMR90 and A549 cells was that the latter showed a weaker occupancy of H2AFZ immediately downstream of TSSs (Figure 4). These findings collectively demonstrated that the CPR and non-CPR classification of the TSSs is justified because (1) the difference in histone occupancy at CPR-TSSs and non-CPR-TSSs is consistent across various ChIP-seq datasets, (2) CPR-TSSs and non-CPR-TSSs display distinct patterns of histone occupancy between primary and transformed cells, and (3) for a selected panel of datasets for H2AFZ, the distribution of p-CPR motifs and t-CPR motifs recapitulate the occurrence of actual H2AFZ-bound sequence reads (Figure 4).

Different patterns of H2AFZ occupancy in a representative primary and transformed cell pair at CPR-TSSs and non-CPR-TSSs suggest a role of CPR motifs in transcription. (A) and (B) CPR-TSSs on top (A) or bottom (B) strands exhibited a highly similar pattern of H2AFZ occupancy. The higher distance of H2AFZ read density peaks from the TSS midpoint for transformed cell A549 (red) than that of the primary cell IMR90 (blue) corroborates our findings that CPR motifs cluster closer to TSSs in primary cells than in transformed cells. (C) and (D) At non-CPR-TSSs, however, there was a strong difference in H2AFZ occupancy exclusively in a region immediately downstream (approximately <100 bp) of the TSSs. The IMR90 cells showed a strong enrichment of H2AFZ at the site where RNA Polymerase II pausing is expected. This H2AFZ enrichment downstream of TSSs was missing in A549 cells for both the top (C) and bottom (D) strands. No qualitative difference was observed for the H2AFZ occupancy for IMR90 and A549 away from the non-CPR-TSSs. The CPR-TSSs and non-CPR-TSSs thus exhibit different patterns of H2AFZ occupancy in normal cells—compare blue curves in (A) and (C) with blue curves in (B) and (D), respectively—and this difference is further strengthened in transformed cells—compare red curves in (A) and (C) with red curves in (B) and (D), respectively. H2AFZ occupancy at CPR-TSSs for A549 exhibits a noticeable strand-specific bias due to unknown reasons. CPR indicates ChIP-seq Peak Regions, TSSs, transcription start sites.
The analyses so far demonstrated that histone occupancy, including that of H2AFZ at CPR-TSSs and non-CPR-TSSs is different between primary and transformed cells. Next, we wanted to know how the altered H2AFZ occupancy at CPR-TSSs and non-CPR-TSSs translates into transcriptional disturbances in primary and transformed cells. We applied the CPR and non-CPR classification of the TSSs to test if these were associated with different transcription patterns and RNA Polymerase II (POLR2A) occupancy across available RNA-seq and POLR2A ChIP-seq datasets, respectively.
A p-CPR to t-CPR motif preference for H2AFZ in transformed cells coincides with deregulated transcription in TSS flanks
ChIP-seq Peak Regions-transcription start sites exhibit a shift of p-CPR motif to t-CPR motif for histone occupancy. Such a shift at the TSSs for positioned nucleosomes which consist of H2AFZ can potentially affect the POLR2A occupancy at CPR-TSSs and non-CPR-TSSs differently. We analyzed H2AFZ occupancy, POLR2A occupancy, and RNA-seq read density in datasets from a panel of primary and transformed cell lines (Supplemental Tables S6 and S7) to test these possibilities.
POLR2A occupancy was plotted in 0.5 kb flanks of CPR-TSSs as well as for a control set of TSSs that did not show any association with the CPR motifs. POLR2A occupancy exhibited similar pattern at CPR-TSSs in primary and transformed cells. Thus, by employing a p-CPR to t-CPR motif preference, the CPR-TSSs escaped the transformation-associated anomalous POLR2A recruitment (Figure 5A and B). A highly similar pattern of RNA-seq read density distribution in primary and transformed cells was observed in both these sets of coordinates (Figure 5C and D).

H2AFZ non-CPR-TSSs display distinct patterns of POLR2A and RNA-seq reads density in primary and transformed cells. (A) and (B) POLR2A ChIP-seq read density at all CPR-TSSs (A) and all non-CPR-TSSs (B) are highly similar. (C) and (D) RNA-seq read density in the same regions as plotted in (A) and (B), respectively, also shows a similar pattern between CPR (C) and non-CPR-TSSs (D) as well as primary and transformed cells, especially in immediate flanks where −1 and +1 positioned nucleosomes are expected to be located. (E) and (F) POLR2A occupancy at CPR-TSSs (E) and non-CPR-TSSs (F) specifically for H2AFZ, a positioned nucleosome component, showed striking dissimilarity that stands out of the pooled analysis of all 8 histone modifications (A) and (B). The POLR2A showed enrichment at approximately 100 to 200 bp flanks of H2AFZ CPR-TSSs (E) which increased to approximately 300 to 500 bp flanks of H2AFZ non-CPR-TSSs (F). At the H2AFZ non-CPR-TSSs, the primary and transformed cells showed the strongest difference in POLR2A occupancy. (G) and (H) In the same regions as in (E) and (F), RNA-seq reads were plotted. At H2AFZ CPR-TSSs (G), the RNA-seq reads recapitulated the pattern of POLR2A occupancy (E) in both primary and transformed cells. However, as shown in (H), RNA-seq reads in the H2AFZ non-CPR-TSSs showed good correlation only for transformed cells, not primary cells. Values plotted on
Next, we performed a targeted analysis of POLR2A occupancy and RNA-seq read density in CPR-TSSs and non-CPR-TSSs, specifically for H2AFZ (Figure 5E-H). POLR2A occupancy was highly similar between primary and transformed cells at H2AFZ CPR-TSSs (Figure 5E). On the other hand, H2AFZ non-CPR-TSSs showed highly different POLR2A occupancy (Figure 5F). The transformed cells showed strikingly low POLR2A occupancy both upstream and downstream of non-CPR-TSSs in the direction of transcription (red and orange samples in Figure 5E and F, respectively). The primary cells, on the other hand, showed a robust POLR2A recruitment at non-CPR-TSSs as expected with a subpeak of POLR2A read density depicting upstream binding, and a downstream subpeak depicting the pausing of POLR2A. Interestingly, this difference in POLR2A recruitment between primary and transformed cells was not observed in CPR-TSSs (blue sample in Figure 5E) and was restricted only to the non-CPR-TSSs (aqua sample in Figure 5F). The classification of TSSs as CPR-TSSs and non-CPR-TSSs thus proved to be instrumental in identifying groups of TSSs that maintain distinct histone occupancy in transformed and primary cells, thereby affecting POLR2A occupancy.
We argued that if different levels of POLR2A occupancy at H2AFZ-bound CPR-TSSs and non-CPR-TSSs are functionally relevant, then it would affect transcriptional activity in the flanks of these TSSs. The functional impact of different POLR2A occupancy at H2AFZ-bound CPR-TSSs and non-CPR-TSSs could thus be tested by analyzing RNA-seq datasets of different cell lines and correlating transcript levels with the POLR2A occupancy. We measured the RNA-seq read density at H2AFZ-bound CPR-TSSs and non-CPR-TSSs (same as the ones on which POLR2A occupancy was plotted). The CPR-TSSs in both primary and transformed cells showed RNA-seq reads distributed on both sides of the TSSs, as was expected from the POLR2A distribution in these regions (Figure 5G and H). In transformed cells, the RNA-seq and POLR2A occupancy showed a strong correlation in 1 kb flanks of CPR-TSSs (Supplemental Figure S5A and B) and non-CPR-TSSs (Supplemental Figure S5C and D). Clearly, the H2AFZ-bound CPR-TSSs maintained similar pattern of transcript levels in primary and transformed cells than the non-CPR-TSSs. This correlated RNA-seq and POLR2A occupancy in transformed cells were same as seen in primary cells for the CPR-TSSs only (Supplemental Figure S5E and F). The correlated RNA-seq and POLR2A occupancy in non-CPR-TSSs seen in transformed cells (Supplemental Figure S5C and D) were different from the generally uncorrelated pattern expected in primary cells (Supplemental Figure S5G and H).
These findings were consistent with the knowledge of RNA Polymerase II pausing immediately downstream of the TSSs. Because such a pausing of RNA Polymerase II is expected to be its non-productive retention, in primary cells the RNA-seq and POLR2A read density are expected to be uncorrelated. In transformed cells, however, this pattern was lost and POLR2A occupancy seemed to be coupled with transcription, and hence positively correlating with RNA-seq read density. These data indicated that unlike CPR-TSSs, the non-CPR-TSSs in transformed cells are prone to runaway transcriptional activity by RNA Polymerase II. At the CPR-TSSs, however, a p-CPR to t-CPR motif shift mitigates such aberrant transcriptional activity.
Discussion
It has been established that DNA sequence can affect positioning of individual nucleosomes. Within nucleosomes, histones co-occur in different combinations of post-translational modifications. These post-translational modifications of histones affect the regulation of nucleosomal as well as higher order chromatin structure and function. A large amount of data exist (typically histone modification ChIP-seq data) in the public domain that describes the association of variously modified histones with human genomic DNA in different types of cells. As the histones typically interact with DNA as components of nucleosomes, histone occupancy can indicate nucleosomal occupancy. Histone occupancy data from pools of cells, with each cell having its own unique histone occupancy profile, do not contain information about nucleosomal boundaries. However, if site variability of histone occupancy has a tendency such that certain genomic regions tend to have more histone occupancy than others in a population of cells, then this information is embedded in such histone modification ChIP-seq data. In this work, we have systematically mined the histone modification ChIP-seq data regardless of age and gender in an unbiased manner for primary and transformed cells and identified genomic regions that tend to have high histone occupancy and described their unexpected sequence features. Our comparative approach led us to discover that genomic regions with a high tendency of histone occupancy have functionally relevant subtle differences in sequence features between primary and transformed cells.
Nucleosomal occupancy has also been studied using DNase-seq. However, the footprint of DNase digestion depends on the binding of several different protein complexes to the DNA other than histones as well. This difference between DNase-seq and histone ChIP-seq datasets is strongly manifested in the narrow peaks. In the narrow peaks, the nucleosomes are present in low density, making the DNA accessible to DNA-binding proteins. Indeed, we have found (data not shown) that similar to CPR, non-discriminative motif-containing common peak regions can also be identified in DNase-seq datasets, albeit with a much lower frequency. These findings were obtained from 35 primary and 25 transformed cell lines (ENCODE).
The publicly available histone modification ChIP-seq data are at the level of peaks as well as reads, wherein each peak represents a central tendency of a cluster of reads. Traditionally, the ChIP-seq peaks for sequence-specific DNA-binding proteins such as TFs, not histones, are probed for central enrichment of sequence motifs. However, establishment and maintenance of chromatin conformation by sequence-specific binding factors also regulate the histone occupancy and modification profile of the target region. In such a condition, genomic regions with a strong tendency to be histone-bound will also harbor DNA sequence motifs for such chromatin organizer proteins. This is a tangible possibility because DNA sequence-specific binding of protein such as CTCF and YY1 can regulate occupancy of specifically modified histone modifications.41–45
Our non-candidate approach has involved ChIP-seq datasets from several primary and transformed cell lines available on ENCODE. Derivation of common peak regions from such a diverse group of multiple datasets from independent experiments makes our approach more robust and increases both statistical strength and biological relevance of our findings.
To eliminate the arbitrariness in selecting the threshold number of cell lines for calling of CPR, we calculated the commonly retained regions for different numbers of independent samples, with no prior filtering of the published datasets. To further strengthen the confidence in the CPR identification, we used ChIP-seq peaks as input for our analyses, where the reads are reported to cluster over and above the background in different samples. We also applied the same CPR identification protocol on a mock set of sequences obtained by randomly selecting genomic regions of similar length and chromosomal profiles as of the peaks used for actual CPR calling. Thus, we ensured that the CPR could be called only with actual peak sequences and not with a random set of sequences, thereby establishing the specificity of CPR. Also, by selecting a threshold number of cell lines such that comparable number of CPR are called between primary and transformed cells, we ensured that any downstream processing of the data is not affected by different CPR counts between the two sets.
Given that we were analyzing data from completely disparate sets of cell types, in which histone occupancy was expected to be widely different due to different lineages of differentiation and cellular transformation, identification of any CPR was unexpected. To the contrary, we identified thousands of recurrent sequences in multiple datasets. That these are over 0.1 kb long sequences again reinforces that the CPR are not accidentally short occurring sequences due to unidentified random probabilities. Thus, variously modified histones do gravitate toward these CPR and thus, in disparate datasets, exhibit a tendency to be associated with them.
Histone modifications and their occupancy undergo global deregulation in transformed cells. 28 Thus, we derived the CPR and performed downstream analyses in primary and transformed cells separately. The CPR, the different motifs, and sequence properties that we have reported are hence representatives of primary and transformed cell types and not of cell types, tissue types, or modes of cellular transformation. We have thus separated out the CPR findings that do not differentiate transformed and primary cells from those which are distinct between primary and transformed cells.
To identify common sequence motifs in CPR, the sequences of CPR for each histone modification were subjected to motif identification. The majority of CPR, independently discovered for 11 out of 11 histone modifications, had similar motifs between primary and transformed cells. The ChIP-seq peaks used as input in this analysis were all from the repeat-masked genome, thereby eliminating the possibility that the CPR motifs identified are repetitive sequences. These unexpected findings show that global histone occupancy for several histone modifications is primarily regulated by robust mechanisms that are universal for different tissue/cell types and remain unaltered in cellular transformation. However, these results did not rule out the role of additional subtler tissue/cell-type-specific or transformation-specific mechanisms. A prior classification of the samples into primary or transformed groups helped us to discover such hidden DNA sequence elements associated with histone occupancy in these 2 cell-type groups distinctly. The usage of similar sizes of sequences as motif-negative sets in a discriminative motif search led us to highly similar motifs in CPR of 8 out of 11 histone modifications. That the CPR for 2 modifications did not follow the same pattern and an additional third histone modification showed opposite patterns for only cell types suggesting indirectly that CPR of some histone modifications do not follow the discriminative motif pattern we have identified. Thus, although the p-CPR and t-CPR motifs are common to CPR of 8 histone modifications, they are not universal features of sequences that differentially regulate histone occupancy in transformation.
The CPR were first called at the level of individual histone modifications and then analyzed at the cell-type-specific level. At a single cell level, it is expected that CPR of co-occurring histone modifications (as histones bind to DNA as components of nucleosomes) will be overlapping regions. Although identifying such an overlap in data derived from a mass of cells is not possible, it is also expected that the regions with high nucleosomal occupancy will nonetheless recur. Expecting a physical overlap between CPR coordinates of numerous independent and diverse samples is prone to type II error. However, any location-independent sequence features common to CPR could be identified free from such systematic errors, as proven by our results. The most unexpected feature of the CPR is that they harbor two different types of motifs: (1) the universal CPR motifs present in most histone modifications and most cell types analyzed, and (2) a restrictive set of motifs that are present in the CPR but preferentially employed by either primary or transformed cells only. A combined synthesis of this finding is that while the universal CPR indiscriminately favor histone occupancy, the motifs specific for p-CPR and t-CPR fine-tune the CPR around the universal motifs differently in primary or transformed cells. The p-CPR motifs are GC-rich, and CpG hypermethylation in gene promoters is an often observed epigenetic feature of cancer cells. 46 Interestingly, the sequence-specific DNA-binding proteins that regulate histone occupancy (for example, CTCF) may lose DNA binding upon cytosine methylation. 41 Thus, the GC-rich p-CPR motifs are not preferred in transformed cells. The AT-richness of the t-CPR motifs could either be an alternate binding site for the histone occupancy regulating motif-specific DNA-binding protein or a simple consequence of unavailability of the GC-rich sequences in transformed cells.
While the cause of the motif-type preference switch in transformed cells is an area for further investigation, our results show that it has a functional consequence. Transcription start sites are some of the best characterized functional landmarks in human genome and so we chose to study the functional effects of a motif preference shift with respect to the TSSs. By analyzing motifs with respect to those TSSs that are associated in cis with both p-CPR and t-CPR motifs, we found a surprising consistency in the location of t-CPR motifs distal from the TSSs compared with the p-CPR upstream in the direction of transcription. The effects of shift in histone localization could directly affect transcriptional activity and even alternative start site usage. The features we have observed to be differentially associated with p-CPR and t-CPR motifs (namely, different GC and AT-contents) are also important determinants of histone occupancy. 47 These positioned nucleosomes limit divergent transcription47–50 and affect RNA Polymerase II occupancy and activity. 51 The differences in RNA-seq read pattern differences we observe between primary and transformed cells could thus be a result of differential histone occupancy.
The combined effect of p-/t-CPR motif shift at any TSS will be a combination of the profiles of the co-occurring histone modifications in a particular CPR and other factors governing the transcription at that locus. For example, motif shift for CPR of H2AFZ, a component of positioned nucleosomes at TSSs, could be more informative about any wobble in the TSS usage than that for other histone modifications, which are not components of positioned nucleosomes or are prevalent in gene bodies. Thus, our analyses of CPR-TSSs and non-CPR-TSSs specifically for H2AFZ CPR-TSSs have been instrumental in uncovering transcriptional disturbances in TSS flanks. Our results obtained from analyses of disparate sets of RNA-seq data show that H2AFZ occupancy at the flanks of CPR-TSSs is susceptible to disruption in cellular transformation leading to aberrant transcription. Transcription start sites that undergo a p-CPR motif to t-CPR motif shift of H2AFZ occupancy in transformed cells maintain TSS flank transcription normally. However, the CPR-TSSs that do not undergo p-CPR to t-CPR motif shift exhibit aberrant upstream as well as downstream transcription. The RNA-seq read prevalence pattern in the flanks of p- or t-CPR-TSSs is correlated with the POLR2A occupancy. However, whereas the non-CPR-TSSs, which did not exhibit the motif shift, the POLR2A occupancy and RNA-seq read abundance were uncorrelated. Collectively, the findings suggest that these TSSs are prone to divergent transcription between −1 and +1 nucleosomes, 48 and the antisense transcription toward −1 nucleosome is what we observe in our results. Interestingly, it seems that p-CPR to t-CPR motif transition is a mechanism that helps transformed cells to maintain same transcription pattern at the CPR-TSSs as the primary cells. The functional annotation of H2AFZ CPR-TSSs (as the target set) revealed specific enrichment of genes involved in RNA Polymerase II regulatory region sequence-specific DNA binding (GO:0000977) as compared with H2AFZ non-CPR-TSSs (as the background set) using Gene Ontology enRIchment anaLysis and visuaLizAtion (GOrilla) tool 29 (not shown).
The motif shift, although identified for 8 histone modifications here, was analyzed for functional consequence only for H2AFZ and restricted only to TSSs and transcription. ChIP-seq Peak Regions of other histone modifications, occurring at non-TSS regions, could have functional consequences beyond transcription. In addition, it will be informative to analyze what kinds of sequence-specific DNA-binding factors could interact with CPR, and thus, what roles do CPR play in chromatin organization and cellular fate in differentiation and transformation.
Conclusions
Our study reveals that the t-CPR motifs and p-CPR motifs function differently in normal and transformed cells with an impact on histone occupancy and transcription. This is an interesting epigenetic difference between normal and cancer cells not reported till date. Out of 8 histone modifications, we restricted the functional analysis of CPR only to H2AFZ. Our work sets the platform for future investigations as to why histone modifications have different CPR preferences in primary and transformed cells. The robustness of our findings lie in (1) the number and diversity of the panel of cell types chosen for this analysis and (2) that the patterns identified by us hold true across different sequencing and data-analysis platforms. Also, the CPR identified from ChIP-seq data at the level of DNA correlate well with unrelated datasets at the level of RNA-seq. Our work sets a precedence for analyzing and comparing seemingly unrelated datasets to identify patterns that cannot be otherwise propounded by a hypothesis-driven approach only. This work is thus of pertinent relevance to analytics of next-generation datasets, to fellow researchers in the field of epigenetics as well as cancer biology. An obvious line of work to emanate from these data is to identify the epigenetic functional nature of the CPR, something that relies upon experiments and is hence beyond the scope of the current piece of work.
Supplemental Material
Revised_Supplementary_tables_combined_xyz168872431018f_(1) – Supplemental material for Distinct DNA Sequence Preference for Histone Occupancy in Primary and Transformed Cells
Supplemental material, Revised_Supplementary_tables_combined_xyz168872431018f_(1) for Distinct DNA Sequence Preference for Histone Occupancy in Primary and Transformed Cells by Subhamoy Datta, Manthan Patel, Divyesh Patel and Umashankar Singh in Cancer Informatics
Supplemental Material
Supplementary_figures_combined_xyz16887d8ced21d_(3) – Supplemental material for Distinct DNA Sequence Preference for Histone Occupancy in Primary and Transformed Cells
Supplemental material, Supplementary_figures_combined_xyz16887d8ced21d_(3) for Distinct DNA Sequence Preference for Histone Occupancy in Primary and Transformed Cells by Subhamoy Datta, Manthan Patel, Divyesh Patel and Umashankar Singh in Cancer Informatics
Footnotes
Funding:
Declaration of Conflicting Interests:
Author Contributions
Supplemental Material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
