Abstract
Introduction
Alternative RNA splicing (AS) is an essential post-transcriptional regulatory mechanism in practically all of the fundamental biological processes in complex organisms, such as gene regulation, metabolism, development, cell cycle control, immune response, signal transduction, and human diseases.1–7 By definition, AS targets only multi-exonic genes, which are present in multicellular organisms but rarely observed in unicellular organisms. The vast majority of the genes in unicellular genomes contain only one single exon.8,9 AS is thus arguably a characteristic of multicellular organisms, and an important source for functional diversity and evolutionary novelty.6,10–12 However, previous AS-related studies have been focused on coding genes. The implications of AS in the functions/regulations of noncoding genes have been underexplored. By analyzing the gene structures of long intergenic noncoding RNAs (lincRNAs) and comparing multi-exonic lincRNAs (MELs) and single-exonic lincRNAs (SELs), we might be able to gain some insights into this important issue.
In complex organisms, single-exonic coding genes and multi-exonic coding genes (“SECs” and “MECs,” respectively) differ significantly from each other in evolutionary history, gene regulation, and molecular function. Evolutionarily, SECs in mammals were reported to have relatively recent origins as compared with MECs. Most of the SECs were found only in Chordata, 13 and some found only in mammals. 14 This observation stands in interesting contrast to the notion that SECs are characteristic of the unicellular genomes. 13 In mammals, SECs were found to evolve faster than MECs, 13 and to have undergone lineage-specific expansions. 14 These two groups of genes also differ in terms of gene regulation. SECs tend to be lowly and tissue-specifically expressed as compared with MECs. 13 In addition, SECs were reported to be resistant to mis-transcription because they tended to avoid “fragile codons,” which when mutated could lead to nonsense-mediated decay. 15 Although the majority of human genes are multi-exonic, a number of functionally important genes, such as histones and the G protein-coupled receptors, are mostly composed of only one exon. 16 SECs and MECs also differ from each other in terms of biological function. SECs were reported to be enriched in functional categories such as transport and binding, cell envelope, protein translation, energy metabolism, amino acid biosynthesis, and other regulatory functions. 14
For noncoding genes, however, the above-mentioned differences may not apply. This is because noncoding RNAs are free from the selective constraints imposed on coding genes, such as the conservation of reading frame and functional protein domains. Indeed, noncoding genes have been reported to be evolving rapidly. 17 Furthermore, the exon boundaries of lincRNAs have been recently suggested to turn over rapidly, and be unimportant for the transcriptional regulations of these genes in mammals. 18 Interestingly, the tissue specificities of lincRNA expression were found to be well conserved in mammals 18 and tetrapods 19 despite the rapid turnover of lincRNA exon boundaries. These observations imply that splicing might play a less important role in generating the functional forms of lincRNA genes than in the case of coding genes. Notably, nevertheless, the majority of the annotated lincRNA genes include multiple exons, and a considerable proportion of the multi-exonic lincRNA genes are alternatively spliced. 20 The notion that splicing is unimportant for lincRNA functions implies that MELs and SELs might be biologically similar to each other. Furthermore, if splicing is biologically meaningless for lincRNAs, splicing of a lincRNA should occur stochastically (rather than actively regulated). In other words, the probability that an SEL is spliced should be similar to that of an MEL given similar lengths and sequence compositions. In addition, if splicing is functionally irrelevant, the sequences of alternatively spliced exons and constitutively spliced exons (“ASEs” and “CSEs,” respectively) of MELs should have been subject to similar levels of selective constraint, and evolved at similar rates. These propositions, however, have not been examined.
lincRNAs can regulate the transcriptions of coding genes through a number of different mechanisms, including transcriptional interference, activation of transcription factors/repressor proteins, recruitment of epigenetic modifiers, and induction of chromosomal looping.21,22 These are fairly different regulatory mechanisms, yet most of them require the interactions between lincRNAs and other biological molecules – DNA, RNA, and/or proteins. These interactions are associated with the biological features of lincRNAs. One good example is secondary structure. Properly folded lincRNAs may serve as scaffolds to orient heterogeneous biological molecules for interactions. 23 Secondary structure is in turn affected by other sequence features such as the length and G+C content of a lincRNA. To be sure, the primary sequence per se may also be important for the functions of lincRNAs. 23 Another biological feature of lincRNAs is proximity to coding genes. This feature is functionally relevant because the transcription of a noncoding RNA could interfere with the transcription of its nearby coding gene. 22 Meanwhile, the breadth of lincRNA gene expression is also functionally informative. Expression breadth is a traditional measurement of functional importance. Widely expressed genes are regarded as biologically more important, and evolve more slowly than narrowly expressed genes.24,25 To investigate the functional differences between SELs and MELs, it is necessary to compare these two groups of lincRNAs in view of these biological features.
In this study, we systematically compared Encyclopedia of DNA Elements (ENCODE)-annotated human SELs and MELs. Our results suggest that SELs and MELs diverge from each other in primary sequence conservation, exon/transcript length, proximity to coding genes, and expression breadth. These results imply that SELs and MELs represent two functionally distinct gene groups. Furthermore, SELs were found to be resistant to RNA splicing, and primate-conserved ASEs evolved more slowly than CSEs. We thus suggest that splicing plays an unknown yet influential role in the functions of lincRNA.
Materials and Methods
LincRNA Sequences and Identification of Homologous Sequences
A total of 13,249 human lincRNA genes, which included 22,531 transcripts and 71,864 exons, were retrieved from the ENCODE data portal at the University of California, Santa Cruz (UCSC) Genome Browser (https://genome.ucsc.edu/ENCODE). The retrieved sequences corresponded to the ENSEMBL Version 70 annotations. The human lincRNAs that were shorter than 80 nts were excluded to minimize the variations of the subsequent calculations of genetic distance. The potential nonhuman homologous exonic sequences were identified with reference to the human–nonhuman pairwise genome alignments from the UCSC Genome Browser. In cases where one human exon was aligned to multiple nonhuman genomic regions, only the exons that were in the correct exonic synteny were retained. The nonhuman sequences that were of low quality (ie, those contained “Ns”) or located in uncertain chromosomal locations (eg, ChrUn) were discarded. The exonic sequences that overlapped with either human or nonhuman coding genic regions were also removed. The nonhuman homologous sequences were examined for the presence of canonical exon boundaries (GU/AG or GC/AG), which were required to be located not farther than 10% of the human exon length from the human exon boundary according to the pairwise alignment. The homologous sequences lacking one or both exon boundaries were discarded. The exons of multi-exon genes were classified into CSEs and ASEs in cases of multiple transcript isoforms. CSEs were the exons that were always present in all the transcript isoforms of a gene. ASEs were those that were included in only some of the transcript isoforms of a gene.
We started our analysis with 13,249 ENSEMBL-annotated human lincRNA genes (Version 70), which corresponded to 22,531 transcripts and 71,864 exons. We then examined whether the orthologous exonic regions and the corresponding exon boundaries could be found separately in the genome of orangutan or rhesus macaque. We did not analyze the lincRNAs conserved in mouse because the number of human–mouse conserved lincRNAs was fairly small (only tens of genes), and the alignments were of low quality. If the ortholog of a human exon was absent in the compared genome, the corresponding transcript was discarded. (From the human–orangutan (H-O) orthologs, we removed the lincRNAs whose orthologs were present in the rhesus macaque to retain only hominoid-specific lincRNAs. The two datasets (H-O and human–macaque or “H-Ma”) therefore represent the lincRNAs that were hominoid-specific and primate-conserved, respectively. The H-O and H-Ma datasets included 1,664 and 4,332 genes, which in turn included 2,235 (2,025 multi-exonic and 210 single-exonic) and 6,198 (5,763 multi-exonic and 435 single-exonic) transcripts, respectively (Table 1). Similar procedures were applied to the mouse–rat comparison, yielding 1,069 MELs (including 2,418 CSEs and 821 ASEs) and 21 SELs.
The H-O and H-Ma conserved lincRNAs identified in this study.
Not Applicable.
Estimates of Sequence Conservation
The H-O and H-Ma genetic distances of orthologous exonic sequences were calculated by using the baseml module of PAML4 (with the HKY 85 substitution model) on the basis of the UCSC pairwise sequence alignments subject to the filters mentioned in the last section. We also retrieved 300-bp intergenic sequences 5 kb upstream and downstream of the studied lincRNAs, and calculated their H-O/H-Ma distances for comparison. The genetic distances of pure introns (the sequences that were annotated as introns in all of the transcript isoforms of a lincRNA according to the ENSEMBL annotations) were also computed. The lengths of the pure introns were also limited to 300 bp, which was close to the average exon length of 345 bp in the H-Ma dataset. The genetic distances of these non-exonic regions served as the “neutral reference.”
Estimates of the Distance between LincRNAs and Coding Genes/Enhancers
The nearest coding gene (or enhancer) from a lincRNA was identified with reference to the ENSEMBL human gene annotations. The distance between a lincRNA and a coding gene was defined as the smallest number of nucleotides among the four possible combinations of gene boundaries (5′ coding–3′ lincRNA, 5′ coding–5′ lincRNA, 3′ coding–3′ lincRNAs, and 3′ coding–5′ lincRNA). Similarly, the distance between a lincRNA and an enhancer was defined as the smallest number of nucleotides among the four possible combinations of lincRNA gene boundaries and the terminals of the enhancer of interest.
Exonic Expression Level and Breadth
The RNA-sequencing (RNA-seq) data of 16 human tissues (E-MTAB-513) were retrieved from the ArrayExpress website at https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-513/. In this dataset, each tissue was retrieved from a human individual. The transcriptome of each tissue was subject to both single-end and paired-end RNA-seq. The short reads were mapped to the human genome using STAR, 26 and the fragments per kilobase transcript per million mapped reads for the studied lincRNAs were derived using Cufflinks. 27 We then took the average of single- and paired-end results for each tissue as the expression level of a lincRNA. The expression breadth of a lincRNA was defined as the proportion of the 16 tissues where the lincRNA was expressed.
Prediction of the Secondary Structure of LincRNAs
The secondary structure of each lincRNA transcript was predicted by using RNAfold.
28
To evaluate the influences of transitional mutations on the folding energy of lincRNAs, each individual nucleotide of a lincRNA transcript was subject to transitional mutation
Prediction of Splicing Sites
The human lincRNA transcript sequences were submitted to the SplicePort web server 29 for identification of putative donor and acceptor sites with default parameters. The numbers of putative donor/acceptor sites were then normalized by the length of the transcript.
Results
Differential Sequence Evolution between SELS and MELS in Primates
We first examined whether the primary sequences of SELs and MELs were subject to different levels of selective constraint. A recent report indicated that mammal-conserved lincRNAs tended to be SELs. 18 We were thus interested to know whether SELs were more conserved than MELs for evolutionarily recent (H-O and H-Ma) lincRNAs. We also classified the exons of MELs into CSEs and ASEs to examine whether splicing was associated with the evolution of lincRNAs. For H-O lincRNAs, unexpectedly, CSEs and ASEs had slightly larger genetic distances than pure introns and intergenic regions. Meanwhile, the H-O genetic distances of SELs were approximately the same as those of the non-exonic regions (Fig. 1A). This observation seemed to suggest that SELs were selectively neutral, whereas the evolution of CSEs and ASEs was slightly accelerated.

The genetic distances in lincRNA exons between (A) human and orangutan; and (B) human and rhesus macaque.
In comparison, for H-Ma lincRNAs, ASEs and SELs were slightly more conserved than pure introns and intergenic regions, but CSEs were similar to the non-exonic regions in this regard (Fig. 1B). Of note, for H-O lincRNAs, SELs were more conserved than both CSEs and ASEs, although not more conserved than non-exonic regions. Yet for H-Ma lincRNAs, SELs were approximately as conserved as ASEs, and both of the exon groups were more conserved than CSEs and non-exonic regions. The results based on the H-Ma dataset appeared to suggest that ASEs and SELs were subject to weak selective constraint, while CSEs were selectively neutral. These results were different from what were derived from the H-O dataset, implying that the exons of hominoid-specific and primate-conserved lincRNAs had been evolving along different paths since the emergence of the hominoid lineage. Despite the differences between the results of H-O and H-Ma lincRNAs, one common theme was that SELs were more conserved than CSEs but not than ASEs (Fig. 1). The difference between Figures 1A and B appears to result mainly from the relative level of sequence conservation between lincRNAs and the reference neutral sequences. In Figure 1B (as compared with Fig. 1A), the genetic distances of lincRNA exons (ASEs, CSEs, and SELs) seem to have shifted downwards relative to those of intronic and intergenic regions. The possible reasons for this difference will be discussed later.
To examine whether the differences between SELs and MELs also apply to other mammalian species, we calculated the corresponding genetic distances between mouse and rat. As shown in Supplementary Figure 1, in the mouse–rat comparison, all of ASEs, CSEs, and SELs were more conserved than the neutral regions (introns and intergenic regions). ASEs were slightly more conserved than CSEs. Meanwhile, although SELs appeared to be slightly more conserved than ASEs and CSEs, the differences were statistically insignificant, probably because of the small sample size (there were only 21 SELs in this analysis). Overall, the results derived from the mouse–rat comparison (Supplementary Fig. 1) were similar to those obtained in the H-Ma comparison (Fig. 1B).
We also calculated the percentage of canonical splice sites (GT/GC-AG) that were conserved between the compared species for lincRNA exons. Since exon boundaries might have been shifted during evolution, we searched a sequence space of 10% of the exon length upstream and downstream from each exon boundary for potential homologous splice sites in the non-human genome with reference to pairwise alignments. Our results indicated that the homologous splice sites of nearly all (99.94% in H-O and 99.94% in H-Ma) of the human exon boundaries could be found in the search space in the compared genomes. Therefore, the exon boundaries of lincRNAs were highly conserved across primates in view of genomic sequence. This result was consistent with what had been previously reported. 18
SEL Transcripts were Significantly Longer and More Resistant to Splicing than MEL Transcripts
Next, we compared the lengths of CSEs, ASEs, and SELs. As shown in Figure 2A, SELs were seven to eight times and five to six times longer than MEL exons in the H-O and H-Ma dataset, respectively. In terms of “transcript length,” SELs were about twice (1.9–2.3) longer than MELs in both of the datasets (Fig. 2B). In other words, one single exon of an SEL was in general twice longer than the combined length of multiple exons of an MEL.

The Violin Plot of (A) exon length; and (B) transcript length of human lincRNAs.
We then asked why SELs were significantly longer than the MEL transcripts. One possibility was that SEL transcripts contained fewer putative splice sites than the primary transcripts of MELs. To examine this possibility, we used an integrated splice site prediction tool, SplicePort, 29 to identify putative donor and acceptor sites in the primary transcripts (exonic plus intronic regions) of MELs and SELs. Indeed, the length-normalized numbers of putative donor and acceptor sites were significantly smaller in SELs in both of the H-O and H-Ma datasets (Fig. 3). Even if we considered the absolute numbers of predicted splicing sites (ie, regardless of transcript length), SELs still had fewer donor and acceptor sites than MELs (Supplementary Fig. 2). This difference was remarkable considering that SELs were approximately twice longer than MELs (Fig. 2). Furthermore, this difference did not result from the difference in G+C content, for the G+C contents were very similar between SELs and MELs (median G+C%: H-O MEL – 44.0%; H-O SEL – 43.5%; H-Ma MEL – 43.0%; H-Ma SEL – 43.5%). This observation seemed to suggest that splicing sites were disfavored by selection in SELs as compared with in MELs.

The length normalized numbers of (A) putative acceptor sites; and (B) putative donor sites as predicted by the SplicePort program. Note: Statistical significance: ***
SELs were Physically Closer to the Nearest Coding Genes than MELs
LincRNAs have been found to be capable of affecting the transcription of their neighboring coding genes via transcriptional interference.30–33 The distance of a lincRNA from its neighboring coding gene therefore is an important feature. Interestingly, as shown in Figure 4A, SELs were significantly closer to coding genes than MELs. The median distance between an SEL and its closest coding gene was 3,503 bp for the H-O dataset, and 4,213 bp for the H-Ma dataset. In comparison, the corresponding distances for MELs were 10,772 bp and 16,339 bp, respectively. Of note, the SELs and MELs in the H-O dataset were closer to coding genes than their H-Ma counterparts. It appeared that more recent lincRNAs tended to be located in the neighborhood of coding genes. We thus conducted Spearman's correlation analyses between H-O (or H-Ma) genetic distance and the physical distance to nearest coding gene separately for SELs and MELs. As shown in Supplementary Figure 3, the correlation was statistically significant only for MELs in the H-O data-set. However, the effect size was fairly small (rho = 0.085). The distance from a lincRNA gene to the nearest coding gene thus did not seem to be significantly correlated with evolutionary age in general.

(A) The physical distances between lincRNAS and the closest coding genes. (B) the physical distances between linCRNAS and the closest enhancers.
We also examined whether SELs and MELs differed from each other in their distance to the nearest enhancer. Interestingly, SELs again were significantly closer to enhancers than MELs in both of the H-O and the H-Ma datasets (Fig. 4B). This observation further supports our hypothesis that SELs were functionally different from MELs.
SELs were More Widely Expressed than MELs
One indicator of lincRNA functionality is the breadth of gene expression across different tissues. We thus retrieved RNA-seq data from 16 human tissues, and examined the expression breadths of SELs and MELs separately. Interestingly as shown in Figure 5, SELs were more widely expressed than MELs in both of the datasets. Of note, hominoid-specific SELs were significantly more widely expressed in human tissues than primate-conserved SELs. Similar comments also apply to MELs. Note that the “hominoid-specific” and “primate-conserved” SELs/MELs were defined according to sequence alignability and the presence of homologous exon boundaries in the examined hominoid or primate genomes. Therefore, the “lineage specificity” of lincRNAs discussed here was supported by “genomic conservation” rather than RNA-seq data from multiple species. This is considered as a relaxed criterion because conservation of genomic sequences does not guarantee conservation of expression patterns. Interestingly, nevertheless, it has been recently reported that the tissue specificity of lincRNA expression could be conserved in mammals even without conserved exon boundaries.34,35 In brief, the results presented in Figure 5 should be interpreted as “the level of genomic sequence conservation of a lincRNA is associated with its expression breadth in human tissues.”

The proportions of human tissues (out of 16) where lincRNAs are expressed.
One potential concern of this analysis is that the expression of SELs might have been more likely to be detected by RNA-seq than that of MELs because the former were significantly longer. 36 Since lincRNAs are usually lowly expressed, short transcripts could be undetectable given insufficient sequencing depths. To address this issue, we retrieved SELs and MELs of lengths between 500 bp and 1,000 bp, and examined their expression breadths again. The results remained similar (Supplementary Fig. 4). Therefore, the observation that SELs were more widely expressed than MELs might not have resulted from the difference in length between the two gene groups.
Discussion
In this study, we compared several biological features of hominoid-specific and primate-conserved SELs and MELs. We found these two groups of lincRNAs to differ from each other in primary sequence conservation, exon length, propensity to splicing, distance to the nearest coding gene, and expression breadth. These two groups of lincRNA genes thus appear to have experienced different evolutionary histories and represent distinct biological subtypes.
The primary sequences of long noncoding RNAs were reported to be weakly selected. 20 For hominoid-specific lincRNAs, we found no evidence for negative selection on the primary sequences of lincRNA exons (Fig. 1A). Unexpectedly, the CSEs and ASEs of MELs both had slightly larger H-O genetic distances than pure introns and intergenic regions. There are two possible explanations for this observation. First, the mutation rates in CSEs and ASEs could be higher than those in pure introns and intergenic regions. Since transposable elements were suggested to be an important contributor of lincRNA exons, 37 and repetitive elements were known to evolve rapidly, 38 the elevated H-O genetic distances in CSEs and ASEs could have resulted from a larger proportion of repetitive elements in these exonic regions. We thus calculated the proportion of exonic/non-exonic sequences that overlapped with repetitive elements. As shown in Supplementary Figure 5, CSEs and ASEs actually tended not to overlap with repetitive elements. Therefore, the content of repetitive elements cannot explain the higher H-O genetic distances in ASEs and CSEs. Other genomic features such as recombination rate and mutation hotspot may not account for the elevated genetic distances in CSEs and ASEs, for the non-exonic regions (pure introns and intergenic regions) were retrieved from the genomic regions very close to these exons (see Materials and Methods). Second, CSEs and ASEs of MELs might have been positively selected in either the human or orangutan lineage. To explore this possibility, we extracted 1000 Genome-based single-nucleotide polymorphism (SNP) information for three representative populations (YRI, CEU, and JPT) from dbPSHP 39 as of June, 2014. If CSEs and ASEs in the H-O dataset have been subject to positive selection, the derived allele frequency should be higher in these two exonic regions than in the other regions. However, less than 0.2% of the studied exonic/non-exonic regions were found to contain SNPs. Therefore, we found no clear evidence for positive selection on the ASEs and CSEs of MELs in the H-O dataset. The reason for the larger H-O genetic distances in CSEs and ASEs remains unclear.
In comparison, for primate-conserved lincRNAs, the genetic distances in SELs and ASEs (but not CSEs) were slightly yet significantly smaller than those in non-exonic regions (Fig. 1B). This observation implied that the primary sequences of SELs and ASEs were weakly constrained, whereas CSEs were not. Of note, the rapid turnover of non-coding RNAs 17 suggested that the majority of the nucleotides in the exons were selectively neutral. The statistically significant deviations from neutrality of the sequences of SELs and ASEs observed in this study are thus particularly meaningful. The ~4% deviation (the %difference between the 0.069 H-Ma genetic distance in pure introns and 0.066 distance in SELs/ASEs) indicated that approximately 4% of the nucleotides in SELs and ASEs were selectively constrained. Of note, this selective constraint was not observed in the H-O dataset (Fig. 1A), which implied that hominoid–specific lincRNAs had strayed into a different evolutionary path from their more ancestral counterparts (primate-conserved lincRNAs). In view of the differences between Figures 1A and B, the hominoid–primate divergence in sequence evolution seemed to have occurred in both SELs and MELs.
Interestingly, in coding sequences, ASEs were reported to evolve more slowly at the RNA level but more rapidly at the protein level than CSEs.40,41 We speculate that RNA splicing plays an important role in both coding and primate-conserved lincRNA genes, therefore constraining sequence evolution at splicing signals in both gene types. And this constraint in turn has led to slightly reduced evolutionary rates in ASEs. However, this may not be true for hominoid-specific lincRNA genes. It is worth noting that the exon boundaries of lincRNA genes were reported to turnover rapidly, which suggested that exact splicing sites might be unimportant for the functions of lincRNAs.
18
However, we discovered that although the exon boundaries of human lincRNAs might not be found at the exact orthologous positions in the compared genomes, “backup” boundaries were nearby (within 10% of the exon length) in >99% of the cases. Furthermore, splicing patterns also change rapidly in coding genes,11,42 but this regulatory mechanism remains biologically important,
1
and is involved in critical processes such as functional pleiotropy
43
and adaptation.
10
In lieu of the selective constraint on ASEs (particularly in primate-conserved lincRNAs) and the biological differences between SELs and MELs, it is likely that splicing
It was previously suggested that the primary sequences of lincRNAs evolved rapidly because the functionally important secondary structures of lincRNAs could be preserved even if the primary sequences had changed.
17
We thus conducted a simulation study by
Conclusions
In this study, we provided evidence that SELs and MELs represented two biologically distinct gene groups. They differed in primary sequence evolution, exon/transcript length, proximity to nearest coding gene, and expression breadth. The differences in regulatory mechanism between the two gene types thus are worth further explorations. Notably, ASEs were found to be slightly more conserved than CSEs in primate-conserved lincRNAs. Furthermore, splicing appeared to be disfavored by selection in SELs as compared with in MELs. These results suggest that splicing might be relevant to the functionality of lincRNAs. The exact roles of splicing in lincRNA functions await future investigations.
Author Contributions
FCC conceived of and designed the study. CLP and HYL retrieved and analyzed the data. FCC interpreted the results and drafted the manuscript. All authors have read and approved the manuscript.
