Abstract
Keywords
Introduction
High-throughput sequencing technologies are being widely applied in biomedical research. Since the initial application, it has expedited tremendous advances in the characterization and quantification of genomes, epigenomes, and transcriptomes over the last few years. Next-generation sequencing (NGS) technology is free from many of the confines dictated by previous technologies, such as the bias due to the probe selection in array technology, cross-hybridization background, and signal saturation-induced detection dynamic range limitation.1,2 Moreover, this high-throughput technology produces large and complex datasets at single nucleotide resolution, and the cost is continuously dropping so that it offers the possibility of investigating the molecular biology genome widely in a far more precise and comprehensive manner as has been previously achieved.
RNA-seq is the set of experimental procedure that generates cDNA sequences derived from the entire RNA molecules, followed by library construction and massively parallel deep sequencing. Gene expression is known to be time-, cell-type-, and stimulus-dependent, and many loci are only expressed under very specific conditions. In fact, the genome-sequencing project has revealed numerous open reading frames encoding “hypothetical” genes, for which expression patterns are not established yet.3,4 RNA-seq allows quantifying the abundance level or relative changes of each transcript during defined developmental stages or under specific treatment conditions. Also, RNA-seq allows for analysis of the transcriptome in a rather unbiased way, with single base pair resolution, a tremendous dynamic detection range (>8,000 fold), and low background signals. 5 In contrast to hybridization-based technologies, it is not limited to the interrogation of selected probes on an array and can be also applied in species, for which the whole reference genome is not assembled yet.
RNA-seq is not only a tool for quantitative assessment of RNA but can also be exploratory. Only until recently, it was appreciated that 85% of the human genome can be transcribed, albeit only 3% of the genome encodes protein-coding genes. 6 Thus, RNA-seq has been instrumental to catalog the diversity of novel transcript species including long non-coding RNA, miRNA, siRNA, and other small RNA classes (eg, snRNA and piRNA) involved in regulation of RNA stability, protein translation, or the modulation of chroma-tin states.7,8 For instance, RNA-seq has been used to discover enhancer RNA, a class of short transcript directly transcribed from the enhancer region, which contributes to our knowledge of epigenetic gene regulation.9,10 In addition, RNA-seq can give information about transcriptional start sites, revealing alternative promoter usage, information about mRNA isoforms derived from alternative splicing, and premature transcription termination at the 3’ end, which is critical from mRNA stability.11–15 Most recently, RNA-seq was used to study biological problems including precisely locating regulatory elements.16,17 RNA-seq information can also identify allele-specific expression, disease-associated single nucleotide polymorphisms (SNP), and gene fusions contributing to our understanding about disease causal variants in cancer.18–21 Furthermore, RNA-seq can provide information about the transcription of endogenous retrotransposons and other parasitic repeat elements that may influence the transcription of neighboring genes or may result in somatic mosaicism in the brain. 22 Finally, single-cell RNA-seq analysis has been widely applied to study the cellular heterogeneity and diversity in stem cell biology and neuroscience.23–25
While RNA-seq technology is considered unbiased, it is important to note that the preparation and fragmentation of RNA and the library construction (which includes size selection) can be biased. 5 This bias may be undesired or unfavorable; for example, the use of oligo (dT) primers in the first strand synthesis enriches poly (A) mRNA, which is useful to study expression of most protein coding genes, but misses on canonical histones, 26 some histone variants, and subclasses of non-coding RNA. 27 Strand-specific sequencing retains the orientation of the original RNA transcript, which may be critical to identify antisense or non-coding RNA. 28
The interpretation of the NGS datasets requires sophisticated and powerful computational programs. The RNA-seq data generation is an ever-evolving process, which includes development in sequencing technology, experiment design, and algorithm development. Accompanied with this, computational tools with varying performances are emerging constantly. A wealth of mature tools exists to meet the basic requirements of RNA-seq data analysis, for instance, the quality assessment and reads mapping. Meanwhile, challenges remain that require comprehensive solutions, such as differential gene expression analysis, as well as the detection of fusion genes. Instead of describing each software, we outline in this study the available tools to perform the analysis of data preprocessing, differentially gene expression (DGE), alternative splicing, variants detection and allele-specific expression, pathway analysis, co-expression network analysis and highlight the essential principles of computational methods in the RNA-seq data analysis, describe the challenges associated with the RNA-seq application, and discuss examples that represent the most advances in the transcriptome characterization.
RNA-seq Workflow
An overview of a typical RNA-seq workflow is outlined in Figure 1. Three main sections are presented: the Experimental Biology, the Computational Biology, and the Systems Biology. The experimental part includes the methods’ choice of RNA collection, first strand synthesis, and library construction, resulting in millions of short reads from the NGS sequencer. Multiple platforms (Table 1) have been applied for the RNA-seq including sequencing-by-synthesis approach Illumina GA IIx and HiSeq,29,30 Applied Biosystems SOLiD, 31 Roche 454 Life Science, 32 semi-conductor technology-driven Ion torrent Personal Genome Machine, 33 single molecule real-time PCR machine PacBio, 34 and nanopore technology-driven portable device MinION, and PromethlON (http://allseq.com/blog/minion-and-promethion-oxford-nanopore-s-present-and-future).
Overiew of technical specifications of next generation sequencing platforms. *
Information based on company sources alone, data update to 2013-2014.
Cost only count the cost per run, does not include general purpose and library preparation equipment, annual maintenance agreements and extra sevices.
New compressed binary data format saves base and quality-value data in a 1byte: 1base ratio.

Overview of the typical RNA-seq pipeline. Three main sections are presented: the Experimental Biology, the computational Biology and the systems Biology. the pipeline starts from the experimental preparation and come with the work flow to the sequencing and analysis steps as the arrows point from step to step.
RNA preparation methods may vary for different kinds of sequencing platforms, RNA subtypes, and sequencing purposes. However, sample quality is always the determinant of acquiring qualified data and deriving biological insights from unbiased analysis. Poly-A-selection of sufficient mRNA is well used in a variety of whole-transcriptome analysis including gene expression, alternative splicing, and variations detection.35,36 While for single cell sequencing, molecular labeling, and random sequencing, the labeled molecules on Illumina platform can achieve remarkable mRNA capture efficiency. 36 With more RNA-seq applications in clinical samples, formalin-fixed paraffin-embedded (FFPE) tissue samples became invaluable recourse for transcriptomic studies. Ribo rRNA depletion is the preferred method for archival and long-aged FFPE samples. 37
The raw reads served as starting material of the second part, the computational biology. First, technical and biological contaminations were removed from preprocessing steps, followed by mapping the qualified reads to the genome or transcriptome. The mapped reads for each sample were subsequently indexed into gene-level, exon-level, or transcriptlevel to assess the abundance of each category depending on the experimental purpose. The summarized data were then assessed by statistical models of differentially expression gene list and alternative splicing events, or regulatory mechanisms were evaluated via integration analysis with other datasets such as epigenomic or proteomic data. Finally, pathway or network level analyses were implemented to gain biological insight through the systems biology approaches.
Data Preprocessing
Quality Assessment
Since RNA-seq is a complicated, multiple-step process involving sample preparation, fragmentation, purification, amplification, and sequencing, it is not straightforward to identify and quantify all RNA species from the reads sequenced. Hence, quality assessment is the first step of the bioinformatics pipeline of RNA-seq, and also, it is important as a step before analysis. Often, it is necessary to filter data, removing (trimming) low-quality sequences or bases adaptors, contaminations, or overrepresented sequences to ensure a coherent final result. An array of tools are available for this purpose with reads quality visualized graphically such as FastQC (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc), HTQC, 38 as listed in Table 2. Recently, more flexible and efficient preprocessing tools were developed: Trimmo-matic was developed to remove adapters and scan every read with a 4-base sliding window and trim the lower-scored bases and low-quality N bases to enhance the quality of reads 39 and HTSeq to depict the base calling and evaluate the base quality at position-based way as well as the overall read features. 40
selected list of packages and tools for RNA-seq data analysis.
Before alignment to the reference genome, RNA-seq data can be further preprocessed to meet expectations in the next sequencing mapping steps. There are multiple tools available for this purpose, for example, BBMerge from BBMap package (http://sourceforge.net/projects/bbmap/) merges paired reads based on overlap to create longer reads and creates an insert-size histogram. FLASH 41 combines paired-end reads that overlapped and converts them to single long reads. It is also a good practice to assess the RNA-seq data quality after the preprocessing procedure, and there are packages, for example, RSeQC package, to comprehensively evaluate the reads that will go to analysis. 42
Reads Mapping
Once high-quality data are obtained from preprocessing, the next step is to map the short reads to the reference genome or to assemble them into contigs and align them to the reference genome. This procedure refers to the classic bioinformatics problem of discovering the most reliable original sources of a large scale of short DNA sequences from the genome in a speed- and memory-efficient manner.43,44 There are many popular bioinformatics programs that can be used for this purpose, including ELAND (http://support.illu-mina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_ReferenceFiles.htm), SOAP, 45 SOAP2, 46 MAQ, 47 Bowtie, 48 BWA, 49 ZOOM, 50 STAR, 51 etc. Comparable analyses on real data have been done to assess most mapping tools. 52 These programs are typically suitable for reads that are not located at the poly (A) tails or exon-intron splicing junctions. Poly (A) tails can be easily identified by the presence of multiple As or Ts, and a partial junction library that contains the known junction sequence has been compiled to allow the alignment of difficult mapping reads.23,53 From a different point of view, the reads that locate at the exon-intron boundaries are helps with the determination of the alternative splicing pattern, where advent RNA-seq promotes the development of new generation of slice-alignment software such as BLAT,54,55 TopHat,56,57 GEM, 58 and MapSplice. 59
Another problem in reads mapping is that of polymorphisms, which occur when sequence reads align to multiple locations of the genome. Polymorphisms are especially common for the large and complex transcriptomes. For lower repetitive reads, one can employ the solution of assigning the reads to multiple locations proportionally based on the neighboring unique reads.31,53 However, for the short reads that have a very high copy number and repetitive sequences, polymorphism is still a great challenge. A longer read sequencer such as the Roche 454 or PacBio sequence analyzer might be required. Alternatively, there are bioinformatics solutions to extend the short pair-end reads into 200-500 bp fragments before deciding upon the multiple-aligned reads.60–62
Reads Counting
RNA-seq reads number that map to a gene is the measurement of the gene's expression level. After mapping the reads to the reference genome, counting the reads number that mapped to gene body will facilitate the next steps. Library preparation methods, such as whether the protocol is strand-specific, whether first read is on the same strand or opposite strands, are determinant factors for the counting of reads. One example of tools and packages for read counting from bam file is the multicov command in bedtools that takes a feature file (GFF) and read counts in certain regions, such as all exons of a gene. 63 By default, it counts reads on both strands within interested regions. But it can work in a strand specific manner if necessary. HTseq is a specialized utility for counting reads although speed lifting is necessary in the future. 40 However, it allows us to look for more fine-grained controls on read counting by setting different parameters. This is very useful, especially when a read overlaps more than one gene and we want to use customized strategy. Note that HTseq-counts assume that the RNA-seq data is strand-specific; it will only count those reads that were mapped to the strand that the feature is on. R packages include easyRNASeq, summarizeOverlaps and feature-Counts for reads counting. easyRNASeq hides the complex interplay of the required packages and thus can be easily used. summarizeOverlaps, which is a function in the Genomics-Ranges package in Bioconductor, and featureCounts, which have implemented, highly efficient chromsome hashing and feature blocking methods, are suitable for RNA-seq or genomic DNA sequencing data. Different tools and their different related parameters generate different reads’ numbers, and thus affect downstream analysis because they use different strategies to assign reads to features.
In addition, the gene model that hypothesizes the structure of transcripts produced by a gene also affects the analysis. Among multiple genome annotation databases, RefGene, Ensembl, and the UCSC annotation databases are the most popular ones. The choice of genome annotation directly affects gene expression estimation. Recently, Zhao and Zhang systematically characterized the impact of genome annotation datasets choice on read mapping and transcriptome quantification. 67 They found that the impact of a gene model on mapping of nonjunction reads is different from junction reads. The percentage of correct mapped nonjunction reads was much higher than that of the junction reads for all gene models. Surprisingly, although there are 21,958 common genes among RefGene, Ensembl, and UCSC annotation, only 16.3% of genes obtained identical quantification results. Approximately 28.1% of genes’ expression levels differed by ≥5% when using different annotation, and of those, the relative expression levels for 9.3% of genes differed by >;50%. This study revealed that the different gene definition of gene models frequently result in inconsistency in gene quantification.
Normalization
After getting the read counts, data normalization is one of the most crucial steps of data processing, and this process must be carefully considered, as it is essential to ensure accurate inference of gene expression and subsequent analyses thereof. There are multiple facets of the RNA-seq data to be taken into account including transcript size, GC-content, sequencing depth sequencing error rate, insert size, etc.68,69 Multiple normalization methods should be compared for the specific bias elimination of a dataset, which can be done by comparing their corresponding estimated performance parameters using measurement error models.69,70 Plenty of comparative analysis or integrative analysis concluded the best approach in different types of RNA-seq data analysis. For instance, quantile normalization can improve the mRNA-seq data quality including those from low amounts of RNA.71,72 R package EDASeq using within-lane normalization procedures followed by between-lane normalization can reduce GC-content bias. 73 Lowess normalization and quantile normalization worked well in microRNA-seq data normalization. 74 Further advancement of RNA-seq application calls for the development of effective statistical and computational methods for RNA-seq data normalization.
There are other bioinformatics challenges for the RNA-seq reads mapping, for example, reducing the errors in image analysis and base calling to enhance sequencing accuracy; removing low-quality reads; and the development of applicable approaches to store, retrieve, and process large datasets in a time- and energy-efficient manner.
Differential Gene Expression
The transcriptome is the complete set of transcripts in a cell or cell population, and transcriptome analysis provides information about the identity and quantity of all RNA molecules. An important application of RNA-seq is the comparison of transcriptomes across different developmental stages, across a disease state compared to normal cells, or specific experimental stimuli compared to physiologic conditions. This type of analysis requires identification of genes along with their isoforms and precise assessment of their abundance comparing two or multiple samples. It is essential for interpreting the functional elements of the genome and uncovering the molecular constitution providing important insights in the biological mechanisms of development and diseases.
After the step of preprocessing RNA-seq reads, it is an important question to reveal how the transcripts level differs across samples, known as DGE analysis. Numerous statistical methods have been developed that use read coverage to quantify transcript abundance since the microarray era.75,76 The RPKM (reads per kilo base per million mapped reads) is widely used method to account for expression and normalized read counts with respect to overall mapped read number and gene length.32,53 However, beside the read coverage, there are other factors that determine the estimated transcript abundance including sequencing depth, gene length, and isoforms abundance.72,77,78 Since the RPKM method handles all the RNA-seq reads almost equally, for example, without concern for isoforms, it has been criticized. RNA-Seq by Expectation Maximization (RSEM) is a newly developed software tool, which gives accurate estimates for gene and isoform expression levels and can be used even for species without a reference genome assembly. 79
Most algorithms to date for differential gene expression analysis apply simple count-based probability distributions (eg, Poisson distribution) followed by Fisher's exact test without accounting for biological variability among samples.32,53,80 While the technical variability of RNA-seq is extremely low compared with microarray data, 32 the biological variability could be significantly reduced by analyzing several replicates through a permutation-derived methods. 75 Serial analysis of gene expression has been developed for biological variability assessment, in which larger scale datasets are used so that an additional dispersion parameter can be estimated based on an extended Poisson distribution, allowing extensive molecular characterization capability.81,82
However, for most applications, a large number of replica may be too costly, and many developed methods have overcome the problem by modeling biological variability and measuring the significance with limited number of samples, applying pairwise or multiple group comparisons. 75 Several programs offer well-done solution for this purpose and have been applied in numerous studies for biomedical and clinical research. Examples of these programs are Cuffdiff from the Cufflinks package,7,83,84 DESeq, 31 DESeq2, 85 and EdgeR. 82 Since RNA-seq read counts are integer numbers that range from zero to millions and are highly skewed, many kinds of transformation algorithms have been applied to the counts so that the numbers can be fit to statistic distribution models for differential expression detection. For instance, Li et al developed PoissonSeq, a Poisson log-linear model for differential gene expression assay. 86 Approaches developed for microarray data analysis based on continuous distribution have been improved for RNA-seq counts. Excellent example is the voom function in the limma package, which offers a way to transform count data into Gaussian distributed data so that significance can be tested statistically.87–89 An extensive comparison to evaluate the performances of several DGE packages has been recently reported.90,91 However, to the best of our knowledge, there is no one-size-fits-all strategy. Also, space for refinement of existing pipelines exists to develop effective strategies for the following questions: how to uniform the reads coverage along the genome with the nucleotide composition variation; how to detect the “within-sample” variations without simply assuming that the underlying conditions or treatments affect all individual gene equally; how to improve current methods to detect differences in gene isoform preferences and abundance level in varying conditions; and how to account for the different probability in read coverage in long genes versus short genes since we can gain great sequencing depth nowadays.
Alternative Splicing
The biological complexity and genomic diversity are determined, to a large degree, by the alternative splicing events. 92 Alternative splicing shapes the control of numerous pivotal cellular processes, and abnormal splicing events are involved in 15%-50% of disease-causing mutations in human. 93 Compared with constitutive splicing, alternative splicing refers to the differential inclusion/exclusion of exons in the processed RNA product after splicing of a precursor RNA segment. 94 It is a crucial step in controlling the expression of ~95% of all multiexon genes, and an increasing number of diseases are found to be associated with the “wrong” splice sites usage, while the overall transcript abundance does not change.95,96 Spliceosomes, composited of intricate structures with RNA-RNA, protein-protein, and RNA-protein interactions, carry out the splicing reaction. 97 Splicing mechanism studies on model genes have deduced many regulatory principles including the role of negative intrinsic sites binding and positive enhancement of splicing sites selection in the formation of spliceosome assembly. 94
However, given the variety of cis-acting elements and trans-acting factors involved in splicing, either cooperatively or in a competing manner, the “code” for controlling alternative splicing needs still further deciphering using high-throughput approaches. 98 RNA-seq technology allows us to estimate alternative splicing events on genome-wide scales and in an unbiased manner. Deep surveying of alternative splicing by RNA-seq revealed unprecedented wealth of splice junctions and RNA-binding motifs and provides more reliable measurements compared with microarray technology.99–101 Furthermore, alternative splicing is tissue-specific, with hundreds of context-sensitive RNA features and tissue-dependent splicing regulatory elements, which generate thousands of combinations of alternative splicing events.102,103 In-depth of RNA-sequencing analysis yield a digital inventory of gene and mRNA isoform expression with tissue specificity and high sensitivity of single cells and provides a framework of understanding alternative splicing pattern on genome-wide scales.15,104
With the rapid accumulation of RNA-seq data, many methods and tools have been developed to infer alternative splicing events. These tools generally focus on either gapped alignment of short reads or de novo assembly and characterization of transcript models. Examples of these methods are MISO for identification and regulation of isoforms from CLIP-seq data and 105 SpliceMap, 106 SplitSeek, 107 spliceR, 108 and SplicingCompass 109 for detection of splice junctions and exon usage from pair-end RNA-seq. GLiMMPS provides a useful tool for elucidating the genetic variation of alternative splicing in humans and model organisms. 110 MATS is developed from a statistical method and used for detecting differential alternative splicing events from RNA-seq data. 111 rMATS is a statistical method for robust and flexible detection of genome-wide differential alternative splicing from paired or unpaired replicates. 112 ALEXA-Seq assesses the differential and alternative expression of the mRNA iso-forms after cataloging transcripts. 12 An integrative analysis approach constructed an exon co-splicing network based on distances combined with matrix correlations and found that the co-splicing network was distinct and complementary to the co-expression network, although they both possess scale-free properties.113,114
The field of alternative splicing analysis using RNA-seq data is still in its infancy and would benefit from new strategies. An extensive evaluation and comparison of the existing methods would be desirable, and to date, there is no general consensus regarding which method performs best under given conditions. We are expecting to see the novel, exploring methods to be developed in this flourishing field in the near future.
Variants detection and Allele-specific Expression
The main applications of RNA-seq analysis are novel gene identification, expression, and splicing analysis. However, RNA-seq data is also a useful by-product of sequence-based mutation analysis, though there are many limitations, such as highly differential coverage between different genes. Among many variants calling and annotation methods such as ANNOVAR, 115 SNPiR, 116 and SNiPlay3, 117 the best practical workflow provided by GATK may be still the best pipeline to identify mutations from RNA-seq data, although it is still far from perfect and under heavy development (http://gatk-forums.broadinstitute.org/discussion/3891/calling-variants-in-rnaseq). In GATK pipeline, the sequence reads are first mapped to the reference using STAR aligner (2-pass protocol) to produce a file in SAM/BAM format sorted by a coordinate. After marking and removing duplicates, GATK splits reads with N operators in the CIGAR strings into component reads and trims to remove any overhangs into splice junctions to reduce the occurrence of artifacts. The remaining steps are similar to DNA-seq variants calling, such as local alignment and haplotype variant call.
Heterozygous SNP, which means two different alleles in the same position in the DNA, may lead to the following: one of two alleles is highly transcribed into mRNA and another is lowly transcribed or even not transcribed at all. This is called as allele-specific expression (ASE). Both genetic and epigenetic determinants govern transcriptional activity at the different alleles of a gene in a non-haploid genome, and impairment of this highly regulated process can lead to disease.118,119 Whole genome DNA sequencing (WGS) allows identification of single nucleotide mutations or polymorphisms in the entire human genome. The expression state of the heterozygous loci can be investigated in the matched RNA-Seq and WGS sample from the same individual, and ASE activity can be identified to uncover the instances of allele silencing. 120 Though conceptual simple, there is still a challenge to identify ASE due to many problems, such as reads bias and lack of sophisticated statistical model. 121 Recently, Mayba et al developed a pipeline, MBASED to ASE detection, through aggregating information across multiple single nucleotide variation loci to obtain a gene-level ASE. 122 More sophisticated softwares are needed for ASE identification.
Beyond the Differentially Expressed Gene Lists
Creating lists of the differentially expressed genes is only the starting point of gaining biological insights into experimental systems, developmental stages, or specific disease scenarios. To understand the biologic context of differentially expressed genes, many advanced analyses have been working on gene ontology,123,124 gene sets, 125 network inference, and knowledge databases.126,127
Pathway Analysis
The interpretation of gene expression data is based on the function of individual genes as well as their role in pathways since genes work connectively in all biological processes. In addition, for some genes, a small expression change may be not significant at a single gene level, but minor changes of several genes may be relevant in a pathway and may have dramatic biological consequences. Thus, differentially expressed biological pathways provide better explanatory results than a long list of seemingly unrelated genes. 128
One traditional analysis works with a gene list of interest, identified with genomics methods or curated by biologists, and applies statistical methods, such as the Fisher Exact Test, on contingency tables to test for enrichment of each annotated gene set. 129 Such approaches can be applied to the differentially expressed gene list identified with RNA-seq data directly. Another class of analysis ranks all expressed genes according to metrics of expression difference and then uses Kolmogorov-Smirnov like tests to obtain enrichment significance. Gene set enrichment analysis (GSEA) is one such highly effective method that has been widely used in studying functional enrichment between two biological groups. 130
Many studies have adapted pathway analysis tools from microarray data analysis and developed new tools applicable to RNA-seq data. For example, a non-parametric competitive GSA approach named Gene Set Variation Analysis has been developed to fit RNA-seq data characteristics. Such analyses have given highly correlated results between microarrays and RNA-Seq sample sets of lymphoblastoids cell lines that have been profiled using both technologies. 131 SeqGSEA uses count data modeling with negative binomial distributions to score differential expression and then executes gene set enrichment analysis to achieve biological insights. In real applications, SeqGSEA detects more biologically meaningful gene sets without biases toward longer or more highly expressed genes. 132 GAGE is another method for pathway analysis that is applicable to both microarray and RNA-seq data. It is unaffected by sample sizes, experimental designs, assay platforms, or other types of heterogeneity. 133 GSAASeqSP offers a variety of statistical procedures by adapting and combining multiple gene-level and gene set-level statistics for RNA-seq count-based data. Such statistics include Weighted_KS, L2Norm, Mean, WeightedSigRatio, SigRatio, Geometric-Mean, TruncatedProduct, FisherMethod, MinP, and Rank-Sum. 134 GSAASeqSP is a powerful platform for investigating molecular differential activity within biological pathways.
The limitations of the gene set analysis methods developed for microarrays in the context of RNA-seq data have been comprehensively investigated. 128 Several frequently used RNA-seq normalization strategies were studied to examine the performance of multivariate tests. Data transformations were also investigated in an attempt to extend other approaches beyond microarray data analysis. It was found that the use of log counts when normalized for sequence depth is a good strategy for data transformation prior pathway analysis.
Previously, pathway analysis methods had been developed based on algorithms considering pathways as simple gene lists and ignoring pathway structure. Recently, methods have been developed that incorporate various aspects of pathway topology. For example, SPIA captures pathway topology through its scoring system, in which the positions and the interactions of the genes in the pathway are considered. 135 Accordingly, interacting differentially expressed gene pairs are preferentially weighted over two non-interacting genes. Similarly, TAPPA is a scoring method in which higher weights are automatically assigned to hub genes and interacting gene pairs. 136 DEAP identifies the most differentially expressed path to provide a refined focus for further biological exploration. 137 Accordingly, biological pathways are represented by directed graphs, where nodes are biological compounds and the edges represent catalytic or inhibitory regulatory.
Applying methods developed for microarray data analysis without considering specific data features of RNA-seq data may lead to biases. For example, long or highly expressed transcripts are more likely to be detected as differentially expressed than are the short and/or lowly expressed ones. By developing new statistical framework, the new problem of gene length bias and total reads number bias from RNA-seq could be well corrected. One good example is the GOseq package for gene ontology analysis. It considered the read counts bias by estimating the probability weighting function and used resampling strategy beyond the differentially expressed gene expression so that it can highlight GO categories more consistent with the known biology. 138 Development of good methods to correct the biases in pathway analysis brought by GC content, dinucleotide distribution, and other factors is challenging. 139
Although many pathway databases are available, highresolution annotation of such knowledge bases is still lacking. For example, >90% of the human genome is alternatively spliced and transcripts from the same gene may have distinct, even opposing functions. However, current knowledge bases only are curated at gene level. It is essential to also include knowledge about pathway-specific transcript activity. In addition, high-quality annotations for genes are still needed, although there are enormous numbers of annotations available in the public domain. 140 We expect to see more sophisticated data mining and machine learning algorithms applicable to RNA-seq data, especially those methods considering the gene in the context of its pathway.
Co-expression network analysis
Co-expression network analysis is an important complement to DGE analysis. A gene co-expression network is represented as an undirected graph, in which each node corresponds to a gene, and two nodes are linked if there is a significant co-expression relationship between them. Because co-expressed genes are often functionally related, controlled by the same set of transcriptional factors, or work together within same pathway, building co-expression networks can help to extract meaningful biological modules that are tightly associated within a specific biological process. 141
The co-expression network has been extensively studied since microarray era and such data have been examined using RNA-seq data with the emergence of NGS technology. Comparison studies between RNA-seq co-expression networks and microarray data-derived networks revealed that correlations from RNA-seq data are much higher due to the reason that RNA-seq data is of greater sensitivity and larger dynamic range. Although both co-expression networks show scale-free properties, there is low overlap between hub-like genes. This phenomenon can be explained by low correlation between microarray and RNA-seq data, especially for high- and low-transcript abundances. 142
Both sample size and reads’ depth affect the quality of RNA-seq-derived co-expression networks. 143 Larger sample sizes and greater read depth can increase the functional connectivity of the networks. The minimal suggested experimental criteria to obtain performance on par with microarrays are at least 20 samples with total number of reads greater than 10 million per sample. Meta-analysis across multiple data sets is a good solution to increase the relatively poor performance of individual co-expression networks. Aggregation across different experiments can improve performance significantly beyond that attained by even the largest individual co-expression networks in one experiment. However, thousands of samples from different conditions are necessary to obtain the “gold standard” co-expression networks.
The high quality of co-expression network by large meta-analysis promises the power of a functional genomics tool to biologists and clinicians. GeneFriends project team has constructed co-expression maps for human and mouse with RNA-seq datasets of 4,000 and 2,500 samples from different experiments, respectively. 144 This information can be used statistically, such as using a guilt by association approach to predict gene function, identifying and prioritizing novel candidate genes involved in biological processes. COXPRESdb is another database of RNAseq-based gene co-expression networks. 145 The co-expressed gene list in COXPRESdb provides a comparable view of orthologous genes among several species (human, mouse, rat, chicken, fly, zebra fish, nematode, monkey, dog, and yeast) and the numbers of common edges for all pairs of species.
Besides building gene co-expression networks under defined conditions, finding co-expression modules in one condition and then testing if these modules show different co-expression in other conditions can assist in understanding the regulatory change under disease conditions. Gene set co-expression analysis was proposed to test differential co-expression of known pathways through testing the changes in co-expression over all gene pairs in the pathway. 146 Based on theoretical analysis, a small highly co-expressed subnetwork was found to be a good indicator of disease onset or other biological process. This finding has been validated with real data and confirmed that this small set of genes clustered within a strongly correlated subnetwork is able to provide the significant warning signal just before onset of disease. 147 The current approach of building dynamic network biomarker is based on population data. It might be interesting to build a co-expression of network with time series data from same subject with self-correlation or synchronization,148,149 such that we can use it to predict disease onset for diagnosis and personalized medicine.
Interestingly, in biological systems, antagonistic and self-reinforcing co-expressed modules have been found in system stability and adaptability. 150 Algorithm have been designed to model this phenomenon, for instance, DICER, in which the expression profiles of genes within each module of the pair are correlated across all samples and the correlation between the two modules differ dramatically between the disease and normal samples. 151 Weighted gene co-expression network analysis is a powerful method to extract co-expressed groups of genes from large microarray data sets and has been successfully applied to RNA-seq data. It is suggested to remove genes whose read counts are consistently low and normalize the data with a variance-stabilizing transformation before calculating pairwise similarity of expression pattern. It can perform various aspects of weighted correction of co-expression network analysis including network construction, module detection, gene selection, calculations of topological properties, data simulation, visualization, and interfacing with external software. 152
As more and more RNA-seq data become publically available, there is a great need to develop new algorithms to formulate both the global and local characteristics of co-expression networks, especially those dynamic changes associated with biological processes. Much work still remains for the development of RNA-seq co-expression methodologies. So far, there have been few published statistical studies that have examined metrics for similarity of expression profiles with RNA-seq data. Si et al designed an algorithm to cluster genes by measuring the differential expression patterns across treatments using model-based statistical methods according to either Poisson or NB models for RNA-seq data, using the mean expression level as reference, bypassing treating RNA-seq data directly. 153 The co-expression networks built with different expression measurements, such as those using raw counts, RPKM, or variance-stabilizing transformation have low overlap. Therefore, the development of new metrics for co-expression network establishment is urgently needed. 142
Centrality and network flow have been successful for the identification of important genes and modules from co-expression networks. However, we lack a good way to formulate the network structure. Some network metric, such as percolation, is too simple to grasp network characteristics efficiently due to the dynamic nature of the biological processes. The lack of a model that represents the dynamic change of co-expression network at different time points limits our ability to observe biological system changes at a network level. 150
Systems Biology
High-throughput sequencing technologies are now routinely being applied to a wide range of topics in biology and medicine, allowing scientists to address important questions and reveal difficult discoveries that were impossible before. Advances in genome sequencing and data analysis are of critical roles, while the procedure for how to prepare samples selectively and how to generate qualified data requires sophisticated experimental design, which is essential part of systems biology (Fig. 2).

The STARR-seq pipeline and the corresponding ‘systems biology’ steps. The sonicated genomic DNA are PCR amplified and placed downstream of a minimal promoter in reporter vectors. The desired measurement are embedded in the genome. The reporter library is transfected into the cultured cell lines and Poly-A RNAs are isolated from the pool of total RNA. These steps are selectively to enrich the targets interested. After RNA-seq is performed, the reads are mapped to the reference genome and their enrichment over input are measured to reflect enhancer activity. The steps of systems biology including mathematics and computational biology analysis will help with the interpretation.
Concerning gene expression analysis, the integration datasets from diverse platforms in this Next-Generation Genomics era, including genomics, epigenomics, and proteomics with transcriptomics, is critical in the effort to understand complex biological systems. A wide scope of integrating analysis projects were well defined for a more complete picture of gene regulation such as the Roadmap Epigenomics Project, the ENCODE Project, and The Cancer Genome Atlas. 154 RNA-seq has been used in combination with transcription factor (TF) binding,155,156 histone modification,157,158 DNA methylation,159,160 genotyping data,161,162 and RNA interference. 163 In this study, we summarize two excellent examples to illustrate RNA-seq application in the frame of systems biology.
STARR-seq: Whole Genome Functional Readout of Enhancers
Enhancers are functional non-coding DNA sequences that can recruit TFs, physically interact with promoters, and regulate the timing and tissue specificity of gene expression.164–169 Despite their important roles during development, in response to stimuli and various diseases, a genomewide approach to identify functional enhancer regions is still lacking. Current high-throughput enhancer detection methods can be grouped into three categories: (1) identification of open chromatins, including deep sequencing of DNase hypersensitive sites (DHS-seq) 170 and formaldehyde-assisted isolation of regulatory elements sequencing (FAIRE-seq) 171 ; (2) chromatin immunoprecipitation followed by deep sequencing (ChIP-seq) on enhancer-associated histone modifications (H3K4me1, H2K27ac, H3K18ac, etc.)172–175; and (3) ChIP-seq on TFs or cofactors (p300, CEBPB, etc.). 176 However, the mapping of open chromatins and histone modifications usually lacks sufficient resolution and specificity to detect precise enhancer locations, and the binding of some specific TFs or cofactors can hardly cover all the active enhancers. Moreover, none of these methods can provide a quantitative measurement of enhancers’ activities. The traditional quantitative reporter assays, on the other hand, cannot be scaled up to a high-throughput genome-wide manner.177,178
To address this question, Arnold and colleagues developed a method, named self-transcribing active regulatory region sequencing (STARR-seq),
17
which quantitatively measures the activity of enhancers in the whole genome. They sheared
Computational analyses include mapping the STARR-seq data to the genome and examining their enrichment over input. From this, the authors identified 5,499 enhancers in Drosophila S2 cells and validated 77 in addition to 65 negative controls by luciferase assays. As a result, 81% of the predicted enhancers and 14% of negative controls showed enhancer activity. There was a strong linear correlation (
STARR-seq is a high-throughput application of the traditional enhancer reporter assay that directly and quantitatively assesses enhancers’ activities in a genome-wide manner. It complements existing enhancer detection methods based mainly on chromatin features. One of the limitations of STARR-seq, as the authors pointed out, is that it only assesses the potential enhancer ability of DNA sequences irrespective of the endogenous genomic context, such as DNA accessibility and histone modification.
Structural Genome (re)Annotation
The task of defining the complete set of transcripts is complicated because of the fact that transcriptomes are of high dynamic entities, which change in response to both of the intracellular signals and extracellular environment. In addition, expression level, allele expression, and alternative splicing events are involved in increasing the complexity of transcriptome defining with regard to the development stages, growth condition, or disease status.
Genomic studies including gene expression by microarray and chromatin feature assays by tiling array are based on genome annotations. However, the genome annotation is continuously being updated and even the current annotation is incomplete indicating that the previous studies might have missed important information or they are not precise enough to uncover the biological insight. Accumulating studies using RNA-seq to reveal the genome and transcriptome annotation structurally have been generating a more complete and more precise map to facilitate our understanding of the gene transcription. We highlight in this study examples that finely annotated transcriptional landscapes in a major invasive fungal pathogen with combined elegant experiment design and RNA-seq following comprehensive data analysis.
Outlook/Perspective
In this review, we have outlined major applications of the RNA-seq in biomedical research, highlighted the computational approach in data preprocessing, differential gene expression, alternative splicing, pathway analysis, and co-expression network, and presented examples to show how this technology can be applied in systems biology field to advance our understanding in genomic level. Since it is potent in investigating the transcriptome in a highly quantitative manner at single nucleotide resolution, complex disease diagnosis, and precision medicine, the rapidly accumulating genome sequence data allow researchers to address fundamental biological questions that were not even asked just a few years ago. Although many progresses have been made since the initial application of this technology, there are still more applications possible if further refinement is provided for each of the topics.
Single RNA-seq
RNA-seq in single cells has provided a new powerful approach to study complex biological processes, for instance, promoting advances in cancer studies starting from qualitative microscopic images to quantitative genomic datasets in recent year. 185 Single-cell genome and exome sequencing fueled the investigation of fundamental questions including resolving solid tumor heterogeneity, identifying stem cells, tracking cell lineages and population consumption, measuring mutation rates, and detecting fusion gene events.19,186–188 Although single-cell sequencing can provide far more accurate measurement, however, the challenges of the single-cell sequencing in cancer cells exist in the sequencing and data analysis steps beyond the cancer cell isolation. First, the two copies of DNA strands as the input material results in technical errors including insufficient coverage, difficulties in mutation calling, and false-positive error in heterogeneity characterization. Multiple datasets from different single-cell sequencing encompass even higher requirements for the post-sequencing comparison analysis. In the near future, we expect to see that the single-cell sequencing will be applied in much more new issues of cancer genomics study such as differentiate extensive biological complexity or extensive technical errors, rare cancer diagnosis, and early development stage tumor discovery.
Dual RNA-seq
Pathogen-host interactions study including the immune response of eukaryotic cells is another important battlefield, where RNA-seq plays a critical role. Transcriptomic analysis has predominantly focused on either the host or the pathogen, which requires the RNA molecule separation from the host or the pathogen at specific time point, prior to the high-throughput sequencing era. 189 Deeper understandings of the interaction process, identification of new virulence factors, immune response mechanism, and development of therapeutic approach will require the simultaneous analysis of interaction partners because the battle leads to a constantly changing environment and complex gene expression patterns. A “dual RNA-seq” approach allows to monitor the genes from both host and pathogen without RNA separation throughout the infection process. 190 It enables the study of dynamic response and interspecies gene regulatory networks in both the interaction partners from initial contact through to invasion and the final persistence of the pathogen or clearance by the host immune system with high level of accuracy and depth. Dual RNA-seq attempt studies are in widespread areas such as molecular and cellular biology,191,192 public health, 191 immune response in disease,193,194 and bacteria and plant interactions.195–197 As a discovery-from-data approach, computational process and storage of the high magnitude data are of great challenge recently, although project-specific packages have been developed.198–200 Computational modeling and algorithm design beyond the existing ones will facilitate greatly for answering emerging questions by ever-developing applications from NGS to nanopore sequencing and single-cell sequencing.
As the biological complexity, the challenges of development of computational methods also exist in multiple dimensions. We have to consider the particular situation and design experiment accordingly, and no single method or pipeline is optimal under all circumstances even in the same fields. In addition, with rapid accumulation of data in public repositories, new challenges arise from the urgent need to effectively integrate many different RNA-seq data-sets, as well as different levels omics data to study the biological complexity and ultimately facilitate the precision and personalized medicine.
Author contributions
Wrote the first draft of the manuscript: YH, SG, WZ. Contributed to the writing of the manuscript: YH, SG, KM, WZ. Agree with manuscript results and conclusions: YH, SG, KM, WZ, BZ. Jointly developed the structure and arguments for the paper: YH, SG. Made critical revisions and approved final version: YH, SG, WZ, BZ. All authors reviewed and approved of the final manuscript.
