Abstract
Introduction
Measuring the expression of genes in bacterial genomes has a very broad range of applications, from developing treatments for infections to creating synthetic genomes. Gene expression studies in bacteria have been used to study metabolic pathways, identify properties of mutants, and otherwise better understand the molecular processes in their genomes.1,2 The first step in this process is to quantify gene expression for all the genes expressed in a particular experiment.
For more than a decade, microarrays have been the main means for studying gene expression. However, microarray technology can only capture transcripts for which probes are designed, therefore limiting its applicability to known genes in well-studied strains of a species. Alternatively, the use of quantitative PCR (qPCR) allows one to quantify specific genes rather than all genes in the genome, although this technique is far more costly on the scale of the whole genome. Recent improvements in the quality, efficiency, and cost of second-generation sequencing have led to an explosion in experiments (known as RNA-seq) that directly capture and sequence RNA, which has been replacing microarray analysis in recent years. In a microarray experiment, differences between the reference and a novel strain might prevent hybridization to some probes on the microarray. In contrast, in an RNA-seq experiment, transcribed genes are sequenced and aligned to the genome. Alignment algorithms can tolerate many mismatches, thereby allowing sensitive measurement of gene expression even when the target genome has diverged from the reference. In addition, because the entire transcript is sequenced, RNA-seq data also reveals the operon structure of a genome.3,4
Since the introduction of RNA-seq technology,5,6 software methods have been developed to quantify gene expression 7 in a sample and to find differences in gene expression between multiple samples.7–9 However, the current family of tools for estimating gene expression has been developed with the goal of identifying the structure of eukaryotic genes. These tools focus much of their effort on finding intronic regions within a gene and on finding alternative splice variants, which are common in higher eukaryotes. On the contrary, bacterial genes do not have introns and are not alternatively spliced; thus, there is no need to look for splice variants when analyzing their transcripts.
Bacterial genomes are also very densely packed with genes, many of them overlapping one another. Previous RNA-seq software methods generally do not provide a means of dealing with genes that overlap because these are extremely rare in humans and other mammals (the main targets of most RNA-seq experiments). In contrast, approximately 90% of a typical bacterial genome codes for proteins. 10 A study of 220 prokaryotic genomes spanning a wide evolutionary range 10 revealed that 29% of all genes in these species overlap another gene on either the 5′ or the 3′ end. These overlaps range from just a few base pairs (bp) to well over 100 bp. Overlapping genes can occur on the same strand or on opposite strands; thus, strand-specific RNA-seq offers at best a partial solution. For RNA reads that map within the overlapping region, it may be impossible to determine which of the 2 genes yielded the read, thus producing a challenge in determining gene expression of each prokaryotic gene.
Further complicating the requirements for analysis, bacterial RNA-seq technology has at least one major difference from eukaryotic RNA sequencing protocols, due to the absence of polyadenylation. The long poly-A tail on eukaryotic transcripts can be used as a capture probe, but bacterial RNA-seq method must instead rely on random priming to capture transcripts. 11 Another challenge is that more than 80% of captured bacterial transcripts are ribosomal RNA (rRNA). Although methods have been developed for removing rRNA, a large amount of rRNA still appears in some RNA-seq experiments.
Due to differences between eukaryotic and bacterial genomes and between RNA-seq protocols, the existing programs for expression analysis often perform poorly or break down entirely when applied to RNA-seq data from bacterial genomes. Therefore, new bioinformatics methods are required to estimate levels of gene expression in bacterial RNA-seq data. Currently, no stand-alone tool exists for this purpose. Multiple bacterial RNA-seq projects have been published,12–20 but these have used ad hoc methods to quantify gene expression. All of these ad hoc methods first align input sequences (“reads”) to a reference genome using a next-generation sequence aligner such as Bowtie, MAQ, SOAP, BWA, ELAND, Novoalign, or other custom-built aligners.21–26 From these alignments, they count the number of reads mapped to each gene, usually normalizing counts by each gene's length. Reviews of some of these ad hoc approaches can be found in Guell et al 27 and Van Vliet. 28
One of the challenges faced by the standard alignment approach is that in every data set, some reads align to multiple places in the genome. It can be difficult and sometimes impossible to determine which of these multiple locations is the true source of the read, particularly if the source is repeated identically elsewhere in the genome. To avoid this problem, some previous methods simply discard multi-aligned reads. This strategy may significantly (and incorrectly) reduce the apparent expression levels of genes that contain repetitive sequences. A more serious problem arises for gene families, in which reads from any gene in the family may map equally well to all copies of the gene. Other methods of counting multi-aligned reads assign a fractional count to each location where a read maps or randomly assign a read to one of its multi-mapped locations. None of these provide a perfect solution, although, as we will show, the use of fractional read counts works well in practice.
While currently used ad hoc methods for estimating gene expression level in bacteria perform reasonably well, they are often not easy to use. Some require the user to run several software tools in succession, and the output of one program is sometimes not the correct input for the next tool, requiring ancillary programs to reformat the data. In this paper, we present a new program called EDGE-pro (Estimated Degree of Gene Expression in PROkaryots), the first stand-alone method specifically designed for estimating gene expression level in prokaryotic genomes. EDGE-pro is an efficient software system that provides solutions to the challenges mentioned above.
Methods
The EDGE-pro pipeline operates on four main inputs: the reference genome, a protein translation table (ptt) containing coordinates of protein coding genes in that genome, another table (rnt) containing coordinates of tRNA and rRNA genes, and the RNA-seq reads themselves. If the ptt and rnt tables are not available, they can be generated from the genome sequence separately, for example, by running Glimmer 29 to find protein-coding genes and running tRNAscan-SE 30 and RNAmmer 31 to find RNA genes.
The EDGE-pro pipeline consists of four mandatory steps: (1) the main mapping step, in which all reads are aligned to the reference genome; (2) a filtering step dedicated to multi-aligned reads; (3) computation of the depth of coverage of each base in the reference genome; and (4) conversion of raw coverage depth statistics into the RPKM value (reads per kilobase of gene per million reads mapped) for each gene.
Reads mapping
EDGE-pro maps reads to the reference genome using Bowtie2 32 with its default parameters, allowing up to 10 alignments for each read. Bowtie2 allows mismatches and small indels in the aligned reads, which gives it improved sensitivity over the mapping capability of its predecessors and especially over the ungapped aligners that have been used in some of the previous, ad hoc analysis methods for bacterial RNA-seq. The ability to allow indels is especially important when the reference genome is a different strain of bacterium from the source of the RNA, since even closely related strains often differ in many small insertions and deletions.
Filtering multi-mapped reads
In many cases, with up to 10 alignments per read available, some alignments are clearly better than others. For example, if 1 alignment contains 2 mismatches and a second alignment reports 10 mismatches, the first alignment is much more likely to be the correct one. On the other hand, if 1 alignment contains 1 mismatch and a second alignment contains 2 mismatches, the better match is not clear; for example, the second mismatch might be due to a sequencing error, in which case the 2 alignments are equally good.
The EDGE-pro pipeline filters reads that map to multiple locations based on the alignment score assigned by Bowtie2. Bowtie2 assigns a score of 0 to a perfect match and a negative score to each mismatch and indel. A mismatch gets a negative value between -6 and −1, with more negative values given to bases with higher quality values. The default penalty for an indel is −5 for opening a gap and −3 for each base in the gap. EDGE-pro considers multiple alignments using a threshold that is set based on the best-scoring read. If we let
For example, if the best scoring alignment has only 1 mismatch, then no alignments with any gap will be considered. However, if the highest scoring alignment has 5 mismatches with low quality values, its score will be −10 (assuming that each base gets penalized by score of −2). EDGE-pro will keep all alignments with a score of at least 1.15 · (−12) = −13.8, which allows for a 2-base indels (−5 for opening the gap and −3 for each of 2 bases in the gap). It is not clear which of these 2 alignments is better (5 mismatches at low quality bases or 1 2-bases indel); thus, EDGE-pro considers both alignments as good alignments. The value of 1.15 was determined empirically by looking at hundreds of situations manually to determine which alignments are “reasonable competition” to the highest scoring alignment.
Per-base coverage
After filtering multi-mapped reads, the EDGE-pro pipeline determines the read coverage for each position in the genome. Uniquely mapped reads are simple: for these, EDGE-pro increments its coverage counts by 1 for each position in the genome covered by the read. Multi-mapped reads in bacterial genomes often map to duplicate genes, which contain identical or near-identical sequences, and it is not possible to determine which of 2 duplicate genes is expressed. EDGE-pro offers 2 options to handle these reads. The default option is to divide these multi-mapped reads, giving partial credit to each location where they might map, as follows. For each multi-mapped read remaining after filtering, the coverage of each position that the read spans is incremented by 1/N, where N is the number of “good” alignments for that read.
This heuristic will give the correct answer if all copies of a duplicated gene are expressed equally. Suppose instead that we have 2 identical copies of the gene, 1 of which is expressed at an RPKM level of 10 and the other silent. EDGE-pro will undercount the expression level and will assign half the reads to the unexpressed copy, reporting both copies with an RPKM of 5. If the 2 genes are identical, then this represents the same quantity of transcripts as if only 1 copy were expressed, and, in this respect, EDGE-pro's behavior is still correct; however, it might have different implications depending on the physical location and nearby regulatory sequences for each of the 2 copies.
The second option that EDGE-pro provides to deal with multi-mapped reads is to randomly choose 1 of all the positions where the read mapped and assign a full count to this position and ignore all the other positions.
The pipeline up to this point computes the per-base coverage for all genes across the genome. Calculating coverage per base overcomes the problem of trying to determine the source of a read that maps in the overlapping portion of 2 genes. Moreover, per-base coverage leaves open possibilities for further analysis such as visualization of uneven coverage across a gene and improvements to gene annotation, particularly at the start and stop coordinates of annotated genes.
Average gene coverage and RPKM calculation
In the last step, EDGE-pro converts coverage per base to the RPKM value for each gene. Note that the current version of EDGE-pro provides RPKM rather than FPKM values, even if paired-end reads are used. This provides a count for each read rather than each fragment that maps to a gene, therefore giving 2 counts for a pair of reads that map to the same gene rather than giving just 1 as FPKM does. This may provide a small disadvantage if an experiment uses paired-end sequencing and if a significant fraction of the fragments only yield 1 rather than 2 reads. However, because bacterial RNA-seq analysis does not need to link together exons across splice junctions, paired-end sequencing does not provide as much of an advantage. For single-end sequencing, RPKM and FPKM are equivalent.
Because many bacterial genes overlap, even to the point that a gene may completely contain another gene on the opposite strand, the coverage in the overlapping regions should not be counted toward both genes. The easiest method to distribute the coverage in the overlapping region is to distribute it proportionally to the coverage in the nonoverlapping segments. This scheme should work on average, but it might be distorted due to nonuniform coverage of genes. To illustrate, Figure 1 shows a gene for which coverage is much higher at its 5′ end, skewing its average coverage. The relative coverage levels of genes 1 and 2 in the overlapping region, as shown in Figure 1, are better approximated by the coverage in the smaller windows (red boxes) adjacent to the overlapping region. The window size used here is a parameter that can be changed by a user. By default, EDGE-pro takes the window of length 100 bp, which is approximately the length of current Illumina reads.

Nonuniform coverage of genes using data from an experiment on
Figure 2 shows the regions used to determine coverage of the overlapping portion of a gene. After identifying the region of overlap, EDGE-pro adjusts the coordinates of the nonoverlapping parts by a pre-defined length in order to exclude the untranslated region (UTR) on each side of the overlap. Because the RefSeq annotation provides only the coordinates of the protein coding regions and because the precise coordinates of UTRs are generally unknown, this heuristic adjustment reduces the bias in determining the coverage of each gene in its UTR regions. The UTR length is a parameter that could be adjusted by a user and is set to 40 bp by default since most bacterial UTRs are approximately 30 to 40 bp long. The coverage of a gene is calculated as follows:
The average coverage of the nonoverlapping portion of a gene is computed as the number of reads per base covering that region. The average coverages of window 1 (W1), window2 (W2), and the overlapping segment (O1) are calculated. The ratio of these coverage values (W1/W2) is used to distribute the total coverage of the overlap regions, with each gene getting a proportional share of the coverage in the overlapping segment. The ratio W1/W2 is also used to determine which portion of coverage in UTR1 and UTR2 comes from gene 1 and gene 2, respectively.

Overlapping genes.
In Figure 2, the nonoverlapping part of each gene is longer than the window used to estimate coverage (100 bp by default). If the nonoverlapping part of only 1 gene is long enough, only this gene is used to determine distribution of overlap coverage between the 2 genes.
Figure 3 illustrates the algorithm for computing coverage when a gene is completely contained within the coordinates of a gene on the other strand. The average coverage in sampled regions on either side of the contained gene (window1 and window2 in Fig. 3) is used as the estimated coverage of the longer gene. If the total coverage in the overlapping part is similar to or lower than the overall coverage of the larger gene, then EDGE-pro assumes that gene 2 is not expressed. By default, the coverages are considered to be similar if the coverage of overlapping region is within 15% of the overall coverage of the larger gene. This parameter is adjustable by a user. If the overlap coverage is not lower than or similar to the overall coverage of the larger gene, the difference between the overlapping segment's coverage and the coverage of the larger gene is assigned to the smaller, contained gene.

Contained genes.
Once the average coverage of each gene is determined, we use the average coverage to estimate the number of reads that mapped to the gene:
Next, we compute the RPKM for each gene, which normalizes the expression level with respect to gene length and the total number of reads mapped. To avoid counting low-level background noise as transcription, we set the RPKM value to 0 if the average coverage of a gene is less than 3 (a parameter that can be adjusted by the user). Otherwise, we calculate RPKM as
The EDGE-pro pipeline outputs for each chromosome and plasmid a report that contains the identity and coordinates of each gene, its average coverage C, the number of reads
Results
We tested the performance of EDGE-pro on RNA-seq data generated from the bacterial pathogen
We downloaded RNA-seq data collected from 2 samples of
Reads used in the differential expression study.
For each pair of replicate samples, we simply averaged the RPKM values of the 2 samples and used the averages for comparison between the wild type and mutant strains, similarly to the method of Chaudhuri et al.
15
We defined upregulated and downregulated genes as those whose expression level increased or decreased at least 4-fold. Note that no conclusions about significance can be drawn from this comparison; our goal was simply to compare the computational analysis produced by EDGE-pro to previously published results on the same data. The differential expression in the Chaudhuri study was computed with DeSeq,
6
which computed a
We identified 20 genes that were downregulated in the mutant strain,
Downregulated genes in the mutant.
Supporting this hypothesis, the fold-change for 2 of these genes,
Thus, the differences between results presented in Chaudhuri et al and our results are mostly due to different thresholds used: the
Another possible explanation for the small difference between these results is that the mutant samples contained a much higher proportion of rRNA reads (see Table 1), which EDGE-pro discards before computing RPKM values. Chaudhuri et al's analysis included these reads, which would have altered both the absolute and relative RPKM values for some genes. We also note that Chaudhuri et al used a different (commercial) alignment program, Novoalign, which may have changed some of the statistics.
We found that 11 genes were upregulated in the
Upregulated genes in the mutant.
The EDGE-pro pipeline only provides the expression level of each gene and does not compute differential expression statistics. The simple heuristic that we used here to compare our results with those presented in Chaudhuri et al 15 shows that even a simple heuristic applied to the EDGE-pro output matches the results produced by previously published, less automated ad hoc methods for bacterial RNA-seq analysis. Researchers who want to compute differential expression can easily feed the output of EDGE-pro into a separate program (eg, DeSeq) 6 designed for that task.
Computational Requirements
We measured time and memory requirements for EDGE-pro on a four-core 2.1 GHz AMD Opteron server with 512GB of RAM. To provide requirements relevant to current RNA-seq technologies, we used 101-bp reads for the timing experiments. We ran EDGE-pro on 2 samples containing 30 million RNA-seq reads from
Time and memory requirements.
Conclusion
Quantifying gene expression levels is the critical first step in analyzing RNA-seq experiments. Although several software systems have been developed for eukaryotic RNA-seq experiments, these algorithms do not perform well on bacterial organisms due to multiple differences between the experimental protocols and due to differences in gene structure and gene density. EDGE-pro is the first system to integrate, in a single software pipeline, a series of alignment and quantification methods specifically designed for prokaryotic genomes. The software includes algorithms specifically tailored for the overlapping gene regions commonly found in prokaryotic genomes.
Abbreviations
EDGE-pro, Estimated Degree of Gene Expression of PROkaryotic organisms; RNA, ribonucleic acid; rRNA, ribosomal RNA; tRNA, transfer RNA; RPKM, reads per kilobase of gene per million reads mapped; ptt, protein translation table.
Author Contributions
TM designed and implemented most of the software. DW designed and implemented some parts of the software. TM and SLS wrote the manuscript. All authors reviewed and approved of the final manuscript.
Funding
This work was supported in part by the National Institutes of Health under grants R01-LM006845S, R01-HG006677, and R01-HG006102 to S.L.S.
Competing Interests
Authors disclose no potential conflicts of interest.
Disclosures and Ethics
As a requirement of publication author(s) have provided to the publisher signed confirmation of compliance with legal and ethical obligations including but not limited to the following: authorship and contributorship, conflicts of interest, privacy and confidentiality and (where applicable) protection of human and animal research subjects. The authors have read and confirmed their agreement with the ICMJE authorship and conflict of interest criteria. The authors have also confirmed that this article is unique and not under consideration or published in any other publication, and that they have permission from rights holders to reproduce any copyrighted material. Any disclosures are made in this section. The external blind peer reviewers report no conflicts of interest.
