Abstract
Keywords
Introduction
It is the widespread nature of the arrangement of genome architecture that means two adjacent protein-coding genes have coding sequences that partially or entirely overlap. 1 This phenomenon has been observed in viruses,2–4 prokaryotes5–9 and eukaryotes.10–13 For example, it has been reported that about one-third of all protein-coding genes across completely sequenced bacterial genomes are such overlapping genes (OGs). 14 As to the role of this configuration, OGs are usually assumed to be potentially involved in the regulation of gene expression1,13,15,16 or improvement of genome compaction.1,13,16,17
In our previous works, we demonstrated that OGs could be used as rare genomic markers for bacterial phylogeny inference.18,19 A previous study shows that gene content might change when gene order is altered too much. 20 As phylogenetic markers, it is evident that OGs do not evolve as slowly as gene content, because of their widespread arrangement in prokaryotic genomes and mutation at a universal rate. However, OGs evolve more conservatively than gene order, because the linkage between two OGs may be preserved for functional constraints.11,14,21–23 Our idea is simple and intuitive that two closely related bacterial genomes sequenced completely are compared using a similarity metric, which measures the proportion of the shared OGs between the two genomes. Our results and the results of others show that there are indeed some phylogenetic signals within those orthologous OGs among closely related bacterial genomes.24–28 However, it is obvious that OGs as phylogenetic markers may be inappropriate for comparing a set of too closely related bacterial genomes, such as complete strain genomes of one species, because it is possible that no evolutionary events of OGs would have occurred in a short time span. In addition, for too distantly related bacterial genomes, the OGs method used for phylogeny inference may also show poor performance, as a few orthologous OGs may be identified.
Another drawback of OGs as phylogenetic markers for bacterial phylogenomic analysis in our previous works is that current similarity metrics consider only the presence or absence of one pair of OGs and ignore the relationship with neighborhoods. Thus, the method is usually less sensitive to capturing inconsistent evolutionary events, such as intra-chromosome/genome translocations, gains of OGs from foreign genomes or species, and gene losses.18,19 To solve the problem caused by gene losses, here we use another type of genomic feature called locally collinear blocks (LCBs). Each LCB, also known as a collinear genomic region, is a homologous region of sequence shared by two or more genomes. 29 Clearly, LCBs from different genomes can be used as orthologous regions, which are likely to contain multiple conserved genes even in their regulatory regions when they are sufficiently large. In some previous studies, LCBs have proved to be reliable genomic features for phylogenomic analysis among closely related genomes on the whole-genome scale.30,31 By combining OGs with LCBs, additional constraint is built into computations of the similarity between two genomes. In this way, we aim to partly, if not completely, mitigate the above-mentioned drawback of OGs as phylogenetic markers and thus infer phylogeny more accurately.
To test this hypothesis, we studied the phylogenetic relationships of 88 Enterobacteriale genomes of the class
Materials and Methods
Genome data
As a case study, the dataset of this study consists of 88 Enterobacteriales genomes of
The Robinson–Foulds topological distances between different types of phylogenies.
Orthologous genes were identified using the approach of BBH by setting the parameters with
Identification of LCBs among multiple genomes and reconstruction of phylogenies
Each LCB, also known as a collinear genomic region, is a homologous conserved block of sequence among different species. 29 As in our previous study, 30 the core LCBs, which are the set of collinear regions shared by all the species in the study, were used to reconstruct the evolutionary phylogeny. The whole-genome phylogeny inferred from collinear genomic segments was called LCBs phylogeny. First, LCBs were indentified using Sibelia version 3.0.6,32,33 which can efficiently find LCBs among a large number of microbial genomes without alignment. Chromosome sequences were compared with parameters “-s loose -q -g -v -t tmp –gff -m 100.” Second, for each LCB identified by Sibelia, multiple sequences alignment was performed using MAFFT (Multiple sequence Alignment based on Fast Fourier Transform) version 7.164 with parameters “–auto <data>,” 34 and ambiguously aligned regions were removed from the alignment using trimAl version 1.4 with default parameters. 35 Those trimmed alignments were converted to the Multiple Alignment Format (MAF) to get treated LCBs. Third, the core LCBs shared by all the studied genomes were assembled into a concatenated supermatrix. The maximum likelihood (ML) tree was inferred from the data matrices using FastTree version 2.1.8 36 with default Jukes-Cantor + CAT (classifying sites into CATegories) model.19,36 Local Shimodaira–Hasegawa (SH)-like support was assessed using SH test with 1,000 bootstrap replicates, and the support values are given as names for the internal nodes.
Identification of OGs, orthologous OG pairs, and reconstruction of phylogenies
The phylogeny inferred from OGs was called OGs phylogeny.18,19,27 OGs are defined as adjacent genes whose coding sequences are shared with each other. OGs were identified from each genome annotation using in-house Perl (a programming language) scripts (Supplementary File 5). All genes annotated as “unknown,” “hypothetical” or “putative,” which may be misannotated in the genomes downloaded from the NCBI,
27
were removed for more reliable analysis. Considering only protein-coding genes, the putative orthologous genes between two genomes were determined using the approach of bidirectional best hit (BBH). In accordance with previous studies,18,27 we tested two types of parameter settings for the NCBI BLAST (the Basic Local Alignment Search Tool) program
37
: (1)
The distance between two genomes is defined by Luo et al, which is as follows:
Combining OGs and LCBs and reconstruction of phylogenies
By combining the features of OGs and collinear genome region, a phylogeny called OGs–LCBs phylogeny was inferred. First, pairwise comparison was performed to identify pairwise LCBs between any two genomes from the studied genomes using Sibelia version 3.0.6.32,33 The parameters were set as “-s loose -q -g -v -t tmp –gff -m 100.” Second, the orthologous OG pairs, identified in the process of reconstructing the OGs phylogeny described earlier, were selected as follows: if all genes of one orthologous OG pair were completely along one pairwise LCB between these two genomes, this OG pair was selected and called a collinear orthologous OG pair. Then, the distance between two genomes is defined as shown in Equation 2 using collinear orthologous OG pairs instead of the orthologous OG pairs used by Luo et al.18,19 The distance matrix was generated according to this modified definition, and both the NJ and UPGMA phylogeny trees were inferred using the PHYLIP version 3.69.40,41
The distance between two genomes is defined as:
Reconstruction of phylogeny with 16S rRNA
To test the usefulness of our method, a standard 16S rRNA phylogeny was reconstructed. All the OGs phylogeny, LCBs phylogeny, and OGs–LCBs phylogeny were compared with this 16S rRNA phylogeny to quantitatively measure the similarity with the Robinson–Foulds topological distance (Table 1 and Supplementary Table 2). 42 The 16S rRNA genes of these 88 genomes downloaded from the NCBI may be annotated using different methods. In order to reduce the inference of differences caused by different annotation methods, rRNAs were re-annotated with our local annotation pipeline, in which Infernal release 1.1 43 was used to annotate possible noncoding RNAs accompanying the Rfam database (release 10.1). 44 Based on the re-annotated 16S rRNA gene, a standard 16S rRNA phylogeny for the 88 Enterobacteriale genomes was reconstructed as follows. First, multiple sequences alignment was performed on the 16S rRNA sequences using MAFFT version 7.164 with the parameters “–auto <data<.” 34 Then, ambiguously aligned regions were trimmed using trimAl version 1.4 with default parameters. 35 FastTree version 2.1.8 was performed36,45 using the default Jukes-Cantor + CAT model to construct an ML tree. 46 Using the SH test with 1,000 bootstrap replicates, local SH-like support was assessed, and the support values are given as names for the internal nodes. Most species belonging to one genus according to the taxonomy were well clustered in the 16S rRNA phylogeny (Fig. 1).

The phylogeny of 88 Enterobacteriales genomes inferred using 16S rRNA. The maximum likelihood tree was constructed, called 16S rRNA phylogeny, using the 16s rRNA gene. The 16s rRNA phylogeny was inferred from this single gene using FastTree version 2.1.8 36 with default Jukes-cantor + CAT model. Local SH-like support was assessed using the SH test with 1,000 bootstrap replicates, and the support values are given as names for the internal nodes. Species are denoted with their taxa names in the NCBI, and the corresponding genera are indicated in the square brackets. Species in the same genus are colored with the same color. Those genera with only one member in the study were colored with black.
Results and Discussion
The LCBS phylogeny for 88 Enterobacteriales genomes
As a type of phylogenomic marker, LCBs have been proven useful for phylogenomic analysis among closely related genomes or intraspecific prokaryotic genomes on the whole-genome scale.30,31 The phylogeny was constructed as described in the “Materials and methods” section to explore the performance of LCBs in phylogeny reconstruction among genomes in the Enterobacterials. For the 88 Enterobacteriales genomes, four-core LCBs with the total length of 1,392 bp were identified and comprised 0.02%–0.26% of the lengths of the studied genomes (Supplementary Tables 3 and 4). The phylogeny was constructed using the four-core LCBs and was called the LCBs phylogeny. To quantitatively measure the similarity of the LCBs phylogeny and the 16S rRNA phylogeny, the Robinson–Foulds topological distance was calculated (Table 1). By comparing these two types of phylogenies, we found that many species were not well clustered according to their genera in our LCBs phylogeny, but they were well grouped in the 16S rRNA phylogeny. Especially for the genera

Whole-genome phylogeny of 88 Enterobacteriales genomes inferred using LCBs. The maximum likelihood tree was constructed, called LCBs phylogeny, using the core LCBS shared by the 88 Enterobacteriales genomes. The LCBS phylogeny was inferred from the core LCBS using FastTree version 2.1.8 36 with default Jukes-Cantor + CAT model. Local SH-like support was assessed using the SH test with 1,000 bootstrap replicates, and the support values are given as names for the internal nodes. Species are denoted with their taxa names in the NCBI, and the corresponding genera are indicated in the square brackets. Species in the same genus are colored with the same color. Those genera with only one member in the study were colored with black.
The OGs phylogeny for 88 Enterobacteriales genomes
In addition, another type of phylogeny for the 88 Enterobacteriales genomes was inferred based on OGs, which was called OGs phylogeny. As described in the “Materials and methods” section, two types of parameter settings were tested to identify orthologous protein-coding genes. Therefore, two types of OGs phylogeny were reconstructed with the NJ method using the PHYLIP software package (Supplementary Fig. 1 and Fig. 3). Comparing these two OGs phylogenies with the standard 16S rRNA phylogeny using Robinson–Foulds topological distance, we found that the OGs phylogeny with the second parameter setting was more similar to the 16S rRNA phylogeny (Supplementary Table 2). Furthermore, the second OGs phylogeny (Fig. 3) showed more consistency with taxonomy than the first one. Thus, we opted for the second parameter setting and the second OGs phylogeny was used in the remainder of our study. The OG information used for phylogeny inference with the second parameter setting was shown in Supplementary Table 5 (Workbooks 1 and 3). Similar to the 16S rRNA phylogeny, most of the genera in our OGs phylogeny were well grouped, except the ant endosymbionts. The seven species of ant endosymbionts were divided into three groups, which is inconsistent with the topology of the 16S rRNA phylogeny. We examined the genome sizes of the 88 genomes. Strikingly, the genome sizes of ant endosymbionts were much smaller than most other species (Supplementary Table 1). Previous studies have shown that species of ant endosymbionts have experienced substantial gene losses.47–50 As Luo et al stated, 18 the OGs method might be less sensitive to inconsistent evolutionary events, such as a lot of reduction (gene losses). These results indicate that the genomic feature of OGs can contain useful phylogenomic signals for inferring the evolutionary histories of closely related genomes. However, inconsistent evolutionary events probably reduce the phylogenomic signals of OGs, leading to some conflicts or confusion of the evolutionary relationships. Based on the same distance matrix, UPGMA trees were also reconstructed with the UPGMA method using the PHYLIP. Similar results were also found in the UPGMA trees (Supplementary Figs. 2 and 3 and Supplementary Table 2).

Whole-genome phylogeny of 88 Enterobacteriales genomes based on OGs with the NJ method. The NJ tree was constructed, called OGs phylogeny, based on OGs for the 88 Enterobacteriales genomes. Orthologous genes were identified using the approach of BBH by setting the parameters with
As our results show, both OGs and LCBs used for the phylogeny inference of the 88 Enterobacteriales organisms revealed some merits and drawbacks. In seeking to maximize the extraction of genomic information on the whole-genome scale, we assume that the analysis of combining OGs and LCBs should reveal a comprehensive history of closely related bacterial organisms.
The OGs-LCBs phylogeny for 88 Enterobacteriales genomes
To test our hypothesis, pairwise comparison was performed between any two genomes from the 88 Enterobacteriales genomes to identify LCBs of pairwise genomes. Collinear orthologous OG pairs were identified according to the “Materials and methods” section. Combining the two types of genomic features (OGs and LCBs), we constructed the OGs-LCBs phylogeny using orthologous OGs within their collinear genomic regions. The NJ tree (Fig. 4) was inferred with the NJ method using the PHYLIP, based on the distance matrix produced according to the methods (Supplementary Table 5: Workbook 2 and 3). Similarly, the UPGMA trees were also reconstructed with UPGMA method using the PHYLIP (Supplementary Fig. 4). According to the measurement of Robinson–Foulds topological distance, our OGs-LCBs phylogeny seems to be the most similar to the 16S rRNA phylogeny (Table 1). Consistent with the 16S rRNA phylogeny, almost all species in our OGs-LCBs phylogeny were clustered into groups according to their genera. In contrast to the OGs phylogeny, the species of ant endosymbionts, having experienced gene losses in their evolutionary histories,47–50 were better grouped in our OGs-LCBs phylogeny (Fig. 4). This pattern suggests that combining both the OGs and LCBs features, to a certain extent, can reduce the effect of inconsistent evolutionary events and increase the robustness of the OGs methods for phylogenomic analysis. When the loss of genes had occurred within these genomes, smaller numbers of orthologous OGs pairs were identified, which might reduce the accuracy of their positions in the phylogenies when comparing with other genomes. However, if we use orthologous OGs pairs conditional on their collinear genomic regions rather than their whole genomes, the influence of gene loss on the accuracy of phylogeny inference may be mitigated. In addition, we also observed that

Whole-genome phylogeny of 88 Enterobacteriales genomes based on both OGs and LCBs with the NJ method. The NJ tree was inferred for the 88 Enterobacteriales genomes called OGs-LCBs phylogeny, combing two types of genomic features OGs and LCBs. Orthologous genes were identified using the approach of BBH by setting the parameters with
There are many methods designed for the reconstruction of phylogenies among species. 30 A recent study by Facey et al introduced a comparative genomics approach to reconstruct a phylogeny of similar bacterial species, which compiled a dataset of loci shared by 9% of bacteria under study. 55 All these previous studies are generally separated into two categories: those based on the core genes or features, similar to the LCBs phylogeny, and those based on the variable gene content or other genomic features or metabolic pathways, such as the OGs phylogeny and OGs-LCBs phylogeny. In this study, we provide some evaluation of using only LCBs, only OGs, and the combination of OGs and LCBs in reconstructing phylogenies. All the results suggest that the reconstruction of phylogeny combining both OGs and LCBs should be reliable for phylogenomic analysis among closely related bacterial genomes. However, there are some limitations for our OGs-LCBs method. For example, currently, we have analyzed only the complete genomes, and further improvement may be needed to analyze the draft genomes.
Conclusions
In this work, we attempt to combine two types of genomic features (OGs and LCBs) for the phylogenomic analysis of closely related bacterial genomes. As a case study, we analyzed the phylogenetic relationship of 88 species in the order Enterobacteriales. Three different types of phylogenies were constructed based on OGs and LCBs. Our results demonstrated that by combining OGs with LCBs, more accurate phylogenetic signals were detected, which enabled us to construct more precise and robust phylogenies of the 88 genomes. The OGs method for phylogenomic analysis is usually less sensitive to some inconsistent evolutionary events, such as gene losses. Interestingly, combining OGs and LCBs as phylogenetic markers may reduce, to some extent, the influence of gene loss on phylogeny inference. In the OGs-LCBs phylogeny of 88 Enterobacteriales genomes, the species of ant endosymbionts, which have experienced substantial gene losses, were better clustered than in the OGs phylogeny. Thus, mining the phylogenetic signals of OGs, together with collinear genome regions, should be an effective approach to increasing the robustness of the OGs methods for phylogenomic analysis of closely related bacterial organisms.
Author Contributions
Designed the study: YCZ, KL. Performed the bioinformatics analyses: YCZ. Wrote the first draft of the manuscript: YCZ, KL. Contributed to the writing of the manuscript: YCZ, KL. Agreed with manuscript results and conclusions: YCZ, KL. Jointly developed the structure and arguments for the paper: YCZ, KL. Made critical revisions and approved the final version: YCZ, KL. Both the authors reviewed and approved the final manuscript.
