Abstract
Introduction
Over the last decade phylogenomics has been used to infer the global phylogenetic tree and reconstruct the evolutionary relationship among the major assemblages of eukaryote lineages.1–10 The success of such analyses is based on the construction of a large number of sequence alignments and the concatenation of multiple single-gene alignments into one supermatrix. This laborious and time-consuming enterprise involves sequence annotations at the genomic or transcriptomal level, construction of single-gene alignments, identification of ortholog sequences, and the final inference of multigene phylogenetic trees. The workload further increases as new generations of efficient sequencing techniques are being developed and the amount of available sequence data is growing rapidly.11,12 Construction of sequence alignments can be laborious because it requires, for instance, collection of relevant data, manual adjustment of indels, and removal of ambiguously a ligned sites. Hence, curated alignments (hereafter referred to as seed alignments) are natural starting points for phylogenomic inferences and for inclusion of newly sequenced genomes or transcriptomes.
In phylogenomic inferences, a critical step is the identification of ortholog gene copies, ie, genes that have been vertically inherited from a single origin. In contrast, paralogous gene copies generated by gene duplications within a genome may have diverged substantially by acquiring new functions. Inclusion of paralogous gene copies in sequence alignments can therefore seriously mislead the phylogenetic inferences of species.13,14 There is no simple solution to distinguish orthologs from paralogs, and often several different approaches are used.15–17 Most usually however, a phylogenetic tree is used to detect paralogs. If homologous sequences for selected species are scattered into different clades in a phylogenetic tree with robust statistical value (eg, >70% bootstrap support3,18), it can indicate the presence of paralog copies that should be removed before the concatenation of genes. Phylogenetic methods are preferred over the simple pairwise clustering strategy or similarity searches such as BLAST, because such methods include models of sequence evolution that better accommodate the evolutionary history of the sequences.
Phylogenomic analyses are often done on data that are based on hundreds of different single-gene alignments.2,5,6,8,9,19 Hence, there is an obvious need for an efficient process to construct alignments and perform single-gene phylogenetic inferences for paralog detection and definition of ortholog sequences.
In this paper, we present a bioinformatics pipeline called BIR (
Method
BIR is implemented on the Lifeportal web portal
BIR is written in Perl v5.8 and implemented on the web-based Lifeportal bioinformatics service at the University of Oslo. Initially, the user provides two files (both in FASTA format): 1) a ZIP-compressed file containing the query files. These typically consist of non-annotated sequences generated in genome and transcriptome sequencing projects, and 2) a ZIP file containing all the seed alignments.
Extending seed alignments in five steps
The procedure to generate extended single-gene alignments is divided into five steps (fig. 1). First, the query files are matched against the seed alignment using the BLAST algorithm. 20 Based on the BLAST search and the quality criteria set by the user (ie, query coverage percentage, subject coverage percentage, identity percentage, score, and e-value), sequences from the query files are added to the seed alignments with the best match. Using the same approach, the user can increase the probability of detecting hidden paralogs in the seed alignments and query sequences by incorporating sequences from other available genomes representing all the major groups of eukaryotes (see Table 1 for detailed information about these genomes). The user can define the maximum number of sequences to be added from each of the selected genomes. BLAST result files often contain multiple high-scoring pair (HSP) sequences that describe regions of similarity between query and hit sequences. In contrast, BIR uses a combination of alignment length, identity percentage, score, and e-value statistics to calculate sequence similarity.

Overview of steps in BIR pipeline. 1) The user provides a zipped file with the query sequences and another zipped file with the seed alignments. The sequence and alignments should be in FASTA format. Additionally, protein sequences from completely sequenced genomes (Table 1) can be added. Sequences from query files and selected reference genomes are added to the seed alignments with highest match using BLAST. 2) The modified seed alignments can be realigned using MAFFT. 3) Gblocks or trimAl can be used for removal of unambiguously aligned regions. 4) Phylogenetic trees can be inferred with FastTree or RAXML. 5) Paralog prediction is done by the COCO-CL program. Putative paralogs are marked in circles with a dashed line. The resulting phylogenetic trees can then further be assessed and interpreted using any tree-viewing software.
Completely sequenced genomes from the eukaryotic super groups available in the BIR pipeline.
SAR = Stramenopila, Alveolata, Rhizaria. #AA = Number of protein sequences in each genome.
In the second step, the user can decide to align the sequences in the seed alignments by either realigning all sequences (implemented by choosing progressive or iterative methods in MAFFT 26 ), or alternatively to preserve the original seed alignment and only add the newly identified sequences. In the third step, the main task is to remove ambiguously aligned characters. This can be done by using either the Gblocks 27 or the trimAl programs. 28 Gblocks and trimAl parameters can be modified by the user; parameters for the removal of columns can be set to conservative (strict) or liberal (relaxed). In the forth step, phylogenetic trees for each single-gene alignment are inferred by RAxML 24 or FastTree. 29 For RAxML, the user can select the evolutionary model and define the number of pseudo-replicates for bootstrapping analyses. Only the “-f a” option is implemented in BIR pipeline, but other options can be used in a separate installation of RAxML on Lifeportal. In the last step of the pipeline, orthologous groups of sequences are predicted by hierarchical clustering with COCO-CL. 30 This algorithm requires a similarity distance matrix that is calculated separately using ClustalW. 31 The pipeline provides alignments for users who want to use other bioinformatic tools or tree inferring programs, such as PhyloBayes, 23 RAxML, 24 and MrBayes 22 (also available on Lifeportal under the Bioportal Phylogeny 25 Tools section).
Results and Discussions
Fast and easy addition of sequences to seed alignments
BIR allows the fast and easy screening of high numbers of sequences against custom-defined seed alignments using BLAST. 20 Sequences with similarities higher than the user-defined cutoffs are automatically added to the seed alignments. Additionally, homologous amino acid sequences from representative species of all eukaryotic supergroups (for details, see Table 1) can optionally be added to the seed alignments, so as to better recover hidden paralog sequences in the input data. New sequences are aligned to the seed alignment using MAFFT. 26 The Gblocks 27 or trimAl 28 programs are included for the visualization and removal of ambiguously aligned sites. In the final steps of the pipeline, phylogenies from each singlegene alignments are generated by FastTree 29 or RAxML. 24 These phylogenetic trees, together with the prediction of orthologs by COCO-CL, 30 provide the user with ample information so as to select true ortholog sequences. Upon completion of data processing, the download section of Lifeportal provides several output files. These include log files, result files, alignment files, and tree files. The generated files can be downloaded and interpreted manually, and tree files can be visualized separately by using one of the many available graphical programs such as FigTree 32 and TreeView.33,34 For ease of visualization, sequences added to seed alignments from either the queries or the genomes, as well as the predicted paralogs from COCO-CL, are marked with *_Q, *_G, and *_C, respectively.
Unique aspects of BIR
Some of the steps in the BIR pipeline are similar to those of the other bioinformatics applications such as PyPhy, 35 PhyloGena, 36 Hal, 37 and bioinformatics services such as phylogeny.fr 38 and PALM. 39 However, BIR is unique in providing a sequence screening using seed alignments to generate gene alignments. Also, it is the only program that adds amino acid sequences from representative eukaryote genomes that belong to all eukaryotic supergroups. Other programs and pipelines such as PyPhy, PhyloGena, and Hal are stand-alone tools that require the installation of third-party software and databases prior to use. Hal is currently only available as a command line program without graphical user interface, while online programs such as phylogeny.fr and PALM are specified for phylogenetic inference and selection detection, and they have strict limitations on number and length of the sequences. In contrast, BIR is a web-based bio-informatics service installed on a high-performance computing cluster, thus avoiding installations on local computational resources. Since the query files and the alignment files can be in nucleotide or protein sequences, it gives the user the added flexibility to use either type of data. However, the quality of the generated alignments is, in general, dependent on how conserved the input seed alignments are. BIR is linked to several other bioinformatics applications on Lifeportal for upstream and downstream data processing such as contig assembly, annotation, statistical analyses, and phylogenetic inferences. Hence, BIR can easily be combined with many other relevant applications in many fields of genomics and evolutionary biology.
Performance of the BIR pipeline – creating alignments for phylogenomic analyses
We developed a test case to demonstrate the usefulness and speed of analysis performed by the BIR program. As a starting point, we used the 124 single-gene alignments for phylogenomic analysis of the genus named
Conclusion
BIR provides a simple, fast, and user-friendly Web-based pipeline installed on a high-performance computing resource. The pipeline can create a massive number of alignments highly useful for sequence annotation and the identification of paralogs. Hence it can be used in many different bioinformatics disciplines including key steps in phylogenomic analyses and other comparative and functional studies.
Author Contributions
All programming, analysis, and implementation on Lifeportal, drafting, and writing the manuscript: SK, AKK, RSN. Contributed in preparing and analyzing test case: AKK, RSN. Contributed in testing the programs: XZ, SZ, RSN. Contributed in implementation on Lifeportal: SK, KM. Conceived, designed, and wrote the manuscript: KST. All authors reviewed and approved of the final manuscript.
Supplementary File
Table S1
Information about the single-gene alignments used in the test case.
Table S2
Sequences randomly extracted from the singlegene alignments.
Table S3
Results from COCO-CL.
