Abstract
Introduction
Phylogenomics is being intensively used in the “Post Genomic Era”. Many different phylogenetic trees have been published on the basis of different models of sequence evolution, 1 applying different parameter settings and algorithms. Although the rDNA,2–4 COI2,5,6 and other single genes have been extremely valuable for phylogenetic studies, single-gene phylogeny has its limitations. 7 Therefore, phylogenomics approaches, corroborated by the use of more representative phylogenetic markers, will allow, in principle, a more reliable and representative inference into the tree of life.
Nowadays, one has the option to concatenate multiple gene sequences to construct trees on the genomic level, such as “genome trees” or also called “supermatrix trees”, possessing more phylogenetic signals making them less susceptible to the stochastic errors than those built from a single gene.
The other option is the construction of the “supertree” that involves the concatenation of a set of trees.8,10 However, there are fundamental differences between the ways in which phylogenomic approaches integrate the phylogenetic information. Dutilh et al in 2007 11 systematically compared alternative methodologies such as gene content, superalignment, super-distance (to construct a supermatrix) and supertree approaches using various algorithms and tree-building methods on the Fungi, the eukaryotic clade with the largest number of sequenced genomes. The phylogenomic trees reproduced many of the clades in accordance with the current taxonomic views. Superalignment (supermatrix) and supertrees reproduced better target fungal phylogeny but they were not a guarantee for a successful phylogenomic tree. 11
Phylogenomics involving the use of entire genomes to infer a species tree has become the
There are more than 200,000 named species of unicellular eukaryotes that can be classified as Protozoa, of which approximately 10,000 are parasites, but only a small number are sufficiently important to be mentioned on the pages of
Difficulty is encountered in seeking a consensus taxonomic definition of several kingdoms. In particular, protozoan species are loosely characterized; deciding whether a species belongs to Protozoa or not is based on morphology and biological properties and also partially by the fact of not belonging to another kingdom.18,19,29 They are not a coherent phylogenetic group like other kingdoms or candidate kingdoms are, eg, stramenopila. Thus, to determine if a species belongs to Protozoa we do not just look at one characteristic (such as the multiparted, tubular flagellar hairs of stramenopila) and decide that. Instead, several criteria have to be met. 30
Protozoa are phylogenetically connected to other eukaryotic groups and several eukaryotic groups are most likely derived from Protozoa.
31
A molecular similarity has been established between alveolate (Protozoa) and stramenopiles.
32
In corroboration, residual plastids in the malarial parasite (
Illnesses caused by parasitic Protozoa are a major cause of disease worldwide, but because they are concentrated in low socioeconomic parts of the world, they receive relatively little attention from the pharmaceutical industry. Of the ten diseases targeted as research priorities by the World Health Organization's Special Program for Research and Training in Tropical Diseases (http://www.who.int/tdr), four are caused by protozoan parasites (malaria, leishmaniasis, Chagas disease and African trypanosomiasis). These diseases and other less dangerous ones (eg, amoebiasis and trichomoniasis) are having an alarming increase in cases, which are refractory to the main treatment. Treatment failure has potentially a multifactorial origin where drug resistance stands out as one of its major concerns.38–42
Our understanding of the phylogenetic position of protozoan within eukaryotes, as well as their relationships, is mainly based on ribosomal DNA analysis.2–4 At first, this process seemed relatively straightforward: trees generated from a single gene, most commonly a SSU rRNA, 43 appeared to provide a basic structure for the topology of eukaryotes, although many branches of the tree remained controversial. The 1990s was a period of deconstruction of this theory since several protein coding gene trees revealed serious discrepancies.44,45 Currently, a hypothesis for the tree of eukaryotes resembles the tree presented by Keeling in 2005. 36 The tree is a hypothesis composed from the various types of data, including molecular phylogenies and other molecular characteristics, as well as morphological and biochemical evidence. Five ‘supergroups’ were shown, each consisting of a diversity of eukaryotes, most of which were microbial (mostly protists and algae).
Chaudhary in 2005
46
showed results on available sequencing data. Phylogenetic analysis defined five supergroups of eukaryotes: i) the plant and red/green algal lineage; ii) a clade comprised of animals, fungi, slime molds and amoebozoans named Unikonta eukaryotes which contained the species included in our study (
In this study, we explore the use of multiple genes to present a phylogenomics-based study among Protozoa and its relationship with other very close taxonomic species which are considered as mitochondrial or plastid protozoan according to RefSeq (http://www.ncbi.nlm.nih.gov/RefSeq) taxonomic classification.
Material and Methods
Selection and preparation of marker gene families
We based our methodology on the study by Ciccarelli et al (2006) 8 for the selection and construction of the species tree. Thirty-one universal orthologous (UO) genes showing 1:1 orthologous relationships were used (Supplementary Table S1). Those UO were originally identified by Ciccarelli et al (2006) 8 showing the following characteristics: i) to be present in all complete genomes available at Genbank until 2006, ii) not to be involved in horizontal transfer, and iii) to be good ones for phylogenomic studies. As those 31 UO have a direct correspondence in the protozoan genome data available at RefSeq, they were mapped to the referred data using (a) the best blast hits (e-value < 1-e50), and (b) manual verification of the annotation (the RefSeq annotation of the best hits needed to match the UO annotation) (Supplementary Table S2). Once mapped, the protozoan protein sequences corresponding to those 31 UO were downloaded in fasta format from RefSeq (ftp://ftp.ncbi.nih.gov/refseq/release/protozoa/) and then aligned using Mafft v5.86152–54 with default parameters.
Selection and preparation of marker gene families for protozoan species trees
Our study used the data (protozoan species) that were obtained based on the taxonomic classification provided in the Release Catalogue of RefSeq (RefSeq-release 35-05/04/2009) for Protozoa. This was done as a first tentative to include the different genus and species of Protozoa available in public databases. The full names of species used in our analysis are listed in Supplementary Table S3.
Hidden Markov Models (HMM) profiles 55 were constructed for the 31 aligned UO set, then this database (HMM profiles obtained from the alignment of the best hits of UO from protozoan sequences available at RefSeq) served as a seed to search for more UO hits in protozoan sequences available at Genbank and RefSeq. The HMM profiles used as a seed were created (hmmbuild) and calibrated (hmmcalibrate) and the searches (hmmpfam) were done with e-value “1e-5” as cut-off using HMMER version 2.3.2. All protozoan sequences (74 complete and draft genomes) available at RefSeq (ftp://ftp.ncbi.nih.gov/refseq/release/protozoa/) (RefSeq-release 35-05/04/2009) and Genbank (http://www.ncbi.nlm.nih.gov/genbank/) (NCBI-Flat File Release 172.0-06/15/2009) were used as a target database.
The best hits (e-value < 1e-5) of the HMMER (hmmpfam) search were added to the multiple alignment that originated the UO HMM profile, then a new multiple alignment was constructed (Mafft v5.861)52–54 containing: a) the original protozoan sequences from RefSeq that originated the UO HMM profile, plus b) the best hmmpfam hits of the UO HMM profiles obtained with protozoan sequences from Genbank and RefSeq. Those new multiple alignments (entire or trimmed) were used: i) to be concatenated and build a supermatrix tree, or ii) to build individual trees, then those trees were concatenated to obtain a supertree (Supplementary Table S3).
Concatenating multiple alignments to build a supermatrix tree
A supermatrix tree was obtained using concatenated multiple alignments, either entire or trimmed:
Entire concatenated alignments (M1): The individual alignments were concatenated using an in-house perl script, resulting in a global supermatrix of 21,260 positions in a total of 74 species (43 Protozoa plus 31 mitochondrial or plastid Protozoa). The 31 plastid or mitochondrial protozoan genomes were included in the study based on the taxonomic classification provided in the Release Catalogue of RefSeq (RefSeq-release 35-05/04/2009).
Trimmed concatenated alignments (M2): The individual alignments were trimmed using TrimA1 v1.2 56 (http://trimal.cgenomics.org/) aiming to obtain their most conserved blocks. Those extracted conserved blocks were concatenated using an in-house perl script, resulting in a global supermatrix of 12,807 positions in a total of 74 protozoan species. Positions in the alignment with gaps in more than 10% of the sequences were trimmed with TrimA1.56,57
The resulting supermatrix of M1 or M2 was used to generate separate trees with Phym1 2.4.458,59 using 100 bootstrap replicates and JTT, elected as the best evolutionary model. Each individual alignment was tested for the best evolutionary model using Modelgenerator 0.85. Several models were selected by Modelgenerator 0.85; however, because it is not simple to use multiple models in a single (concatenated) alignment, we decided to adopt JTT that was also the model adopted in the phylogenomics studies of Ciccarelli et al (2006). 8 JTT assumed that there were two classes of sites, one class being invariable and the other class being free to change. 8
The resulting clades C1, C2, and C3 obtained from the supermatrix tree of M1 or M2 were used to generate the individual trees Ct1, Ct2, and Ct3 with Phym1 2.4.458,59 using 100 bootstrap replicates and the evolutionary models (Supplementary Table S4) obtained with Modelgenerator 0.85.
Building individual trees to obtain a supertree
We also used either entire or trimmed individual alignments to build phylogenetic trees, then we concatenated them to build a supertree: the 31 individual total alignments (M3) and the 31 individual trimmed alignments (M4) were used to construct individual trees with Phym1 2.4.4 using 100 bootstrap replicates and the matrices of the evolutionary models (Supplementary Table S4) obtained with Modelgenerator 0.85. 60 The resulting trees of M3 were concatenated to obtain supertrees with Clann 3.1.3. 9 The same procedure was adopted for M4 (trimmed alignments).
Test of phylogenetic signal
The content and distribution of the phylogenetic signal of the 64 alignments (2 concatenated and 62 individual) were analyzed. Two statistical approaches, the PTP test and g1 statistics were employed to achieve a measure of the overall signal content. The PTP test (PTP—permutation test probability or permutation tail probability test) 61 was implemented in PAUP* (Phylogenetic Analysis Using Parsimony) version 4.0b10 62 and it was executed with heuristic search.63,64 G1 statistics was calculated from the characters using the RandTrees function in PAUP 65
Topological test
The Kishino and Hasegawa tests (KH tests) 66 were performed in PAUP* 4.0b10 62 to assess differences between the most parsimonious trees resulting from the analysis of the full dataset.
Null distribution of the test statistic was simulated using 100 bootstrap replicates likelihoods (full dataset) obtained with Phym1 2.4.4 of the following eight groups: i) Complete total tree (M1), ii) Complete trimmed tree (M2), iii) M1-C1, iv) M2-C1, v) M1-C2, vi) M2-C2, vii) M1-C3 and viii) M2-C3.
Kishino-Hasegawa test 66 assumes a null hypothesis where the expected difference in the optimality score between alternative phylogenies is zero. 67 This requires that the topologies under comparison must be specified a priori and without reference to the data used for the test. However, nearly all uses of these tests involve comparing alternative topologies to the optimal topology estimated from the data. This application guarantees that the null expectation of difference will always be larger than zero and does not violate any assumption of a normal distribution of differences in optimality scores between topologies. 68
Results and Discussion
The vast majority of eukaryotic diversity is Protozoa and most of the sequenced protozoan species are parasites. 46 It is important to establish the position of protozoan within the eukaryotic group; despite the taxonomic classification rules are not completely clear. Nowadays, the majority of phylogenies cannot be considered as complete information as they use only single or ribosomal genes.
The evolutionary analysis of the data included several steps. First, we separately evaluated the presence or absence of a significantly structured phylogenetic signal for each data set, and second, we separately compared the trees signalized by the KH test as being “the best” and the trees obtained with Phym1 2.4.4 from the entire concatenated alignments (M1) and the trimmed concatenated alignments (M2).
In order to evaluate the statistical significance of any particular branching pattern, statistical tests were performed. Statistical tests in phylogenetics allow the assessment of the degree of confidence in any given tree topology being the true topology or the most consistent that we accept as true. Thus, statistical tests are responsible for the mutual development between the abilities to estimate better trees and to create more realistic models of evolution. 69
This was accomplished by conducting exhaustive parsimony searches on each alignment using PAUP* 4.0b262,70 and comparing the resulting g1 statistics of tree-length distribution skewness with critical values published by Hillis and Huelsenbeck in 1992. 65 Negatively skewed distributions indicate the presence of trees shorter than what would be expected by chance. As another measure of the phylogenetic robustness of the data, a bootstrap analysis of the combined data set was conducted using 1,000 replicate branch-and-bound parsimony searches. Another popular statistical test that was used in the phylogenetic analysis was the PTP test, which is designed to address whether there is any true phylogenetic signal in any given dataset (alignment). 69
The PTP test indicates that: A) the length value of the most parsimonious tree based on the trimmed concatenated alignments (78,498 steps,

Permutation Tail Probability Test (PTP) of the concatenated alignments. A) PTP test of the trimmed concatenated alignments of the protozoan UO. (Number of replicates = 1,000, search = heuristic). The gray arrow indicates the most parsimonious tree (TMP = 78,498).
The g1 statistics also indicates that there is phylogenetic signal in the data used in the analysis regarding the tree-length skewness because: A) the tree-length distribution based on the trimmed concatenated alignments showed left skewness (G1 = –0.57,

The g1 statistics of the concatenated alignments. A) The g1 statistics of the trimmed concatenated alignments of the protozoan UO. (Number of replicates = 1,000,000). The gray arrow indicates the most parsimonious tree (TMP = 103,691). B) The g1 statistics of the total concatenated alignments of the protozoan UO. (Number of replicates = 1,000,000). The gray arrow indicates the most parsimonious tree (TMP = 111,342).
The results of the PTP test for concatenated alignments are (I) Total Characters: i)
Results of the PTP test for the concatenated alignments of the universal orthologs.
Results of the PTP test for the 31 single alignments of the universal orthologs.
By several approaches, our data (concatenated alignments) showed to be more reliable or informative than single gene phylogenies: A) the two statistical approaches, PTP test and g1 statistics used to achieve a measure of the overall signal content, indicated that the molecular data related to the 64 alignments have phylogenetic signal. B) The use of concatenated alignments offers more Parsimony-Informative Characters in comparison to the use of separated single genes.
On the strategy to build phylogenetic relationships our study was indeed based on UO originally described by Ciccarelli's work.
8
This choice was made because we believe that this approach is really nice and useful, especially when using genomes with partial or unfinished sequencing. While the strategy is pretty much the same, more data were used in our study. Ciccarelli's tree
8
present only six species of Protozoa (
Figure 3 shows the comparison of the trees M1 and M2, using the total (M1) and trimmed (M2) (using TrimA1) concatenated alignments. Both methodologies showed very similar topologies; however, the tree obtained with trimmed alignments showed higher bootstrap values. Also, in Figure 3 (at the right side) are presented the three sub-trees belonging to each of the three clades of the original M2 tree: M2-C1 (Clade 1 of the M2 tree), M2-C2 (Clade 2 of the M2 tree) and M2-C3 (Clade 3 of the M2 tree). Each presented higher bootstrap values when compared to the entire M2 tree. The trees signalized by the KH test as being “the best” (for scores see Supplementary File S3) and the trees M1 and M2 (and their sub-trees M1-C1, M2-C1, M1-C2, M2-C2, M1-C3 and M2-C3) obtained with Phym1 2.4.4 (for log1k see Table 3) presented the same topologies and they are shown in the Supplementary Figure S1 Group A-D, respectively.

Phylogenomic supermatrix trees of protozoan species using the total and trimmed (using TrimA1) alignments. Supermatrices of 21,260 (M1) and 12,807 (M2) positions were used respectively in 74 protozoan species. Maximum likelihood tree was constructed with Phym1 2.4.4, JTT as evolutionary model and bootstrap 100. The resulting clades of M2: M2-C1 in red, M2-C2 in blue and M2-C3 in black with the evolutionary models obtained with Modelgenerator 0.85 were used to construct three individual trees: M2-Ct1 (Blosum62), M2-Ct2 (RtREV) and M2-Ct3 (WAG).
Kishino-Hasegawa test results.
The bootstrap values of the M2 tree (containing the three clades: C1, C2 and C3) are weak, and it is probably because not all taxons had the sequences of the 31 UO available at public databases. We tested the possibility to obtain reliable species trees using less than 31 UO, then showed that using at least 80% (25/31) for each taxon provide reliable inferences (most bootstraps equal or higher than 80). The C1, C2, and C3 clades of the M2 tree were re-analysed as separated trees (M2-C1, M2-C2 and M2-C3) (Fig. 3), then better bootstrap values were obtained. The M2-C1 tree showed bootstrap values above 80, except for the values 49, 60, 61, and 72; and the M2-C2 tree also showed bootstrap values above 80, except for the values 43, 63, and 67. Unfortunately bootstrap values could not be enhanced even better either using new/better alignment or more taxa. This weak bootstrap values behaviors has also been noted in Ciccarelli (2006); 8 Dutilh (2007) 11 and Hartmann (2008) 71 using similar data and/or methodology.
The species trees—M1 and M2—presented three major clades. Each clade was related to one of the following groups of data: i) 26 species presenting at least 80% of UO (25/31) in their concatenated alignments called C1, ii) 12 species presenting between 50%–79% (15–24/31) of UO called C2, and iii) 36 species presenting less than 50% (1–14/31) of UO called C3.
C1 showed excavates represented by kinetoplastids, trichomonads and diplomonads. Kinetoplastids, a group of uncertain affinity,45,72,73 was characterized by the presence of a monophyly formed by the paraphyletic groups
Ciccarelli et al (2006)
8
showed that despite a highly resolved and robust tree, they could not exclude a few uncertainties in tree topology due to biased species sampling or Long Branch Attraction (LBA). This LBA was suggested to account for the placement of Diplomonadida (
Animals and their unicellular relatives (together termed ‘Holozoa’) show a strong affinity with Fungi (together termed ‘opisthokonts’) providing strong evidence in a relationship with Protozoa (eg,
The analysis of multiple nuclear genes strongly supports the sharing of a common ancestry of choanoflagellates represented by
In our phylogenomic trees, C2 was formed by four groups: i) rhodophyta:
C3 was formed by the protozoan euglenozoa/kinetoplastida:
Our phylogenomic analysis showed, as expected, the phylogenetic relationships and the monophyly of Protozoa (Fig. 3) that is in good agreement with previous studies.36,46,47,75,76,80 Unfortunately, because of the use of several incomplete genomes in this study, C3 phylogeny is not reliable, mainly because it is formed by a group of species containing less than 50% (1–14/31) of the 31 UO found. This could be also interpreted as the concatenated alignment having too many gaps (in this case a gap would be a missing UO or whole protein, not only a missing amino acid or nucleotide sequence), thus contributing to the low robustness of the clade or tree. We hypothesize that each missing UO can be treated as a gap in the concatenated alignment, then the more gaps (or less UO) the less reliable trees. It appears logical to expect an inverse relationship between the proportion of gapped sites in an alignment and the accuracy of the inferred phylogeny,71,81,82 particularly if the gaps are not treated as reflective of distinct evolutionary events, 83 and thus, containing distinct phylogenetic signal.
The supertree approach did not work in our hands because when Clann 3.1.3 was used for the supertree reconstruction (trees’ concatenation), only a Neighbor-Joining tree (the initial step to produce a supertree) was obtained. The next two steps (heuristic search or the bootstrapping) were not executed because the software showed an error message informing that the individual trees used as inputs could not be concatenated into a one supertree. Nevertheless, Supplementary Figure S2 shows the Neighbor-Joining supertree for the individual total alignments (M3) and for the individual trimmed alignments (M4) and what could be appreciated is the fact that they presented similar topologies when compared to the supermatrix tree (M1 and M2) in Figure 3.
While most of the publicly available eukaryote genome sequence data were obtained with Sanger technology (medium to low coverage), second and third generation sequencing technologies will be probably used to sequence a larger number of species but at low coverage; hence the approach to use partial sequences from several genes (especially UO) appears to be a good option for future phylogenomics-based studies.
Conclusions
We have presented a phylogenomics-based overview for Protozoa. Relationships between protozoan groups are in agreement with previous studies, supporting monophyly. On the other hand, phylogenetic information inferred from C3 is not reliable due to incomplete information (missing UO in those genomes), suggesting that the use of less than 15 UO for phylogenomic reconstruction is not reliable. The inclusion of more data (UO) is necessary to obtain a robust tree in C3. Our phylogenomics-based methodology using a supermatrix approach proved to be reliable with protozoan genome data, suggesting that (a) the more UO used the better, and (b) that the use of the entire UO sequence or just a conserved block of it produce similar reliable results. The highest bootstrap values were obtained when the trees of the clades C1, C2 and C3 were constructed separately. The results of the supertree were obtained only for the Neighbor-Joining tree and were not conclusive. Finally, we need to further investigate if this methodology could be extrapolated or reproduced to other taxonomic groups.
Disclosure
This manuscript has been read and approved by all authors. This paper is unique and it is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers report no conflict of interest. The authors confirm that they have permission to reproduce any copyrighted material.
Abbreviations
Universal Orthologs;
Hidden Markov Models;
Kishino and Hasegawa test.
