Abstract
Keywords
Introduction
The genetic information encoded in the genome sequence contains the blueprint for an organism's potential development, physiology and activity. This information can only be fully comprehended in the light of the evolutionary events acting on the genome (duplication, gains and gene losses, nucleotide substitutions, genome recombination), reflected in changes in the sequence, structure and function of the gene products (nucleic acids and proteins) and ultimately in the organism's biological complexity.
The recent availability of the complete genome sequences of a large number of model organisms means we can now begin to unravel the mechanisms involved in the evolution of the genomes and their implications for the study of biological systems. At the same time, theoretical advances in biological information representation and management have revolutionized the way experimental information is collected, stored and exploited. Ontologies, such as Gene Ontology (GO) or Sequence Ontology (SO), 1 provide a formal representation of the data for automatic, high-throughput data parsing by computers. These ontologies are being exploited in new information management systems to allow large-scale data mining, pattern discovery and knowledge inference.
The vast number and complexity of the events shaping genomes means that a complete understanding of evolution at the genomic level is not currently feasible. At the lowest level, point mutations affect individual nucleotides. At a higher level, large chromosomal segments undergo duplication, lateral transfer, inversion, transposition, deletion and insertion. Ultimately, whole genomes are involved in processes of hybridization, polyploidization and endosymbiosis, often leading to rapid speciation (for a review see 2 ).
Several databases dedicated to homologous gene families from vertebrates and microbial organisms have recently been developed for use in comparative genomics projects (for example,3,4). At present, the genomic context of a specific gene can be easily displayed using different user-friendly databases5,6 and the evolutionary dynamics of gene clustering can be accurately inspected. In our study, the main objective is the characterization and study of the evolutionary histories of the chordate proteome, in particular the detection of genomic events and automatic functional searches. We make use of formal descriptions of biological data, together with recent developments concerning automated reliable protein sequence alignment and accurate phylogenetic reconstruction. These approaches have been combined in a multi-agent, expert system for the construction of evolutionary histories to facilitate the automatic definition of the important genetic events shaping a given protein. Here we present the computational strategies that we have developed and the first steps towards our final goal, in the form of a novel database: the chordate proteome history database. This database provides phylogenies for the chordate proteomes, reconstructed using a gene-based approach in which the same high quality phylogenetic pipeline is applied to each individual gene in a given genome. Genomic events, at the gene level or at the protein domain level, were detected automatically and localised on the gene phylogenies and on the genomic sequence, wherever possible. We focused on the orthologous relationships between sequences from 14 species:
Materials and Methods
General Features
The chordate proteome history database is deployed
Data Model
As its name suggests, I.O.D.A does not rely on a relational database model, but on a more accurate and flexible model structured by an ontology. The ontology used in the chordate database focuses on the specific evolutionary concepts manipulated in the laboratory. More specifically, we use an approach based on mathematical first-order logic named
Phylogeny Construction and Event Detection
All the phylogenetic trees present in the database were built automatically using the software platform FIGENIX10,11 driven by the DAGOBAH expert system. 9 DAGOBAH is a multi-agent system in which specific agents have been developed for genetic event detection and verification. The phylogenetic trees were automatically analysed by a Java API: PhyloPattern. 12
Identification of Vertebrate Homologs and Construction of a Multiple Sequence Alignment
The 19,837 human proteins defined by the Human Protein Initiative (http://expasy.org/sprot/hpi/) were used in this analysis. For each protein, database searches of the Swissprot and Ensembl databases 13 were performed using the BlastP program. Multiple alignments of complete sequences (MACS) were then constructed using the MAFFT program, 14 containing up to 500 full-length protein sequences. The quality of the MACS was then validated using the NorMD objective function, and unrelated sequences were excluded using the LEON program. 15
Once a high quality MACS was obtained, the next step was to extract structural/functional information related to the protein family from the public databases. This was done using the in-house BIRD data retrieval system, and covered a wide range of information, from taxonomic data and functional descriptions (protein definition, EC number, GO, pFAM, Interpro) to sequence features, such as structural domains and active site residues. The retrieved data was integrated in the multiple alignment, together with a number of
Construction of an Accurate Phylogenetic Tree
Based on the main FIGENIX phylogeny pipeline, a new phylogeny pipeline was specifically developed to initiate phylogenetic studies from MACSIMS alignment files. In this pipeline, the alignment was intelligently cut to detect alignment areas associated with specific protein domains and repeats. For each domain, a phylogenetic tree was built and used for the study of domain architecture events.
In addition, a gene-level phylogeny was produced. All alignment areas associated with the domains in the protein query (the one that initiated the alignment) were concatenated and the resulting alignment was used for tree building. The gene phylogenies were used to study gene losses/gains and horizontal gene transfers and to compile duplication events and orthology and paralogy relationships.
Functional Data
From all the homolog pages in I.O.D.A, the user can search functional data from: GO, 1 KEGG, 17 ArrayExpress, 18 String, 19 and QuickGO. 20 To do this, I.O.D.A converts Ensembl references to Uniprot references, which are all indexed in these databases. 21 To extract the functional data, these references are then sent to the web services associated with each of these databases. I.O.D.A presents the functional data, either directly on the web pages or through a link to these sites.
New Protein Domain Architecture Events, Localization on the Chordate Species Tree and Verification at Genome Level
An apomorphic protein can be formed by any of five kinds of events detected by a dedicated DAGOBAH agent:
The general strategy for domain event detection involved a nine-step process driven by the DAGOBAH multi-agents system:
Domain-annotated protein alignments built from a query protein are used to outsource phylogeny tree construction (domain trees and protein trees) to the FIGENIX pipeline. The Mirkin parsimony algorithm
22
is used on each tree produced to infer ancestral domain architectures on internal nodes. Unfortunately, no efficient algorithm is currently available to infer the order of ancestral domain architectures. The query's domain architecture is divided into a list of consecutive domain pairs. We note that two artificial domains (without any associated phylogenetic tree) are added at the tips, in order to study events occurring at the beginning and end of the protein. For example, for a protein with three domains A, B and C, our process studies each of the four pairs Ideally, a phylogenetic pattern consistent with the event should be found on each domain tree of a pair, which strengthens the event hypothesis. Nevertheless, events found only on one tree of the domain pair are considered as valid, but weaker, candidates. Two patterns are applied on the domain pair trees with our API: PhyloPattern,
12
one for deletion events and one for other events (Fig. 1). A pattern is a triple, ie, a well-supported ancestral node with two children: a plesiomorphic node corresponding to a domain architecture close to the ancestral one, an apomorphic node corresponding to the derived domain architecture.
A virtual example of an event leading to a novel domain architecture. The pattern associated with a deletion event candidate is an ancestral node with the two domains of the pair and other domains (denoted DL) located between them, an apomorphic child node with the two domains of the pair whose subtree contains the query sequence, and a plesiomorphic child node with the two domains of the pair whose representative sequence contains the DL domain list. The pattern associated with other event candidates is an ancestral node with one of the two domains of the pair, ie, the one for which the tree receives the pattern, an apomorphic child node with the two domains of the pair, and a plesiomorphic child node with one of the two domains of the pair, ie, the same one as in the ancestral node. We note that when we refer to a node's domains, we mean the inferred architecture for an internal node or the known architecture for a leaf. The choice of apomorphic and plesiomorphic representative sequences is very important for the subsequent steps. It will influence the reliability of event type determination (step 6), and also the reliability of event verification at the genomic level (step 8). Thus the process chooses the least remotely derived sequences, ie, the sequences with the domain architecture closest to the ancestral one and with the shortest branch to the ancestral node. These sequences are assumed to have accumulated fewer mutations and recombinations than the others. For the choice of apomorphic sequence, there is an exception to this last rule: when an event candidate's pattern is relevant for the two trees of the domain pair, we choose a sequence that belongs to the two apomorphic subtrees found in the two trees of the domain pair (for deletion candidates, we choose the query). When the criteria are not sufficient to choose between plesiomorphic sequences, the sequence closest to the apomorphic sequence species in the species tree is chosen. In the studied query sequence, the domain pair A B or a pair with a virtual tip A A-after is shown in bold.

To produce “definitive” conclusions, the process confronts each individual event candidate (produced by the study of all the domain pairs in the query sequence architecture) with all the others, through an expert system, applying logical rules. We cannot give details of all the specific rules here, but their aim is to group some individual events or to remove some ambiguity, whenever possible. Non-grouped events are identically conserved. As an example of grouped event candidates, given an apomorphic domain architecture A B C D, the process could identify two insertion candidates by studying the A, B pair and the C, D pair, but they are probably linked to a single event, the insertion of B and C between A and D.
In addition, we can see that the “shuffling/insertion” ambiguity in Table 1 could also be removed if, for example, the plesiomorphic architecture was A D when the process studied the A domain tree. In this case, the shuffling hypothesis is eliminated.
When event candidates are confirmed, the next step is to try to verify them at the genome level, by trying to find an alignment break position between two DNA segments, one associated with the representative apomorphic protein and the other with the representative plesiomorphic protein. DNA segments are extracted between concerned domains using Ensembl online access, and a Blast (tblastn) search is then performed on the DNA regions associated with the proteins, using the domain's amino acid segments as a query. Overlapping of Blast high similarity pairs is managed to extract the most significant area.
BlastZ 23 is then used to align the two segments and to detect the alignment break position that should be the recombination point. More details of this process can be found in. 9
We note that for many events we find no such position, because the divergence date between the apomorphic and plesiomorphic species is often too far distant, and many other accumulated events have since masked the recombination event. When this information is found, it is supplied to the I.O.D.A user in an “Expert comment” field.
In “ideal” studies, we identify two close recombination points on the common apomorphic sequence found on the two domain trees that show the event. If the same position is identified based on two different plesiomorphic sequences, then the event hypothesis is very strongly supported. However, these cases are very infrequent in the database.
We introduced this final step to detect, Wrong propagation of domain architecture in the MACSIMS alignments (used to initiate our studies). The artefact detection agent re-predicts the apomorphic and plesiomorphic sequence domain architectures from the Pfam database to verify them. Sequencing, assembling or gene prediction errors in the genomes used in this study. This agent is able to detect frequent artefacts resulting from the use of a gene isoform as an apomorphic one, although the plesiomorphic variant still exists, or reciprocally as a plesiomorphic one when the apomorphic variant exists.
Phylogroups and Gene Loss/Gain Study
The OrthoMCL algorithm was used to create groups of orthologous sequences from the same set of species as used for the phylogeny reconstruction. Phylogroups were created by clustering the groups using ortholog information obtained by the phylogenetic analyses. The “gene loss/gain” module is based on the phylogroup analysis. Gene gain and loss events were identified using the PARS algorithm. 22 As gene transfer in chordates is unlikely, a gene gain was assumed to occur only once. An event that occurred more than once was thus assumed to signal an artefact. The PARS algorithm minimizes the gain and loss events. For example, when an ortholog is frequently absent on a given tree, the algorithm predicts several gains. These cases should be considered as putative artefacts, possibly due to problems with the sequencing/assembly process.
Rules for Event Validation
If orthologs are recorded absent only on the leaves (except for the well-annotated genome: human and mouse), the expert system (a DAGOBAH agent) will not confirm the loss, which might be due to an annotation artefact or unfinished sequences. If the orthologs are recorded absent higher up the tree, or if all the orthologs are also absent in daughter branches, then DAGOBAH will valid the loss events.
External Access
To facilitate access to all the data contained in the chordate database, I.O.D.A entries can be easily linked to and from external pages using the URL: http://ioda.univ-provence.fr/IodaSite/Site.jsp?id=XXXXX, where XXXXX can be replaced by any reference or keyword searchable on the I.O.DA site (eg, P35125, which is a Uniprot reference). In this way, other databases focused on specific themes can include additional evolutionary information in their data.
Results and Discussion
The data in the chordate proteome history database are divided into two subprojects. The first subproject includes phylogenies, new architecture and duplication events. The second one is dedicated to chordate phylogroups analysis.
Phylogenetic Data
As we were interested in the evolution of the human proteome, the scope of the phylogenetic analyses was limited to the chordate, focusing exclusively on well-annotated genome species. The phylogenetic analysis was assumed to be robust for small families, as all the homologous sequences should be present, forming reliable ortholog and paralog groups. Based on the phylogenetic tree, genetic events that affect different protein characteristics were investigated, including orthology/paralogy and domain architecture.
Functional data for the different orthologous groups were collected from the GO, 1 KEGG, 17 ArrayExpress, 18 STRING 19 databases, and links are provided to the Ensembl, 21 Uniprot and Pfam 24 databases and the NCBI taxonomy. 25
Phylogroups
The phylogroup analysis is used as a filter and provides information about the size of the gene families, about potential gene loss in a given lineage, and finally about the appearance or gain of a novel gene family. Phylogroups are in fact OrthoMCL 7 ortholog groups that we overclusterized using orthology relationships offered by the automatic analysis of phylogenetic trees produced. OrthoMCL clustering can lead to artefact groups made up of fast-evolving orthologs. We correct this artefact by clustering the groups using ortholog information obtained by the phylogenetic analyses. Thus several OrthoMCL groups can be integrated in a single group, denoted “phylogroup”. The phylogroups can then be used to perform functional analyses as described for phylogenetic analyses.
In addition, the phylogroups are exploited in the evolutionary analyses for the detection of events such as gene loss and gene appearance. Gene appearance can be the result of various scenarios, 26 eg, (i) pseudo-appearance due to duplication followed by high rates of mutation, (ii) gene rearrangement leading to different domain architectures in the orthologs, (iii) horizontal gene transfer (only a few examples in the chordates) or (iv) de novo genes.
Our phylogroup approach and the associated gene loss and gain results offer a number of advantages over other published ortholog databases that use clustering: (i) the ortholog group is corrected by the phylogeny and (ii) we include expert rules to give greater confidence to the ortholog loss/gain events.
Database Access and Web Interface Features
Browsing and Querying
The chordate proteome history database is publicly accessible at http://ioda.univ-provence.fr. The database is organized in two interconnected projects: (i) domain events and phylogenies and (ii) chordate phylogroups and gene loss/gain. The two subprojects are linked: the corresponding phylogroup can be accessed from a gene's phylogeny study page, and conversely, the domain events and phylogeny studies can be accessed from the phylogroup page.
The database can be browsed using the “search” window by entering various queries, eg, (i) the human protein name, using Ensembl or Uniprot identifiers, (ii) the Ensembl identifier for nonhuman species, (iii) key words, (iv) one or more domain names, (v) partial domain names or (vii) a combination of these key words. We note that numbered information and user guidelines are provided in wiki pages.
Phylogenies and Domain Event Searches in the Phylogeny Subproject
The phylogenetic reconstructions for each gene are available and can be retrieved directly from queries. The phylogeny subproject can be searched for events leading to new domain architectures, ie, caused by the loss or gain of a domain or domain shuffling. Figure 2 shows an example of a search using the UniProtKB/Swiss-Prot accession number P35125 as a query. By clicking on the search window (entry page), two results pages are available; phylogenetic study and events studies (see Fig. 2).
The chordate proteome history database entry page.
The phylogenetic study results page includes the phylogenetic trees, the ortholog list with the functional links, the paralog list and the list of homologs if the phylogenetic analysis results in some weakly supported nodes.
The events study results page includes links to each possible type of domain architecture evolution, ie, domain shuffling, domain insertion or deletion inside the sequence and domain loss or gain at the N- or C-terminus. For example, in the case of P35125, a domain shuffling event was detected (Fig. 3). By clicking on the “Shuffling events” tab and selecting a specific shuffling event, the user gains access to two information pages: “from Tree” and “Event pattern” (Fig. 4). The “from Tree” tab shows the phylogenetic tree used to deduce the event, together with the domain organization of the leaf sequences. In addition, the branch on which the event is assumed to occur is identified. The “Event pattern” tab provides more details about the domain organization of the apomorphic (derived) and the plesiomorphic (similar to the ancestral) representative sequences.
Domain structure organization. Tree and event pattern pages.

Phylogroup Subproject
By clicking on the “Chordate phylogroups and gene loss/gain” and “Studies” menus, the study box shows the ortholog distribution on the different species under investigation. The “Group statistics” menu gives the user an overall view of the group distribution, the sequence number and the number of events, while the “Groups” menu gives the list of all ortholog groups. The tree box shows the species tree where the gene appears and when it is lost. The ortholog box provides the list of all the orthologs, and the functions of the ortholog sequences can be easily retrieved by clicking on the functional request button (see below:
A Case Study: The Example of Gulo Gene Analysis from Phylogroup Data and Phylogenies
The Detection of gene gain and loss in phylogroups.
New Protein Domain Architecture: The Example of Shuffling Events
A number of shuffling domain exchanges could be evidenced by using the database (as described above in the case of P35125). To summarize, 1943 shuffling domains were reported in the current version of the database. These 1943 shuffling events exclude all putative artefacts and could be assigned as relevant shuffling events with no ambiguity. In the field of new gene origination, these shuffling events are of prime importance for users, as the creation of new proteins/function could be carried out by bringing different domains together. 2
Conclusions and Perspectives
In summary, the chordate proteome history database combines ortholog clustering, phylogeny and automatic functional link searches with automatic detection of important genomic events at the gene or protein domain levels. We are focusing on new enhancements for the medium-term including: (i) detection of other evolutionary events to achieve a more overall view of the genomic changes (eg, pseudogenization), (ii) introduction of other chordate genomes thanks to the current growing number of genomes sequenced and improved quality (structural and functional annotation) of the present genomes and (iii) development of other databases focusing on different kingdoms (eg, fungi) under the I.O.D.A. umbrella.
Author Contributions
Conceived and designed the experiments: AL, JP, PP, PG. Analysed the data: AL, JP, JD, JDT, OP, PP, PG. Wrote the first draft of the manuscript: AL, JDT, PP, PG. Contributed to the writing of the manuscript: AL, JDT, PP, PG. All authors reviewed and approved of the final manuscript.
Competing Interests
Author(s) disclose no potential conflicts of interest.
Funding
This research was supported by the ANR EvolHHuPro (ANR-07-BLAN-0054-01).
