Abstract
Introduction
Bioinformatics is of paramount importance for protein analysis in proteomic studies. Proteomics generates huge amounts of data that must be interpreted to answer biological questions. It is used at three steps of proteomic studies (Fig. 1): (i) identification of proteins by peptide mass mapping or peptide sequencing; (ii) prediction of their sub-cellular localization, (iii) prediction of their function. Identification of proteins requires measurement of mass/charge ratios of tryptic peptides or of ions resulting from peptide fragmentation and comparison to sequence databases. This part of the work is done with devoted software such as ProteinProspector (http://prospector.ucsf.edu/), Mascot (http://www.matrixscience.com/search_form_select.html), and Profound (http://prowl.rockefeller.edu/prowl-cgi/profound.exe) that can be used online. Of course, reliability of such identifications depends on the quality of both sequences deposited in databases and of structural annotation of genomes. But many efforts have been done to improve it, especially with the setting of Unigene at NCBI which is defined as “An Organized View of the Transcriptome” (http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene). Analysis of protein sub-cellular localization and function is more puzzling. For

Prediction of Sub-Cellular Localization of Proteins as a Valuable Tool to Assess the Quality of Sub-Cellular Proteomics
The creation and maintenance of cellular structures relies on the regulated expression and spatial targeting of proteins. In addition, determining the sub-cellular localization of a protein is an important first step towards understanding its function. Proteomics is one of the large-scale methods used to identify proteins, made possible by the sequencing of whole genomes. The homogenized cells are fractionated, most commonly through centrifugation. With proper purification and fractionation techniques, the contents of a particular fraction will correspond to a particular organelle. However, the approach is sensitive to contaminations. Purity control mainly relies on tests that should be positive for proteins expected in the purified fraction, and negative for proteins of all other cell compartments.
Plant cell wall is a particular difficult compartment since it is not surrounded by a membrane, and contains anionic carbohydrates that can bind basic proteins from other compartments. The reliability of protein profiling for a compartment like the cell wall thus strongly depends on the quality of the extraction protocol. There are two ways to obtain proteins from the cell wall compartment:4–5 (i) non-destructive methods using the culture medium of cell cultures, or the extraction of extracellular fluids by an infiltration/centrifugation procedure; (ii) destructive methods involving the rupture of the cells and the separation of the insoluble cell wall fraction. Since plant cell walls are mainly built up with highly dense polysaccharides, this property can be used to purify them by centrifugation through high density solutions. Several methods have been used to prepare enriched cell wall fractions and to verify their purity (Table 1). On the one hand, enzymology, immunology, and microscopy have been used to estimate the purity of the isolated fraction. On the other hand, bioinformatic analyses of the sequences of identified proteins allowed the prediction of their sub-cellular localization using PSORT (http://psort.ims.u-tokyo.ac.jp/form.html) 6 and TargetP (http://www.cbs.dtu.dk/services/TargetP/).7,8 In a few cases, sub-cellular localization could not be reliably predicted. The ratio of the number of predicted secreted proteins to the total number of proteins identified has been calculated. This ratio can also be used as a purity control. It can be seen that classical methods of control support a high purity for most of the fractions. However, the concrete results of the bioinformatic predictions show a different picture with much lower degree of purity in most cases, suggesting that many fractions were not pure enough. On the contrary, although such controls have not been performed in the case of preparation of proteins from the hypocotyl cell wall fraction, 1 it is remarkable that the level of purity of the cell wall protein fraction is one of the highest as calculated from bioinformatic predictions. It should be noted that the use of biochemical or immunological markers as purity criteria have produced a number of publications claiming that many well-known intracellular proteins are also secreted without a signal peptide. 9 It is true that not all secreted proteins contain a signal peptide and alternative secretion pathways exist in both prokaryotes and eukaryotes. However, only a reduced number of proteins are secreted without a signal peptide in eukaryotes. 10
These comparisons show that the classical methods used to test for the purity of sub-cellular compartments are not conclusive for proteomic studies. Indeed, the sensitivity of mass spectrometry is much higher than that of enzymatic or immunological tests using specific markers. As a consequence, the characterization and prediction of the intrinsic signals that target proteins to the correct subcellular compartment has become a major task in bioinformatics. Although not all signals for protein sorting in cell compartments are described, bioinformatics can help in predicting subcellular localization of proteins thus contributing to the quality control of proteomic strategies (Fig. 1). In particular, sorting signals for vacuoles are of several types and probably not all are known. 11 In addition, non classical pathway for protein secretion should be taken into account. 10
Using Functional Domains as Efficient Tools for Annotation of Proteins
With regard to protein function and due to automatic annotation of proteins on the basis of BLAST searches (http://blast.ncbi.nlm.nih.gov/Blast.cgi),
12
there are many mistakes in databases on the principle of the children game called the Chinese whispers. Even if functional domains such as InterPro, PFAM or PROSITE are now indicated in the description of protein sequences in most databases, the names proposed for proteins are often incorrect because they result from BLAST searches rather than from the presence of functional domains. Actually, BLAST results can rely on partial sequence homology as shown in the case of the family of 11 leucine-rich repeat extensins (LRXs)
13
as LRXs and PEXs. Query of the NCBI Entrez Protein database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=Protein) results in 14 accession numbers using the following key words: leucine-rich repeat AND extensin AND Arabidopsis. The same functional annotation was found at TAIR (http://arabidopsis.org/index.jsp) and TIGR (http://www.tigr.org/tdb/e2k1/ath1/) whereas only 6 proteins were given related names such as leucine-rich repeat/extensin or extensin-like at MIPS (http://mips.gsf.de/proj/plant/jsf/index.jsp) (Table 2). A detailed analysis of the information available in databases shows that the appropriate functional domains are listed in the description of the proteins (Table 1, supplementary data). However, the names assigned to the proteins are not correct at NCBI, TAIR, and TIGR in three cases (At2g19780, At4g06744, and At4g29240) since these names were given according to BLAST results. As shown for At2g19780 in Figure 2, significant identity was found with an LRX protein encoded by

Number of
Functional domains found using InterProScan and PROSITE in
Other problems result from different annotations despite the presence of identical functional domains. This is the case for the extensin gene family which comprises 19 members. 14 Three examples are given in Table 2, supplementary data. At1g21310 gets the name of its mutant (RSH for Root Shoot Hypocotyl Defective) whereas At1g26240 is annotated as “proline-rich extensin-like family protein” and At1g26250 as “proline-rich extensin, putative.” On the contrary, the same name can be attributed to proteins that belong to different families. The “Pollen Ole e1 allergen and extensin family” contains two types of proteins (Table 2, supplementary data): proteins that only contain the IPR006041 domain (Pollen Ole e1 allergen and extensin), and proteins that contain both this domain and a proline-rich region profile (PS50099). Only the MIPS database names At3g33790 a “putative proline-rich protein.”
Although functional domains are correctly listed in description of proteins, annotation may only take into account a profile with a high probability of occurrence, i.e. of poor significance. This is the case of At2g15770 annotated as “glycine-rich protein” although the score obtained for PS50315 (glycine-rich region) is very low (Table 2, supplementary data). A higher score was obtained for IPR008972 (cupredoxin) and the gene has been annotated as AtEn23 by experts. 15 On the contrary, At1g15825 has been annotated as “hydroxyproline-rich glycoprotein family protein” although the only identified structural domain is PS50099 (proline-rich region). A “BLAST 2 sequences” reveals a very poor matching to the true extensin mentioned in databases (At3g19020) (data not shown). Moreover, the prediction for subcellular localization is “cytoplasm” using PSORT with a score of 0.450 and “other” using TargetP with a score of 0.602. A detailed analysis of the protein sequence reveals that at present the only possible annotation is “proline-rich protein”.
The last case will concern a protein that has been annotated as an “extensin-like protein” as inferred from sequence comparison to a protein sequence deduced from a cDNA of

Building of the ProtAnnDB Dedicated to Collection of Predictions of Sub-Cellular Localization and Functional Domains
The
Sub-cellular localization of proteins can be predicted using several programs on line. TargetP, SignalP (http://www.cbs.dtu.dk/services/SignalP/),
10
and Predotar (http://urgi.versailles.inra.fr/predotar/predotar.html)
18
predict N-terminal targeting sequences. TMHMM (http://www.cbs.dtu.dk/services/TMHMM-2.0/) predicts transmembrane domains. Since all these programs use different algorithms to make predictions, it is necessary to compare the results. There is no such tool available on line apart from Aramemnon (http://aramemnon.botanik.uni-koeln.de/), a database describing plant putative membrane proteins based on the comparison of results from different programs predicting transmembrane spanning domains and membrane-anchoring through GPI anchors, prenylation, and myristoylation.
19
Several of these programs were selected in

Bioinformatics tools for prediction of protein sub-cellular localization. Protein sequences are from TAIR8 (http://www.arabidopsis.org/index.jsp).
Prediction of functional domains can be achieved using different programs. InterProScan (http://www.ebi.ac.uk/Tools/InterProScan/) collects information from different prediction programs and proposes classification of proteins in superfamilies.
20
Pfam (http://pfam.sanger.ac.uk/) comprises 9318 protein families.
21
PROSITE (http://www.expasy.org/prosite/) comprises 717 patterns and 795 profiles.
22
PANTHER (http://www.pantherdb.org/) classifies genes by their functions, using published experimental evidence and evolutionary relationships. Three of these programs were chosen for building

Bioinformatics tools for prediction of cell wall protein functional domains. Protein sequences are from TAIR8 (http://www.arabidopsis.org/index.jsp). Links to NCBI RefSeq are provided for each protein (http://www.ncbi.nlm.nih.gov/RefSeq/). Examples of cell wall-related gene families annotated by experts are listed in Table 4.
In all cases, use of several programs based on different rationales is required to improve the quality of the prediction of sub-cellular location or function of a protein. However, some doubtful cases cannot be resolved. Finally, relevant literature, especially if it includes experimental data, is the best source of information. Databases annotated by experts in the field also provide checked information and often allow use of unified nomenclature. This information is available either in research articles or on line. It has been included in
Cell wall-related protein families of
Two different formats have been designed in


Conclusion
Proteomic studies generate huge amounts of data consisting in lists of genes. Answering biological questions using these results is a great challenge. It is usually done collecting protein annotations from databases. However, protein annotation in databases can be incomplete or misleading since they are mainly inherited from results of sequence comparisons and do not usually take into account the presence of functional domains. Moreover, the available databases do not allow finding predictions of sub-cellular localization of proteins easily.
Methods
Authors’ Contributions
HSC made the scripts to build up
