Abstract
Keywords
Introduction
Gaining information about proteins both structurally and functionally remains an ultimate goal in biological and medical research due to their importance in living organisms of which they are the major components. Proteins are large and complex organic compounds that consist of amino acids joined by peptide backbones. The poly-peptide chains thus constituted are called primary structures that can fold into complicated three-dimensional (3-D) structures (native structures) which largely describe their functions. Thus, it is essential to know the protein's three-dimensional structure so as to infer its function. With recent advances in large genome sequencing projects, there is an increasing need to determine this structure. The number of protein sequences deposited in the PDB a continues to grow much faster than the number of known protein structures and the major interest in current time is then to bridge this ever-widening gap between sequences and structures. Amino acid sequence (primary or 1-D structure) contains sufficient information specifying the three-dimensional structure. 1 However, structure determination from known sequence is not a straightforward task. It is laborious, expensive, very time-consuming and sometimes, even impossible to use purely experimental techniques such as X-ray crystallography and Nuclear Magnetic Resonance spectroscopy. Thus, in silico methods present alternative approaches to accomplish this task with low cost and reduced time and effort. Therefore many methods have been rigourously explored to perform the essential intermediate step on the way to predicting this structure, which is to predict the secondary (2-D) structure from the primary structure. The secondary structure (SS) consists of local folding regularities maintained by hydrogen bonds. Three classes (patterns) characterize the SS: aL-helices, β-sheets (Extended-strands) and coils (the remaining non regular conformations). Given a protein sequence, the Protein Secondary Structure Prediction (PSSP) problem is to predict whether each amino acid also known as residue is in either α-helix (H), β-sheet (E) or coil (C) state. From the point of view of pattern recognition, this can be seen as a 3-class discrimination problem.
Brookhaven Protein Data Bank of solved structures availible at http://www.rcsb.org/pdb/.
Since the 1970s, many approaches for PSSP have been developed. The first efforts have been made for predicting the SS solely from the amino acid sequence using simple linear statistics and expert rules taking into account only the physicochemical properties of single amino acids.2–4 The average tree-state per-residue score
Currently the performance of all the best PSSP methods in term of Q3 score is between 75% and 80% depending on the training and the test datasets. All the recent above-mentioned methods use homology as the important factor to determine the SS and consensus to improve the prediction quality. Thus it would make sense to tackle the PSSP problem using these two factors. The concept of combining classifiers to benefit from their strengths so as to improve the performance has long been established. However, the choice of an effective combiner may be a difficult task in addition of the problem of generating a set of classifiers with reasonably good individual performance and independent (no biased) predictions.
In this study, we investigate how performance can be enhanced by combining k-Nearest Neighbors, Artificial Neural Networks and Multi-class Support Vector Machines, incorporating position-specific scoring matrix (PSSM) profiles to predict the SS of globular proteins. To do so, two variants of the majority voting rule are used. Simple Majority Voting (SMV) which counts the votes and allocates a queried residue to the class that gains the majority votes and Weighted Majority Voting (WMV) which weights each vote by the corresponding classifier prediction accuracy. If two or more classes gain the same vote (conflicting decision), SMV uses two strategies. The first one consists in the traditional scheme of SMV which chooses one of the classes arbitrarily (SMV) and the second scheme that we used and named in our experiments “Influenced Majority Vote” (IMV), assigns the class predicted by the best classifier in the ensemble. We also used a heuristic based filter to refine the predictions obtained by each scheme of the ensemble method proposed.
The remainder of this paper is organized into three sections starting with section 2 which describes in detail the general framework of the proposed system. The experimental results are summarized in section 3, followed by a conclusion and an overview of future work drawn in section 4.
Materials and Methods
This section describes the general framework used for achieving the goal of the analysis described in this article. It first introduces the PSSP problem, presents the datasets used and then explains the detail of methods and algorithms used to implement and test the proposed system in predicting the secondary structures of globular proteins.
Data encoding and PSSP problem formulation
Within known protein structures, the most frequent secondary structures are α-helices and β-sheets. Beside these two common structures, six other rare types of structures are further proposed by Dictionary of Secondary Structures of Proteins program (DSSP)
30
which are explicitly
The primary structure which is described as a sequence of amino acids in the polypeptide chain can be represented as a string on the finite alphabet
Let
Each amino acid in the sequence of length n belongs to one of the three structural conformations H, E or C. A large number of methods for ensuring such mapping have been developed and nowadays the perfect discrimination of the three classes (H, E, C) remains a challenge. A computational method which can effectively translate the sequence into an accurate structure is urgently required. The major methods based on supervised learning have adopted a windowing technique for generating training and testing sequences. The window consisting of fixed number of amino acids centred on the residue to be predicted (queried residue) is used to incorporate the influence of the neighbors into the prediction. To generate different input sequences, the window slides over the protein sequence, and one amino acid is considered at a time. The output value represents the secondary structure predicted of the queried residue. Many studies have demonstrated that the window size influences the quality of the prediction. Qian and Sejnowski 7 varied the window size from 1 to 21 and empirically found 13 to be the most effective. In this study, we used the same window size.
PSSM generation
Multiple alignments can produce position-specific profiles, which provide crucial information about structure. Using position-specific profiles as inputs for structure prediction can improve 5%–10% the prediction accuracy than the sequence alone. Position-specific profiles are mainly generated by position-specific iterated BLAST (PSI-BLAST) searches
34
and Hidden Markov Models.
35
The multiple alignment is converted into a profile: for each position, the vector of amino acid frequencies is calculated based on the alignment. Here, we performed PSI-BLAST to obtain the profile for each sequence, setting the parameter
Training and test datasets
Many datasets have been used in the PSSP experiments and different prediction accuracies have been reported. Predicting the structure of small proteins is naturally easier and faster than predicting the structure of large proteins. In this study, we experiment on and test the proposed system on sequences of different sizes using two widely used datasets. The first one is the dataset of 126 globular protein chains proposed by Rost and Sander,
8
referred to as the RS 126 dataset. It has been extracted from the initial RS 130 dataset (excluding the four membrane protein chains 1
Ensemble method
In this section, we first describe the proposed ensemble method, starting with the ensemble members and then we introduce the three voting combination schemes used in our experiments SMV, IMV and WMV. The heuristic based-filter used to refine the predictions is also described in this section.
K-Nearest Neighbor classifier
The k-Nearest Neighbor (k-NN) algorithm classifies an example by assigning it the label most frequently represented among the
Feed-forward Neural Networks
The Multi-Layer Perceptron (MLP) and the Radial Basis Function Neural Network (RBFNN) are the most commonly used feed forward Neural Network models in computational biology. In PSSP, they have produced the most accurate SS for the majority of the past few years. The MLP is an improvement of the Perceptron 39 including one or more transition layers known as hidden layers. The units in the input layer are connected to units in the hidden layers, which in turn are connected to units in the output layer. Each connection is associated with a weight. The MLP units take a number of real-valued inputs and generate a single real-valued output, according to an activation function (transfer function) applied to the weighted sum of the outputs of the units in the preceding layer. The most commonly used activation function in this network is a sigmoid function. 40 The learning algorithm can be expressed using generalized Delta rule and back propagation gradient descent 41 In the RBFNN each hidden unit implements a radial activated function, whose outputs are inversely proportional to the distance from the center of the units. In pattern classification applications the most used radial activated function is the Gaussian.42,43 The Gaussian's centers influence critically the performance of the RBFNN. Poggio and Girosi 43 showed that using all the training data as centers may lead to network over-fitting as the number of data becomes too large, and the gradient descent approach used to update the RBFNN centers moved the centers towards the majority of the data. To avoid these situations, they suggested the use of a clustering algorithm to position the centers.
For both the two NNs, the global error
Or a log-likelihood cost function defined as:
The MLP in this study consists of an input layer, a hidden layer and an output layer. It is trained using the standard back-propagation algorithm. The hidden and the output layer units have sigmoid activation function. For the RBFNN, we have adopted QuickRBF f package 44 which uses an efficient least mean square error method to determine the weights associated with the links between the hidden layer and the output layer based on the Cholesky decomposition method. It is capable of delivering the same level of prediction accuracy as the SVM classifier.
Support vector machines and multi-class support vector machines
Recently Support Vector Machines (SVM) technique has been applied successfuly in many applications. Up to now the highest accuracy is achieved by approaches using it. Designed by Vladimir Vapnik and his colleagues45,46 as a direct implementation of the Structural Risk Minimisation (SRM) inductive principle.
47
It was originally designed for binary (two-class labels) classification. SVMs look for the optimal separating hyperplane between two classes by maximizing the margin between the classes. Basically SVMs can only solve binary classification problems, they treat the linear case (optimal hyperplane) and the non-linear case by introducing of kernel satisfying Mercer's conditions.
48
In the non-linear case, non-negative slack variables ζ
where
The function κ is called kernel. The output function for a class
with b=(bj)1≤j≤Q and w=(wj)1≤j≤Q. Thus for each class, it is associated an hyperplane and the classification function consists of assigning to an examlpe
In the SVM formulation (bi-class case), the training algorithm amounts to finding the optimal values of the couples (w; b) for a given choice of the kernel κ and a soft margin constant
where δ is the Kronecker symbol.
LIBSVM
LIBSVM software
37
supports various SVM formulations for classification, regression, and distribution estimation. In this study C-Support Vector Classification (C-SVC)45,46 using one-against-one (pairwise coupling) technique which constructs one SVM for each pair of classes has been used. For a problem with Q classes,
Majority voting combination rule
Various studies have used the majority vote for classifier combination. In this study two variants of majority vote have been experimented. Simple Majority Voting (SMV) and Weighted Majority Voting (WMV) will be introduced in detail in this section.
Let us consider χ a set of
where
The predicted class
If two or more classes gain the same vote (conflicting decision), two strategies are used. The first strategy chooses one of the classes arbitrarily (SMV). The second strategy that we named in our experiments “Influenced Majority Vote” (IMV) chooses the class predicted by the best classifier in the ensemble.
Certain classifiers can be more qualified than others. Weighting the decisions of the classifiers by their prediction accuracy can reinforce the decision of those qualified classifiers, what makes it possible to give more importance to their decision in the vote and consequently may further improve the overall performance than that can be obtained by SMV (where all classifiers have identical weights). In WMV, each vote is weighted by the prediction accuracy value of the classifier that we denote here Ace. The count of total votes for a class ck in (12) can then be redefined as:
The class receiving the greatest total weight is chosen.
The three voting schemes are evaluated in this study using different prediction accuracy measures.
Refining the predictions
In addition to combining classifiers, we apply filter based on explicit rules to “clean up” the predictions obtained by the ensemble method. The filter makes the predictions more realistic by excluding such physically unlikely structures. To assess the prediction accuracy, many PS SP methods filtered/smoothed the predictions by using different strategies.8,12,56 The main condition for obtaining a coherent structure is that the shortest length of the consecutive state H must be three and two for the consecutive state E. Filtering out all isolated helix and strands residues can improve the performance of the method. We have conceived an heuristic based-code to filter the prediction obtained by the ensemble method. We used the filter proposed by Salamov and Solovyev 12 and added the following rules: [EHE] → [EEE], [HEH] → [HHH], [HCH] → [HHH], [ECE] → [EEE], [HEEH] → [HHHH] and all helices of length one or two are converted to coils, all strands of length one are converted to coils.
Prediction accuracy assessment
In this article, the experimental results are reported with several methods for performance evaluation of both the ensemble method and the selected ensemble members on RS 126 and CB513 datasets. The methods used are Cross-Validation (section 2.3), Confusion Matrix and three accuracy measurements. A description of these methods will be given in the following subsections.
Confusion matrix
In order to compute the statistics, a matrix of size 3 × 3 named confusion matrix (contingency table) has been used,
Prediction accuracy measurements
Several standard accuracy measures have been suggested in the literature. The most popular measure is the three-state overall per-residue accuracy known as
Per-residue prediction accuracy
For each type of secondary structure the per-residue accuracy can also be calculated, for example:
with
The Pearson's Matthews correlation coefficients for a particular state
With
Proposed firstly by Rost et al 9 and modified by Zemla et al, 59 it is based on the average overlap between the observed and the predicted segments.
where,
Here,
Experimental Results and Discussion
In this section, we report the results of our experiments obtained using the seven-fold cross validation on both RS126 and CB513 datasets. We first evaluate the six individual classifiers with and without refining their outputs by applying the proposed filter and then try to evaluate whether and how much voting can improve the prediction accuracy. Morover, a comparison is made between the three voting schemes with and without filtering the predictions. The experiments have been conducted in C on linux platform. For the M-SVMs, the use of a proper kernel function with its optimal parameters can achieve high prediction accuracy. However, there are no effective theoretical methods for model selection and the optimization of kernel parameters may become quite difficult due to computing time. In this study, we used the gaussian radial basis function (RBF) kernel for both C-SVC, the M-SVM of Weston and Watkins (M-SVMWW) and the M-SVM of Crammer and Singer (M-SVMCS). This kernel choice is motivated by the fact that the RBF kernel is more suitable for complex classification problems. The parameters γ and
Prediction accuracies of the six individual classifiers
The six selected classifiers have been evaluated individually so as to estimate their performance in predicting the three conformational states. Table 1 shows that the best individual classifier for RS126 dataset is M-SVM
Performance comparison of the six individual classifiers for RS126 dataset.
Performance comparison of the six individual classifiers for RS126 dataset after applying the filter to the predictions.
The results obtained on CB513 dataset are reported in Tables 3 and 4. Both QuickRBF, MSVM
Performance comparison of the six individual classifiers for CB513 dataset.
Performance comparison of the six individual classifiers for CB513 dataset after applaying the filter to the predicitions.
Generally, coils are predicted quite well, helices are predicted moderately well and strands are predicted rather poorly. Our analysis shows the same trend. All the results on both RS126 and CB513 confirmed this well. It can be clearly seen that the coil predictions have a high prediction accuracy compared to the β-sheet as well as the aL-helix. The results show also that each classifier has its particular stengths and weaknesses in predicting structural conformations. The high number of predicted coils here is mainly due to the fact that shorter secondary structure elements are harder to predict.
Prediction accuracies of the ensemble method using the three voting schemes
Tables 5 and 7 report the results obtained by the three voting schemes on RS 126 and CB513 without refining the obtained outputs. We can see that Influenced Majority Voting (IMV) gives better results than Simple Majority Voting (SMV) but the results given by Weighted Majority Voting (WMV) are best. By combining the classifiers predictions, the SOV increased significantly, which means that the segments prediction quality becomes better. The best

The Q3 (A) and SOV (B) scores for the three voting schemes on the RS126 dataset. QH/E/C and SOVH/E/C are respectively the predicted Q and SOV scores for each conformational state (helix, strand and coil).

The Q3 (A) and SOV (B) scores for the three voting schemes on the CB513 dataset. QH/E/C and SOVH/E/C are respectively the predicted Q and SOV scores for each conformational state (helix, strand and coil).
Performance comparison of the three combination schemes for RS126 dataset.
Performance comparison of the three combination schemes after applying the filter to the predictions for RS126 dataset.
Performance comparison of the three combination schemes for CB513 dataset.
Performance comparison of the three combination schemes after applying the filter to the predictions for CB513 dataset.
We can see in Figure 1 and Figure 2 that the
All the experimental results reported above show that the proposed system exploits the prediction power of the selected classifiers but it is also influenced by the worst classifiers. In all the cases, the propposed system works more effectively than the individual classifiers. In this study, we have presented a real case to demonstrate the advantage of combining classifiers. Our results, as shown in all the Tables, demonstrate substantially higher accuracy achieved by M-SVMs as compared to MLP, k-NN and C-SVC. The experiments provided evidence from the several tests conducted that combining highly accurate classifiers improve significantly the performance and that majority voting rule is a robust way to combine powerful classifiers. Figure 3 summarizes the performances of the individual classifiers and the three schemes of the ensemble method proposed with and without refining the predictions.

Comparison of prediction accuracies (y axis) between the three combination schemes and the individual classifiers (x axis) with and without applying the filter to the predictions on both RS126 and CB513 datasets.
The aim of combining multiple classifiers is to obtain better performance than individual classifier. Here, the experimental evaluation using the two standard datasets RS 126 and CB513 show that the proposed framework indeed improve the prediction quality. The results reported here are highly encouraging since they reveal that the proposed framework is capable of producing higher Q3 and SOV scores than that achieved by PHD, DSC, PREDATOR, NNSSP and Jpred as well as previously developed SVM-based methods and similar to the current prominent PSSP methods such as PSIPRED, SSpro, SAM-T99sec. It is difficult to compare exactly our results against the results reported in recently published studies directly because it will not be a fair comparison to compare only the absolute
Conclusion
Ensemble classifier combination has been extensively studied in the last decade, and has been shown to be successful in improving the performance of diverse applications. Protein secondary structure prediction is an important step towards predicting the tertiary structure of proteins and then their function. Therefore, an accurate prediction is strongly required. The effectiveness of the methods used depends crucially on the parameters. A problem which occurs for this study is the selection of ideal parameters for each predictor. We believe that at least three optimization processes are mainly important to perform in PS SP for prediction quality such as for example encoding scheme, sliding window size and parameter optimization. The interesting point emerging from our study is that when multiple classifiers are combined, the gain in performance is not always guaranteed, especially when poor-performed classifiers are integrated. The voting results shows that when lower-quality classifiers are added in, the better classifiers are steadily drowned out. So the gain in performance will be more pronounced by including better-performed classifiers, while including poor-performed classifiers decrease the prediction accuracy. Consequently, the relative merits of maintaining the best and eliminating the weakest can be considered. This study also shows that the SVMs remain the major competitor of ANNs in the field of machine learning. The prediction result shows that β—
Abbreviations
Artifial Neural Network;
Basic Local Alignment Search Tool;
BLOck Substitution Matrix;
Garnier-Osguthorpe-Robson;
Multi-Layer Perceptron;
Multi-class Support Vector Machines;
Influenced Majority Voting;
Position Specifi Iterative BLAST;
Radial Basis Function Neural Network;
Simple Majority Voting;
Support Vector Machines;
Weighted Majority Voting.
Disclosures
Author(s) have provided signed confirmations to the publisher of their compliance with all applicable legal and ethical obligations in respect to declaration of conflicts of interest, funding, authorship and contributorship, and compliance with ethical requirements in respect to treatment of human and animal test subjects. If this article contains identifiable human subject(s) author(s) were required to supply signed patient consent prior to publication. Author(s) have confirmed that the published article is unique and not under consideration nor published by any other publication and that they have consent to reproduce any copyrighted material. The peer reviewers declared no conflicts of interest.
Subset of the structures in the PDB that does not contain (highly) homolog sequences availible at http://swift.cmbi.kun.nl/swift/pdbsel/.
Structural Classification of Proteins availible at http://scop.mrc-lmb.cam.ac.uk/scop/.
