Abstract
Keywords
Introduction
With the drastic increase of data to be processed in really short amounts of time, new problems have appeared. Chromosome classification, spam filtering, defining which advertisement to show to a person on a web page, recognition of human activities and protein structure prediction are a few applications that involve immense amounts of high-dimensional data.1,2 Sometimes the dimension and/or the number of data samples is too large, making the storage of a dataset in a computer impossible. This problem is solved by large-scale classification learning, which aims to find a function that relates the data and their corresponding class labels for an amount of data that cannot be stored in a modern computer's memory. 3 The main concern (constraint) is the amount of time that an algorithm takes to obtain an accurate result, rather than the number of samples to process. 4
A typical problem that support vector machines (SVMs) have to face while working with a large dataset is that learning algorithms are typically quadratic and require several scans of a dataset. Three common strategies can be used to reduce this practical complexity:3,4
Solving several smaller problems by working on subsets of the training data instead of the complete large dataset. Parallelizing the learning algorithm. Designing a less complex algorithm that gives an approximate solution with equivalent or superior performance.
This work presents a novel approach to solving large-scale learning problems by designing a less complex algorithm to train a large-scale SVM. Our approach uses a combination of Kernel-Adatron (KA) and some state-of-the-art evolutionary algorithms (EAs), to solve, principally, protein structure prediction (PSP) and other large-scale learning problems. 5 The obtained algorithm works with small sub-problems, has low computational complexity and is easy to implement; in addition to providing accurate generalization results, such methodology is also highly parallelizable.
Support vector machines
Since the SVM algorithm was first introduced by Vladimir Vapnik in 1995, it has been one of the most popular methods for classification because of: its simple model, the use of kernel functions and the convexity of the function to optimize (it only has a global minimum).
6
SVM's characteristics make it more appealing for classification problems with high precision requirements than other models such as multilayer perceptron, radial basis function network, Hopfield network, etc.7–9 Many large-scale training algorithms have been proposed for SVMs with the main idea is of minimizing a regularized risk function R and maximizing the margin of separation between classes (Fig. 1) by solving Equation 1
A binary dataset is composed of positive 
where
where
The dual formulation has the same optimal values as the primal, but the main advantage of this representation is the use of the “kernel trick” (see Fig. 2). Since SVMs can only classify data in a linear, separable feature space, the role of the kernel function is to induce such feature space by implicitly mapping the training data into a higher dimensional space where data is linearly separable.14,15 There are two main approaches for large-scale SVM training algorithms: those that solve the primal SVM formulation, shown in Equation 1, by a gradient-based method (primal estimated subgradient solver for SVM, careful quasi-Newton stochastic gradient descent, forward looking sub-gradient, etc.) and those that solve the dual formulation of Equation 2 by quadratic programming (QP) methods (SVM for multivariate performance measure, library for large linear classification and bundle method for risk minimization, etc.).4,11,16,17 There are options that do not fall into these categories, such as the optimized cutting plane algorithm (OCA), which uses an improved cutting plane technique and is based on the work of SVM for multivariate performance measure (SVMperf) and bundle method for risk minimization. OCA has fast convergence compared to methods like stochastic gradient descent and primal estimated sub-gradient solver for SVM (Pegasos), and it has shown good classification results and offers computational sublinear scaling.
13
Nevertheless, the use of a QP solver to solve a linear constraint problem (where each linear constraint is a cutting plane) makes it a complex approach to implement, even if the number of constraints is drastically lower than the data dimensionality. Gradient-based methods tend to be fast algorithms (especially those that use stochastic gradient descent) and have good generalization capabilities. However, they are highly dependent on step size to obtain a good speed of convergence. If the step size is not chosen carefully or it does not have an adjustment criteria, this can produce slow convergence.
4
The dual QP methods can handle kernels easily and can converge quickly by combining them with other optimization techniques. The main disadvantage of these methods is the computational complexity of the quadratic programming solvers and the fact that they are more difficult to implement than a gradient descent method or an EA.4,18–21
Datasets that are not linearly separable may be separated by a hyperplane in higher dimensions after applying the kernel trick.
In the past years, several evolutionary computation-based training algorithms for SVM have been proposed.22–25 These algorithms solve the dual formulation (Equation 2), tend to be easy to implement and have shown good results for small amounts of data. The disadvantage on their implementation is their computational complexity of
Evolutionary algorithms
EAs are global optimization methods that scale well to higher dimensional problems. They are robust with respect to noisy evaluation functions, and can be implemented and parallelized with relative ease. 26 Even when premature convergence to a local extremum may occur, it has been proven that an algorithm that is “not quite good” or “poor” at optimization can be excellent at generalization for a large-scale learning task. 4
This work presents a series of parallelized algorithms based on the KA algorithm as fitness function combined with Artificial Bee Colony (ABC), micro-Artificial Bee Colony (μABC), Differential Evolution (DE) and Particle Swarm Optimization (PSO), in order to solve the SVM learning problem. The EA algorithms combined with KA were chosen based on good results shown in other areas, their exploration and exploitation capabilities, and low computational complexity.27–33
Large-scale training algorithms for SVMs using EA is a promising field that has not been well explored. Although parallelization is a highly desirable approach to the large-scale classification problem, most large-scale SVM training algorithms do not take this into consideration to obtain better results in a shorter amount of time. This is in part because testing complex parallel applications to guarantee a correct behavior is challenging; in scenarios, such as where inherent data dependencies exist, a complex task cannot be partitioned because of sequential constraints, making parallelization less convenient.3,4 One of the main goals in parallelizing an EA is to reduce the search time. This is a very important aspect for some classes of problems with firm requirements on search time, such as in dynamic optimization problems and real-time planning. 34
Protein structure prediction
A protein structure (PS) is the three-dimensional arrangement of atoms in a protein molecule.
35
These structures arise because particular sequences of amino-acids in polypeptide chains fold to generate, from linear chains, compact domains with specific 3D structures (Fig. 3). The folded domains can serve as modules for building up large assemblies such as virus particles or muscle fibers, or they can provide specific catalytic or binding sites, as found in enzymes or proteins that carry oxygen or regulate the function of DNA. PSP predicts the three-dimensional strucutres of a protein by using its first structure, its amino-acid sequence, to predict its folding and its secondary, tertiary and quaternary structure.36,37 This makes PSP an essential tool in proteomics since the molecular function of a protein depends on its threedimensional structure, which is often unknown.
Left: Amino-acid sequence of a protein. Right: A representation of a three-dimensional structure of a protein.
In the past 50 years there has been enormous growth in the available information regarding genomic sequences, to the point that the pace is difficult to follow. At present, more protein coding sequences are known than their three-dimensional structures. Protein folding is a large-scale problem because 20 different amino acids can generate such a large number of combinations, and there are also many ways for different amino-acid sequences to generate similar structural domains in proteins. 35 It has been suggested that many proteins contain enough information in their amino-acid sequences to determine their three-dimensional structure, making possible the prediction of new three-dimensional structures from an amino-acid sequence since it is known that sequence similarity does confer structural similarity.38,39 Furthermore, to understand the biological function of proteins it is necessary to deduce or predict the three-dimensional structure from the amino-acid sequence, since their functional properties depend upon their structures. If the predictions are accurate enough, the gap between the growing amount of sequence information and their corresponding structures can be diminished.
PSP is, overall, an optimization problem where each amino acid can be characterized by several structural features. A good prediction of these features helps to obtain better models for the 3D-PSP problem. These features can be predicted as classification/regression problems, where the goal is to determine the shape (known as fold) that a given amino-acid sequence will adopt. The problem can take two possible directions.
40
The sequence may adopt a new fold, or bear resemblance to an existing fold in some protein structure database:
If two sequences share evolutionary ancestry, they are called homologous and the structure for the query protein can be built by choosing the structure of the known homologous sequence as a template. If no template structure is found for the query protein, the structure must be built from scratch.
Many methods have been developed to assign folds to a protein coding sequence. 41 These methods can be divided into three groups: sequencestructure homology recognition methods, threading methods and machine-learning-based methods. Sequence-structure homology and threading methods methods align the target sequence onto known structural templates and calculate their sequence-structure compatibilities, using, for example, environment-specific substitution tables or pseudo-energy-based functions to calculate if it is possible that a template is a fold of a target sequence.42,43 Sequence-structure homology methods (like FUGUE and 3DPSSM) fail when two proteins are structurally similar, but share little in the way of sequence homology.44,45 Threading methods (such as THREADER) depend on data derived from solved structures, but the number of proteins whose structure has been solved is much smaller than the number of proteins that have been sequenced. 46 Machine learning-based methods for protein fold recognition, like the approach presented in this paper, see the problem as a fold classification problem, where a classifier is built using a dataset with sequences of features of proteins with a known structure. The classifier can assign a structure-based label to an unknown protein (one that has not yet been solved).
In recent years, a number of diffierent SVM-based methods have been developed, producing better results than those obtained by pairwise sequence comparisons.40,43,47–49 These algorithms have made improvements in the detection of homologous structures with low levels of sequence similarity (remote homology detection).
Most of the state of the art for PSP classification is not focused on large-scale data, and even if some approaches have shown good results in small-scale PSP classification, most use versions of SVM reliant on kernel functions or neural networks; these do not scale well as the dimension and/or the number of data to classify grows.48,50–54 Because of this, some approaches tend to select an optimized feature subset with a moderate number of samples to improve the generalization performance of the SVM instead of using the complete dataset. This reduces the amount of data to compute, making it more practical to process with the original SVM approach, but also more time consuming since the dataset needs to be selectively preprocessed. 55 These methods might be good for small or medium amounts of data, but protein folding, because of its combinational nature, can generate an immense amount of data to process. This is where an algorithm especially designed for large-scale data is needed. Sequencing projects are fast at producing protein coding sequences, but only a small portion of protein coding sequences have experimentally solved 3D structures. This is due to the expensive and timeconsuming laboratory methods, such as X-ray crystallography and nuclear magnetic resonance (NMR). 41 This problem is becoming more pressing as the number of known protein coding sequences expands as a result of genome and other sequencing projects. 54 Because of this, tools that can predict PS rapidly and accurately, like the one presented in this paper, are needed. The full potential of genome projects will be realized only once we discover and understand the functions of these new proteins. This understanding will be facilitated by structural information for all or almost all proteins.
Methods
The kernel adatron algorithm
The Adaptive Perceptron algorithm (or Adatron) was first introduced by J. K. Anlauf and M. Biehl in 1989 for training linear classifiers. 56 This algorithm was proposed as a method for calculating the largest margin classifier. The Adatron is used for on line learning perceptrons and guarantees convergence to an optimal solution, when this exists. 57
In 1998, T. Fries et al proposed the KA algorithm. Basically, the KA algorithm is an adaptation of the Adatron algorithm for classification with kernels in high-dimensional spaces. 5 It combines the simplicity of implementation of Adatron with an SVM's capability of working in nonlinear feature spaces to construct a large margin hyperplane using online learning. 15
An advantage of KA algorithm is the use of gradient ascent instead of quadratic programming, which is easier to implement and significantly faster to calculate.
To implement KA algorithm, it is necessary to calculate the dot product
To update the multipliers, a change in α
where η is the step size and δα
where
The pseudocode is described briefly in Algorithm 1.
Kernel Adatron Algorithm.
Evolutionary algorithms
Evolutionary computing is a subfield of artificial intelligence that includes a range of problem-solving techniques based on principles of biological evolution. The principles for using evolutive processes to solve optimization problems originated in the 1950s. 58
The EA are optimization methods that are part of evolutionary computing, applying models based on biological evolution. In EA, a population of possible solutions is composed of individuals that can be compared according to their aptitude to improve the population; the most qualified candidates are those that obtain better results by a fitness function evaluation. The evolution of the population is obtained through iterations, in which a series of operations are applied to the individuals of the population (reproduction, mutation, recombination or selection), from these operations a new set of potentially better solutions are generated. The way the population evolves the possible solutions, and the way it chooses the new global best solutions, is something inherent to each EA. 59
A swarm intelligence algorithm is based on swarms that occur in nature; PSO and ABC are two prominent swarm algorithms. There is a debate on whether swarm intelligence-based algorithms are EAs or not, but since one of the inventors of PSO refers to it as an EA, and swarm intelligence algorithms are executed in the same general way as EAs, by evolving a population of candidate problem solutions that improves with each iteration, we consider swarm intelligence to be an EA.59,60
As mentioned before, the KA algorithm requires the α
The diagram explains the basic idea behind the algorithm described in this paper.
Artificial bee colony algorithm
The ABC algorithm was first introduced by Karaboga in 2005.
64
This algorithm is based on honey bee foraging behavior. The bees are divided into three classes:
Employed: Bee with a food source. Onlookers: Bee that watches the dances of employed bees and choose food sources depending on dances. Scouts: Employed bee that abandons its food source to find a new one.
Each food source is equivalent to a possible solution to the optimization problem and, as in nature, individuals are more likely to be attracted to sources with a larger amount of food (a better result obtained by the fitness function). For each food source, only one employed bee is assigned, and when it abandons its food source it becomes a scout. The number of the onlooker bees is also equal to the number of solutions in the population.
Initially, ABC algorithm generates a random population
If the amount of nectar (the value obtained by the fitness function) is greater than the old one, the employed bee takes it as its new food source
Once positions of the employed bees have been updated, the information is shared with the onlooker bees. Onlooker bees choose their food sources based on a probability
A new potential food source
If a position
The population of bees evolves through iterations and only the best bee is kept unaltered, whereas the rest of the bees are reinitialized with modifications based on the food source with the best fitness.
After the employed and onlooker phases have been completed (in the same way as in the ABC algorithm) the population is ranked according to its fitness values. The bee with the best fitness remains in its food source, while the second best fitness is moved to a position near to the best one in order to facilitate a local search. The bee with the worst position is initialized to a random position to avoid premature convergence.
Artificial Bee Colony Algorithm.
The value of φ
Micro Artificial Bee Colony Algorithm.
Differential evolution
Differential Evolution Algorithm.
Particle swarm optimization
The PSO algorithm was first introduced by Kennedy and Russell in 1995.
66
This algorithm exploits a population of potential solutions. The population of solutions is called a swarm and each individual from a swarm is called a particle. A swarm is defined as a set of
where φ1 and φ2 are random variables uniformly distributed within [0,1];
The position update is applied by Equation 17 based on the new velocity and the current position.
Particle Swarm Optimization.
To solve the uncontrolled increase of magnitude of the velocities (swarm explosion effect), it is often necessary to restrict the velocity with a clamping at desirable levels, preventing particles from taking extremely large steps from their current positions. 67
Although the use of a maximum velocity threshold improves the performance, by controlling the swarm explosions, without the inertia weight the swarm would not be able to concentrate its particles around the most promising solutions in the last phase of the optimization procedures. 67
Kernel adatron trained with evolutionary algorithms
The basic idea behind the proposed algorithms is to use a “divide and conquer” strategy, where each individual in the population of the EA (vector in DE, particle in PSO, food source in ABC and μABC) is seen as a sub-process, in this case a thread (Fig. 5), that will solve a part of the whole problem. Once each sub-process reaches a result, it is compared to the results of its peers to improve future results.
A thread is a component of a process. Multiple threads can exist within the same process; they are executed concurrently and share resources, such as memory.
DE, PSO, ABC and μABC are easily parallelized because each individual can be evaluated independently. The only phases in which the algorithms require communication between their individuals are the phases that involve mutation and the selection of the fittest individual. Also, the process to obtain the kernel matrix can be easily parallelized by dividing the process into several subtasks. For this approach, a lineal kernel is used (represented by the dot product
On each variant of the proposed algorithm, individual
The value
The main problem of KA is calculating the
Fitness Function.
The number of data to be used by the fitness function nvals in this approach needn't necessarily increase drastically with an increase in the number of training samples of the data set
The fitness function complexity is
Interdisciplinary computing and complex biosystems protein structure prediction benchmarks repository
The Interdisciplinary Computing and Complex BioSystems Protein Structure Prediction (ICOS PSP) benchmarks repository
I
contains datasets suitable for testing classification algorithms based on real data.69,70 The dataset is based on PSP, aiming to predict the three-dimensional structures of amino-acid chains based on several structural features. The features are extracted by using a window of size Ω on amino-acid chain to predict the Coordination Number (CN) for residue
Binning is the simplest method to discretize a continuous-valued attribute by creating a specified number of bins. The bins can be created by uniform frequency or length. In both, arity
The ICOS PSP benchmarks repository is available at http://ico2s.org/datasets/psp_benchmark.html.
The description of the dataset is available at http://ico2s.org/datasets/psp/motivation.html.
Results and Discussion
Brief description of large-scale datasets. Density denotes the average percentage of non-zero features of the data vectors.
Brief description of the ICOS PSP dataset.
From the PSP dataset, only the subsets discretized with uniform length and uniform frequency, with window sizes ranging from 7 to 9, were used for training and generalization because of their density and dimensionality. The Astro-Ph dataset is focused on classifying abstracts of scientific papers from Physics ArXiv. 74 The Aut-Avn and Real-Sim classification datasets come from a collection of UseNet articles from four discussion groups: for simulated auto racing, simulated aviation, real autos and real aviation. CCAT and C11 are obtained from the Reuters RCV1 collection, and address the problem of separating corporate related articles. 3 The Worm dataset focuses on classifying worm RNA splices. III , 13
The experiments were performed on an Intel® Core i7-3770 IV machine with 16 GB of RAM and Fedora Linux 20 V operating system. The code was written in C++ using POSIX Threads VI and Armadillo. 75 For the implementation of the algorithms, the Armadillo random number generator was used; the C++ random number generator was more expensive computationally speaking and increased the execution time drastically.
The datasets can be obtained at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/data-sets/binary.html and http://users.cecs.anu.edu.au/~xzhang/data/
Intel and Intel Core are trademarks of Intel Corporation.
Fedora is a trademark of Red Hat, Inc.
For more information about POSIX: http://pubs.opengroup.org/onlinepubs/9699919799/
For the experiments done in this section, our approach is compared against algorithms like OCA, SVM
Something to be taken into account is that it is much easier to implement and parallelize EA algorithms than to implement or parallelize the QP solvers used by OCA, SVM
For the EA fitness function, a linear kernel was used in all the algorithms since it gave the best results in the generalization tests. Several tests were made using a radial basis function kernel. In general, the results showed a slight increase in the training accuracy (not sufficient to compete with the other approaches in the training phase), the generalization accuracy decreased slightly and the processing time increased because of the extra operations that had to be performed to calculate the kernel. Because of this, only the results obtained with the linear kernel are shown.
Previous to the tests, from each dataset a subset of 4000 training samples was randomly extracted and normalized for binary training classification and cross-validation. Because of hardware limitations, the amount of training samples used on each dataset is not large-scale, so that it could be stored in the computer's memory. However, since the KA algorithm possess the guarantees given by the support vector theory and, as explained later in this section, the algorithm scales well with the increase in the amount of data and dimensionality, the algorithm can easily be used with a larger amount of data without problems.
68
The dimensionality and density of the datasets can be seen in Tables 1 and 2. The generalization accuracy was obtained by applying a 10-fold cross-validation to each dataset. To test the accuracy of training capability of each algorithm, the SVM was trained using 3600 training samples per run, which represents the
The values used to train the SVM with each EA were obtained by running PSO on each variant of the algorithm to determinate the optimal values. This is not to be confused with the PSO variant that uses KA to classify data. The following values were used by the EAs while using the large-scale datasets:
The μABC version used: The ABC version used: The DE algorithm used: The PSO algorithm used:
For the PSP dataset, the following values were, used by the EAs:
The μABC version used: The ABC version used: The DE algorithm used: The PSO algorithm used:
The
As stated in Section The Kernel Adatron Algorithm, the KA algorithm has appealing advantages such as the simplicity of implementation of Adatron and the capability of working in high-dimension feature spaces to construct a large margin hyperplane. But the main concern of implementing the original KA approach is working with the kernel matrix, since its computational complexity is of
The approach proposed in this paper always uses, per iteration, a subset of randomly chosen training samples with a much smaller fixed size, and it is independent of the number of training samples
Computational complexity of the algorithms.
Results obtained from the Friedman test were the sum of squares (SS), mean squares (MS), degrees of freedom (df), χ
2
value and
Mean rank obtained from the Friedman test for each solver.
Results from the Astro-Ph dataset. The best global results are underlined and the best results obtained by our approach are written in bold letters.
Results from the Aut-Avn dataset.
Results from the C11 dataset.
Results from the CCAT dataset.
Results from the RCV1 dataset.
Results from the Real-Sim dataset.
Results from the Worm dataset.
Training accuracy results for PSP uniform frequency subsets.
Training time results for PSP uniform frequency subsets.
Cross-validation accuracy results for PSP uniform frequency subsets.
Training accuracy results for PSP uniform length subsets.
Training time results for PSP uniform length subsets.
Cross-validation accuracy results for PSP uniform length subsets.
As can be seen from the ROC curves in Figures 6A to 6G and in Table 19, the generalization performances of the classifiers shown in this paper are very similar (the curves overlap each other) with excellent values for area under the curve (AUC), ranging from 0.9160 to 0.9891. The ROC curves for the PSP dataset (Table 21 and Figs. 7A to 7F) gave good values for AUC, ranging from 0.8087 to 0.8345. Even though the generalization tests performed on the Worm dataset are not as good as the rest of the generalization tests, it gave the best AUC result compared to the other ROC curve results.
ROC curves obtained from large-scale datasets. ROC curves obtained from the ICOS PSP dataset. Roc curve areas obtained from large-scale datasets.

To detect diffierences between solvers across multiple test attempts, Matlab's× implemention of the Friedman test was applied to the results of the four solvers that gave the best generalization results (DE, PSO, SVM
Friedman test applied to the datasets shown in Table 1: χ
2
= 5.9, a value smaller than 7.815, with Friedman test applied to the PSP dataset: χ
2
= 13.4, a value greater than 7.815, with
From the test we obtained an honest significant difference of 1.17; when this value is compared to the results presented in Table 5B it is apparent that there is a statistically significant difference between SVM
Roc curve significance level obtained from large-scale datasets.
Roc curve areas obtained from ICOS PSP dataset.
Roc curve significance level obtained from ICOS PSP dataset (where UF is uniform frequency and UL is uniform length).
Conclusions
We developed a simple-to-implement method for classifying sparse, largescale datasets using parallelism with four EA. As can be seen in the results, the approach also works for classifying not-so-sparse data in very short amounts of time without increasing the complexity of the algorithm. Even though the approach did not give good results in the training phase, it gave good generalization results in competitive or smaller amounts of time compared with those obtained by algorithms such as KA, OCA, SVM
Comparing the four EAs using variants proposed, it is easy to notice that the
Future work includes a multiclass version of this approach, an implementation of the algorithm that can run in computer clusters, and improvements to the accuracy of the training capability of the algorithms.
Author Contributions
Conceived and proposed the initial idea of the work: NAD. Conceived and designed the experiments: NAD, AAG, JM, ALF, CLF. Analyzed the data: NAD, AAG, JM, ALF, AYA. Wrote the first draft of the manuscript: NAD, AAG, AYA, CLF. Contributed to the writing of the manuscript: NAD, AAG, AYA, CLF. Agree with manuscript results and conclusions: NAD, AAG, CLF, AYA, JM, ALF. Jointly developed the structure and arguments for the paper: NAD, AAG, CLF, AYA, JM, ALF. Made critical revisions and approved final version: NAD, AAG, CLF, AYA, JM, ALF. All authors reviewed and approved of the final manuscript.
