Abstract
1. Introduction
I
From a computational perspective, the two common methods of clustering of orthologous genes are three-way triangle relative reciprocal similarity and graph clustering used in NCBI clusters of orthologous groups (COG) and orthoMCL works (respectively in Wolf et al., 2012, and Li et al., 2003). Both of the aforementioned methods are well represented in the literature as used in orthologous protein classifications. The development of additional mathematical tools and theory for analysis of related sequences is an area of burgeoning interest. Examples may be found in the literature, such as in Kim and Lee (2006) and Viswanath and Madabhushi (2012).
We propose a new approach based upon a hypothesis that under evolutionary constraint, descendant sequences from a common ancestor share a mathematical subspace, as opposed to ambient space. In particular, we hypothesize that member nucleotide or peptide sequences lie within low dimensions within the ambient space of all sequences. Therefore, the subspace dimensions are a reflection of mutation and time elapsed since the speciation or duplication within the cluster.
In order to investigate the hypothesis, several sets of experiments and estimates of the
2. Methods
The following section contains an overview of the methods used to estimate the dimensional rank of the subspaces of COG sequences and the principle angles distribution. The population sample data used in this study come primarily from the first
2.1. Subspace considerations
In this work, it is preferred to obtain a reasonable estimate of the dimensions of COG subspaces at hand, since many of the common methods assume constant equal subspace dimensions in the models. The fundamental aspects considered toward estimating the rank of the subspace included review and analysis of the principal angles and singular values distributions.
2.1.1. Rank estimation of a single orthologous group
Let a set of unaligned nucleotide sequences form a matrix
where σj is the
2.1.2. Distribution of principle angles
The principle angles between orthologous subspaces is defined
The angular distribution was estimated by computing a subspace basis estimate for each COG and then computing the principal angles from the above equation.
2.1.3. Orthologous sequences subspace basis
Let the
where
2.1.4. Sequence alignment
Prior to performing the subspace analysis, the randomly selected cluster nucleotide sequences are aligned to ensure best common basis representation. This alignment corresponds to a relative energy minimization argument and is also a best effort attempt to allow for mutations, which may include insertions and/or deletions. The multiple sequence alignment used in this study was the MAFFT code in Katoh and Standley (2013) and Chenna et al. (2003), compiled and optimized for use on the Intel i7 processor under 3.12.x baseline Linux kernel. The goal is to perform the alignment for each
2.2. Probability of error estimates
In order to estimate the statistical significance of the experimental outcomes, an approximation is employed to compute the probability of the experimental data and classification outcomes. Consider the probability of randomly classifying
Let the random vector
where probability of success is
2.3. Subspace dimensions in the model tree
In order to investigate the relationship of the subspace dimensions to the phylogenetic history, a simple randomized binary evolutionary tree was developed. Since the orthologous trees are inferred (predicted), the analysis here uses simulations of a binary randomized evolutionary tree, where each speciation branch is a random outcome of an
2.3.1. Simple random binary mutation tree model
In order to investigate the vector properties and singular values, a mathematical construction is used for the
where
An example of a simple binary model tree is shown in Figure 1. It should be noted that in reality, phylogenetic trees are generally not perfect, binary, or symmetric, due to incomplete data, missing evolutionary paths, differing branch lengths, or other reasons.

Symmetric balanced binary model tree.
If each species branch is comprised of the sum of all speciation and mutation vectors in the path from the common ancestor, it is then possible to simulate and solve for the singular value decomposition (SVD) or eigenvalue decomposition (EVD) of the random binary tree of depth
For example, compute 2(2+1)−1=7 speciation vectors, which also include the root common ancestor in counting. Let
where
It should be noted that by the above definitions,
To simplify the analysis, it is assumed that the statistical expected magnitude of the speciation vector distance of mutation is directly proportional to the elapsed time period between the speciation node of the tree from the parent to child. In addition, a simple constant rate
where
where
2.3.2. Stochastic phylogenetic tree model
In order to investigate the secondary hypothesis, a simplified statistical binary tree whose branches are comprised of random variables (RV) for the elapsed time was developed. Although not used here, a more extensive model may allow (1) each speciation event to be an m-fold generalization of the binary tree, which may also be an empty branch, and (2) the lengths of the speciation branches would have unequal lengths with random variable outcomes.
Using the foregoing simple mathematical model and expectation results, it is noted that the expectation of any pair of vector dot products from two randomly created binary trees
In the above simple binary mutation tree treatment, • case I orthologous genes of differing common ancestors lie within independent subspaces if and only if the common ancestor vectors are uncorrelated. The trees do not intersect in expectation. • caseII recent descendants of common ancestors may still retain nonvanishing similarity from inherited common ancestor vector components. The trees intersect.
2.4. Experimental approach
In order to test the subspace mutation tree relationship hypothesis, we seek to cluster sequences from the known COG database. Toward this end, two randomized experiments were conducted to test the primary hypothesis using the COG database and sets of thirty trials each containing unknown mixtures of sequences belonging to
The approach is as follows:
(1) Estimate the subspace dimensions of the first (2) Estimate inter-subspace principle angle distribution using Equation (2.2) and the following density counting functional,
Figure 2c shows the angle distribution of the first group of (3) Perform random selection for the experiment. (a) Draw random nucleotide sequences set (b) Construct the combined matrix data stack
where the rows of (c) Randomize the order of rows of (4) Align (5) Compute the misclassification rate and compare to null probability, (a) for each experimental sample and algorithm output set (b) Compute the (6) Go to random selection above and repeat

Estimates of subspace rank and distribution of principle angles. COG, Clusters of orthologous groups.

Results for M = 2 and M = 3.
2.5. Detailed algorithm
Samples were drawn at random uniformly from the first
3. Results
In this section, the average subspace dimensions are obtained from the rank estimates of the first
3.1. Estimation of parameters
The estimated rank for the first
3.2. Experimental objectives
The experimental data sequence analysis methods used here are designed to evaluate the subspace algorithm hypothesis for COGs of proteins or DNA nucleotide sequences with presumed known truth from the NIH COG database. The objectives include consideration of membership, similarity, principle angles, ensemble distribution, and/or other aspects.
It is important that the experimental design allows for both false positives and false negatives, since a primary objective is to estimate the error rate and the statistical significance of the experimental results. Plots of the
3.3. Computational considerations
The data matrix of interest is a three-dimensional tensor of the form and size 1000–4000 (nucleotide base pairs)×50 – 100 (orthologous genus species taxon of interest)×2 – 200 (orthologous protein groups).
Several observations regarding the challenges of existing methods in COG are included below.
(1) Direct application of linear algebra is complicated by mutation events, such as insertion and deletion. However, preliminary results suggest potential for a hybrid model using subspace techniques combined with traditional methods published and demonstrated in the NIH COG series of Makarova et al. (2007).
(2) Real-world data sets are often incomplete, contain noise, and have other outliers due to errors and uncertainties and potential false positives and false negatives. These aspects generally complicate the process of modeling and inference, which may be significant. The experiments discussed in this study do not contain known nonmember outlying sequences. Nonmember outliers may be addressed in a future effort.
3.4. Experiments
Randomized simulations were performed using two and three clusters for
3.5. Statistical significance
Preliminary review of the estimated experimental errors and probabilities suggests that the orthologous sequences are indeed members of a common subspace, since the observed correct classification outcomes (with shown error rates) would occur due to random selection on average with probability estimate
4. Discussion
In our small orthologous population sample study, we observe that the common ancestors are sufficiently uncorrelated as to significantly preserve the inherited vector components that allow the LSA (Zappella et al., 2011) algorithm to perform the subspace classification of descendants relative to their common ancestors. A larger study would be needed to ascertain the limiting degree of applicability of the observed distinctiveness of the common ancestor orthologous sequences.
4.1. Supplemental results for simple model tree
This section includes results from simulation and analysis of the simple binary model tree. The characteristics of rank, tree depth, and number of nodes are consistent with the main hypothesis and are included here for reference. It should be noted that operations such as dot products and projections are generally performed after alignment. Also, the
4.1.1. Idealized model tree
The simple randomized binary mutation tree was generated for mutation rates

Simple binary mutation tree simulations.
4.1.2. Singular values versus binary tree depth
Simulations using the random binary tree evolution model were run over a range of depths. For each complete balanced random tree, the rank was estimated using Equation (2.1), as shown in Figures 2a and 5a. The dependence of estimated rank versus binary tree depth is noted in Figure 5b.

Estimation of group rank.
4.1.3. Singular values versus mutation rates
The simulated results for the simple binary mutation tree are shown in Figures 4a and b. The rank of the tree for the given mutation may be estimated at the 90% level in the cumulative sum of the singular values normalized for the two cases here,
Since the mutation time lengths are constant in this simple model, it is expected that a better model would allow for random mutation time.
4.1.4. Subspace relationships in the simple model tree
The dimensions of the subspaces are observed to be dependent upon similarities among the sequences. In particular, it is observed that the mutation rate product with the number of sequence replication is proportional to the estimated rank dimensions of the binary tree. This is consistent, since a tree with very high rate of mutations would produce species that were all significantly different, and the full rank of the sequence matrix would be required for the subspace representation. In addition, a perfect binary tree shows increasing rank versus depth, as illustrated in the simulations presented.
The subspace nature of orthologous groups in general is believed to have a connection within the molecular aspects of divergent evolutionary processes. In particular, the simple mathematical model binary tree simulations indicate descendants from two uncorrelated unique common ancestors are statistically unlikely to intersect, and therefore, the subspaces of binary trees are not expected to intersect. It is important to note that statistically, the mathematical model allows exceptions and variance about the mean, which is consistent with exceptions that are expected in nature and biology. However, the reader should be mindful that the binary tree is generally too simple for most orthologous sequence relationships, and is primarily used here as a means for analysis.
4.1.5. Model tree simulations
Based upon the results in Figure 1, the estimated dimensional rank of the subspace
In addition, the simulation data are consistent with the following observation: rank estimates for the perfect complete binary tree are dependent upon the rate of mutation and the elapsed time, which give rise to the length of the branches. The larger the rate of mutations per generation for a given unit of time, the more quickly descendants diverge from the common ancestor, and the more significant the members of the branches become in the singular value decomposition, thereby increasing the rank approximation value [Eq. (2.1)]. In summary, the observed results support the hypothesis relating the mutation events in the phylogenic tree to the subspace constraint.
5. Conclusions
The experimental results are consistent with the hypothesis that similar genes of various related organisms lie within relatively small subspaces. Randomized experimental simulations were performed using two and three clusters for
6. Appendix
A brief overview and discussion of subspace clustering and related theory is presented below with references from the literature.
6.1. The subspace theoretic problem
(1) determine the number of subspaces
(2) determine the set of dimensions
(3) find an orthonormal basis for each subspace
(4) collect the data points belonging to the same subspace into the same cluster.
6.2. Subspace segmentation methods
Subspace clustering may be viewed as multiple independent nonintersecting (except at zero) low-rank spaces amenable to singular value decomposition or principle components in linear algebra. The technique has shown success in areas such as face recognition and motion detection.
Methods in the literature (Vidal, 2010) generally fall into four categories:
• algebraic methods—generalized principle component analysis (GPCA) and reduced row echelon form (RREF); • sparsity methods—sparse subspace clustering (SSC) and low rank representation (LRR); • local neighborhood methods—local subspace affinity (LSA), nearness to local subspace (NLS), and spectral curvature clustering (SCC); • iterative and statistical methods—random sample consensus (RANSAC) and agglomerative lossy compression (ALC).
Most of the methods incur significant numerical computational complexity, owing to the number of unknown quantities of the set characteristics, such as dimensions, bases, and members, for general problem domains. The sparsity methods often require the most computational cost relative to the other methods.
References for solving Problem 1 in the literature include sparsity methods (Eldar and Mishali, 2009; Elhamifar and Vidal, 2009; Elhamifar and Vidal, 2010); algebraic methods (Vidal et al. 2005; Tron and Vidal, 2007); iterative and statistical methods (Kanatani and Sugaya, 2003; Aldroubi et al., 2009; Tseng, 2000; Fischler and Bolles, 1981; Silva and Costeira, 2008; Aldroubi and Zaringhalam, 2009); and spectral clustering methods (Lauer and Schnorr, 2009; Chen and Lerman, 2009). These methods and applications on subspace clustering are reviewed and discussed in Vidal (2010).
Footnotes
Acknowledgment
The research of Ali Sekmen and Tim Wallace was supported in part by NASA Grant NNX12AI14A.
Author Disclosure Statement
The authors declare that there are no known conflicts of interests.
