Abstract
Introduction
It is of great importance for biologists to analyze and understand the structure and function of a large volume of genomic DNA sequences.1,2 However, it is very difficult to obtain biological information directly from large DNA sequences. 3 Determination of sequence similarity is one of the major steps in computational phylogenetic studies. During evolution, both nucleotide mutation and gene rearrangement occurred. One of the major tasks of computational biologists is to develop novel mathematical descriptors for similarity analysis. 1 DNA clustering is an important technology that automatically finds inherent relationships in large-scale collections of DNA sequences. However, the quality of DNA clustering resulting from this technique may still need to be significantly improved. The comparison between the DNA sequences of different species helps determine the phylogenetic relationship among various species. 2
In recent years, various approaches have been proposed for generating DNA sequence information.1–9 In 1990, Jeffrey 4 proposed a method known as Chaos Game Representation (CGR) of gene structures. The Chaos Game is essentially an iterated function scheme, similar to that used in the representation of the Sierpinski triangle obtained by plotting a sequence of points generated in a somewhat random fashion.4,10 CGR is a novel holistic approach that provides a visual image of a DNA sequence that is quite different from the traditional linear arrangement of nucleotides. CGR of DNA sequences has received widespread attention for searching global patterns in long DNA sequences.10–12 Sequence alignment has been frequently used as a powerful tool for comparing two DNA sequences. 13 However, with the divergence of species over time, subsequence rearrangements occurring during evolution make sequence alignment similarity scores less reliable. 1 Alignment-free methods for more efficient comparison and classification of DNA sequences than sequence alignment were developed in the past decade. 6 Graphical representations of DNA sequences are also powerful alignment-free tools for the analysis of DNA sequences. 1 The first three-dimensional (3D) geometric representation for DNA sequences was presented by Hamori and Ruskin. 14 Later, 2D,15–20 3D,21–25 4D, 26 5D, 27 and 6D 28 representations of DNA sequences were developed. These methodologies represent DNA as matrices that were associated with the selected geometrical objects, as well as vectors that were composed of invariants of matrices that were used to compare DNA sequences. Qi et al. 1 introduced a novel method based on a graph theory to represent
DNA sequences mathematically for similarity analysis, in which they set up a weighted directed graph, whose adjacency matrix yields a representative vector. Two distance measurements for representative vectors are then defined to assess the similarity/dissimilarity of DNA sequences. However, to keep one-to-one mapping between a DNA sequence and its corresponding weighted directed multigraph, the weight of an arc between two nucleotides must retain a sufficient number of significant digits for large DNA sequences to increase its computational complexity. Hou et al. 2 proposed a new graphical representation method that adopted the telecommunication Coded Mark Inversion coding to represent four nucleotides, namely, A, G, C, and T. Their approach considers not only the sequences’ structure but also the chemical structure of the DNA sequence. Jeong et al. 3 proposed a noble encoding approach for measuring the degree of similarity/dissimilarity between different species. Their approach preserves the physiochemical properties, positional information, and the codon usage bias of nucleotides. Recently, Thankachan et al. 7 presented ALFRED, an alignment-free distance computation method, which solves the generalized common substring search problem via exact computation. Pizzi 8 presented MissMax, an exact algorithm for the computation of the longest common substring with mismatches between each suffix of two sequences. Kumar et al. 9 introduced a 36-dimensional Periodicity Count Value that represented a particular nucleotide sequence by adapting a stochastic model. However, most alignment-free methods may lose the structural and functional information of DNA sequences because these mainly utilize feature extractions. Therefore, these may not fully reflect the actual differences among DNA sequences. Most of these techniques utilize genes that consist of several hundreds to thousands of bases to illustrate the effectiveness of their methods. However, for a large volume of genomic DNA sequences that comprises tens of thousands to millions of bases, the problem is computational complexity.
In most cases, it is necessary to identify a proper way to represent DNA sequences. In this paper, we construct a novel mathematical descriptor based on the characterization of complex networks. In particular, for each DNA sequence, a complex network is created and used to generate a 60-dimensional characterization vector that in turn would be employed in the analysis of the mitochondrial DNA sequence phylogenetic relationships among nine species. The present approach is capable of assessing tens to hundreds of thousands of bases of DNA sequences.
Complex Network Construction for DNA Sequence
The central dogma of molecular biology states that genetic information is transcribed from the DNA to the messenger RNA, which in turn in used in protein translation. A code of three nucleotide codes for an amino acid and various combinations are used to specify each of the 20 different amino acids occurring in living organisms.
Our method is mainly based on the central dogma. In addition, the similarity between two sequences based on alignment is completely based on the ordering of nucleotides. The basic idea of our model is to first construct the complex networks of a DNA sequence, which in turn are used in generating a characterization vector and integrated into a clustering algorithm for the analysis of phylogenetic relationships among various species.
To clarify the definition of different complex networks for a DNA sequence, we take the sequence shown in Figure 1 as an example. In accordance with the central dogma and considering computational complexity, for each DNA sequence, we will construct five complex networks called dinucleotide (trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide) cis sequence networks. The vertices of these cis sequence network are dinucleotides (trinucleotides, tetra-nucleotides, pentanucleotides, and hexanucleotides), and the edges of those networks are dinucleotide (trinucleotide, tetra-nucleotide, pentanucleotide, and hexanucleotide) cis sequence pairs. The vertex numbers of these cis sequence networks are 2
4
= 16, 3
4
= 81, 4
4
= 256, 5
4
= 625, and 6
4
= 1,296, respectively. For the example sequence in Figure 1, the five complex network construction processes are illustrated in Figure 2. In Figure 2, in constructing a cis sequences network, each line has two cis sequences that represent two vertices between which there exists an edge. For example, in Figure 2A, cis sequence TG and CC are two vertices with an edge between TG and CC. Figure 3A–E shows the five nucleotide cis sequence networks of the example sequence that were laid out using Cytoscape 3.3.0.
Example sequence. Examples of nucleotide cis sequence network constructions. (A) Dinucleotide cis sequences network construction, Five nucleotide cis sequence networks of the example sequence. (A) Dinucleotide cis sequences network, 


Characterization Vector Construction for Complex Networks
Measurements of the topology of complex networks are essential for their characterization, analysis, classification, modeling, and validation. In a review, Costa et al. 31 surveyed the characterization of the main existing complex networks. The characterizations included distance-based measurements, clustering coefficients, assortativity, entropies, centrality, subgraphs, spectral analysis, community-based measurements, hierarchical measurements, and fractal dimensions.
Fifteen global characterizations of complex network.
In addition, for a large volume of DNA sequences, the characterizations of each of the five nucleotide cis sequence networks revealed a short characteristic path length, one connected component, and low diameter and density, and each of the networks displayed common characteristics of a biological network such as scale free and small words, thereby suggesting that the inferred networks were not random and presented features of complex networks. The five nucleotide cis sequence networks of the example sequence shown in Figure 1 do not present the characteristic of scale-free small words [for example, Figure 1E i just a path] because the example sequence is short.
Materials
Length of the mitochondrial DNA sequences of nine species.
From above, five nucleotide cis sequence complex networks and 60-dimensional vector associated with each of the nine species’ mitochondrial DNA sequences were generated (Supplementary material).
Distance Measurements for Similarity Calculations
Each of the DNA sequences was mapped to a characterization vector in the 60-dimensional linear space. The comparison between DNA sequences is now reduced to the comparison between these 60-dimensional vectors. We applied two standard measurements: one is the distance between two 60-dimensional vectors and the other is the similarity/dissimilarity among the different species. For two species
We present the first similarity/dissimilarity matrix based on the Euclidean distance between two vectors
The second similarity/dissimilarity matrix is based on the cosine of the angle included between vectors
We will use different similarity/dissimilarity metrics for numerical analysis and determine phylogenetic analysis to show the evolutionary proximity and distance among different species. We further analyze the similarity/dissimilarity of the resulting complex networks in terms of the local characterizations of topological coefficients.
Results and Discussion
Similarity analysis of global characterizations
Numerical representation of biomolecular sequences facilitates application of conventional pattern recognition tools and techniques such as various classifiers and clustering techniques to group and classify sequences. 9 By our method, each of the five nucleotide cis sequence networks can be represented by a 60-dimensional vector using 15 global characterizations, which can then be used in calculating the similarity/dissimilarity matrix using the two measurements of defining distances between two characterization vectors as well as in a clustering algorithm for the analysis of phylogenetic relationships of nine species based on their mitochondrial DNA sequences. We evaluated and applied the proposed characterizations of complex networks model similarity measure to the Matlab clustering to perform clustering. Using the similarity/dissimilarity matrix, we generate the phylogenic tree using the functions “pdist”, “linkage”, and “dendrogram” in Matlab 7.12.0 (R2011a).
Upper triangular part of similarity/dissimilarity matrix based on the Euclidean distance of vectors.
Upper triangular part of similarity/dissimilarity matrix based on the cosine of a vector-included angle.

The phylogenic tree of nine species using the upper triangular part of similarity/dissimilarity matrix based on Euclidean distance of vectors.

The phylogenic tree of nine species using the upper triangular part of similarity/dissimilarity matrix based on cosine of vector-included angle.

Topological coefficients of pentanucleotide cis sequence networks of nine species. (A)
Similarity analysis of local characterizations
We analyze the similarity/dissimilarity of the local characteriza-tions 33 of these complex networks. We consider the topological coefficients of pentanucleotide cis sequence networks of nine species as an example (Fig. 6).
From the topological coefficients of pentanucleotide cis sequence networks of nine species, the more similar species pairs include
Conclusions
We present a new method for the characterization of a complex networks model for alignment-free DNA sequence similarity analysis. This new approach is based on a code of three cis nucleotides in a gene that could code for an amino acid. The similarity/dissimilarity matrix for the nine complete mitochondrial DNA sequences belonging to different species is built, and the elements of the similarity matrix are used to construct phylogenic tree. When the mitochondrial DNA sequences were compared between different species, the results were in agreement with the actual situation.
Author Contributions
JZ conceived and designed the experiments and was a major contributor in writing the manuscript. PYZ and THZ analyzed the data and the result. All authors reviewed and approved of the final manuscript.
Supplementary Material
Characterizations of five complex cis sequence network of nine species.
