Abstract
Introduction
Datasets emerging from different fields such as biology, neuroscience, engineering, social science, economics,
To understand the formation and evolution of real-world networks various generative models have been proposed to generate synthetic networks that follow the non-trivial topological properties of real-world networks [7,13,35,37,57,59]. For example, Watts-Strogatz model [59] synthesizes small-world networks with small average path length and high clustering coefficient, and Barabási–Albert model [7] generate scale-free networks with long-tail (power law) degree distribution. In addition to degree distribution, clustering and path lengths, other structural properties such as modularity, assortativity and special eigenvalues – are also supported in newer generative models [3,35].
Despite the progress made in proposing many generative models, there is currently no universal generative model that is applicable for all applications. Therefore, prior to network generation, we have to perform the non-trivial task of choosing the appropriate generative model for a particular application (also called model selection). Since we want to choose the model that is most representative of the real network, model selection involves deep analysis of the properties of the given network (called target network), and accordingly the most appropriate model is chosen. Essentially, model selection attempts to evaluate a library of candidate generative models and predict which the most appropriate for generating complex network instances similar to the real network. There are many applications of model selection including network sampling [22,34,36,54], simulation of network dynamics [12,42,46] and summarization [2,39,50]
In order to perform effective model selection, we require a similarity measure to compare networks across their topological properties such as average path length, transitivity, clustering coefficient, modularity,
Related work
In the literature, several network model selection (or network classification) methods are available most of them are based on graphlet counting feature [26,41,48], and combination of local and global features of network topology [2,4,43] for selecting the best generative model. Other methods are also developed for model selection problem [20,27].
To measure the structural similarities between two networks various quantitative measures have been reported [4,10,17,40,48–51,62]. Graph isomorphism is one of the classical approaches to compare two graphs. Two graphs are said to be isomorphic if they have an identical topology. Some variants of this approach are also proposed, including subgraph isomorphism and maximum common subgraphs [11]. Several isomorphism-inspired methods based on counting the number of spanning trees [28] and computing the node-edge similarity scores are also available [62]. These different methods are computationally expensive and not suitable for the large complex networks. Various approaches utilized graphlet counts as a measure of network similarity [26,29,48] and distance measures for network comparison in which network are represented in the form of feature vectors that summarize the network topology [2,11,35,39,50].
In the analysis of complex networks, a size-independent similarity metric plays an important role in the evaluation of the network generation models [7,37,59,60], evaluation of sampling methods [22,34,36,54], study of epidemic dynamics [12,42,46], generative model selection [26,41,43], biological networks comparison [48,49], network classification and clustering [2,16,26,41,43], and anomaly and discontinuity detection [10,45].
Materials and methods
In this section we describe our model, formly known as ‘TripleFit’, for complex networks. The model consists of two important processes: (a) model selection and (b) structural network similarity. The model selection is based on the structural similarity between two complex networks. For a given target (realistic) network instance, the proposed method chooses the appropriate generative model among six generative models that can generate a synthetic network similar to the target (realistic) network. The selection of the best model is based on the embedded feature space of the target network and various synthetic networks generated from different generative models. In the proposed method, we developed a network distance metric by utilizing topological features for separating the various types of network classes. The simplest choice would be to compute the Euclidean distance between the two topological feature vectors of two networks. In this way, we utilize the triplet neural network architecture to learn the best distance metric [24]; the goal will be to learn a transformation from the original feature space to an embedded feature space, in which the euclidean distance between similar networks is smaller than distances between dissimilar networks. We also quantify the structural similarity between two networks by computing the Euclidean distance between the corresponding network topological feature vectors in transformed space.
Dataset description:
In this study, we have taken a gold standard data set 1
Downloaded from

The proposed methodology for model selection.
Figure 1 shows a high-level view of the model selection process. The methodology is configurable by several decision points, such as the set of considered network features, the candidate generative models. The steps for constructing the final network classifier are described here:
We took 1000 network instances of different size and densities (using different parameters) for each candidate network generative model from RND, for capturing their growth mechanism and generation process. These network instances will form the dataset for learning the triplet neural network.
We extract the topological features (clustering coefficient, modularity, degree distribution,
Construct the triplets a from the labeled dataset which comprises of the positive, negative and anchor samples, where the positive and anchor sample is of the same class (or same generative model), and the negative sample is of a different class. These triplets are utilized for learning the distance metric using a triplet network. So, we train the triplet neural network using the different triplets. This trained triplet neural network transforms the feature space into another feature space (called embedding feature space) in which the distance between the positive and anchor sample is smaller than the distance between negative and anchor sample. This new embedding feature space will form the dataset for learning a network classifier.
The labeled embedding feature dataset forms the training and test data for a supervised learning algorithm. The learning algorithm will return a network classifier which can predict the class (the best generative model) for a given network instance.
The same topological features used in Step 2 are also extracted from the real world (target) network. Then pass this feature vector into trained triplet neural network, which we trained in Step 3 and generate the embedding feature vector of the target network. This embedding feature vector of the target network is used as the input of the learned classifier, which is trained in Step 4.
The learned network classifier is a customized model selector for finding the model that fits the target network. It gets the topological features of the target network as the input and returns the most compatible generative model.
Besides the network model selection, proposed method also estimate the structural similarities of networks. For measuring the structural similarity between two networks, we extract the topological features of two given networks as in Step 2. Then we pass both feature vectors into trained triplet neural network, trained in Step 3, and generate embedding feature vectors for both networks. Then we compute the Euclidean distance between the two embedding feature vectors, representing the similarity between two networks. The distances between similar networks are smaller compared to dissimilar networks. Section 4.3 describes the computation of structural similarities between the network instances generated in Step 1. Detailed description of the steps described in the above section are presented below:
Network features
The process of model selection, as described in Fig. 1, utilizes network topological features in the second and fifth steps. There are various features are defined in network literature to quantify the topological properties of the network. We considered only some well-known and frequently studied measures, which are relevant to our study. A diverse set of local and global network features were utilized to construct feature vectors. A brief description of the calculated measures is as follows:
Where
Where
Network features are not standardized i.e. there is no universal best set of features for networks and other features such as shrinking diameter, densification, vulnerability, network resilience, rich-club phenomenon,
Triplet neural network
Triplet neural network is a deep learning model, which aims to learn useful representations of data by distance comparisons [24]. Recently, triplet network architecture have successfully applied in many computer vision tasks [14,32,52,58]. In past few years, deep learning models have been widely exploited to solve various machine learning tasks. Deep Learning is automating the extraction of high-level meaningful complex abstractions as data representations (features) through the use of a hierarchical learning approach. The notion of hierarchical features stems from neuroscientific discoveries of the visual cortex that indicate a hierarchy of cells with successively higher level cells firing for more complex visual patterns [5,8,9,23,61].

A schematic representation of triplet neural network architecture, which consists three feed forward neural networks of the same instance with shared parameters.
Triplet network, as shown in the Fig. 2, is a model inspired by the Siamese Network which comprise three feed forward networks of the same instance with shared parameters. The network gives two intermediate values when fed three samples to it. These intermediate values come from
We train this network for a 2-class classification problem using the triplet input
Here, we briefly describe the implementation details of Triplet Neural Network. A Triplet Neural Network consists of three feedforward neural networks of shared weights, followed by two layers. An advanced activation function ELU (Exponential Linear Unit) are applied between two consecutive layers. Network configuration (ordered from input to output) consists of layers dimensions
Network models
Among several existing network generative models, we have selected six important models: Kronecker Graphs (KG) [35], Forest Fire model (FF) [37], Barabási-Albert model (BA) [7], Watts-Strogatz model (WS) [59], Erdös-Rényi (ER) [19] and Random Power Law (RP) [57]. The selected models are the state of the art methods of network generation. Existing model selection methods have ignored some new and important generative models such as Kroneckr graphs and Forest Fire [4]. All the models are briefly described in the study [4]. The characteristic of above six generative models are defined as follows:
Results
In this section, we evaluate our proposed method for network model selection and structural network similarity. We utilize the RND to evaluate our model against other existing methods. The proposed model is evaluated by first transforming the networks dataset into the feature dataset using the topological features mentioned in Section 3.3. Thus we have several network instances generated using six generative models and many real world networks that correspond to one feature vector representation in feature dataset. In previous methods size (number of nodes) and/or density of a target network is considered in the generation of training data [26,41,43]. In our methodology, the size and density of the target network are not considered in the generation of the training data. In the proposed method, a Triplet Neural Network is utilized to find the best network distance metric, which is capable of separating networks of different classes.
Evaluation of network distance metric
The TripletFit method is primarily based on learning of network distance metric. The distance metric learning problem is concerned with learning a distance function, which can separate networks of different generative models. In this study, we choose euclidean distance (

Visualization of embedded feature set into two dimensional space using t-SNE method.
In this section, we show that learned distance function is capable of separating different class generative models. In this order, we visualize the high-dimensional embedded feature dataset using
As described in Fig. 3, each class of generative model is clearly separate in embedding feature space. Hence, we utilized the embedded feature dataset for both model selection (network classification) and network similarities (network comparison).
First, we evaluate the TripletFit for model selection, on the synthetic labeled dataset, which is constructed by the network instances of known generative models mentioned in Section 3.6. To the best of our knowledge, these are the most widely used generative models in network generation applications, and they also cover a wide range of network structures.
Each generative model offers a set of parameters for tuning the synthesized networks so that they follow the properties of real (target) networks. The Number of nodes (or network size) is an important parameter of considered models. Unlike other network model selection methods [2,26,41,43,48], we select the all the parameters randomly. The size of the network is randomly chosen from 1000 to 5000 nodes, and other network parameters are also chosen randomly from the parameter ranges described in study [4]. As compared with the size of real (target) network, we generate smaller sizes of the network instances for training the proposed method. This feature increases the efficiency and performance of our proposed method.
The generative model selection is treated as a network classification problem. In this study, we developed a supervised learning algorithm for network model selection. In this way, we utilized an Artificial Neural Network (ANN) model [25,30] as a classifier to select an appropriate generative model for a given real network. For training the ANN model, we utilized embedded features dataset generated in Step 3 of Section 3.2. We performed 10-fold cross-validation process for evaluation of our proposed method, where the whole dataset is randomly divided into 10 equal subsets. From the 10 subsets, a single subset is retained as a test set, and the remaining subsets are used as a training data. This process repeated 10 times. We construct the test set in such a way that, every iteration contained an equal number of networks instances (
Results of TripletFit method on synthetic network dataset
Results of TripletFit method on synthetic network dataset
Aliakbary et al. [4] compared their proposed method (ModelFit) with various existing model selection methods: GMSCN [43], SVMFit [30,47], RGF [49], AvgBased, Naïve-Manhattan. GMSCN utilized LADTree decision tree classifier on RND. SVMFit is an SVM-based classification method. AvgBased is a distance based classifier which considers an average distance of the given network with neighbor networks. RGF-method is another distance based method, utilizes the concept of the graphlet count features. Finally, the Naïve-Manhattan distance can be defined as pure Manhattan distance of the network features, where all the network features shares the equal weights during distance calculation. More details about these methods can be found through the research paper [4]. We compared our proposed method with other methods reported in the work of Aliakbary et al. [4], a comparative summary of results is shown in Fig. 4.

The accuracy comparison of different network selection methods.
We also evaluate the robustness of our proposed method by introduce varying noise levels (noise =

The robustness of TripleFit with respect to different levels of noise.
In this section, we measure the structural similarities between different synthetic and real world complex networks. In order to measure the structural network similarity between two networks, we compute the euclidean distance between the embedded feature vectors of corresponding networks.

Structural similarities between synthetic networks.
We employ the Euclidean distance as a structural network similarity measure for comparison of real networks. To measures the structural similarity between two networks, compute euclidean distance between embedded features of these two networks. The question: is the euclidean distance capable of describing the structural similarities in real network data? In order to evaluate this, we compute the euclidean distance between every pair of embedded feature vectors of 6000 network instances, generated by six different generative models as described in Step 4 of Section 3.2. Then we plot a heatmap of this
The assortativity coefficient has been shown dependency upon network size [38]. Hence, we carried out an extensive analysis of this issue. In the Appendix: Figs 7 to 12 shows the boxplots of assortativity in various network size ranges for different network types. We do confirm that a few network types, particularly the Barabási-Albert model, assortativity slightly increases with network size. The major implication of this observation is that the proposed approach should not be used for a very large range of network sizes. It is to be noted that the dependencies are rather weak, with a reasonable range of network sizes, hence the proposed model can be used effectively as has been demonstrated in our study. Further, we carried out an additional experiment in which we removed the assortativity feature from our model and the results shows that this feature is not critical to the overall strategy of the proposed network comparison methodology. Thus, if users intend to reuse our model for a very large range of network sizes, they are advised to remove assortativity features and the cost of doing so in terms of model performance is not very high.
Case study
In order to find the effectiveness of proposed method for real networks, we have applied TripleFit on some real networks. The real network instances and the results of TripletFit on these networks are described below:
“p2p-Gnutella08”, Gnutella is a small peer to peer file sharing network with about
“Email-URV”, Email-URV is the network of emails interchanges between members of the University Rovira i Virgili of Tarragona, Spain. This data set is collected by Guimera
“email-Enron”, email-Enron is the communication network covers all the email communication of Federal Energy Regulatory Commission during its investigation. In this network nodes represent the email addresses and an edge represents the email interchange between two address
“cit-HepPh”, cit-HepPh is a arxiv High Energy Physics Phenomenology paper citation network. Which covers all the citations within a dataset of
“cit-HepTh” cit-HepTh is a arxiv High Energy Physics Theory paper citation network. Which covers all the citations within a dataset of
“ca-HepPh” ca-HepPh is a arxiv High Energy Physics Phenomenology collaboration network (or co-authorship network). Which covers scientific collaborations between authors papers submitted to High Energy Physics – Phenomenology category within a dataset of
“ca-HepTh” ca-HepPh is a arxiv High Energy Physics Theory collaboration network (or co-authorship network). Which covers scientific collaborations between authors papers submitted to High Energy Physics – Theory category within a dataset of
Table 2 shows the results after applying Tripletfit on the real networks as described above. As Table 2 shows, various real-networks, which are selected from a wide range of sizes, densities and domains, are categorized in different network models by the classifier. This fact indicates that no generative model is dominated in proposed method for real networks and it suggests different models for different network structures. The case study also verifies that no generative model is sufficient for synthesizing networks similar to real networks and we should find the best model fitting the target network in each application. As a result, the task of generative model selection is an important stage before generating network instances. Accordingly for the chosen seven real networks the appropriate generative model selected and shown in last column of Table 2.
Real networks and the selected generative model using network model selection method
Real networks and the selected generative model using network model selection method
As we discussed, we can utilize network similarity method for model selection. So in that way we compute euclidean distance between embedded features of real networks and generative networks (synthesized by generative network models). Table 3 shows the average euclidean distance between embedded features of a real network and all instances of particular generative network. A smallest average distance shows strong structural similarity between real network and generative network.
Real networks and the selected generative model using structural network similarity method
In this paper, we proposed a novel method for network model selection and network similarity. In this method, we investigated the network distance metric for comparing the topological propoerties (topology) of the complex networks. In simple words for a given real network find the matching generative model using network topology based features. The computational strategy adopted include first generating instances from select group of generative models. Second the triplet set of features characterizing the network property from the above set of instances and thirdly these features are embedded and fed as input to the classifier and the other input to the classifier come from the real network data refer Fig. 1. The classifier will provide us to appropriate generative model that fits real network. The second objective is to estimate the network similarity using euclidean distance measure. The result of our analysis with seven real networks shows the usefulness of proposed method refer Tables 2 and 3. We have seen in our study that the model selected through model selection (using classification scheme) and the model selected through network similarity (using euclidean distance measure) agreed completely refer Tables 2 and 3.
Figure 5 shown the sensitiveness of proposed method to perturbations in the network topology by introducing various noise levels (or variability i.e. noise = 5%, 10%, 15%, 20%, 25%) into test case network. The noise was introduced in the network by rewiring a particular fraction of network edges between the randomly selected pair of nodes. We observe that the classification performance of proposed methodology do not affected upto 10% of noise after that it drops considerably. In other words the proposed method is less sensitive to perturbation in network topology of test set. Hence it is independent of size and density of the test case network (e.g., a real network). Therefore, variability factor is considered clearly in proposed methodology.
In our methodology, the size and density of the target network are not considered in the generation of the training data. Size and density independence is an important feature of our method. It enables the classifier to learn from a dataset of generated networks with different sizes and different densities, perhaps smaller than the size of the target network. For example, given a very large size of network instance as the target network, we can prepare the dataset of generated networks with smaller size than the target network. This facility decreases the time of network generation and feature extraction considerably. The proposed method outperforms the various existing methods, which highlight the effectiveness of deep learning architectures in the learning of a distance metric for topological comparison of complex networks. Our proposed method (TripletFit), outperforms the state-of-the-art methods with respect to accuracy and noise tolerance.
Benefits of our study may help biologist to mimic real word network this may help further insight into the network topology and properties. With the large scale generation of Next Gen Sequencing data and its annotation may further demand fast and reliable model selection software tools. The present study may in the direction too.
Author contributions
K.V.S. and L.V. conceived and designed the study. K.V.S. and A.K.V. implemented the algorithm and prepared the figures of the numerical results. K.V.S., A.K.V. and L.V. analyzed and interpreted the results, and wrote the manuscript. All the authors have read and approved the final manuscript.
Competing interests
We declare that there is no competing of interests for this work.
