Abstract
Keywords
Introduction
The estimation of gene regulatory networks is the reverse engineering for inferring network structures from different gene expressed products such as transcribed RNA or translated proteins. 1 Many approaches in previous studies were proposed for learning gene regulatory networks. Bayesian networks (BNs) reveal directed acyclic graph structures of networks, in which nodes represent random variables and directed edges indicate causal probabilistic conditional independencies, 2 and therefore, they have often been applied to gene expression data for inferring causal gene regulatory networks.3-6 Considering the fact that BNs are unable to reflect feedback loops existing in real biological networks, the dynamic Bayesian networks were proposed and applied to gene expression data for inferring causal interaction relationship of genes.7-10 However, the primary practical problem with the BNs is their computational complexity. It has been demonstrated that learning BNs from data is an NP-complete problem. Algorithms for learning BNs from data include 2 main components: a scoring metric and a search procedure. The search procedure is used to identify network with high scores, and the article by Chickering 11 shows that the search problem of identifying a BN is NP-complete. However, simple correlation matrices have been applied in several fields for analyzing the correlation of variables and used as another instrument to extract correlation patterns between genes that are presented in gene expression data. 12 Many researchers have proposed using correlation matrices for the analysis of gene interaction networks.13-19
The Gaussian graphical models (GGMs), unlike BNs, define undirected graph structures of networks that represent the conditional dependence between variables. The GGM has been widely applied to gene expression data for analyzing gene interactions. Several techniques were reported for GGM model parameter selection; among them are the standard greedy stepwise forward selection20,21 and the improved model selection approach.
22
Using the fact that a gene regulatory network is typically sparse, a lasso-based method
23
was introduced to improve the accuracy by shrinking the nonzero values that might just be noise in the precision matrix representing a GGM. An increasing number of algorithms have since been proposed to estimate the precision matrix, such as gradient directed regularization for sparse Gaussian concentration graphs,
24
neighborhood selection with the Lasso,
25
the penalized likelihood method,26,27 the stability approach to regularization selection for GGM,
28
the novel Bayesian method for building the GGMs,29,30 and the joint graphical lasso.
31
However, these algorithms for analyzing gene interactions focus on relatively small data sets that include a small part of the genes. As a result, some potentially highly significant gene interactions in a large-scale study could be omitted by those small-scale algorithms. Also, these approaches are computationally costly, making them ineffective to be applied to large data sets, such as genome-wide gene expression data sets. Other recent methods on inference of GGM are as follows: an integrated statistical framework based on the graph lasso which is applied to learning gene networks under single-nucleotide-polymorphism perturbations using eQTL data sets was developed.
32
A novel regression-based method was proposed to obtain asymptotically normal estimation of a large GGM,
33
and it provides both
In this study, an effective network learning model that integrates traditional GGM with the Monte Carlo method (MCGGM) was developed for learning a global network from genome-wide gene expression data. Monte Carlo Gaussian graphical model was applied and verified on a relatively small, real data set of RNA-seq gene expression levels, and the estimated results indicate its strong ability of identifying the interactions with high conditional dependences. Monte Carlo Gaussian graphical model was then applied to a genome-wide data set of RNA-seq gene expression levels. The results validate the ability and reliability of the approach in identifying strong conditional dependences among genes. The contributions of this study are (1) the proposed MCGGM algorithm speeds up the GGM in estimation of large-scale data set and makes it feasible to infer genetic networks under the framework of GGM at a genome-wide scale and (2) the estimated interactions in genome-wide expression data sets provide insights for biologists to explore the complicated molecular interactions.
Model and Methods
A GGM is characterized through the precision matrix, rather than the covariance matrix, of the random variables involved.
Given gene expression data
where S is the sample covariance matrix,
To apply graphical lasso to infer the graphical model, 1 important issue is to choose the optimal penalty parameter
The Monte Carlo method (MC) refers to a series of statistical methods that are essentially used to find solutions to computationally expensive problems.40,41 The core of this method is to use stochastic sampling techniques to solve intractable problems that are too complicated to deal with analytically. 42 The Monte Carlo method typically includes 2 major components: (1) random sampling and estimation and (2) estimation integration. Random sampling is used to run an estimation, and the estimates from multiple runs will then be integrated to improve the estimation. This technique has been used in developing algorithms for solving different problems in multiple fields, including computational biology, 43 applied statistics, 44 and artificial intelligence. 45 All the algorithms with MC share the concept of using random sampling to compute a solution to a given problem. 46
An integrated approach that uses the Monte Carlo random sampling technique is introduced to obtain large-scale GGMs. First, the large set of genes is randomly divided into subsets of equal size. In this way, the large data set is divided into multiple small subsets, making it feasible to directly apply the Gaussian graphical lasso to those subsets for analyzing gene-gene relationships. However, if a large set of genes is simply divided and estimated like this, only some isolated subnetworks are obtained. Some strong gene interactions that connect the subnetworks could be omitted and go undetected. To increase the probability of each gene pair having an opportunity to be sampled in the same subset, the process of random sampling (with replacement) and estimation in the first step will be repeated. Researchers expect that each pair of genes has some opportunity to be in the same subset so that the dependence of each pair of genes may be estimated. Monte Carlo Gaussian graphical model sets a threshold to terminate the iterations when the probability of any pair of genes not being sampled in at least 1 subset is less than or equal to the threshold. The estimated subnetworks are then integrated into a larger network that includes all genes under consideration. During the iterations, each random sampling is tracked and the number of pairs of genes sampled in a subset is recorded in a matrix. Each estimated subnetwork along with the edge weight corresponding to each pair of genes is recorded. The average edge weight of a pair of genes is considered the estimated strength of their dependency. Owing to noise in the data as well as sampling and rounding errors, it is inevitable that there are “noisy” values in the estimates. Because the focus is on finding those genes that are highly conditionally dependent, edges with small weights are filtered if the weights are less than a threshold calculated based on the SD of the edge weights or less than a threshold calculated based on the estimated global network.
Figure 1 illustrates the flowchart of MCGGM. Considering a gene to be a random variable, let
Randomly partition
2. Repeat step 1
3. Each
4. Obtain the final estimated edge matrix by integrating the
where
5. The last step of noise reduction is applied to

Flowchart of the proposed MCGGM approach for learning global genetic networks on genome-wide gene expression data set. GGM indicates Gaussian graphical model; MCGGM, Monte Carlo Gaussian graphical model.
To ensure a high confidence of the integrated network, the number of sampling rounds
In this study,
Results and Analysis
Results on the small data set
The integrated model MCGGM was tested and verified using 2 RNA-seq gene expression data sets. The first one is a small gene expression data set. The genes of the data set were collected from the common genes in 15 types of specific cancer pathway maps from the Kyoto Encyclopedia of Genes and Genomes (KEGG).
47
The RNA-seq expression levels of the genes were retrieved from The Cancer Genome Atlas (TCGA).
48
The data set of RNA-seq expression level was cleaned in 2 steps: (1) removal of the genes whose expression levels are zeros across all samples. It is possible that these genes did express but their levels were so low and were not picked up by RNA-seq technology. (2) Transformation of the expression data with log function. All 0s in the data set were replaced with 1s before transformation. Eventually, a cleaned data matrix that includes 515 samples and 430 genes was obtained. Both traditional GGM and MCGGM were then applied to the data set to obtain 2 edge matrices
Analysis with confusion matrix and Jaccard coefficient
Confusion matrix and receiver operating characteristics (ROC) are useful tools to organize and visualize the performance of classifiers. 49 In this study, nonzero values in an edge matrix are classified as positive and zero values as negative. An ROC curve was constructed to choose an appropriate threshold to filter the noise in the estimated edge matrix. Nonzero values in the estimated matrix represent edges in the corresponding undirected graph, indicating potential interactions in the gene interaction network; however, it does not mean that all the estimated nonzero values indicate actual gene interactions. Some nonzero values in the estimated matrix might be noise called false positives (FPs). Inevitably, some noises (FPs) are present in the estimated gene interaction network. To find a tradeoff threshold so that the estimated matrix includes the true positives (TPs) as much as possible and the FPs as little as possible, the mean and corresponding SD of the nonzero values in the estimated edge matrix were computed. In the estimated edge matrix, most estimated values fall in the range between the value (mean − 3*SD) and the value (mean + 3*SD). The term (mean − 3*SD) represents the difference between the mean value and 3-time SD, and the term (mean + 3*SD) is the sum between the mean value and 3-time SD. To filter the small nonzero values, the values mean − 3*SD, mean − 2*SD, mean − SD, and mean were chosen as thresholds, and based on the estimated results, other discrete values 0.1, 0.08, 0.06, 0.04,0.02, and 0.01 were also chosen as thresholds to observe how the corresponding true positive rates (TPRs) and false positive rates (FPRs) change. The estimated matrix is expected to include more TPs and fewer FPs; the expected threshold is the one with a high TPR and a relatively low FPR in the ROC curve.
With the threshold list (0.1, 0.08, 0.06, 0.04, 0.02, 0.01, mean, mean − SD, mean − 2*SD, mean − 3*SD), the ROC curve reflecting the ratio of TPR and FPR of the thresholds is given in Figure 2 below. The red curve in Figure 2A reflects the performance of the MCGGM. To clearly observe the close markers in the red ROC curve, the markers in Figure 2A are enlarged and the last 4 marks are highlighted with different color strips in Figure 2B. From Figure 2B, clearly, the threshold (mean − 3*SD) indicates a high TPR, and therefore, it was selected to filter the noises in the estimated networks.

(A) ROC curve with different thresholds in the estimated networks. The red curve reflects the performance of the MCGGM. The blue dash line represents the no skill ROC. (B) Zoom-in view of the ROC curve in (A) for FPR < 0.06. The different color strips indicate the corresponding thresholds shown in the top left corner in (B). FPR indicates false positive rate; MCGGM, Monte Carlo Gaussian graphical model; ROC, receiver operating characteristics; TPR, true positive rate.
Unlike TPR, which only focuses on the ratio of correctly identified gene interactions and all actual interactions, Jaccard coefficient
50
takes FPs into account to measure the performance of MCGGM. To focus on the interactions indicating strong conditional dependence, the edge matrices
The estimated TPR and Jaccard coefficient with different proportion of edges.
TP: true positive—if the value of the interaction is positive (nonzero) in
TPR: true-positive rate =
FPR: false-positive rate =
TNR: true-negative rate =
From Table 1, first, the MCGGM method correctly identified 4074 edges of all the 4149 edges in
Analysis with correlation coefficient of strong gene interactions in
and
A stronger indicator of the MCGGM approaching the ground truth is the agreement between the ranks of the strong interactions in
The correlation

Correlation coefficient analysis of gene interactions in
Analysis of the missing strong gene interactions in
Owing to the nature of the random sampling involved in MCGGM, it cannot be guaranteed that all pairs of interacting genes are sampled in a subset. Therefore, this leaves the possibility that some gene pairs might be strongly conditionally dependent in
Identify the missing gene interactions in
Table 2 shows that a total of 6 among the top 500 interactions in

Comparison of the interactions of missing gene pairs. (A) The interactions of missing gene pairs in
For the sake of easy and clear observation, these gene pairs and their interactions were magnified and marked with different colors. Obviously, the direct interactions between gene pairs PLCG2 and FLT3, PDGFB and NOTCH4, SHH and GLI1, E2F2 and CCNA2, SHH and E2F3, and PGR and BIRC5 of
To further explore a reasonable explanation why those direct interactions are not identified in
The aforementioned analysis results indicate the following facts. First, the proposed MCGGM has a strong ability of and high reliability in correctly identifying gene interactions, especially for strong conditional dependences. Second, there is a small probability that few direct interactions of gene pairs might not be identified because they were never sampled in 1 group during the estimation process. However, this possibility has an acceptable probability, even if a few direct interactions are occasionally missing; actually, the results reveal that their indirect interactions may be found with high probability in a short path through only a few genes (sometimes 1 gene), and this further reduces the probability of losing strongly conditionally dependent interactions among genes. Also, not all the missing gene pairs have high conditional dependences. Considering such factors, the actual probability of missing high conditional dependences is far less than the set threshold (0.01 in this study). Third, the MCGGM method infers some FPs in the estimated results. Considering high TPR and the biological context, some FPs involved in the estimated network are tolerated and accepted.
Results on Genome-Wide Gene Data sets
Fifteen genome-wide data sets of RNA-seq expression levels corresponding to 15 types of specific human cancer from TCGA were collected and processed, respectively. The same cleaned methods were applied to the genome-wide data sets. As a result, 15 cleaned data matrices are obtained and shown in Table 3.
Genome-wide data sets of the 15 types of specific human cancers.
Abbreviation: TCGA, The Cancer Genome Atlas.
In this study,
The summary of subedge matrices.
Abbreviations: BLCA, bladder urothelial carcinoma; BLGG/LGG, brain lower grade glioma; BRCA, breast invasive carcinoma; COAD, colon adenocarcinoma; KIRC, kidney renal clear cell carcinoma; KIRP, kidney renal papillary cell carcinoma; LIHC, liver hepato cellular carcinoma; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; cell, pancreatic adenocarcinoma; PRAD, prostate adenocarcinoma; SKCM, skin cutaneous melanoma; STAD, stomach adenocarcinoma; THCA, thyroid carcinoma; UCEC, uterine corpus endometrial carcinoma.
Analysis of the common interactions in the extracted subedge matrices
For observing interaction patterns and further analysis, those common interactions between genes which connect to at least 3 other genes (3 scores) were visualized in Figure 5. The gradient color ranges from blue to red, illustrating the frequency of the edge’s appearance across the estimated cancer networks; blue indicates the edge is shared among 8 networks and red indicates the edge is common to all 15 networks. The thickness of an edge represents the median weight of the edge in the crossed cancer networks. From Figure 5, some genes are connected to form different cliques in which many genes are from the same family with a within-group homogeneity, or the expressed products of the genes might function as a biologically significant module in molecular networks; some genes interact with multiple genes in cliques to form highly connected clusters. The genes involved in these interactions potentially play important biological roles. For example, the family of genes C1QA, C1QB, and C1QC that are present as strongly interactive in all 15 estimated networks can be found in Figure 5. C1QA, C1QB, and C1QC are protein-coding genes that encode A polypeptide chain, B polypeptide chain, and C polypeptide chain of serum complement subcomponent C1q, respectively. 51 The structure of C1q indicates that those 3 genes (C1QA, C1QB, and C1QC) are highly connected. 52 In addition, the function of the C1q complex which plays a key role in initiating the classical complement system 53 also indicates strong interdependence of the 3 genes. Their high conditional dependences have been captured in the form of a clique consisting of the C1QA, C1QB, and C1QC genes in genetic networks estimated using their RNA-seq expression levels. This provides further evidence for the ability of the MCGGM to reliably identify high conditional dependence in genome-wide expression data sets.

Three-score common interactions in at least 8 cancer networks. The genes in the interactions connect with at least 3 other genes. The color of an edge indicates the frequency of the edge crossing the estimated cancer networks. The thickness of an edge represents the median weight of the edge in the crossed cancer networks.
In addition, to explore the common subnetwork that shows strong connections and is present in multiple estimated networks, the largest component from the common interactions was extracted and visualized as shown in Figure 6A. The numbers shown on the edges indicate exactly the number of crossing cancer networks. For close observation, the part included in the black rectangle in Figure 6A was further enlarged as shown in Figure 6B. The results clearly show the interactions within the 3 gene families and the interaction among those 3 gene families, such as the interaction among the ribosomal protein (RP) gene family in the green dotted curve, the eukaryotic initiation factors (EIF) gene family in the yellow dotted rectangle, and the eukaryotic elongation factors (EEF) gene family in the gray rectangle.

(A) The largest component of the common interactions. Similarly, the color of an edge indicates the number of cancer networks which the edge crosses. The thickness of an edge represents the median weight of the edge in the cancer networks, indicating the degree of conditional dependence between 2 genes. (B) Zoom-in view of the part included in a black rectangle in Figure 6A. The main family genes are highlighted with the dash curve or rectangles.
It is reasonable that there are some interactions within RP, EIF, and EEF gene families as well as the interactions between those gene families. It has been revealed that those 3 gene families are involved in protein synthesis and play critical roles in eukaryotic translation.54,55 Also, some studies indicated that some of the genes involved in the estimated interactions play certain important roles in different human cancers. Some evidence has indicated the misregulation of EIF3 gene is associated with cancers and its progression.56,56-58 In this study, several EIF3 genes (EIF3D, EIF3L, and EIF3K) involved in highly conditional dependent interactions are highlighted in Figure 6B. The overexpression of EIF3D was demonstrated to promote the development of gallbladder cancer by stabilizing GRK2 and activating phosphatidylinositol 3-kinase-AKT signaling pathway. 59 Also, the overexpression of EIF3D was reported to be related to the lung adenocarcinoma. 60 In addition, it was indicated that the expression levels of EIF3D, EIF3L, and EIF3K were highly associated with mutant status of gliomas. 61 EEF1A has been demonstrated to have a translation-independent role in various biological processes, such as in senescence, oncogenic transformation, cell proliferation, apoptosis, and degradation,62-65 and its overexpression has been reported in multiple human cancers, including melanomas, pancreas, breast, lung, prostate, and colon.66-71 EEF1A was indicated to interact with P53 and P73 and inhibit p53-, p73-, and chemotherapy-induced apoptosis. 72 In fact, the evidence strongly supports that the estimated interactions among the RP genes, EIF genes, and EEF genes in multiple tumors are associated with human cancers with high probability. The common interactions centering on the RP gene family, EIF gene family, and EEF gene family in Figure 6B were identified in the estimated global cancer network, and this further testifies to the ability of MCGGM in estimating gene interaction network.
Time analysis on genome-wide gene data set
To estimate the running time of the graphical lasso on a high-dimensional gene expression data set, the graphical lasso was applied to a real data set of RNA-seq expression levels. The running time was collected through 10 experiments of different sample sizes. Figure 7 below illustrates how the running time increases along with the increasing number of genes involved in the experiments. The results indicate the time increased exponentially with the increasing genes involved in the experiment. Based on the curve of running time in the experiment, the predicted time for finishing the estimation is more than 10 years if graphical lasso is directly applied to a genome-wide gene expression data set of 16 000 genes, assuming the space complexity is not a concern. However, on the same sever, the proposed MCGGM method completed the estimation of the network of 16 000 genes in approximately 64 hours. The result indicates the MCGGM effectively speeds up the estimation of GGM in learning global genetic network at a genome-wide scale.

Running time of the graphical lasso with the increasing number of gene variables. The point on the curve indicates the average time. The running time was collected through 10 experiments for each size indicated in the x-axis.
Discussion
Undoubtedly, a global network including all genes provides additional information for the analysis of human diseases and will be more helpful for biologists to acquire insights into the genetic interactions than a subnetwork. Owing to the nature of Monte Carlo sampling, the sampled subdatasets are independent, and therefore, the estimated subnetworks are also independent. The MCGGM model can also be effectively deployed on a parallel computing platform to infer global networks.
The proposed MCGGM is based on GGM and speeds up the estimation of GGM in learning global genetic networks at a genome-wide scale. The challenge is that MCGGM introduces “false positive” interactions between genes during the estimation. Although the threshold was set to filter the “false positive” gene interactions, some “false positive” gene interactions could not be eliminated from the estimated networks. However, moderate FPs may be tolerated. Also, because of the intrinsic attribute of random sampling, there is a small probability that some gene interactions might not be estimated because those gene pairs might not be sampled in the same subset. But the possibility of losing highly dependent gene interactions can be reduced by setting up a statistically acceptable threshold. The estimated results show that, even if the interactions between some gene pairs are never estimated directly, if they are highly conditionally dependent, their undirected interactions can be identified via a short path with a high probability. Actually, the probability of losing highly dependent gene interactions will be far less than the statistically acceptable threshold.
Despite the existing challenges and minor limitations, the proposed method has been proven to be efficient in learning global gene networks on genome-wide data sets.
Conclusions
The proposed MCGGM model integrates traditional GGM and Monte Carlo simulation technique to make learning a global genetic network from genome-wide data set practical. The integrated model MCGGM was tested and verified using several RNA-seq gene expression data sets. The results demonstrate that MCGGM is an efficient and robust model to be deployed to learn global genetic networks from genome-wide gene expression data sets. The estimated interactions in genome-wide expression data sets provide insights for biologists to explore the complicated molecular interactions and also shed light on exploring new mechanisms of pathways which are involved in different biological activities.
