Abstract
Introduction
Cancer is a complex and heterogeneous genetic disease. Decades of molecular genetic research have identified a number of susceptibility genes responsible for the underlying genesis in different types of cancers. 1 It is anticipated that cancer can involve 5–10% of human genes. 2 However, currently experimentally validated cancer genes only cover 1% of human genes, suggesting that there are still hundreds to thousands of cancer genes that remain to be identified. Similarly, drugs that target known mutated cancer genes have brought dramatic therapeutic advances and substantially improve and prolong the lives of cancer patients. 3 Owing to extreme heterogeneity and complexity in cancer, there is a pressing need to develop individualized treatment for cancer patients. However, drug development is a costly, complex, and time-consuming process. 4 Nevertheless, large amounts of biomedical data and findings provide us with unprecedented opportunities to explore associations among different types of cancers, drugs, and genes. Systematic analyses of these cancer-specific associations can help highlight the hidden associations between different cancer types and related genes and drugs.
During the last decade, network-based computational approaches gained popularity and have become a new paradigm to investigate associations among drugs, diseases, and genes. Applications of these approaches include drug repositioning,5,6 disease gene prioritization, 7 9 and identification of disease relationships.10,11 Majority of these approaches focuses on relationships between only two categories (eg, association between gene and disease). For instance, a human disease–drug network was created based on genomic expression profiles collected from public GEO database. In total, 170,027 interactions between diseases and drugs were considered significant, including 645 disease–disease, 5,008 disease-drug, and 164,374 drug–drug associations. 12 These expression-based associations among diseases and drugs could serve as future research directions. Bauer-Mehren et al. 13 developed a comprehensive disease–gene association network by integrating associations from several sources that cover different biomedical aspects of diseases. The results indicate a highly shared genetic origin of human diseases. Functional modules were also detected in several Mendelian disorders as well as in common diseases. To systematically analyze drug–disease–gene relationships, Daminelli et al. 14 proposed a network-based approach to predict novel drug–gene and drug–disease associations by completing incomplete bicliques in the network. This approach holds great potential for drug repositioning and discovery of novel associations. However, they are not comprehensive and are limited to only certain associations between drugs, genes, and diseases (ie, drug–disease and drug–gene associations). A network-based investigation considering all pair-wise associations among these entities is necessary to understand the complexity of existing associations and to infer novel associations within the context of the whole knowledge base.
Network-based computational approaches enable us to analyze heterogeneous networks such as drug–disease–gene networks by decomposing them into small subnetworks, called network motifs (NMs). 15 NMs are statistically significant recurring structural patterns found more often in real networks than would be expected in random networks with the same network topologies. They are the smallest basic functional and evolutionarily conserved units in biological networks. The hypothesis is that NMs of a network are the significant sub-patterns that represent the backbone of the network, which serves as the focused portion out of hundreds of nodes (eg, drugs, diseases, and genes). These NMs could also form large aggregated modules that perform specific functions by forming associations among a large number of NMs.
In this paper, we constructed a heterogeneous cancer–drug–gene network from public literature knowledge and investigated the underlying association relationships using network-based systems biology approaches. First, we developed a domain pattern-driven approach to construct an integrated cancer–drug–gene network extracted from Semantic MEDLINE Database. Second, we proposed a network-based computational approach to mine this integrated heterogeneous network. Significant NMs were detected and evaluated for their potential biological meanings. We demonstrate that these NMs have potential to help prioritize disease genes and propose novel drug targets. The analysis of such cancer-focused network involving cancer-drug and cancer–gene associations permits researchers a more detailed evaluation of the specific relationships between individual cancers. We believe that such approaches will facilitate formulization of novel research hypotheses, which is critical for translational medicine research.
Methods
To comprehensively investigate the integrated cancer–drug–gene network formed by associations available in Semantic MEDLINE, we proposed the following two-step computational framework: (1) extraction and optimization of cancer-drug–gene network in Semantic MEDLINE and (2) network topology analysis of this heterogeneous network at two levels: statistics and degree distribution of high-confidence association networks, and distinct pattern detection at the NM level. In this section, we first describe the steps to extract association network data from MEDLINE database, followed by a description of the proposed network-based approach to investigate this heterogeneous drug–disease–gene association network. Figure 1 illustrates the steps of the proposed approach.

Overview of the network-based computational framework for an integrated cancer–drug–disease network.
Data Sources and Preprocessing
Semantic MEDLINE in RDF
For this research, we used biomedical research findings extracted from MEDLINE literature as our knowledge base. MEDLINE 16 contains more than 19 million references to published articles in the biomedical fields. We first downloaded the Semantic MEDLINE Database, 17 which is an National Library of Medicine (NLM)-supported database that contains different biomedical entities and their relationships extracted from MEDLINE abstracts using natural language processing methods. Semantic MEDLINE provides comprehensive resources with structured annotations with Unified Medical Language System (UMLS) terms and properties. It currently contains more than 56 million relations extracted from MEDLINE articles. In our previous research, we reorganized these relations into six different Resource Description Framework (RDF) graphs based on the semantic types of the associated concepts. 18 Based on the source and target concepts and their semantic groups, we extracted 843k disease-disease, 111k disease-gene, 1,277k disease-drug, 248k drug–gene, 1,900k drug–drug, and 49k gene-gene associations. Table 1 shows some basic statistics of these six groups of associations.
statistics of the six extracted association groups.
Cancer Relevant Relation Extraction
From the six graphs above, we further extracted those associations that are related to cancer terms. We used “Neoplastic Process” (NEOP) as the semantic type to extract the cancer disease relevant terms. NEOP is defined as a sub-type of disease or syndrome in UMLS semantic type. The associations involving NEOP were extracted and used for downstream network-based analyses.
Network Motif Analysis
The six different types of associations among cancers, drugs, and genes were integrated into a heterogeneous cancer–disease–gene network. In this network, nodes represent biomedical entities (ie, cancer terms, disease, or gene), and edges between nodes represent associations between two nodes (eg, association between drugs and genes). In this paper, we focused on three-node NM identification for this drug–disease–gene network since larger size NMs (number of nodes >3) are composed of three-node NMs in most cases.
19
All connected subnetworks containing three nodes in the interaction network were collated into isomorphic patterns, and the number of times each pattern occurred was counted. By the default setting of the algorithm, if the number of occurrences was at least five, which is significantly higher than randomized networks, the pattern was considered to be an NM. Statistical significance test was performed by generating 1,000 randomized networks and computing the fraction of randomized networks in which the pattern appeared at least as often as in the interaction network.
19
The
where
Construction of the Core Cancer Association Network
It has been shown that in gene regulatory networks, for each NM, the majority of matches overlap and aggregate into homologous motif clusters. 21 Many of these motif clusters largely overlap with modules of known biological processes within the gene regulatory network. 22 The clusters of overlapping matches of these motifs aggregate into a superstructure that presents the backbone of the network and is assumed to play a central role in defining the global topological organization. Similarly, we aggregated matches of significant NMs as described above into a core cancer–disease–gene network. In this core network, we investigated degree distributions of different types of nodes. Nodes with significantly larger number of links in the network are called hub nodes, which are critical in the information flow exchange throughout the entire network.
Results
An Integrated Cancer–Drug–Gene Network Reconstructed from Semantic MEDLINE
We constructed a cancer–drug–gene network with the following two steps:
Statistics of the six extracted association groups with at least one cancer term involved.
Network Topology Analysis of the Core Drug–Disease–Gene Network
The NM analysis was performed on the integrated cancer–drug–gene network obtained above. As the network contains thousands of associations among 1,711 cancer terms, 1,704 drugs, and 2,551 genes (Table 2), it is too complex for a direct visualization. We overcame this problem by identifying enriched NMs and interpreting them through an enhanced visualization. Out of this heterogeneous network consisting of 16,028 associations among 5,966 entities (including cancers, drugs, and genes), 8 significant NMs were identified. Table 3 presents detailed statistics on these NMs.
Statistics of significant NMs.
Based on the NMs identified in the analysis, we constructed a core cancer–drug–gene network aggregated from significant NM instances. We then investigated the degree distribution of different types of entities in the integrated network. Figure 2 represents the degree distribution of cancer, drug, and gene nodes in the core cancer–disease–gene network. All three distributions follow the power-law distribution, indicating that networks related to different types of nodes are scale free. The majority of the nodes in the network have only a few (less than 10) links, but a few other nodes have a large number of links. Such distributions have been observed in many studies of biological networks. 24 Our analysis demonstrates that in an integrated network consisting of heterogeneous associations, the scale-free network structure still holds. The hub nodes (ie, the nodes having a large number of links) can provide scientists future research directions.

Degree distribution of three biomedical entities: cancer term, drug, and gene.
Local Network Structure: From Network to NM
The eight significant NM patterns in Table 3 have strong biological meanings and could suggest scientists future directions in their research field. One example is NM 7 (Table 3), in which two cancer terms that are associated with each other are also associated with one common gene. This indicates that diseases identified to be associated in literature are more likely to share the same associated disease genes. To further investigate the relationships highlighted by NM 7, we extracted all associations among 75 cancer terms and 848 genes in NM 7. In total, there are 907 disease-disease and 2,713 disease-gene associations (Fig. 3A) in this subnetwork, suggesting that diseases that are associated with each other are more likely to be associated with a group of common disease genes. For instance, in Figure 3B, “malignant neoplasm of prostate” shares its 253 associated genes with a list of cancer-related terms, such as “neuroendocrine tumors” and “leukemia.” Specifically, five leukemia-related terms were directly associated with “malignant neoplasm of prostate.” Similar findings have also been discovered in other studies demonstrating the same functional modules/pathways being affected in both diseases. 25 There are only 25 genes associated to “leukemia” in literature. Such information will help scientists generate testable hypotheses of possible roles of these genes in future leukemia research. The detailed associations in Figure 3 are presented in Supplemental File 1.

Subnetworks extracted from NM 7. (
Similarly, NM 8 suggests another association pattern between diseases and drugs, in which two diseases that are associated with each other are targets for the same drug. It has been shown by Suthram et al. 11 that diseases with significant correlations based on mRNA gene expression data also share common drugs. This NM supports the hypothesis that similar diseases can be treated by the same drugs, allowing us to make hypotheses of new uses of existing drugs. Three-disease motif (NM 1) was also identified in this heterogeneous network. This NM is also a very common motif pattern in the disease network or gene regulatory network,26,27 which indicates that NM detection analysis of heterogeneous networks can identify significant NMs, including those NM patterns that exist in a single type of association.
Conclusions
In this paper, we proposed a network-based computational framework to investigate integrated heterogeneous network extracted from MEDLINE literature, including associations among three major entity categories: cancer, drug, and gene. Eight significant NMs were identified and considered as the backbone of the entire network. The potential biological meanings of each NM were further investigated. The results demonstrated that the proposed approach holds the potential to prioritize disease genes for different types of cancer and propose novel drug targets, within the context of the entire knowledge. We believe that such analyses can facilitate the process of inferring novel relationships between cancers, drugs, and genes. One future direction is to develop module-based approaches to understand associations between different biomedical entities. Topology analysis of heterogeneous network in graphic theory can also be applied in future studies. Pathway level information could also be integrated.
Author Contributions
Conceived and designed the experiments: YZ, CT. Analyzed the data: YZ, CT. Wrote the first draft of the manuscript: YZ, CT. Contributed to the writing of the manuscript: YZ, CT. Agree with manuscript results and conclusions: YZ, CT. Jointly developed the structure and arguments for the paper: YZ, CT. Made critical revisions and approved the final version: YZ, CT. Both authors reviewed and approved the final manuscript.
