Abstract
Keywords
Introduction
The American Cancer Society estimates that 233,000 out of 855,220 new cases of cancer in the United States will be prostate cancer and that prostate cancer will cause approximately 29,480 deaths, making it the second deadliest cancer for men. 1 Treatment options for prostate cancer include surveillance, removal of the prostate and surrounding tissue, radiation therapy, hormonal therapy including removal of the testicles or suppression of testosterone production, stabilization of bone to limit metastases, and chemotherapeutic or immunotherapeutic agents. 2 Removal of the prostate often results in significant morbidity, including urinary and sexual dysfunction 3 or potentially fecal incontinence. 4 Hormonal treatment of prostate cancer, although standard, has been shown to significantly decrease quality of life in the domains of mental and general health and activity and energy. 5 Chemotherapy and immunotherapy are generally used for recurrent prostate cancer. A list of drugs used for treatment and palliation of prostate cancer are included in Table 1.
Standard drugs for prostate cancer.
With the high impact of prostate cancer in the United States and around the world, the continued development of effective therapeutic options is of utmost importance. However, the average cost for bringing a new drug to the market has been estimated to be nearly $1 billion in the US. 6 The whole discovery process requires years of development and experimentation, including costly and time-consuming clinical trials. Thus, the development of an efficient and accurate informatics system for drug repurposing, which can leverage the literature without significant manual effort, is needed. We propose to use semantic predications extracted from the literature to expedite drug discovery and potentially to reduce development time and cost.
In this paper, we report on a system built on natural language processing (NLP) that can find potential prostate cancer drugs based on the knowledge contained within the biomedical literature. Specifically, the system extracts all relevant semantic predications from SemMedDB 7 (a database of semantic relationships generated by SemRep 8 ) and identifies candidate prostate cancer drugs based on proposed pathway schemas and manual filtering by a physician. Using this approach, our methodology discovers potential prostate cancer drugs that are supported by evidence in the biomedical literature.
Background
This study leverages several publicly available NLP tools that have been developed at the National Library of Medicine (NLM) including Unified Medical Language System (UMLS), SemRep, and SemMedDB.
UMLS
The UMLS provides biomedical domain knowledge for researchers and includes the Metathesaurus, Semantic Network, and SPECIALIST Lexicon. 9 The Metathesaurus integrates concepts from over 100 vocabularies, classifications, and coding systems into one structure. The Semantic Network provides a hierarchy of semantic types assigned to Metathesaurus concepts as well as relationships between those semantic types. The SPECIALIST Lexicon 10 includes lexical information (such as part-of-speech, morphology, and object structure of verbs) to support NLP systems.
SemRep
SemRep is an NLP application that extracts semantic predications from the biomedical research literature. The system relies on all components of the UMLS. For underspecified syntactic analysis, the SPECIALIST Lexicon provides input to the MedPost part-of-speech tagger 11 and subsequent syntactic rules. MetaMap 12 is used to map noun phrases in the syntactic structure to Metathesaurus concepts, and indicator rules map syntactic components to relationships in an extended version of the Semantic Network.
Each semantic predication, a subject–PREDICATE–object triple, consists of a semantic relationship from the extended version of the Semantic Network as a predicate and arguments from the Metathesaurus concepts. SemRep predicates cover genetic etiology of disease (eg, ASSOCIATED_WITH, CAUSES), substance interactions (eg, INTERACTS_WITH, STIMULATES), clinical medicine (eg, TREATS, DIAGNOSES), and pharmacogenomics (eg, AFFECTS, AUGMENTS). 13 For example, SemRep interprets the biomedical text in (1) as the semantic predication in (2), identifying the word “linked” as an indicator of the semantic relationship ASSOCIATED_WITH:
Extracellular matrix associated protein
CYR61 ASSOCIATED_WITH Malignant neoplasm of prostate (MNP).
SemMedDB
All MEDLINE citations have been processed with SemRep, and extracted predications stored in a database, SemMedDB. 7 The version of SemMedDB used for this study is based on citations published as of September 30, 2013. The database maintains links from each predication to its source sentence along with the citation identifier (PMID). It also includes positional information regarding arguments and predicates in a given sentence as well as the distance between an argument and its indicator. We have recently exploited SemMedDB as a structured knowledge resource for discovering drug–drug interactions in clinical data. 14
Discovery patterns
In the earlier work,
14
we used discovery patterns to identify pairs of drugs that have a shared association with specific genes and biological functions, suggesting that the drugs interact. The patterns we used take the form
Related work
Other authors have used a number of techniques to extract cancer-related information from biomedical resources, leveraging both the literature and structured data sources. For example, Chun et al. developed a maximum entropy-based named entity recognizer and a topic-classified relation recognizer to extract information from MEDLINE abstracts on prostate cancer. 19 They had biologists annotate a corpus consisting of gene and prostate cancer relations to train the machine learning tools. Epstein used statistical association rules primarily applied to co-occurring words in MEDLINE citations to explore how text mining can be exploited to reduce cost and enhance effectiveness in cancer research. They provide examples in several areas, which include designing therapeutic strategies, clinical trial design, and targeted drug efficacy for different cancer subtypes. 20 Deng et al. developed a statistical method to select prostate cancer biomarkers from mass spectrometry and microarray datasets; the authors then used text mining from Online Mendelian Inheritance in Man (OMIM) to validate results. 21 Finally, Lu et al. used an order-prediction model to predict cancer drug indications based on chemical–chemical interactions. 22
Methods
Our approach (Fig. 1) included four basic components: (1) identifying possible UMLS concepts (with MetaMap) related to prostate cancer, (2) extracting all semantic predications relevant to prostate cancer concepts as well as the genes and drugs that are in a relationship with those concepts from SemMedDB, (3) discovering all possible cancer drugs based on combinations of semantic predications according to pathway schemas, and (4) providing potential unknown prostate cancer drugs after human review and exclusion of known drugs. These components are achieved through a series of steps detailed below.

Prostate cancer concepts are found from the UMLS using MetaMap. SemRep extracts semantic predications from the MEDLINE database and stores them in SemMedDB. Predications from SemMedDB are found containing the prostate cancer concepts as objects and genes as subjects and more predications are found that contain drugs as subjects and genes as objects. Additional predications are selected that contain genes as both subject and object. These predications are lined up in either the
Step 1: Prostate cancer concept extraction. We retrieved relevant prostate cancer concepts from UMLS Metathesaurus. Two concepts were found and used for this study: C0376358: prostate cancer (MNP) [neoplastic process] and C0600139: prostate cancer (prostate carcinoma) [neoplastic process]. Note that numbers starting with a “C” are concept unique identifiers in UMLS Metathesaurus, and their corresponding semantic types (eg, neoplastic process) are given in square brackets.
Step 2: Semantic predication extraction from SemMedDB. We extracted three types of predications from SemMedDB: gene–cancer (ie, predications with a gene as the subject and a cancer concept as the object), gene–gene, and drug–gene. We first find all predications describing an influence between a gene and one of the prostate cancer UMLS concepts (Step 1). Specifically, predications having a gene as the subject, one of the prostate cancer concepts as the object, and one of the six restricted predicate types – AFFECTS, ASSOCIATED_WITH, AUGMENTS, CAUSES, DISRUPTS, and PREDISPOSES – were extracted as gene–cancer predications. Additionally, drug–gene predications were extracted by finding those that contained a drug as the subject and a gene as the object with any of the following predicates: INHIBITS, STIMULATES, or INTERACTS_WITH. We also extracted gene–gene predications. These were required to have a gene as both the subject and object and STIMULATES, INHIBITS, or INTERACTS_WITH as the predicate.
Step 3: Prostate cancer discovery pathways (Fig. 2)

(
Step 4: Physician selection of semantic predications. We first retrieved the MEDLINE sentences that produced drug candidates based on DGC and DGGC pathways from SemMedDB. One author (MJC, a physician) then selected the most promising candidates from the semantic predications matching each of the pathways. The selection considered the logical implications of the combination of predications. For instance, if the gene in a DGC pathway contributed to prostate cancer, the drug would need to reduce the abundance or activity of the gene. For the non-specific predicates INTERACTS_WITH and ASSOCIATED_WITH, the actual nature of the interaction or association needed to be ascertained from the abstract or full text article. Consideration was also given to the validity of the component predications relative to their source sentence.

The resulting drug candidates and their mechanism of action in treating prostate cancer are represented schematically.
Results
Drugs discovered through DGC pathway schema
Step 2 of our method resulted in 6511 predications containing 853 drug terms, 1107 gene terms, and 2 cancer terms. The break down for each type of predication is given in Table 2.
Counts of predications and unique subjects, predicates, and objects for each type of predication.
Using the DGC pathway schema (Step 3i), we found 18 potential prostate cancer drugs and 3 drugs with some established usage (Table 3). For a gene that promotes growth or impact of cancer, the example drug is inhibitory; whereas for a gene that decreases cancer progression, the drug is stimulatory. Note that ASSOCIATED_WITH can either indicate a promoting or decreasing effect and requires exploration of the source text. For example, FAS is pro-apoptotic, and so in this case the association with prostate cancer is a decreasing effect that suggests therapeutic potential. Many drugs share the same pathway, for example, No. 1–4, No. 5–6, No. 7–12, No. 13–16, and No. 17–19 (Table 3). In the first example, simvastatin inhibits the gene
Resulting drug candidates through DGC pathway.
Drugs discovered through drug→Gene1→Gene2→ Cancer (DGGC) pathway schema
Applying the DGGC pathway schema (Step 3ii) to our predication set and the subsequent physician selection of semantic predications (Step 4) yielded two unknown drug candidates (Sch-23390 and quercetin) and the known prostate cancer drug dexamethasone (Table 4). In the pathway to cancer for the compound quercetin (Table 4, No. 3), FAS stimulates NFkappaB, which is further described in the source (PMID: 15289496) as an inflammatory response instead of a proapoptotic signal, and activation of NFkappaB is then associated with prostate cancer progression. Therefore, inhibition of FAS by quercetin might reduce prostate cancer progression.
Resulting drug candidates discovered through DGGC pathway.
Literature evidence for cancer drugs generated from DGC and DGGC pathway schemas
Some example predications and their source sentences from those that resulted in selected pathways are listed in Table 5. The source of the sentences, including PMID and title/abstract are also extracted. The underlined words in sentences are related to subjects and objects in the predications. Bold and italic words in the sentences indicate the relationships (predicates) between two biomedical concepts. Predicates (eg, STIMULATES) in the semantic predications can be generated from verbs (eg, induce, promote) or nouns (eg, induction, upregulation, stimulation). All biomedical concepts were mapped to UMLS concepts. For example, NFkappaB was mapped to the gene
Sentence citations for selected drug–gene, gene–gene, and gene–cancer semantic predications.
Discussion
Our method of identifying cancer drugs from the biomedical literature is novel since it makes use of knowledge from the entire MEDLINE database (via semantic predications extracted by SemRep). Moreover, we design the two different pathway schemas to allow for linking knowledge from different citations and potentially even different fields of biomedical science. This preliminary work is not intended to provide an exhaustive list of candidate prostate cancer drugs, but it provides a significant starting point for future exploration.
Clinical implications
Both of our pathway schemas provided both drugs already used for prostate cancer therapy and drugs not currently associated with its treatment. One of the known drugs, dexamethasone, is part of standard combined therapy for certain prostate cancer patients, whereas ketoconazole and paclitaxel are less common in standard protocols but exist in studies of experimental treatment. In general, the drugs not currently used are obvious candidates because they are standard or experimental treatments for other cancers, for instance simvastatin has been investigated for pancreatic cancer, 23 leukemia, 24 and lung cancer. 25 Tamoxifen is a somewhat unexpected candidate since it is an estrogen receptor antagonist, but it has been suggested in the literature that it may inhibit prostate cell proliferation. 26 Adriamycin is included in the resulting therapeutic candidates and has already been investigated for use in prostate cancer, although clinical trials results have been controversial suggesting its activity is limited. 27
Advantages of SemMedDB predications in finding unknown cancer drugs
Our methodology uses semantic predications extracted from all of MEDLINE. In addition to providing broad access to biomedical knowledge in the literature, SemRep predications identify the nature of the relationships between entities, going beyond techniques that use concept co-occurrence. The semantic predications are not only machine readable and computable, but they are also human readable and intuitive. In our method, we are able to take advantage of this by specifying predicates and semantic types of subjects and objects. This is an essential component to the construction of our pathway schemas that significantly facilitates the automatic generation of meaningful candidate pathways.
Drug discovery guidance
Our method facilitates the search for new prostate cancer drugs by focusing on likely candidates that already have supporting evidence in the literature and provide not only a candidate list but a specific mechanism of action. This facilitates preclinical investigation necessary before clinical trials may be considered. This method has the potential to find candidates that may not have been considered since the semantic predications are derived from any of the journals included in MEDLINE, which are not limited to cancer research but come from a wide range of biomedical research fields.
Evaluation of semantic predications.
SemRep output has been evaluated several times for recall and precision. Recall has been evaluated to approximate 0.60.17, 28 In previous work identifying drug–drug interactions using semantic predications, 14 we undertook a formal linguistic evaluation for three predication types: gene–drug, drug–gene, and gene– function. The overall precision was 0.60 and varied slightly for each type (0.61 for drug–gene, 0.65 for gene–drug, and 0.54 for gene–function).
Identification of known prostate cancer targets
Our results are limited in several ways. One is due to a physician having manually reviewed a relatively small, randomized subset of candidates. Through this process, we were able to identify drug–gene and gene–cancer pairs (eg, tanshinone II A INHIBITS AR, AR ASSOCIATED_WITH MNP) by looking for specific known targets (prostate cancer-specific androgen receptor and androgen synthesis pathways).
However, many complete pairs still did not appear in our filtered set; typically, only the drug–gene predication occurred (or less commonly we found only the gene–cancer relationship). There are two major reasons for these missed relationships, both due to decisions made when post-processing the extracted predications.
SemRep is not always able to resolve ambiguous gene/protein names, for example, Steroid 17-alpha-monooxygenase versus
Another post-processing step that reduced results was keeping only specific drugs and genes, while removing relationships in which one of the arguments was a class of drugs (eg, anthracyclines or estrogen antagonists) or proteins (eg, HSP90 heat-shock proteins). Results containing drug classes would likely be nearly as useful as specific compounds. On the other hand, including specific drug–gene and gene–cancer relationships along with gene families would increase recall and provide more candidates but would also significantly increase noise and decrease precision.
Limitations and future work
One limitation to this work is that we depend on previous evaluations of SemRep predications and these evaluations did not include all of our predication types, specifically gene–gene or gene–cancer predications. Although these types are similar to those included in evaluations and relatively consistent within other similar types, an evaluation on these specific predication types may provide additional validation of our methodology.
Our Step 4, physician selection, limits the number of potential pathways analyzed because, instead of equal consideration of each and every predication, selection is somewhat limited to a human-readable amount of component predications and subject to individual bias. Machine learning or similar predictive techniques may be able to simulate selection process given prior selections as training data. This in turn may increase the amount of candidates that may be considered computationally and reduce the amount that needs to be considered by humans as a last step.
An essential part of this physician selection was distinguishing whether the cancer genes within the predications were likely to have a “driver” or “passenger” role. This need arose in part from the underspecified nature of SemRep predications, especially in the case of the predicate ASSOCIATED_WITH. Because this relationship can either indicate a promoting or decreasing effect, further clarification was gathered from the source text.
One concern that may be significant in our approach is that the compounds extracted by SemRep are from the 2006 version of the UMLS to avoid increased ambiguity in the 2012 version, and so we are not able to consider potential drugs that were added to the newer version. Even the 2012 version may leave out a considerable amount of potential drugs and using another source for chemical compounds might increase the number of drug–gene assertions extracted.
Just as this approach is an extension of our previous discovery of potential drug–drug interactions, it too can be easily extended to consider other cancers as well as different diseases, conditions, and syndromes. In addition, more levels of gene–gene interactions can be added, extending the schemas to
Conclusion
We present a method to identify potential prostate cancer drugs that takes advantage of the wealth of biomedical literature knowledge contained in the MEDLINE database. In our study, we identified 18 potential prostate cancer drugs that have not previously been used for prostate cancer. Our methodology was also able to identify three substances that have already been used in prostate cancer treatment.
Author Contributions
Conceived the concepts: RZ, MJC. Analyzed the data: RZ, MJC. Wrote the first draft of the manuscript: RZ, MJC. Contributed to the writing of the manuscript: RZ, MJC, MF, HK, TCR, SP, GBM. Agree with manuscript results and conclusions: RZ, MJC, MF, HK, TCR, SP, GBM. Jointly developed the structure and arguments for the paper: RZ, MJC, MF, HK, TCR, SP, GBM. Made critical revisions and approved final version: RZ, MJC, MF, HK, TCR, SP, GBM. All authors reviewed and approved of the final manuscript.
