Abstract
Keywords
Introduction
Anticancer drug treatments normally have a higher risk of adverse events 1 compared with noncancer drug treatments. For example, in chemotherapy, one of the most commonly used methods for treating cancer, agents control cells that divide rapidly, 2 which is the major characteristic of cancer cells. However, many chemotherapy agents can produce severe toxicities,3–6 such as gastrointestinal toxicity, 7 cardiovascular toxicity,8,9 and nephrotoxicity (renal toxicity).10,11 The toxicity effect leads to a wide range of side effects,12,13 such as decreased production of blood cells, suppression of the normal immune system, hair loss, and bleeding.
To better understand the adverse effects of various cancer drugs, several clinical trial studies have focused on analyzing adverse event patterns. For example, Larrar et al. 14 discovered severe hematological side effects from rituximb treatments in children when the drug is used to treat autoimmune diseases. Norden et al. 15 studied the toxicity of bevacizumab when treating patients with recurrent malignant gliomas, finding common adverse events such as nausea and vomiting and severe events such as hemorrhage, proteinuria, and thromboembolic complications. Andritsos et al. 16 discovered that higher doses of lenalidomide cause life-threatening toxicity in patients with chronic lymphocytic leukemia. Furthermore, several studies have analyzed a specific type of adverse event, such as the intravitreal toxicity of bevacizumab studied by Manzano et al. 17 , but most of these clinical studies focused on a specific drug or type of event. Due to the complexity of clinical trials and the challenges in recruiting patients, few studies have been designed to systematically analyze significant adverse events across a large number of cancer therapy agents. In this study, we aim to address this gap by developing a new large-scale, data-driven informatics method to help investigators systematically explore and analyze adverse events in cancer treatments. This systematic analysis complements existing analysis methods for clinical research and offers many potential clinical applications, such as detecting significant side effects of drug for postmarketing monitoring and providing evidence for comparing cancer therapies across different drugs.
In our exploratory study, we developed a method to extract and formalize adverse event data from multiple clinical trial reports for cross-trial analysis. An informatics pipeline was designed and built for clinical investigators to systematically analyze the adverse events using existing clinical study outcomes. To demonstrate our method, we studied the adverse events of 30 cancer therapy agents using data that were automatically extracted from clinical trial reports on ClinicalTrials.gov. The adverse event results were identified and integrated from trial reports. Using the data, we summarized the prevalence of adverse events across different study drugs. We conducted an analysis to compare and rank the adverse event incidences using the extracted data. The results show that the method provided an effective way to discover significant adverse event outliers associated with cancer therapies.
Methods
We first collected data on 186,339 clinical trials from ClinicalTrials.gov. We then conducted a study by using the 1602 cancer clinical trials that targeted 30 common cancer drugs such as bevacizumab, imatinib, lenalidomide, and pemetrexed. The selected cancer drugs included the top eight most commonly used chemotherapies; the complete drug list is summarized in Table 3. A parser was developed to traverse and extract data from the clinical trial reports. From these reports, we extracted the clinical trial title, target condition, recruitment location, and adverse events. To recognize medical concepts and standardize terminologies in the text reports, the extracted data elements were mapped to the Unified Medical Language System (UMLS).
18
For example,
Results
Summary of Data Elements
Table 1 summarizes the study data in the Hadoop data warehouse. The adverse event data table is the main focus of this study, which stores 12,922 distinct adverse events. This data table contained the event name, UMLS concept of the event, number of affected subjects, number of at-risk subjects, and event type (eg, serious event and nonserious event). Other data tables, including 1602 clinical trial descriptions, 30 selected cancer drugs, and 1989 cancer disease conditions, were linked to the adverse event data table through their unique trial reference keys. The Hadoop data warehouse not only stored the adverse event data in a structured format but also provided parallel access to the data elements for data mining analysis.
Data extraction and summary.
High Prevalence and Incidence Adverse Events
Understanding the prevalence and incidence of adverse events can provide a useful reference for conducting clinical studies and monitoring the postmarketing of toxic drug effects.20,21 Table 2 shows the top 30 adverse events according to the ranking of trial prevalence. The
Top 30 ranking of adverse events based on the prevalence analysis.
Average incidence rate of adverse event per cancer drug.
To the best of our knowledge, there is a lack of large-scale systematic analysis on the prevalence and incidence of the adverse events. Therefore, our study complements existing toxicity research for cancer drugs, providing a fundamental baseline to understand the common events. If an adverse event had a high prevalence, it meant that the event was more common among different drugs. For example, nausea was the top adverse event among trials, with a very high prevalence at 82.77%, followed by fatigue at 77.34%, vomiting at 75.97%, constipation at 72%, and cough at 63%. All the top five adverse events had a prevalence of greater than 60% among trials and an incidence rate of greater than 10% among patients. We also calculated the incidence rates of all the adverse events. High incidence of an adverse event indicated that the risk of observing the event on a patient was high. For example, among the top 30 high-prevalence events, alopecia (hair loss) had the highest incidence rate at 26.43%, which is higher than nausea at 23.17%; even the prevalence of alopecia among trials was about half of nausea. This indicated that if patients are exposed to drugs that cause alopecia, the likelihood of observing the alopecia event was high.
Average Adverse Event Incidence Rate per Cancer Drug
To compare adverse event risks across different cancer drugs, we compared the summarized incident rate of the 30 selected drugs. Table 3 shows the individual event incidence rate for the 30 selected drugs, which were ranked by the incidence rate. There was a significant difference among the cancer drugs: the event incidence rates range from vorinostat at 12.41% to lenalidomide at 3.20%. The higher incidence rate of a drug indicated that adverse events were more likely to be observed when a drug was used on a patient. The analysis was a summarized estimation of the total risk of adverse events when administering a drug to a patient. For example, when designing chemotherapy treatment for a patient with breast cancer, a doctor may use a combination of capecitabine and cyclophosphamide. If adverse events were an important factor to consider for the treatment, 22 eg, treating a weak patient, a higher dose of capecitabine could be combined with a lower dose of cyclophosphamide, because the average event incidence rate of capecitabine was more than 40%, which was less than that of cyclophosphamide. Combining high- and low-toxicity drugs to create a therapy could lead to a better-tolerated treatment plan. 23 To further design a better treatment strategy, a clinical investigator may need to compare a specific adverse event across several different drugs. In the next section, we discuss specific adverse events and analyze their potential risks.
Association Analysis between Drugs and Adverse Events
To compare the association of an adverse event across different drugs, we used the Apriori 24 association mining method to extract significant drug-event pairs from the clinical trial reports. We excluded low-quality adverse event cases where the at-risk patient count is less than 5 patients and the affected rate is 100%. These trials are usually small and provide little statistical power to the analysis. Figure 1A–D shows the ranking of four adverse events across 30 cancer drugs based on the incidence rate of the events. Given an adverse event, there is a significant difference across the 30 cancer drugs in terms of adverse event incidences. For example, in Figure 1A and B, we can see that degarelix had many fewer nausea and insomnia events than other drugs. Degarelix was a hormonal therapy normally used for prostate cancer treatment, and it is well tolerated. Comparing the nausea events between degarelix (5.53%) and vorinostat (38.19%), vorinostat was significantly more likely (6.9 times) to cause nausea. Some cancer drugs generally have higher incidences of adverse events. For example, cyclophosphamide had high number of insomnia (15.27%) and neutropenia (29.16%) events, and it was also among the top three for nausea (31.34%) and myalgia (16.35%). Comparing the incidence variances of adverse events among cancer drugs is crucial for designing personalized therapy for patients. For example, studies25,26 found that when a cancer drug causes the neutropenia event, the patient is very likely to develop bacteremia. Therefore, if a cancer drug has high possibility of inducing neutropenia (eg, cyclophosphamide and cisplatin, Fig. 1D), the treatment plan should consider the use of colony-stimulating factors 26 and antibiotics. 25

Comparison of adverse event associations across different cancer drugs. (A) nausea event; (B) insomnia event; (C) myalgia; (D) neutropenia.
The results of our work show that, given an adverse event, the incidence rates across different drugs can be significant. Our method provides an effective way to compare adverse events across multiple drugs by systematically combining evidences from multiple trials.
Detecting Significant Drug–Event Association Outliers
The previous analysis helped us rank and compare adverse event incidence across different cancer drugs, and we determined that the variance of adverse event incidence could be significant. Based on this observation, we hypothesized that there could be adverse event drugs that are statistically associated with some adverse events when compared with other drugs. Here, we explored a method to visualize and identify significant outliers of drugs that could cause an adverse event. In the data we extracted, we found that for a given drug, there could be many trials conducted to evaluate it. We grouped together trials that tested the same drug and then compared them with other drug groups. To visualize and compare the outlier groups of drugs, the boxplot 27 method was used to examine adverse events across different drugs. The boxplot shows the statistical results of an adverse event among a drug group, including the mean and median; 75th percentile, 25th percentile, 95th percentile, and 5th percentile; and the maximum and minimum values of the incidence. The boxplot provides an effective and intuitive way to estimate the variation and dispersion of drug adverse events. In this study, when the boxplot showed a potential outlier, we further calculated the statistical significance of the outlier using Grubbs’ test.
Figure 2 shows four examples of drug-adverse event association outliers. The first example (Fig. 2A) shows that axitinib had a high possibility of inducing hypertension in cancer patients. The event distribution of axitinib was significantly higher than that of other drugs. Previous studies28,29 have analyzed the impact of axitinib on blood pressure. Using our data, we found that the association significance was

Adverse event outliers when comparing cancer drugs (A) hypertension; (B) deep vein thrombosis; (C) muscle spasms; (D) paronychia.
Discussion
In this study, we proposed a large-scale, systematic approach to analyze and compare the adverse effects of cancer drugs. Using clinical trial reports to extract adverse events from clinical studies, we showed that integrating large amounts of clinical trial data can effectively detect significant adverse events from cancer drugs. Clinical trials are the gold standard for evaluating the safety of drugs, and clinical trial results are valuable resources for clinical research and practice. However, conducting clinical trial studies is expensive and slow: a typical clinical trial could cost millions of dollars within five years.33,34 Sometimes, even after a trial is completed, clinical investigators still face the challenge of not having enough statistical power to support the analysis of drug toxicity and adverse events. A common way to address this problem is to use meta-analysis 35 to enhance the statistical power.
Meta-analysis aims to aggregate data from multiple clinical trials to test a hypothesis. In a meta-analysis, the investigator combines the results from multiple clinical trials to conduct a statistical analysis, which could provide greater information for evaluating drug toxicity. For example, Silva et al. 36 combined 18 trials to analyze the statin-related adverse events. They manually reviewed the data on 18 trials and applied the Fisher's test to find significant adverse events across trials. Meta-analysis could improve estimation of the effect and reduce the uncertainty of clinical studies. However, meta-analysis is not immune to human bias, 37 and the method can be applied only to a limited number of trials, because it is a labor-intensive process that requires a high level of domain expertise. In this study, we proposed a new, data-driven approach to complement the analysis of adverse events for clinical research. By systematically integrating large numbers of clinical trial reports, we could summarize the prevalence of adverse events across different cancer drugs. We conducted exploratory studies to compare and rank the incidences of adverse events using the extracted data. In the “Results” section, we demonstrated that the method can effectively discover significant adverse event outliers for cancer drugs.
This is an exploratory study, and our method and analysis can be improved in several ways: (1) we did not use the extracted placebo results, which could have been used to establish a baseline standard to discover adverse event outliers. (2) We excluded therapies that use multiple drugs, which helped to reduce the noise of the signal. However, analyzing the combinational effect of drugs is an important topic, and analyzing drug combination therapies can be a future work. (3) About 2% of the adverse event report data contains complex elements that cannot be directly mapped to the UMLS concepts, such as “infection without neutropenia, nasal pharynx” and “late radiotherapy toxicity: subcutaneous tissue (within radiotherapy field)”. For these cases, if part of the string can be mapped, we will use the first recognizable string as the adverse event, such as
Conclusion
We proposed a method to support the outlier detection of adverse events in cancer clinical trials. We used a data-driven approach to synthesize clinical trial results that studied cancer therapy agents. Among the retrieved 186,339 clinical trial data, we focused on 1602 cancer trials that studied 30 cancer drugs. From the trial data, 12,922 distinct adverse events were extracted. We conducted a systematic analysis to rank all the 12,922 adverse events based on their prevalence in trials, such as nausea 82%, fatigue 77%, and vomiting 75.97%. To demonstrate the effect of finding significant adverse events among cancer drugs, we used the boxplot method to visualize adverse event outliers across different drugs. We showed that by systematically integrating clinical trial reports, significant adverse event outliers associated with cancer drugs can be detected. The method is demonstrated by detecting the following four statistically significant adverse event cases: the association of the drug axitinib with hypertension (Grubbs’ test,
Author Contributions
Conceived and designed the experiments: JL. Analyzed the data: JL, RAC. Wrote the first draft of the manuscript: JL, RAC. Contributed to the writing of the manuscript: JL, RAC. Agree with manuscript results and conclusions: JL, RAC. Jointly developed the structure and arguments for the paper: JL, RAC. Made critical revisions and approved final version: JL, RAC. Both authors reviewed and approved of the final manuscript.
The corresponding author JL had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
