Abstract
Keywords
Introduction
What is big data?
“Big data” is a term that was introduced in the 1990s to include data sets too large to be used with common software. In 2016, it was defined as information assets characterized by high volume, velocity, and variety that required specific technology and analytic methods for its transformation into use. 1 In addition to the three attributes of volume, velocity, and variety, some have suggested that for big data to be effective, nuances including quality, veracity, and value need to be added as well.2,3Big data reveals health patterns, and promises to provide solutions that have previously been out of society’s grasp; however, the murkiness of international laws, questions of data ownership, public ignorance, and privacy and security concerns are slowing down the progress that could otherwise be achieved by the use of big data. In this descriptive review, we highlight the roles of big data, the changing research paradigm, and easy access to research participation via the Internet fueled by the need for quick answers.
Universally, data volume has increased, with the collection rate doubling every 40 months, ever since the 1980s.
4
The big data age, starting in 2002, has generated increasing amounts of alphanumeric data; in addition, social media has generated large amounts of data in the form of audio and images. The use of Internet-based devices including smart phones and computers, wearable electronics, the Internet of things (IoT), electronic health records (EHRs), insurance websites, and mobile health all generate terabytes of data. Sources that are not obvious include clickstream data, machine to machine data processing, geo-spatial data, audio and video inputs, and unstructured text. In general, the total volume of data generated can only be estimated. For example, the usual personal computer in the year 2000 held 10 gigabytes of storage; recently,
Besides being statistically powerful and complex, data need to be available in real time, which allows it to be analyzed and used immediately. Big data has immense volume, dynamic and diverse characteristics, and requires special management technologies including software, infrastructure, and skills. Big data shows trends from shopping, crime statistics, weather patterns, disease outbreaks, and so on. Recognizing the power of big data to effect change, the United Nations (UN) Global Working Group on big data was created under the UN Statistical Commission in 2014. Its vision was to use big data technologies in the UN global platform to create a global statistical community for data sharing and economic benefit. 8
Methods
We aimed to write a descriptive review to inform physicians about use of big data (biological, biometric, and electronic health records) in both the commercial and research fields. Pubmed-based searches were performed, and in addition, since many of the topics were outside the scope of this data base, general Internet searches using Google search engine were performed. Searching for “Big data and volume and velocity and variety” in the Pubmed data base resulted in 45 articles in English. Papers were deemed to be appropriate by the consensus of at least two authors. Pubmed search for “artificial intelligence in clinical decision support” resulted in two relevant review articles, and the addition of “randomized control trials” resulted in 11 randomized control studies, of which only one was relevant. For non-Pubmed indexed scholarly articles, two authors determined relevance by the frequency of the paper being cited or accessed online. As some content was to be informative rather than conclusive, commercial websites, such as those dealing with DNA testing for ancestry, were accessed. The Food and Drug Administration (FDA) website was accessed when searching for the “oldest biobank,” which revealed the HIV registry. Landmark trials were selected for changes in research design and use of big data mining.
Big data in medicine
The major fields predicted to increasingly use big data by 2025 include astronomy, social media (

Big data in medicine.
Before the data can be converted to digital form, biological specimens need to be processed and preserved. Biospecimen preservation standards in the past varied based on the organization. In 2005, in an effort to standardize biospecimen preservation, the National Cancer Institute contributed to the creation of the Office of Biobanking and Biospecimen Research (OBBR) and the annual symposium for Biospecimen Research Network Symposia. 11 In 2009, with international support, there was the publication of the first biobank-specific quality standard, which has since been applied to many biobanks. Biobanking has evolved with regulatory pressures, advances in medical and computational information technology, and is a crucial enterprise to biological sciences. One of the longest existing biobanks is the University of California at San Francisco AIDS specimen bank, which has functioned for the past 30 years. 12
One thing in common that all biobanks have is the need for significant resources to manage, analyze, and use the information in a timely manner. 13 Commercial biobanks include multinational companies that collect biological specimens from subjects for verification of ancestry. Subjects pay for the DNA analysis kit, which is collected by them and mailed to the companies where they are analyzed and stored. The company then can sell the data to third parties for research based on legislation.
The shifting paradigm in medical research
The clinical research paradigm has changed to match an increasingly older population’s needs. This has been fueled by large-scale biological data harvesting (biobanks), which is developed, analyzed, and managed by cheaper computing technology (big data), supported by greater flexibility in study design and the relationships between industry, government regulators, and academics. With easy access to information via the Internet, citizen science had allowed many non-scientists to participate in research. 14 Biological specimens collected via Internet-based projects may be sold to third parties for research; these may be as data of healthy controls or as part of a specific medical condition.
Historical precedent and its difficulties
In the past, drug development may have started in serendipity. 15 Subsequent to the Second World War, the therapeutic research approach became long and expensive. The initial step was the search into possible therapies, followed by in vitro and in vivo testing via multiple phases: the first phase for safety, the second for efficacy and the third to compare the treatment to the existing standard of care. In addition, hurdles for new drugs included FDA approval, randomized control trials (RCTs), and finally post-release studies. In some unfortunate cases, once the drug was released in the market, rare, but serious, adverse events would bankrupt the company and patients who needed the therapy would still not have effective treatment choices. This was particularly hard for patients suffering from rare diseases, where the small population needed a large investment of money and time, which was less attractive to industry to attempt a repeat study. In patients who had limited life spans, the long process precluded them from beneficial therapies. Understanding this need, when there was an urgency for rapid treatments, the FDA worked to expedite the release of new drugs, such as the release of new medications to treat HIV during its epidemic.16,17
In the case of oncology, the historical approaches in research and development (R&D) of a new drug followed by the usual phases to RCTs have been expensive. In 2018, pharmaceutical companies invested approximately 50 billion dollars in R&D for a 3% probability of success from individual projects. A 3% probability of success, despite the investment of financial and human effort, is too low for patients who may not have any treatment options. 18
Changes in research
Changes in study design
At present, a more purposeful and organized approach for determining the responsible cause as a starting point for subsequent therapy is being used.
After completion of the Human Genome Project, technology for pinpointing mutations increased. 19 Broad sweeps of the human genome with more than 3000 genome-wide association studies (GWAS) have examined about 1800 diseases. 20 Following GWAS or Quantitative trait locus (QTL) determination, microarray data allowed identification of candidate genes of interest. 21 For allelic variants to be correlated to disease, large biobanks that have both patient and control data are compared. If a mutated allelic frequency correlates at a significantly higher rate in those with the disease, that variant can be targeted for therapy.
In a tumor, once a driver mutation that promotes abnormal growth is identified, therapy targeting the specific genetic alteration can be attempted. 22 In the presence of multiple mutations, driver mutations are differentiated from bystander or passenger mutations, as tumors may have a heterogeneous molecular signature.
Pharmaco-genomics is the foundation for precision medicine, which is now being clinically practiced in oncology and is being adapted in other fields. The introduction of molecular pathological epidemiology (MPE) allows the identification of new biomarkers using big data to select therapy23,24 (Table 1). Based on an individual’s cellular genetics, drugs that target the desired mutation can be studied and effective doses determined, which can result in safe and efficient treatments.
Examples of big data and new research designs trials.
AI: artificial intelligence.
Big data technology allows large cohorts of biological specimens to be collected, and the data can be stored, managed, and analyzed. At the point of analysis, machine learning algorithms (a subset of artificial intelligence (AI)) can generate further output data that may be different from the initial input data. AI can create knowledge from big data25,26 (Table 1). For example, Beck et al., 25 using a computation pathology model in breast cancer specimens with AI, found prior unknown morphologic features to be predictive of negative outcomes.
Rapid learning health care (RLHC) models using AI may discover data that are of varying quality which need to be compared to validated data sets to be truly meaningful. 29 Subsequently, the information extracted can be processed into decision support systems (DSS), which are software applications that can eventually apply knowledge-driven healthcare into practice.
AI can be classified into knowledge-based or data-driven AI. Knowledge-based AI starts with information entered by humans to solve a query in a domain of expertise formalized by the software. Data-driven AI starts with large amounts of data generated by human activity to make a prediction. Data-driven AI needs big data and, with inexpensive computing, is a promising economic choice.30,31
The combination of AI and DSS is a clinically powerful one to improve health care delivery. For example, in a small study of 12 patients with type one diabetes, using AI and DSS allowed for quicker changes in therapy rather than the patients waiting for their next caregiver appointment, without an increase in adverse events. 32
New study designs
With new technology for diagnosing, managing, and treating diseases, modifying the RCT design was essential. The development of master clinical trial protocols, platform trials, basket/bucket designs, and umbrella designs has been seen over the last decade. 33
Usually, sub-trials may be designed as early phase and single arm studies, with one or two stages having an option of stopping early if the study is considered futile. The study design is based on determining tumor pathophysiology/activity and matching the target mutation with a hypothesized treatment. Analogous to a screening test, a responsive sub-study would require a larger confirmatory study. For example, although rare cancers are uncommon on an individual basis, the total sum of these cases make “rare cancers” the fourth largest category of cancer in the United States and Europe. 34 These are challenging to diagnose and treat and have a worse 5-year survival rate as compared to common cancers. One option to help these patients would be to make them eligible for a clinical trial based on genetic dysregulation of the tumor rather than organ histology.
Drugs have been studied for a signature driver mutation rather than for an organ-specific disease. With enough information about the molecular definitions of the targets, the focus on the site of origin of the cancer is diminishing, for example, the study drug Larotrectinib was noted to have significant sustained antitumor activity in patients with 17 types of Tropomysin Receptor kinase fusion–positive cancers, regardless of the age of the patient or of the tumor site of origin.35,36 This landmark drug was the first which was FDA approved for tumors with a specific mutation and not a disease.
Basket trials may also test off-label use of a drug in patients who have the same genomic alteration for which the drug was initially approved, or it could test a repurposed drug. 37
Even if the traditional RCT is planned, matching various data sets with AI to run various configurations can result in determining possible therapy choices, and can eliminate time and investment outlay. In the end, this could speed up the process of drug testing and result in a quicker arrival to the RCT stage.
Real-world evidence
Real-world evidence (RWE), is information obtained from routine clinical practice and it has increased with the use of the EHR. RWE in the digital format can be significantly furthered by big data. Clinical practice guidelines that have been using RWE-based insights include the National Comprehensive Cancer Network. In addition, the American Society of Clinical Oncology suggests using RWE in a complementary nature to randomized controlled trials. 40 Big data in RWE allows for more rapid evaluation of therapy in the clinical setting, which is a key element in the cost of R&D of drugs. The 21st Century Cures Act (signed into law 13 December 2016) resulted in the FDA creating a framework for evaluating the potential use of RWE to help support the approval of a new indication of a drug, or to help support post-approval study requirements. 41 Focusing on EHR data, industry is starting to generate interest in a new pathway to drug approvals. An example would be using natural language processing and machine learning systems to provide observational clinical studies with adequate quality to attempt justification of approval for the new indication of drugs. Another example includes using AI technology to identify the effect of comorbidities on therapy outcomes and subgroups in single disease entity all of which will enhance personalized medicine. RWE data that are collected include demographics, family history, lifestyle, and genetics, and can be used to predict probabilities of diseases in the future. Once marketed, RWE along with RCT could speed up the FDA requirements to get the therapy to the patient or to compare drugs. A recently published study that used RWE to compare cardiovascular outcomes between different therapies was the Cardiovascular Outcome Study of Linagliptin versus Glimepiride in Type 2 Diabetes (CAROLINA) trial. (Patorno et al.; 27 see Table 1.)
Big data: technology and security
Computing technology has gotten cheaper which allows for the extensive use of big data. Examples of big data technology can be characterized by its function: either operational or analytic (Table 2). Both systems have specific advantages, formats, data forms, and computer network capabilities (Figure 2).
Big data technology with examples of systems in use.
Non-structured query language.
Massively parallel processing.
Structured query language.

Big data security.
Big data security should include measures and tools that guard big data at all points: data collection, transfer, analysis, storage, and processing. This includes the security needed to protective massive amounts of dynamic data and faster creative processing like massive parallel processing systems. The risk to data may be theft, loss, or corruption either through human error, inadequate technology (example crash of a server), or malicious intent. Loss of privacy with health-related information adds to the need for greater security and exposes involved organizations to financial losses, fines, and litigation.
Processes to prevent data loss and corruption at each access point needs to be in place, for example, during data collection, there needs to be interruption to incoming threats. Security measures include encrypting data at input and output points, allowing only partial data volume transfers and analysis to occur, separating storage compartments on cloud computing, limiting access with firewalls, and other filters. 45 For example, Block chain technology is a security device that can authenticate users, track data access, and, due to its decentralized nature, can limit data volume retrieval. 46 Standardizing big data security continues to be an area where further research and development is required. A review of 804 scholarly papers on big data analytics to identify challenges, found data security to be a major challenge while managing a large volume of sensitive personal health data. 47
Concerns
With changes in the scientific method, difficulties are to be expected. Examples of big data with non-traditional research techniques and negative consequences are listed in Table 3. These include preemptive release of drugs to the market as in the Bellini trial, loss of privacy of the relatives of criminals who underwent ancestry determination, and questions of ownership of data. Whether the developing research systems will justify the trust invested in it by altruistic participants, patients and physicians need to be seen. Government regulators are included in the struggle as a shifting legal framework could challenge everyone involved (Table 3).
Weaknesses and consequences faced by big data in the changing research landscape.
RCTs: randomized control trials; FDA: Food and Drug Administration; AI: artificial intelligence.
Changing cultural context and the physician
All hospitals have collected biological specimens as part of their routine workflow, an example being routine blood tests. In the ideal world, many doctors would like to do some research; however, in the real world, research is performed by the minority of physicians. A survey of physicians across two hospitals in Australia found physicians interested in having biobanks in hospitals;
64
however, large biobanks may be more efficient and financially viable. Rather than discounting the routinely collected specimens, consideration to capture this potential resource should be explored. One option is to explore how to close the gap between those who routinely prepare the specimens, those who store it, and those who use the information for research. One such project,
Correlations between genetics and disease, and connections that were not obvious in the past, can become visible as the data set increases in size. Instead of starting with people who have the disease in whom the new drug is tested in a RCT and then waiting to determine post-marketing study outcomes, large data collections of genetic and demographic information (including family history, lifestyle, etc.) can be used to show the risk of disease in a population and predict if risk modification can prevent illness. The shift toward prevention rather than cure may get a big boost from big data. In those with the disease, cellular specifics (receptors, cytokines, along with gene variants) can predict what sites to target (increasing or decreasing effects) in order to develop therapies that are personalized in that subset of the same disease.
The growth of the Internet over the last 20 years and creation of open access to scientific literature has resulted in the availability of unlimited medical information to patients. 66 It has led to the direct use of products and practices by the general public, at times eliminating the need for the clinician’s input. Lack of transparency has created an inconsistently safe environment, and this is especially true among those who participate in social media research. Minimally invasive activities like mailing a saliva swab for genetic testing, while done for reasons of curiosity like determining one’s ancestry, contribute to the collection and sale of large amounts of genetic information to third parties. The loss of privacy is a clear risk outlined in the several pages of online consent that most subjects will probably not read.67,68 There are collections of large data banks with more than a million biospecimens in many private organizations. In the past, medical big data may have seemed more aspirational than practical with both physicians and the general public unaware of its risks and benefits.
For physicians, researchers, and the general public, flexibility to find answers rapidly is vital for our well-being today more than ever before. For example, in the coronavirus disease of 2019 (COVID-19) pandemic, the FDA has engaged directly with more than 100 test developers since the end of January 2020. This unprecedented policy by the FDA is attempting to get rapid and widespread testing available. According to the policy update, responsibility for the tests, including those by commercial manufacturers, is being shared with state governments and these laboratories are not required to pursue emergency use authorization (EAU) with the FDA. 69
An example of big data with an alternate research paradigm using public participation in the COVID-19 pandemic could be as follows: direct-to-consumer marketing of a quantifiable antibody home test for COVID-19. The FDA is working with the Gates foundation to produce a self-test kit for COVID-19 as a nasopharyngeal swab. 70 If a biobank registry is subsequently created for COVID-19, it would provide us with tremendous information, including, but not limited to, an accurate mortality rate and identification of those who have high antibody levels. The identification of participants with high antibody levels may then allow them to donate antibodies to those at risk for worse outcomes.
Limitations of the article
The article is about the various aspects of data and medical research and is limited to being a relevant analysis of literature rather than an exhaustive review. The most cited or electronically accessed articles have been used as references. Changes in the many aspects of data collection to security are based on rapidly changing technology. Information which had physical restrictions and was located in controlled physical premises have migrated to the cloud with digital transformation. In addition, dynamic factors like enterprise mobility or even the current COVID-19 lock down has changed the way people work. A comprehensive review and in-depth analysis would be out of the scope of a review article.
Final thoughts
The increasing use of big data and AI with heterogeneous large data sets for analysis and predictive medicine may result in more contributions from physicians, patients, and citizen-scientists without having to go down the path of an expensive RCT. The formative pressures between altruistic public participants, government regulators, Internet-using patients in search of cures, clinicians who refer patients, and industries seeking to reduce cost, all supported by cheaper technology, will determine the direction of how new therapies are tried out for use. Increased government interest and funding in this aspect is noted with programs like the “All of Us initiative.” 71 At present, pressing needs in the COVID-19 pandemic force flexibility between all interested parties to conduct investigations and find answers quickly.
Conclusion
Personalized health care is expanding rapidly with more clues for cures than ever before. Each solution presented brings its own set of problems, which in turn needs new solutions. Collaboration across silos, like government agencies, commercial manufacturers, researchers, and the public needs to be flexible to help the greatest number of patients. Big data and biobanks are tools needed for basic research, which, if successful, may lead to new therapies and clinical trials, which will ultimately lead to new cures. Data that are collected, analyzed, and managed still needs to be converted into insight with the goal of “first do no harm.” All involved must have the common goal of data security and transparency to continue to build public trust.
