Sage Journals: Discover world-class research

Abstract

Advancements in technology are shifting the ways that biomedical data are collected, managed, and used. The pervasiveness of connected devices is expanding the types of information that are defined as ‘health data.’ Additionally, cloud-based mechanisms for data collection and distribution are shifting biomedical research away from traditional infrastructure towards a more distributed and interconnected ecosystem. This shift provides an opportunity for us to reimagine the roles of scientists and participants in health research, with the potential to more meaningfully engage in partnership across the research process. At the same time, these emerging practices present a potential to expose research participants to unanticipated and unintended consequences. Social norms and policy can help to mitigate these risks, but their development is often slow relative to the pace of technological advances and, as such, they can become reactive rather than prospective. As an alternative, the integrated development of data governance structures within technological advancements, supports their effective implementation, evaluation and evolution in a manner that can balance the benefits and risks of biomedical researcher in a decentralized ecosystem.

Keywords

Decentralization cloud computing biomedical research governance

This article is a part of special theme on Health Data Ecosystem. To see a full list of all articles in this special theme, please click here: https://journals.sagepub.com/page/bds/collections/health_data_ecosystem

A trend toward decentralization

“Decentralization can be seen as a strategy of governance, prompted by external or domestic pressures to facilitate transfers of power closer to those who are most affected by the exercise of power. This understanding of decentralization implies that in analyzing any such initiative, it is critical to identify the interests of the major actors involved …” (Agrawal and Ostrom, 2001).

There are unique characteristics of digital health data that influence the kind of governance structures that can be used to effectively oversee its use. Data do not carry intellectual property rights in most jurisdictions, making management approaches based on data ownership very complex. In fact, digital data often exists in multiple places where copying and redistributing data have minimal transactional costs. As such, we cannot easily reapply governance approaches that have been effective in other fields. Additionally, many pre-existing models for sharing digital objects from other domains are heavily influenced by copyrights in software and culture and do not necessarily translate. Because property licenses focus on the copying and distribution of files rather than their use, this approach is flawed for personal health data as it would be easy to comply with the terms of a license (not-copying) while violating the privacy desires of a data donor (re-identifying). Further complicating matters, data is increasingly considered a commercial asset in the private sector. Emerging data protection efforts therefore must emphasize governance mechanisms that responsibly mediate the flow of and access to these data, and control for participant rights, such as the right to be “forgotten.”

In the life sciences, research has typically been conducted in academic institutions, which provide a concentration of research resources and expertise within a single physical location. As the scale and complexity of biomedical data has evolved, many projects now require integration of resources across multiple sites. This movement toward decentralized research systems has the potential to disperse power in a way that can provide new values to both researchers and participants. Recent technological advances, including cloud infrastructure and modern web application programming interfaces, provide an opportunity to reimagine how we might extend aspects of decentralization more broadly across the biomedical research community. These advances, used as the basis for a variety of technological platforms, lay the groundwork for distributed teams of researchers to work on common data or compute infrastructure and shared research questions. Similarly, for certain types of biomedical research, participants may no longer have to come into a physical clinic in order to engage with a research study. As with any new system, these changes have and will continue to both provide great promise as well as reveal unintended consequences for each actor in the ecosystem.

Based on our own experiences with the decentralized management of health data across dozens of research programs (Allaway et al., 2018; anonymized for peer review; Dobbyn et al., 2018; Guinney et al., 2015; Logsdon et al., 2019; Sweeney et al., 2018; Trister et al., 2017), we have observed several key ways in which this shift to decentralization has impacted the interactions and expectations of the primary stakeholders in the research ecosystem. Here, we will use three of these programs as case studies to exemplify the impact of decentralization on three stakeholders: research participants, primary researchers (those who design a research study), secondary researchers (those not directly involved in the study design nor data collection efforts), and the power changes that decentralization brings to their interactions with one another.

While a gross simplification, identifying these three broad actors allow us to discuss two major components of biomedical research that are potentially affected by decentralization, and a third that is less often thought about, but is becoming increasingly important. The first two interactions that have already been affected by decentralization are: (1) participant recruitment and data generation as an interaction between participants and primary researchers; and (2) distributed data analysis as an interaction between primary and secondary researchers (usually computationally focused). A third relationship, one that is often not largely a focus of most governance systems nor consent processes, is the indirect relationship between participants and secondary researchers.

A more decentralized research ecosystem will most certainly change the way participant and primary researcher interact with one another. Traditional human research embraces the term “subject” to refer to participant, with attendant power implications, and has largely relied on in-person discussions between researcher and participant codified, in part, by an informed consent process. Remote digital health studies (Moore et al., 2017), such as those that have leveraged ResearchKit and ResearchStack frameworks, are changing at least some of the power structures around research participation (see case study 2). In these cases, the interactions between researcher and participant can occur outside of the formal clinical setting—in the context of a participant’s daily life. Digital consent processes are beginning to fill in for the aforementioned physical informed consent document, but expectations may be different for both sides of the agreement when no physical interactions take place and a different medium is used for communication (Doerr et al., 2017). Additionally, digital communication can be used to support broader participant choice. With remote communication, participants are not directly exposed to the interests of the clinical researchers and may feel more comfortable refusing or disengaging from a research study—and the engagement rates reflect this (McConnell et al., 2017). They may also enjoy more granular opportunity to guide their interactions with the study. In our own work, an early example provided participants with the opportunity to self-guide how their research data would be distributed to researchers and found that the majority of participants were willing for their data to be shared broadly for multiple research uses (Bot et al., 2016). While we show one mechanism by which this broad sharing is carried out (Wilbanks and Friend, 2016), how it should be done with more sensitive data is less established. Continued work is necessary to understand how participants and researchers may effectively use these emerging digital communication tools to meaningfully engage throughout the research process.

Although they provide many potential benefits to improve participant–researcher interactions, technologies that support communication and data collection outside of the research setting also have the potential to further compromise a participant’s sense of privacy. Indeed, research studies are now expecting more frequent data collection from participants within the context of their daily lives. This level of engagement will shift the risk–benefit equation for many participants, as there is more potential that they will expose their own life choices in the context of the research study. Transparency and joint assessment of these types of research studies will be necessary in order to understand how well it supports both actors. One possible solution is to integrate research from community-based practices into the mainstream of biomedical research, because we know that community ties can be a very powerful element in engagement and knowledge transmission. Some of these practices will be assessed with initiatives such as the All of Us Research Program (AoURP) in the United States (see case study 3). This work will help us to understand how and when transparency and openness benefit the research participant and the research, and how and when they can create harms and risks.

Another interaction that is affected by decentralization is the one between primary researcher and secondary researcher. Bolstered by funder-guided data sharing mandates and other open science initiatives, there is an increasing trend for research data to be broadly distributed for reuse by secondary researchers. Traditionally, this relationship is heavily negotiated around attribution and can be subject to power imbalances related to seniority, resources, and power. This has been somewhat ameliorated by the development of decentralized data sharing platforms for which pre-developed terms and rules can guide both primary and secondary researchers toward best practices and standards for data sharing and reuse. These platforms support sharing under conditions that outline a clear, predefined understanding of what data use to expect and what credit to receive. Additionally, active sharing of data and resources lessens the likelihood that these secondary researchers are seen as parasitic, as they have been described as by some in the biomedical research community (Longo and Drazen, 2016).

The third, more indirect, relationship that is being affected by decentralization is that of the participant and secondary researchers. Since secondary researchers do not have direct contact with the participants whose data they are analyzing, this research is often not governed by traditional ethics review and has minimal oversight (Metcalf and Crawford, 2016). While direct consequences are less apparent, there still exist potential harms to participants or groups of participants as a result of secondary analysis of their data as the (re-)use of data collected for another reason may not have been considered during ethical review of the original study. There are also potential harms to the scientific community as secondary researchers may make false observations due to insufficient understanding of the context that influences appropriate analyses of each data set (Garrison, 2012). These are a set of practices that have gone largely unstudied. Here too, thoughtful governance may help to minimize harms while promoting valuable secondary use of data.

As many of these emerging opportunities are dependent on technological platforms to mediate communication and data handling, there is an important emerging role for technology providers to serve as “gatekeepers” that guide the flow of data across the research ecosystem. This is a shift from the gatekeepers of traditional biomedical research, which included highly trusted public institutions such as research funders, academic institutions, and hospitals. In this new paradigm, the gatekeeping role can often be performed by a technology platform with unknown experience in human health data management, immature data governance policies, and an undisclosed set of values guiding decision-making. Transparency is essential in order to understand how these players are managing the distribution of resources and protection of human data.

Use of technology platforms to collect or manage biomedical research data can also raise privacy concerns. Pervasive technologies like web platforms, smartphones, or wearable sensors are increasingly used to collect health and disease data directly from research participants (Schmitz et al., 2018). The data collected by many technology companies do not belong to the user, but are instead governed by the terms of service of each individual platform or service that is collecting data. As such, use of these platforms can represent serious privacy risks that are not always apparent to participants or researchers. The promise of unprecedented scale often masks unexplored issues caused by the integration of these non-traditional stakeholders into the health data ecosystem. These digital representations of our research participants have tremendous potential to elucidate patterns of human activity, in the health space and beyond, but may come with significant consequences. Governance and regulatory frameworks that can help monitor and facilitate the use of these new data collection methods will be essential to safeguard against privacy risks while still facilitating appropriate use within the health sector.

Decentralization case studies

Three case studies are provided to exemplify the advancement of decentralization in biomedical research and to demonstrate effective mechanisms to promote strong relationships between participant, primary researchers, and secondary researchers. They are provided here not as solutions but case studies that can be learned from as to how interactions between these actors can and will continue to change.

Case study 1: The Digital Mammography Challenge

The Digital Mammography DREAM Challenge was a computational competition that invited the global machine-learning and biomedical imaging community to contribute models to address the high false-positive rates in breast cancer screening (Trister et al., 2017). Reducing these false-positive rates is particularly important because it can cause unnecessary stress for patients and is extremely costly to the healthcare system as a whole. Sage Bionetworks has been working with the DREAM community to run crowd-sourced challenges since 2012. These challenges are one way in which valuable data can be actively shared by primary researchers to a motivated community of secondary researchers in order to answer an important scientific question or set of questions.

In order to address breast cancer screening false-positive rates, a large quantity of highly sensitive data was necessary, including over 1.7 million digital mammogram images. Usually, these sensitive data are protected by limiting availability to a small set of primary researchers. Sage and a team of collaborators engineered a new system “model-to-data” (Guinney and Saez-Rodriguez, 2018) using cloud computing and container technology to support analyses of these data without distribution. This “model-to-data” approach allowed over 400 teams of researchers to train their algorithms on sensitive data without actually downloading the full set of mammography images, thus helping to mitigating privacy risks. This approach provides a form both of devolving larger power to access data while reserving the power to protect it against privacy attacks. The establishment of a public benchmark describing performance of algorithms to detect cancer from mammography images was enabled by both allowing decentralized competition between scientists followed by facilitating centralized cooperation among top performing competitors.

Beyond the scientific findings that have and will come from this computational challenge, it also provides a proof of concept that this type of computational framework can be used to support distributed analysis of large and highly sensitive data. While this type of infrastructure provides a mechanism to allow many secondary researchers to access sensitive data, access is managed by a central gatekeeper (in this case, Sage Bionetworks, a nonprofit research organization) with a history of ethical practice in human data protection.

Case study 2: The mPower Parkinson Disease study

The emergence and ubiquity of smartphones and connected devices has allowed for the relationship between research participant and primary researcher to drastically shift in recent years. In March 2015, Apple announced a framework called ResearchKit, which would enable researchers to build iPhone applications to run remote research studies that could leverage sensors on a participant’s smartphone. Along with the framework, there were five research studies that were launched simultaneously. One of those studies, mPower, was designed as an observational study to evaluate the feasibility of collecting frequent information about the daily fluctuations in symptoms and symptom severity as related to a self-reported diagnosis of Parkinson disease (PD). Participants were asked to complete a series of short surveys as well as repeated active tasks that quantified symptom severity leveraging sensors of the participant’s smartphone. Enrollment was open to individuals diagnosed with PD as well as anyone interested in participating as a control.

Potential participants would download the mobile application and self-navigated through eligibility criteria and through an interactive electronic consent process. After completing the consent process, passing a short comprehension quiz, and electronically signing the informed consent form, participants had to make an active choice to either share their data only with the mPower study team and partners or to share their data more broadly with qualified researchers worldwide. In the first six months of the mPower study, 75% of participants who enrolled in the study opted to share their data broadly with qualified researchers (Bot et al., 2016).

The qualified researcher process (Wilbanks and Friend, 2016) is designed to balance participants’ privacy with their desire for optimal data use as well as to emphasize return of information, such that there is transparency into how data is being used and by whom. Because the self-reported outcomes and active task data collected in mPower are relatively low-risk, we felt this type of solution for data distribution with as few restrictions as possible was appropriate. As of the writing of this publication, over 180 qualified researchers have accessed the mPower data (mPower research community) as well as a number who have accessed by participating in a more structured computational challenge (Parkinsons Disease Digital Biomarker DREAM Challenge). This case is an illustrative example of the tensions between devolving the power to make choices to research participants and the need for platforms to fulfill those choices when they involve redistribution, as well as a potential case for how to govern those platforms.

Case study 3: The All of Us Research Program (AoURP)

The AoURP is a longitudinal cohort medical research study run by the United States National Institutes of Health. AoURP aims to enroll at least 1,000,000 residents, meaning collection of their blood, urine, medical records, surveys, and digital health data as well as linking in data related to social determinants of health, environment, and more. The study represents an interesting mega-project in decentralization.

First, there is no primary researcher, as the entire project was designed as a resource for secondary researchers. This represents a devolution of power away from the primary awardees, whose impacts were explicitly seen when the vast Kaiser Permanente system exited the program because it would not be a “scientific partner” in the structure but instead simply a recruiter of participants (Kolata, 2018). The AoURP data access structures are not yet final, but anticipate “data passports” or “library cards” being issued to researchers on a standardized basis, deviating from traditional access methods in which both the user and the uses are subjectively reviewed for merit and appropriateness. Instead, the data passport would be a distributed approach to validating and authenticating a user’s identity and bona fide good-standing within the research community (Cabili et al., 2018).

Second, the program is guided by a set of core values developed through a blue-ribbon process led by the White House in 2015. These values require that money and attention be devoted to security, privacy, participant engagement and centricity, enrollment of populations under-represented in traditional biomedical research, and more. The core values do not so much devolve power as provide a forcing function to reorient money and attention to issues that most clinical studies do not need to address. But they represent an interesting element of program governance that supports the other elements of power devolution.

Third, research participants and ethicists have power at multiple levels and stages of the program’s administration. Participants and ethicists sit on governing bodies such as the Steering Committee, Executive committee, Institutional Review Board, Resource Access Board, and more, and can often call back to the core values noted above to question and influence decisions. Participants also have the right to download their donated data and take it to parties outside the study, breaking some of the traditional power of the data collector to prevent replication of the data outside a study’s context.

Active and iterative evaluation

The multi-faceted biomedical research ecosystem will never be effectively managed by a one-size-fits-all solution to data governance and management. However, it is also clear that this community has need for a deliberate, prospective assessment to identify effective governance mechanisms. These will be best developed through the active design, development, and testing within the context of active research programs, where the community can learn from and build on the realities of implementation. Meaningful governance must account for: (a) flow into a platform, (b) interactions within communities on the platform, and (c) flow out of the platform. This implicates a wide variety of tools including informed consent, data use agreements and a dizzying array of contracts, plus the design and technology required to make them useful. These tools can be leveraged as building blocks to orchestrate data collection and access depending on the sensitivity of the data in question and the extent to which the data will be made available.

The continued evaluation of appropriate governance models within specific research projects will help to identify potential solutions that may then be generalized. The interactions between major actors as well as specific case studies help to illustrate potential solutions for specific instances of decentralization. We must continue to empirically assess how best to develop these solutions and share our lessons learned, because the technologies and advances in data collection methods will not wait.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) received no financial support for the research,authorship,and/or publication of this article.

References

Agrawal

Ostrom

(2001) Collective action, property rights, and decentralization in resource use in India and Nepal. Politics & Society 29(4): 485–514. DOI: 10.1177/0032329201029004002.

Allaway R, Angus SP, Beauchamp RL, et al. (2018) Traditional and systems biology based drug discovery for the rare tumor syndrome neurofibromatosis type 2. Lebedeva IV (ed.) PLOS ONE 13(6). Public Library of Science (PLoS): e0197350. DOI: 10.1371/journal.pone.0197350.

Bot BM, Suver C, Neto EC, et al. (2016) The mPower study, Parkinson disease mobile data collected using ResearchKit. Scientific Data 3. Springer Nature: 160011. DOI: 10.1038/sdata.2016.11.

Cabili MN, Carey K, Dyke SOM, et al. (2018) Simplifying research access to genomics and health data with Library Cards. Scientific Data 5: 180039. DOI: 10.1038/sdata.2018.39.

Dobbyn

Huckins

Boocock

et al. (2018) Landscape of Conditional eQTL in Dorsolateral Prefrontal Cortex and Co-localization with Schizophrenia GWAS. The American Journal of Human Genetics 102(6): 1169–1184. DOI: 10.1016/j.ajhg.2018.04.011.

Doerr

Maguire Truong

Bot

et al. (2017) Formative evaluation of participant experience with mobile econsent in the App-Mediated Parkinson mPower Study: A mixed methods study. JMIR mHealth and uHealth 5(2): e14. DOI: 10.2196/mhealth.6521.

Garrison

(2012) Genomic justice for Native Americans. Science, Technology, & Human Values 38(2): 201–223. DOI: 10.1177/0162243912470009.

Guinney

Dienstmann

Wang

et al. (2015) The consensus molecular subtypes of colorectal cancer. Nature Medicine 21(11): 1350–1356. DOI: 10.1038/nm.3967.

Guinney

Saez-Rodriguez

(2018) Alternative models for sharing confidential biomedical data. Nature Biotechnology 36(5): 391–392. DOI: 10.1038/nbt.4128.

10.

Kolata G (2018) The struggle to build a massive ‘Biobank’ of patient data. The New York Times, 19 March, 18.

11.

Logsdon B, Perumal TM, Swarup V, et al. (2019) Meta-analysis of the human brain transcriptome identifies heterogeneity across human AD coexpression modules robust to sample collection and methodological approach. Cold Spring Harbor Laboratory. DOI: 10.1101/510420.

12.

Longo

Drazen

(2016) Data sharing. New England Journal of Medicine 374(3): 276–277. DOI: 10.1056/nejme1516564.

13.

McConnell

Shcherbina

Pavlovic

et al. (2017) Feasibility of obtaining measures of lifestyle from a smartphone App. JAMA Cardiology 2(1): 67. DOI: 10.1001/jamacardio.2016.4395.

14.

Metcalf

Crawford

(2016) Where are human subjects in Big Data research? The emerging ethics divide. Big Data & Society 3(1): 205395171665021. DOI: 10.1177/2053951716650211.

15.

Moore

Tassé

A-M

Thorogood

et al. (2017) Consent processes for mobile App mediated research: Systematic review. JMIR mHealth and uHealth 5(8): e126. DOI: 10.2196/mhealth.7014.

16.

mPower research community, as part of the mPower Public Researcher Portal (2016). Available at: https://www.synapse.org/#!Synapse:syn4993293/wiki/392026 (accessed 1 March 2019).

17.

Parkinsons Disease Digital Biomarker DREAM Challenge (2017). Available at: https://www.synapse.org/#!Synapse:syn8717496 (accessed 1 March 2019).

18.

Schmitz

Howe

Armstrong

et al. (2018) Leveraging mobile health applications for biomedical research and citizen science: A scoping review. Journal of the American Medical Informatics Association 25(12): 1685–1695. DOI: 10.1093/jamia/ocy130.

19.

Sweeney

Perumal

Henao

et al. (2018) A community approach to mortality prediction in sepsis via gene expression analysis. Nature Communications 9(1): 694. DOI: 10.1038/s41467-018-03078-2.

20.

Trister

Buist

DSM

Lee

(2017) Will machine learning tip the balance in breast cancer screening? JAMA Oncology 3(11): 1463–1464. DOI: 10.1001/jamaoncol.2017.0473.

21.

Wilbanks J and Friend SH (2016) First, design for data sharing. Nature Biotechnology 34(4). Springer Nature: 377–379. DOI: 10.1038/nbt.3516.