Abstract
Introduction
The COVID-19 pandemic has created tremendous political will to bolster pandemic preparedness. As such, this is an opportune time to ensure that investments and technology adoption policy are geared to prevent not just the next pandemic, but all future pandemics. Metagenomic shotgun sequencing is a set of methods that extracts genetic sequence data directly from environmental samples. While metagenomic sequencing is limited in various ways, and other technologies could plausibly be adopted instead, it offers a disease-agnostic approach to monitoring, detecting, and characterizing pathogens and variants, and many disparate groups are working toward or promoting this future pathway.1-3
In this article we discuss policy obstacles to the establishment of a universal, scalable, One Health-upholding, 4 and pathogen-agnostic monitoring system, along with relevant technical and operational issues. While policy planning under uncertainty is always difficult, and concrete plans are premature, strategic thinking can convert technological and policy uncertainties into specific questions, which can then be addressed by the relevant academic, policy, and professional communities.
As with any transitional planning, we identify the starting point, the destination, and then consider transitional challenges. Accordingly, in this article we start with an overview of the current state of biosurveillance, then consider certain representative future systems, and finally describe some common challenges. We use the term “widespread metagenomic monitoring” (WMGM) to refer to future sequencing-based systems to detect infectious diseases and potential pandemic risks, WMGM is distinct from the current and often disjointed biosurveillance efforts and from other visions that are either not pathogen agnostic or are more narrowly focused on specific geographies or sources. We identify the following issues as needing policy solutions:
Suboptimal use and high prices Privacy and data abuse Peacetime usefulness Enabling crisis response
If not adequately addressed, we expect these issues to massively delay or even prevent the implementation of a system to detect pandemic risks.
Present State of Biosurveillance
In the United States, a complex set of programs exists where state-level control over some biosurveillance activities competes with multiple national programs. Meanwhile in many low-income countries, regional and global cooperation, often funded by international partners, is more common. Globally, current approaches track a limited number of patients using tests specific to a single disease and, for the most part, known diseases only.
Geographic Heterogeneity
The practice of surveillance varies greatly around the world. In the United States, not only does the Centers for Disease Control and Prevention (CDC) run several disease-specific and syndromic biosurveillance programs, but the Department of Homeland Security's Countering Weapons of Mass Destruction Office runs both the BioWatch Program and the National Biosurveillance Integration Center. 5 Separate systems like the US Department of Agriculture's Animal and Plant Health Inspection Service are in place for agricultural and livestock disease monitoring. 6 These government systems tend to have limited data sharing between each other, or with other systems internationally. But even when open and widely used systems such as the Electronic Surveillance System for the Early Notification of Community-Based Epidemics (ESSENCE) are used, 7 public health officials more often flag outbreaks of notifiable diseases via doctor diagnoses, rather than via syndromic or other monitoring methods.
Meanwhile, many low-income countries have, at best, partial coverage of the population for basic health services. If the governments or health departments in the areas affected have the capacity to gather data on prevalence, they do so, but they often do not even aggregate extant data. Outbreaks are reported to the World Health Organization (WHO) when they are identified, and limited real-time analysis capacity exists, although the Africa Centres for Disease Control and Prevention (Africa CDC) and others are starting to address this gap.8,9 At the same time, these countries are collaborating with the use of open-source tools like IDseq for analysis of metagenomic sequencing data. 10 The open nature of these systems enables faster analysis and increased operational resilience. We note a trend toward the increasing modularity of various nodes of biosurveillance.
Genomic Data Gathering Paradigms
In the current paradigm, a given node that gathers disease data points is the same as, or is highly coupled to, the node that performs the analysis of such data. This vertical approach is often tied to a specific jurisdiction or data-gathering method, as seen among US agencies mentioned earlier. The vertical integration of analysis into gathering has meant that it is at best awkward—and at worst impossible—to aggregate data between different systems for a more comprehensive disease landscape. This is a key issue in current global infectious disease monitoring (Table 1).33,34
Current Popular Genomic Data-Gathering Paradigms
Funding Infrastructure and Payment Systems
Funding for biosurveillance has always been uneven, with costs borne largely by high-income countries but with inconsistencies even there.33,35 In some high-income countries, costs for tests even during a pandemic are often borne by consumers. The resulting implicit discrimination against lower-income and neglected communities has important direct impacts—which also leads to insufficient data and biases—that undermine surveillance efforts. The parts of surveillance that tend to maintain funding are for lower-risk issues like foodborne pathogens and rare reportable diseases, rather than robust infrastructure for detecting future outbreaks. In low-income countries, there is also a constant battle to maintain funding for surveillance systems, which can seem superfluous until they are vital (Table 2).
Summary of Policy Shortcomings in Current Biosurveillance Efforts
Note: Shortcomings exist in almost every area of the system, including gathering, analyzing, storing, and reporting data.
Potential Metagenomic Monitoring Futures
Accurate long-term planning is challenging, and even more so when a plan is predicated on major technological progress. Thus any vision for WMGM must remain tentative and flexible. However, to get to WMGM responsibly and with maximized biosecurity benefits, there needs to be a common understanding of qualities we expect to see in a high-investment scenario.
Gather
An expansive future WMGM system collects data from many nucleic acid sequencing data sources in a coherent set of formats. Other data-types are still available, but given the extent and rapidity of sequencing data, they are largely ancillary. The geographic coverage of nucleic acid data sources feeding into the system is extensive and global, and the cost per sample is minimal. Clinical use of metagenomic sequencing is routine and nearly universal for any suspected respiratory, urinary, and other infections, displacing disease-specific polymerase chain reaction (PCR)-, antigen-, or CRISPR (clustered regularly interspaced short palindromic repeats)-based tests. Similar to the use of syndromic surveillance today, subsets of this medical data are used for biosurveillance. Beyond clinical use for diagnosis, sampling and sequencing capacity is deployed directly for biosurveillance. This encompasses high-risk “sentinel” populations and civic-minded volunteers, 41 as well as agricultural and wilderness ecosystems, 42 built environments, 43 and urban wastewater, often at a neighborhood level. One potential example is a Nucleic Acid Observatory that monitors wastewater and waterways. 44 The rapid gathering of data by this decentralized network is routine and automated, wherever possible, enabling temporal trends to be quickly identified.
Analyze
Analysis is possible in both a centralized and decentralized fashion. Local analysis includes diagnostics in clinical settings that replace and supersede (current) vertically integrated monitoring systems; the details of such systems are important but not our focus.
The data collection points tended by farmers and animal biosurveillance researchers, wastewater investigators, healthcare providers, and others can obtain genomic health insights more rapidly, cheaply, and of higher quality than they can generate themselves. The routinely collected data is also analyzed by academic, local government, and international biosurveillance experts, in near real time. These centers have strong links to political entities responsible for public health and biosecurity. Separating analysis from data gathering also encourages data standards and systems that enable subsequent analysis and broader sharing.
Other data sources for WMGM, such as indicators from other forms of data analysis (eg, internet search data) are integrated. Analysis of data streams in both clinical and public health applications may use a variety of publicly available and/or open-source software, which allows for continually improving and diverse ecosystems of analysis and prediction for clinical applications, national and international public health early warning, and research.
Scale
All else being equal, a WMGM program seeking to maximize public health benefits (1) maximizes sampling density in space, time, and in terms of sequencing depth, and (2) minimizes the time between nucleic acid sampling and data analysis (ideally everywhere, often, and instant). Achieving increased sampling density in space and time at a sufficiently low cost, and with results on sufficiently fast timelines, implies a future with substantial increases in automation at all steps of data acquisition and the extent of decentralization of nucleic acid sequencing. A key component of a future system is therefore field-deployable nucleic acid sequencing machines that perform sample collection, sample preparation, and sequencing autonomously at an extremely low cost. To enable this scale of ubiquity, data formats, information protocols, and downstream analysis are carefully designed to derive maximum insight from the integration of these diverse data streams while protecting against abuse of data collection at such an intensive scale.
Store
Data and metadata collected by the distributed sequencing network are at least partly public, but systems also account for societal preferences regarding privacy. Maintaining privacy is necessary to earn social licensing and trust in storing potentially identifying information. Data from these systems are housed in publicly funded repositories and are used for both real-time monitoring and research. These systems provide sufficiently granular monitoring to afford transformational biosecurity benefits, and sufficiently strong privacy protections via a combination of (1) high levels of information secured by operational security and/or (2) statistically or cryptographically deidentified public representations of data streams for monitoring activities that guarantee individual privacy.45-47
Reporting and Usefulness
The distribution and prevalence of diseases is routinely reported in a standardized fashion. Worrying clusters, mutations, and novel crossovers are flagged to both local public health officials and international infectious disease monitoring organizations. The timeliness, sensitivity, and specificity of monitoring approaches are well characterized and a range of candidate threat profiles have been identified, enabling well-calibrated, predetermined but flexible response plans to be activated quickly and with accountability. Funding for the system is supported politically as cost-effective medical infrastructure and as a crucial global warning and response system.
Feasibility
The expansive, yet anticipated, future system will be impossible to realize with current technologies and systems. Current policy and legal structures are also insufficient. In the coming years, any advances in this direction will involve restrictive tradeoffs between coverage, depth, cost, usefulness, and privacy. Fortunately in the longer term, it seems possible to find solutions that make minimal compromises on each front.
With almost certain decreases in sequencing costs and increases in compute capabilities, both real and perceived threats to genomic privacy are potential limiting factors. To address this concern, statistical or cryptographic privacy in data collection is important, as gathering will not be carried out by a single entity. Additionally, privacy-preserving representations can lower barriers to data sharing between entities. Compressed, privacy-preserving representations of sequence data may also limit “information hazards” associated with gaining a deeper understanding of natural variation in the genomes of environmental organisms. 48
WMGM depends on numerous technological advances, changes to policy, and new systems. None of these are simple, but some of them, or at least their direct antecedents, are already being built.
Way Points and Obstacles in a Transition
Near-term applications of metagenomic sequencing in biological monitoring foreshadow longer-term futures for wider deployment. Crucially, near-term proposals differ significantly in how sampling is accomplished, but also in the read length and speed of the underlying sequencing technology.
For example, Shean and Greninger 49 propose a near-term future resting on widespread deployment of clinical sampling. In their vision, metagenomic sequencing has increased analytic sensitivity (achieved through deeper sequencing—sequencing a higher proportion of the nucleic acid molecules in the sample, more slowly) such that data can be used reliably and cost-effectively for diagnosis of infectious disease and determination of antimicrobial sensitivity. They also suggest using these methods for outbreak clustering and transmission tracking. In this vision, it seems that some preliminary analysis at least, will occur “on machine” (ie, automatically and locally). Ideally, this would be expanded to collecting more than just the immediately clinically relevant data to increase the usefulness of each data point. This increase in metadata would require trustworthy privacy mechanisms that allow for use of the data for WMGM and advantages for reservoir and other monitoring.
Another near-term possibility is the Nucleic Acid Observatory, 44 which proposed ongoing wastewater and watershed sampling across the United States to find sequences that recently emerged or are increasing in frequency, indicating a potential new pathogen or other notable events.
Critical Technological Advances
For a technical review of current metagenomic techniques and their use in biosurveillance, refer to Ko et al 1 and Simner et al. 3 In this article, we include both metatranscriptomics (analyzing collective RNA transcriptomes, specifically) and viral metagenomics in the definition of metagenomic sequencing. While current surveillance efforts focus on culture- or PCR-based methods, recent advances have been made in using metagenomics for the surveillance of both viruses and microbes, and sequencing both DNA and RNA. Metagenomics, but more so metatranscriptomics, are still limited, especially in terms of the extraction techniques. Such techniques are different depending on the organisms expected to be in the sample, but especially in terms of analysis because assembled metagenomes are still highly fragmented and especially difficult to compare using current algorithms. Table 3 outlines possible critical advances, including extraction protocol, genome assembly and characterization, and privacy and storage considerations. We propose that efforts on the technologies outlined, in addition to identifying other technologies, will contribute to a WMGM future.
Critical Technological Advances
Policy Planning Under Technological Uncertainty
It is possible that the most valuable policy in the long term is to do more technical research. However, the extent to which this might be true can be evaluated in discussions such as this article. Whether a metagenomics sequencing-based biosurveillance is technically viable will become apparent in the coming years, but much more neglected is the consideration of policy and systemic concerns. We identify 4 current drawbacks that, if unaddressed, will delay or prevent the implementation of a system to detect pandemic risks. Unfortunately, changes in policy move more slowly than advances in technology, which too often leads to both slow adoption and locking in subpar methods—for example, if PCR tests are adopted as a standard or if clinical guidelines indicate that sequencing is appropriate only after other testing is performed. Even after sequencing is demonstrated to be comparably inexpensive and rapid, it might remain reserved for unusual cases. For reasons such as these, it is crucial to flag the needs of future systems and current drawbacks now.
Suboptimal Use and High Prices
The value of metagenomic sequencing will be limited if publicly beneficial uses of metagenomic monitoring are impossible due to patents, the collection of nonpublic data, or a lack of academic and clinical incentive to participate in broadly beneficial applications. A metagenomic monitoring system could fail to be adopted if providing or accessing data is overly unattractive or difficult. To reduce the likelihood of private capture and fragmentation of data, one option is to accelerate data publication and provision of data at the earliest point, so that clinicians and scientists provide public data as immediately as possible. Because academic incentives push against prepublication data dissemination, and commercial incentives push for closed information systems, policymakers should promote immediate data availability.
Burdens of system change also have the potential to delay adoption, which can result in the paradigm of using first paper tests, then PCR or culture-based tests, and only then using metagenomic sequencing. Varied sources will remain critical in the coming decade, but as metagenomic sequencing declines in price enough to be negligible, it should supplant PCR, lateral flow, or other testing, not just supplement them.
An issue related to suboptimal use is costs. The idiosyncratic nature of the US health system will pose additional challenges related to reimbursements and universal clinical access, but even internationally there will be challenges. Capital investment in biosurveillance may be difficult, especially in lower-income countries and less well-populated areas, but international subsidies, tax credits, and other incentive schemes may be useful. Because metagenomic monitoring is particularly important in areas that otherwise may have less access to care, addressing disparities in access to both clinical and biosurveillance sequencing will be essential.
Privacy and Data Abuse
Abuse of private genetic information is likely to occur as the use of sequencing proliferates, whether via metagenomic sequencing or otherwise. For this reason, near-term focus on addressing privacy concerns is crucial. These concerns will become increasingly salient as metagenomic monitoring becomes ubiquitous. Metagenomic samples may initially include human DNA, which is clearly identifiable, but even removed microbiotic signatures are potentially personally identifiable. 108 It is unclear if there are legal restrictions on analysis of sewage and similar sources, but the data are potentially predictive of otherwise personal information, so public discussion of privacy tradeoffs and preventing misuse is important.
Technological privacy solutions need to be adapted to the specific goals of metagenomic monitoring and need standardization to enable usage. Legal structures that allow for public use of healthcare data, as well as policy approaches for developing standards and encouraging or mandating compliance and data sharing, will be essential.
Closely related to the concern about abuse of personal data is the concern that wider availability of genomic data could affect biosecurity. However, it is unclear if widespread monitoring significantly increases availability of these data compared with other applications of already increasingly available metagenomic technologies.
Beyond these legitimate concerns, new technologies are often the subject of suspicion and misinformation. Clear public rules and enforcement can help address public mistrust. The messaging about the promise of such technology, and the rules to prevent misuse, should be emphasized earlier rather than later. An example is the Health Insurance Portability and Accountability Act of 1996, 109 which is seen as too restrictive, and likely as a result, few claims of misuse of medical data in the United States have emerged. At the same time, fully private data would not allow for infectious disease monitoring, so the specific approach is unlikely to be viable. For this reason, initial deployments in countries with higher institutional trust may be preferred. Alternatively, less privacy-invasive initial use cases, such as metagenomic analysis of wastewater, might circumvent many concerns.
Peacetime Usefulness
The most important applications for health security are preventing and responding to crises; however, systems that are useful only during a crisis are likely to lack funding or be unavailable when a crisis occurs.35,110 These challenges are compounded for new technologies. For this reason, it is vital to ensure that metagenomic technologies are used even when there is no crisis. Thankfully, studies show a wide variety of nascent uses, including early cancer screening and precision medicine.111-114
Clinical applications for diagnosis of infections are crucial for identifying pathogens based on symptoms, ruling out the possibility of infection as a cause, and identifying antibiotic resistances in a given bacterial infection, as well as contact tracing and identifying routes of transmission. At present, metagenomic sequencing is rarely used for any of these, which could change if adoption increases. Advancing new uses depends on the availability of sequencing, their speed, and their reliability. Changing clinical practice is also challenging, and in addition to demonstrable advantages, care should be taken to ensure the new systems will benefit clinicians—and that cost differences are minimized or compensated for. Similarly, building transmission tracing systems for routine outbreaks can provide continuing value to public health workers.
Metagenomic analysis of wastewater is also valuable for routine public health monitoring. Benefits may include the identification of variants of seasonal influenza or SARS-CoV-2 or the locations of foodborne pathogen outbreaks, potentially even before clinical detection. Emphasizing these routine benefits will help ensure that the systems are maintained.
Despite the seeming convergence of interests between public health and metagenomic sequencing, identifying places where the routine uses of sequencing are misaligned with crisis uses is also critical to ensure systems are not built myopically. For example, short-term uses of pathogen metagenomics focused on viral metagenomic sequencing might compete with more valuable technology, such as parallel host transcriptomics, which can maximize clinical information per sample.
Enabling Crisis Response
The value of metagenomic sequencing for biosurveillance lies in its ability to function as a warning system for imminent outbreaks. However, a warning is useful only if it enables a response. Government willingness to respond requires a trustworthy warning signal across the possible scenarios where response is needed. The value of such a response depends on the speed of warning signals and subsequent interventions. Building a system that provides alerts within hours instead of days is superfluous if a response take weeks. Similarly, the value of a system is limited by the accuracy of the warning signal. False alarms both reduce willingness to respond and make the system more expensive to use.
Tradeoffs may exist in such a system. For example, distributed analysis allows for faster independent confirmation of an incipient outbreak but may lead to more false positives. Similarly, nonpublic government analysis could be more inclusive of otherwise private data but may be less trusted. In either case, governments need to plan responses when a warning signal is detected, regardless of the source.
Key questions about the use of metagenomic sequencing for biosurveillance extend far beyond the scope of this article. One of the most vital questions is how analysis will be translated into policy, and by whom. Governments, nongovernmental organizations, multilateral institutions, and academics are all capable of analysis, but their analyses will enable different types of response. Different response options will require different types of interactions between the systems and government and international planning. And perhaps most important, different funding models for the biosurveillance systems may be needed. An excellent system that is then defunded is far worse than a modest system that can be maintained.
Solutions Seeking Implementations
In this decade, we expect that at a minimum the challenges described earlier in the sections on suboptimal use and high prices, as well as enabling crisis response, need to be substantively addressed. This will enable a system as integrated and encompassing as we describe to have an adequate foundation.
If buy-in from clinicians, hospitals, government bodies, academics, and other data gatherers is insufficient and if data from metagenomic sequencing are not readily available or easily stored and analyzed, we will never move past the current paradigm of disconnected biosurveillance systems. A WMGM future must also have enough financial, political, and operational support to make the transition.
Historically, supporting a biosurveillance system across many groups has often required that an institution or protocol manages the sharing and storing of data. Entities of comparable scale are the US National Center for Biotechnology Information or European Bioinformatics Institute, which require primary research authors to upload their genomic data. However, before a government-run data-sharing system is commonplace and regulated, a third party may need to be developed.
Conclusion
In the coming years, next-generation sequencing and more widespread use of clinical and environmental sequencing for biosurveillance are likely. Drawbacks of such biosurveillance systems include data privacy and long-term viability of funding, which are unlikely to be fully remedied before deployment and require further attention. The question we address in this article is what policy issues are likely to arise in the coming years.
Among the most important policy issues are market failures, in several forms—for example, private capture of the market in ways that make widespread use expensive or that fragment the data and analysis ecosystem, as well as potential abuse of data and privacy concerns. Relatedly, it is possible that the system could become economically nonviable during nonpandemic times and funding is lost. Important in a different way is how planning and response activities are able to capitalize on these systems and data.
To address these questions and concerns, a variety of projects seem useful, and are best led by different groups. Until various price levels and market penetration are reached, research or expert forecasting of timelines is useful for planning. Building public or interoperable data systems and standards will be important for industry groups and government or nongovernment agencies. Policy planning to ensure that payment systems or regulations do not lock in current or near-term technologies is also needed. Of course, none of these will supplant the technological advances that are needed, but each will help unlock their potential. The most important tasks, however, must be started now, because if problems are addressed post-hoc instead of preemptively, much of the biosecurity potential of next-generation sequencing will be unnecessarily delayed, or lost.
