Sage Journals: Discover world-class research

Abstract

Data science is characterized by engaging heterogeneous data to tackle real world questions and problems. But data science has no data of its own and must seek it within real world domains. We call this search for data “prospecting” and argue that the dynamics of prospecting are pervasive in, even characteristic of, data science. Prospecting aims to render the data, knowledge, expertise, and practices of worldly domains available and tractable to data science method and epistemology. Prospecting precedes data synthesis, analysis, or visualization, and is constituted by the upstream work of discovering disordered or inaccessible data resources, thereafter to be ordered and rendered available for computation. Through this work, data science positions itself in the middle of all things—capable of engaging this, that, or any domain—and thus prospecting is a key driver of data science’s ongoing formation as a universal(izing) science.

Keywords

Infrastructure data science logic of domains Big Data hubs and spokes science policy science studies

Introduction

Data science is popularly understood as the mode of scientific inquiry whose business is the study of extracting value from data. Observing it in action, one witnesses a curious form of emptiness: research and knowledge production in data science take the shape of an unceasing search for novel attachments, a consistent refreshing of sociotechnical associations. Data science, in other words, has no data of its own, but proclaims its relevance to all who do. Consider, for instance, the imaginary of data science’s virtuous cycle (Brodie, 2019), wherein the “fundamental” data science disciplines—e.g. mathematics, statistics, and computer science—solicit and engage with data and other computational resources culled from specific “domains” as a means of advancing computational performance and accruing novel technical capacities (e.g., programs like the NSF TRIPODS awards¹). These advances, in turn, promise to be circulated back out into the world such that they can be taken up by domain scientists and decision-makers in industry, government, and the nonprofit sector, streamlining their workflows and reducing “time-to-insight” (Bottum et al., 2017).

The purpose of this paper is to draw attention to the set of practices deployed in the course of making the data sciences fulsome with worldly material (e.g., data) from which value may be extracted—practices that we collectively label prospecting. Prospecting here is defined as the work of rendering data, knowledge, expertise, and practices of worldly domains available or amenable to engagement with data scientific method and epistemology, including mapping available data sources and tools, surveying potential organizational connections, and reasoning about future resources. Prospecting precedes data analysis or visualization, and is constituted by the activities of discovering disordered or inaccessible data resources, thereafter to be ordered and rendered available for data scientific work. It is through this work that data science positions itself in the middle of all things—capable of engaging this, that, or any domain. Prospecting is thus a key driver of data science’s ongoing formation as a universal(izing) science (Bowker, 1994).

The theoretical insights on prospecting presented here are informed by reflections on earlier empirical studies of the policies and practices driving data-intensive scientific research in the US (e.g., Borgman et al., 2009; Edwards et al., 2013; Ribes and Lee, 2010) and, to a greater extent, by more recent inquiries into the institutionalization of data science in the United States during the second decade of the 21st century. In the latter case, our team of four researchers conducted a three-year sociotechnical study of a new national umbrella organization for data science dubbed the Big Data Regional Innovation Hubs program (henceforth BDHubs), an initiative founded and funded by the National Science Foundation (NSF) in 2015.²

Our ethnographic work with BDHubs focused on its establishment as a consortium-building effort, and using participant-observation and interviewing methods we set out to grasp the scope of BDHubs’ work. In particular, we focused on the myriad “scalar devices” (Ribes, 2014) used by this distributed organization to understand itself and to build active regional constituencies. We were also keenly attentive to the valuation practices that BDHubs leadership—including Executive Directors and Principal Investigators—deployed in selecting, prioritizing, and executing on engagements in line with their mandate to “foster regional, cross-sector collaborations and multi-sector projects to foster innovation with Big Data” that would function as a critical pathway to “building and sustaining a successful national Big Data innovation ecosystem” (NITRD, 2016: 34). The primary goal was to identify and characterize the activities through which the Hubs worked to constitute themselves as key intermediaries that could foster regional data science innovation that promised to address a range of scientific and societal “grand challenges.”

Methodologically, we relied on a mix of participant-observation, semi-structured interviews with key actors, as well as thematic analysis (Nowell et al., 2017) of a range of internal and publicly available working and policy documents pertaining to BDHubs’ genesis and ongoing activities. Data collection took the form of ethnographic field notes, transcribed interviews, documents produced on site, and meeting transcripts, and was analyzed under the constant comparison method (Dye et al., 2000). We regularly attended and participated in Hub-specific workshops, meetings, and seminars including leadership calls, steering committee meetings, topic-specific community calls, and annual all hands meetings. We also engaged in a number of cross-hub initiatives such as joint calls with NSF program staff and regional Hubs leadership, the monthly All-Hubs Cyberinfrastructure Working Group, international collaborations with the European Big Data Value Association, and National Data Challenges around transportation safety and safe drinking water. Together, these calls and meetings enabled our identification and prioritization of additional fieldwork opportunities and provided a basis for identifying key interlocutors—including Executive Directors and Principal Investigators from each of the four Hubs—with whom we subsequently conducted in-depth semi-structured ethnographic interviews.

In light of the empirical basis of this analysis, however, our objective here is not so much to provide an ethnographic recounting of the BDHubs as such, but rather to deploy insights we gleaned during our study of this initiative as a means of furnishing a broader understanding of data science as an emergent universal(izing) science, with particular emphasis placed on prospecting—an empirical and conceptual notion that we will flesh out over the coming sections—which we argue is an enabling force driving the broader datafication of science and society (Cukier and Mayer-Schoenberger, 2013).

In the following sections, we theoretically elaborate (Vaughn, 1992) three constitutive dynamics of prospecting in order to explicate its role in the structuring and centralizing of data science. First is the notion that data science is intentionally “emptied” of domain affiliation and commitment (Ribes, et al., 2019), built upon the presumption that significant content, data, or applications will take place in conjunction with or mediate across specific domains. Second, this “domain-agnostic” positionality of data science serves as an ordering force, progressively reconfiguring an expanding scope of data and resources to be made amenable to data scientific techniques and analytic conventions, all the while simultaneously exposing new sites of disorder (Berg and Timmermans, 2000). Third, this work of ordering positions data science as the consumer of the data, resources, and even epistemologies of the domains with which it engages. In so doing, the practices of prospecting are centralized as a vital mediating activity—rendering the disordered as ordered, the siloed as shared, and facilitating the movement of knowledge and technique between domains such that data resources may be more seamlessly reused (Gregory et al., 2019) in subsequent analyses, perhaps with altogether different questions in mind than what motivated their initial creation or development.

Two concepts are especially relevant to the analysis that follows and are discussed in greater detail below. The first is Ribes, et al.,’s (2019; Ribes 2019) notion of “domain logics.” According to these authors, the “logic of domains” describes a style of organizing in computationally intensive science wherein a “domain,” or collective of expertise, is engaged or studied by a second party who generally conceives of themselves to be “domain independent,” or in possession of a set of generalizable tools or agnostic expertise that can intermediate between domains toward computational advancements and/or interventions in the domain itself. The second concept, which we take up in the latter part of the paper, is Michel Serres’ (1982) metaphor of the “parasite” (Brown, 2002), wherein he describes a fundamental relationship between the object of inquiry and the knowledge that might be produced from it. However, before unpacking the utility and centrality of these two ideas for our own conceptualization of prospecting, we turn to our first theoretical elaboration: the emptying out of domain specificity in data science, and its attendant hunger for connection.

The emptiness and hunger of data science

For a data scientist oriented toward the use of already connected data, reconfiguring, generalizing, and otherwise rendering a new data set amenable to use is naturally preceded by an assessment of the scope, character, and availability of that data (Borgman, 2015; Zuiderwijk et al., 2012). Prospectors in data science therefore navigate a territory of institutions, individuals, and technological concerns, mapping available data, discovering new potential domains for engagement, and assessing the balance between initializing work and expected value of the analyzed data set (Gregory et al., 2019). A change in technology or tools might make a given data set more or less available, ready at hand, or amenable to re-use, but the initial assessment is both formative of what the end research will consist of as well as indicative of the perceptions, assumptions, and capacity of the assessing researcher (Shen, 2018).

Case in point: the 2017 National Transportation Data Challenge, a BDHubs-led initiative that aimed to contribute to the international “Vision Zero” strategy of eliminating traffic fatalities on highways. When considering the problem of traffic accidents from a data scientific approach, the data scientists involved in this endeavor reached out to researchers and practitioners in government, commercial, and academic organizations to discover what data and computational resources were out there, and in what form. The data scientists were then able to evaluate the data according to their own needs (Is there good metadata? Is it consistently structured? How difficult or expensive would it be to gain access?) and engage with the various domains producing that data in order to better understand it and, ultimately, to apply it to research into the causes of accidents and possible solutions for avoiding highway deaths. We observed actors testing various sources of data in initial analyses to gauge its suitability in answering their questions, all of which took place before the analysis of the data began. It is this process of selecting, testing, and evaluating available data that structures what the results of that analysis would look like, while remaining relatively invisible in the final product.

The notion of prospecting is thus vital to understanding how the field of data science comes to be through the interactions of data scientists and data-producing domains, both in the sense of mutual adjustments toward shared questions of interest as well as in highlighting the process by which a generalized approach or entity might be selected and re-specified or re-deployed for a given application. At its core is the work of understanding the landscape of potential inquiry, of assessing potential resources, and discovering avenues of engagement with a given domain, or opportunities for intermediating between domains—all of which were characteristic activities of the BDHubs consortium.

We use the term prospecting to define this work insofar as it invokes the notion of unexplored territory that may yield some value once it is better understood, mapped, and ultimately targeted for infrastructural development. Data builds upon prior data and exposes new opportunities. Prospecting in this sense is analogous to the developments in the field of geology where salt domes, initially sought primarily because they could readily be mapped with existing seismic methods, led to the identification of many-faulted zones such as the San Joaquin Valley. These were later found to be incredible valuable as sites for oil extraction (Bowker, 1994). In both data science prospecting and geological mapping, the concerns of field were shaped both by the availability of data as well as its perceived value and importance. This style (in Hacking’s (1990) sense of the term) of seeking out new data moves a given resource toward being “data science ready” even as it excludes other data for a variety of reasons (cf. Crombie, 1995). Much like the drive to discover and mine gold or oil, we perceive a similar goal of discovering, mapping, and rendering available an ever-growing preponderance of data resources as a characteristic of data science.

Indeed, “Big Data” and data science more generally are increasingly defined and modulated by the metaphors instrumented to describe and understand them. Puschmann and Burgess (2014) describe how Big Data is being discursively shaped and understood through metaphorical comparisons to a force of nature that needs to be grappled with, managed, and controlled, and once so controlled, as a resource to be consumed for nourishment. For those working directly in data science, concepts of “Big Data” and “data science” are abstract, distant, and contested but work nevertheless continues to take place. Prospecting thus serves as a sieve of meaning in a contested space, enabling the practicalities of working with (big) data, allowing for a selection of meaning amidst uncertainty, and defining the fulfillment of the daily work of data science with its expansive sociotechnical understandings.

Unlike seemingly similar metaphors, however—such as “data extractivism” (Sadowski, 2019) and “data colonialism” (Thatcher et al., 2016)—our own notion of prospecting has a slightly different focus. Rather than thinking about the value of data per se, as enacted at the point of analysis, we look to earlier moments in data journeys (Leonelli and Tempini, 2020) in order to highlight the prospecting work that serves as a precondition for value extraction. Moreover, the nature of prospecting work is consequential primarily in its temporal bounding and selectivity as the entire field of available data sets and processing tools cannot be prospected at once. Both the connections leveraged and built, and knowledge produced toward discovering and working with a “new” data resource, point toward the priorities and practices of data scientists engaged in prospecting. Together, they work toward shaping the contours of their interactions with extant “domains” according to a particular style of organizing data-intensive work, a phenomenon to which we now turn our attention.

Prospecting as praxis in the logic of domains

Undergirding the work of prospecting is what Ribes, et al. (2019) refer to as the logic of domains. According to these authors, and as alluded to above, this is a style of organizing data-intensive work that in which a “domain,” or collective of expertise, is engaged or studied by a second party who generally conceives of themselves to be “domain independent,” in possession of a set of generalizable tools or agnostic expertise that can intermediate between domains toward computational advancements and/or interventions in the domain itself. For example, data scientists working to employ machine learning in agricultural research would perceive agriculture, botany, and plant genomics as domains that the tools of data science (e.g., computer vision and other algorithms), being domain independent, would be able to mediate and draw from in advancing both the techniques of data science and the progress of agricultural research (though a given project might be focused on either or both). As the aforementioned authors show in their analysis of the logic of domains, this logic has been discursively present throughout the history of computationally and data-centric research and serves to structure both science policy and technology development. For our purposes, the term “domain” refers to any area of collective expertise, or the data, tools, and methods of that area, that are available to engagement by the agnostic, or domain independent, methods, analyses, and assumptions collectively grouped under the label “data science.”

Prospecting presupposes a quantity of data that is currently intractable because of its size, complexity, poor documentation, and/or siloing, yet bears significant potential for generating social, economic, and epistemic value once better managed and understood (Hey et al., 2009). As the philosopher of science Ian Hacking (1982: 280) wrote in his work on the history of statistics—that “[e]numeration demands kinds of things or people to count,” and that counting in turn “is hungry for categories”—so too can it be said that data science demands kinds of things to analyze, and analysis is hungry for categories of domains (e.g., biology, geology, chemistry, etc.) and domains’ own categories which can be worked upon. Thus, in the ongoing discovery of new analytic tools, techniques, and applications, more unanalyzed data, domains, and opportunities for development are revealed. In this way, we consider data science in broad terms to be endlessly hungry.

But rendering those resources amenable to extraction requires that they first be made visible (e.g., Brighenti, 2007), necessitating the work of reaching into domains to discover resources; to structure the criteria by which data-intensive analysis might be leveraged to refine data scientific praxis; to produce new knowledge within as well as across domains; and ultimately to produce order out of disorder. These actions are propelled by the logic of domains’ core disposition that data scientific knowledges and practices developed in one domain or setting are capable of being made broadly applicable across many different instances of data-intensive work (Ribes, et al., 2019). Universality has occupied a central place in the historical evolution of the logic of domains, having “been defined and architecturally materialized as an absence of specificity” (Ribes, et al., 2019: 290). And so where this “emptiness” is a central characteristic of the logic of domains, prospecting, in turn, can be thought of as encoding the logic within a praxis: as the functional means by which the interfacing of “worldly” domains—with the “principle” data science disciplines, and with each other—is enabled and structured, and through which a given domain’s resources are rendered visible, available, and amenable to being intervened upon for any given set of ends, from the merely technical to the deeply embedded social.

Data-intensive research in fact implicitly and actively acknowledges and operates according to a presumption that the analytic tools developed for managing and reasoning about very large data sets are to some extent agnostic to their initial context of production and are applicable to a heterogeneous set of other domains. For instance, machine learning algorithms produced for facial recognition might be applicable to the identification of tumors (Kourou et al., 2015), predictive tools developed for epidemiology might be effective in studying consumer behavior (Goel et al., 2010), and genomic tools developed in a human context are similarly useful for agricultural research (McCarthy et al., 2006; Upadhyaya et al., 2011). It is for this reason that data science is can be cast as a universal(izing) science (Bowker, 1994), one that is not specifically concerned with analysis and representation of some specific kind of data per se—say, biological or geological—but can potentially deal with data from any scientific domain or societal sector, be it academic science, business, or government. And yet, data science is also dubbed an applied or translational field, that is concerned with working with “real world” data, expertise, or problems (Baru, 2019).

From these two propositions—that data science is a general field, but one that through its application is potentially relevant to any domain—emerges its structural impetus to engage in prospecting. For data science itself has no data of its own but must thus seek it within those fields concerned with “real world” phenomena. As data science prospects, then makes use of, its connection to the domains, there is introduced an ordering of that data according to the needs of data science (be it description, metadata, structure, etc.), and the subsequent exposure of data not similarly rendered available to such re-use. In the next section, we consider data science as a universalizing ordering of the domains, and deploy the BDHubs as an illustrative example of prospecting in action as a key step in ordering data, people, and organizations in what is perceived to be a largely disordered space.

The (dis)ordering powers of prospecting

Data science seeks out ever more forms of data in order to universalize. As Hey et al. (2009) argue in The Fourth Paradigm, their widely referenced collection of essays about the paradigm-changing nature of contemporary Big Data analysis, the quantity of data that exists in the world is a locus of disorder begging systematized solutions that involve the erasure of variation in local assessment: “Simply put, we are moving from data paucity to a data plethora, which is leading to a relative poverty of human attention to any individual datum and is necessitating machine-assisted winnowing” (Hey et al., 2009: 131–132). Moreover, they observe, this quantity of collected data is also “not curated or published in any systematic way” (Hey et al., 2009: xvii). And the scale repeats with the notion that those activities enabling the universality of data are also disordered: “Data generation on this scale must be matched by scalable processing methods. The preparation, management, and analysis of data are bottlenecks and also beyond the skill of many scientists” (Hey et al., 2009: 138). In sum, The Fourth Paradigm frames data as both opportunity and problem, justification and nourishment, and the nature of a science of data is one that creates tools that operate across localities, resolving tensions between global and local, big and small, and between available resources and potential outputs.

The very mention of “systematic” and “scalable” approaches to winnowing, curating, publishing, or processing data points to the fact that the work of universalizing in data science—that is, of opening up domain resources, and of making them available to engagement by data scientific approaches—is in fact a normative process of ordering (in our case, of the disarray of data resources) according to certain (data scientific) conventions. In formatting these resources accordingly, there emerges a recursivity wherein the discovery of new resources reveals them to be in a state of disorder vis-à-vis whatever conventions are in play, thus begging their subsequent management in such a way that they become conformant to the existing order.

Sociologists of science Marc Berg and Stefan Timmermans (2000) attend to this phenomenon in their analysis of standardization efforts in the medical domain, where they discuss universality as both order emerging from disorder and the progressive recognition of new spaces of disorder resulting therefrom—what they call “Orders and their Others”: “The production of universality follows a clear temporal pattern: disorder preexists and precedes the emergence of order. The phoenix of universality rises from the ashes of local chaos” (33). Order, in this sense, is a form of stability, of attachment to a localized form of the universal. “Achieving universality,” the authors go on to say, entails “the erasure of local varieties, the gradual grouping and transforming of what used to be dissimilar under the same category” (Berg and Timmermans, 2000), a progressive process of “investing in forms” (Thévenot, 1984).

Data science applications reach toward a generalizable science of data analysis, where “questions are informed by basic science, but they raise additional issues that can be addressed only by a new science discipline focused specifically on its applications—a discipline that integrates physical, biogeochemical, engineering, and human processes” (Hey et al., 2009: 14). Some, in grappling with Big Data, see a pressing need for such universality.

There is a compelling need for a rigorous and holistic definition of big data, a structural model of big data, a formal description of big data, and a theoretical system of data science … An evaluation system of data quality and an evaluation standard/benchmark of data computing efficiency should be developed … there is still not a unified evaluation standard and benchmark to balance the computing efficiency of big data with rigorous mathematical methods. (Chen et al., 2014: 202)

Here we see “rigor” emerge as endogenous to a mathematical assessment of Big Data capacity and resources. Important to note is that this is not referring to a problem with data quantity, but rather with a perceived disorder among those operating on the data itself, be they machine or human. There is a stated need for elimination or erasure of some local variation in the interest of a universal rigor in the evaluation of data resources and tools.

Data itself is also problematized:

As the prior step to data analysis, data must be well-constructed. However, considering [the] variety of data sets in Big Data problems, it is still a big challenge for us to purpose efficient representation, access, and analysis of unstructured or semi-structured data in the further researches. (Chen and Zhang, 2014: 5)

Of interest here is the note that the nature of data itself is written as a form of disorder—its variation according to the context of creation requiring a set of prospecting activities to understand its construction, representation, and to create a structure amenable to analysis. The work of the data scientist in this context is exemplary of the work of prospecting as it gives rise to structure (read: order) from semi-structured or unstructured data (read: its Other) in order to enable future engagement. And given the relatively broad conceiving of semi-structured and unstructured data, there is an infinite source of disorder—of noise—out of which an attached researcher might produce some ordering.

However, moves toward this universality, in the mode of Berg and Timmermans, similarly reproduce and expose its necessary Other: “The duality, futurity, and disparity of Big Data, along with its various conceptualizations among practitioners, make it unlikely for a consensus view to emerge” (Ekbia et al., 2015). While universality might be a teleological goal of universalizing efforts, it is likely to remain remote, where even basic principles, definitions, and concepts showing local variation that is itself resistant to a universal approach. The universality of data science remains ever remote, with competing universals producing disordered noise even when other forms of locality are erased.

Approaches to best practices in data management across disciplines and organizations are complex and often in contrast to one another, and may require years to change; therefore, the incongruence of these approaches continues to be an impediment to the complex science of today. (Tenopir et al., 2015: 18)

Even as universal approaches to data analysis emerge, local variation in data management practices produces further sources of disorder.

Even among those working to further data science, we see a perceived disorder in local variation and condition. Ranging from a significant barrier to the progress of science to a problem simply needing greater agreement and systematization, there is a consistent imagining of a field of work fundamentally in the business of rendering data more available, more consistent, and more amenable to broad analysis. The work of ordering data for data science is preceded by an understanding of that space before efforts intended to delete local variation are possible, followed by the hoped-for reintroduction of that locality in order to apply the results of the generalized analysis. While this is a fairly sparse picture of a more complex space, these steps characterize our view of prospecting as a vital, underrepresented activity that acts as a necessary precondition for a universal(izing) field of data science that may operate in some way across academic and sectoral domains.

Moving forward, we discuss the BDHubs organization as an example of some dynamics of “prospecting in action” (cf. Latour, 1987), paying special attention to its enactment of ordering in the form of negotiated avenues of coordination and communication from which further data scientific work might be made possible. In doing so, prospecting here is revealed as a transitional process, a necessary but partial step toward the effective use of data that produces the knowledge about that data necessary for engagement with the domains but does not enact an engagement in and of itself.

Prospecting in action: The Big Data hubs and spokes

We undertake here to characterize the ordering activities of BDHubs as an avenue for coordination and communication that are both not neutral, in that they are directed toward a particular image of what institutionalized data science might look like, and specifically contentless in their lack of prior commitment to any one domain. So while we might argue that these institutionalizing efforts prioritize a form of neutrality and agnosticism with respect to particular domains or technological infrastructures, it is important to keep in mind that this is a mode of directed ordering of a perceived disordered space, with all the attendant conflicts, tensions, and frictions that can be expected of pursuing one particular vision of “order” in that space.

As we discussed in the “Introduction” section, the BDHubs initiative was founded and funded by the NSF in 2015 to serve as a national umbrella organization for data science writ large. By way of historical context, its roots can be traced to the 2012 US Big Data Initiative, an executive directive spearheaded by the White House Office of Science and Technology Policy (OSTP) under which six major funding agencies³ dedicated a combined $200m in new commitments toward data scientific goals. A press release announcing the investment identified a national data science strategy that “[aimed] to make the most of the fast-growing volume of digital data … [and] greatly improve the tools and techniques needed to access, organize, and glean discoveries from huge volumes of digital data” (OSTP, 2012). Notable in this press release, and echoing Hey et al.’s (2009) aforementioned viewpoints, is the problematic of a growing preponderance of data that could readily respond to “tools and techniques to access, organize, and glean discoveries” (OSTP, 2012), and where the volume of data itself is framed as both a problem and an opportunity—a resource at once underleveraged and untamed.

“Big Data increasingly includes information provided by increasingly diverse sources, of varying reliability. Uncertainty, errors, and missing values are endemic, and must be managed” (Jagadish et al., 2014: 91). Data itself is a problematic source of disorder, with data scientists a vital step in rendering the data amenable to analysis as well as moving from a state of localized variation to a more singular universality of data in both representation and quality. “In today’s complex world, it often takes multiple experts from different domains to really understand what is going on. A Big Data analysis system must support input from multiple human experts, and shared exploration of results” (Jagadish et al., 2014: 93). Domain logics persist in the problematic of Big Data, with the notion that those generalizable skills possessed by the scientist of data are most effective in conjunction with a close attachment to a domain, and that domain experts similarly benefit from the application of those skills to their area of expertise. In the BDHubs, particularly in its early phases, we saw a consistent emergence of the work of alignment and attachment of the skills, tools, and resources generalizable to data analysis with the domains.

Following from the instruction and agreements established to support Big Data at the national scale, and with a particular emphasis on bridging across the academy, industry, and the public sector, a cornerstone of the Big Data Initiative was a series of workshops and design charrettes that brought together representatives from these different sectors who sought to identify and characterize the forthcoming challenges for Big Data. It was here that NSF leadership and program staff sought to assess, understand, and in many ways render the widely claimed, broadly applied field of data science in some way tractable to development under a cohesive funding effort. A primary objective of these events was thus to understand the state of the field, with science policy advisors noting that “improving our ability to extract knowledge and insights from large and complex collections of digital data, the initiative promises to help solve some the Nation’s most pressing challenges” (OSTP, 2012). Data science was presented as broadly useful, generally applicable, and fundamentally concerned with untapped value in disparate data, waiting to be extracted.

In reports and presentations drawn from these workshops, we see initial prospecting efforts working to establish an agenda and mode of work that would persist throughout the process of developing, funding, and rolling out the BDHubs consortium. Introducing and oriented around the loosely defined notion of “partnership,” workshop attendees were primarily recruited through existing social networks and familiarity with research work undertaken by its organizers. “Why are you here?,” one slide asked: “You have made some connection about Big Data with OSTP, the Big Data Senior Steering Group and/or one of the agencies involved” (Iacono, 2013). Workshop attendees were then charged with a set of four tasks: “Fact finding: Collect data and information … Idea finding: Listen for new ideas, models, partnerships, etc … Partner finding: Search for your Big Data ‘soulmates’ … Solution finding: Discern promising ideas that can be applied and that would make a difference” (Iacono, 2013). Note that each of these “charges” for action during the workshop are activities that loosely fit under the umbrella of prospecting: discovery of social connection; provision of access to data; mapping and understanding existing social, organizational, and institutional ties; and discovering those basic questions that might be readily answered.

Another notable finding from these workshops was a shared understanding among many participants that the location of future value for data science is “elsewhere,” but nevertheless available to be leveraged—only in a different, and perhaps more systematized, way than it currently was. For example, at a 2012 workshop that gathered members of the academic community, government, industry, and nonprofit organizations to discuss implementing the aforementioned national strategic Big Data plan, David Logsdon from the US technology trade association TechAmerica exclaimed that:

Big data has the potential to transform government and society itself. Hidden in an immense volume, variety and velocity of data that is produced today is new info, facts, relationships, indicators and pointers that either could not be practically discovered in the past or simply did not exist before. This new information, effectively captured, managed and analyzed has the power to enhance profoundly the effectiveness of government. Although the impact of big data will be transformational, the path to effectively harnessing it does not require government agencies to start from scratch. Rather, government can build on the capabilities and technologies it already has in place. Success in capturing the transformation lies in leveraging the skills and experiences of our business and mission leaders rather than creating a universal big data architecture. (Transcribed from a presentation delivered for TechAmerica)

In this excerpt, we find two key features of prospecting. First, prospecting in the data sciences is always premised on a vision that resources and expertise are “out there” and currently in some state of disorder. Second, what is needed to redress this disorder is not the creation of a singular technical solution—“a universal big data architecture,” in Logsdon’s words—but rather a more streamlined approach that could leverage both existing technologies and capabilities as well as leveraging the human skills and experiences of various concerned actor groups. Since data science (dubbed as such) is still a nascent discipline, assessment of the state of the field is a fundamental constitutive activity—that is to say, data science is still in a state of perceived disorder moving toward a consistent interpretation of its basic principles and disciplinary center. In parallel fashion, data science is predicated on the notion of a universal skill set related to data analysis and generalizable across domains and contexts of data creation.

More to the point, though, it was out of these design charrettes and workshops—themselves a prospecting activity—that surfaced a goal of developing what at the time was referred to as “big data coalitions.” In tandem with official Requests for Information and informal solicitations from potential partners, a blueprint for what would eventually become the BDHubs model began to take shape. The NSF funded four Big Data Hubs in late 2015 with one Hub designated for each of the Northeast, South, Midwest, and West regions of the country, with regions determined by US census population. The justification for regional Hubs, versus a single national entity, was to foster more face-to-face interactions within each region as a means of stimulating innovative data science research and development projects and partnerships. While some of the eventual goals of the workshop (in particular, solution finding) might be more readily defined as core data scientific work rather than the prospecting that creates its initial landscape, this nevertheless points to a teleology of prospecting as oriented toward finding that data from which value can be extracted. Prospecting is a means to an end rather than an end to itself—the mechanism by which data sharing and secondary use might be initiated—and is intimately concerned with the untapped value of data outside of its initial context of creation. In turn, it is also “endlessly hungry” in that its very activities of finding and arranging resources reveal further horizons of additional potential targets for prospecting.

Our own ethnographic work with BDHubs began shortly after these initial activities and was focused on the establishment of this novel organizational entity as a new instantiation of “cyberinfrastructure” at the NSF (cf. Ribes and Lee, 2010). Over the course of analyzing the initial imaginative work, planning, and realization of the BDHubs consortium, we came to identify prospecting as a central element of the Hubs’ work: first oriented toward assessing, understanding, and working with the field of data science—as was the case prior to BDHubs being funded—and later as the initiating activity for a variety of data science research projects, applications, and pedagogical endeavors carried out by BDHubs’ leadership and their research constituencies. This was especially evident in their collective attempts to position themselves as intermediaries—sitting between the core data science disciplines and the domain sciences—and as facilitators of those activities we label prospecting: outreach to domains, capturing domain epistemology, and rendering domain data available for secondary analysis and re-use. The process of making these resources available is thus revealed as a mode of ordering data science through various kinds of formalizing procedures.

The work of the BDHubs, in their consortia-building efforts, is the work of producing a specific version of order (i.e. an avenue for coordination and communication) localized to knowledge produced by the NSF through a series of workshops and design charrettes that uncovers disorder within their work. It is through the work of prospecting, of producing knowledge about the site of action with an eye toward in some way forming a lasting attachment to that site, that both a national consortium and a generalized discipline of data science might operate, and might identify further sources of disorder. We return to Berg and Timmermans (2000), here, who remind us that order and its Other are in fact “two sides of the coin in a double sense: not only does the one come into being only with the other – it also cannot survive without it” (52).

In the case of BDHubs, seeking to know, accession and order the diversity of data (and other domain resources) is revelatory of deeper problems and challenges for the work of data science as the very act of trying to frame particular activities and resources reveals new overflows begging further framing. It is this discovering, extracting, and leveraging the value of data that we see as the core of how prospecting is fundamental to the practice of data science: it names the work of discovering data resources ripe for value extraction. These activities bear consequence for how the BDHubs consortium came to be and are revelatory of the heterogeneous interdisciplinary knowledges, work practices, and sociotechnical analyses we include under the label prospecting. Such work is a fundamental instrument of data science: tracing and understanding its effect on questions of interest, on the epistemology and application of data scientific work, and on the broader structures of science policy as funders seek to engage with novel areas of scientific inquiry.⁴

As the consortia-building efforts of the BDHubs emerge from apparent or perceived un-coordination among the multiplicity of instances of what might be data science, so too is their existence predicated on the notion that there is still further to reach for, more participants with whom to engage, more resources to make available. The Other of the BDHubs—like the Other of data that may yet be analyzed, rendered universal, made available—both gives rise to its current form and provides the necessary means for its existence. Prospecting behaviors reach out to the Other, expose new forms of disorder, and position the coordinative entity accordingly. However, this positioning work, both in the BDHubs and across data science as an emergent field, bears consequence in the domains, in how they structure their data, and in how data is assessed, curated, and structured.

As of this writing the BDHubs are still in process, having recently been renewed for two years of further funding. While the core consortium-building efforts remain the same, the role of the “Spoke” projects, each of which is a traditional research project leveraging data science in heterogeneous domains, and the final sustainable organizational plan remain contested and uncertain. The Hubs themselves are in a phase of prospecting their own future. As the Hubs form stable partnerships and build upon successful activities, they became more ordered internally wile exposing new avenues for future work—new forms of disorder. They in essence prospect the broad and contested space of data science to further become themselves. In the next section, we deploy the notion of a fundamental relationship between a knowledge creator and their object of study, which Serres (1982) refers to as “the parasite,” to explore how prospecting not only is generative of new engagements between data science and the domains, but also how it is also highly consequential for the domains themselves.

Positioning data science: Moving downstream

Having discussed the “emptiness” of data science relative to the domains with which it engages, and the nature of data science as simultaneously producing order among heterogeneous resources and data in their approach while creating and exposing further areas of disorder in that process, we now move to our third theoretical elaboration: how data science positions itself as both consumer of data and, in effect, the arbitrator of interoperability. Here, we turn to Michel Serres’ (1982) notion of The Parasite, which describes a fundamental relationship between knowledge and the object from which it is produced:

Knowledge parasites the world, parasites objects, systems, black boxes, and laboratories. It is a general undertaking of pumping out and capturing of information. If, one day, the parasite invented the exchange of material for logical at his host’s table, and vice versa, he also invented science and theory the same day. What would all knowledge be without this asymmetrical, crossed exchange? This irreversible capture. (210)

As one of Serres’ primary goals in The Parasite was a leveling of the “exact” and “human” sciences. The parasite is not an object on its own, but rather a relationship between the scientist and the world, the ethnographer and the community, or the data scientist and the data.

This relationship is key to our understanding of data science and prospecting not in that it displaces prior relationships, but rather in the nature of data science as consistently working on data outside of the initial context of its creation—data science bears the same relationship to secondary use of data as other science bears to the data it initially creates. Data science takes as its object data produced by other sciences, and thus is both a source of disorder—for example, in siloed data being consistently problematized for its lack of ability to move outside its initial domain—and as the producer of order in its position as “downstream” of other sciences. Brown (2002), commenting on Serres, describes the nature of the parasitic relationship in a way especially evocative of the nature of an emergent science of data:

The parasite does not seek to establish property rights, they merely exploit all such efforts at enclosure and create a vector where everything flows towards them. In the chain or cascade of parasites that opens up in every white space, the position of power is always found in she or he who comes last. From this position, one may parasitise all the others. (93)

What renders a bucket empty is the same thing that renders a balloon full. Emptiness is not a thing to be wary of, but one to be embraced; the tools of a science of data are philosophically emptied of domain attachment in reaching toward a generalizable or universal application. Thus, emptied of commitment to any specific domain, the tools, analyses, and applications of data science are fulfilled in their ephemeral attachment to questions, domains, and technologies and can readily be emptied once again without significant modification. An empty bucket is merely readied for its next use, rather than something fundamentally incomplete or lacking. A generalized tool can only be conceived as such when it is capable of being expressed similarly across different areas of research, so too does the science of data science (extracting value from data) become reified in its capacity for demonstrating specific outcomes across a variety of domains.

Operating as a generalizable tool across many domains, data-intensive analytic techniques seek to take on this downstream role, sometimes placing themselves as the “last word” in analytics, but more likely just expanding the capacity of scientists to act (a more reasonable definition of power). The fundamental relationship Serres characterizes as the parasite plays a dual role in the system. It makes communication possible across domains by acting as the means of intermediation. But it also necessarily disrupts the message. This disruption is transformative—data is re-described, reformatted, and processed according to downstream analytics, and this transformed data becomes more useful in its availability to the generalized methods of data science. The more data is collected, accessed, organized, and rendered unto order, the greater capacity for those scientists of data to approach heterogeneous domains, to apply their analysis to social, scientific, and commercial “problems”—in short, more data in ordered, accessible form confers greater power to act in the world, but, vitally, that data is ordered and accessed according to the practices of the data science that so ordered it. As such, as the downstream consumer of data, data science arbitrates and mediates the notion of reusable data, and data becomes characterized as disordered (siloed, ill-described, underlinked) according to the general set of tools employed by data science, rather than endogenously according to the conventions of the domain from which it originated.

Alongside and in sync with data scientists themselves, the work of the BDHubs consortium is in understanding, mapping, and assessing the resources potentially available to the institution of data science. The level of coordination of these activities is somewhat unimportant, though—informed by the notion and ethos of re-use (Ribes and Lee, 2010) and under the assumption that knowing what data is “out there” is merely one step in a chain of rendering signal from noise (Bowker, 2013). Each engagement with data, tools, and techniques for its analysis, potential sites for the application of that analysis, and enabling institutions, organizations, and individuals, is, even when uncoordinated and driven by individual goals, an ordering of a noisy space. Disorder is thus characterized not only as data that doesn’t share characteristics with other data, but also that data that is unavailable to data science. The capacity of data science becomes a defining factor in identifying disorder, an asymmetric production of something more amenable to a given approach.

That approach, despite its heterogeneity of specific instantiation, is itself operating on domain logics, on the notion that the tools of large-scale data analysis, informed by domain application, are generically useful and productive, i.e. there exists a general set of skills oriented around data itself that might operate across a broad variety of domains. Thus, prospecting is vital to the praxis of domains, the connecting step that produces connection and asymmetry between the domain and the data scientist and enables the coordination of data toward an application regardless of the context of its initiation.

Data science currently sits as a novel downstream point of data production initiated with instrumented observation of the world, proclaiming the capacity to reason on data qua data regardless of its initial domain context. However, the most consistent stability here can be seen in the fundamental relationship characterized by the parasite itself—there will always be another step up the chain of knowledge that works to generate and improve our understanding of the world. As the current downstream endpoint of data consumption, data science bears the capacity to define what data is interoperable and what data is not, to call out silos and islands of data as they are evaluated on the basis of their availability to data science, and to propose that the notion of disordered data is increasingly drawn from how it is amenable to data analytic techniques. Prospecting, then, works as a fundamental mediator of this Order (and thus disorder), and the position of data science as downstream is leveraged and enacted through the field’s practices of prospecting.

Conclusion

Data science as a discrete field of study is characterized by a move toward a generalizable, relatively universal set of tools, skills, and knowledge that can be applied to analyze data drawn from a wide range of potential domains. In the work of the BDHubs, we have characterized that generality of technique and specificity of application as a form of “emptiness” that spurs a drive toward attachment to a variety of domains, sectors, and institutions. Rather than pejorative, emptiness in this context is a virtue of the data scientific approach (cf. Daston and Galison, 2007): agnostic to specific domains until attached, operating across data in its myriad of forms, contexts, and various levels of being “cooked” (Bowker, 2013), the emptiness of data scientific approaches is also their means of achieving universal modes of interacting with data.

Emptiness here is not a lack of theory, or epistemology (i.e. statistical, computational), but rather a structured, intentional means that aims to generate universality and generalizability in technique and technology. In creating a domain agnostic BDHubs initiative, the NSF participated in an intentional emptying of domain affiliation, leaving the individual regional Hubs much space for flexibility and adjustment as they went about their work of facilitating data scientists in their work of prospecting domains.

Much as work toward standardization of data and interoperability creates an increasingly universal and generalizable world of data available to data science, so too does a domain-agnostic approach to the institutionalization of the data sciences format increasingly broad swathes of domains as available to data scientific engagement. While the prospecting work that we have observed as central to this activity is only a first step in pre-formatting domains and collaborations, the work of establishing stable, long-term relationships between a field of study characterized by its reach toward an approach generalizable across scientific domains is an ongoing process with no apparent end. Order produces and is produced from its Other.

The emptiness of the BDHubs is structurally established, paralleling the character of data science itself as it emerges as a cohesive discipline with increasing support for its institutionalization. There will be more data to prospect, more disorder to be found and managed, another step further down the parasitic chain to move, and this space will grow as more prospecting work is done. Any organization that takes the institutionalization of data science as its object of work will likely bear substantial similarities in approach, agnosticism, and a focus on prospecting work. Thus, prospecting becomes a fundamental initial step in forming new collaborations, rendering data amenable to analysis outside of the context of its initial creation, and in propagating the rapidly developing set of tools, resources, and approaches that are closely coupled to a science that takes data itself as an object of study.

Footnotes

Author’s Note

Andrew S Hoffman is now affiliated with iHub – Interdisciplinary Hub for Data Security,Privacy and Data Governance,Radboud University,the Netherlands.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) received no financial support for the research,authorship,and/or publication of this article.

ORCID iDs

Stephen C Slota

Andrew S Hoffman

Notes

References

Baru

(2019) Data science needs a clinical program. Blog Post. Available at: https://chaitanbaru.blogspot.com/ (accessed on 2 December, 2019).

Berg

Timmermans

(2000) Orders and their others: On the constitution of universalities in medical work. Configurations 8(1): 31–61.

Borgman

(2015) Big Data, Little Data, No Data: Scholarship in the Networked World. Cambridge: MIT Press.

Borgman

Bowker

Finholt

et al. (2009) Towards a virtual organization for data cyberinfrastructure. In: Proceedings of the 9th ACM/IEEE-CS joint conference on digital libraries, June, pp.353–356. New York: ACM.

Bottum

Atkins

Blatecky

et al. (2017) The future of cloud for academic research computing. Unpublished. https://doi.org/10.13140/RG.2.2.10071.88489

Bowker, G.C., 1994. Science on the run: Information management and industrial geophysics at Schlumberger, 1920-1940. Cambridge, MA: MIT Press.

Bowker, G.C., 2013. Data flakes: An afterword to ‘Raw Data’is an oxymoron. In Gitelman, L. (ed.), Raw data is an oxymoron. Cambridge, MA: MIT press. pp. 167–172.

Brighenti

(2007) Visibility: A category for the social sciences. Current Sociology 55(3): 323–342.

Brodie

(2019) On developing data science. In: Braschler

Stadelmann

Stockinger

(eds) Applied Data Science: Lessons Learned for the Data-Driven Business. New York, NY: Springer International Publishing, pp.131–160.

10.

Brown

(2002) Michel Serres: Science, translation and the logic of the parasite. Theory, Culture & Society 19(3): 1–27.

11.

Chen

Zhang

(2014) Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences 275: 314–347.

12.

Chen

Mao

Liu

(2014) Big data: A survey. Mobile Networks and Applications 19(2): 171–209.

13.

Crombie

(1995) The History of Science from Augustine to Galileo. North Chelmsford, MA: Courier Corporation.

14.

Cukier

Mayer-Schoenberger

(2013) The rise of big data: How it’s changing the way we think about the world. Foreign Affairs 92(3): 28–40.

15.

Daston

Galison

(2007) Objectivity. New York: Zone Books.

16.

Dye

Schatz

Rosenberg

et al. (2000) Constant comparison method: A kaleidoscope of data. The Qualitative Report 4(1): 1–10.

17.

Edwards

Borgman

Jackson

et al. (2013) Knowledge infrastructures: Intellectual frameworks and research challenges. Available at: knowledgeinfrastructures.org.

18.

Ekbia

Mattioli

Kouper

et al. (2015) Big data, bigger dilemmas: A critical review. Journal of the Association for Information Science and Technology 66(8): 1523–1545.

19.

Goel

Hofman

Lahaie

et al. (2010) Predicting consumer behavior with web search. Proceedings of the National Academy of Sciences of the United States of America 107(41): 17486–17490.

20.

Gregory

Cousijn

Groth

et al. (2019) Understanding data search as a socio-technical practice. Journal of Information Science 5: 1–17.

21.

Hacking

(1982) Biopower and the avalanche of printed numbers. Humanities in Society 5(3–4): 279–295.

22.

Hacking

(1990) The Taming of Chance. Cambridge: Cambridge University Press.

23.

Hey

Tansley

Tolle

(eds) (2009) The Fourth Paradigm: Data-Intensive Scientific Discovery. Vol. 1. Redmond, WA: Microsoft Research.

24.

Iacono

(2013) ‘Big Data Partners Workshop.’ Remarks given to the White House Office of Science and Technology Policy/Big Data Senior Steering Group ‘Big Data Partners Workshop,’ 3 May 2013, Washington, DC.

25.

Jagadish

Gehrke

Labrinidis

et al. (2014) Big data and its technical challenges. Communications of the ACM 57(7): 86–94.

26.

Kourou

Exarchos

et al. (2015) Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal 13: 8–17.

27.

Latour

(1987) Science in Action: How to Follow Scientists and Engineers Through Society. Cambridge, MA: Harvard University Press.

28.

Leonelli

Tempini

(eds) (2020) Data Journeys in the Sciences. Berlin: Springer.

29.

McCarthy

Wang

Magee

et al. (2006) AgBase: A functional genomics resource for agriculture. BMC Genomics 7(1): 229.

30.

NITRD (2016) The federal big data research and development strategic plan. Available at: https://www.nitrd.gov/PUBS/bigdatardstrategicplan.pdf (accessed 21 February, 2019).

31.

Nowell

Norris

White

et al. (2017) Thematic analysis: Striving to meet the trustworthiness criteria. International Journal of Qualitative Methods 16(1): 1–13.

32.

OSTP (Office of Science and Technology Policy) (2012) Obama administration unveils big data initiative: Announces $200 million in new R&D investments. Press release. Available at: https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/big_data_press_release.pdf (accessed 21 February, 2019).

33.

Puschmann

Burgess

(2014) Big data, big questions|metaphors of big data. International Journal of Communication 8: 20.

34.

Ribes, D. and Lee, C.P., 2010. Sociotechnical studies of cyberinfrastructure and e-research: Current themes and future trajectories. Computer Supported Cooperative Work (CSCW) 19(3-4): 231–244.

35.

Ribes, D., 2014, February. Ethnography of scaling, or, how to a fit a national research infrastructure in the room. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing: 158–170.

36.

Ribes, D., Hoffman, A.S., Slota, S.C. and Bowker, G.C., 2019. The logic of domains. Social studies of science 49(3): 281–309.

37.

Ribes, D. 2019. How I Learned What a Domain Was. Proc. ACM Hum.-Comput. Interact 38: 1–12.

38.

Sadowski

(2019) When data is capital: Datafication, accumulation, and extraction. Big Data & Society 6(1).

39.

Serres

(1982) The Parasite. Translated by L Schehr. Baltimore, MD: Johns Hopkins UP.

40.

Shen

(2018) Data sharing practices, information exchange behaviors, and knowledge discovery dynamics: A study of natural resources and environmental scientists. Environmental Systems Research 6(1): 9.

41.

Tenopir

Dalton

Allard

et al. (2015) Changes in data sharing and data reuse practices and perceptions among scientists worldwide. PLoS One 10(8): e0134826.

42.

Thatcher

O’Sullivan

Mahmoudi

(2016) Data colonialism through accumulation by dispossession: New metaphors for daily data. Environment and Planning D: Society and Space 34(6): 990–1006.

43.

Thévenot

(1984) Rules and implements: Investment in forms. Social Science Information 23(1): 1–45.

44.

Upadhyaya

Thudi

Dronavalli

et al. (2011) Genomic tools and germplasm diversity for chickpea improvement. Plant Genetic Resources 9(1): 45–58.

45.

Vaughn

(1992) Theory elaboration: The heuristics of case analysis. In: Ragin

Becker

(eds) What Is a Case? Exploring the Foundations of Social Enquiry. Cambridge: Cambridge University Press, pp.173–202.

46.

Zuiderwijk

Janssen

Choenni

et al. (2012) Socio-technical impediments of open data. Electronic Journal of e-Government 10(2).