Abstract
Keywords
Introduction
Data science is popularly understood as the mode of scientific inquiry whose business is the study of extracting value from data. Observing it in action, one witnesses a curious form of emptiness: research and knowledge production in data science take the shape of an unceasing search for novel attachments, a consistent refreshing of sociotechnical associations. Data science, in other words, has no data of its own, but proclaims its relevance to all who do. Consider, for instance, the imaginary of data science’s
The purpose of this paper is to draw attention to the set of practices deployed in the course of making the data sciences fulsome with worldly material (e.g., data) from which value may be extracted—practices that we collectively label
The theoretical insights on prospecting presented here are informed by reflections on earlier empirical studies of the policies and practices driving data-intensive scientific research in the US (e.g., Borgman et al., 2009; Edwards et al., 2013; Ribes and Lee, 2010) and, to a greater extent, by more recent inquiries into the institutionalization of data science in the United States during the second decade of the 21st century. In the latter case, our team of four researchers conducted a three-year sociotechnical study of a new national umbrella organization for data science dubbed the Big Data Regional Innovation Hubs program (henceforth BDHubs), an initiative founded and funded by the National Science Foundation (NSF) in 2015. 2
Our ethnographic work with BDHubs focused on its establishment as a consortium-building effort, and using participant-observation and interviewing methods we set out to grasp the scope of BDHubs’ work. In particular, we focused on the myriad “scalar devices” (Ribes, 2014) used by this distributed organization to understand itself and to build active regional constituencies. We were also keenly attentive to the valuation practices that BDHubs leadership—including Executive Directors and Principal Investigators—deployed in selecting, prioritizing, and executing on engagements in line with their mandate to “foster regional, cross-sector collaborations and multi-sector projects to foster innovation with Big Data” that would function as a critical pathway to “building and sustaining a successful national Big Data innovation ecosystem” (NITRD, 2016: 34). The primary goal was to identify and characterize the activities through which the Hubs worked to constitute themselves as key intermediaries that could foster regional data science innovation that promised to address a range of scientific and societal “grand challenges.”
Methodologically, we relied on a mix of participant-observation, semi-structured interviews with key actors, as well as thematic analysis (Nowell et al., 2017) of a range of internal and publicly available working and policy documents pertaining to BDHubs’ genesis and ongoing activities. Data collection took the form of ethnographic field notes, transcribed interviews, documents produced on site, and meeting transcripts, and was analyzed under the constant comparison method (Dye et al., 2000). We regularly attended and participated in Hub-specific workshops, meetings, and seminars including leadership calls, steering committee meetings, topic-specific community calls, and annual all hands meetings. We also engaged in a number of cross-hub initiatives such as joint calls with NSF program staff and regional Hubs leadership, the monthly All-Hubs Cyberinfrastructure Working Group, international collaborations with the European Big Data Value Association, and National Data Challenges around transportation safety and safe drinking water. Together, these calls and meetings enabled our identification and prioritization of additional fieldwork opportunities and provided a basis for identifying key interlocutors—including Executive Directors and Principal Investigators from each of the four Hubs—with whom we subsequently conducted in-depth semi-structured ethnographic interviews.
In light of the empirical basis of this analysis, however, our objective here is not so much to provide an ethnographic recounting of the BDHubs as such, but rather to deploy insights we gleaned during our study of this initiative as a means of furnishing a broader understanding of data science as an emergent universal(izing) science, with particular emphasis placed on prospecting—an empirical and conceptual notion that we will flesh out over the coming sections—which we argue is an enabling force driving the broader datafication of science and society (Cukier and Mayer-Schoenberger, 2013).
In the following sections, we theoretically elaborate (Vaughn, 1992) three constitutive dynamics of prospecting in order to explicate its role in the structuring and centralizing of data science. First is the notion that data science is intentionally “emptied” of domain affiliation and commitment (Ribes, et al., 2019), built upon the presumption that significant content, data, or applications will take place in conjunction with or mediate across specific domains. Second, this “domain-agnostic” positionality of data science serves as an ordering force, progressively reconfiguring an expanding scope of data and resources to be made amenable to data scientific techniques and analytic conventions, all the while simultaneously exposing new sites of disorder (Berg and Timmermans, 2000). Third, this work of ordering positions data science as the consumer of the data, resources, and even epistemologies of the domains with which it engages. In so doing, the practices of prospecting are centralized as a vital mediating activity—rendering the disordered as ordered, the siloed as shared, and facilitating the movement of knowledge and technique between domains such that data resources may be more seamlessly reused (Gregory et al., 2019) in subsequent analyses, perhaps with altogether different questions in mind than what motivated their initial creation or development.
Two concepts are especially relevant to the analysis that follows and are discussed in greater detail below. The first is Ribes, et al.,’s (2019; Ribes 2019) notion of “domain logics.” According to these authors, the “logic of domains” describes a style of organizing in computationally intensive science wherein a “domain,” or collective of expertise, is engaged or studied by a second party who generally conceives of themselves to be “domain independent,” or in possession of a set of generalizable tools or agnostic expertise that can intermediate between domains toward computational advancements and/or interventions in the domain itself. The second concept, which we take up in the latter part of the paper, is Michel Serres’ (1982) metaphor of the “parasite” (Brown, 2002), wherein he describes a fundamental relationship between the object of inquiry and the knowledge that might be produced from it. However, before unpacking the utility and centrality of these two ideas for our own conceptualization of prospecting, we turn to our first theoretical elaboration: the emptying out of domain specificity in data science, and its attendant hunger for connection.
The emptiness and hunger of data science
For a data scientist oriented toward the use of already connected data, reconfiguring, generalizing, and otherwise rendering a new data set amenable to use is naturally preceded by an assessment of the scope, character, and availability of that data (Borgman, 2015; Zuiderwijk et al., 2012). Prospectors in data science therefore navigate a territory of institutions, individuals, and technological concerns, mapping available data, discovering new potential domains for engagement, and assessing the balance between initializing work and expected value of the analyzed data set (Gregory et al., 2019). A change in technology or tools might make a given data set more or less available, ready at hand, or amenable to re-use, but the initial assessment is both formative of what the end research will consist of as well as indicative of the perceptions, assumptions, and capacity of the assessing researcher (Shen, 2018).
Case in point: the 2017 National Transportation Data Challenge, a BDHubs-led initiative that aimed to contribute to the international “Vision Zero” strategy of eliminating traffic fatalities on highways. When considering the problem of traffic accidents from a data scientific approach, the data scientists involved in this endeavor reached out to researchers and practitioners in government, commercial, and academic organizations to discover what data and computational resources were out there, and in what form. The data scientists were then able to evaluate the data according to their own needs (Is there good metadata? Is it consistently structured? How difficult or expensive would it be to gain access?) and engage with the various domains producing that data in order to better understand it and, ultimately, to apply it to research into the causes of accidents and possible solutions for avoiding highway deaths. We observed actors testing various sources of data in initial analyses to gauge its suitability in answering their questions, all of which took place before the analysis of the data began. It is this process of selecting, testing, and evaluating available data that structures what the results of that analysis would look like, while remaining relatively invisible in the final product.
The notion of prospecting is thus vital to understanding how the field of data science
We use the term prospecting to define this work insofar as it invokes the notion of unexplored territory that may yield some value once it is better understood, mapped, and ultimately targeted for infrastructural development. Data builds upon prior data and exposes new opportunities. Prospecting in this sense is analogous to the developments in the field of geology where salt domes, initially sought primarily because they could readily be mapped with existing seismic methods, led to the identification of many-faulted zones such as the San Joaquin Valley. These were later found to be incredible valuable as sites for oil extraction (Bowker, 1994). In both data science prospecting and geological mapping, the concerns of field were shaped both by the availability of data as well as its perceived value and importance. This style (in Hacking’s (1990) sense of the term) of seeking out new data moves a given resource toward being “data science ready” even as it excludes other data for a variety of reasons (cf. Crombie, 1995). Much like the drive to discover and mine gold or oil, we perceive a similar goal of discovering, mapping, and rendering available an ever-growing preponderance of data resources as a characteristic of data science.
Indeed, “Big Data” and data science more generally are increasingly defined and modulated by the metaphors instrumented to describe and understand them. Puschmann and Burgess (2014) describe how Big Data is being discursively shaped and understood through metaphorical comparisons to a force of nature that needs to be grappled with, managed, and controlled, and once so controlled, as a resource to be consumed for nourishment. For those working directly in data science, concepts of “Big Data” and “data science” are abstract, distant, and contested but work nevertheless continues to take place. Prospecting thus serves as a sieve of meaning in a contested space, enabling the practicalities of working with (big) data, allowing for a selection of meaning amidst uncertainty, and defining the fulfillment of the daily work of data science with its expansive sociotechnical understandings.
Unlike seemingly similar metaphors, however—such as “data extractivism” (Sadowski, 2019) and “data colonialism” (Thatcher et al., 2016)—our own notion of prospecting has a slightly different focus. Rather than thinking about the value of data per se, as enacted at the point of analysis, we look to earlier moments in data journeys (Leonelli and Tempini, 2020) in order to highlight the prospecting work that serves as a precondition for value extraction. Moreover, the nature of prospecting work is consequential primarily in its temporal bounding and selectivity as the entire field of available data sets and processing tools cannot be prospected at once. Both the connections leveraged and built, and knowledge produced toward discovering and working with a “new” data resource, point toward the priorities and practices of data scientists engaged in prospecting. Together, they work toward shaping the contours of their interactions with extant “domains” according to a particular style of organizing data-intensive work, a phenomenon to which we now turn our attention.
Prospecting as praxis in the logic of domains
Undergirding the work of prospecting is what Ribes, et al. (2019) refer to as the
Prospecting presupposes a quantity of data that is currently intractable because of its size, complexity, poor documentation, and/or siloing, yet bears significant potential for generating social, economic, and epistemic value once better managed and understood (Hey et al., 2009). As the philosopher of science Ian Hacking (1982: 280) wrote in his work on the history of statistics—that “[e]numeration demands kinds of things or people to count,” and that counting in turn “is hungry for categories”—so too can it be said that data science demands kinds of things to analyze, and analysis is hungry for categories of domains (e.g., biology, geology, chemistry, etc.) and domains’ own categories which can be worked upon. Thus, in the ongoing discovery of new analytic tools, techniques, and applications, more unanalyzed data, domains, and opportunities for development are revealed. In this way, we consider data science in broad terms to be
But rendering those resources amenable to extraction requires that they first be made visible (e.g., Brighenti, 2007), necessitating the work of reaching into domains to discover resources; to structure the criteria by which data-intensive analysis might be leveraged to refine data scientific praxis; to produce new knowledge within as well as across domains; and ultimately to produce order out of disorder. These actions are propelled by the logic of domains’ core disposition that data scientific knowledges and practices developed in one domain or setting are capable of being made broadly applicable across many different instances of data-intensive work (Ribes, et al., 2019). Universality has occupied a central place in the historical evolution of the logic of domains, having “been defined and architecturally materialized as an absence of specificity” (Ribes, et al., 2019: 290). And so where this “emptiness” is a central characteristic of the
Data-intensive research in fact implicitly and actively acknowledges and operates according to a presumption that the analytic tools developed for managing and reasoning about very large data sets are to some extent agnostic to their initial context of production and are applicable to a heterogeneous set of other domains. For instance, machine learning algorithms produced for facial recognition might be applicable to the identification of tumors (Kourou et al., 2015), predictive tools developed for epidemiology might be effective in studying consumer behavior (Goel et al., 2010), and genomic tools developed in a human context are similarly useful for agricultural research (McCarthy et al., 2006; Upadhyaya et al., 2011). It is for this reason that data science is can be cast as a
From these two propositions—that data science is a general field, but one that through its application is potentially relevant to any domain—emerges its
The (dis)ordering powers of prospecting
Data science seeks out ever more forms of data in order to universalize. As Hey et al. (2009) argue in
The very mention of “systematic” and “scalable” approaches to winnowing, curating, publishing, or processing data points to the fact that the work of universalizing in data science—that is, of opening up domain resources, and of making them available to engagement by data scientific approaches—is in fact a normative process of ordering (in our case, of the disarray of data resources) according to certain (data scientific) conventions. In formatting these resources accordingly, there emerges a recursivity wherein the discovery of new resources reveals them to be in a state of
Sociologists of science Marc Berg and Stefan Timmermans (2000) attend to this phenomenon in their analysis of standardization efforts in the medical domain, where they discuss universality as both order emerging from disorder and the progressive recognition of new spaces of disorder resulting therefrom—what they call “Orders and their Others”: “The production of universality follows a clear temporal pattern: disorder preexists and precedes the emergence of order. The phoenix of universality rises from the ashes of local chaos” (33). Order, in this sense, is a form of stability, of attachment to a localized form of the universal. “Achieving universality,” the authors go on to say, entails “the erasure of local varieties, the gradual grouping and transforming of what used to be dissimilar under the same category” (Berg and Timmermans, 2000), a progressive process of “investing in forms” (Thévenot, 1984).
Data science applications reach toward a generalizable science of data analysis, where “questions are informed by basic science, but they raise additional issues that can be addressed only by a new science discipline focused specifically on its applications—a discipline that integrates physical, biogeochemical, engineering, and human processes” (Hey et al., 2009: 14). Some, in grappling with Big Data, see a pressing need for such universality. There is a compelling need for a rigorous and holistic definition of big data, a structural model of big data, a formal description of big data, and a theoretical system of data science … An evaluation system of data quality and an evaluation standard/benchmark of data computing efficiency should be developed … there is still not a unified evaluation standard and benchmark to balance the computing efficiency of big data with rigorous mathematical methods. (Chen et al., 2014: 202)
Data itself is also problematized: As the prior step to data analysis, data must be well-constructed. However, considering [the] variety of data sets in Big Data problems, it is still a big challenge for us to purpose efficient representation, access, and analysis of unstructured or semi-structured data in the further researches. (Chen and Zhang, 2014: 5)
However, moves toward this universality, in the mode of Berg and Timmermans, similarly reproduce and expose its necessary Other: “The duality, futurity, and disparity of Big Data, along with its various conceptualizations among practitioners, make it unlikely for a consensus view to emerge” (Ekbia et al., 2015). While universality might be a teleological goal of universalizing efforts, it is likely to remain remote, where even basic principles, definitions, and concepts showing local variation that is itself resistant to a universal approach. The universality of data science remains ever remote, with competing universals producing disordered noise even when other forms of locality are erased. Approaches to best practices in data management across disciplines and organizations are complex and often in contrast to one another, and may require years to change; therefore, the incongruence of these approaches continues to be an impediment to the complex science of today. (Tenopir et al., 2015: 18)
Even among those working to further data science, we see a perceived disorder in local variation and condition. Ranging from a significant barrier to the progress of science to a problem simply needing greater agreement and systematization, there is a consistent imagining of a field of work fundamentally in the business of rendering data more available, more consistent, and more amenable to broad analysis. The work of ordering data for data science is preceded by an understanding of that space before efforts intended to delete local variation are possible, followed by the hoped-for reintroduction of that locality in order to apply the results of the generalized analysis. While this is a fairly sparse picture of a more complex space, these steps characterize our view of prospecting as a vital, underrepresented activity that acts as a necessary precondition for a universal(izing) field of data science that may operate in some way across academic and sectoral domains.
Moving forward, we discuss the BDHubs organization as an example of some dynamics of “prospecting in action” (cf. Latour, 1987), paying special attention to its enactment of ordering in the form of negotiated avenues of coordination and communication from which further data scientific work might be made possible. In doing so, prospecting here is revealed as a transitional process, a necessary but partial step toward the effective use of data that produces the knowledge about that data necessary for engagement with the domains but does not enact an engagement in and of itself.
Prospecting in action: The Big Data hubs and spokes
We undertake here to characterize the ordering activities of BDHubs as an avenue for coordination and communication that are both not neutral, in that they are directed toward a particular image of what institutionalized data science might look like, and specifically contentless in their lack of prior commitment to any one domain. So while we might argue that these institutionalizing efforts prioritize a form of neutrality and agnosticism with respect to particular domains or technological infrastructures, it is important to keep in mind that this is a mode of directed ordering of a perceived disordered space, with all the attendant conflicts, tensions, and frictions that can be expected of pursuing one particular vision of “order” in that space.
As we discussed in the “Introduction” section, the BDHubs initiative was founded and funded by the NSF in 2015 to serve as a national umbrella organization for data science writ large. By way of historical context, its roots can be traced to the 2012 US Big Data Initiative, an executive directive spearheaded by the White House Office of Science and Technology Policy (OSTP) under which six major funding agencies 3 dedicated a combined $200m in new commitments toward data scientific goals. A press release announcing the investment identified a national data science strategy that “[aimed] to make the most of the fast-growing volume of digital data … [and] greatly improve the tools and techniques needed to access, organize, and glean discoveries from huge volumes of digital data” (OSTP, 2012). Notable in this press release, and echoing Hey et al.’s (2009) aforementioned viewpoints, is the problematic of a growing preponderance of data that could readily respond to “tools and techniques to access, organize, and glean discoveries” (OSTP, 2012), and where the volume of data itself is framed as both a problem and an opportunity—a resource at once underleveraged and untamed.
“Big Data increasingly includes information provided by increasingly diverse sources, of varying reliability. Uncertainty, errors, and missing values are endemic, and must be managed” (Jagadish et al., 2014: 91). Data itself is a problematic source of disorder, with data scientists a vital step in rendering the data amenable to analysis as well as moving from a state of localized variation to a more singular universality of data in both representation and quality. “In today’s complex world, it often takes multiple experts from different domains to really understand what is going on. A Big Data analysis system must support input from multiple human experts, and shared exploration of results” (Jagadish et al., 2014: 93). Domain logics persist in the problematic of Big Data, with the notion that those generalizable skills possessed by the scientist of data are most effective in conjunction with a close attachment to a domain, and that domain experts similarly benefit from the application of those skills to their area of expertise. In the BDHubs, particularly in its early phases, we saw a consistent emergence of the work of alignment and attachment of the skills, tools, and resources generalizable to data analysis with the domains.
Following from the instruction and agreements established to support Big Data at the national scale, and with a particular emphasis on bridging across the academy, industry, and the public sector, a cornerstone of the Big Data Initiative was a series of workshops and design charrettes that brought together representatives from these different sectors who sought to identify and characterize the forthcoming challenges for Big Data. It was here that NSF leadership and program staff sought to assess, understand, and in many ways render the widely claimed, broadly applied field of data science in some way tractable to development under a cohesive funding effort. A primary objective of these events was thus to understand the state of the field, with science policy advisors noting that “improving our ability to extract knowledge and insights from large and complex collections of digital data, the initiative promises to help solve some the Nation’s most pressing challenges” (OSTP, 2012). Data science was presented as broadly useful, generally applicable, and fundamentally concerned with untapped value in disparate data, waiting to be extracted.
In reports and presentations drawn from these workshops, we see initial prospecting efforts working to establish an agenda and mode of work that would persist throughout the process of developing, funding, and rolling out the BDHubs consortium. Introducing and oriented around the loosely defined notion of “partnership,” workshop attendees were primarily recruited through existing social networks and familiarity with research work undertaken by its organizers. “Why are you here?,” one slide asked: “You have made some connection about Big Data with OSTP, the Big Data Senior Steering Group and/or one of the agencies involved” (Iacono, 2013). Workshop attendees were then charged with a set of four tasks: “Fact finding: Collect data and information … Idea finding: Listen for new ideas, models, partnerships, etc … Partner finding: Search for your Big Data ‘soulmates’ … Solution finding: Discern promising ideas that can be applied and that would make a difference” (Iacono, 2013). Note that each of these “charges” for action during the workshop are activities that loosely fit under the umbrella of prospecting: discovery of social connection; provision of access to data; mapping and understanding existing social, organizational, and institutional ties; and discovering those basic questions that might be readily answered.
Another notable finding from these workshops was a shared understanding among many participants that the location of future value for data science is “elsewhere,” but nevertheless available to be leveraged—only in a different, and perhaps more systematized, way than it currently was. For example, at a 2012 workshop that gathered members of the academic community, government, industry, and nonprofit organizations to discuss implementing the aforementioned national strategic Big Data plan, David Logsdon from the US technology trade association TechAmerica exclaimed that: Big data has the potential to transform government and society itself. Hidden in an immense volume, variety and velocity of data that is produced today is new info, facts, relationships, indicators and pointers that either could not be practically discovered in the past or simply did not exist before. This new information, effectively captured, managed and analyzed has the power to enhance profoundly the effectiveness of government. Although the impact of big data will be transformational, the path to effectively harnessing it does not require government agencies to start from scratch. Rather, government can build on the capabilities and technologies it already has in place. Success in capturing the transformation lies in leveraging the skills and experiences of our business and mission leaders rather than creating a universal big data architecture. (Transcribed from a presentation delivered for TechAmerica)
More to the point, though, it was out of these design charrettes and workshops—themselves a prospecting activity—that surfaced a goal of developing what at the time was referred to as “big data coalitions.” In tandem with official Requests for Information and informal solicitations from potential partners, a blueprint for what would eventually become the BDHubs model began to take shape. The NSF funded four Big Data Hubs in late 2015 with one Hub designated for each of the Northeast, South, Midwest, and West regions of the country, with regions determined by US census population. The justification for regional Hubs, versus a single national entity, was to foster more face-to-face interactions within each region as a means of stimulating innovative data science research and development projects and partnerships. While some of the eventual goals of the workshop (in particular, solution finding) might be more readily defined as core data scientific work rather than the prospecting that creates its initial landscape, this nevertheless points to a teleology of prospecting as oriented toward finding that data from which value can be extracted. Prospecting is a means to an end rather than an end to itself—the mechanism by which data sharing and secondary use might be initiated—and is intimately concerned with the untapped value of data outside of its initial context of creation. In turn, it is also “endlessly hungry” in that its very activities of finding and arranging resources reveal further horizons of additional potential targets for prospecting.
Our own ethnographic work with BDHubs began shortly after these initial activities and was focused on the establishment of this novel organizational entity as a new instantiation of “cyberinfrastructure” at the NSF (cf. Ribes and Lee, 2010). Over the course of analyzing the initial imaginative work, planning, and realization of the BDHubs consortium, we came to identify prospecting as a central element of the Hubs’ work: first oriented toward assessing, understanding, and working with the field of data science—as was the case prior to BDHubs being funded—and later as the initiating activity for a variety of data science research projects, applications, and pedagogical endeavors carried out by BDHubs’ leadership and their research constituencies. This was especially evident in their collective attempts to position themselves as intermediaries—sitting between the core data science disciplines and the domain sciences—and as facilitators of those activities we label prospecting: outreach to domains, capturing domain epistemology, and rendering domain data available for secondary analysis and re-use. The process of making these resources available is thus revealed as a mode of ordering data science through various kinds of formalizing procedures.
The work of the BDHubs, in their consortia-building efforts, is the work of producing a specific version of order (i.e. an avenue for coordination and communication) localized to knowledge produced by the NSF through a series of workshops and design charrettes that uncovers disorder within their work. It is through the work of prospecting, of producing knowledge about the site of action with an eye toward in some way forming a lasting attachment to that site, that both a national consortium and a generalized discipline of data science might operate, and might identify further sources of disorder. We return to Berg and Timmermans (2000), here, who remind us that order and its Other are in fact “two sides of the coin in a double sense: not only does the one come into being only with the other – it also cannot survive without it” (52).
In the case of BDHubs, seeking to know, accession and order the diversity of data (and other domain resources) is revelatory of deeper problems and challenges for the work of data science as the very act of trying to frame particular activities and resources reveals new overflows begging further framing. It is this discovering, extracting, and leveraging the value of data that we see as the core of how prospecting is fundamental to the practice of data science: it names the work of discovering data resources ripe for value extraction. These activities bear consequence for how the BDHubs consortium came to be and are revelatory of the heterogeneous interdisciplinary knowledges, work practices, and sociotechnical analyses we include under the label prospecting. Such work is a fundamental instrument of data science: tracing and understanding its effect on questions of interest, on the epistemology and application of data scientific work, and on the broader structures of science policy as funders seek to engage with novel areas of scientific inquiry. 4
As the consortia-building efforts of the BDHubs emerge from apparent or perceived un-coordination among the multiplicity of instances of what might be data science, so too is their existence predicated on the notion that there is still further to reach for, more participants with whom to engage, more resources to make available. The Other of the BDHubs—like the Other of data that may yet be analyzed, rendered universal, made available—both gives rise to its current form and provides the necessary means for its existence. Prospecting behaviors reach out to the Other, expose new forms of disorder, and position the coordinative entity accordingly. However, this positioning work, both in the BDHubs and across data science as an emergent field, bears consequence in the domains, in how they structure their data, and in how data is assessed, curated, and structured.
As of this writing the BDHubs are still in process, having recently been renewed for two years of further funding. While the core consortium-building efforts remain the same, the role of the “Spoke” projects, each of which is a traditional research project leveraging data science in heterogeneous domains, and the final sustainable organizational plan remain contested and uncertain. The Hubs themselves are in a phase of prospecting their own future. As the Hubs form stable partnerships and build upon successful activities, they became more ordered internally wile exposing new avenues for future work—new forms of disorder. They in essence prospect the broad and contested space of data science to further become themselves. In the next section, we deploy the notion of a fundamental relationship between a knowledge creator and their object of study, which Serres (1982) refers to as “the parasite,” to explore how prospecting not only is generative of new engagements between data science and the domains, but also how it is also highly consequential for the domains themselves.
Positioning data science: Moving downstream
Having discussed the “emptiness” of data science relative to the domains with which it engages, and the nature of data science as simultaneously producing order among heterogeneous resources and data in their approach while creating and exposing further areas of disorder in that process, we now move to our third theoretical elaboration: how data science positions itself as both consumer of data and, in effect, the arbitrator of interoperability. Here, we turn to Michel Serres’ (1982) notion of Knowledge parasites the world, parasites objects, systems, black boxes, and laboratories. It is a general undertaking of pumping out and capturing of information. If, one day, the parasite invented the exchange of material for logical at his host’s table, and vice versa, he also invented science and theory the same day. What would all knowledge be without this asymmetrical, crossed exchange? This irreversible capture. (210)
This relationship is key to our understanding of data science and prospecting not in that it displaces prior relationships, but rather in the nature of data science as consistently working on data outside of the initial context of its creation—data science bears the same relationship to secondary use of data as other science bears to the data it initially creates. Data science takes as its object data produced by other sciences, and thus is both a source of disorder—for example, in siloed data being consistently problematized for its lack of ability to move outside its initial domain—and as the producer of order in its position as “downstream” of other sciences. Brown (2002), commenting on Serres, describes the nature of the parasitic relationship in a way especially evocative of the nature of an emergent science of data: The parasite does not seek to establish property rights, they merely exploit all such efforts at enclosure and create a vector where everything flows towards them. In the chain or cascade of parasites that opens up in every white space, the position of power is always found in she or he who comes last. From this position, one may parasitise all the others. (93)
Operating as a generalizable tool across many domains, data-intensive analytic techniques seek to take on this downstream role, sometimes placing themselves as the “last word” in analytics, but more likely just expanding the capacity of scientists to act (a more reasonable definition of power). The fundamental relationship Serres characterizes as the parasite plays a dual role in the system. It makes communication possible across domains by acting as the means of intermediation. But it also necessarily disrupts the message. This disruption is transformative—data is re-described, reformatted, and processed according to downstream analytics, and this transformed data becomes more useful in its availability to the generalized methods of data science. The more data is collected, accessed, organized, and rendered unto order, the greater capacity for those scientists of data to approach heterogeneous domains, to apply their analysis to social, scientific, and commercial “problems”—in short, more data in ordered, accessible form confers greater power to act in the world, but, vitally, that data is ordered and accessed according to the practices of the data science that so ordered it. As such, as the downstream consumer of data, data science arbitrates and mediates the notion of reusable data, and data becomes characterized as disordered (siloed, ill-described, underlinked) according to the general set of tools employed by data science, rather than endogenously according to the conventions of the domain from which it originated.
Alongside and in sync with data scientists themselves, the work of the BDHubs consortium is in understanding, mapping, and assessing the resources potentially available to the institution of data science. The level of coordination of these activities is somewhat unimportant, though—informed by the notion and
That approach, despite its heterogeneity of specific instantiation, is itself operating on domain logics, on the notion that the tools of large-scale data analysis, informed by domain application, are generically useful and productive, i.e. there exists a general set of skills oriented around data itself that might operate across a broad variety of domains. Thus, prospecting is vital to the
Data science currently sits as a novel downstream point of data production initiated with instrumented observation of the world, proclaiming the capacity to reason on data
Conclusion
Data science as a discrete field of study is characterized by a move toward a generalizable, relatively universal set of tools, skills, and knowledge that can be applied to analyze data drawn from a wide range of potential domains. In the work of the BDHubs, we have characterized that generality of technique and specificity of application as a form of “emptiness” that spurs a drive toward attachment to a variety of domains, sectors, and institutions. Rather than pejorative, emptiness in this context is a
Emptiness here is not a lack of theory, or epistemology (i.e. statistical, computational), but rather a structured, intentional means that aims to generate universality and generalizability in technique and technology. In creating a domain agnostic BDHubs initiative, the NSF participated in an intentional emptying of domain affiliation, leaving the individual regional Hubs much space for flexibility and adjustment as they went about their work of facilitating data scientists in their work of prospecting domains.
Much as work toward standardization of data and interoperability creates an increasingly universal and generalizable world of data available to data science, so too does a domain-agnostic approach to the institutionalization of the data sciences format increasingly broad swathes of domains as available to data scientific engagement. While the prospecting work that we have observed as central to this activity is only a first step in pre-formatting domains and collaborations, the work of establishing stable, long-term relationships between a field of study characterized by its reach toward an approach generalizable across scientific domains is an ongoing process with no apparent end. Order produces and is produced from its Other.
The emptiness of the BDHubs is structurally established, paralleling the character of data science itself as it emerges as a cohesive discipline with increasing support for its institutionalization. There will be more data to prospect, more disorder to be found and managed, another step further down the parasitic chain to move, and this space will grow as more prospecting work is done. Any organization that takes the institutionalization of data science as its object of work will likely bear substantial similarities in approach, agnosticism, and a focus on prospecting work. Thus, prospecting becomes a fundamental initial step in forming new collaborations, rendering data amenable to analysis outside of the context of its initial creation, and in propagating the rapidly developing set of tools, resources, and approaches that are closely coupled to a science that takes data itself as an object of study.
