Abstract
Introduction
Machine learning (ML) systems are rapidly being developed and deployed in a variety of socially consequential domains. Yet, there is a growing abundance of examples of how these systems are failing people of color (Noble, 2018; Benjamin, 2019), women (Bolukbasi et al., 2016), LGBT+ communities (Scheuerman et al., 2019), people with disabilities (Hutchinson et al., 2020; Trewin, 2018), and the working class and those in poverty (Eubanks, 2018). Many of these failures have been the direct result of underrepresentation, misrepresentation, or the complete lack of representation of these groups in the data upon which these systems are built (Paullada et al., 2020).
In response to these failures of algorithmic systems, a proliferation of algorithmic fairness interventions have emerged in recent years that hinge on balancing representation of different demographic groups within training datasets—the data used for algorithms to “learn” associations (Shankar et al., 2017; Merler et al., 2019; Yang et al., 2020). While interventions of this sort play a non-trivial role in achieving recently advanced technical definitions of algorithmic fairness (e.g. Hardt et al., 2016), failures of data-driven systems are not located exclusively at the level of those who are represented or under-represented in the dataset. For example, deficiencies are often tied to unstated assumptions underlying the dataset or the schemas used to encode particular types of harmful classifications (e.g. Ọnụọha, 2016; Denton et al., 2020; Scheuerman et al., 2020). Solutions oriented around balancing representation of sociodemographic groups within ML datasets often reflect a focus on “fairness” at the wrong level of abstraction (Selbst et al., 2019). Worse, data collection efforts aimed at increasing the representation of marginalized groups within training data are often executed through exploitative or extractive mechanisms (Chutel, 2018; Solon, 2019).
In contrast to the significant efforts that have focused on statistical properties of training datasets, comparatively little attention has been paid to the various modes of their constitution; that is, how and why these datasets have been created, what and whose values influence the choices of data to collect, and the contextual and contingent conditions of their creation. To meaningfully understand how and why ML systems fail marginalized communities, we need to write critical histories of our present in which ML datasets are understood both as infrastructural and genealogical objects of inquiry. These histories ought to be attentive to politics and standpoints of the designers who construct ontologies and category dictionaries, the gig-economy annotators who categorize data instances into ontologies, and the data subjects whose likenesses or utterances are documented and absorbed into the dataset, often without their knowledge, consent, or compensation.
In this article, we highlight key moments of a critical history of ML datasets by focusing on the popular dataset, ImageNet (Deng et al., 2009). ImageNet is widely recognized as having far-reaching impacts within the field of ML and artificial intelligence (AI). In this sense, ImageNet operates as more than just a dataset, but as a discursive object that has power and influence far beyond its role within a localized computer vision system. Our primary aim in analyzing ImageNet is not to univocally question the legitimacy of the dataset itself as an object of research and use. Rather, our analysis reveals the assumptions, norms, and values inherent in the modes of dataset construction as a way to reflect on their limits and show how datasets have not only a material form as informational infrastructure but also a temporal dimension as historically situated artifacts.
We analyze discourses which shaped ImageNet, focusing on three problems: the importance of data; meaning and the computational construction of understanding; and the strategic choices regarding the visibility (and invisibility) of labor. We conclude with the implications of pursuing a critical history of ML datasets operating in relation to modes of power and contestability.
Data infrastructure
Data as a concept or object has been written about at length within science and technology studies (e.g. Gitelman, 2013). We narrow our focus to the peculiar object of the ML dataset, i.e. a collection of data instances, collected and curated for the purpose of developing ML algorithms. Datasets form the background conditions upon which ML research and development operates: they structure how ML practitioners frame and approach problems, inform how progress is defined and tracked within research communities, and create the grounds upon which algorithmic techniques are developed, tested, and ultimately deployed in industry contexts (Denton et al., 2020). In short, datasets form the critical information infrastructure underpinning ML research and development, as well as a critical base upon which algorithmic decision-making operates. We use the term infrastructure in a broad sense, echoing definitions from infrastructure studies (Bowker and Star, 2000; Bowker et al., 2010; Larkin, 2013), to encompass the conceptual and material tools that enable different forms of knowledge, work, and scientific practice.
The defining of infrastructure is what Larkin terms a “categorizing moment” (2013, 330), one that shines a particular lens on an object of inquiry. In our work, the framework of infrastructure foregrounds two central properties. First, distinct from being understood simply as an inert object, data infrastructure creates the background conditions—the environment—upon which ML research and development operates. As Larkin states: “What distinguishes infrastructures from technologies is that they are objects that create the grounds on which other objects operate, and when they do so they operate as systems.” Larkin (2013, 329). Star complicates this view, suggesting that while infrastructure serves as an invisible background for other types of work, it can also easily become a barrier upon breakdown or when unable to serve certain types of work and action Star (1999, 380). Second, this environment has been built—ideated and labored upon by individuals located within particular socially, historically, geographically, and institutionally situated contexts.
We posit that the relationship between ML practitioners and the data infrastructure that supports their work can be characterized by a trajectory of naturalization (Bowker and Star, 2000, 294–295): as ML datasets become increasingly familiar and relied upon within daily routines, the contingencies of dataset creation are eroded in a manner that ultimately renders the constitutive elements of their formation invisible. Dataset naturalization is hastened by the current norms of dataset documentation within AI research and practice that render certain conditions of creation invisible from the start. For example, publications accompanying new datasets under-specify the decisions that go into collection, curation, and annotation (Geiger et al., 2020; Scheuerman et al., 2020), and data labor often goes undocumented (Irani and Silberman, 2013; Gray and Suri, 2019; Miceli et al., 2020). Moreover, data work is often carried out by annotators who are not co-located in the same geography or culture as the ML practitioners which can serve to further distance data labor from its outputs (Gray and Suri, 2019). The more naturalized ML datasets become, the more likely they are to be treated as value-neutral scientific artifacts and unquestioningly adopted by ML practitioners. In this manner, they come to resemble black boxes of laboratory science (Latour, 1987).
Moreover, despite the foundational role data plays, data work is rarely considered foundational. Guidance or advice on how to construct ML datasets occupy little to no space in ML textbooks and curriculum (e.g. Goodfellow et al., 2016). Data work is heavily under-incentivized, with most attention being paid to algorithmic developments (Jo and Gebru, 2020; Hutchinson et al., 2021; Sambasivan et al., 2021) and publications that focus solely on dataset creation tend to be devalued within traditional peer-review processes (Heinzerling, 2019). Consequently, data practices themselves operate as unquestioned, unchallenged routines, that is, as naturalized infrastructure.
In what follows, we begin to examine the ethical and political genealogy of ImageNet; that is, the factors at play as well as those taken-for-granted in it’s construction. As articulated in the analysis below, we argue that the material conditions of emergence at work in the processes of building such datasets, in addition to the shared background of discourses and practices brought by the AI researchers and dataset curators—those who collect, scrape, and collate different pieces of ML datasets—are rich grounds for critical inquiry. Towards this end, the genealogical method can be employed as a mechanism of denaturalizing data infrastructure by providing a pathway to account for, and identify, the discourses and practices of curators over time and the contingent epistemic assumptions, choices and decisions that impacted the production of ImageNet itself.
The Archaeology of ImageNet and the genealogical method
Crawford and Paglen (2019) begin to examine some of these questions regarding dataset construction in an illuminating project, excavating.ai, which they describe as an “archaeology of datasets.” Their archeology plies at the “person” subcategory in the ImageNet dataset hierarchy, to reveal depreciative and morally inflected categorizations such as
While we build upon and extend Crawford and Paglen’s work, our project is methodologically distinct from both the archeology of discourses in its foucauldian formulation (Foucault, 1972) and the method of excavation of political images presented by Crawford and Paglen. Foucault’s archeology is broadly defined by the positionality of statements, images, and discourses in order to specify the boundaries or limits between what can be said, done or thought within a given epistemic and historical context. Crawford and Paglen’s political archeology excavates the hidden motivations, assumptions, and values, in short, the patterns of meaning which populate, distort, and subvert the use of the dataset. The promise of their archaeology holds that once we bring these hidden assumptions and motivations to light, we would finally be in a position to extricate the internal bias operative in the classification and labeling practices of a given dataset. Our view is that these accounts of archeology—while crucial for the development of our own project—are in fact insufficient for denaturalizing and challenging data infrastructure as it is currently established.
At this point, we would need to ask what other interpretive methodologies would be relevant for us to conceptualize our treatment of datasets. Plasek (2016) suggests that “we need [to write] histories of the datasets themselves” by identifying the ways in which these entities were formed, maintained, altered, and utilized. This way, writing a history of datasets would include not only an account of the hidden subjective elements (assumptions, motivations, values) which shaped its constitution but also the objective or material components which form dataset infrastructure. Our work advances a research program that we call the genealogy of data applied to ML (Denton et al., 2020). We find this methodology especially apt for AI, where the discourse of progress as continuous development 1 and technical solutionism pervade the field.
According to Foucault (1977), genealogy is an interpretive method that traces and identifies the temporal conditions of emergence, formation, and transformation of practices, discourses, and concepts within given historical contexts with the purpose of identifying the material conditions and manifestation of various modes of subjection. It does so by unveiling strategies (means towards ends) and resistances to these strategies that are sometimes removed from the actors’ own explicit intentions. Its focus is on irreducible forms of resistance, the formation of minor discourses and practices in tension with the dominant one(s). In particular, genealogies put forward an account of how modes of power came to define particular kinds of actors whose possibilities for action are conditioned in particular ways (Koopman, 2019).
As an interpretive method, our genealogy of data is not reducible to excavating the hermeneutic ground, that is, the hidden meanings related to the values, assumptions, and motivations of particular actors, nor is it reducible to analyzing discourses and practices from a relation between signified and signifier. Rather, one focus of the genealogical analysis is on the strategic emergence of the discursive events operative in a given dataset infrastructure. Archaeology and genealogy are each in their own way an attempt to re-constitute historical events. They seek to uncover events verified from their own respective spheres of jurisdiction: archaeology as defined by the historical regularity of the sayable and its limits revealed by epistemological modes of exclusion; genealogy constitutes knowledge within a historical field of power relations, examining the various modes at play in data subjective formation 2 .
In this paper, we focus on the formation of subject positions and data roles, and their deployment in a network of power relations that have resulted in the (1) formation of the conditions of possibility—such as technological affordances of novel crowdworking platforms and the demands of academic publishing and funding—operative in the practices and discourses of dataset construction and (2) the transformation of subject positions and data roles/points, which constitute and constrain the production of explicit or implicit meaning within a given data infrastructure We do not deny the crucial role that subjective values, assumptions, and intentions play in the construction of the dataset itself. On the contrary, a genealogy of datasets involves an analysis of the positionality of subjective values, assumptions, and intentions in the economy of the dataset. This means that not all subjective values, assumptions, and intentions are equal but rather that they have different weight and different effects in the construction and deployment of the dataset itself.
In contrast to Foucault, our genealogy of data is not an amoral endeavor, but a historically inflected ethics of knowledge formation within datasets with measurable outcomes. Genealogy for us is a specific interpretive and analytical mode of inquiry that seeks to produce technical standards of analysis and practice which are themselves open to contestation. Our objective is to think of new ways to translate between conceptual problems and technical solutions around transparency and accountability when considering the datasets in addition to the power relations that they create and replicate. In this paper, our genealogy retraces key moments operative in the discourses surrounding ImageNet, including talks and related documentation which shaped and justified the constitution of ImageNet as a benchmark dataset. In this respect, genealogy functions as a method of interpretation of historically constituted discursive strategies in which dataset creators 3 function less as biographical beings and more like discursive figures or discursive knots 4 which create and implement the discursive and infrastructural conditions of dataset construction.
ImageNet and the emergence of deep learning
We now turn to the ImageNet dataset, and its associated ImageNet Large Scale Visual Recognition Challenge. We offer a short history of the ImageNet dataset and associated competition. We then critically engage with the discourses—both historical and contemporary—that have formed and transformed around it (Fairclough, 2013). Namely, we focus on texts associated with the dataset’s creation, including original publications, talks given by its creators, and wider discussions within computer vision about the dataset and its practices. Table 1 outlines the texts which we relied upon. We selected sources primarily from the ImageNet creators, but also other major ML researchers who build upon the ImageNet work. Notably, we do not include texts produced by data labelers, mostly because these do not readily exist in the historical record, but also because we focused primarily on texts which were afforded institutional weight by virtue of being produced by ML researchers. We find that discourses around ImageNet revolve around three critical themes: the importance of data; the computational construction of meaning and understanding; and the visibility (and invisibility) of labor.
Description of sources used for discourse analysis. Those in the top pane are from Li and other creators of ImageNet, or reporting on them. Those in the bottom pane are subsequent works by other authors.
We analyze ImageNet primarily because of its outsized influence on the whole field of ML and because of the claim—made by many deep learning researchers themselves—that ImageNet irreversibly altered the direction of the field (e.g. Gershgorn, 2017). Moreover, ImageNet is representative of dominant data practices, and therefore incentive structures, institutions, and work practices, in computer vision research. We applaud the ImageNet creators for being forthcoming with their documentation and details about their processes, which isn’t the case with many other researchers within their subfield. In the same vein, we address the creators of ImageNet—including Fei-Fei Li, Jia Deng, and Olga Russakovsky—as discursive figures who convey insight into those institutions and practices. To the extent that our research efforts can be deemed a “critique” of ImageNet, our work is in service of a constructive effort to help identify some of the oversights associated with its emergence and development Our objective is to contribute to the transformation of benchmark datasets and their associated practices of collection and use. In doing so we aim to contribute to ongoing development of the standards and norms of algorithmic fairness and accountability.
A short history of ImageNet
According to the creators, a team of researchers spread across Princeton University and Stanford University, the ImageNet dataset was developed to support research and development into visual object recognition (Deng et al., 2009; Russakovsky et al., 2015). Consisting of over 14 million images organized into about 20 thousand categories, at the time of its creation it was one of the largest human-annotated image datasets ever developed and is orders of magnitude larger than its predecessors (ImageNet’s primary predecessor, PASCAL VOC (Everingham et al., 2010), had 20 categories and 19,737 images). Senior faculty at Princeton discouraged one of its creators—Fei-Fei Li 5 —from doing so; the task would be too ambitious for a junior professor, they said. When she applied for federal funding to help finance the undertaking, her proposals were rejected, with commenters saying its only redeeming feature was that she was a woman (Gershgorn, 2017). Despite these initial setbacks, several key affordances came together to make ImageNet possible.
First, the rise of digital image sharing during the decade prior to ImageNet’s creation, coupled with search engines capable of indexing images on the web provided a mechanism for the ImageNet creators to construct a massive dataset of high resolution images based on simple text-based queries. The categorical structure for the dataset is derived from WordNet 6 , a large database of English words organized into a hierarchy based on semantic relationships, which was developed by cognitive psychologist George A. Miller in the 1980s. The keywords for each concept were used to scrape web search engines for images associated with the concept. Previous datasets, such as TinyImages (Torralba et al., 2008), had leveraged this same method of data collection and categorical structure. ImageNet, however, relied on the newly formed Amazon Mechanical Turk (AMT) crowdworker platform to add a final step of human annotation to the data collection process. After collecting tens of thousands of candidate images for each WordNet concept, each image was reviewed by a set of human annotators from AMT who were instructed to confirm the presence or absence of the given concept in the image. Images retrieved from the web queries associated with a particular category were only included in the final dataset if sufficient inter-annotator agreement was obtained during the manual annotation process.
In 2010, the ImageNet Large Scale Visual Recognition Challenge was established as a yearly competition that would run until 2017. The competition focused on two tasks: image classification (assigning images to pre-specified categories) and object localization and detection (specifying the precise location of objects within an image), and leveraged a subset of the overall dataset: about 1.5 million images organized into one thousand categories. In 2012, Alex Krizhevsky, along with his colleagues at the University of Toronto, won the ImageNet Challenge with a neural network-driven ML model that outperformed all other competitors by a previously unimaginable margin (Krizhevsky et al., 2012). This marked the first use of neutral networks in the challenge and is often credited with sparking the resurgence of neural networks (under a new moniker—deep learning) as a dominant ML paradigm (Dotan and Milli, 2020). While deep learning models had experienced several breakthrough increases in model accuracy in speech recognition, computer vision, and other domains prior to 2012, the ImageNet win transformed the field in a way these previous successes had not.
“The unreasonable effectiveness of data”
ImageNet was developed to be a training resource and benchmarking tool for computer vision practitioners. Yet, the impacts of the dataset’s creation extend far beyond the materiality of the dataset and the subfield of computer vision. In this section, we examine how ImageNet shifted an entire discipline’s relationship to data, solidifying big data as a central pillar of AI research.
Prior to ImageNet, computer vision practitioners had been constrained to algorithms that only required small sets of images. These methods tended to rely on hand-crafted image features 7 and often leveraged domain knowledge about object-part relations or prior knowledge extracted from other datasets in order to perform a particular task. In a retrospective talk on ImageNet, Fei-Fei Li remarks that her early 2000’s work in object recognition focused on learning from a small set of examples using a set of Bayesian methods called “few-shot learning” (Fei-Fei, 2019; Fei-Fei et al., 2004). The focus on these methods, she remarks, was out of necessity, due to the paucity of image data on the web and the high cost of digital cameras.
For the first few years that the ImageNet Challenge ran, the submissions were similar to the traditional methods that had been successful in the low-data regime. Krizhevsky’s 2012 submission represented the first departure from this trend—a departure that would rapidly be solidified as the norm. Although the method leveraged by Krizhevsky et al. was not new, his team was the first to use it in the ImageNet Challenge. The procedure involves a ML model with 60 million parameters—all of which were “learned” from the data—something that would have been inconceivable with smaller image datasets. To fit so many parameters, the scale of the data needs to be magnitudes of order higher than anything that computer vision researchers had seen prior to this moment.
In the years since Krizhevsky’s ImageNet win, the need for more data has become axiomatic within ML circles, so much so that several well-cited papers within computer vision discuss the “unreasonable effectiveness of data” 8 in modern ML. Within computer vision, “The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition,” co-authored by Fei-Fei Li with researchers at Stanford University and Google, discusses how images gathered from the web, even without carefully cultivated annotations and labels, can still produce very good results for the image classification task (Krause et al., 2016). Another paper examines how, in the deep learning era in which new methods have seen massive growths in model architectures and numbers of parameters, new datasets haven’t kept pace. The authors (a collaboration of Google and Carnegie Mellon computer scientists) created an internal Google dataset of 300 million images—called JFT-300—that is vaguely said to be sourced from Google Image Search (Sun et al., 2017). They demonstrate that with this new dataset, they are able to attain state-of-the-art results on an image classification task. In sum, in the short time between the AlexNet success in 2012 and a publication from the JFT-300 in 2017, we’ve seen a thorough commitment to ever growing datasets and a complete commitment to the deep learning paradigm.
The implications of the ImageNet mode of work have been felt all over AI research. The creation of parallel “ImageNets of x” has become the standard mode of doing research in the field of computer vision and beyond (Figure 1). The architects of MusicNet (Thickstun et al., 2017) directly trace the inspiration for their dataset—a dataset of 330 classical music recordings with over one million annotations which indicate the precise timing of each note—to ImageNet. The same can be said of ShapeNet, which contains a large repository of 3D Computer-Aided Draft (CAD) models of a variety of shapes and, like ImageNet, organized under the WordNet taxonomy (Chang et al., 2015). It appears to be a point of pride and a moral success when a field has matured to obtaining a large-scale annotated dataset; as one recent blog post was titled boldly, “NLP’s [Natural Language Processing’s] ImageNet moment has arrived,” describing the value of these types of datasets for a number of subtasks in natural language processing, the subfield of AI which deals with human language (Ruder, 2018).

Computational construction of meaning and understanding
Central to ImageNet’s epistemology is the assumed existence of an underlying and universal organization of the visual world into clearly demarcated concepts. We can trace the origins of this idea back to WordNet. Li’s understanding of the underlying aims of WordNet—to organize hundreds of thousands of English words into a massive ontology—directly inspired the creation of ImageNet. She describes a key formative encounter with a linguist and president of the WordNet Consortium during her time as a junior faculty member at Princeton University. The linguist, Christiane Fellbaum, alerted Li to an abandoned side project aimed at finding an illustrative picture for every concept in the WordNet hierarchy. This encounter sparked the idea for ImageNet: instead of illustrating WordNet concepts via a representative example, the WordNet hierarchy could be coupled with the wealth of data afforded by the rise of internet search engines to construct a massive ontology of images that, in Li’s words, would “map out the entire world of objects” (Gershgorn, 2017).
The relationship between images and their meanings is complicated, mediated, and contextual, with entire fields—such as art history and media studies, structural semiotics, and symbolic hermeneutics—dedicated to the study of the relation between objects and the representation of images. Yet, the documentation and publications accompanying ImageNet do little work to motivate, justify, or analyze the relationship between WordNet categories and their accompanying images. Through this omission, the ImageNet creators signal a presumably self-evident relationship between WordNet nouns and the visual world. The act of recognition within this epistemology—whether by human or machine—is one of identification or verification of an underlying truth of what an image depicts.
Evidence of this underlying assumption can be gleaned through examination of the original ImageNet publication (Deng et al., 2009). For example, when motivating the need for human verification of the candidate labels, the ImageNet creators describe the image search results as having a mere 10% accuracy. The notion of accuracy that’s operative in this statement, and the framing of the labeling process as one of “verification,” suggests the existence of an underlying truth about the presence or absence of a concept in the image. Furthermore, the implementation details of the crowd-sourced task frames it as one that requires little reflexivity or deliberation. Rather, an imperative of speed and absence of rules for interpretation suggest naming is something that naturally manifests from a glance, or in the case of concepts requiring specialized knowledge, a quick scan of a Wikipedia page.
Broader discourses surrounding ImageNet further suggest a problematization of object recognition rooted in a decontextualized and a non-situated, physicalist account of human vision. For example, Li cites the existence of entire brain regions devoted to object recognition and the ability of children to recognize tens of thousands of categories at a young age, and without explicit teaching, as evidence of the innate and foundational human capacity for object recognition (Fei-Fei, 2012). Both of these claims are used to motivate generalized object recognition as a foundational computer vision task. By focusing on innate faculties and neural structures, these narratives frame the human process of object recognition as disembodied and decontextualized information processing, rather than a situated and contextual process of making sense of and describing the visual world. The reliance on a child, as the canonical viewer, further substantiates an amoral perspective—a child does not judge, interpret, or deliberate upon the scene but simply sees the world as self-evident.
In another motivating narrative, Li describes a study her team conducted that found subjects could describe the contents of an image after viewing it for only a fraction of a second (Fei-Fei et al. (2007), described in Fei-Fei (2012)). When showing an image description produced by one of the undergraduate subjects, Li emphasizes—to much laughter from the audience—that the student is “not special,” the implication being that anyone could do it. By emphasizing the innate and universal human capacity for visual intelligence, and remaining silent about how lived experiences of a viewer, and the context of viewing, shifts and alters how one sees the world, these narratives establish a decontextualized account of human vision. This framing suggests, moreover, an interchangeability on the part of the viewer—not only do we all have the capacity for sight, but none of us are “special”—we all see the same way.
Visible and invisible labor
One of the major challenges of constructing ImageNet, according to Li’s account, was the verification of the vast amounts of data gathered from internet search engines. University undergraduates were initially utilized for this task, but were quickly abandoned due to cost: collaborator Jia Deng calculated that it would take 19 years to label the entire dataset that way. Notably, undergraduate work is subjected to interruptions as it is contingent on external factors, such as funding, the timing of the school year, and training.
After abandoning the prohibitively costly, undergraduate-enabled labeling effort, the ImageNet creators pondered a machine-in-the-loop approach, whereby an algorithm would sort through the massive troves of data and minimize the overall human labelling effort. This approach was also abandoned with the realization that the quality of the labels would be limited by the capabilities of the machines at the time of construction. This, as Li describes, would contradict the stated goal of constructing a “gold standard” dataset (Fei-Fei, 2017)—that should ultimately be set by humans.
These two failed labelling attempts inform our understanding of how the ImageNet creators conceptualized the labelling task. While humans—rather than machines—are regarded as critical for obtaining high-quality labels, the constraints on which humans should define the “gold standard” are posed largely in terms of cost and time efficiency. In other words, the ImageNet creators sought a techno-social configuration which would place humans in the position to speedily perform basic tasks of image recognition without interruption and at a low cost. Their choice to use the AMT crowdworking platform was, according to Li, a “tool that could scale, that we could not possibly dream of by hiring Princeton undergrads” (Gershgorn, 2017). On this new platform, anyone could construct a “Human Intelligence Task,” to be completed by workers on the platform who would be paid by each item they completed. This solution quickly solved their image annotation problem by allowing the problem to be broken down and distributed across 49 thousand workers from 167 countries.
Li describes the new affordances offered by MTurkers as a “Godsend” (Fei-Fei, 2019). But despite being framed as a “divine” solution to a technical problem, they are not acknowledged, named individually as contributors, or positioned as active stakeholders in the construction and the design of ImageNet. The ImageNet creators do not disclose how much the annotators were paid. They do not discuss which countries had the largest number of annotators, nor do they discuss any demographic characteristics of their annotators. This silence is, we contend, structural; MTurkers are not, in the economy of the dataset, actual individual contributors. They are utilized as a generic human intelligence resource capable of executing the requested tasks of labeling images on the AMT platform. This is premised on the idea that all humans have the innate capacity to recognize images in the same way—an approach to vision that erases lived experience from the formation of meaning. This is also evident in the manner in which disagreements are resolved: a majority rule is applied in spaces of conflict (Russakovsky et al., 2015). This objectivist and universal formulation of vision creates the infrastructural condition for MTurkers as a collective subject to exert their fast, low-cost basic capacity of sight in a non-contextual manner.
The functional roteness, infrastructural devaluation, and abstractness of the annotation task is not lost to ImageNet’s creators. In a slide deck from 2010, they ask themselves if they are “exploiting chained prisoners” with this work, accompanied by a piece of cartoon clip art of a fatigued prisoner in a ball and chains (Figure 2). To defend their use of crowdsourcing, they present a set of statistics attributed to Panos Ipeirotis, Professor in the NYU Stern School of Business, wherein a study of MTurkers demonstrates that a majority are doing this work in addition to full-time work to get extra cash, and most earn less than $100 a week. This is further justified by a slide that superimposes the ImageNet logo on a time-series graph, showing the Gallup Index of Investor Optimism rising between the end of 2008—the depths of the global economic crisis—and 2009 (Figure 2b). Although we interpret this image as facetious, the implication here is that this recovery has been driven by ImageNet and crowdwork more generally.

Discussion
The texts analyzed above offer several different implications for the work of computer vision and ML more broadly. These texts reveal what we called strategies of meaning inherent to dataset production which are not homogeneous, linear, and unidirectional. Rather they are contingent, conflicted, and constrained by institutional norms of cost efficiency. To put it differently, our inquiry reveals two distinct, but related, ideological formations operating at the discursive level of dataset creators. The first one refers to the accumulation of data as an end in itself and the second one involves labor and the production of a de-humanized resourceful collective subject.
Regarding data, the web has long been a venue to be “mined.” Seen as a natural resource to be extracted, the “data as oil” metaphor has been persistent within both corporate and academic contexts (Puschmann and Burgess, 2014; Stark and Hoffmann, 2019). Moreover, the focus on the mass accumulation of data from the web as part of new scientific methodologies is not unique to computer vision (Van Dijck, 2014; Kitchin, 2014).
What we highlight, however, is that the drive for data accumulation in ML practice has resulted in the deepening of the ideological formation of data as a discursive configuration, a mode of collective assent within the ML community, in the construction of datasets. Our concept of an ideological formation of data can be defined as the system of moral beliefs and epistemological propositions which posit that the more accumulation of data, the better and more accurate the techno-scientific instruments will be. The ideological formation of data, we contend, cannot subsist in the same way Plato’s ideas (
The drive to collect more data can be motivated by several different distinct, but connected factors. First, the ideological formation of data holds that one can and will obtain better predictive accuracy with more data. This claim is related but distinct from the claim made by “Big Data” boosters (and their critics): that theorization about mechanisms and causation is meaningless, that all science ought to proceed deductively rather than inductively, and that statistical significance loses its power under a Big Data regime in scientific research (Anderson, 2008; Kitchin, 2014). In this framing, the technical underpinnings for more data stems from a belief that the ML model requires sufficient examples to learn from, and therefore can predict more accurately from data which it has not seen before. Under the deep learning paradigm, extensive data is necessary to make the model work at all, which makes the desire for more data even more piqued.
Moreover, discursive concerns about fairness, accountability, transparency, and explainability are often reduced to concerns about sufficient data examples. When failures are attributed to the underrepresentation of a marginalized population within a dataset, solutions are subsumed to a logic of accumulation, the underlying presumption being that larger and more diverse datasets will, in the limit, converge to the (mythical) unbiased dataset. For example, recent papers critiquing image classification datasets envision solutions whereby imagery is collected from around the world (DeVries et al., 2019), and in particular the developing world (Shankar et al., 2017).
This specific ideological formation is at the root of a recent Twitter feud between Yann LeCun and Timnit Gebru, in which LeCun framed the issue of a vision model “whitening” images of Barack Obama and Alexandria Ocasio-Cortez in terms of a restrictive and purely technical claim of “data poverty,” whereas Gebru pointed to the broadened critiques of computer vision technologies and their uses issued by critical race and technology scholarship (Johnson, 2020). A major consequence of making the problem one of unrepresentative data is that those entities who already sit on massive caches of data and computing power will be the only ones who can make models more “fair,” and therefore are the only ones who are well-suited and equipped to engage in the work of critique.
The ideological formation of data is also implicated in the constrained relationship between images and concepts that ImageNet relies on. As Li has described, to see an image is to understand the contents of the image—to be able to craft a meaningful narrative that describes what the image depicts (Fei-Fei, 2012). In this sense, visual intelligence requires not only low-level pattern recognition, but meaning-making in relation to the visual world. Yet, the formulation of object recognition put forward by ImageNet leaves the subjective, situated, and contextualized nature of meaning-making unacknowledged and unaccounted for. Instead, it constrains the relationship between images and concepts to function under a non-mediated, transparent, and self-evident schema that can be revealed only through sufficient data and human computation. While Li regularly reminds us that sight is a universal human capacity, there is of course no universal way of making sense of, naming, and describing the visual world. And indeed, far from indexing some sort of mythical universal representation for each WordNet concept, ImageNet represents a very particular way of “seeing” and naming the visual world—one that is shaped by a myriad of sociotechnical processes including digital photography and image sharing practices of the late 2000s, the functionality of web search engines, the imperative for speed, and non-reflexivity designed into the dataset curation process, and the situated subjectivities of the annotators themselves. It is a view that associates “bikinis” with women, “sports” with men, “trout” with fishing trophies, and “lobsters” with dinner (Malevé, 2019; Prabhu and Birhane, 2020). By failing to account for the particularities of this view—particularities that largely reflect a white, Western, male gaze—and wielding a naturalistic rhetoric in popular scientific discourse, the subjective nature of meaning formation and the presence of acts of unreflective interpretation is obfuscated and hidden from view. In this sense, we understand ImageNet, and the myriad of image datasets that have followed in its stead, as a technical instantiation of Haraway 1988’s “god trick,” the view that “sees everything from nowhere” (1988, 581). This parallel is exemplified by the logo of Li’s Vision Lab, an unsettling singular machine eye with a sclera made of sky and an iris made of a camera lens (Figure 3). Ultimately, the universalist and objectivist approach to the human capacity of vision is at best incomplete and at worst a formula for a significant failure in terms of social and ethical costs. The ImageNet creators have, in part, recognized these costs, and taken steps to redress the objectionable content uncovered in the data. For example, 5 days after Crawford and Paglen released ImageNet Roulette—a face classification app trained on the problematic categories from ImageNet’s “Person” subclass—the ImageNet creators removed “about half of the images of people from its site.” While this response is commendable, we note that a deeper interrogation of the ontological assumptions baked into ImageNet are still lacking.

We now turn to the ideological formation of labor. The presumed self-evidence of the relationship between images and concepts formulated the problem of label verification in a manner that positioned anonymous, interchangeable AMT workers as an appropriate solution. The discursive reality of the ideological formation of data is not only normative but finds its infrastructural implementation in the AMT piecework model. Framing the label verification as an act that requires little reflective judgement not only suggests that anyone can participate, but that annotators are interchangeable because they share the same innate faculty of seeing objects and because they exercise vision in the same way. Framing human vision as a capacity separate from reflective judgement positions the unqualified gaze of the AMT worker as an epistemological and technical device which can establish an unmediated relation between our field of vision, images, and concepts. Of course, the unqualified gaze, lack of rules for interpretation, and imperatives for speed do not strip the process of subjective meaning, bias, and discriminatory classifications, but systematically leaves these aspects unaccounted for.
Moreover, we have seen that ImageNet’s creators frame the MTurk workers as “heroes,” without whom the dataset could not have come into being. The hero’s existence, however, is not defined historically by their technical capacity of completing anonymous tasks, but by actions that have the purpose of transforming the human condition. In reality, AMT workers are not only unnamable as individual beings, but most importantly from the standpoint of requesters and, in particular, the dataset creators, they are
If the “cutting edge” of AI is built by the heroes of named progenitors, it is only because of the labor of annotators who exist beyond the halls of academia and industry. Artist Mimi Ọnụọha turns this construction on its head by rendering the mundane, crowded home work spaces of crowdsourced image annotators as sites of heroism, illustrating rumpled couches, crowded kitchen tables, and home office desks in bright colors with dramatic quotes such as “Heroes emerge only in times of great need!” (Figure 4).

Conclusion
This work conceptualizes ML datasets as a type of informational infrastructure, and motivates genealogy as a method of examining the histories and modes of constitution of ML datasets. We historicize ImageNet as an exemplar, utilizing critical discourse analysis of major texts around ImageNet’s creation and impact. We find that assumptions around ImageNet and other large computer vision datasets generally rely on three themes: the aggregation and accumulation of more data, the computational construction of meaning, and the rendering of certain types of data labor invisible. We find that in these discourses a dual ideological formation: first around the accumulation of data, and second around the disembodied, decontextualized nature of annotation work.
Our exercise in tracing this critical history is not critique for critique’s sake. On the contrary, we understand our work as crucial to developing holistic, forward-looking frameworks for data practitioners to reflexively analyze elements of their data pipeline, many of which must be questioned and clarified before gathering a byte of data. In this sense, we echo the call of others who have done significant work around model and data transparency (Gebru et al., 2018; Mitchell et al., 2019) and accountability in dataset development (Hutchinson et al., 2021). Our work here is not meant to suggest comprehensive solutions, but to highlight moments in which reflexive interventions can be made in dataset development, including, but not limited to, the initial ideation/design stage, the creation and collection stages, and the subsequent maintenance and storage stages. Although out of scope for this article, there is a need for more work which highlights the importance of data work for ML model development (Hutchinson et al., 2021; Sambasivan et al., 2021).
