Sage Journals: Discover world-class research

Abstract

This paper sketches a prehistory of synthetic data in the development of simulation technologies. Synthetic data is connected to simulation by the technical problem of the reality gap: the gap between the synthetic data a model is trained on and the real-world data it is deployed on. The reality gap is presented as a novelty both generated and solved by synthetic data. We demonstrate that the reality gap has plagued simulation technologies since their inception in the mid-20th century. We contend that the reality gap is not something synthetic data can solve. To illustrate this, we examine three episodes in the prehistory of synthetic data. These episodes are representative of three distinct regimes of simulation: (a) the statistical regime, (b) the discrete-event regime, and (c) the visual-interactive regime. Each regime reveals a reality gap; from before the advent of digital computers to the present. Synthetic data, like simulations, require data about a given domain in order to model it. It requires the real-world data which it purports to dispense with. The reality gap is thus an epistemological issue as well as a technical one. We argue that it is also a political economic issue: it complicates existing means of producing data, adding new layers of mediation and labor. Synthetic data thus indicates the emergence of an alternative stack for the production of AI systems. This suggests that the political economy of AI must take account of the proliferation of new technical means for creating data.

Keywords

Synthetic data simulation machine learning reality gap generalizability model

Introduction

Synthetic data is data that is the output of a computational process, such as a simulation, “rather than being generated by actual events” (Dilemgani, 2020). It can be utilized for training machine learning models. A facial recognition model, for example, can be trained on a synthetic dataset consisting solely of images of procedurally generated 3D models of faces captured in a simulation (Wood et al., 2021). Despite being in its infancy, synthetic data is cast by proponents as the cure for nearly all the problems associated with the machine learning approach to artificial intelligence (AI), from labor costs to privacy and bias concerns (Savage, 2023). The consulting firm Gartner suggests that by 2030, synthetic data will outnumber conventional data for the training of machine learning models (Linden, 2021). Critics reply that synthetic data serves best to expand on conventional data, rather than replacing it (Feremanga, 2024). In any case, as of 2024, synthetic data is the basis of a growing subsector of the AI industry, with over 120 firms producing synthetic data commodities around the world (Devaux, 2022), numerous synthetic datasets freely available online for applications such as computer vision (Tsirikoglou et al., 2017) and growing interest from big tech; as a technique for creating training data for large language models (Wiggers and Coldeway, 2024) and computer vision applications (Krzus, 2022).

However, the nascent synthetic data industry must confront a fundamental technical issue: the “reality gap” (Tremblay et al., 2018), or synthetic-to-real “domain transfer” (Nikolenko, 2021: vi). This is the issue of whether “models trained on synthetic data will work when applied to real-world data” (Nikolenko, 2021: vi). The reality gap must be bridged if synthetic data is to achieve its goals. It is often described as the phenomenon which distinguishes synthetic data from conventional data. One Google employee even maintains a newsletter devoted to synthetic data called “The Reality Gap” (Shtylenko, 2022).

We argue that the reality gap is an epistemological and political economic issue as well as a technical one. We mount this argument via a historical comparison, which shows that the problem of the reality gap has long plagued the technology of simulation. According to Nikolenko (2021), synthetic data can be traced back to early computer vision research in the 1960s (140).¹ We suggest that synthetic data has a longer prehistory in the technology of simulation. We examine three episodes in this prehistory of synthetic data. These episodes are representative of three distinct regimes of simulation: a) the statistical regime, b) the discrete-event regime, and c) the visual-interactive regime. These episodes demonstrate that while simulations are often imagined as alternate realities, they are always conditioned by a need to set parameters accurately so that their outputs map accurately onto the real world. Furthermore, we show that simulations generate new demands for data, determined by the capacities of the particular simulation technologies involved. The basic problem of the reality gap has thus been a fundamental issue for technologies of simulation since their inception. Today, simulation is presented as a promising means of data synthesis because it can produce data “from scratch” (Nikolenko, 2021).² Platforms, such as Sim4CV, built on the Unreal 4 engine, generate virtual environments in which image data can be captured (Müller et al., 2018). However, we contend that the reality gap is not something synthetic data can solve. Synthetic data simulations, like other simulations before them, require data about a given domain in order to model it.

The reality gap is an epistemological issue as well as a technical one. But it is also a political economic issue. Rather than constituting a simple fix for the big problems facing data-intensive capitalism, synthetic data complicates the existing means of producing and utilizing data, adding new layers of technological mediation and labor. We suggest that synthetic data and the simulation approach are not likely to replace the surveillance and capture of user data (Steinhoff, 2022), but will rather become implicated with it in diverse ways. The drawing together of AI, synthetic data and simulation suggests the emergence of an alternative “stack” of technical systems and social processes for the production of AI systems (Bratton, 2016). The emergence of this simulation-synthesis stack demonstrates that the political economy of AI must take account of the proliferation of new technical means for creating data.

The paper begins by reviewing the critical literature on synthetic data and the political economy of AI. In the subsequent section we offer a brief technical introduction to machine learning before discussing simulation and the notion of the reality gap. After examining three regimes of simulation, we discuss how our analysis of this prehistory can be utilized to make sense of the contemporary AI industry.

Political economy of AI and synthetic data

Building on the foundation of critical data studies, which showed that big data is a material, spatial phenomenon with social, political and ecological implications (Boyd and Crawford, 2012; Dalton and Thatcher, 2014; Kitchin, 2021), the political economy of AI asserts the fundamental importance of studying AI as a capitalist industry. Such research asserts that AI is not a disinterested scientific endeavor but rather a project led by a handful of firms with distinct interests in surplus value generation through the sale of AI and data commodities, intercapitalist competition and the increased efficiency of, and control over, labor through discipline and the development of automation technologies (Dyer-Witheford et al., 2019; Pasquinelli, 2023; Sadowski, 2019; Steinhoff, 2021; Van der Vlist et al., 2024).

Political economy of AI research may be divided into three strands focused on labor, infrastructure, and data. Research in the first category has considered both the “hidden” data work, often outsourced to those in the global south (Muldoon et al., 2024; Tubaro et al., 2020) and the prestigious work of data scientists (Steinhoff, 2021). Research in the second category has revealed the materiality of the AI industry, examining cloud platforms (van der Vlist et al., 2024), semiconductor chips (Rella, 2023) and foundation models (Luitse and Denkena, 2021). In the last category, research has focused on the importance of data to the development of AI, from the creation of training datasets (Engdahl, 2024) to the organization of data-based competitions (Hind et al., 2024; Luitse et al., 2024) to the emergence of synthetic data (Steinhoff, 2022). This article contributes to the final category.

As the above authors have shown, AI is a general purpose technology (GPT) and more precisely a general automation technology. This is what makes it a commodity of interest to businesses in all sectors, driving the widespread enthusiasm for AI in the present. The production of AI is itself something that might be automated. Synthetic data is one way of doing so: it facilitates the “automation of the production of the conditions for production and circulation in data-intensive capitalism” (Steinhoff, 2022: 2). This means, Steinhoff (2022) contends, that synthetic data could, theoretically, provide an alternate means of operating data-intensive capitalism, not based on surveillance. Beyond this intervention, the emerging area of critical synthetic data research has not been focused directly on political economy. Offenhuber (2024) holds that synthetic data necessitates a revision to the epistemology of data, from representational to relational. Jacobsen's work has touched on relevant political economy dimensions in his studies of the social logic at work in synthetic data discourse, from a promise of de-risking AI model production (Jacobsen, 2023) to de-biasing datasets (Jacobsen, 2024). Research on the possibility of enhanced privacy via synthetic data shows that this too is a political issue (Beduschi, 2024; Lau Munkholm and Wiehn, 2025) while the legal implications of synthetic data present new dimensions for the governance of big tech (Gal and Lynskey, 2023). Helm et al. (2024) also explore political dimensions in their study of how, across several fields of application, synthetic data functions as a tool for AI firms to silence debates about their business models.

We propose that the political economy of AI, after the appearance of synthetic data, must also include the means of data synthesis—like simulation technology—within its purview. There is limited political economy research on simulation, with most work found in science and technology studies (Edwards, 2013; Galison, 1996; Sundberg, 2009) philosophy of computation (Korenhof et al., 2021; Lenhard, 2019) and media studies (Bogard, 2006; Chun, 2018; Dippel and Warnke, 2017) and oriented primarily towards epistemological, and sometimes ontological, questions. Recent work on the political economy of virtual reality and the metaverse is adjacent to our research here, but not directly of interest (Egliston and Carter, 2022).

Simulation and the reality gap

Synthetic data and the simulation approach to generating it appeared in the context of the rapid growth and commercialization of machine learning since 2010. Machine learning is the use of learning algorithms to generate other algorithms, called models, which represent patterns discerned across large datasets. Models can then be deployed to analyze or make predictions on new data. Models are assessed in terms of “generalization ability” (Alpaydin, 2016: 40) or their ability to function on data not included in their training dataset. A facial recognition model should be able to recognize faces other than those it was trained on. In other words, it should develop a general model of what a face consists of. The most prevalent approach to machine learning, supervised learning, requires that data be labeled by human annotators; a “time-consuming and expensive task” (Tremblay et al., 2018). The labels function as a “supervisor” and provide the learning system with the basis on which to form its general model; that is, for a facial recognition system one might label training data binarily with either ‘face’ or ‘not-face’. The problem becomes more complex as the domain expands. Data must also be painstakingly segmented and labeled; for every frame captured each tree, person, vehicle and so on, must be demarcated and annotated (Kniazieva, 2022). Such “AI data work”, as Muldoon et al. (2024) term it, is intensive and routinely outsourced to the global south through digital platforms. Simulations–as a means for synthesizing data–are one approach to solving the problem of data work. Simulation can be defined in terms of modeling. As an influential early textbook on the topic puts it, simulation is:

the process of designing a model of a real system and conducting experiments with this model for the purpose either of understanding the behaviour of the system or of evaluating various strategies … for the operation of the system (Shannon, 1975: 2).

In the context of synthetic data, simulations model real phenomena, but are used to generate training data rather than run experiments. A simulation may simply consist of a virtual space for capturing images of 3D models or it may be an interactive video game-like environment. For example, the Singaporean company CVEDIA has trained computer vision models for tracking endangered animals with trail cameras by capturing images of 3D models of them in a simulation (CVEDIA, n.d.-a). On the other hand, NVIDIA has developed Isaac Sim, a virtual environment with simulated physics in which data for the control of robots can be generated by interacting agents (Omotuyi et al., 2024). In either case, synthetic data is synthetic insofar as it is the output of a computational process which is a model, rather than a recording, of a real-world phenomenon. While all data are, in some sense, synthetic insofar as they require labor for their selection, collection, preparation and analysis, synthetic data is distinguished from conventional data insofar as it lacks a direct real-world referent (Offenhuber, 2024).³

As mentioned above, synthetic data is economically appealing because it aims to minimize costs for data collection and offers means of de-biasing and overcoming privacy issues. It also provides a means of circumventing data work. Data synthesized in simulations is automatically labeled down to a precise pixel level since all of the 3D elements of the simulation are discrete objects (Nikolenko, 2021: v). Within a simulation, the boundaries of those objects are strictly determinable. There is no need to hire anyone to label the data. This would seem to solve perhaps the biggest issue with supervised learning, however, it generates a new problem.

Simulation-borne synthetic data does not overcome machine learning's problem of generalization ability; rather, it expands it. If synthetic data are used to train a model, not only must the model generalize from its training data to new data; it must generalize from computationally generated data to conventional data; “bridging the reality gap” (Tremblay et al., 2018). For instance, an autonomous vehicle trained on data synthesized in a simulated urban environment must be able to function when deployed on real streets (Kamel et al., 2021).

The reality gap presents a new economic concern for AI production because highly detailed and realistic simulations are expensive and difficult to make. The requisite labor of building an adequately detailed simulation must be balanced with the potential savings to be achieved through the use of synthetic data. As Tremblay et al. (2018) note, “the primary selling point of synthetic data … that arbitrarily large amounts of labeled data are available essentially for free” can be mitigated by the substantial “expense required to generate photorealistic quality” (1082) because sophisticated simulations require “artistic design of the environment or prior real data” (Prakash et al., 2019: 7249). For instance, if you want to simulate an urban environment, an artist must either design a city, or a detailed model must somehow be extracted from existing data. Regardless, even the best simulations are far from a 1-to-1 replication of the real world. As we demonstrate below, simulations always require the selection of a subset of phenomena for modeling. Thus, data synthesis via simulation remains a non-trivial task, as a body of technical literature demonstrates (Alkhalifah et al., 2022; Tobin et al., 2017). To concretize the problem of the reality gap, let us consider two techniques for bridging it.

The first is domain adaptation. This involves using a generative adversarial network (GAN) to improve synthetic data by blending it with real data. The GAN is called a “Refiner” and it discerns patterns from real data, and progressively incorporates them into a synthetic dataset via an adversarial learning process. With this approach, real photographs of eyes can be used to refine synthetic images of eyes (Shrivastava et al., 2017). A second technique is called domain randomization. This entails “randomly perturbing the environment in non-photorealistic ways” by adding objects (i.e. geometric shapes and automobiles) and textures to images in order to “force the network to learn to focus on the essential features” (Tremblay et al., 2018: 1082). The goal is to increase the generalization ability of a model by helping it determine which patterns are essential to the phenomenon of interest and which are extraneous (Tobin et al., 2017; Prakash et al., 2019).

These approaches differ in how they aim to bridge the reality gap. The first aims to achieve photorealism by refining synthetic data so that it more closely resembles conventional data. The second aims to increase a model's generalization ability through the random introduction of objects into scenes. The first obviously relies on conventional data. However, so too does the second. The inserted objects are situated on top of real-world images from the popular Flickr 8k dataset.⁴ These techniques for bridging the reality gap thus suggest that while the simulation approach to synthetic data attempts to create data without recording the real world, conventional data remains essential. This dynamic, we suggest, has a historical precedent throughout the development of simulation technology.

Method

We follow the approach of Hu's (2015) prehistory of the cloud, an examination of the material conditions and social relations which formed the basis for the emergence of cloud computing. Here we aim, less ambitiously perhaps, to examine some historical periods which anticipate the central problem of synthetic data. To do so we employ the method of “technography” (Bucher, 2018; Hind et al., 2022), described as “a detailed examination of the material aspects of technology by directly reading various publicly available documents generated by and related to technical systems” (van der Vlist et al., 2024: 4). Offering a more historical approach, this paper is nonetheless interested in understanding the “specific functionalities or aspects” of simulation and synthetic data, while “maintaining a critical stance toward promotional language [and] industry jargon found in…materials” relating to them (van der Vlist et al., 2024: 4).

Archival research was conducted at two locations: the Ferranti Collection (1948–1979) at the University of Manchester, UK, and the Brian W. Hollocks Collection (1961–1982) held at NC State University Libraries, USA. The Ferranti Collection includes documents related to the Ferranti computer company, responsible for manufacturing some of the earliest digital computers on which various firms relied for their simulation work. The Brian W. Hollocks Collection, accessed remotely, contains various documents related to early simulation work in the UK. Data were also collected via interviews conducted by one of the authors, as part of an ongoing study, with data scientists, 3D artists, CEOs and CTOs at synthetic data companies located in Europe and North America.

This research required analysis of various texts, including technical reports and academic papers across a wide range of disciplines (e.g. statistics, operations research, simulation, and computer science) as well as sales records and operator's handbooks to more personal, reflective texts on the early days of simulation. Our aim was to evidence how various attempts have been made to tackle the reality gap from the 1950s to the present day. To demonstrate this, we now outline three historical regimes of simulation (Table 1).

Table 1.

A summary of simulation regimes.

Regime of simulation	Definition	Example system	Key technology	Social context	Reality gap
Statistical	The use of random numbers to model operations	Monte Carlo Method	Probability distribution, early computers	Nuclear weapon research	Between distributions/parameters and real-world processes
Discrete Event	The total modeling of closed real-world environments	General Simulation Program (GSP)	Exercises, analog interfaces, digital computers	Steel production	Between distributions/parameters and real-world processes
Visual Interactive	The modeling of active agents within visually rich environments	FORSSIGHT	2D computer graphics, real-time processing of input	Manufacturing/transport/logistics	Between graphics and reality
Synthetic Data (simulation approach)	The creation of data for training machine learning models	DRIVE Sim (NVIDIA)	GPUs, game engines, 3D simulation platforms, machine learning	Autonomous vehicles	Between synthetic data and real data

Regime 1: statistical

During the Second World War, scientists and engineers led by Stanislaw Ulam and John Von Neumann were involved in the construction of nuclear weapons as part of the infamous Manhattan Project. As part of this work, they devised a procedure for using random numbers to substitute for detailed knowledge of stochastic processes which exceeded the analytic capabilities of existing technology. Although they were based at Los Alamos Laboratory near Sante Fe, New Mexico, USA, this procedure came to be known as Monte Carlo simulation in reference to the famous casino in Monaco.

Monte Carlo involves substituting the analytic solution of a problem for a number of key parameters and the insertion of random numbers within those parameters over many iterations. The notion is, that from the set of all iterations, inductive knowledge, or at least informed estimates, can be obtained about the phenomenon under investigation (Halton, 1970). Galison (1996) suggests that for its advocates, Monte Carlo simulation went quickly from being “a numerical calculation scheme” to an “alternate reality—in some cases a preferred one—on which ‘experimentation’ could be conducted” (119).⁵ While the Monte Carlo method was quickly adapted for use on the early computers of the 1950s, it does not necessarily require a computer and was commonly “performed on electromechanical calculators by a host of operators … in the 1950s” (Nance and Sargent, 2002: 163). It should thus be noted, as Nance and Sargent (2002) put it, that the “use of simulation precedes computers, either analog or digital” (162). The essential characteristic of simulation is not its means of implementation or hardware, but its creation of an “artificial world” (Galison, 1996: 120).

Of course, the nature of such an artificial world varies with the medium used to create it. But in any case, a simulation purports to tell the experimenter something about the real world, and thus Monte Carlo “represents an attempt to model nature through direct simulation of the essential dynamics of the system in question” (Bielajew, 2021: 1). This involves the setting of parameters. As one of the earliest formulations of Monte Carlo puts it: one must “assume that the probability of each possible event is given” so that one can then “play a great number of games of chance, with chances corresponding to the assumed probability distributions” (Metropolis and Ulam, 1949: 337). But how is one to know the appropriate probability distributions? The authors go on to elaborate that a simulation, over all its iterations, can be conceived of as a “space” in which a given “process takes place [as] the collection of all possible chains of events” (Metropolis and Ulam, 1949: 337). They assert that while the “general properties of such a phase space have been considered … much work remains to be done on the specific properties of such spaces, each corresponding to a given physical problem” (Metropolis and Ulam, 1949: 337). In other words, the specific parameters must be set for each specific problem, to reflect the real-world properties of the process in question if the simulation is to generate useful data. A recent textbook on Monte Carlo reiterates the point:

one must start with models of the processes involved (e.g., the laws of gravity) and often one must make assumptions (e.g., there is a Gaussian distribution of speeds about the posted speed limit by cars on a certain highway) The [Monte Carlo] estimates developed are only as good as the underlying models and assumptions (Dunn and Shultis, 2022: 13).

In other words, the artificial world of the simulation must correspond in certain important respects to the real world; the essential dynamics of the real-world system at hand must be reducible to dynamics amenable to representation in statistical form. However, this means making decisions about what to include and what not to. Simulation thus necessitates a process of selection. Escape from the real world is only possible via Monte Carlo if relevant knowledge of the real world can be abstracted from other elements and captured within the simulation's statistical parameters. The following sections show that this phenomenon recurs across regimes of simulation.

Regime 2: discrete-event

A second regime of simulation was made possible with the development of digital computers in the immediate post-Second World War period. At this time, Monte Carlo was performed on the new computers, but “progressively gave way to more involved bespoke models of real systems” (Hollocks, 2008: 131). These were called “discrete-event” simulations, able to model closed environments. In 1957, influential cybernetician Stafford Beer established a Department of Operational Research and Cybernetics (known as “Cybor House”) at United Steel, a steelmakers in South Yorkshire, UK, with the task of applying new computational approaches to steelmaking (Hind, 2024). Various projects were devised and carried out at Cybor House, the most important of which was the General Steelplant Program (GSP), a program designed to simulate the production process from start to finish (Hollocks, 2008). The GSP was intended to streamline the Monte Carlo method, reducing both time and labor costs. It represented an attempt to overcome the context-specificity of each simulation problem, as discussed in the previous section, by integrating the shared problems of industrial production into a reusable framework implemented in a semi-automated software suite. The first edition of the GSP Handbook opens by asserting that the Monte Carlo method:

consists of describing the plant in terms of a mathematical model, in which the variates representing the uncontrolled factors in the plant behaviour are initiated by random variables. Such a model gives rise to a problem of finding the distribution of some function of the plant condition which represents, in some sense, its behaviour. (Tocher et al., 1961: 1)

The authors highlight a central problem with Monte Carlo: the execution of sampling. More precisely, the analysis of “the structure of the plant to be simulated” followed by its description “in mathematical terms”, the manipulation of the “equations to suitable forms” and the use of these “to generate the sampling values” (Tocher et al., 1961: 2). The solution to this was to “discretize” all component parts and processes of the plant.

In doing this the structure of the plant first needed to be conceived as a series of “machines” through which production moved. KD Tocher (1960), who succeeded Beer as director of Cybor House, held that “the first step in writing a simulation is to prepare a flow diagram” (62). Accidentally sounding like Deleuze and Guattari's Anti-Oedipus, the handbook asserts that a flow can be described as “consisting of nothing but a collection of machines” with each machine being understood as a system “capable of taking one of a set of clearly defined states” (Tocher et al., 1961: 6). Even the steel plant manager was considered “a machine for taking decisions” (Tocher et al., 1961: 6). Secondly, a flow diagram needed to track the sequence of “events” that needed to occur for production to happen. An event was understood as a “discrete change of activity of some component of the plant” or in other words, of some machine (Tocher et al., 1961: 5–6).

The next stage was to discretize independent steel plant “activities”, understood as “groups of machines in certain specified states” (Tocher et al., 1961: 7). Ensuring activities, such as the stripping of ingot molds (Figure 1), were appropriately discretized was critical, especially in being able to capture the necessary “states,” the machines might assume at different stages in the steelmaking process (Tocher et al., 1961: 7). Here, the GSP needed to be able to understand how a steel plant normally functioned in order to simulate the steelmaking process. Machines, events and activities thus constituted the operational vocabulary of discrete-event simulation (Hind, 2024), ultimately constituting what Chun (2018) calls “an epistemological model of the captured activity”(66).

Figure 1.

Simplified flow diagram of activities: Acid Bessemer Steel-Making Plant. Notably, the flow is composed of a series of discrete events.

From general steel to general simulation

Specific requirements had to be met to make computer-based simulation economically viable. Firstly, the GSP needed to run as quickly as a bespoke, “tailor-made” simulation program (Tocher et al., 1961: 4). Such generalizability was first pursued via the assembly of a library of subroutines. However, this was rated as “not … particularly successful” by the GSP creators, because it did not solve the laborious task of “analysis of the works” or, in other words, of modeling each individual factory (Tocher et al., 1961: 3). It was left, therefore, to the GSP to find a “generalised expression for the structure of any [steel] plant” (Tocher et al., 1961: 4). As work on the GSP continued “it became clear that [it] was not simply a General Steelplant Program but was a more powerful framework and could be regarded as a General Simulation Program” (Hollocks, 2008: 132, authors’ emphasis). This was a significant milestone in the prehistory of synthetic data because it reinforced the notion of simulation as an alternate reality, the use of which could produce data that provide economic benefit. As two of the GSP creators put it: “if one cannot experiment with a live industrial system one can still derive great benefit from experimenting with a model of that system. Thus simulation, if it can be carried out with sufficient realism derived from close observation of the plant, is a means of obtaining experimental material” (Tocher and Owen, 2008 [1960]: 143).

Yet, for all its advances over the Monte Carlo approach, the GSP faced a similar reality gap problem. As Tocher and Owen (2008 [1960]) bemoaned, the requisite analysis of plant structure remained an obstacle since the “practical success of a simulation depends on the accuracy and validity of the sampling distributions used to describe the processes” (150). Although the discrete-event approach reduced simulation to a chain of machines in discrete states, the functioning of those machines could not be assumed a priori: “Data of sufficient accuracy to produce the correct sampling distributions are not usually available from works records, and a large volume of special observations are then needed, with the consequent labour of analysis” (Tocher and Owen, (2008 [1960]): 150). The possibilities of simulation via GSP were further limited by data about system failures.

The GSP needed to be able to simulate how breakdowns occur. However, only endemic breakdowns could be simulated since they would be discovered in the analysis of the plant structure. Excluding breakdowns which occur infrequently (whether they had great consequences or not) was “of great practical importance, since the data for finding the statistical laws covering infrequent occurrences is very hard to get” (Tocher et al., 1961: 9). The ambitions of the GSP were thus constrained by the unavailability of data on what would today be called “edge cases” or data points which fall outside a normal distribution (CVEDIA, n.d.-b).

In its early days GSP did not involve computer displays. Engineers had no choice but to present information to controllers “obtained from teleprinter output and posted up on a display board” (Mellor and Tocher, 1963: 133). Decisions taken by the controllers would then be fed back into a Ferranti Pegasus computer for further rounds of simulation. As GSP evolved, it was linked up to analog physical displays which emulated the apparati being simulated. These were not computer graphics displays, but mechanical interfaces which might, for instance, reproduce a particular control panel (Figure 2). Such physical simulations aimed to place a user within the simulated system in a rudimentary way by visually representing some elements of it. This was a bridge to the next regime of simulation; and it would re-present the problem of the reality gap in a whole new form.

Figure 2.

A “typewriter-type device being used as an interactive terminal” for a simulation exercise in the late 1960s.

Regime 3: visual-interactive

A new regime of simulation was initiated when Robert D. Hurrion (1976) produced a PhD dissertation at the University of Warwick proposing Visual Interactive Simulation (VIS), employing color graphics and limited interactivity. Hurrion was at the time working on the simulation of manufacturing processes. He found that a generic model did not work because there tended to be a human scheduler who had control over the process and intervened as needed on the basis of “rules … [which] were frequently difficult to encapsulate in the simulation” (Bell and O’Keefe, 110: 1987). He thus decided to create a simulation program which could visually represent the system at hand, so that the scheduler could see it and intervene as needed. Hurrion would go on, with funding from manufacturers including Rolls Royce and Imperial Chemical Industries, to develop a software package called VISION, later marketed as SEE WHY, which was subsequently used to produce a graphical version of GSP known as FORSSIGHT in 1982 (Bell and O’Keefe, 1987; Hollocks, 1983). We turn now to this simulation application, and the VIS approach more broadly: “the video game approach to simulation” (Bell and O’Keefe, 1987: 115). With the visual-interactive regime, simulation became a more tangible alternate reality. But, in this regime, the necessity of data takes on a new form concurrent with the new visual mode of simulation. As early researchers put it, VIS demands “added data requirements” (Kirkpatrick and Bell, 1989: 142). Not only do parameters for key statistical properties of the domain need to be set, but the appearance of certain entities in that domain must be represented via computer graphics. This stands in contrast to the Monte Carlo method, in which the “geometry of the environment … plays little role except to define the local environment of objects interacting at a given place at a given time” (Bielajew, 2021: 1). Here the appearance of and interfaces to relevant objects become central concerns. First we consider the property of visuality, then we turn to interactivity.

Visuality

The shrinking size of components allowed vast increases in computational power in the 1980s such that it became feasible to render graphics as well as text. The so-called “microcomputers” drove a wave of enthusiasm for increased production speed at United Steel: “The graphics facilities of FORSSIGHT enable the visual displays, previously requiring special physical or electronic construction and control, to be generated quickly and easily” (Hollocks, 1983: 338). One computer could be programmed to visualize any simulation rather than having to build a bespoke physical display. Up until this point, as mentioned previously, simulations were dependent on printed paper outputs: analysis could only take place after a simulation was run. VIS thus put a “window … in the side of the simulation black box” (Hollocks, 1983: 338).

A new problem arose, however, concerning an economy of graphical fidelity to the real world. It became important for simulations to have “adequate realism in the display” (Hollocks, 1983: 337). What was deemed an “adequate” level of realism for simulations was, of course, relative to the usability of hard copy outputs and mechanical interfaces. However, despite the notable benefits of graphical simulation, practitioners noted that “visual displays were time consuming, and in some cases expensive to produce, and interactive studies took considerable time” (Hollocks, 1983: 336). For simulation designers graphical outputs needed to be convincing. They would utilize new capacities for the dynamic display of “shapes and colors, together with alpha-numeric information” which made possible the “construction of clear, easily understood mimic diagrams” (Hollocks, 1983: 337). A mimic diagram was “a fixed background and a set of dynamic icons” (Kiernan, 1991: 11) or an animated graphical representation of a certain domain. In the present moment of computer graphics which are near-indistinguishable from reality, it can be difficult to appreciate the novelty of visuals at that time.

FORSSIGHT produced simulations which were graphically inferior to its contemporary video game systems, such as the Famicom, released in Japan in 1983 (Figure 3). The limits of these visuals motivated research on the introduction of realism into mimic diagrams; purely iconic simulations such as those produced with FORSSIGHT quickly became inadequate. Such work details the construction of dynamic icons, but insists that for “visual realism” to be introduced, icons must be combined with backgrounds consisting of “scanned images from plant photographs or schematic diagrams” (Kiernan, 1991: 10). While gauges might be graphically simulated, this was deemed insufficiently realistic if not presented on top of a photograph of the actual pumphouse, with simulated gauges located in relation to their real-world counterparts. The visuals generated via programming could not adequately capture the real world within the simulation so the digitization of analog photos typically would have to suffice instead.

Figure 3.

Use of Visual Interactive Simulation (VIS) to model flight departures at an airport terminal.

Mimic diagrams encapsulate a central problem posed by VIS: picture definition. This consists of two elements. Problem formulation: “the picture must display ‘solutions’ to the problem that are useful to the user” and conceptual validation: “the picture must correctly represent the system to the user” (Kirkpatrick and Bell, 1989: 141). In both cases, data which are not required for purely statistical simulations is necessary. Data are also required about the transition between discrete states. Kirkpatrick and Bell (1989) describe how a VIS simulation of a train depot required animation of trains moving from one track to another, rather than “simply appearing” in a new location (142). While VIS may employ a discrete-event logic, it must represent the analog transitions which are abstracted from by discrete-event simulation. As one comparative study of a statistical batch-processing simulation versus a visual simulation concludes: “The VIS model … required more detailed data to appear to be accurate, which increased the demands on data acquisition … it was necessary to simulate a finer level of detail in many areas” (Kirkpatrick and Bell, 1989: 148). Visuals not only convey information about a simulation's operations and results, but also about how it could be interacted with.

Interactivity

The analog interfaces described in the discrete-event regime section were the first form of interactivity in simulation. These displays enabled operators to see what a simulation “was doing and not, as statistical results alone portrayed, what the model had done” (Hollocks, 1983: 335, authors’ emphasis). Interaction was, however, only possible in a retrospective, iterative sense. Simulation operators would analyze outputs, and decide on aspects to modify or replicate, before the simulation was run again. Until the batch-processing architecture of mainframe computers was superseded by terminals, real-time interaction was impossible. With VIS, virtual environments offering “interaction through [a] graphics terminal” (Hollocks, 1983: 337) were possible for the first time. Yet, graphics and interactivity were not necessarily linked in the minds of all early simulation developers. While United Steel's FORSSIGHT:

integrated the graphics display with the simulation model such that the graphics display was updated by the model as it ran, that is, the graphics are “concurrent” with the model … The principal USA development path in such graphics at the time … adopted a “replay” approach. In this, a simulation run produces an output data file detailing the sequence of events in the run; this is then the source for a second program that generates graphics from the file … There is no interaction with the model (Hollocks, 2006: 1390)

Bell and O’Keefe (1987) understood this as a difference between “animation” on the one hand, and full “visual interaction” on the other. Whilst an animation might allow the user to interact with a graphical representation of the simulation during a replay (e.g. zooming), it did not allow interventional possibilities with the simulation and its parameters.

Interactivity does not apply only within the set parameters of the model, but to those parameters as well. According to Sohnle et al. (1973) interactivity refers to the user being able to “not only observe, verify, and record data, but also to interrupt the simulation so that parameter values may be changed, or the model structure modified, or progress be retraced or redirected” (146). In Hurrion's (1986) words, interactivity involves “respecifying the model” or modifying its parameters (285). This is an operation, as explained above, which requires data about the domain to be simulated. Respecifying is only a simple task for the simplest domains which can be precisely specified, increasing in difficulty as the phenomena to be modeled become more complex.

Discussion

Across the three regimes of simulation, real-world data are required to create a simulation such that it can then serve as an alternative to real-world data. The virtual worlds of simulations are not produced ex nihilo. Thus, we argue, the reality gap noted by contemporary synthetic data researchers has a precedent in the history of simulation. Like machine learning models trained on synthetic data, which must work when deployed in the wild, simulations have to accurately model the real-world processes they purport to simulate. However, simulations have necessarily been non-exhaustive, and explicitly so: the raison d’etre of simulation, going back to Monte Carlo, is to lower costs and reduce complexity by abstracting from the real world. The creation of an alternate reality entails a process of selection from the manifold of reality. Simulation is thus an eminently epistemological operation; one must decide what exactly to model (Robinson, 2005). Such decisions in the production of simulations mean that they are “non-neutral and incomplete representation[s]” (Korenhof et al., 2021: 1768). The historical modesty of the simulation researchers we have discussed can be contrasted with the discourse of proponents of synthetic data.

Synthetic data producers tend to be committed to big data aspirations of exhaustivity (Kitchin, 2021: 62). The idea with synthetic data is that whatever data does not exist can be created (Jacobsen, 2023). Demonstrating this position, the synthetic data company CVEDIA (2021) announced that it had “officially solved the domain adaptation [reality] gap using its proprietary synthetic data pipeline” called CVEDIA-RT. This required designing over “30,000 3D models for many types of objects, including clothing and exotic animals to buildings and ships” (CVEDIA, n.d.-b). Once complete, CVEDIA-RT would “allow AI technologies to scale without the burdens of data collection and labeling,” (CVEDIA, n.d.-b) resulting in a “resilient AI with zero data” (CVEDIA, 2021). While we do not know how these models were created, we recall, however, that for the GSP's creators at United Steel, edge cases were hard to model. CVEDIA ambitiously proposes to create examples of all possible edge cases, despite their rarity. Notably, the company has never followed up on the pronouncement made in 2021, and the synthetic data market has yet to be conquered by their simulation platform. It is safe to believe that the reality gap has yet to be bridged.

Furthermore, we contend that as the ambitions of contemporary synthetic data producers increase and simulations scale up, the data that will be required to create simulations is likely to increase in quantity and detail. In other words, the reality gap will only grow. The three regimes of simulation discussed above demonstrate how simulations have increased in their ambitions over time, modeling increasing kinds of phenomena with increasing fidelity to the real world. Recall that the earliest simulations did not involve computer graphics—or computers. This dynamic continues today. NVIDIA has partnered with the German automobile simulation firm dSPACE to create physically accurate models of vehicles for the Drive Sim platform, simulating “suspension, tires, brakes—all the way to the full vehicle powertrain and its interaction with the electronic control units that power actions such as steering, braking and acceleration” (Burke, 2020). This trajectory points to why the reality gap is not only a technical and epistemological problem, but a political economic one.

It is a political economy issue because it proposes a possible trajectory for the AI industry, differing from the industrialization or “the becoming mundane of AI” (Van der Vlist et al., 2024: 13) in which an increasingly standardized cloud-based machine learning stack is expanding. The constantly increasing demand for computational power, with its requisite energy demand, is now widely recognized as a consequence of this expansion (Lawson, 2024). But the AI industry faces an uncertain horizon in terms of the data it requires to train machine learning models. The rise of the simulation approach to synthetic data shows that alongside the intensification and expansion existing of data surveillance practices, the AI industry is pursuing qualitatively novel technical means for creating data, adding layers to the stack of AI. Advanced simulations are computationally intense and will no doubt contribute to steeply rising energy demands for the AI industry. But they may also end up reconfiguring the infrastructural basis of the industry, leading to an alternate stack. NVIDIA, which became the world's most valuable company in June 2024 at $3.34tn (Robins-Early, 2024), at least, appears to think so. In a keynote talk at the 2024 Nvidia GPU Technology Conference, CEO Jensen Huang (2024) positioned the Omniverse simulation platform, rather than AI, as the “soul” of the company, tied to the generation of synthetic data, the operation of digital twins, the design of robotics and other automation technologies, as well as scientific research. “Big AI”, as van der Vlist et al. (2024) term it, might look decidedly different with a chip firm leading the way, rather than web services-turned-cloud providers.

As simulations grow in ambition they expand the reality gap and thus the domain data required to create them also grows. As the example of dSPACE and NVIDIA shows, a team of data scientists cannot simulate vehicles at a fine-grained level without assistance from automotive engineers. This inclusion of domain expertise is a marked shift: not only away from the agnostic conception of contemporary AI and data science pervasive through industry—as a meta-discipline of sorts which can be applied to any domain without knowledge of it (Ribes et al., 2019). Instead we see a return to something more like the “expert systems” approach dominant in AI during the 1980s (Woolgar, 1985). It remains to be seen how big tech will obtain this expertise and what new social relations and technical configurations such efforts will impart to the AI industry. We suggest that such new relations and configurations might be glimpsed by attending to the following aspects of the synthetic data industry.

Firstly, it remains unclear where synthetic data will truly “take off”. Medicine, finance and autonomous vehicles are all lively areas for synthetic data applications (Devaux, 2022), but as of yet none can be definitively pointed to as the place where synthetic data has matured, despite bold claims made by some industry players (Wayve, 2024). The specificity of a particular domain could shape the future of synthetic data if the technology takes off there. Secondly, synthesizing data requires new forms of labor. For instance, beyond the domain experts discussed above, the simulation approach requires artists who can create virtual environments and objects (Steinhoff, 2023). The influence of such workers over the nature of synthetic data and the models it is trained on is currently unexplored. Thirdly, synthetic data proponents argue that it will “democratize” access to data (Ebert, 2023), although this word is frequently invoked in the AI industry, with little substance. It is more likely, we contend, that simulations will accrue the most value to the large firms which own the substantial fixed capital for their development. However, it remains theoretically possible that data synthesis could contribute to a rupture within the oligopoly of data-intensive capitalism.

Conclusion

We conclude with a final epistemological-political economic reflection. The necessity of simulation being a selection from the manifold of the real entails that models trained on synthetic data produced in a simulation will reflect that selection process. As big tech moves into synthetic data, the question is raised of how the exigencies of hyperscale capital valorization will impinge upon the requisite process of selection and the qualities of simulations produced. Acemoglu and Johnson (2023) have demonstrated that capitalist industry has generally been content with what they call “so-so automation”; the introduction of machines which lower labor costs to some degree but achieve very trivial gains in productivity and often reduce the quality of work. This, they argue, is a major driver behind increasing inequality and the falling labor share of value in most of the Global North since the 1960s. Will the AI industry accept a “so-so simulation” and what might its implications be?

Footnotes

Acknowledgements

The authors would like to thank the reviewers for their comments and suggestions,as well as the organizers and participants of the 2023 NordicSTS Conference in Oslo. The authors would also like to acknowledge Evgeny Morozov for providing initial assistance on the archival research undertaken for the paper.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: Parts of this work were funded by the German Research Foundation (Deutsche Forschungsgemeinshaft,DFG),project number 262513311 (SFB1187 “Media of Cooperation”) and University College Dublin Ad Astra Fellowship funding.

ORCID iDs

James Steinhoff

Sam Hind

References

Acemoglu

Johnson

(2023) Power and Progress: Our Thousand-Year Struggle Over Technology and Prosperity. New York: PublicAffairs.

Alkhalifah

Wang

Ovcharenko

(2022) MLReal: Bridging the gap between training on synthetic data and real data applications in machine learning. Artificial Intelligence in Geosciences 3: 101–114.

Alpaydin

(2016) Machine Learning: The New AI. Cambridge, MA: MIT Press.

Beduschi

(2024) Synthetic data protection: Towards a paradigm change in data regulation? Big Data & Society 11: 1–5.

Bell

O’Keefe

(1987) Visual interactive simulation — history, recent developments, and major issues. Simulation 49(3): 109–116.

Bielajew

(2021) Chapter 1: History of Monte Carlo. In: Seco

Verhaegen

(eds) Monte Carlo Techniques in Radiation Therapy, 2nd Ed. Boca Raton, FL: CRC Press, 1–25.

Bogard

(2006) Welcome to the society of control: The simulation of surveillance revisited. In: Ericson

Haggerty

(eds) The New Politics of Surveillance and Visibility. Toronto: University of Toronto Press, 55–78.

Boyd

Crawford

(2012) Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society 15(5): 662–679.

Bratton

(2016) The Stack: On Software and Sovereignty. Cambridge, MA: MIT Press.

10.

Bucher

(2018) If…Then: Algorithmic Power and Politics. Oxford: Oxford University Press.

11.

Burke

(2020) Modeled behavior: dSPACE introduces high-fidelity vehicle dynamics simulation on NVIDIA DRIVE Sim. NVIDIA Blog. https://blogs.nvidia.com/blog/dspace-vehicle-dynamics-simulation/

12.

Chun

WHK

(2018) Queerying homophily. In: Apprich

Chun

WHK

Cramer

(eds) Pattern Discrimination. Minneapolis: University of Minnesota Press, 59–98.

13.

CVEDIA (2021) CVEDIA becomes first synthetic data company to solve ‘domain gap’ problem, deploying AI without data. In: PRNewswire, 30 June. Available at: https://www.prnewswire.com/news-releases/cvedia-becomes-first-synthetic-data-company-to-solve-domain-gap-problem-deploying-ai-without-data-301323558.html.

14.

CVEDIA (n.d.-a) Case study resolve NGO: AI for good. CVEDIA. Available at: https://www.cvedia.com/resolve.

15.

CVEDIA (n.d.-b) Synthetic data for computer vision. CVEDIA. Available at: https://www.cvedia.com/what-is-synthetic-data.

16.

Dalton

Thatcher J

(2014) What does a critical data studies look like, and why do we care? Society + Space. https://www.societyandspace.org/articles/what-does-a-critical-data-studies-look-like-and-why-do-we-care.

17.

Devaux

(2022) Everything that happened in the synthetic data space in 2022. Medium. https://elise-deux.medium.com/everything-that-happened-in-the-synthetic-data-space-in-2022-c5d6cb5aaf06

18.

Dilemgani

(2020) Synthetic data generation: Techniques, best practices & tools. AIMultiple. https://research.aimultiple.com/synthetic-data-generation/

19.

Dippel

Warnke

(2017) Interferences and Events: On Epistemic Shifts in Physics through Computer Simulations. Lunenberg: Meson Press.

20.

Dunn

Shultis

(2022) Exploring Monte Carlo Methods. London: Elsevier.

21.

Dyer-Witheford

Kjøsen

Steinhoff

(2019) Inhuman Power: Artificial Intelligence and the Future of Capitalism. London: Pluto Press.

22.

Ebert

(2023) Synthetic data for open data and data democratization. LinkedIn Learning. Available at: https://www.linkedin.com/learning/synthetic-data-as-the-future-of-ai-privacy-explainability-and-fairness-an-introduction-for-data-scientists-and/synthetic-data-for-open-data-and-data-democratization?u=94281106

23.

Edwards

(2013) A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming. Cambridge, MA: MIT Press.

24.

Egliston

Carter

(2022) The metaverse and how we’ll build it’: The political economy of Meta’s Reality Labs. New Media & Society 26(8): 1–25.

25.

Engdahl

(2024) Agreements ‘in the wild’: Standards and alignment in machine learning benchmark dataset construction. Big Data & Society 11(2): 1–13.

26.

Feremanga

(2024) Could AI-generated data lead to model collapse? How to prevent it. Saifr Blog. https://saifr.ai/blog/could-ai-generated-data-lead-to-model-collapse-how-to-prevent-it

27.

Gal

Lynskey

(2023) Synthetic Data: Legal Implications of the Data-Generation Revolution. LSE Law, Society and Economy Working Papers.

28.

Galison

(1996) Computer simulations and the trading zone. In: Galison

Stump

(eds) The Disunity of Science: Boundaries, Contexts, and Power. Stanford, CA: Stanford University Press, 118–157.

29.

Goodfellow

Pouget-Abadie

Mirza

, et al. (2014) Generative Adversarial Nets. Advances in Neural Information Processing Systems 27.

30.

Halton

(1970) A retrospective and prospective survey of the Monte Carlo method. SIAM Review 12(1): 1–63.

31.

Haugeland

(1989) Artificial Intelligence: The Very Idea. Cambridge, MA: MIT Press.

32.

Helm

Lipp

Pujadas

(2024) Generating Reality and Silencing Debate: Synthetic Data as Discursive Device. Big Data & Society 11(2): 1–5.

33.

Hind

(2024) From United Steel to Waymo: Industrializing simulation. AI & Society 0(0): 1–14.

34.

Hind

Kanderske

van der Vlist

(2022) Making the car ‘platform ready’: How big tech is driving the platformization of automobility. Social Media + Society 8(2): 1–13.

35.

Hind

van der Vlist

Kanderske

(2024) Challenges as catalysts: How Waymo’s Open Dataset challenges shape AI development. AI & Society 0(0): 1–17.

36.

Hollocks

(1983) Simulation and the micro. The Journal of the Operational Research Society 34(4): 331–343.

37.

Hollocks

(2006) Forty years of discrete-event simulation—a personal reflection. Journal of the Operational Research Society 57(12): 1383–1399.

38.

Hollocks

(2008) Intelligence, innovation and integrity— KD Tocher and the dawn of simulation. Journal of Simulation 2(3): 128–137.

39.

(2015) A Prehistory of the Cloud. Cambridge, MA: MIT Press.

40.

Huang

(2024) GTC March 2024 keynote with NVIDIA CEO Jensen Huang. YouTube. https://www.youtube.com/watch?v=Y2F8yisiS6E

41.

Hurrion

(1976) The design, use and required facilities of an interactive computer simulation language to explore production planning problems . PhD thesis. University of London, UK.

42.

Hurrion

(1986) Visual interactive modelling. European Journal of Operational Research 23(3): 281–287.

43.

Jacobsen

(2023) Machine learning and the politics of synthetic data. Big Data & Society 10(1): 1–12.

44.

Jacobsen

(2024) The logic of the synthetic supplement in algorithmic societies. Theory, Culture & Society 41(4): 1–16.

45.

Kamel

Reid

Plepp

(2021) Validating NVIDIA DRIVE Sim Camera Models. NVIDIA Technical Blog. https://developer.nvidia.com/blog/validating-drive-sim-camera-models/.

46.

Kiernan

(1991) The introduction of realism into SCADA mimic diagrams. MsC Thesis, Dublin City University. Available at: https://doras.dcu.ie/19541/1/Paul_Kiernan_1991.pdf

47.

Kirkpatrick

Bell

(1989) Simulation modelling: A comparison of visual interactive and traditional approaches. European Journal of Operational Research 39(2): 138–149.

48.

Kitchin

(2021) The Data Revolution : A Critical Analysis of Big Data, Open Data and Data Infrastructures. London: Sage.

49.

Kniazieva

(2022) The role of data Constructing Autonomous Vehicles. Label Your Data. 4 August. Available at: https://labelyourdata.com/articles/data-annotation-for-autonomous-driving.

50.

Korenhof

Blok

Kloppenburg

(2021) Steering representations—towards a critical understanding of digital twins. Philosophy & Technology 34: 1751–1773.

51.

Krzus

(2022) Create synthetic data for computer vision pipelines on AWS, AWS Machine Learning Blog. https://aws.amazon.com/blogs/machine-learning/create-synthetic-data-for-computer-vision-pipelines-on-aws/

52.

Lau Munkholm

Wiehn

(2025) Synthetic data: Servicing privacy. In Obelitz Søe

Wiehn

Frank Jørgensen

Valtysson

(Eds.), Beyond Privacy: People, Practices, Politics Bristol University Press.

53.

Lawson

(2024) Google to buy nuclear power for AI datacentres in ‘world first’ deal. The Guardian. https://www.theguardian.com/technology/2024/oct/15/google-buy-nuclear-power-ai-datacentres-kairos-power

54.

Lenhard

(2019) Calculated surprises: a philosophy of computer simulation. Oxford: Oxford University Press.

55.

Linden

(2021) Is synthetic data the future of AI? Gartner. https://www.gartner.com/en/newsroom/press-releases/2022-06-22-is-synthetic-data-the-future-of-ai

56.

Luitse

Blanke

Poell

(2024) AI Competitions as infrastructures of power in medical imaging. Information, Communication & Society 0(0): 1–22. doi:https://doi.org/10.1080/1369118X.2024.2334393

57.

Luitse

Denkena

(2021) The great transformer: Examining the role of large language models in the political economy of AI. Big Data & Society 8(2): 1–14.

58.

Mellor

Tocher

(1963) A steel-works production game. Journal of the Operational Research Society 14(2): 131–135.

59.

Metropolis

Ulam

(1949) The Monte Carlo method. Journal of the American Statistical Association 44(247): 335–341.

60.

Muldoon

Graham

Cant

(2024) Feeding the Machine: the Hidden Human Labour Powering AI. Edinburgh: Canongate Books.

61.

Müller

Casser

Lahoud

, et al. (2018) Sim4cv: A photo-realistic simulator for computer vision applications. International Journal of Computer Vision 126: 902–919.

62.

Nance

Sargent

(2002) Perspectives on the evolution of simulation. Operations Research 50(1): 161–172.

63.

Nikolenko

(2021) Synthetic Data for Deep Learning. Cham: Springer International Publishing.

64.

Offenhuber

(2024) Shapes and frictions of synthetic data. Big Data & Society 11: 1–16.

65.

Omotuyi

Hoeller

Burnham

(2024) Closing the Sim-to-Real Gap: Training Spot Quadruped Locomotion with NVIDIA Isaac Lab. NVIDIA Technical Blog. https://developer.nvidia.com/blog/closing-the-sim-to-real-gap-training-spot-quadruped-locomotion-with-nvidia-isaac-lab/.

66.

Pasquinelli

(2023) The Eye of the Master: A Social History of Artificial Intelligence. London: Verso Books.

67.

Prakash

Boochoon

Brophy

, et al. (2019) Structured domain randomization: Bridging the reality gap by context-aware synthetic data. In: 2019 International Conference on Robotics and Automation (ICRA), pp.7249–7255.

68.

Rella

(2023) Close to the metal: Towards a material political economy of the epistemology of computation. Social Studies of Science 54(1): 1–27.

69.

Ribes

Hoffman

Slota

, et al. (2019) The logic of domains. Social Studies of Science 49(3): 281–309.

70.

Robins-Early

(2024) Nvidia becomes world’s most valuable company amid AI boom. The Guardian. https://www.theguardian.com/technology/article/2024/jun/18/nvidia-valuation-most-valuable

71.

Robinson

(2005) Discrete-event simulation: From the pioneers to the present, what next? Journal of the Operational Research Society 56: 619–629.

72.

Sadowski

(2019) When data is capital: Datafication, accumulation, and extraction. Big Data & Society 6(1): 1–12.

73.

Savage

(2023) Synthetic data could be better than real data. Nature. 17 April.

74.

Shannon

(1975) Systems Simulation: The Art and Science. Englewood Cliffs, N.J.: Prentice-Hall.

75.

Shrivastava

Pfister

Tuzel

, et al. (2017) Learning from simulated and unsupervised images through adversarial training. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp.2107–2116.

76.

Shtylenko

(2022) Generative AI goes mainstream. LinkedIn. The Reality Gap. Available at: https://www.linkedin.com/pulse/generative-ai-goes-mainstream-reality-gap-1-andrey-shtylenko/?trackingId=Lmwfcs3jR6y8AvoBlNI5yw%3D%3D

77.

Sohnle

Tartar

Sampson

(1973) Requirements for interactive simulation systems. Simulation 20(5): 145–152.

78.

Steinhoff

(2021) Automation and Autonomy: Labour, Capital and Machines in the Artificial Intelligence Industry. London: Palgrave Macmillan.

79.

Steinhoff

(2022) Toward a political economy of synthetic data: A data-intensive capitalism that is not a surveillance capitalism? New Media & Society 26(6): 1–17.

80.

Steinhoff

(2023) Reproducing reality: New configurations of labour and data in a data synthesis economy. In: Society for the Advancement of Socio-Economics Conference 2023, 20–22 July. Universidade Federale Do Rio De Janiero, Brazil.

81.

Sundberg

(2009) The everyday world of simulation modeling. Science, Technology, & Human Values 34(2): 162–181.

82.

Tanaka

FHK dos S

Aranha

(2019) Data augmentation using GANs. arXiv. http://arxiv.org/abs/1904.09135

83.

Tobin

Fong

Ray

, et al. (2017) Domain randomization for transferring deep neural networks from simulation to the real world. arXiv. Available at: http://arxiv.org/abs/1703.06907

84.

Tocher

(1960) An integrated project for the design and appraisal of mechanized decision-making control systems. OR 11(1-2): 50–65.

85.

Tocher

Owen

(2008) The automatic programming of simulations. Journal of Simulation 2(3): 143–152.

86.

Tocher

Owen

Cunninghame-Green

(1961) Hand book for the general simulation programme. 54/ORC.3/Technical.

87.

Tremblay

Prakash

Acuna

, et al. (2018) Training deep networks with synthetic data: Bridging the reality gap by domain randomization. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Salt Lake City, UT: IEEE: pp.1082–10828.

88.

Trotter

Tukey

(1956) Conditional Monte Carlo for normal samples. In: Symposium on Monte Carlo Methods. New York: John Wiley and Sons, 64–79.

89.

Tsirikoglou

Kronander

Wrenninge

, et al. (2017) Procedural modeling and physically based rendering for synthetic data generation in automotive applications. arXiv preprint. https://arxiv.org/pdf/1710.06270.pdf

90.

Tubaro

Casilli

Colville

(2020) The trainer, the verifier, the imitator: Three ways in which human platform workers support artificial intelligence. Big Data & Society 7(1): 1–12.

91.

Van der Vlist

Helmond

Ferrari

(2024) Big AI: Cloud infrastructure dependence and the industrialisation of artificial intelligence. Big Data & Society 11(1): 1–16.

92.

Wayve (2024) Introducing GAIA-1: A cutting-edge generative AI model for autonomy. Wayve. https://wayve.ai/thinking/introducing-gaia1/

93.

Wiggers

Coldewey

(2024) This week in AI: Tech giants embrace synthetic data. TechCrunch (blog), October 9, 2024. https://techcrunch.com/2024/10/09/this-week-in-ai-tech-giants-embrace-synthetic-data/

94.

Wood

Baltrušatis

Hewitt

, et al. (2021) Fake it till you make it: Face analysis in the wild using synthetic data alone. In: International Conference on Computer Vision 2021. Available at: https://arxiv.org/abs/2109.15102v2

95.

Woolgar

(1985) Why not a sociology of machines? A case of sociology and artificial intelligence. Sociology 19(4): 557–572.