Abstract
Introduction
Synthetic data is data that is the output of a computational process, such as a simulation, “rather than being generated by actual events” (Dilemgani, 2020). It can be utilized for training machine learning models. A facial recognition model, for example, can be trained on a synthetic dataset consisting solely of images of procedurally generated 3D models of faces captured in a simulation (Wood et al., 2021). Despite being in its infancy, synthetic data is cast by proponents as the cure for nearly all the problems associated with the machine learning approach to artificial intelligence (AI), from labor costs to privacy and bias concerns (Savage, 2023). The consulting firm Gartner suggests that by 2030, synthetic data will outnumber conventional data for the training of machine learning models (Linden, 2021). Critics reply that synthetic data serves best to expand on conventional data, rather than replacing it (Feremanga, 2024). In any case, as of 2024, synthetic data is the basis of a growing subsector of the AI industry, with over 120 firms producing synthetic data commodities around the world (Devaux, 2022), numerous synthetic datasets freely available online for applications such as computer vision (Tsirikoglou et al., 2017) and growing interest from big tech; as a technique for creating training data for large language models (Wiggers and Coldeway, 2024) and computer vision applications (Krzus, 2022).
However, the nascent synthetic data industry must confront a fundamental technical issue: the “reality gap” (Tremblay et al., 2018), or synthetic-to-real “domain transfer” (Nikolenko, 2021: vi). This is the issue of whether “models trained on synthetic data will work when applied to real-world data” (Nikolenko, 2021: vi). The reality gap must be bridged if synthetic data is to achieve its goals. It is often described as the phenomenon which distinguishes synthetic data from conventional data. One Google employee even maintains a newsletter devoted to synthetic data called “The Reality Gap” (Shtylenko, 2022).
We argue that the reality gap is an epistemological and political economic issue as well as a technical one. We mount this argument via a historical comparison, which shows that the problem of the reality gap has long plagued the technology of simulation. According to Nikolenko (2021), synthetic data can be traced back to early computer vision research in the 1960s (140).
1
We suggest that synthetic data has a longer
The reality gap is an epistemological issue as well as a technical one. But it is also a political economic issue. Rather than constituting a simple fix for the big problems facing data-intensive capitalism, synthetic data complicates the existing means of producing and utilizing data, adding new layers of technological mediation and labor. We suggest that synthetic data and the simulation approach are not likely to replace the surveillance and capture of user data (Steinhoff, 2022), but will rather become implicated with it in diverse ways. The drawing together of AI, synthetic data and simulation suggests the emergence of an alternative “stack” of technical systems and social processes for the production of AI systems (Bratton, 2016). The emergence of this simulation-synthesis stack demonstrates that the political economy of AI must take account of the proliferation of new technical means for creating data.
The paper begins by reviewing the critical literature on synthetic data and the political economy of AI. In the subsequent section we offer a brief technical introduction to machine learning before discussing simulation and the notion of the reality gap. After examining three regimes of simulation, we discuss how our analysis of this prehistory can be utilized to make sense of the contemporary AI industry.
Political economy of AI and synthetic data
Building on the foundation of critical data studies, which showed that big data is a material, spatial phenomenon with social, political and ecological implications (Boyd and Crawford, 2012; Dalton and Thatcher, 2014; Kitchin, 2021), the political economy of AI asserts the fundamental importance of studying AI as a capitalist industry. Such research asserts that AI is not a disinterested scientific endeavor but rather a project led by a handful of firms with distinct interests in surplus value generation through the sale of AI and data commodities, intercapitalist competition and the increased efficiency of, and control over, labor through discipline and the development of automation technologies (Dyer-Witheford et al., 2019; Pasquinelli, 2023; Sadowski, 2019; Steinhoff, 2021; Van der Vlist et al., 2024).
Political economy of AI research may be divided into three strands focused on labor, infrastructure, and data. Research in the first category has considered both the “hidden” data work, often outsourced to those in the global south (Muldoon et al., 2024; Tubaro et al., 2020) and the prestigious work of data scientists (Steinhoff, 2021). Research in the second category has revealed the materiality of the AI industry, examining cloud platforms (van der Vlist et al., 2024), semiconductor chips (Rella, 2023) and foundation models (Luitse and Denkena, 2021). In the last category, research has focused on the importance of data to the development of AI, from the creation of training datasets (Engdahl, 2024) to the organization of data-based competitions (Hind et al., 2024; Luitse et al., 2024) to the emergence of synthetic data (Steinhoff, 2022). This article contributes to the final category.
As the above authors have shown, AI is a general purpose technology (GPT) and more precisely a general
We propose that the political economy of AI, after the appearance of synthetic data, must also include the means of data synthesis—like simulation technology—within its purview. There is limited political economy research on simulation, with most work found in science and technology studies (Edwards, 2013; Galison, 1996; Sundberg, 2009) philosophy of computation (Korenhof et al., 2021; Lenhard, 2019) and media studies (Bogard, 2006; Chun, 2018; Dippel and Warnke, 2017) and oriented primarily towards epistemological, and sometimes ontological, questions. Recent work on the political economy of virtual reality and the metaverse is adjacent to our research here, but not directly of interest (Egliston and Carter, 2022).
Simulation and the reality gap
Synthetic data and the simulation approach to generating it appeared in the context of the rapid growth and commercialization of machine learning since 2010. Machine learning is the use of learning algorithms to generate other algorithms, called models, which represent patterns discerned across large datasets. Models can then be deployed to analyze or make predictions on new data. Models are assessed in terms of “generalization ability” (Alpaydin, 2016: 40) or their ability to function on data not included in their training dataset. A facial recognition model should be able to recognize faces other than those it was trained on. In other words, it should develop a general model of what a face consists of. The most prevalent approach to machine learning, supervised learning, requires that data be labeled by human annotators; a “time-consuming and expensive task” (Tremblay et al., 2018). The labels function as a “supervisor” and provide the learning system with the basis on which to form its general model; that is, for a facial recognition system one might label training data binarily with either ‘face’ or ‘not-face’. The problem becomes more complex as the domain expands. Data must also be painstakingly segmented and labeled; for every frame captured each tree, person, vehicle and so on, must be demarcated and annotated (Kniazieva, 2022). Such “AI data work”, as Muldoon et al. (2024) term it, is intensive and routinely outsourced to the global south through digital platforms. Simulations–as a means for synthesizing data–are one approach to solving the problem of data work. Simulation can be defined in terms of modeling. As an influential early textbook on the topic puts it, simulation is: the process of designing a model of a real system and conducting experiments with this model for the purpose either of understanding the behaviour of the system or of evaluating various strategies … for the operation of the system (Shannon, 1975: 2).
In the context of synthetic data, simulations model real phenomena, but are used to generate training data rather than run experiments. A simulation may simply consist of a virtual space for capturing images of 3D models or it may be an interactive video game-like environment. For example, the Singaporean company CVEDIA has trained computer vision models for tracking endangered animals with trail cameras by capturing images of 3D models of them in a simulation (CVEDIA, n.d.-a). On the other hand, NVIDIA has developed Isaac Sim, a virtual environment with simulated physics in which data for the control of robots can be generated by interacting agents (Omotuyi et al., 2024). In either case, synthetic data is synthetic insofar as it is the output of a computational process which is a model, rather than a recording, of a real-world phenomenon. While all data are, in some sense, synthetic insofar as they require labor for their selection, collection, preparation and analysis, synthetic data is distinguished from conventional data insofar as it lacks a direct real-world referent (Offenhuber, 2024). 3
As mentioned above, synthetic data is economically appealing because it aims to minimize costs for data collection and offers means of de-biasing and overcoming privacy issues. It also provides a means of circumventing data work. Data synthesized in simulations is automatically labeled down to a precise pixel level since all of the 3D elements of the simulation are discrete objects (Nikolenko, 2021: v). Within a simulation, the boundaries of those objects are strictly determinable. There is no need to hire anyone to label the data. This would seem to solve perhaps the biggest issue with supervised learning, however, it generates a new problem.
Simulation-borne synthetic data does not overcome machine learning's problem of generalization ability; rather, it expands it. If synthetic data are used to train a model, not only must the model generalize from its training data to new data; it must generalize from computationally generated data to conventional data; “bridging the reality gap” (Tremblay et al., 2018). For instance, an autonomous vehicle trained on data synthesized in a simulated urban environment must be able to function when deployed on real streets (Kamel et al., 2021).
The reality gap presents a new economic concern for AI production because highly detailed and realistic simulations are expensive and difficult to make. The requisite labor of building an adequately detailed simulation must be balanced with the potential savings to be achieved through the use of synthetic data. As Tremblay et al. (2018) note, “the primary selling point of synthetic data … that arbitrarily large amounts of labeled data are available essentially for free” can be mitigated by the substantial “expense required to generate photorealistic quality” (1082) because sophisticated simulations require “artistic design of the environment or prior real data” (Prakash et al., 2019: 7249). For instance, if you want to simulate an urban environment, an artist must either design a city, or a detailed model must somehow be extracted from existing data. Regardless, even the best simulations are far from a 1-to-1 replication of the real world. As we demonstrate below, simulations always require the selection of a subset of phenomena for modeling. Thus, data synthesis via simulation remains a non-trivial task, as a body of technical literature demonstrates (Alkhalifah et al., 2022; Tobin et al., 2017). To concretize the problem of the reality gap, let us consider two techniques for bridging it.
The first is
These approaches differ in how they aim to bridge the reality gap. The first aims to achieve photorealism by refining synthetic data so that it more closely resembles conventional data. The second aims to increase a model's generalization ability through the random introduction of objects into scenes. The first obviously relies on conventional data. However, so too does the second. The inserted objects are situated on top of real-world images from the popular Flickr 8k dataset. 4 These techniques for bridging the reality gap thus suggest that while the simulation approach to synthetic data attempts to create data without recording the real world, conventional data remains essential. This dynamic, we suggest, has a historical precedent throughout the development of simulation technology.
Method
We follow the approach of Hu's (2015) prehistory of the cloud, an examination of the material conditions and social relations which formed the basis for the emergence of cloud computing. Here we aim, less ambitiously perhaps, to examine some historical periods which anticipate the central problem of synthetic data. To do so we employ the method of “technography” (Bucher, 2018; Hind et al., 2022), described as “a detailed examination of the material aspects of technology by directly reading various publicly available documents generated by and related to technical systems” (van der Vlist et al., 2024: 4). Offering a more historical approach, this paper is nonetheless interested in understanding the “specific functionalities or aspects” of simulation and synthetic data, while “maintaining a critical stance toward promotional language [and] industry jargon found in…materials” relating to them (van der Vlist et al., 2024: 4).
Archival research was conducted at two locations: the Ferranti Collection (1948–1979) at the University of Manchester, UK, and the Brian W. Hollocks Collection (1961–1982) held at NC State University Libraries, USA. The Ferranti Collection includes documents related to the Ferranti computer company, responsible for manufacturing some of the earliest digital computers on which various firms relied for their simulation work. The Brian W. Hollocks Collection, accessed remotely, contains various documents related to early simulation work in the UK. Data were also collected via interviews conducted by one of the authors, as part of an ongoing study, with data scientists, 3D artists, CEOs and CTOs at synthetic data companies located in Europe and North America.
This research required analysis of various texts, including technical reports and academic papers across a wide range of disciplines (e.g. statistics, operations research, simulation, and computer science) as well as sales records and operator's handbooks to more personal, reflective texts on the early days of simulation. Our aim was to evidence how various attempts have been made to tackle the reality gap from the 1950s to the present day. To demonstrate this, we now outline three historical regimes of simulation (Table 1).
A summary of simulation regimes.
Regime 1: statistical
During the Second World War, scientists and engineers led by Stanislaw Ulam and John Von Neumann were involved in the construction of nuclear weapons as part of the infamous Manhattan Project. As part of this work, they devised a procedure for using random numbers to substitute for detailed knowledge of stochastic processes which exceeded the analytic capabilities of existing technology. Although they were based at Los Alamos Laboratory near Sante Fe, New Mexico, USA, this procedure came to be known as Monte Carlo simulation in reference to the famous casino in Monaco.
Monte Carlo involves substituting the analytic solution of a problem for a number of key parameters and the insertion of random numbers within those parameters over many iterations. The notion is, that from the set of all iterations, inductive knowledge, or at least informed estimates, can be obtained about the phenomenon under investigation (Halton, 1970). Galison (1996) suggests that for its advocates, Monte Carlo simulation went quickly from being “a numerical calculation scheme” to an “alternate reality—in some cases a preferred one—on which ‘experimentation’ could be conducted” (119). 5 While the Monte Carlo method was quickly adapted for use on the early computers of the 1950s, it does not necessarily require a computer and was commonly “performed on electromechanical calculators by a host of operators … in the 1950s” (Nance and Sargent, 2002: 163). It should thus be noted, as Nance and Sargent (2002) put it, that the “use of simulation precedes computers, either analog or digital” (162). The essential characteristic of simulation is not its means of implementation or hardware, but its creation of an “artificial world” (Galison, 1996: 120).
Of course, the nature of such an artificial world varies with the medium used to create it. But in any case, a simulation purports to tell the experimenter something about the real world, and thus Monte Carlo “represents an attempt to model nature through direct simulation of the essential dynamics of the system in question” (Bielajew, 2021: 1). This involves the setting of parameters. As one of the earliest formulations of Monte Carlo puts it: one must “assume that the probability of each possible event is given” so that one can then “play a great number of games of chance, with chances corresponding to the assumed probability distributions” (Metropolis and Ulam, 1949: 337). But how is one to know the appropriate probability distributions? The authors go on to elaborate that a simulation, over all its iterations, can be conceived of as a “space” in which a given “process takes place [as] the collection of all possible chains of events” (Metropolis and Ulam, 1949: 337). They assert that while the “general properties of such a phase space have been considered … much work remains to be done on the specific properties of such spaces, each corresponding to a given physical problem” (Metropolis and Ulam, 1949: 337). In other words, the specific parameters must be set for each specific problem, to reflect the real-world properties of the process in question if the simulation is to generate useful data. A recent textbook on Monte Carlo reiterates the point: one must start with models of the processes involved (e.g., the laws of gravity) and often one must make assumptions (e.g., there is a Gaussian distribution of speeds about the posted speed limit by cars on a certain highway) The [Monte Carlo] estimates developed are only as good as the underlying models and assumptions (Dunn and Shultis, 2022: 13).
In other words, the artificial world of the simulation must correspond in certain important respects to the real world; the essential dynamics of the real-world system at hand must be reducible to dynamics amenable to representation in
Regime 2: discrete-event
A second regime of simulation was made possible with the development of digital computers in the immediate post-Second World War period. At this time, Monte Carlo was performed on the new computers, but “progressively gave way to more involved bespoke models of real systems” (Hollocks, 2008: 131). These were called “discrete-event” simulations, able to model closed environments. In 1957, influential cybernetician Stafford Beer established a Department of Operational Research and Cybernetics (known as “Cybor House”) at United Steel, a steelmakers in South Yorkshire, UK, with the task of applying new computational approaches to steelmaking (Hind, 2024). Various projects were devised and carried out at Cybor House, the most important of which was the General Steelplant Program (GSP), a program designed to simulate the production process from start to finish (Hollocks, 2008). The GSP was intended to streamline the Monte Carlo method, reducing both time and labor costs. It represented an attempt to overcome the context-specificity of each simulation problem, as discussed in the previous section, by integrating the shared problems of industrial production into a reusable framework implemented in a semi-automated software suite. The first edition of the GSP Handbook opens by asserting that the Monte Carlo method: consists of describing the plant in terms of a mathematical model, in which the variates representing the uncontrolled factors in the plant behaviour are initiated by random variables. Such a model gives rise to a problem of finding the distribution of some function of the plant condition which represents, in some sense, its behaviour. (Tocher et al., 1961: 1)
The authors highlight a central problem with Monte Carlo: the execution of sampling. More precisely, the analysis of “the structure of the plant to be simulated” followed by its description “in mathematical terms”, the manipulation of the “equations to suitable forms” and the use of these “to generate the sampling values” (Tocher et al., 1961: 2). The solution to this was to “discretize” all component parts and processes of the plant.
In doing this the structure of the plant first needed to be conceived as a series of “machines” through which production moved. KD Tocher (1960), who succeeded Beer as director of Cybor House, held that “the first step in writing a simulation is to prepare a flow diagram” (62). Accidentally sounding like Deleuze and Guattari's
The next stage was to discretize independent steel plant “activities”, understood as “groups of machines in certain specified states” (Tocher et al., 1961: 7). Ensuring activities, such as the stripping of ingot molds (Figure 1), were appropriately discretized was critical, especially in being able to capture the necessary “states,” the machines might assume at different stages in the steelmaking process (Tocher et al., 1961: 7). Here, the GSP needed to be able to understand how a steel plant

Simplified flow diagram of activities: Acid Bessemer Steel-Making Plant. Notably, the flow is composed of a series of discrete events.
From general steel to general simulation
Specific requirements had to be met to make computer-based simulation economically viable. Firstly, the GSP needed to run as quickly as a bespoke, “tailor-made” simulation program (Tocher et al., 1961: 4). Such generalizability was first pursued via the assembly of a library of subroutines. However, this was rated as “not … particularly successful” by the GSP creators, because it did not solve the laborious task of “analysis of the works” or, in other words, of modeling each individual factory (Tocher et al., 1961: 3). It was left, therefore, to the GSP to find a “generalised expression for the structure of any [steel] plant” (Tocher et al., 1961: 4). As work on the GSP continued “it became clear that [it] was not simply a General
Yet, for all its advances over the Monte Carlo approach, the GSP faced a similar reality gap problem. As Tocher and Owen (2008 [1960]) bemoaned, the requisite analysis of plant structure remained an obstacle since the “practical success of a simulation depends on the accuracy and validity of the sampling distributions used to describe the processes” (150). Although the discrete-event approach reduced simulation to a chain of machines in discrete states, the functioning of those machines could not be assumed a priori: “Data of sufficient accuracy to produce the correct sampling distributions are not usually available from works records, and a large volume of special observations are then needed, with the consequent labour of analysis” (Tocher and Owen, (2008 [1960]): 150). The possibilities of simulation via GSP were further limited by data about system failures.
The GSP needed to be able to simulate how breakdowns occur. However, only endemic breakdowns could be simulated since they would be discovered in the analysis of the plant structure. Excluding breakdowns which occur infrequently (whether they had great consequences or not) was “of great practical importance, since the data for finding the statistical laws covering infrequent occurrences is very hard to get” (Tocher et al., 1961: 9). The ambitions of the GSP were thus constrained by the unavailability of data on what would today be called “edge cases” or data points which fall outside a normal distribution (CVEDIA, n.d.-b).
In its early days GSP did not involve computer displays. Engineers had no choice but to present information to controllers “obtained from teleprinter output and posted up on a display board” (Mellor and Tocher, 1963: 133). Decisions taken by the controllers would then be fed back into a Ferranti Pegasus computer for further rounds of simulation. As GSP evolved, it was linked up to analog physical displays which emulated the apparati being simulated. These were not computer graphics displays, but mechanical interfaces which might, for instance, reproduce a particular control panel (Figure 2). Such physical simulations aimed to place a user within the simulated system in a rudimentary way by visually representing some elements of it. This was a bridge to the next regime of simulation; and it would re-present the problem of the reality gap in a whole new form.

A “typewriter-type device being used as an interactive terminal” for a simulation exercise in the late 1960s.
Regime 3: visual-interactive
A new regime of simulation was initiated when Robert D. Hurrion (1976) produced a PhD dissertation at the University of Warwick proposing Visual Interactive Simulation (VIS), employing color graphics and limited interactivity. Hurrion was at the time working on the simulation of manufacturing processes. He found that a generic model did not work because there tended to be a human scheduler who had control over the process and intervened as needed on the basis of “rules … [which] were frequently difficult to encapsulate in the simulation” (Bell and O’Keefe, 110: 1987). He thus decided to create a simulation program which could visually represent the system at hand, so that the scheduler could see it and intervene as needed. Hurrion would go on, with funding from manufacturers including Rolls Royce and Imperial Chemical Industries, to develop a software package called VISION, later marketed as SEE WHY, which was subsequently used to produce a graphical version of GSP known as FORSSIGHT in 1982 (Bell and O’Keefe, 1987; Hollocks, 1983). We turn now to this simulation application, and the VIS approach more broadly: “the video game approach to simulation” (Bell and O’Keefe, 1987: 115). With the visual-interactive regime, simulation became a more tangible alternate reality. But, in this regime, the necessity of data takes on a new form concurrent with the new visual mode of simulation. As early researchers put it, VIS demands “added data requirements” (Kirkpatrick and Bell, 1989: 142). Not only do parameters for key statistical properties of the domain need to be set, but the appearance of certain entities in that domain must be represented via computer graphics. This stands in contrast to the Monte Carlo method, in which the “geometry of the environment … plays little role except to define the local environment of objects interacting at a given place at a given time” (Bielajew, 2021: 1). Here the appearance of and interfaces to relevant objects become central concerns. First we consider the property of visuality, then we turn to interactivity.
Visuality
The shrinking size of components allowed vast increases in computational power in the 1980s such that it became feasible to render graphics as well as text. The so-called “microcomputers” drove a wave of enthusiasm for increased production speed at United Steel: “The graphics facilities of FORSSIGHT enable the visual displays, previously requiring special physical or electronic construction and control, to be generated quickly and easily” (Hollocks, 1983: 338). One computer could be programmed to visualize any simulation rather than having to build a bespoke physical display. Up until this point, as mentioned previously, simulations were dependent on printed paper outputs: analysis could only take place after a simulation was run. VIS thus put a “window … in the side of the simulation black box” (Hollocks, 1983: 338).
A new problem arose, however, concerning an economy of graphical fidelity to the real world. It became important for simulations to have “adequate realism in the display” (Hollocks, 1983: 337). What was deemed an “adequate” level of realism for simulations was, of course, relative to the usability of hard copy outputs and mechanical interfaces. However, despite the notable benefits of graphical simulation, practitioners noted that “visual displays were time consuming, and in some cases expensive to produce, and interactive studies took considerable time” (Hollocks, 1983: 336). For simulation designers graphical outputs needed to be convincing. They would utilize new capacities for the dynamic display of “shapes and colors, together with alpha-numeric information” which made possible the “construction of clear, easily understood mimic diagrams” (Hollocks, 1983: 337). A mimic diagram was “a fixed background and a set of dynamic icons” (Kiernan, 1991: 11) or an animated graphical representation of a certain domain. In the present moment of computer graphics which are near-indistinguishable from reality, it can be difficult to appreciate the novelty of visuals at that time.
FORSSIGHT produced simulations which were graphically inferior to its contemporary video game systems, such as the Famicom, released in Japan in 1983 (Figure 3). The limits of these visuals motivated research on the introduction of realism into mimic diagrams; purely iconic simulations such as those produced with FORSSIGHT quickly became inadequate. Such work details the construction of dynamic icons, but insists that for “visual realism” to be introduced, icons must be combined with backgrounds consisting of “scanned images from plant photographs or schematic diagrams” (Kiernan, 1991: 10). While gauges might be graphically simulated, this was deemed insufficiently realistic if not presented on top of a photograph of the actual pumphouse, with simulated gauges located in relation to their real-world counterparts. The visuals generated via programming could not adequately capture the real world within the simulation so the digitization of analog photos typically would have to suffice instead.

Use of Visual Interactive Simulation (VIS) to model flight departures at an airport terminal.
Mimic diagrams encapsulate a central problem posed by VIS: picture definition. This consists of two elements. Problem formulation: “the picture must display ‘solutions’ to the problem that are useful to the user” and conceptual validation: “the picture must correctly represent the system to the user” (Kirkpatrick and Bell, 1989: 141). In both cases, data which are not required for purely statistical simulations is necessary. Data are also required about the transition between discrete states. Kirkpatrick and Bell (1989) describe how a VIS simulation of a train depot required animation of trains moving from one track to another, rather than “simply appearing” in a new location (142). While VIS may employ a discrete-event logic, it must represent the analog transitions which are abstracted from by discrete-event simulation. As one comparative study of a statistical batch-processing simulation versus a visual simulation concludes: “The VIS model … required more detailed data to appear to be accurate, which increased the demands on data acquisition … it was necessary to simulate a finer level of detail in many areas” (Kirkpatrick and Bell, 1989: 148). Visuals not only convey information about a simulation's operations and results, but also about how it could be interacted with.
Interactivity
The analog interfaces described in the discrete-event regime section were the first form of interactivity in simulation. These displays enabled operators to see what a simulation “was integrated the graphics display with the simulation model such that the graphics display was updated by the model as it ran, that is, the graphics are “concurrent” with the model … The principal USA development path in such graphics at the time … adopted a “replay” approach. In this, a simulation run produces an output data file detailing the sequence of events in the run; this is then the source for a second program that generates graphics from the file … There is no interaction with the model (Hollocks, 2006: 1390)
Bell and O’Keefe (1987) understood this as a difference between “animation” on the one hand, and full “visual interaction” on the other. Whilst an animation might allow the user to interact with a graphical representation of the simulation during a replay (e.g. zooming), it did not allow interventional possibilities with the simulation and its parameters.
Interactivity does not apply only within the set parameters of the model, but to those parameters as well. According to Sohnle et al. (1973) interactivity refers to the user being able to “not only observe, verify, and record data, but also to interrupt the simulation so that parameter values may be changed, or the model structure modified, or progress be retraced or redirected” (146). In Hurrion's (1986) words, interactivity involves “respecifying the model” or modifying its parameters (285). This is an operation, as explained above, which requires data about the domain to be simulated. Respecifying is only a simple task for the simplest domains which can be precisely specified, increasing in difficulty as the phenomena to be modeled become more complex.
Discussion
Across the three regimes of simulation, real-world data are required to create a simulation such that it can then serve as an alternative to real-world data. The virtual worlds of simulations are not produced
Synthetic data producers tend to be committed to big data aspirations of exhaustivity (Kitchin, 2021: 62). The idea with synthetic data is that whatever data does not exist can be created (Jacobsen, 2023). Demonstrating this position, the synthetic data company CVEDIA (2021) announced that it had “officially solved the domain adaptation [reality] gap using its proprietary synthetic data pipeline” called CVEDIA-RT. This required designing over “30,000 3D models for many types of objects, including clothing and exotic animals to buildings and ships” (CVEDIA, n.d.-b). Once complete, CVEDIA-RT would “allow AI technologies to scale without the burdens of data collection and labeling,” (CVEDIA, n.d.-b) resulting in a “resilient AI with zero data” (CVEDIA, 2021). While we do not know how these models were created, we recall, however, that for the GSP's creators at United Steel, edge cases were hard to model. CVEDIA ambitiously proposes to create examples of all possible edge cases, despite their rarity. Notably, the company has never followed up on the pronouncement made in 2021, and the synthetic data market has yet to be conquered by their simulation platform. It is safe to believe that the reality gap has yet to be bridged.
Furthermore, we contend that as the ambitions of contemporary synthetic data producers increase and simulations scale up, the data that will be required to create simulations is likely to increase in quantity and detail. In other words, the reality gap will only grow. The three regimes of simulation discussed above demonstrate how simulations have increased in their ambitions over time, modeling increasing kinds of phenomena with increasing fidelity to the real world. Recall that the earliest simulations did not involve computer graphics—or computers. This dynamic continues today. NVIDIA has partnered with the German automobile simulation firm dSPACE to create physically accurate models of vehicles for the Drive Sim platform, simulating “suspension, tires, brakes—all the way to the full vehicle powertrain and its interaction with the electronic control units that power actions such as steering, braking and acceleration” (Burke, 2020). This trajectory points to why the reality gap is not only a technical and epistemological problem, but a political economic one.
It is a political economy issue because it proposes a possible trajectory for the AI industry, differing from the industrialization or “the becoming mundane of AI” (Van der Vlist et al., 2024: 13) in which an increasingly standardized cloud-based machine learning stack is expanding. The constantly increasing demand for computational power, with its requisite energy demand, is now widely recognized as a consequence of this expansion (Lawson, 2024). But the AI industry faces an uncertain horizon in terms of the data it requires to train machine learning models. The rise of the simulation approach to synthetic data shows that alongside the intensification and expansion existing of data surveillance practices, the AI industry is pursuing qualitatively novel technical means for creating data, adding layers to the stack of AI. Advanced simulations are computationally intense and will no doubt contribute to steeply rising energy demands for the AI industry. But they may also end up reconfiguring the infrastructural basis of the industry, leading to an alternate stack. NVIDIA, which became the world's most valuable company in June 2024 at $3.34tn (Robins-Early, 2024), at least, appears to think so. In a keynote talk at the 2024 Nvidia GPU Technology Conference, CEO Jensen Huang (2024) positioned the Omniverse simulation platform, rather than AI, as the “soul” of the company, tied to the generation of synthetic data, the operation of digital twins, the design of robotics and other automation technologies, as well as scientific research. “Big AI”, as van der Vlist et al. (2024) term it, might look decidedly different with a chip firm leading the way, rather than web services-turned-cloud providers.
As simulations grow in ambition they expand the reality gap and thus the domain data required to create them also grows. As the example of dSPACE and NVIDIA shows, a team of data scientists cannot simulate vehicles at a fine-grained level without assistance from automotive engineers. This inclusion of domain expertise is a marked shift: not only away from the agnostic conception of contemporary AI and data science pervasive through industry—as a meta-discipline of sorts which can be applied to any domain without knowledge of it (Ribes et al., 2019). Instead we see a return to something more like the “expert systems” approach dominant in AI during the 1980s (Woolgar, 1985). It remains to be seen how big tech will obtain this expertise and what new social relations and technical configurations such efforts will impart to the AI industry. We suggest that such new relations and configurations might be glimpsed by attending to the following aspects of the synthetic data industry.
Firstly, it remains unclear where synthetic data will truly “take off”. Medicine, finance and autonomous vehicles are all lively areas for synthetic data applications (Devaux, 2022), but as of yet none can be definitively pointed to as the place where synthetic data has matured, despite bold claims made by some industry players (Wayve, 2024). The specificity of a particular domain could shape the future of synthetic data if the technology takes off there. Secondly, synthesizing data requires new forms of labor. For instance, beyond the domain experts discussed above, the simulation approach requires artists who can create virtual environments and objects (Steinhoff, 2023). The influence of such workers over the nature of synthetic data and the models it is trained on is currently unexplored. Thirdly, synthetic data proponents argue that it will “democratize” access to data (Ebert, 2023), although this word is frequently invoked in the AI industry, with little substance. It is more likely, we contend, that simulations will accrue the most value to the large firms which own the substantial fixed capital for their development. However, it remains theoretically possible that data synthesis could contribute to a rupture within the oligopoly of data-intensive capitalism.
Conclusion
We conclude with a final epistemological-political economic reflection. The necessity of simulation being a selection from the manifold of the real entails that models trained on synthetic data produced in a simulation will reflect that selection process. As big tech moves into synthetic data, the question is raised of how the exigencies of hyperscale capital valorization will impinge upon the requisite process of selection and the qualities of simulations produced. Acemoglu and Johnson (2023) have demonstrated that capitalist industry has generally been content with what they call “so-so automation”; the introduction of machines which lower labor costs to some degree but achieve very trivial gains in productivity and often reduce the quality of work. This, they argue, is a major driver behind increasing inequality and the falling labor share of value in most of the Global North since the 1960s. Will the AI industry accept a “so-so simulation” and what might its implications be?
