Sage Journals: Discover world-class research

Abstract

Synthetic data are computer-generated data that mimic and substitute empirical observations without directly corresponding to real-world phenomena. Widely used in privacy protection, machine learning, and simulation, synthetic data is an emerging field only just beginning to be explored in the social sciences and critical data studies. However, recent developments, such as the use of synthetic data in the US Census and American Community Survey, make a reflection on the nature and implications of synthetic data urgent. While earlier work focused mostly on training data for machine-learning models, this paper presents a broad typology of synthetic data and discusses its frictions. The main argument presented is that the traditional representational model of data as symbolic references to corresponding physical or conceptual objects is insufficient for understanding and critically engaging with issues and implications of synthetic data. The paper discusses an alternative relational model, which defines data not through an object of reference but based on “who uses them, how and for which purposes”. The relational model is more productive for capturing the fact that synthetic data are defined through their purpose; their performance in a particular situation (such as training a machine learning model); and a context-dependent operationalization of evidence. The post-representational anything-goes epistemology of synthetic data can be productively challenged through a forensic approach that foregrounds the outliers, artifacts, and gaps in datasets as meaningful information.

Keywords

Synthetic data representation artificial intelligence US Census differential privacy conceptual models

The weirdness of synthetic data

The artificial intelligence (AI) startup Synthetic Users, founded in February 2023, advertises a service to do “User research without users”—simplifying slow and expensive focus group interviews and usability tests by replacing human respondents with in silico users.¹ As of June 2023, the company offers simulated user feedback to websites and interfaces that was generated by a large language model (LLM). Other AI startups such as gretel.ai and mostly.ai specialize in anonymizing sensitive user data by generating synthetic versions using AI models.² While these synthetic versions supposedly preserve the statistical properties and internal correlations, their records can no longer be linked to the original dataset. Gretel.ai contends that synthetic data is “as good or better” as the original dataset and can be shared without privacy concerns.³ Conveniently, the synthetic dataset can be inflated to any size while allegedly correcting for biases in the original data. In other use cases, AI models for computer vision (CV) systems are trained with another form of synthetic data: A blend of real images, digitally altered content, and three-dimensional (3D) rendered footage. Labels, a form of data used in supervised machine learning to teach models the desired outputs, are routinely generated by other AI models with little human supervision.

Using artificial data for applications that normally depend on real-world observations may seem troubling. Remarkably, even government agencies like the National Institutes of Health and the US Census Bureau, which traditionally conducts labor-intensive door-to-door data collection for the decennial census, have embraced synthetic data to enhance dataset accuracy and address privacy concerns (Syntegra, 2021; US Census Bureau, 2021). Despite its paradoxical premise, synthetic data is rapidly gaining traction in many fields. The Wall Street Journal estimates (with little elaboration) that by the end of the year 2024, 60% of all data used in training and data analytics will be generated by AI (Castellanos, 2021).

This paper provides a typology of the different forms and methods of synthetic data and examines its epistemic claims. It argues that the phenomenon demands a departure from the traditional representational concept of data toward a relational model, where data are defined not only through their associated phenomenon but also through their purpose, performances, and contexts of use. Synthetic data thus prompts a reevaluation of the notion of data realism—from accurately representing things in the world to allowing a model trained with it to recognize real things. Finally, the paper identifies Paul Edwards’ concept of data friction as a central factor that motivates the use of synthetic data and examines tensions around privacy, fairness, trust, and evidence in a synthetic data universe.

The representational model of data

Data can mean different things in different contexts—from verbal presentations to written records; from physical traces and archeological artifacts to abstract symbols. Data have been defined as singular differences in the world (Floridi, 2011), as inscriptions that are both immutable and mobile, and can be used as evidence in an argument (Latour, 1990). One of the most familiar conceptualizations of data is what will be described as the representational model, which conceptualizes data as descriptors of a phenomenon, assuming that these can be objectively evaluated regardless of the particular context (Leonelli, 2016: 197).

Considering a datum as a representation means viewing it as a reference to a physical or conceptual object in the world. As a record created through observation or measurement, the datum represents its object, points to it, and substitutes it in subsequent operations. In other words, the datum serves as a sign.⁴

While this representational framing may seem uncontroversial, it has been subject to extensive critique:

The representational relationship is necessarily, and often problematically, reductive, omitting its process of generation, its underlying assumptions and arbitrary decisions (Coopmans et al., 2014; Drucker, 2011).

Representation implies a duality between an external world, assumed to be stable, and data, assumed to be abstract. The world is assumed to simply exist rather than constantly being made and remade. Data are also never abstract since every datum has a material embodiment that can interfere with its intended meaning.

Once data are generated, the dataset establishes its own reality. How exactly data relate to things in the world fades into the background. Outside of scientific practice, data analysts often have little interest in the minutiae of data collection. As critical data literature has extensively discussed, data are not merely objective descriptions but interpretations of the world, shaped by worldviews and problem articulations (D’Ignazio and Klein, 2020; Kitchin, 2014; Loukissas, 2019).

The representational model also has practical flaws. Data found in computer memory or hard drives are often merely by-products of algorithms rather than pointers to things in the world, such as the temporary information stored in caches and buffers. While they could be interpreted as second-order representations, their representational qualities are not necessary or useful for understanding the computational process. In the digital context, data is a flexible concept that extends far beyond representation and includes code, meta-data, thumbnails, and temporary information (Dourish, 2017).

While these critiques are not new, the fundamental weirdness of synthetic data, discussed in the following sections, undermines the representational model most profoundly. Synthetic data may mimic empirical observations but no longer correspond to features of the world. They are defined based on their application and, therefore, no longer independent of their use. Finally, they are often generated through methods that elude exhaustive explanation and scrutiny. These violations of representational assumptions will be elaborated in the following sections.

Forms of synthetic data

Definitions of synthetic data vary, reflecting methodological differences and purposes. In terms of methodology, synthetic data has been divided into three groups: Data derived from real-world observations; data not connected to such observations; and a hybrid combination of the two (Emam et al., 2020a). In terms of purposes, the three main uses of synthetic data are to protect privacy, to generate training data for machine learning applications, and to improve existing datasets by correcting biases and augmentation with other data. Existing definitions diverge as they tend to emphasize only one of these areas: “Synthetic data is annotated information that computer simulations or algorithms generate as an alternative to real-world data;” (Andrews, 2021) “information that's been generated on a computer to augment or replace real data to improve AI models, protect sensitive data, and mitigate bias;” (IBM, 2021) “data that has been generated from real data and that has the same statistical properties as the real data;” (Emam et al., 2020b) and “data points that a machine learning algorithm has generated; data which in turn are used to train other machine-learning algorithms” (Jacobsen, 2023).

The following typology is not meant to be canonical—some authors consider augmentation, imputation, and harmonization as data modifications rather than true synthetic data (Nikolenko, 2019: 12). However, all these approaches exist on a continuum between real and synthetic.

Algorithmically generated data algorithmically generated datasets are widely used across multiple disciplines for simulation, experimentation, and theory building. They are essential in network science, fluid dynamics, or urban mobility modeling. Network scientists use algorithmically generated data sets, such as random or scale-free networks, for various purposes. A random network, for example, comprises random connections between a given set of nodes based on given probabilities of connection. Random networks are idealized models that serve as a point of comparison for examining the structure of social connections and other network-like phenomena or allow modeling processes of contagion and diffusion.

Unlike the forms of synthetic data discussed in the following sections, algorithmic data are typically not considered substitutes or imitations of real-world observations. As epistemic artifacts in their own right, they represent models and theories whose properties and limitations are well understood. Since they are based on explicit assumptions, they allow theoretical experiments that would not be possible with real-world data alone.

Imputed data imputation is a statistical technique for replacing missing data with surrogate values. Many statistical methods, particularly those involving matrix operations or time series, cannot run with incomplete data, and filling gaps with arbitrary values such as zeros can introduce biases into the results. The challenge of imputation lies, therefore, in selecting appropriate substitutions that fill the gaps but do not distort the dataset. Simple imputation techniques involve substituting missing values with the sample mean or values randomly chosen from the dataset. More advanced machine learning techniques aim to generate plausible substitute values that preserve original distributions and correlations.

For example, examining trends and seasonality in medical data depends on a complete time series, but gaps are common in patient data. Determining whether a patient improves over time or whether a long-term treatment is effective requires closing these gaps without distorting the outcome. Depending on the length and nature of the gaps, different imputation techniques can be considered, from simply carrying the last observation forward to interpolation or predictive modeling.

As an elementary form of data synthesis, imputation generates the illusion of a complete dataset. It does not add information and invariably generates new biases, such as smaller apparent margins of error and inflated statistical significance. While not recommended for explanatory data analysis, imputation nevertheless may improve prediction models and is therefore often used for training data. However, to perform imputation correctly requires a thorough understanding of its limitations and assumptions implied by the particular purpose. Commonly used for the analysis of surveys with incomplete responses, imputation is typically used at the analysis stage; If redistributed, the imputed parts of datasets need to be clearly labeled.

Harmonized data Different datasets about the same phenomenon often come aggregated in different categories, which makes their harmonization necessary. The US census divides the population into spatial areas whose boundaries can change over the years due to population changes. When boundaries change, combining data requires some degree of statistical resampling. Harmonizing demographic data over long timeframes involves making assumptions about how to reallocate counts between the shifting areas. The result is an approximation and no longer a direct representation of the census survey. Such harmonized datasets are available in the integrated public use microdata series (IPUMSs),⁵ which globally aggregates demographic and socio-economic microcensus data.

Other forms of administrative data, however, require more extensive data harmonization. Different administrative districts and counties often use different data formats and collection methodologies, which complicates data aggregation. Furthermore, data from different countries are collected with different methods, whose details are often unknown and therefore difficult to reconcile. These numerous incompatibilities make the compilation and harmonization of global datasets a challenging endeavor that involves many arbitrary judgments.

Augmented and hybridized data Beyond tasks of harmonization, augmenting, and hybridizing datasets involves integrating different aspects of the same phenomenon from multiple sources. Individual datasets used in this process can vary significantly in their nature and purposes. For example, many global socio-economic datasets are derived from nighttime light emission satellite images, which offer a good proxy for economic activity and population distribution. Their luminance values are then augmented with different country-level datasets.⁶

The hybridization of public and non-public data sources is particularly relevant in the social sciences. The annual American Community Survey (ACS) is a widely used socioeconomic dataset that includes information about income, education, employment, housing, and health of residents. However, the accuracy of ACS data can vary significantly at the most granular level, because of the small number of randomly selected households and low response rates in some locations. Margins of error surpassing 90% are not uncommon for certain variables. To address this issue, the US Census Bureau is currently exploring different data augmentation methods, combining ACS data with non-public administrative records into a new hybrid dataset that is more complete and accurate than either of its components. The bureau has previously offered similar synthetic datasets, such as the longitudinal survey of income and program participation (SIPP), which has been augmented with income data from tax returns, records of retirement, and disability benefit receipt.⁷

The sensitive nature of tax- and social benefit information naturally raises concerns about privacy, not only because the US Census Bureau is bound by federal law to protect respondent confidentiality. Even in aggregated form, administrative data can be deanonymized under certain conditions, leading to the disclosure of sensitive personal information. For example, when only a single individual changes between two versions of the same dataset, the differential between the versions allows reconstructing the full information of that individual. The concept of differential privacy (DP) is a privacy guarantee that holds even in this extreme case of a single record added or removed from a database. It is accomplished by injecting just enough noise into the aggregate outputs to avoid reconstruction (Dwork, 2008). The privacy guarantee can be quantified in the privacy budget, representing a trade-off between the risk of disclosure and the loss of data utility (Liu et al., 2022).

Based on the concept of DP, the Census Bureau introduced a disclosure avoidance system (DAS), which involves obfuscation techniques that use noise injection and multiple stages of data imputation.⁸ For spatial analysis and mapping, the combination of data augmentation and DP enables access to data at a level of spatial granularity otherwise not available in public data sources.

Mimicry data Noise injection may guarantee DP but decreases data accuracy, a problem for critical applications such as drawing congressional districts (Kenny et al., 2021). US Census data products such as the synthetic SIPP, therefore, use a different method for producing a synthetic dataset without replicating records from the original data. Applicable methods involve modeling the distributions and correlations of selected variables in the original dataset and subsequently sampling predicted values from this model, using spatial constraints to improve accuracy (Cunningham et al., 2021).

Propelled by the need for training data and the privacy restrictions of regulatory frameworks such as the European General Data Protection Regulation (GDPR)⁹ and the US Health Insurance Portability and Accountability Act (US Congress, 1996), data synthesis for privacy protection is currently a burgeoning industry. Most of these new companies rely on artificial neural networks (ANNs) and specifically deep learning as a versatile alternative method for capturing the statistical properties and patterns found in the original dataset.

Synthetic data companies advertise other benefits beyond privacy protection. First, synthetic datasets can be generated at any desired size, regardless of the amount of available input data. However, it is important to note that such an inflated dataset inherently contains less information than the original. The model can also be used to generalize the input dataset, eliminate gaps and outliers. Crucially, the representation of groups can be rebalanced to mitigate biases—with the caveat that such post hoc corrections do not inherently improve data quality. Finally, synthetic data allows exploring the effects of extreme outliers by creating fictional variations of the same dataset.

Nevertheless, traditional caveats of statistical modeling still apply: The analyst needs to thread a fine needle between over-fitting, which introduces errors and artifacts by fitting the model too tightly to the available data, and overgeneralization, which leads to a loss of detail and information. But mimicry datasets are not bulletproof, even from a privacy perspective. While deep learning models do not contain the original training data and its inner structures are notoriously difficult to interpret, researchers are exploring a range of possible attacks, including model inversion, membership inference, and data reconstruction that allow systematic estimations about the original data (Chang et al., 2023). Deep learning models are also known to occasionally memorize or closely replicate details found in the training data (Arpit et al., 2017).

Data inpainting Data inpainting is another technique of reconstructing or filling in missing data. It is commonly used in image editing and CV and also applies to audio signals or text. Inpainting involves selecting a region that needs to be reconstructed and predicting the missing or damaged data using the information from the surrounding regions. The goal is often not to fill in neutral substitutions but to produce plausible data that seamlessly integrates with the surrounding context.

As a synthetic technique, inpainting is used in medical imaging to reconstruct damaged or missing parts of magnetic resonance imaging (MRI) data. More recently, it has been used to create fully synthetic training data for medical AI applications, such as brain tumor detection and skin disease classification (Akrout et al., 2023; Rouzrokh et al., 2023). Real-world MRI data of brain tumors are difficult to obtain in the desired volume, not only because of regulatory constraints but also because of the rarity of specific conditions. Data inpainting allows inserting tumors into the MRI images of healthy patients. The technique can be compared to AI image generators such as stable diffusion and Dall-E.¹⁰ Users select an image region, and the model fills it with new content that integrates smoothly with the surrounding context. Rather than inconspicuously patching over gaps, data inpainting is used here to generate the relevant features of the dataset, which can then be used to train other machine learning models to detect real brain tumors in real patients.

The circularity of the process and the premise of teaching models to recognize severe medical conditions without empirical data seem troubling. Image generators may introduce invisible artifacts, are susceptible to adversarial attacks, and their outputs may simply lack the richness of real phenomena. It has been demonstrated that training models on their own output degrades the model and makes it “forget” the more unusual cases (Shumailov et al., 2023). Conversely, emerging machine-learning techniques such as contrastive learning depend on synthetic data: various transformations of the same image or feature set allow the model to generalize similarities.

Synthetic training data The circularity described in the previous section is common in the AI field. Most recent training data sets were at least partially synthesized by other models. Supervised learning, the dominant machine learning technique, depends on labeled datasets, such as images with annotations describing the objects or features they contain. The lack of labeled training data is a major bottleneck in AI development since creating such datasets is slow, expensive, and sometimes unreliable.

ImageNet, an early training set for CV research, currently includes 14 million images that were manually labeled by crowdworkers with concepts and categories from the WordNet lexical database. Images were scraped from the web by searching for WordNet concepts, followed by manual verification through the crowdworking platform Amazon Mechanical Turk (Deng et al., 2009). Newer training sets are significantly larger and no longer involve human supervision. OpenAI's CLIP model uses a novel learning approach that no longer requires specifically labeled input data. Designed to work with natural language descriptions, it was trained on over 400 million image-text pairs randomly scraped from the web (Radford et al., 2021). LAION-5B is a dataset of over 5.8 billion image-text pairs used to train the Stable Diffusion image generator (Rombach et al., 2022; Schuhmann et al., 2022). Its labels were generated using the CLIP model.

The impossibility of evaluating data quality at such vast scales has raised concerns about data biases. ImageNet's biases have been extensively scrutinized along the lines of gender and race (Dulhanty and Wong, 2019), geography (Shankar et al., 2017), and fairness of algorithmic classification (Yang et al., 2020). In contrast, the newer and vastly larger models involved almost no human oversight and curation. Since the training data for both CLIP and LAION-5B were scraped from the internet, they are subject to all kinds of distortions, such as mislabeled image captions resulting from search engine optimization efforts. Image generators trained on LAION-5B reproduce watermarks of commercial stock photo platforms, indicating that the random scraping of training data did not consider image copyright, which is the basis of pending lawsuits.¹¹

Besides such scraped and automatically labeled data, a second and equally prominent group of synthetic training data is entirely artificial, generated using 3D rendering and game engines (Figure 1). The reason can be easily illustrated: Searching the LAION-5b dataset for the concept “tree” almost exclusively returns images of trees viewed from eye level, but rarely a tree from above. This presents a problem for CV applications. Training data for CV systems therefore often include footage generated in game engines populated by digital humans (Bellan, 2022). 3D rendering ensures accurate labels but presents a trade-off in realism and richness, which could make the resulting models less transferable to real-world environments (Nikolenko, 2021: 12).

Figure 1.

Falling things synthetic dataset with labeled information highlighted (below).

Representation and model indecomposability

The described forms of synthetic data range from simple perturbations of real-world observations to artificial data generated from scratch based on a loose relationship of resemblance or plausibility. While the former maintains correspondence to the original observations, the latter is only indirectly connected to the phenomenon; for example, by serving as training data for a model to detect said phenomenon.

A more fundamental challenge to representation is presented by synthetic data generated through ANNs, an approach that is increasingly becoming the norm. ANNs can detect subtle patterns missed by other techniques, however, their logic is hard to interpret. Because of the holistic training process, one cannot take the models apart to explain how they reach a specific result—they are not decomposable (Burrell, 2016). There is often no simple or satisfying explanation for why an ANN generates a certain output. The model parameters that determine the output of the ANN have no obvious relationship to the input data and therefore no representational value for a human interpreter.

Unable to simply open the black box, the method of choice is feeding the model with various inputs and interpreting its behavior—an approach that treats ANNs more like a natural phenomenon than a mathematical object. Attempts to study the inner workings of fully trained ANNs involve probing individual neurons to generate images resembling the stimuli these neurons respond to (Olah et al., 2017). The internal structures revealed by these studies are neither elegant nor pretty. Some artificial neurons indeed seem to correspond to specific perceptual qualities such as curvature, orientation, or contrast, while many are redundant or respond to multiple stimuli that seem to have nothing in common (Goh et al., 2021). This form of representation may resemble how neural activity relates to external stimuli in biological organisms but cannot be formally abstracted. From a perspective of algorithmic accountability, the chain of custody between real-world observations and derived synthetic data remains opaque.

Non-representational frameworks and the relational data model

The premises of representational frameworks have been questioned long before the emergence of synthetic data. Nigel Thrift's Non-representational Theory (NRT) summarizes its critiques and offers alternative views (Thrift, 2007). While representation assumes a frozen world whose objects are static and stable, NRT emphasizes fluid processes of becoming. Meaning is no longer created only through language and symbols but also through embodied practices and performances. Objects are no longer simply viewed from different subjective perspectives but are enacted through different practices and therefore exist in multiplicities. Anthropologist Annemarie Mol, who examined medical practices in a Dutch hospital, concluded that a disease such as arteriosclerosis does not exist as a universally defined object but is understood and enacted differently by different medical professionals without the need for reconciliation (Mol, 2003). While science is often considered a formal system of representation, historians of science have illuminated its many non-representational aspects, practices, and performances (Coopmans et al., 2014; Daston and Galison, 2007; Galison, 2006). The concept of the operational image abandons the pictorial and representational function of images, focusing on images not made for humans but for machines, used as the basis for various computational operations (Parikka, 2023).

These aspects can be directly applied to synthetic data. As described, the concept of synthetic data is defined through different contexts, practices, and goals. The data themselves exist in multiples; they are generated at will for different purposes. They lack correspondence to real-world objects, and when they are generated by ANNs, their process of generation is often not explainable. One could view synthetic data as embodiments of models and their training data, similar to how the smoke trails in a wind tunnel embody the principles of fluid dynamics. The smoke trails do not represent these principles; they are their physical effect. If digital computation is about the manipulation of symbols, the perspective of embodiment harks back to the era of analog computing. Early cyberneticists were often wary of representation and more concerned with the behavior of a system than its internal states. In 1962, Stafford Beer noted about the biological computers he and Gordon Pask were working on: “The only trouble is you do not know what the answer is. … [But] you do not want the answer. What you do want is to use this answer. So why ever digitise it” (Beer, 1962: 220–221). Similarly, an ANN can be examined from a strictly phenomenological perspective through its interactions with the context of its training (Beckmann et al., 2023). The generated data are the manifestations of this coupling: Part of a phenomenon that not only contains the object of observation such as the brain tumor and its training data, but also the observer, the programmers, the MRI machine, and the ideas that guided its construction.

The relational data model

If the representational model considers a datum as a symbolic reference to an external feature, an alternative relational model derives its meaning from the context, as Sabina Leonelli explains: “What counts as data depends on who uses them, how and for which purposes” (Leonelli, 2016: 196). If a datum is relationally defined through its fitness to serve as evidence, then evidence is likewise “assumed to consist in whatever makes a given assertion believable, or anyhow increases its intelligibility and/or plausibility to a given audience” (Leonelli, 2016: 199). In a relational reading of synthetic data, the question “What do data represent” turns into questions such as “Do they perform well in a particular real-world situation,” i.e., the ability of a model trained on them to make an accurate prediction, or the impossibility to de-anonymize an individual captured in an aggregated dataset. Synthetic data is therefore not defined through correspondence to the data source but through many relationships that define the data setting (Loukissas, 2019): To the phenomenon of interest, to other related datasets, to the data creators and the level of trust afforded to them, and finally to a particular use and purpose. While traditional data may or may not be appropriate for a specific purpose, synthetic data is relationally defined by the intended task.

Without real-world referents, the quality of synthetic data can no longer be measured in absolute terms but only in relation to a particular purpose (Jordon et al., 2018). Therefore, data benchmarks become central to assessing the quality of a synthetic dataset and measuring its fitness for a particular use. Such benchmarks include measures of how closely a synthetic dataset resembles the trends and distributions of a real-world dataset; measures of the privacy budget—the amount of information that can be revealed in a DP framework; or the performance of a CV model in a real-world environment.

Evaluating synthetic data frictions

Paul Edwards’ concept of data friction captures the counterintuitive notion that the supposedly frictionless space of computation introduces a range of new frictions, expressed through the “costs in time, energy, and attention required simply to collect, check, store, move, receive, and access data” (Edwards, 2013: 84). An important motivation for using synthetic data is to eliminate data friction: to avoid the labor of data collection, complications arising from incomplete or biased data, or the legal limitations to data sharing. The business model of surveillance capitalism involves extensive data frictions: tracking individuals and collecting personal data is expensive and increasingly faces regulatory pressure. The domain of synthetic data has become the new playground of data-intensive companies because it allows sidestepping the labor of surveillance and thus accelerating automation (Steinhoff, 2022).

Synthetic data, however, also introduces new frictions. The lack of representational correspondence can greatly complicate their use and evaluation. Data quality can be difficult to establish; various contextual measures may be in conflict. The intertwinedness of object and purpose means that data collection can no longer be viewed as separate from their use. The following section examines such frictions in privacy, fairness, variability, evidence, and trust.

Privacy

Data friction can emerge from the task-dependent trade-off between utility and privacy protection, a calculation that plays an important role in the distribution of the Census Bureau's data products, such as the US Decennial Census and the ACS. Both are essential datasets for social research, the Decennial Census is used as the basis for countless derived datasets and for drawing the boundaries of congressional districts. The Bureau also conducts a wide range of other surveys that cover many facets of the US population.¹² Survey data are publicly available in two formats: Spatial aggregates and public-use microdata samples (PUMSs), including anonymized individual responses.

In the past, the Bureau has aimed to protect such private information by limiting the disclosure of identifiable attributes in the PUMS while offering accurate counts of the same variables in the aggregates. In the early 2000s, however, researchers showed that combining both formats can be leveraged in database reconstruction attacks to re-identify individuals with high confidence (Keller and Abowd, 2023). The only remedy is introducing noise into the disclosed data at a mathematically determined amount: To prevent reconstruction, $\sqrt{N}$ of all responses have to be distorted (Dinur and Nissim, 2003). These findings were of great concern for the bureau, not only because it is under the legal obligation to protect individuals against disclosure of private information. US Census Chief Scientist John Abowd has repeatedly stated that “the database reconstruction theorem is the death knell for traditional data publication systems from confidential sources” (Abowd, 2018).

Starting with the 2020 Census, all disclosed summary statistics contain intentional distortions, including the total population counts for all localities below the state level.¹³ Additionally, the Bureau plans to offer microdata in the future only in synthetic form. Both initiatives have met strong resistance among social scientists, who worry about free access to reliable public data (IPUMS, 2023). It has been argued that the threat of database reconstruction has been overstated, and the results of existing re-identification experiments are explainable by mere chance (Ruggles and Van Riper, 2022).

The plan to release ACS microdata in synthetic form has raised similar controversy. It is unclear whether the synthetic models capture all possible patterns in the data beyond accounting for those already well understood. The Bureau's plan to require researchers to send their results acquired from synthetic data to the Bureau for validation on real data is seen as a practical obstacle that effectively eliminates exploratory data analysis, thus preventing new discoveries (IPUMS, 2023). Finally, the synthetic data models are an abstraction of real observations, lacking outliers and other irregularities due to regularization. But not every form of data analysis is statistical; sometimes, research has to focus precisely on these outliers. Visualization researcher Pedro Cruz's data visualizations of ACS microdata explore the composition of multiracial families in the US at the individual level—an example of research that would not be meaningful with synthetic datasets (Figure 2).¹⁴

Figure 2.

Pedro Cruz, diversity traces. 2022.

The composition of synthetic census data products in aggregate and microdata form is determined by their source and purpose. However, in this case, the constraints imposed by the purpose conflict both conceptually and legally. Protecting individuals from disclosure requires a different level of accuracy than the redistricting process, necessitating different synthetic data expressions. However, as foundational datasets, the ACS and the US Census cannot be easily adjusted for a specific purpose.

Fairness

DASs have additional implications for algorithmic fairness. Noise injection tends to diffuse contrasts in the data, which has been shown to introduce biases that disproportionately affect small precincts and make them appear more homogeneous. This results in undercounted minorities in small precincts with mixed populations. This bias has a measurable effect on the redistricting of congressional districts for both racial minorities and voters with minority party affiliations, even in scenarios where fair congressional maps are a priority (Kenny et al., 2021). Since small and racially diverse precincts generally experience larger DAS errors, this bias has many other implications beyond redistricting. One example is less accurate coronavirus disease 2019 mortality estimates, which affect precisely those communities most at risk (Hauer and Santos-Lozada, 2021).

While privacy-protecting perturbations can introduce harmful biases, synthetic data methods are also explored to eliminate such biases. Many datasets are collected opportunistically from unreliable sources such as self-selected participants or scraped from the web. They are necessarily biased: collected by similar people in similar situations, covering cases that are similar to each other. That machine learning models tend to amplify biases in training data has been well documented (Buolamwini and Gebru, 2018; Wang et al., 2019; Wang et al., 2020). The first step to addressing this issue is introducing protected attributes and ensuring that every group is adequately represented in the training dataset. Unfortunately, this creates another friction: those protected attributes are also prone to revealing respondent identities and are, therefore, often removed from published datasets, complicating the task of bias correction.

Issues of implicit biases in synthetic data can be demonstrated by asking a popular LLM such as OpenAI's GPT-4 to generate an artificial census data set. The output for the city of Boston includes a list of synthetic citizens with plausible occupations, salaries, education, and even street addresses. Specifying the upscale neighborhood of Beacon Hill produces Asian investment bankers, white real estate developers, professors, and architects with high salaries (Figure 3. Specifying the working-class district of Roxbury results in predominantly black and Hispanic virtual citizens with jobs such as barber or construction worker (Figure 4). One may conclude that the LLM knows a lot about Boston until one asks the model for specific data about average salaries, only to find out that the results differ each time. We are quickly reminded that the LLM does not have an internal database about Boston's socio-economic geography but has merely learned to mimic linguistic patterns including the format of census datasets—the racial and occupational information results from learned stereotypes about Boston's population.

Figure 3.

Synthetic residents of Boston's Beacon Hill neighborhood, generated by GPT-4.

Figure 4.

Synthetic residents of Boston's Roxbury neighborhood, generated by GPT-4.

The reproduction and amplification of hidden stereotypes are in friction with the earlier issue of the loss of hidden correlations. A brute force solution may involve removing protected attributes from the training set. However, curation is not sufficient for missing information; it requires synthesizing uncorrelated data. Ramaswamy et al.(2021) offer the example of a CV dataset featuring many images of people wearing both hats and sunglasses as a surrogate for more racially sensitive cases. The correlation between hats and glasses can be broken, the researchers suggest, by removing hats in some images and adding glasses in others. While such approaches may seem clumsy, it is claimed that models trained on such rebalanced data seem to make fewer mistakes and, therefore, can make fairer judgments (Chari et al., 2022). The trade-off between accuracy, privacy, and fairness, however, requires a nuanced look at correlation, especially since it is not always possible to determine whether a pattern corresponds to an external feature or is an artifact exaggerated by the model. As adversarial attack demonstrations show, it is difficult to anticipate how even a balanced dataset translates into model behavior. Considering biases are easier to find in real observations, the lack of representation creates data friction that complicates debugging synthetically de-biased datasets.

Variability

To de-bias a dataset with fake sunglasses and hats means increasing its variability. Now, the dataset has more training images of people with only hats or glasses, allowing the model to distinguish these two concepts. Beyond accurate representation, many training datasets have an additional objective: To cover all possible cases, regardless of their improbability. It is insufficient for a diagnostic model to work in 99% of typical situations if it is meant to recognize anomalies. Identifying edge cases before they arise requires imagination and speculation to create counterfactual scenarios.

In his political analysis of synthetic data, Jacobsen describes synthetic data as a technology of risk (Jacobsen, 2023). In this vein, he connects the concept of variability to Orit Halpern's discussion of resilience as a central goal of smart systems (Halpern et al., 2017). Creating self-stabilizing systems requires embracing heterogeneity and being prepared for every possibility. The irony of this logic is not lost on Jacobsen: The purpose of embracing diversity is to keep the system stable and maintain the status quo.

Amplifying the contrasts within the dataset can help, as Jacobsen puts it, “directing and disciplining the attention” of the algorithm and expanding its “field of vision” to include edge cases it might encounter (Jacobsen, 2023: 5). What may be more surprising is the sometimes extreme degree of these amplifications. Synthetic datasets for training facial recognition systems go beyond realistic faces of all races, genders, and ethnicities under different light conditions (Figure 5), but also include extremely distorted, cartoonish faces with unrealistic skin tones one would unlikely encounter in real life (Boutros et al., 2023). Whether such cartoon faces actually enhance the performance of the model or whether they merely serve as good-enough proxies, in both cases, they emphasize that synthetic data is not just an inferior substitute for real-world observations but an entirely different beast. Synthetic data are an intervention that often contradicts rather than mimics empirical content.

Figure 5.

Examples from Microsoft's face synthetics dataset.

Evidence

If a model trained on distorted data can perform better than one trained with real observations, it poses an interesting question: what constitutes realism in synthetic training data, and what exactly is their evidentiary value? In Leonelli's words, evidence is the “grounds on which specific claims about reality acquire credibility” (Leonelli, 2016: 199). In the case of synthetic data, these grounds are no longer ontological claims about reality but the behavior of the trained model.

But while exaggerated variability can lead to a more reliable model, other forms of variability get lost in synthetic data. Most datasets based on real-world observations contain outliers and irregularities, artifacts of the data collection method, or technical glitches. Such outliers pose challenges for statistical modeling. Trying to capture them in the model often leads to overfitting, resulting in overcomplicated models that perform well on their own training data, but adapt poorly to other situations. Statistical modeling, whether performed manually or through self-supervised machine learning, therefore prefers simpler models over more complex ones. Data produced by statistical models are always abstracted abstractions.

The abstracted outliers, however, are not always the result of technical glitches. Sometimes, they are the proverbial needle in the haystack, evidence for the individual case that stands out from the rest. Other times, they provide crucial context for deciding whether the information is trustworthy in debates around misinformation: revealing noise patterns and artifacts that reveal image manipulations or AI-generated images. Finally, errors and glitches themselves carry relevant information and offer clues about the circumstances of data collection. While the sciences strive for generalization, forensics focuses on individualization: looking close enough to find what makes a particular datum different from all others (Kirschenbaum, 2008: 10).

In this regard, a synthetic dataset is akin to an airbrushed portrait with all imperfections removed—imperfections that, however, also make it authentic. Correcting glitches and biases may improve accuracy but comes at the cost of losing the information embodied in the excluded outliers. Focusing on the “content” of data, the representational model offers few conceptual tools. In contrast, the relational model allows a critical examination of the data origin by considering data as indexical signs (Weatherby and Justie, 2022) or material traces that bear the inscriptions of their generation (Offenhuber, 2019).

However, the relational model is not without its problems. As described, it involves three aspects: First, data generation is a chain of material interactions. Second, the social framing of epistemic contexts determine what is accepted as evidence. And third, the effects and performances of data as their defining aspect. These three aspects, however, are often in conflict with each other, and without a real-world reference, it becomes difficult to resolve these conflicts. Replacing the question “What does it represent?” with “Does it work?” is not enough when grappling with adversarial attacks that can mysteriously throw off the deep learning model through seemingly inconsequential modifications.¹⁵

Trust

Whether a real and a synthetic dataset are statistically equivalent cannot always be easily established. No introductory statistics textbook would be complete without a discussion of Anscombe's quartet, a cautionary tale in four charts against blind reliance on statistics (Figure 6). The quartet shows scatterplots of four datasets that clearly describe very different phenomena. However, these four distinct patterns share the same summary statistics, regression line, and standard error. Anscombe's argument challenges the notion that “performing intricate calculations is virtuous, whereas actually looking at the data is cheating” and cautions that any analysis should be sensitive to “whatever background information is available about the variables” (Anscombe, 1973).

Figure 6.

Anscombe's quartet: Four datasets with the same descriptive statistics including number of observations, means in x and y, same regression coefficient, same standard error (Anscombe, 1973), source: wikimedia.

However, while visual verification is trivial for simple datasets, it is impossible for massive synthetic datasets and machine learning models with an equally large number of parameters. A term frequently emphasized in machine learning literature is trust: a range of different tests, comparisons, and assessments help to build trust that a seemingly unrealistic synthetic datum can serve its purpose as part of the training dataset (Emam et al., 2020b). In the absence of ground truth, trust is built through a multitude of relationships. The solution to the trap of Anscombe's quartet is introducing yet another metric. Rather than a clear-cut set of quantitative standards, these quantitative metrics become mere signals in a larger interpretation-based assessment.

The relational nature of synthetic data expands into the scaffolding of a multitude of metrics of variation and performance. Without a simple reference to ground truth, data frictions emerge from the need to cross-reference unmoored metrics, laboriously evaluating agreements and discrepancies.

Contamination

Whether Gartner's estimate of the imminent flood of synthetic data is justified or not, the contamination of the web with generative AI content is a frequently voiced concern in the public discourse. It is difficult to estimate how much new online content, including blog posts, comments, and reviews, is written with the help of LLMs. Still, the number is certainly increasing and unwittingly sampled in new training datasets. As online data sources are commonly used in research and industry, synthetic data infiltrates even traditional survey responses. An experiment with the crowdworking platform Mechanical Turk showed that roughly half of the crowd-workers used LLMs such as ChatGPT to generate their survey responses (Veselovsky et al., 2023).

As a result, it will become even more difficult to observe human behavior and public sentiments online. A highly convincing deep fake of a president taking a bribe almost seems less harmful than the possibility of easily generating an entire discourse through AI astroturfing, with synthetic users discussing the allegation and thus influencing unwitting bystanders. In addition, any number of synthetic datasets can be generated and pointed to as evidence in the debate (Porsdam Mann et al., 2023). This scenario is a weaponized version of Asch's conformity experiments, in which participants ignored their own observations to conform with the statements of actors disguised as co-participants (Asch, 1951).

Conclusion—the ethics of anything goes

A conclusion of the various issues surrounding synthetic data has to begin with a discussion of its ethical implications, including questions about honesty, fairness, and the well-being of others. The use of synthetic data is ostensibly driven by ethical considerations, such as protecting the privacy of individuals and decreasing biases in datasets. Since synthetic data are quick and easy to generate in any desired volume, they may mitigate the tech sector's inclination towards surveillance and reduce the risk of exposing personal information. At the same time, synthetic data brings up new ethical issues. Their ease and convenience fit too well into the model of capitalistic acceleration (Steinhoff, 2022), and a nuanced discussion of its data issues may fall by the wayside.

The malleability of synthetic data raises the temptation of fabricating an “Instagram reality” of data: An idealized representation heavily mediated by beauty filters and image manipulations. While the frictions of real-world datasets often resist computational analysis, synthetic data can be made to perform beautifully in statistical models by correcting biases, seamlessly filling gaps, sanitizing and regularizing outliers, scaling up resolution by in-painting plausible details, increasing variability, and decreasing ambiguity. With the representational chain of custody broken, synthetic methods allow for generating representational relationships and resemblances at will. The anything goes attitude in the AI start-up world is at odds with the disciplined restraint of classic statistics. The field of AI is enveloped by a ubiquitous epistemic fog, which manifests not only in the opacity of models but also in the motion blur of rapid development, in which the predominant venues are pre-print servers such as ArXiv¹⁶ where new results are self-published and cited at a rapid pace before peer-review in traditional journals has even started.

In many forms of synthetic data, the representational relationship is not merely implicit but entirely absent. The goal of DP is to dissociate a dataset entirely from the observed population. Training datasets are assembled with the aim to enable a model to recognize events that are otherwise rarely observed. Who and what is represented in synthetic datasets becomes a complicated question. As discussed in this paper, the loss of a representational relationship means that data problems often go unnoticed—problems that are often revealed by artifacts, glitches, and data imperfections. Claims of debiasing training data by introducing extreme outliers warrant some practical skepticism. It is doubtful whether data in which marginalized groups are misrepresented can be debiased on a meaningful scale.

This paper calls for a relational approach to synthetic data, focusing on the circumstances of data generation, the purpose they are used for, and the context in which data serve as evidence. Instead of focusing on generalization, it calls for individualization: Examining what makes a particular case unique. This uniqueness is rarely found in the data content but in the data context, its settings, performances, and interactions. At the same time, the paper addresses the shortcomings of the relational model in situations where representation remains essential.

And yet, the phenomenon of synthetic data offers important provocations that can be productive for the critical data discourse. The epistemic weirdness of the phenomenon challenges traditional assumptions about data. Switching from a representational to a performative notion of data realism, in which the performance of the trained model is the main metric or a definition of data that starts with the intended use rather than their descriptive capacity forces us to think relationally and contextually about data. If there is one lesson we can draw from synthetic data, it is that data have a speculative, imaginary, and counterfactual dimension. As Judea Pearl's ladder of causation tells us, to establish causality, it is not enough to simply observe data, it requires intervention and the imagination of counterfactual scenarios (Pearl and Mackenzie, 2018).

Footnotes

Acknowledgments

Thanks to Sasha Costanza-Chock for inspiration,Orit Halpern for feedback,Ali Fard,Sarah Williams,and Michael Shamiyeh for the opportunity to present the early stages of this work.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author received no financial support for the research,authorship,and/or publication of this article.

ORCID iD

Dietmar Offenhuber

References

Abowd

(2018) Staring-down the database reconstruction theorem. In: Joint statistical meetings, Vancouver, BC, 2018, p.234. US Census Bureau.

Akrout

Gyepesi

Holló

, et al. (2023) Diffusion-based data augmentation for skin disease classification: Impact across original medical datasets to fully synthetic images. arXiv:2301.04802. arXiv. Available at: http://arxiv.org/abs/2301.04802 (accessed 24 May 2023).

Andrews

(2021) What is synthetic data? Available at: https://blogs.nvidia.com/blog/2021/06/08/what-is-synthetic-data/ (accessed 12 June 2023).

Anscombe

(1973) Graphs in statistical analysis. The American Statistician 27(1): 17–21.

Arpit

Jastrzębski

Ballas

, et al. (2017) A closer look at memorization in deep networks. In: Proceedings of the 34th International Conference on Machine Learning, 17 July 2017, pp.233–242: PMLR.

Asch

(1951) Effects of group pressure upon the modification and distortion of judgments. In: Groups, Leadership, and Men. Pittsburgh, PA: Carnegie Press, pp.177–190.

Beckmann

Köstner

Hipólito

(2023) Rejecting cognitivism: computational phenomenology for deep learning. arXiv:2302.09071. arXiv. Available at: http://arxiv.org/abs/2302.09071 (accessed 14 March 2023).

Beer

(1962) Towards the automatic factory. In: Von Foerster

(eds) Principles of Self-Organization: Transactions of the University of Illinois Symposium. New York: Pergamon, pp.25–89.

Bellan

(2022) Parallel domain says autonomous driving won’t scale without synthetic data. In: TechCrunch. Available at: https://techcrunch.com/2022/11/16/parallel-domain-says-autonomous-driving-wont-scale-without-synthetic-data/ (accessed 2 June 2023).

10.

Boutros

Struc

Fierrez

, et al. (2023) Synthetic data for face recognition: current state and future prospects. Image and Vision Computing 135: 104688.

11.

Buolamwini

Gebru

(2018) Gender shades: intersectional accuracy disparities in commercial gender classification. In: Proceedings of the 1st conference on fairness, accountability and transparency, 21 January 2018, pp.77–91: PMLR.

12.

Burrell

(2016) How the machine ‘thinks’: understanding opacity in machine learning algorithms. Big Data & Society 3(1): 2053951715622512.

13.

Castellanos

(2021) Fake it to make it: companies beef up AI models with synthetic data. Wall Street Journal, 23 July. Available at: https://www.wsj.com/articles/fake-it-to-make-it-companies-beef-up-ai-models-with-synthetic-data-11627032601 (accessed 13 June 2023).

14.

Chang

Zhuang

Samaraweera

(2023) Privacy-Preserving Machine Learning. Shelter Island, NY: Manning.

15.

Chari

Athreya

, et al. (2022) MIME: Minority inclusion for majority group enhancement of AI performance. In: European conference on computer vision, 2022, pp.326–343: Springer.

16.

Coopmans

Vertesi

Lynch

(2014) Representation in Scientific Practice Revisited. Cambridge, MA: MIT Press.

17.

Cunningham

Cormode

Ferhatosmanoglu

(2021) Privacy-preserving synthetic location data in the real world. In: Proceedings of the 17th International Symposium on Spatial and Temporal Databases, New York, NY, USA, 23 August 2021, pp.23–33: SSTD ‘21. ACM.

18.

Daston

Galison

(2007) Objectivity, 1st ed. New York; Cambridge, MA: Zone.

19.

Deng

Dong

Socher

, et al. (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp.248–255: IEEE.

20.

D’Ignazio

Klein

(2020) Data Feminism. Cambridge, MA: The MIT Press.

21.

Dinur

Nissim

(2003) Revealing information while preserving privacy. In: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, New York, NY, USA, 9 June 2003, pp.202–210: PODS ‘03. ACM.

22.

Dourish

(2017) The Stuff of Bits: An Essay on the Materialities of Information, 1st ed. Cambridge, MA: The MIT Press.

23.

Drucker

(2011) Humanities approaches to graphical display. Digital Humanities Quarterly 5(1).

24.

Dulhanty

Wong

(2019) Auditing ImageNet: towards a model-driven framework for annotating demographic attributes of large-scale image datasets. arXiv:1905.01347. arXiv. Available at: http://arxiv.org/abs/1905.01347 (accessed 2 June 2023).

25.

Dwork

(2008) Differential privacy: A survey of results. In: Agrawal

Duan

, et al. (eds) Theory and Applications of Models of Computation. Berlin, Heidelberg: Springer, 2008, pp.1–19. Lecture Notes in Computer Science.

26.

Edwards

(2013) A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming. Cambridge, MA: MIT Press.

27.

Emam

Mosquera

Hoptroff

(2020a) Practical Synthetic Data Generation: Balancing Privacy and the Broad Availability of Data. 1st edition Beijing Boston Farnham Sebastopol Tokyo: O'Reilly Media.

28.

Emam

Mosquera

Hoptroff

(2020b) Practical Synthetic Data Generation: Balancing Privacy and the Broad Availability of Data, 1st ed. O’Reilly Media.

29.

Floridi

(2011) The Philosophy of Information. Oxford University Press.

30.

Galison

(2006) Images scatter into data, data gather into images. In: Manghani

Piper

Simons

(eds) Images: A Reader, 1st ed. London: Sage Publications Ltd.

31.

Goh

Cammarata

Voss

, et al. (2021) Multimodal neurons in artificial neural networks. Distill 6(3): e30.

32.

Halpern

Mitchell

Geoghegan

(2017) The smartness mandate: notes toward a critique. Grey Room 68: 106–129.

33.

Hauer

Santos-Lozada

(2021) Differential privacy in the 2020 census will distort COVID-19 rates. Socius: Sociological Research for a Dynamic World 7: 237802312199401.

34.

IBM (2021) What is synthetic data? Available at: https://research.ibm.com/blog/what-is-synthetic-data# (accessed 12 June 2023).

35.

IPUMS (2023) Changes to census bureau data products. Available at: https://www.ipums.org/changes-to-census-bureau-data-products (accessed 8 June 2023).

36.

Jacobsen

(2023) Machine learning and the politics of synthetic data. Big Data & Society 10(1): 20539517221145372.

37.

Jordon

Yoon

van der Schaar

(2018) Measuring the quality of synthetic data for use in competitions. arXiv:1806.11345. arXiv. Available at: http://arxiv.org/abs/1806.11345 (accessed 26 May 2023).

38.

Keller

Abowd

(2023) Database reconstruction does compromise confidentiality. Proceedings of the National Academy of Sciences 120(12): e2300976120.

39.

Kenny

Kuriwaki

McCartan

, et al. (2021) The use of differential privacy for census data and its impact on redistricting: the case of the 2020 U.S. Census. Science Advances 7(41): eabk3283.

40.

Kirschenbaum

(2008) Mechanisms: New Media and the Forensic Imagination. MIT Press.

41.

Kitchin

(2014) The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences. Sage Publications Ltd.

42.

Latour

(1990) Visualisation and cognition: drawing things together. In: Lynch

Woolgar

(eds) Representation in Scientific Practice. Cambridge, MA: MIT Press, pp.19–68.

43.

Leonelli

(2016) The philosophy of data. In: Floridi

(ed) The Routledge Handbook of Philosophy of Information, 1st ed. London, New York: Routledge.

44.

Liu

Cheng

Chen

, et al. (2022) Privacy-preserving synthetic data generation for recommendation systems. In: Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, Madrid, Spain, 6 July 2022, pp.1379–1389: ACM.

45.

Loukissas

(2019) All Data Are Local: Thinking Critically in a Data-Driven Society. Cambridge, MA: The MIT Press.

46.

Mol

(2003) The Body Multiple: Ontology in Medical Practice. Durham: Duke University Press Books.

47.

Nikolenko

(2019) Synthetic data for deep learning. arXiv:1909.11512. arXiv. Available at: http://arxiv.org/abs/1909.11512 (accessed 31 March 2023).

48.

Nikolenko

(2021) Synthetic Data for Deep Learning. Springer Optimization and Its Applications. Cham: Springer International Publishing.

49.

Offenhuber

(2019) Data by proxy—material traces as autographic visualizations. IEEE Transactions on Visualization and Computer Graphics 26(1): 98–108.

50.

Olah

Mordvintsev

Schubert

(2017) Feature visualization. Distill 2(11): e7.

51.

Parikka

(2023) Operational Images: From the Visual to the Invisual. Twin Cities: University of Minnesota Press.

52.

Pearl

Mackenzie

(2018) The Book of Why: The New Science of Cause and Effect, 1st ed. New York: Basic Books.

53.

Porsdam Mann

Earp

Nyholm

, et al. (2023) Generative AI entails a credit–blame asymmetry. Nature Machine Intelligence 5(5): 472–475.

54.

Radford

Kim

Hallacy

, et al. (2021) Learning transferable visual models from natural language supervision. arXiv:2103.00020. arXiv. Available at: http://arxiv.org/abs/2103.00020 (accessed 2 June 2023).

55.

Ramaswamy

Kim

SSY

Russakovsky

(2021) Fair attribute classification through latent space de-biasing. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp.9297–9306.

56.

Rombach

Blattmann

Lorenz

, et al. (2022) High-resolution image synthesis with latent diffusion models. arXiv:2112.10752. arXiv. Available at: http://arxiv.org/abs/2112.10752 (accessed 2 June 2023).

57.

Rouzrokh

Khosravi

Faghani

, et al. (2023) Multitask brain tumor inpainting with diffusion models: a methodological report. arXiv:2210.12113. arXiv. Available at: http://arxiv.org/abs/2210.12113 (accessed 4 April 2023).

58.

Ruggles

Van Riper

(2022) The role of chance in the census bureau database reconstruction experiment. Population Research and Policy Review 41(3): 781–788.

59.

Schuhmann

Beaumont

Vencu

, et al. (2022) LAION-5B: an open large-scale dataset for training next generation image-text models. arXiv:2210.08402. arXiv. Available at: http://arxiv.org/abs/2210.08402 (accessed 2 June 2023).

60.

Shankar

Halpern

Breck

, et al. (2017) No classification without representation: assessing geodiversity issues in open data sets for the developing world. arXiv:1711.08536. arXiv. Available at: http://arxiv.org/abs/1711.08536 (accessed 2 June 2023).

61.

Shumailov

Shumaylov

Zhao

, et al. (2023) The curse of recursion: training on generated data makes models forget. arXiv:2305.17493. arXiv. Available at: http://arxiv.org/abs/2305.17493 (accessed 17 June 2023).

62.

Steinhoff

(2022) Toward a political economy of synthetic data: a data-intensive capitalism that is not a surveillance capitalism? New Media & Society I(17).

63.

Syntegra (2021) Syntegra partnering with National Institutes of Health (NIH) and the Bill and Melinda Gates Foundation to democratize access to the largest set of COVID-19 patient records. Available at: https://www.prnewswire.com/news-releases/syntegra-partnering-with-national-institutes-of-health-nih-and-the-bill-and-melinda-gates-foundation-to-democratize-access-to-the-largest-set-of-covid-19-patient-records-301209504.html (accessed 13 June 2023).

64.

Thrift

(2007) Non-Representational Theory: Space, Politics, Affect, 1st ed. London: Routledge.

65.

Tremblay

Birchfield

(2018) Falling Things: A Synthetic Dataset for 3D Object Detection and Pose Estimation, arXiv. DOI: 10.48550/arXiv.1804.06534.

66.

US Census Bureau (2021) What are synthetic data? Available at: https://www.census.gov/library/fact-sheets/2021/what-are-synthetic-data.html (accessed 13 June 2023).

67.

US Congress (1996) Health insurance portability and accountability act of 1996. Public Law 104: 191.

68.

Veselovsky

Ribeiro

Arora

, et al. (2023) Generating faithful synthetic data with large language models: a case study in computational social science. arXiv:2305.15041. arXiv. Available at: http://arxiv.org/abs/2305.15041 (accessed 15 June 2023).

69.

Wang

Zhao

Yatskar

, et al. (2019) Balanced datasets are not enough: estimating and mitigating gender bias in deep image representations. In: 1 October 2019, pp. 5309–5318. IEEE Computer Society. Available at: https://www.computer.org/csdl/proceedings-article/iccv/2019/480300f309/1hVlHZg9vK8 (accessed 12 June 2023).

70.

Wang

Qinami

Karakozis

, et al. (2020) Towards fairness in visual recognition: effective strategies for bias mitigation. In: 1 June 2020, pp. 8916–8925. IEEE Computer Society.

71.

Weatherby

Justie

(2022) Indexical AI. Critical Inquiry 48(2): 381–415.

72.

Yang

Qinami

Fei-Fei

, et al. (2020) Towards fairer datasets: filtering and balancing the distribution of the people subtree in the ImageNet hierarchy. In: Proceedings of the 2020 conference on fairness, accountability, and transparency, New York, NY, USA, 27 January 2020, pp.547–558. FAT* ‘20. ACM.