Abstract
Keywords
The weirdness of synthetic data
The artificial intelligence (AI) startup
Using artificial data for applications that normally depend on real-world observations may seem troubling. Remarkably, even government agencies like the National Institutes of Health and the US Census Bureau, which traditionally conducts labor-intensive door-to-door data collection for the decennial census, have embraced synthetic data to enhance dataset accuracy and address privacy concerns (Syntegra, 2021; US Census Bureau, 2021). Despite its paradoxical premise, synthetic data is rapidly gaining traction in many fields. The Wall Street Journal estimates (with little elaboration) that by the end of the year 2024, 60% of all data used in training and data analytics will be generated by AI (Castellanos, 2021).
This paper provides a typology of the different forms and methods of synthetic data and examines its epistemic claims. It argues that the phenomenon demands a departure from the traditional representational concept of data toward a relational model, where data are defined not only through their associated phenomenon but also through their purpose, performances, and contexts of use. Synthetic data thus prompts a reevaluation of the notion of data realism—from accurately representing things in the world to allowing a model trained with it to recognize real things. Finally, the paper identifies Paul Edwards’ concept of
The representational model of data
Data can mean different things in different contexts—from verbal presentations to written records; from physical traces and archeological artifacts to abstract symbols. Data have been defined as singular differences in the world (Floridi, 2011), as inscriptions that are both
Considering a datum as a representation means viewing it as a reference to a physical or conceptual object in the world. As a record created through observation or measurement, the datum represents its object, points to it, and substitutes it in subsequent operations. In other words, the datum serves as a sign. 4
While this representational framing may seem uncontroversial, it has been subject to extensive critique:
The representational relationship is necessarily, and often problematically, reductive, omitting its process of generation, its underlying assumptions and arbitrary decisions (Coopmans et al., 2014; Drucker, 2011). Representation implies a duality between an external world, assumed to be stable, and data, assumed to be abstract. The world is assumed to simply exist rather than constantly being made and remade. Data are also never abstract since every datum has a material embodiment that can interfere with its intended meaning. Once data are generated, the dataset establishes its own reality. How exactly data relate to things in the world fades into the background. Outside of scientific practice, data analysts often have little interest in the minutiae of data collection. As critical data literature has extensively discussed, data are not merely objective descriptions but interpretations of the world, shaped by worldviews and problem articulations (D’Ignazio and Klein, 2020; Kitchin, 2014; Loukissas, 2019). The representational model also has practical flaws. Data found in computer memory or hard drives are often merely by-products of algorithms rather than pointers to things in the world, such as the temporary information stored in caches and buffers. While they could be interpreted as second-order representations, their representational qualities are not necessary or useful for understanding the computational process. In the digital context, data is a flexible concept that extends far beyond representation and includes code, meta-data, thumbnails, and temporary information (Dourish, 2017).
While these critiques are not new, the fundamental weirdness of synthetic data, discussed in the following sections, undermines the representational model most profoundly. Synthetic data may mimic empirical observations but no longer correspond to features of the world. They are defined based on their application and, therefore, no longer independent of their use. Finally, they are often generated through methods that elude exhaustive explanation and scrutiny. These violations of representational assumptions will be elaborated in the following sections.
Forms of synthetic data
Definitions of synthetic data vary, reflecting methodological differences and purposes. In terms of methodology, synthetic data has been divided into three groups: Data derived from real-world observations; data not connected to such observations; and a hybrid combination of the two (Emam et al., 2020a). In terms of purposes, the three main uses of synthetic data are to protect privacy, to generate training data for machine learning applications, and to improve existing datasets by correcting biases and augmentation with other data. Existing definitions diverge as they tend to emphasize only one of these areas: “Synthetic data is annotated information that computer simulations or algorithms generate as an alternative to real-world data;” (Andrews, 2021) “information that's been generated on a computer to augment or replace real data to improve AI models, protect sensitive data, and mitigate bias;” (IBM, 2021) “data that has been generated from real data and that has the same statistical properties as the real data;” (Emam et al., 2020b) and “data points that a machine learning algorithm has generated; data which in turn are used to train other machine-learning algorithms” (Jacobsen, 2023).
The following typology is not meant to be canonical—some authors consider augmentation, imputation, and harmonization as data modifications rather than true synthetic data (Nikolenko, 2019: 12). However, all these approaches exist on a continuum between real and synthetic.
Unlike the forms of synthetic data discussed in the following sections, algorithmic data are typically not considered substitutes or imitations of real-world observations. As epistemic artifacts in their own right, they represent models and theories whose properties and limitations are well understood. Since they are based on explicit assumptions, they allow theoretical experiments that would not be possible with real-world data alone. For example, examining trends and seasonality in medical data depends on a complete time series, but gaps are common in patient data. Determining whether a patient improves over time or whether a long-term treatment is effective requires closing these gaps without distorting the outcome. Depending on the length and nature of the gaps, different imputation techniques can be considered, from simply carrying the last observation forward to interpolation or predictive modeling. As an elementary form of data synthesis, imputation generates the illusion of a complete dataset. It does not add information and invariably generates new biases, such as smaller apparent margins of error and inflated statistical significance. While not recommended for explanatory data analysis, imputation nevertheless may improve prediction models and is therefore often used for training data. However, to perform imputation correctly requires a thorough understanding of its limitations and assumptions implied by the particular purpose. Commonly used for the analysis of surveys with incomplete responses, imputation is typically used at the analysis stage; If redistributed, the imputed parts of datasets need to be clearly labeled. Other forms of administrative data, however, require more extensive data harmonization. Different administrative districts and counties often use different data formats and collection methodologies, which complicates data aggregation. Furthermore, data from different countries are collected with different methods, whose details are often unknown and therefore difficult to reconcile. These numerous incompatibilities make the compilation and harmonization of global datasets a challenging endeavor that involves many arbitrary judgments. The hybridization of public and non-public data sources is particularly relevant in the social sciences. The annual American Community Survey (ACS) is a widely used socioeconomic dataset that includes information about income, education, employment, housing, and health of residents. However, the accuracy of ACS data can vary significantly at the most granular level, because of the small number of randomly selected households and low response rates in some locations. Margins of error surpassing 90% are not uncommon for certain variables. To address this issue, the US Census Bureau is currently exploring different data augmentation methods, combining ACS data with non-public administrative records into a new hybrid dataset that is more complete and accurate than either of its components. The bureau has previously offered similar synthetic datasets, such as the longitudinal survey of income and program participation (SIPP), which has been augmented with income data from tax returns, records of retirement, and disability benefit receipt.
7
The sensitive nature of tax- and social benefit information naturally raises concerns about privacy, not only because the US Census Bureau is bound by federal law to protect respondent confidentiality. Even in aggregated form, administrative data can be deanonymized under certain conditions, leading to the disclosure of sensitive personal information. For example, when only a single individual changes between two versions of the same dataset, the Based on the concept of DP, the Census Bureau introduced a disclosure avoidance system (DAS), which involves obfuscation techniques that use noise injection and multiple stages of data imputation.
8
For spatial analysis and mapping, the combination of data augmentation and DP enables access to data at a level of spatial granularity otherwise not available in public data sources. Propelled by the need for training data and the privacy restrictions of regulatory frameworks such as the European General Data Protection Regulation (GDPR)
9
and the US Health Insurance Portability and Accountability Act (US Congress, 1996), data synthesis for privacy protection is currently a burgeoning industry. Most of these new companies rely on artificial neural networks (ANNs) and specifically deep learning as a versatile alternative method for capturing the statistical properties and patterns found in the original dataset. Synthetic data companies advertise other benefits beyond privacy protection. First, synthetic datasets can be generated at any desired size, regardless of the amount of available input data. However, it is important to note that such an inflated dataset inherently contains less information than the original. The model can also be used to generalize the input dataset, eliminate gaps and outliers. Crucially, the representation of groups can be rebalanced to mitigate biases—with the caveat that such post hoc corrections do not inherently improve data quality. Finally, synthetic data allows exploring the effects of extreme outliers by creating fictional variations of the same dataset. Nevertheless, traditional caveats of statistical modeling still apply: The analyst needs to thread a fine needle between over-fitting, which introduces errors and artifacts by fitting the model too tightly to the available data, and overgeneralization, which leads to a loss of detail and information. But mimicry datasets are not bulletproof, even from a privacy perspective. While deep learning models do not contain the original training data and its inner structures are notoriously difficult to interpret, researchers are exploring a range of possible attacks, including As a synthetic technique, inpainting is used in medical imaging to reconstruct damaged or missing parts of magnetic resonance imaging (MRI) data. More recently, it has been used to create fully synthetic training data for medical AI applications, such as brain tumor detection and skin disease classification (Akrout et al., 2023; Rouzrokh et al., 2023). Real-world MRI data of brain tumors are difficult to obtain in the desired volume, not only because of regulatory constraints but also because of the rarity of specific conditions. Data inpainting allows inserting tumors into the MRI images of healthy patients. The technique can be compared to AI image generators such as stable diffusion and Dall-E.
10
Users select an image region, and the model fills it with new content that integrates smoothly with the surrounding context. Rather than inconspicuously patching over gaps, data inpainting is used here to generate the relevant features of the dataset, which can then be used to train other machine learning models to detect real brain tumors in real patients. The circularity of the process and the premise of teaching models to recognize severe medical conditions without empirical data seem troubling. Image generators may introduce invisible artifacts, are susceptible to adversarial attacks, and their outputs may simply lack the richness of real phenomena. It has been demonstrated that training models on their own output degrades the model and makes it “forget” the more unusual cases (Shumailov et al., 2023). Conversely, emerging machine-learning techniques such as The impossibility of evaluating data quality at such vast scales has raised concerns about data biases. ImageNet's biases have been extensively scrutinized along the lines of gender and race (Dulhanty and Wong, 2019), geography (Shankar et al., 2017), and fairness of algorithmic classification (Yang et al., 2020). In contrast, the newer and vastly larger models involved almost no human oversight and curation. Since the training data for both CLIP and LAION-5B were scraped from the internet, they are subject to all kinds of distortions, such as mislabeled image captions resulting from search engine optimization efforts. Image generators trained on LAION-5B reproduce watermarks of commercial stock photo platforms, indicating that the random scraping of training data did not consider image copyright, which is the basis of pending lawsuits.
11
Besides such scraped and automatically labeled data, a second and equally prominent group of synthetic training data is entirely artificial, generated using 3D rendering and game engines (Figure 1). The reason can be easily illustrated: Searching the LAION-5b dataset for the concept “tree” almost exclusively returns images of trees viewed from eye level, but rarely a tree from above. This presents a problem for CV applications. Training data for CV systems therefore often include footage generated in game engines populated by digital humans (Bellan, 2022). 3D rendering ensures accurate labels but presents a trade-off in realism and richness, which could make the resulting models less transferable to real-world environments (Nikolenko, 2021: 12).

Falling things synthetic dataset with labeled information highlighted (below).
Representation and model indecomposability
The described forms of synthetic data range from simple perturbations of real-world observations to artificial data generated from scratch based on a loose relationship of resemblance or plausibility. While the former maintains correspondence to the original observations, the latter is only indirectly connected to the phenomenon; for example, by serving as training data for a model to detect said phenomenon.
A more fundamental challenge to representation is presented by synthetic data generated through ANNs, an approach that is increasingly becoming the norm. ANNs can detect subtle patterns missed by other techniques, however, their logic is hard to interpret. Because of the holistic training process, one cannot take the models apart to explain how they reach a specific result—they are not decomposable (Burrell, 2016). There is often no simple or satisfying explanation for why an ANN generates a certain output. The model parameters that determine the output of the ANN have no obvious relationship to the input data and therefore no representational value for a human interpreter.
Unable to simply open the black box, the method of choice is feeding the model with various inputs and interpreting its behavior—an approach that treats ANNs more like a natural phenomenon than a mathematical object. Attempts to study the inner workings of fully trained ANNs involve probing individual neurons to generate images resembling the stimuli these neurons respond to (Olah et al., 2017). The internal structures revealed by these studies are neither elegant nor pretty. Some artificial neurons indeed seem to correspond to specific perceptual qualities such as curvature, orientation, or contrast, while many are redundant or respond to multiple stimuli that seem to have nothing in common (Goh et al., 2021). This form of representation may resemble how neural activity relates to external stimuli in biological organisms but cannot be formally abstracted. From a perspective of algorithmic accountability, the chain of custody between real-world observations and derived synthetic data remains opaque.
Non-representational frameworks and the relational data model
The premises of representational frameworks have been questioned long before the emergence of synthetic data. Nigel Thrift's Non-representational Theory (NRT) summarizes its critiques and offers alternative views (Thrift, 2007). While representation assumes a frozen world whose objects are static and stable, NRT emphasizes fluid processes of becoming. Meaning is no longer created only through language and symbols but also through embodied practices and performances. Objects are no longer simply viewed from different subjective perspectives but are enacted through different practices and therefore exist in multiplicities. Anthropologist Annemarie Mol, who examined medical practices in a Dutch hospital, concluded that a disease such as
These aspects can be directly applied to synthetic data. As described, the concept of synthetic data is defined through different contexts, practices, and goals. The data themselves exist in multiples; they are generated at will for different purposes. They lack correspondence to real-world objects, and when they are generated by ANNs, their process of generation is often not explainable. One could view synthetic data as embodiments of models and their training data, similar to how the smoke trails in a wind tunnel embody the principles of fluid dynamics. The smoke trails do not represent these principles; they are their physical effect. If digital computation is about the manipulation of symbols, the perspective of embodiment harks back to the era of analog computing. Early cyberneticists were often wary of representation and more concerned with the behavior of a system than its internal states. In 1962, Stafford Beer noted about the biological computers he and Gordon Pask were working on: “The only trouble is you do not know what the answer is. … [But] you do not want the answer. What you do want is to use this answer. So why ever digitise it” (Beer, 1962: 220–221). Similarly, an ANN can be examined from a strictly phenomenological perspective through its interactions with the context of its training (Beckmann et al., 2023). The generated data are the manifestations of this coupling: Part of a phenomenon that not only contains the object of observation such as the brain tumor and its training data, but also the observer, the programmers, the MRI machine, and the ideas that guided its construction.
The relational data model
If the representational model considers a datum as a symbolic reference to an external feature, an alternative relational model derives its meaning from the context, as Sabina Leonelli explains: “What counts as data depends on who uses them, how and for which purposes” (Leonelli, 2016: 196). If a datum is relationally defined through its fitness to serve as evidence, then evidence is likewise “assumed to consist in whatever makes a given assertion believable, or anyhow increases its intelligibility and/or plausibility to a given audience” (Leonelli, 2016: 199). In a relational reading of synthetic data, the question “What do data represent” turns into questions such as “Do they perform well in a particular real-world situation,” i.e., the ability of a model trained on them to make an accurate prediction, or the impossibility to de-anonymize an individual captured in an aggregated dataset. Synthetic data is therefore not defined through correspondence to the data source but through many relationships that define the
Without real-world referents, the quality of synthetic data can no longer be measured in absolute terms but only in relation to a particular purpose (Jordon et al., 2018). Therefore, data benchmarks become central to assessing the quality of a synthetic dataset and measuring its fitness for a particular use. Such benchmarks include measures of how closely a synthetic dataset resembles the trends and distributions of a real-world dataset; measures of the privacy budget—the amount of information that can be revealed in a DP framework; or the performance of a CV model in a real-world environment.
Evaluating synthetic data frictions
Paul Edwards’ concept of
Synthetic data, however, also introduces new frictions. The lack of representational correspondence can greatly complicate their use and evaluation. Data quality can be difficult to establish; various contextual measures may be in conflict. The intertwinedness of object and purpose means that data collection can no longer be viewed as separate from their use. The following section examines such frictions in privacy, fairness, variability, evidence, and trust.
Privacy
Data friction can emerge from the task-dependent trade-off between utility and privacy protection, a calculation that plays an important role in the distribution of the Census Bureau's data products, such as the US Decennial Census and the ACS. Both are essential datasets for social research, the Decennial Census is used as the basis for countless derived datasets and for drawing the boundaries of congressional districts. The Bureau also conducts a wide range of other surveys that cover many facets of the US population. 12 Survey data are publicly available in two formats: Spatial aggregates and public-use microdata samples (PUMSs), including anonymized individual responses.
In the past, the Bureau has aimed to protect such private information by limiting the disclosure of identifiable attributes in the PUMS while offering accurate counts of the same variables in the aggregates. In the early 2000s, however, researchers showed that combining both formats can be leveraged in database reconstruction attacks to re-identify individuals with high confidence (Keller and Abowd, 2023). The only remedy is introducing noise into the disclosed data at a mathematically determined amount: To prevent reconstruction,
Starting with the 2020 Census, all disclosed summary statistics contain intentional distortions, including the total population counts for all localities below the state level. 13 Additionally, the Bureau plans to offer microdata in the future only in synthetic form. Both initiatives have met strong resistance among social scientists, who worry about free access to reliable public data (IPUMS, 2023). It has been argued that the threat of database reconstruction has been overstated, and the results of existing re-identification experiments are explainable by mere chance (Ruggles and Van Riper, 2022).
The plan to release ACS microdata in synthetic form has raised similar controversy. It is unclear whether the synthetic models capture all possible patterns in the data beyond accounting for those already well understood. The Bureau's plan to require researchers to send their results acquired from synthetic data to the Bureau for validation on real data is seen as a practical obstacle that effectively eliminates exploratory data analysis, thus preventing new discoveries (IPUMS, 2023). Finally, the synthetic data models are an abstraction of real observations, lacking outliers and other irregularities due to regularization. But not every form of data analysis is statistical; sometimes, research has to focus precisely on these outliers. Visualization researcher Pedro Cruz's data visualizations of ACS microdata explore the composition of multiracial families in the US at the individual level—an example of research that would not be meaningful with synthetic datasets (Figure 2). 14

Pedro Cruz, diversity traces. 2022.
The composition of synthetic census data products in aggregate and microdata form is determined by their source and purpose. However, in this case, the constraints imposed by the purpose conflict both conceptually and legally. Protecting individuals from disclosure requires a different level of accuracy than the redistricting process, necessitating different synthetic data expressions. However, as foundational datasets, the ACS and the US Census cannot be easily adjusted for a specific purpose.
Fairness
DASs have additional implications for algorithmic fairness. Noise injection tends to diffuse contrasts in the data, which has been shown to introduce biases that disproportionately affect small precincts and make them appear more homogeneous. This results in undercounted minorities in small precincts with mixed populations. This bias has a measurable effect on the redistricting of congressional districts for both racial minorities and voters with minority party affiliations, even in scenarios where fair congressional maps are a priority (Kenny et al., 2021). Since small and racially diverse precincts generally experience larger DAS errors, this bias has many other implications beyond redistricting. One example is less accurate coronavirus disease 2019 mortality estimates, which affect precisely those communities most at risk (Hauer and Santos-Lozada, 2021).
While privacy-protecting perturbations can introduce harmful biases, synthetic data methods are also explored to eliminate such biases. Many datasets are collected opportunistically from unreliable sources such as self-selected participants or scraped from the web. They are necessarily biased: collected by similar people in similar situations, covering cases that are similar to each other. That machine learning models tend to amplify biases in training data has been well documented (Buolamwini and Gebru, 2018; Wang et al., 2019; Wang et al., 2020). The first step to addressing this issue is introducing protected attributes and ensuring that every group is adequately represented in the training dataset. Unfortunately, this creates another friction: those protected attributes are also prone to revealing respondent identities and are, therefore, often removed from published datasets, complicating the task of bias correction.
Issues of implicit biases in synthetic data can be demonstrated by asking a popular LLM such as OpenAI's GPT-4 to generate an artificial census data set. The output for the city of Boston includes a list of synthetic citizens with plausible occupations, salaries, education, and even street addresses. Specifying the upscale neighborhood of Beacon Hill produces Asian investment bankers, white real estate developers, professors, and architects with high salaries (Figure 3. Specifying the working-class district of Roxbury results in predominantly black and Hispanic virtual citizens with jobs such as barber or construction worker (Figure 4). One may conclude that the LLM knows a lot about Boston until one asks the model for specific data about average salaries, only to find out that the results differ each time. We are quickly reminded that the LLM does not have an internal database about Boston's socio-economic geography but has merely learned to mimic linguistic patterns including the format of census datasets—the racial and occupational information results from learned stereotypes about Boston's population.

Synthetic residents of Boston's Beacon Hill neighborhood, generated by GPT-4.

Synthetic residents of Boston's Roxbury neighborhood, generated by GPT-4.
The reproduction and amplification of hidden stereotypes are in friction with the earlier issue of the loss of hidden correlations. A brute force solution may involve removing protected attributes from the training set. However, curation is not sufficient for missing information; it requires synthesizing uncorrelated data. Ramaswamy et al.(2021) offer the example of a CV dataset featuring many images of people wearing both hats and sunglasses as a surrogate for more racially sensitive cases. The correlation between hats and glasses can be broken, the researchers suggest, by removing hats in some images and adding glasses in others. While such approaches may seem clumsy, it is claimed that models trained on such rebalanced data seem to make fewer mistakes and, therefore, can make fairer judgments (Chari et al., 2022). The trade-off between accuracy, privacy, and fairness, however, requires a nuanced look at correlation, especially since it is not always possible to determine whether a pattern corresponds to an external feature or is an artifact exaggerated by the model. As adversarial attack demonstrations show, it is difficult to anticipate how even a balanced dataset translates into model behavior. Considering biases are easier to find in real observations, the lack of representation creates data friction that complicates debugging synthetically de-biased datasets.
Variability
To de-bias a dataset with fake sunglasses and hats means increasing its variability. Now, the dataset has more training images of people with only hats or glasses, allowing the model to distinguish these two concepts. Beyond accurate representation, many training datasets have an additional objective: To cover all possible cases, regardless of their improbability. It is insufficient for a diagnostic model to work in 99% of typical situations if it is meant to recognize anomalies. Identifying edge cases before they arise requires imagination and speculation to create counterfactual scenarios.
In his political analysis of synthetic data, Jacobsen describes synthetic data as a
Amplifying the contrasts within the dataset can help, as Jacobsen puts it, “directing and disciplining the attention” of the algorithm and expanding its “field of vision” to include edge cases it might encounter (Jacobsen, 2023: 5). What may be more surprising is the sometimes extreme degree of these amplifications. Synthetic datasets for training facial recognition systems go beyond realistic faces of all races, genders, and ethnicities under different light conditions (Figure 5), but also include extremely distorted, cartoonish faces with unrealistic skin tones one would unlikely encounter in real life (Boutros et al., 2023). Whether such cartoon faces actually enhance the performance of the model or whether they merely serve as good-enough proxies, in both cases, they emphasize that synthetic data is not just an inferior substitute for real-world observations but an entirely different beast. Synthetic data are an intervention that often contradicts rather than mimics empirical content.

Examples from Microsoft's face synthetics dataset.
Evidence
If a model trained on distorted data can perform better than one trained with real observations, it poses an interesting question: what constitutes realism in synthetic training data, and what exactly is their evidentiary value? In Leonelli's words, evidence is the “grounds on which specific claims about reality acquire credibility” (Leonelli, 2016: 199). In the case of synthetic data, these grounds are no longer ontological claims about reality but the behavior of the trained model.
But while exaggerated variability can lead to a more reliable model, other forms of variability get lost in synthetic data. Most datasets based on real-world observations contain outliers and irregularities, artifacts of the data collection method, or technical glitches. Such outliers pose challenges for statistical modeling. Trying to capture them in the model often leads to overfitting, resulting in overcomplicated models that perform well on their own training data, but adapt poorly to other situations. Statistical modeling, whether performed manually or through self-supervised machine learning, therefore prefers simpler models over more complex ones. Data produced by statistical models are always abstracted abstractions.
The abstracted outliers, however, are not always the result of technical glitches. Sometimes, they are the proverbial needle in the haystack, evidence for the individual case that stands out from the rest. Other times, they provide crucial context for deciding whether the information is trustworthy in debates around misinformation: revealing noise patterns and artifacts that reveal image manipulations or AI-generated images. Finally, errors and glitches themselves carry relevant information and offer clues about the circumstances of data collection. While the sciences strive for generalization, forensics focuses on individualization: looking close enough to find what makes a particular datum different from all others (Kirschenbaum, 2008: 10).
In this regard, a synthetic dataset is akin to an airbrushed portrait with all imperfections removed—imperfections that, however, also make it authentic. Correcting glitches and biases may improve accuracy but comes at the cost of losing the information embodied in the excluded outliers. Focusing on the “content” of data, the representational model offers few conceptual tools. In contrast, the relational model allows a critical examination of the data origin by considering data as indexical signs (Weatherby and Justie, 2022) or material traces that bear the inscriptions of their generation (Offenhuber, 2019).
However, the relational model is not without its problems. As described, it involves three aspects: First, data generation is a chain of material interactions. Second, the social framing of epistemic contexts determine what is accepted as evidence. And third, the effects and performances of data as their defining aspect. These three aspects, however, are often in conflict with each other, and without a real-world reference, it becomes difficult to resolve these conflicts. Replacing the question “What does it represent?” with “Does it work?” is not enough when grappling with adversarial attacks that can mysteriously throw off the deep learning model through seemingly inconsequential modifications. 15
Trust
Whether a real and a synthetic dataset are statistically equivalent cannot always be easily established. No introductory statistics textbook would be complete without a discussion of Anscombe's quartet, a cautionary tale in four charts against blind reliance on statistics (Figure 6). The quartet shows scatterplots of four datasets that clearly describe very different phenomena. However, these four distinct patterns share the same summary statistics, regression line, and standard error. Anscombe's argument challenges the notion that “performing intricate calculations is virtuous, whereas actually looking at the data is cheating” and cautions that any analysis should be sensitive to “whatever background information is available about the variables” (Anscombe, 1973).

Anscombe's quartet: Four datasets with the same descriptive statistics including number of observations, means in
However, while visual verification is trivial for simple datasets, it is impossible for massive synthetic datasets and machine learning models with an equally large number of parameters. A term frequently emphasized in machine learning literature is trust: a range of different tests, comparisons, and assessments help to build trust that a seemingly unrealistic synthetic datum can serve its purpose as part of the training dataset (Emam et al., 2020b). In the absence of ground truth, trust is built through a multitude of relationships. The solution to the trap of Anscombe's quartet is introducing yet another metric. Rather than a clear-cut set of quantitative standards, these quantitative metrics become mere signals in a larger interpretation-based assessment.
The relational nature of synthetic data expands into the scaffolding of a multitude of metrics of variation and performance. Without a simple reference to ground truth, data frictions emerge from the need to cross-reference unmoored metrics, laboriously evaluating agreements and discrepancies.
Contamination
Whether Gartner's estimate of the imminent flood of synthetic data is justified or not, the contamination of the web with generative AI content is a frequently voiced concern in the public discourse. It is difficult to estimate how much new online content, including blog posts, comments, and reviews, is written with the help of LLMs. Still, the number is certainly increasing and unwittingly sampled in new training datasets. As online data sources are commonly used in research and industry, synthetic data infiltrates even traditional survey responses. An experiment with the crowdworking platform Mechanical Turk showed that roughly half of the crowd-workers used LLMs such as ChatGPT to generate their survey responses (Veselovsky et al., 2023).
As a result, it will become even more difficult to observe human behavior and public sentiments online. A highly convincing deep fake of a president taking a bribe almost seems less harmful than the possibility of easily generating an entire discourse through AI astroturfing, with synthetic users discussing the allegation and thus influencing unwitting bystanders. In addition, any number of synthetic datasets can be generated and pointed to as evidence in the debate (Porsdam Mann et al., 2023). This scenario is a weaponized version of Asch's conformity experiments, in which participants ignored their own observations to conform with the statements of actors disguised as co-participants (Asch, 1951).
Conclusion—the ethics of anything goes
A conclusion of the various issues surrounding synthetic data has to begin with a discussion of its ethical implications, including questions about honesty, fairness, and the well-being of others. The use of synthetic data is ostensibly driven by ethical considerations, such as protecting the privacy of individuals and decreasing biases in datasets. Since synthetic data are quick and easy to generate in any desired volume, they may mitigate the tech sector's inclination towards surveillance and reduce the risk of exposing personal information. At the same time, synthetic data brings up new ethical issues. Their ease and convenience fit too well into the model of capitalistic acceleration (Steinhoff, 2022), and a nuanced discussion of its data issues may fall by the wayside.
The malleability of synthetic data raises the temptation of fabricating an “Instagram reality” of data: An idealized representation heavily mediated by beauty filters and image manipulations. While the frictions of real-world datasets often resist computational analysis, synthetic data can be made to perform beautifully in statistical models by correcting biases, seamlessly filling gaps, sanitizing and regularizing outliers, scaling up resolution by in-painting plausible details, increasing variability, and decreasing ambiguity. With the representational chain of custody broken, synthetic methods allow for generating representational relationships and resemblances at will. The
In many forms of synthetic data, the representational relationship is not merely implicit but entirely absent. The goal of DP is to dissociate a dataset entirely from the observed population. Training datasets are assembled with the aim to enable a model to recognize events that are otherwise rarely observed. Who and what is represented in synthetic datasets becomes a complicated question. As discussed in this paper, the loss of a representational relationship means that data problems often go unnoticed—problems that are often revealed by artifacts, glitches, and data imperfections. Claims of debiasing training data by introducing extreme outliers warrant some practical skepticism. It is doubtful whether data in which marginalized groups are misrepresented can be debiased on a meaningful scale.
This paper calls for a relational approach to synthetic data, focusing on the circumstances of data generation, the purpose they are used for, and the context in which data serve as evidence. Instead of focusing on generalization, it calls for individualization: Examining what makes a particular case unique. This uniqueness is rarely found in the data content but in the data context, its settings, performances, and interactions. At the same time, the paper addresses the shortcomings of the relational model in situations where representation remains essential.
And yet, the phenomenon of synthetic data offers important provocations that can be productive for the critical data discourse. The epistemic weirdness of the phenomenon challenges traditional assumptions about data. Switching from a representational to a performative notion of data realism, in which the performance of the trained model is the main metric or a definition of data that starts with the intended use rather than their descriptive capacity forces us to think relationally and contextually about data. If there is one lesson we can draw from synthetic data, it is that data have a speculative, imaginary, and counterfactual dimension. As Judea Pearl's ladder of causation tells us, to establish causality, it is not enough to simply observe data, it requires intervention and the imagination of counterfactual scenarios (Pearl and Mackenzie, 2018).
