Abstract
1. Introduction
The very first privacy notions in official statistics can be found in the randomized response mechanism (RR; Warner 1965), which protects query answers by reporting true answers with probability
As the needs for privacy-preserving data sharing increased, it was soon realized that (i) Dalenius-like privacy notions were too stringent to produce protected data with reasonable analytical utility, and (ii) non-interactive data releases were much more useful and needed than interactive query answering. Hence, the first practical approaches to anonymization focused on the release of statistical databases and were utility-first, where anonymization parameters are iteratively tried until sufficient analytical utility is empirically achieved while reducing the observed disclosure risk below a certain threshold. Examples of these heuristic mechanisms are data suppression, sampling, noise addition, microaggregation, data swapping, or value generalization.
To overcome the limitations of the empirical nature of classical SDC methods, the computer science community approached the problem in the late 90s from the privacy-first angle by introducing the notion of privacy model, by which an ex ante privacy condition is enforced using one or several SDC methods. Privacy models have the advantage of setting the privacy level by design, without having to empirically evaluate it. The first privacy model was
In 2006, a new privacy model named ϵ-differential privacy (DP) was proposed in the cryptographic community, which aimed to bring to data sharing a privacy guarantee based on indistinguishability by reformulating the old Dalenius desideratum in an even more stringent manner: “the risk to one’s privacy […] should not substantially increase as a result of participating in a statistical database” (Dwork 2006).
DP was originally proposed for interactive statistical queries to a database, where a randomized query function (that returns the query answer plus some noise) satisfies ϵ-DP if, for any two databases that differ in one record, the respective query answers do not differ by more than an exponential factor of ϵ (called the budget). This ensures the presence or absence of any single record does not significantly affect query answers. The smaller ϵ, the higher the protection, with ϵ ≤ 1 being considered “safe” (Dwork 2011). The noise to be added to the answer to enforce a certain ϵ depends on the global sensitivity of the query to the presence or absence of any single record. One can see that the enforcing mechanism employed by DP is very similar to the RR forerunner, as both produce randomized answers to interactive queries with a predefined probability.
Thanks to its neat privacy guarantee, DP has been rapidly adopted by the research community, to the point that previous approaches tend to be regarded as obsolete. In fact, DP has become the de facto privacy standard in most data intensive areas, such as data analytics, statistics, or machine learning (Wood et al. 2018). However, while mild noise may suffice to enforce DP on aggregated statistical queries (e.g., averages), its stringent privacy guarantee requires a very large noise to be added for identity queries returning the contents of a specific record. In those cases, the noise needed to enforce any safe-enough ϵ is so large that the DP-protected outcomes become nearly random and, therefore, analytically useless. The latter is precisely the case of microdata releases, which are more in demand than interactive queries (Domingo-Ferrer et al. 2021). The research community has devoted enormous efforts during the last eighteen years trying to fit (or bend) DP to those settings, with the arduous aim of reconciling the DP privacy guarantee with sufficient data utility.
2. Challenges
A shortcoming of the utility-first approach traditionally embraced by the SDC community, especially in official statistics, is that it requires (possibly several rounds of) empirical disclosure risk assessment to make sure the protected data are safe enough. Such an assessment is expensive, and it requires selecting plausible attack scenarios and parameters. In recent years, there has been a tendency to seek ex ante privacy guarantees. Whereas official statisticians have mostly opted for synthetic data generation (Burnett-Isaacs et al. 2022), computer scientists have favored privacy models, especially DP. We next sketch the challenges associated with those ex ante approaches.
Synthetic data are artificial data preserving some of the statistical features of the original data. Their main attractive is that, being artificial, it would seem they can be released without privacy concerns. However, they are not without issues. If they are generated with the only requirement that certain statistics be exactly preserved, their utility is limited to those statistics, whereas other statistics or subdomain analyses are not preserved. Then one might wonder why not just publish the statistics to be preserved rather than a synthetic data set. On the other hand, if synthetic data are generated by fitting a statistical or machine learning (ML) model to the original data and then sampling that model, the issue of potential overfitting appears. Indeed, sampling an overfitted model is likely to yield synthetic data extremely close to the original: while this is good for utility, it is very bad for privacy (reidentification and attribute disclosure are likely). Overfitting is especially worrisome when using complex ML models, such as deep neural networks, whose millions of parameters may encode a snapshot of the confidential original data they are trained on.
The deployment of privacy models has mostly focused on DP, whereas the
Regarding the combination of synthetic data and DP, it amounts to training the synthetic data generation models on the original data in a differentially private manner. Since training is an iterative procedure, the sequential composition property of DP applies, which means that the budget used increases with the number of iterations (Abadi et al. 2016) and hence the privacy level decreases. Despite this budget increase, the impact of DP-training on the utility of the trained model remains significant (Domingo-Ferrer et al. 2021), which means that the utility of the synthetic data obtained by sampling the model can be seriously affected. In fact, in Blanco-Justicia et al. (2022) it is shown that anti-overfitting techniques applied during ML model training may yield a better privacy-utility trade-off than DP.
3. Directions
From the challenges above, it becomes clear that empirical disclosure risk assessment remains as unavoidable for synthetic or DP data as it was under the utility-first traditional approach to SDC. DP can only provide enough ex ante privacy to dispense with ex post disclosure risk assessment when the budget ϵ is really small (say below 1), but in this case utility is likely to be too low for most practical applications. When the budget is large, DP is little more than mild classical noise addition, and hence disclosure risk assessment is still needed. Future directions include:
Ex ante privacy that does not require disclosure risk assessment remains a very desirable goal. Privacy models are the way to go but, given the difficulties of deploying DP in practice, exploring the other large family of privacy models (
Coming up with realistic disclosure risk assessment in practical scenarios is necessary. Potential re-identification/disclosure attacks are sometimes invoked as an argument to justify the use of stronger SDC methods and privacy models. However, no successful attacks against properly anonymized data (e.g., those released by national statistical institutes) have been reported, which contradicts the pessimism in Ohm (2009). Much publicized re-identification attacks were mounted against data that had undergone mere pseudonymization or suppression of identifiers—the reidentification of the Governor of Massachusetts (Sweeney 1997), the AOL attack (Barbaro and Zeller 2006), and the Netflix Prize Dataset attack (Narayanan and Shmatikov 2008)—or against poorly anonymized data (De Montjoye et al. 2015; Sánchez et al. 2016). Also, the much discussed reconstruction of data compatible with the released statistical outputs does not in general imply re-identification (Muralidhar and Domingo-Ferrer 2023).
When using ML models in data protection, especially to generate synthetic data, overfitting must be avoided. Overfitted models may memorize the entire original data and yield synthetic data extremely close to the original. One option is to use anti-overfitting methods. An alternative is to use only simple models for synthetic data generation, such as statistical models or basic ML methods (decision trees, random forests, etc.). The bottom line is that models used for synthetic data generation should not be too accurate if we do not want the resulting synthetic data to be too close to the original data.
Non-perturbative SDC methods (such as data suppression, generalization, or sampling, see Hundepool et al. (2012)) are probably a better choice than perturbative methods in settings where perturbed data (e.g., DP data obtained via noise addition) might be unacceptable or invalidate analytical conclusions (e.g., in the medical domain or official statistics). Non-perturbative methods keep protected data truthful, and they are sufficient to enforce formal
Alternative data sharing mechanisms that are privacy- and utility-preserving-by-design also deserve attention. Let us mention federated learning, anonymous channels, or secure multiparty computation as an alternative to sharing anonymized data.

