Abstract
Keywords
Introduction
As many social norms, privacy, or the right to privacy, is an evolving term that is invoked in many contexts as eloquently described by Louis Menand in [25]: “Privacy is associated with liberty, but it is also associated with privilege (private roads and private sales), with confidentiality (private conversations), with nonconformity and dissent, with shame and embarrassment, with the deviant and the taboo (...), and with subterfuge and concealment”.
In order to get some formal underpinning of privacy in the context of electronic data collection and publishing, Li et al [22] have looked at privacy breaches, studied their general characteristics, and concluded, that electronic privacy breaches always ended with giving an attacker the ability to identify (using public data) whether an individual is member of a set or class that had been intended to be anonymous (e.g., the class of individuals with high cholesterol). Hence, they define preservation of privacy as
For the public good, such as the advance of public health, or the fair distribution of government resources, such data is frequently made public. There are also situations in which governmental and commercial organizations collect and analyze data to improve or provide new services. Especially in such cases, society expects a certain level of privacy on the way these organizations use the data. Publishing data with perfect privacy means that no assumption can be made about the prior knowledge an attacker may have about the supposedly anonymous set. Under this assumption, there would be little utility in published data if perfect privacy is expected [11,22]. Therefore, the research community has looked at weaker definitions of “acceptable” privacy. Useful concepts like
In spite of its limitations [12], but because of its formal properties, a privacy notion that has gained a lot of acceptance is
Even though the notion of differential privacy is in principle independent of the data model and query language at steak, so far most practical, automated implementations over well-established languages have been in the context of relational databases, over SQL, and have been restricted to aggregation queries or grouping. Aggregations are queries such as counting, finding maximum or average values over a certain data subset; grouping is the creation of histograms based on aggregations.
Furthermore, to allow for reasonable approximations of sensitivity, the support of these implementations for queries with
In the past decades, graph data models have enjoyed a growing adoption in comparison to the more traditional relational model. One such notable example is the RDF data standard, queried over by the SPARQL language, which have become extremely popular, in particular, for their role in the development of the Semantic Web. By the simple nature of RDF, it can be stored using binary relations [6] and most interesting queries will require operations equivalent to joins. This raises the question whether Johnson
In this paper we provide a positive answer to this question by presenting an algorithm that can answer counting queries over a large class of SPARQL queries that guarantees differential privacy. This result has been made possible by introducing the notion of a
We demonstrate the applicability of our approach by implementing a differential privacy query engine that uses the approximation to answer counting and grouping SPARQL queries, and evaluate the implementation running simulations using the Wikidata knowledge base [34].
The rest of the paper is organized as follows: in Section 2 we introduce the readers to the fundamental concepts of differential privacy. In Section 3, we present the core concepts of SPARQL used within the paper, including the notion of differential privacy schema. In Section 5 we prove the correctness of our proposed approximation to sensitivity and in Section 6 we evaluate the effectiveness of our proposed approximation in an implementation that we apply to both synthetic and real world datasets and queries. We present related work in Section 7, and we conclude the paper in Section 8.
Preliminaries about differential privacy
We now describe the framework of differential privacy, the problem that arises when applying differential privacy to SQL queries with general joins and how it has been addressed by the scientific community.
Definition
Intuitively, a randomized algorithm [26] is differentially private if it behaves similarly on similar input datasets. To formalize this intuition, the framework of differential privacy relies on a notion of
Let
This inequality establishes a quantitative closeness condition between
Establishing differential privacy for numeric queries of limited sensitivity is relatively simple. The Laplacian mechanism [9] says that we can obtain a differentially private version of query
Here,
In practice, when implementing the Laplacian mechanism we approximate the global sensibility of queries by exploiting their structures: Numeric queries are typically constructed by first transforming the original dataset using some standard transformers and by returning as final result some aggregation on the obtained dataset. For example, we join two tables, filter the result (dataset transformations) and return the count (aggregation) of the obtained table. The global sensitivity of such a query can be estimated from the so-called stability properties of the involved transformers. Intuitively, a stable transformer can increase the distance between nearby datasets at most by a multiplicative factor. Formally, we call a dataset transformer
Conversely, the use of transformers with unbounded stability might result in queries of unbounded sensitivity. A prominent example of a transformer exhibiting this problem is join. Assume we join two tables, say
To handle queries that involve transformers of unbounded stability, such as joins, we require the use of more advanced techniques. The Laplacian mechanism calibrates noise according to the query, overlooking the fact that queries are done on concrete datasets, hence the employed noise could be potentially customized for each dataset. Nissim
The
Observe that
For answering a query
A function
We can readily achieve differential privacy by adding noise calibrated according to a smooth upper bound of the query local sensitivity [28, Corollary 2.4].
The benefits of this mechanism are twofold. On the one hand, it allows handling queries that fail to have a bounded
To apply the mechanism from Theorem 2, we must provide a smooth upper bound for the local sensitivity of queries. We can construct the smooth upper bound using approximations for the local sensitivity at fixed distances.
The goal of Section 5 is to apply the differential privacy mechanism from Theorem 2 to SPARQL counting queries. To do so, we will use Lemma 1 to derive smooth upper bounds of the local sensitivity of queries. In turn, this requires constructing upper bounds for the local sensitivity of queries at fixed distances, for which we will leverage
In this section we examine the semantic information that is necessary considering over RDF graphs, in order to answer counting queries in a differentially private manner. This comprises a data schema and upper bounds on the predicate multiplicities.
Privacy schema
Motivation
As mentioned earlier, the goal of differential privacy is to protect the (possibly sensible) contribution of each individual within a dataset when publicly releasing aggregate information about the dataset – in our case, the result of counting queries. In the relational model, individuals are typically

RDF graph
To be able to apply differential privacy to a dataset in the form of an RDF graph, we must thus begin by identifying the different types of entities present in the graph, and the set of individuals in each type. Consider, for instance, the RDF graph
), two companies (depicted in
) and two cities (depicted in
). Said otherwise, there are two individuals of each entity type, adding up to six individuals in all. When querying the graph, we will be interested in protecting the contribution of all these individuals, and when applying differential privacy techniques to this end, we will then consider as a neighbor any other graph that differs in the contribution of either of them.
We refer to the semantic information necessary to identify the individuals in an RDF graph
We briefly review some basic RDF terminology following standard notation used in the literature [14,16], where more details can be found. The RDF language assumes the existence of an infinite set Calculation of query sensitivity will depend on the domain of blank nodes which can change under different contexts and implementations. This is a topic of future research. The more general definition of triple patterns allows also for variables in the predicate component of triple patterns.
To hint how we can formally define entities within an RDF graph, observe first the six colored sub-graphs identified in Fig. 1: two green, two blue and two red. All have a star shape [1], consisting of a “center” with outgoing and/or incoming edges,
A BGP is called a both the subject and the object of all its triple patterns are variables, all triple patterns have different predicates, and it consists of either a single triple pattern with no join vertex, multiple triple patterns with a single join vertex, which appears once and only once in every triple pattern
(Star BGP).
The three stars employed to identify the different entities in our running example (Fig. 1) are (modulo variable renaming): 
Note that in each The notion of
,
and
. Formally, we define the
A
The set
is a dp-schema. as well as its sub-set
. However, this second schema falls short of describing the whole graph
(Induced sub-graphs).
Let
(Induced sub-graph).
The star
induces two sub-graphs
and
over
. These correspond to the blue sub-graphs in Fig. 1, which are formally defined as: 
Likewise,
, where 
Intuitively, the set of induced sub-graphs
, but it is not materialized in
. From the privacy point of view, the values of the star center are the unique identifiers of the entities contributing the data in each sub-graph, and their values must be kept confidential.
The notion of induced sub-graphs naturally extends from single stars, to dp-schemas,
.
Now we have all the prerequisites to define the core concept of this section:
(Dp-schema compliance).
We say that an RDF graph
(Dp-schema compliance).
Our running example
. In contrast, it does not comply with dp-schema
.
In summary, if a graph
Discussion
We now address a few key points about dp-schemas.
(which identifies companies) and
(which identifies people). However, for the sake of protection it makes no difference to which star it belongs, since our application of differential privacy will protect the contribution of all individuals, regardless of its type.
Predicate multiplicity
Automatic approaches to answer dataset queries in a differentially private manner are typically obtained by adding noise to the query results, calibrated according to their sensitivity. Thus, a prerequisite to apply differential privacy to counting queries over RDF graphs is that they have bounded sensitivity. Unfortunately, this does not occur in the general case.
To see this, consider graph
) is replaced by somebody else’s contribution. The query answer over this neighboring graph can certainly be any integer
This problem arises because of the presence of predicates that are not
On the formal level, we associate such bounds to triple patterns rather than to predicates. This is because in the presence of compliant dp-schemas, predicates are identified with triple patterns (every predicate within a dp-schema occurs in a single triple pattern, in a single star).
(Triple-pattern multiplicity).
Let
Many predicates (or equivalently, triple patterns) would have a natural multiplicity bound of 1. For instance, a city has a unique
Henceforth, in the remainder we assume the system administrator provides a dp-scheme
For bounding the local sensitivity of queries in Section 5, it will suffices a coarser notion of multiplicity, at the level of stars rather than triple patterns. The required generalization is straightforward: Let Assume that a graph administrator adopts dp-schema (Star multiplicity).
(Star multiplicity).
and requires that each individual
will have multiplicity
.
Queries
In this section we describe the subset of queries over RDF graphs for which we provide differential privacy, and show how dp-schemes enable a decomposition result for the evaluation of such queries.
Supported queries
We develop differential privacy for counting queries over the SPARQL fragment of basic graph patterns with filter expressions, also known as
For simplicity, we consider only CBGPs that are semantically valid. We also assume that in a graph
Take RDF graph
. Now assume we want to know how many people have a coworker in a company with headquarters in a city with over 20 daily robberies? The query can be cast in terms of the CBGP
Triple patterns in an RDF graph compliant with a dp-schema
which contains the triple pattern
Finally, a The query from the previous example can be expressed as
Continuing with the previous example, assume we want to evaluate
induces on
Hence, for any query every for any two triple patterns
Because the stars in
For the CBGP from Example 4.1,
,
, but they are consireded different elementary BGPs because they share predicate
and
, respectively.
If
, but remain different members of
.)
Note also that the elementary BGP where a triple pattern belongs to is determined by the center of the triple pattern, all triple patterns in the same split must share the same variable or RDF term as their center. The interest in
Then, following the terminology defined in [29] for joins between multisets of solution mappings, we can extend the lemma to
We have already observed that
We are now in a position to establish differential privacy for SPARQL count and histogram queries.
In this section we develop all the prerequisites to extend Lemma 1 from the relational model to the graph model, in terms of the
Preliminary notions
We begin defining the notion of size and distance between RDF graphs. These are straightforward adaptations of the relational case, where the induced sub-graphs play the role of table rows. Concretely, the size of a graph refers to the number of individuals present in it. Formally, given an RDF graph
Moreover, we say that two graphs are
Finally, this notion of distance between RDF graphs readily induces a notion of
In order not to clutter the presentation, we usually omit the underlying dp-schema when referring to the size of an RDF graph, the distance between a pair of RDF graphs, and the local sensitivity of a
Elastic sensitivity
Our next step is, given a user query
To this end, observe that the naive approach of evaluating the query on every neighbor (at distance
In the remainder of the section we adapt Johnson
In our case, these transformations are given by the CBGP of user queries, more concretely, by their BGP part. We thus introduce the auxiliary notion of BGP
The formal definition of elastic stability relies on the frequency of most popular values. More precisely, if Observe that there might be multiple such triple patterns in
The counting on the latter query will be greater than (or equal to) the one on the former query and, therefore, a valid (possibly looser, though) upper bound for
To compute the elastic stability of BGP
The intuition behind the base case is easy to grasp. For
We are now ready to define the elastic stability of a BGP
If the covering star
As so defined, the local stability bounds the local sensitivity of counting queries:
The main intuition behind the proof is that changes made to a graph to get a new graph at distance 1, are limited to a sub-graph
The proof of this lemma follows the same strategy as the proof in [19, Lemma 2], and is by induction on the length of When In the symmetric case, when
□
We chose the largest of the two values when calculating
Now we have all the prerequisite to define the
And as in Johnson
By case analysis on the type of user query For plain counting queries ( For plain unique counting queries ( For counting queries after grouping (
□
Lemma 5 readily establishes our main result, which allows applying differential privacy to
The theorem follows immediately from Theorem 2 and Lemmas 1 and 5.
Having characterized formally how an algorithm can be implemented to enforce differential privacy on SPARQL queries based on privacy schemes, in this section, we present an empirical evaluation of how the algorithm would behave in real scenarios.
Repository

Our dp-schema is defined based on three pairwise disjoint stars,
This gathers the instances of the class
These three stars shouldn’t be used directly as a privacy schema because Note that this observation suggests a more subtle definition of privacy scheme would be useful.
The Wikidata properties that allow joining data from two different stars are
Table showing key statistics about the data in our privacy schema (the largest schema is by far the Humans star)
We selected 26 queries from the query logs in [18,35] containing the predicates used in our dp-schema. Since the amount of COUNT queries in the query logs is small [4] for each of these queries we added the
We consider star queries which are queries covered by a single star from the dp-schema. These queries can only add filters and remove triple patterns from the star. Therefore, a star query is centered around a single join vertex

Star-shaped query

Linear query

Snowflake query
Results of the execution of Wikidata queries using our differential privacy method. Those queries with sensitivity “1.0” are star queries since the sensitivity of a COUNT query over a single star schema is 1 (a COUNT query over a table), and their elastic stability is “x” as described in Section 5.2 if there are joins between star BGPs, the sensitivity increases based on the stability polynomial, calculated according to Theorem 3
We report the results of our evaluation in Table 2, showing the actual count output by the queries and the average result to these queries with added noise (calculated applying the method described in Section 5). We followed the query schema introduced in Section 5.2, that uses the BGP part of each query to calculate the initial most popular values (
(Blue) squares in Fig. 5 represent star queries (typically accessing a single star within the dp-schema), have a very low error (and thus a high utility) compared to the other two types of queries that involved at least one “join” operation across stars.
However, the smaller the result from the

This plot shows that star-shaped queries (blue squares) have the greatest utility, since they are likely not to have joins between stars in the schema. It also shows that queries accessing large amounts of data have high utility. Notice that
The
The study of how to guarantee the privacy of individuals contributing personal data to datasets is a long studied problem. In this work we have focused on how to guarantee this privacy in RDF data graphs accessed through SPARQL queries using differential privacy. The related work can be roughly classified into those that provide some privacy guarantees to accesses to data stored in (social) graphs and those that guarantee privacy over the results returned by SPARQL queries. We briefly look over these works in this section.
Privacy over SPARQL
There have been several approaches to address privacy concerns related queries to RDF data. A good survey can be found in [32]. There is a basic anonymization protection that a SPARQL engine must provide to queries that directly return individuals, as opposed to aggregated data. Similar to the case of relational databases where attribute values are anonymized using nulls, the work presented in [13] uses blank nodes to hide sensitive data. Delanoux et al. [7] introduce a more general framework with formal soundness guarantees for privacy policies that describe information that should be hidden as well as utility policies that describe information that should be available. The framework checks whether policies are compatible with each other, and based on a set of basic update queries that use blank nodes and deletions of triples, automatically derives from the policies candidate sets of anonymization operations that guarantee to transform any input dataset into a dataset satisfying the required policies. However, their soundness guarantees do not imply any formal privacy guarantees. Two early methods developed for privacy protection when answering queries about classes in a dataset are
The only work known to us that directly applies differential privacy to SPARQL queries is [32]. But surprisingly, differential privacy is realized through local sensitivity alone without the use of a smoothing function necessary for correctness [12]. A privacy-preserving query language for RDF streams is introduced in [8]. Limiting queries to that language servers can continuously release privacy-preserving histograms (or distributions) from online streams. Han et al. [15] provide differentially-private variants of the algorithms TransE and RESCAL, aimed at constructing knowledge graph embeddings in the form of real-valued vectors. While the authors show that these encodings allow performing some analyses with a reasonable privacy-utility tradeoff, inlcuding clustering and link prediction, it is an open question whether this generalizes to further analyses or counting queries as addressed in the current article.
Privacy in social graphs
A central task to the development of any practical differentially private analysis tool is finding appropriate approximations and alternatives to global sensitivity: it should be easy to calculate, and, at the same time, close enough to the real sensitivity to allow the computation of statistically useful results. A well-known approach is to rely on the concept of restricted sensitivity [3]. Restricted sensitivity is tailored to provide privacy guarantees assuming datasets come from a specific subgroup of the universe of all possible datasets, and it was introduced in the context of social-graph data analysis. There are two natural notions from which one can define adjacency of graphs: differences on edges and differences on vertices. The distance between two graphs,
Conclusions
In this paper we have introduced a framework to develop differential privacy tools for RDF data repositories. We have used the framework to develop an
We have implemented our algorithm and tested it using the Wikidata RDF database, queries from its log files and other example queries found at the Wikidata endpoint. The simulations show the approach to be effective for queries over large repositories, such as Wikidata, and in many cases for queries within the 10 of thousands answers to aggregate. However, even though elastic sensitivity has been designed to bind the stability of joins, the sensitivity of a query with joins can still be very high. As in the case of SQL queries in relational databases, in order to keep the noise in SPARQL queries under a single percentage digit, query results should have over 1M tuples and
There are many pending issues to address. We can still apply several optimizations to our framework. For example, public graphs can be treated as public tables. If they participate in joins, we can directly use their most popular result mappings during calculation of the query sensitivity. From the more practical point of view, more operations need to be implemented. We can consider the approaches described in [19] for SQL to add aggregation functions like sum and averages to our framework. There are also issues to consider about the impact that such algorithms will have on SPARQL query engines. From the more formal side, it is still important to keep searching for better approximations of local and global sensitivities as well as alternative definitions that are less onerous than differential privacy. One possibility is to find a way to apply restricted sensitivity to more types of queries by adding more semantic information to a dp-schema. It might also be possible to find a more accurate approximation for the elastic sensitivity of
