Sage Journals: Discover world-class research

Abstract

For the sensitive data generated by the sensor, we can use the noise to protect the privacy of these data. However, because of the complicated collection environment of the sensor data, it is easy to obtain some disorderly data, and the data need to be cleaned before use. In this work, we establish the differential privacy cleaning model H-RR, which is based on the contradiction generated by the function dependency, correct the contradictory data, and use the indistinguishability between the correction results to protect the data privacy. In this model, we add the local differential privacy mechanism in the process of data cleaning. While simplifying the data pre-processing process, we want to find a balance between data availability and security.

Keywords

Sensor network local differential privacy data cleaning

Introduction

In the wireless sensor network, we can realize real-time monitoring and data collection of various environments or monitoring object information. After the collection phase, the network would process the data and transmit it to the user.¹ Nowadays, the data collection of sensor network is often a combination of multiple sensors and a collection of multiple attributes of data, and the multiple data may cause leakage of certain sensitive information,^2,3 for example, from the data of pedometer we can speculate if the user walking or resting with the attributes of time and step count.

Example 1

The data collected on pedometer may leak someone’s whereabouts. When we know the number of steps the user had in a certain period of time, we can guess if he were sporting or lying on the bed. In order to protect privacy and not reveal sensitive information, data protection technology is needed to protect data that may contain sensitive information.

For the sensors are always deployed in a variety of complex environments, the data acquired in the sensor nodes are highly unreliable; as a result, the collected data need to be cleaned first before data mining and other data processing.^4,5 As the last program to identify and correct identifiable errors in data files, data cleaning is an indispensable part of the data analysis process, and the quality of data cleaning is directly related to the effect of data analysis models. The purpose of cleaning data is to improve the quality of data and the accuracy of data analysis and processing results. The function dependency of satisfying the knowledge base during the data cleaning process is the final step of cleaning. If the data do not satisfy the function dependencies, we call the data have had contradiction, and call such data as contradictory data. After removing/repairing missing data, removing/modifying data with wrong format and content, and removing unnecessary data, it is necessary to use the restrictions in the knowledge base to perform correlation verification, thereby correcting contradictory data.

Example 2

The most basic pedometer consists of two parts: the vibration sensor and the electronic counter. The vibration sensor records the motion state of the cycle. One cycle means one step. Then, the electronic pedometer records and displays this movement state. The data of the shock sensor and the electronic counter are collected in the background, and the data of the two are in one-to-one correspondence. If the number of cycles of the vibration sensor data display is 1, and the counter digital display is also 1, the data are likely to be correct. If the number of cycles displayed in the vibration sensor data is 2 and the counter data is 3, the correct number of steps may be 2 or 3, and the data need to be cleaned. Therefore, the record needs to be cleaned and repaired.

In recent years, the privacy protection of data has attracted more and more people’s attention. The common data privacy methods include k-anonymity, l-diversity, differential privacy, and so on. However, the privacy protection of data with dependencies cannot be directly protected like the data of ordinary single attributes, but needs to ensure the accuracy of the dependencies between attributes while protecting the data.

Differential privacy is a strict privacy protection model that defines the strength of privacy protection,⁶ that is, adding or deleting a record under the differential privacy model will hardly have effect on the query results. At the same time, differential privacy does not need to consider the background knowledge that the attacker may have, because it protects the potential user privacy information in the published data by adding interference noise to the data, so that even if an attacker already has all the information except the attacked data, the desired information cannot be obtained.⁷ Traditional differential privacy consolidates raw data into a third-party data collection platform that is assumed to be trusted. Under this assumption, differential privacy is called Centralized Differential Privacy. In practical applications, finding a truly trusted third party is difficult, so the application scope of centralized differential privacy is very limited. Based on this, the concept of Local Differential Privacy⁸ has been created. Local differential privacy transfers the privacy process of data to each user on the premise of inheriting Centralized Differential Privacy and provides more complete privacy protection.⁹

In this article, we address the issue of data privacy breaches based on dependencies between data attributes. Based on the correlation verification and repair process in data cleaning,¹⁰ we combine the conflicting process of repairing data with the local differential privacy protection mechanism. That is, we use an improving randomized response mechanism during the data cleaning process to ensure that the data after cleaning can satisfy local differential privacy. While the differential privacy cleaning model simplifies the data processing process, it uses the characteristics of dirty data to reduce the addition of privacy protection noise and find a balance between data availability and privacy.

The main processes of the algorithm are as follows: (1) finding the external knowledge base corresponding to the pending database and comparing the external knowledge base with the corresponding attributes of the pending database. (2) Improving the randomized response mechanism to make the data satisfy the external knowledge base in the pending database. The tuples of the function dependencies and the tuples that do not satisfy the constraint are replaced according to a certain probability. (3) Ensuring that the tuples in the pending database after replacement are satisfied with the functional dependencies in the external knowledge base.

To summarize our contributions:

We propose H-RR, a model that combines privacy protection with data cleaning.

We combine the two steps of data processing into one, simplifying the complexity of data processing.

We have improved the randomized response mechanism to handle data with two dependency properties, and most of the previous randomized response mechanisms can only be used for the noise of single attribute data.

Section “Related works” of this article summarizes and analyzes the main related works of data cleaning and local differential privacy; section “Preliminary” introduces some basic definition knowledge used in this article; section “H-RR” introduces the proposed differential privacy cleaning model; section “Optimizing the H-RR” is the improvement of the algorithm based on section “H-RR”; section “Experiment” introduces the experimental scheme adopted in this article, evaluates the proposed new method through experiments, and analyzes the experimental results; section “Conclusion” summarizes the full text and gives future research directions.

Related works

Improving data availability is an important prerequisite for data analysis.¹¹ Data cleaning, the consolidation and restoration of data, is one of the effective ways to improve data quality. The quality of data cleaning is directly related to the data availability. Among them, correlation verification between data is an important step to improve data cleaning accuracy and availability. Data correlation verification mainly fixes consistency errors, which means to fix the database that violates a predefined function dependency. Xie et al.¹² proposed a data consistency error repairing method based on probability. This method presupposes a reasonable repair space that satisfies the consistency constraint. In this space, sampling is performed according to the probability distribution, and the reasonable repair is used to complete the matching and repair data. Moser¹³ proposed a threshold-based sampling repair framework based on the quasi-equal-length concept. This method allows the error of signal reconstruction to be estimated by reconstructing the sampled signal and applying quasi-equal measures in the sample space. However, most data-based cleaning methods based on function dependencies do not consider data security as a reference factor.

With the development of information technology and people’s increasing awareness of data information security, privacy protection technologies have received more and more attention. As a privacy protection mechanism, differential privacy has received more and more attention in recent years. Against the background of third-party untrustworthiness, Kasiviswanathan et al.⁸ proposed local differential privacy, which shifts the privacy process of data to each user and protects user-level privacy security. The main perturbation mechanism of local differential privacy is randomized response mechanism proposed in Warner.¹⁴ In order to satisfy the complex data structure, the Rappor method proposed in Pihur and Korolova¹⁵ is a representative single-valued frequency statistics method that expresses the value of a variable as a string. Bassily and Smith¹⁶ proposed the S-Hist method. In this method, each user encodes a string and randomly selects a bit, uses a randomized response mechanism to perform perturbation, and then sends it to a data collector. Kairouz et al.¹⁷ proposed K-RR to overcome the situation that the variable contains more than two candidate values. On this basis, randomized response mechanisms have been improved from different perspectives. The existing local differential privacy data perturbation mechanism is basically a variant of the randomized response mechanism. In this article, we use the function dependency between attributes to optimize the randomized response mechanism to satisfy the scenario.

In recent years, the issue of data cleansing based on privacy protection has been concerned; Huang et al.¹⁸ designed a privacy-based data cleaning model PARC, which takes the privacy protection as a step of data cleaning, so that the cleaned data meet the data privacy requirements. PARC only proposes a model framework and does not clarify the specific privacy protection method; Krishnan et al.¹⁹ combined local differential privacy with data cleaning and proposed the PrivateClean method for privacy cleaning of discrete and numerical data. The drawback of the method is that it does not consider functional dependencies between attributes.

The current privacy cleaning models just simply combine the two processes of data cleaning and privacy protection and do not go deep into the optimization of the privacy cleaning process based on accounting the correlation between data attributes. Therefore, we take the external knowledge base as an auxiliary condition and perform data privacy cleaning based on the function dependencies between attributes and the local differential privacy mechanism.

Preliminary

The summary of symbol notations is shown in Table 1.

Table 1.

Summary of symbol notations.

Notation	Description
A, T	External knowledge base and pending database
\|A\| = k, \|T\|= l	The number of tuples in A and T
t[Y] = [Y₁…, Y_i,…Y_j,…Y_k]	The set of attributes Y and 1 ≤ i, j ≤ k
\|T_ac\| = n, \|T_bd\| = m	The set of the clean tuples and dirty tuples, n + m = l
Dom(), Ran()	The domain and the range
P	A number between 0 and 1
ε, ε_h, ε_k	Privacy budget
S_ij = X_i → Y_j	Discrete attribute functiondependencies

Data association verification

Performing data association verification based on function dependencies is one of the important methods for current data cleaning and repairing. In this process, we need to build an external knowledge base that can contain all the correct dependencies and then use this external knowledge base to verify that a tuple to be cleaned meets the constraints.

Definition 1 (external knowledge base (EKB))

A public collection containing all fact-dependent dependencies between attributes. In this article, we will consider the EKB A in dependence between two attributes.

A(X, Y) contains two attributes X and Y that satisfy the function dependency X→Y, the attribute X will be the major attribute so that it must be unique. A contains k tuples (|A| = k, A = {t₁, t₂, …, t_k}).

In addition, according to the different properties of attribute Y, we have the following descriptions of the external knowledge base:

Description 1

If the value in Y is unique, then t[Y] is uniquely determined for any tuple t[X] in A, and vice versa, that is, ∀i ≠ j, (1 ≤ i, j ≤ k), Y_i ≠ Y_j.

Example 3

We take the correspondence between the data of vibration sensor and the electronic counter as an example. The number of each vibration sensor data corresponds to an electronic counter number. There is no duplicate number; therefore, the correspondence between the data of vibration sensor and the electronic counter satisfies Description 1.

Description 2

If the value in Y is not unique, then t[Y] is uniquely determined for any tuple t[X] in R, otherwise it does not hold, that is, ∃i ≠ j, (1 ≤ i, j ≤ k), Y_i = Y_j.

Example 4

Consider a pedometer with time, we can easily know that in a short time one person have limited number of steps, for example, most people can walk most two steps, and run about 4–10 steps per second. As a result, one-time interval can only correspond to several steps. Therefore, the correspondence satisfies Description 2.

Definition 2 (pending database (PD))

A database before the data is cleaned according to the functional dependencies between the EKB.

Define a pending database T (1,2, …, X, Y, …, r). T contains r attributes and l tuples (|T| = l, T = {t₁, t₂, …, t_l}) including attributes X and Y. We verify each record in T according to the function dependencies in A . If t_a (a = 1,2, …., l) satisfies the constraint, the tuple is a clean tuple (CT). The formal definition is T_ac; if t_b (b = 1, 2, …, l) does not satisfy the constraint, the tuple is called a dirty tuple (DT). The formal definition is T_bd. T includes n clean tuples and m dirty tuples, m + n = l. The purpose of data correlation verification and repair is to make all tuples in the PD satisfy the function dependency in EKB.

Privacy model

In order to protect data privacy and reduce data processing steps based on the data that has functional dependencies between attributes, we need to perform tuple level cleaning of the data and a user-level (tuple-level) privacy protection model to combine privacy protection with data cleaning. Generally, we can use local differential privacy for user-level (tuple-level) privacy protection. The formal definition of local differential privacy is as follows.

For a given n users where each user corresponds to a record, given a privacy algorithm G and its domain Dom() and the range Ran(), if G for any two records t and t′ (t,t′∈Dom() and any output t*(t*⊆Ran())) satisfy the following inequality, G satisfies ε-local differential privacy

$\sup_{\binom{t, t' \in Dom ();}{t^{*} \in Ran ()}} \frac{G (t^{*} | t)}{G (t^{*} | t')} \leq e^{ε}$ (1)

In local differential privacy, each user perturbs its own data and then uploads it to the data collector. The data records of the other party are not known between any two users, that is, there is no concept of global sensitivity in local differential privacy, so the commonly used differential privacy plus noise mechanism Laplace mechanism and the exponential mechanism do not apply here. As a result, we mainly use the randomized response mechanism (W-RR) as the mainstream perturbation mechanism of local differential privacy. Moreover, Kairouz et al. proposed a gradient response technique k-randomized response (K-RR), which overcomes the problem of randomized response techniques for binary variables and can be performed directly for cases where the variable contains k (k > 2) candidate values. The formal definition of K-RR is as follows.

For each discrete attribute d_i, satisfies the following equation is called a randomized response mechanism. Through a reasonable setting of p, it can satisfy local differential privacy

$R [x | d_{i}] = {\begin{matrix} 1 - p, x = d_{i} \\ p, x = U (Ran (d_{i})) \end{matrix}$ (2)

where Ran(d_i) is the value range of d_i and U(.) is a randomly chosen value in Ran(d_i), 0 ≤ p ≤ 1. Let K =|Ran()|, from equation (2) we can get that

$sup_{\binom{t, t' \in Dom ()}{t'^{*} \in Ran ()}} \frac{G (t^{*} | t)}{G (t^{*} | t')} = \frac{(1 - p) + \frac{1}{k} p}{\frac{1}{k} p} \leq e^{ε}$

$ε \leq \ln (\frac{k}{p} + (1 - k))$

To ensure local differential privacy, the privacy budget ε is set to

$ε = \ln (\frac{2}{p} - 1)$ (3)

Description 3

K-RR cannot guarantee attribute function dependency.

The K-RR mechanism only responds randomly to a certain attribute d_i and does not consider the dependency relationship among attributes. Therefore, the result of K-RR can only ensure that the attribute meets local differential privacy but cannot guarantee that the attributes satisfy the dependencies in the EKB.

K-RR can indirectly add noise to attributes X and Y in the database to be processed to meet local differential privacy. The processing procedures are as follows: first, iterate over T to find tuples that do not satisfy the functional dependencies in A and perform correlation verification and repair; then, we use formula (2) to perturb X and Y in the repaired data to make it satisfy local differential privacy; finally perform correlation verification and repair based on the disturbance results.

Theorem 1

The privacy budget of K-RR when the external knowledge base satisfies the constraint of function dependencies $ε_{k} = \ln ((2 / p) - 1)$ .

For a tuple containing an attribute function dependency, the privacy budget ε_k is the privacy budget of the tth tuple

$\begin{matrix} ε_{k} = ε + ε \\ = \ln (\frac{2}{p} - 1) + \ln (\frac{2}{p} - 1) \\ = \ln (\frac{4}{p^{2}} - \frac{4}{p} + 1) \end{matrix}$

Problem description

The privacy cleaning model integrates the correlation verification into the random response mechanism for the function dependencies in the external knowledge base and combines the data cleaning with the localized differential privacy to combine the two data processing processes into one. The problem of this article is how to use the noise existing in the data that does not satisfy the constraint relationship in the privacy protection for adding less noise when protecting such data. Therefore, the overall privacy budget can be reduced in the privacy protection, and at the same time, the data can satisfy the correlation verification, and the two steps of data cleaning and privacy protection can be realized.

In the pending database, the privacy cleaning operation is performed according to the function dependencies in the external knowledge base. If the tuple of the pending database satisfies X→Y, we have the probability (1 – p) to retain the original function dependency, and a function dependency is randomly selected from the EKB with the probability p for replacement. If a certain tuple r_t in the pending database does not satisfy X→Y, one of the most similar attribute pairs in EKB is selected with probability (1 – p). Instead, a function dependency is randomly selected from the external knowledge base with the probability p to be replaced.

H-RR

In this section, we propose a privacy cleaning model H-RR for a pending database that contains Description 1. In the next section, we will improve this model to enable privacy cleaning of pending database that contains Description 2 in the corresponding external knowledge base.

Homologous randomized response

First, we define discrete attribute function dependencies as S_ij = X_i→Y_j (i, j = {1, 2, …, k}), |S_ij| = k², and define Ran(S_ij) as the value range of S_ij. Let C_k = A satisfy Description 1,|C_k|= k, also we have Ran (C_k) as the value range of C_k; it is to be said that U_ij = X_i→Y_j(i ≠ j) does not satisfy Description 1. Thus, we have |U_ij| = k² – k, and let Ran(U_ij) as the value range of U_ij.

Definition 3(homologous randomized response (H-RR))

For the function dependencies S_ij that satisfies the following equation, the conversion mechanism is said to be a Homologous Randomized Response

$H [x | S_{ij}] = {\begin{matrix} \frac{1}{2} (1 - p), x = U (C_{i}, C_{j}) \\ p, x = U (Ran (C_{k})) \end{matrix}$ (4)

where U(.) is a randomly chosen value in the specified range. When i = j, x∈C_k, the data we are processing is clean data; when i ≠ j, x∈U_ij, the data we are processing is dirty data.

Lemma 1

Homologous randomized response satisfies local differential privacy.

Proof

For any t,t′∈S_ij, t*∈C_k, satisfy equation (1).

According to equation (4) we can get

$\frac{\frac{1}{2} (1 - p) + \frac{1}{k} p}{\frac{1}{k} p} \leq e^{ε_{u}}$

To ensure local differential privacy, according to definition 3, a privacy budget ε of Homologous Randomized Response is

$ε_{u} = \ln (\frac{1}{p})$

Theorem 2

H-RR has a lower privacy budget than K-RR.

Proof

The process by which H-RR processes the pending database is as follows: we first traverse the pending database and then use equation (4) to perturb the data and make it satisfy local differential privacy

$\begin{matrix} ε_{u} - ε_{k} = \ln (\frac{1}{p}) - \ln (\frac{4}{p^{2}} - \frac{4}{p} + 1) \\ = \ln (\frac{p}{{(2 - p)}^{2}}) \\ \leq 0 \end{matrix}$

As it turns out, H-RR has a smaller privacy budget than K-RR.

Lines 5–16 of Algorithm 1 are the local differential privacy replacement process between the external knowledge base and the pending database. The core of the algorithm is to ensure that all tuples in the pending database after processing satisfy the function dependency in the external knowledge base A. Line 1 satisfies Lemma 1. The third row of the algorithm gives a (0,1) random number, replace_pre, to guarantee the randomness of the probability of each substitution of the tuple. Lines 5–9 are the alternatives when the tuple in the pending database is clean data, and lines 10–16 is the replacement when the tuple in the pending database is dirty.

Algorithm 1. H-RR
Input: external knowledge base A, pending database T, privacybudget εOutput: Disturbance processing database T′1 p = log(1/ε)2 for T_t in T:3 replace_pre = random.random()4 for C_k in A:5 if T_t = C_k6 if (replace_pre <= p)7 Randomly extract a set of function dependencies C_k from A to replace T_t8 else9 output T_t10 if T_t = U_ij11 if (replace_pre > 0.5 + p):12 replace T_t to C_i13 elif (0.5 + p >= replace_pre > p):14 replace T_t to C_j15 elif (replace_pre <= p)16 Randomly extract a set of constraint relations C_k from A to replace T_tReturn T′

Algorithm 1. H-RR

Input: external knowledge base A, pending database T, privacybudget εOutput: Disturbance processing database T′1 p = log(1/ε)2 for T_t in T:3 replace_pre = random.random()4 for C_k in A:5 if T_t = C_k6 if (replace_pre <= p)7 Randomly extract a set of function dependencies C_k from A to replace T_t8 else9 output T_t10 if T_t = U_ij11 if (replace_pre > 0.5 + p):12 replace T_t to C_i13 elif (0.5 + p >= replace_pre > p):14 replace T_t to C_j15 elif (replace_pre <= p)16 Randomly extract a set of constraint relations C_k from A to replace T_tReturn T′

Optimizing the H-RR

Problem description

The H-RR only applies to the case where the knowledge base table meets Description 1. In real life, there are many function dependencies that do not meet Description 1 but Description 2, such as the city-state relationship. In order to perform privacy cleaning operations on such data, we must improve the H-RR.

If the function dependencies of the EKB satisfy Description 2, when the attribute X and the attribute Y are all sensitive attributes, the pending database directly performs privacy cleaning processing through H-RR. When there is only one sensitive attribute in X and Y, the principle of the privacy cleaning mechanism is as follows.

When the attribute X is a sensitive attribute, if the tuple t_l of the pending database satisfies X→Y, that is, t_l∈T_ac, then the original function dependency is retained with probability (1 – p), and a function dependency randomly selected from the EKB with the probability p for replacement. If t_l does not satisfy X→Y, that is, t_l∈T_bd, we replace (t_l[X],t_l[Y]) with (t_k[X],t_k[Y]) with probability (1 – p)/2 and replace with the probability of (1 – p)/2s for one of the elements in the set S = {(t_z[X], t_z[Y]), (t_z₊₁[X], t_z₊₁[Y]) … (t_j[X], t_j[Y]) … (t_z_+s–1[X], t_z_+s–1[Y])}(z∈(1, 2, …, k)), z+s≤k+1 and t_z[Y] = t_z₊₁[Y] = … = t_j[Y] = t_z_+s–1[Y]), a probability constraint p is randomly selected from the external knowledge base.

When the attribute Y is a sensitive attribute, if the tuple t_l of the pending database satisfies X→Y, that is, t_l∈T_ac. Then, we retain the original function dependencies with the probability of (1 – p), and with the probability p from the external knowledge base in accordance with certain rules to extract a constraint to replace. If the tuple r_t do not satisfy X→Y, that is, t_l∈T_bd, randomly replace it with the probability (1 – p)/(s + 1) for one of the elements in the set S′ = {(t_i[X], t_i[Y]), (t_z[X], t_z[Y]), (t_z₊₁[X], t_z₊₁[Y]) … (t_j[X], t_j[Y]) … (t_z_+s–1[X], t_z_+s–1[Y])}. A probability p is extracted from the external knowledge base according to a certain rule to replace it.

Optimizing the H-RR

According to the analysis in section “Problem description,” when privacy cleaning is performed on a pending database that contains in Description 2 in an external knowledge base, it is necessary to consider whether the privacy attribute satisfies the uniqueness constraint, to determine whether the privacy attribute belongs to several-to-one or one-to-several in the external knowledge base. Different attribute relationships distribute different privacy budgets; so, it is necessary to optimize the H-RR in different situations.

Definition 4 (several-for-one homologous randomized response (SH-RR))

If X is a sensitive attribute, the conversion mechanism is said to be SH-RR if the function dependencies S_ij satisfies the following equation

$SH [x | S_{ij}] = {\begin{matrix} \frac{1}{2} (1 - p), x = C_{i} \\ \frac{1}{2 s} (1 - p), x = U (S) \\ p, x = U (Ran ()) \end{matrix}$ (5)

where U(.) is a randomly chosen value in the specified range. When i = j, x∈C_k, the data we process is clean data; when i ≠ j, x∈U_ij, the data we process is dirty data.

Lemma 2

SH-RR satisfies local differential privacy.

Proof

For any t,t′∈S_ij, t*∈C_k, satisfy equation (1).

If t∈U_ij, substituting equation (5) into equation (1) we can get

$\frac{\frac{1}{2} (1 - p) + \frac{1}{k} p}{\frac{1}{k} p} \leq e^{ε_{u}}$

To ensure local differential privacy, according to Definition 4, the privacy budget ε for SH-RR is

$ε_{u} = \ln (\frac{1}{p})$

Theorem 3

SH-RR has a lower privacy budget than K-RR.

The proof method is the same as theorem 1.

The biggest difference between Algorithm 2 and Algorithm 1 is the processing of dirty data. Lines 10–18 are the SH-RR processing methods for dirty data. S in the 14th line indicates all the elements in the external knowledge base that have the same attribute Y_i.

Algorithm 2. SH-RR
Input: external knowledge base A, pending database T, privacy budget εOutput: Disturbance processing table T′1 p = log(1/ε)2 for T_t in T:3 replace_pre = random.random()4 for C_k in A:5 if T_t = C_k6 if (replace_pre <= p)7 Randomly extract a set of constraint relations C_k from A to replace T_t8 else9 output T_t10 if T_t = C_i11 if (p < replace_pre < p + (1 – p)/2):12 replace T_t with C_i13 elif (replace_pre >= p + (1 – p)/2):14 replace T_t with U(S)15 elif (replace_pre <= p)16 Randomly extract a set of constraint relations C_k from A to replace T_t17 else:18 continueReturn T′

Algorithm 2. SH-RR

Input: external knowledge base A, pending database T, privacy budget εOutput: Disturbance processing table T′1 p = log(1/ε)2 for T_t in T:3 replace_pre = random.random()4 for C_k in A:5 if T_t = C_k6 if (replace_pre <= p)7 Randomly extract a set of constraint relations C_k from A to replace T_t8 else9 output T_t10 if T_t = C_i11 if (p < replace_pre < p + (1 – p)/2):12 replace T_t with C_i13 elif (replace_pre >= p + (1 – p)/2):14 replace T_t with U(S)15 elif (replace_pre <= p)16 Randomly extract a set of constraint relations C_k from A to replace T_t17 else:18 continueReturn T′

Definition 5 (one-for-several homologous randomized response (OH-RR))

If Y is a sensitive attribute, the conversion mechanism is called OH-RR if the function dependencies S_ij satisfy the following equation

$OH [x | S_{ij}] = {\begin{matrix} \frac{1}{s + 1} (1 - p), x = U (S') \\ p, x = U (R) \end{matrix}$ (6)

where U(.) is a set of randomly selected values in the specified range, R is a set of $(X'_{i}, Y'_{j})$ (i,j∈(1,2, …, g), g ≤ k) where $Y'_{1} \neq Y'_{2} \neq \dots \neq Y'_{g}$ , the formal definition of R is as follows

$\begin{array}{l} R = {U (A_{z}) = ({X^{'}}_{i}, {Y^{'}}_{j}) | A_{z} = {(X_{z}, Y_{z}), (X_{z + 1}, Y_{z + 1}) \\ \dots (X_{z + r - 1}, Y_{z + r - 1})}, z, r \in (1, 2, \dots, k), \\ z + r \leq k + 1, Y_{z} = Y_{z + 1} = \dots = Y_{z + r - 1}} \end{array}$

Lemma 3

OH-RR satisfies local differential privacy.

Proof

For any t,t′∈S_ij, t*∈C_k, satisfy equation (1).

If t∈U_ij, substituting equation (6) into equation (1), we can get

$\frac{\frac{1}{s + 1} (1 - p) + \frac{1}{rk} p}{\frac{1}{rk} p} \leq e^{ε_{h}}$

To ensure local differential privacy, according to Definition 5, the privacy budget ε for OH-RR is

$ε_{h} = \ln (\frac{1}{p})$

Theorem 4

OH-RR has a lower privacy budget than K-RR.

The proof method is the same as Theorem 1.

The R of the 7th row and the 12th row in Algorithm 3 indicates that a tuple is randomly selected from the k tuples to be selected, and the k tuples to be selected are different from the attributes of Y in the external knowledge base. S′ in the 14th row represents a tuple consisting of a tuple (X_i, Y_i) and s tuples containing Y_j.

Algorithm 3. OH-RR
Input: External knowledge base A, pending database T, privacy budget εOutput: Disturbance processing table T′1 p = log(1/ε)2 for T_t in T:3 replace_pre = random.random()4 for C_k in A:5 if T_t = C_k6 if (replace_pre <= p)7 replace r_t to U(R)8 else9 output r_t10 if r_t = C_i11 if (replace_pre <= p)12 replace T_t to U(R)13 else:14 replace T_t to U(S′)15 else:16 continueReturn T′

Experiment

This section demonstrates the effect of our algorithms on data cleaning and privacy protection through experiments. To simplify the experimental steps, we use data on cafe licensing data issued by the Chicago government in the United States rather than the normal sensor data. The degree of data cleaning is reflected in the availability of data, so data availability can be used as a measure of data cleaning. Data availability is also one of the important indicators to measure the quality of the privacy protection model. The degree of privacy protection of data is mainly measured by ε, so the value of ε determines the availability of data for local differential privacy results.

Experiment settings

Data description

Coffee shop license data issued by the Chicago Municipal Government include license number, time of issue, expiration date, cafe address, street, street type, latitude and longitude coordinates, and belonging police station properties. There is a constraint between street and street types, and the data itself is clean.

Experimental steps

First, the function dependency between street and street types is extracted from the original data and written into a new data table as an external knowledge base; here, two new external knowledge bases are created, which satisfy the function dependencies of Description 1 and Description 2, respectively; second, according to the different function dependencies contained in the external knowledge base, two types of pending databases are extracted from the original data, and some data are randomly extracted from the pending database to become dirty data; and then according to different function dependencies. Because we do not find someone else who have done the similar work, so we will first process the pending database using K-RR and then compare with the result of H-RR, OH-RR, and SH-RR; adjust the value of ε and the proportion of dirty data; and analyze the changes in data availability under different methods.

Data availability

Data availability is indicator of whether a privacy model is built to be reasonable.²⁰ We examine the performance of the algorithm from two perspectives, data cleaning and privacy protection. The degree of privacy protection is negatively related to data availability. The higher the privacy protection, the lower the data availability, and the lower the privacy protection, the higher the data availability. In local differential privacy, the privacy budget ε controls the degree of deviation of the data by controlling the probability value of the stochastic response technology to output the true value, thereby protecting privacy.²¹ It is mainly a trade-off of data availability. In this article, we will use accuracy rate (Ar) to measure the similarity between raw data and processed data to show the data availability.

Definition 6 (accuracy rate (Ar)): accuracy of data after cleaning and disturbance

In order to better measure the practicability of the algorithm, the raw data we used in this article meet the EKB, so we can easily calculate the accuracy rate by comparing the same part of processed data with the original data. If we take s as the number of the same tuples, and take n as the total number of the tuples in the database, we can get

$Ar = s / n$ (7)

Result

The following experiments used K-RR and the three methods proposed in this article to compare the effect of different values on the statistical results.

In local differential privacy, different privacy budgets ε determine the result of the probability p in the random response mechanism. From sections “Preliminary,”“H-RR,” and “Optimizing the H-RR,” we know that ε is inversely proportional to p and is proportional to (1 – p). Therefore, as ε increases, p decreases, and users can respond to real results with a higher probability. The relationship between privacy budget ε and p for clean data and dirty data is shown in Figure 1.

Figure 1.

Functional relation between ε and p.

Figure 1(a) shows the relationship between the privacy budget ε of the dirty data and the probability p, and Figure 1(b) shows the relationship between the privacy budget ε of clean data and the probability p. The curve in Figure 1(a) is relatively steep, which means that when the privacy budget is the same, the user can respond to real results with higher efficiency when dealing with dirty data. This is because the dirty data itself has uncertainty. Using the uncertainty of dirty data can reduce the amount of noise added during privacy protection.

Because the constraints contained in the external knowledge base are different, we will compare the results of K-RR with the results of H-RR, OH-RR, and SH-RR.

Figure 2 shows the comparison between K-RR and H-RR. The number of rows in the data set is 1000. The external knowledge base A contains nine constraints. In the pending database, 500 tuples are randomly selected and become dirty data. Calculate the correctness rate of data repaired after K-RR and H-RR are repaired under different privacy budgets.

Figure 2.

Accuracy rate of krr and hrr.

It can be clearly seen from the figure that as the privacy budget ε increases, the accuracy of data repair is also increasing. When ε is 0, the data repair rate of K-RR is almost 0, which means that the data is basically inconsistent with the original accurate data. As ε increases, the accuracy of K-RR data repair increases continuously. About 0.1, this is because K-RR is performing two passes of noise plus two passes of cleaning process, which is equivalent to four modifications to the data, and the probability of correct output of each modification is not 1. Therefore, modifying the result of superimposing multiple times will lead to a decrease in the accuracy of data restoration.

When ε is 0, the actual data recovery rate of H-RR is about 0.1. This is because when ε is 0, the data in the table to be processed is randomly selected from nine kinds of function dependencies in the knowledge base for replacement. In general, the probability of replacing the original accurate data is 1/9. With the increase in ε, the data repair rate of H-RR increases continuously, and the maximum value approaches 0.75, because half of the data is dirty during the data repair process. Even if ε increases, 1 – p tends to increase. Nearly 1, the probability of replacing this dirty data with the original accurate data is only 1/2, so the maximum data repair rate of the H-RR should be 0.75.

Figures 3 and 4 show the comparison between K-RR, SH-RR, and OH-RR. The number of rows in the data set is 2000. The external knowledge base A contains 42 constraints, and 1000 randomly selected tables are to be processed. The tuples become dirty data and calculate the correctness rate of data repaired after the K-RR and SH-RR and OH-RR are repaired under different privacy budgets.

Figure 3.

Accuracy rate of K-RR and SH-RR.

Figure 4.

Accuracy rate of K-RR and OH-RR.

From Figure 3, it can be clearly seen that with the increase in privacy budget ε, the accuracy of data repair is also increasing. When ε is 0, the actual data recovery rate of SH-RR is close to 0.02. This is because when ε is 0, there are 42 kinds of function dependencies in the knowledge base. Therefore, a relation is selected for replacement, and generally it is replaced. The result is a probability of 1/42 of the original accurate data. With the increase in ε, the accuracy of SH-RR data repair increases continuously, and the maximum value approaches 0.68, because in the process of data repair, due to the existence of a many-to-one equal-function dependencies, even if ε increases so that 1 – p approaches 1, the probability of replacing dirty data with original accurate data is only 2 + s/2s, so the maximum data repair rate of SH-RR is related to s. In this experiment, the maximum data repair rate is about 0.68.

The knowledge base in Figure 4 is the same as Figure 3 in the experiment. There are 42 constraints in the knowledge base. Therefore, when ε is 0, the actual data recovery rate of OH-RR is also close to 0.02, that is, the probability that the replacement result is the original accurate data 1/42. With the increase in ε, the accuracy of data repair of OH-RR increases continuously, and the maximum value approaches 0.6, because in the process of data repair, due to the existence of a one-to-many equivalence function dependencies, even if ε increases so that 1 – p approaches 1, the probability of replacing dirty data with original accurate data is only 1/s + 1, so the maximum data repair rate of OH-RR is related to s. In this experiment, the maximum data repair rate is approximately 0.6, less than the maximum data repair rate of SH-RR.

Conclusion

Model H-RR and its improved models SH-RR and OH-RR are privacy cleaning algorithms based on data cleaning and data privacy protection for the data collected by sensors. The model assumes the dependency of functions in the external knowledge base and fixes the correlation of data. The process incorporates a local differential privacy cleaning model that combines data cleaning with privacy protection in a local differential privacy process. In this article, we utilize the noise of the data itself that does not satisfy the function dependencies and the correlation verification repair strategy to reduce the amount of noise added when protecting such data. Compared with the traditional cleaning-plus-noise separation data processing process, the differential privacy cleaning model presented in this article treats the function dependencies between attributes as the processing basis in the process of data processing and adds noise as a process of data cleaning, which makes noise-added data satisfies the function dependency and does not require the secondary repair, which greatly simplifies the data processing process, improves the data processing speed, and can improve the data availability.

However, there are many things that can be optimized in this article. First, function dependencies in external knowledge bases can become more complex, such as transfer function dependencies. At this time, the model needs to be further improved according to the relationship between attributes; second, when the external knowledge base meets the function dependencies between attributes, it does not consider the case where the sensitivity of X and Y are not the same; Third, it does not consider the several-to-several relationship in the external knowledge base. In this case, both X and Y have no functional dependencies between the attributes, but there are constraints between the attributes. Fourth, we do not consider the spatio-temporal continuity of sensor data, and we do also not consider data fusion of multiple sensors, in other words, we do not consider the heterologous data. In the future research, we will continue to explore, continuously improve and refine the model, and conduct more in-depth and meticulous research on the issue.

Footnotes

Handling Editor: Wei Li

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This work was partly supported by Fundamental Research Funds for the Central Universities (grant no. HEUCF180603) and Harbin Application Technology Research and Development Project (grant nos 2016RAQXJ063 and 2016RAXXJ013). The National Natural Science Foundation of China under grant nos 61370084 and 61872105;Experimental verification of the basic commonness and key technical standards of the Industrial Internet network architecture.

ORCID iD

Liguo Zhang

References

Zheng

Cai

Data linkage in smart IoT systems: a consideration from privacy perspective. IEEE Commun Mag 2018; 56: 55–61.

Liang

Cai

et al . Deep learning based inference of private information using embedded sensors in smart devices. IEEE Network Mag 2018; 32: 8–14.

Cai

Zheng

A private and efficient mechanism for data uploading in smart cyber-physical systems. T Network Sci Eng. Epub ahead of print 24 April 2018. DOI: 10.1109/TNSE.2018.2830307.

Cai

Guan

et al . Collective data-sanitization for preventing sensitive information inference attacks in social networks. IEEE T Depend Secure Comput 2018; 15(4): 577–590.

Cai

Latent-data privacy preserving with customized data utility for social network data. IEEE T Veh Technol 2018; 67(1): 665–673.

Dwork

Differential privacy: lecture notes in computer science, vol. 26. New York: Springer, 2006, pp.1–12.

Dwork

. Differential privacy: a survey of results. In: Proceedings of the international conference on theory and applications of models of computation, Xi’an, China, 25–29 April 2008, pp.1–19. New York: Springer.

Kasiviswanathan

Lee

Nissim

et al . What can we learn privately? In: Proceedings of the 49th annual IEEE symposium on foundations of computer science (FOCS), Philadelphia, PA, 25–28 October 2008, pp.531–540. New York: IEEE.

Kairouz

Bonawitz

Ramage

Discrete distribution estimation under local privacy. In: Proceedings of the 33rd international conference on machine learning, New York, 19–24 June 2016, pp.2436–2444. New York: MLR Press.

10.

Wang

Song

Chaudhuri

. Privacy-preserving analysis of correlated data, 2016.

11.

Liu

An important aspect of big data: data usability. J Comput Res Dev 2013; 50(6): 1147–1162.

12.

Xie

Yang

Chen

et al . A sampling-based approach to information recovery. In: Proceedings of the 24th international conference on data engineering, Cancun, Mexico, 7–12 April 2008, pp.476–485. New York: IEEE.

13.

Moser

BA.

Similarity recovery from threshold-based sampling under general conditions. IEEE T Signal Pr 2017; 65(17): 4645–4654.

14.

Warner

SL.

Randomized response: a survey technique for eliminating evasive answer bias. J Am Stat Assoc 1965; 60(309): 63–66.

15.

Pihur

Korolova

RAPPOR: randomized aggregatable privacy-preserving ordinal response. In: Proceedings of the ACM SIGSAC conference on computer and communications security, Scottsdale, AR, 3–7 November 2014, pp.1054–1067. New York: ACM.

16.

Bassily

Smith

. Local, private, efficient protocols for succinct histograms. In: Proceedings of the forty-seventh annual ACM on symposium on theory of computing, Portland, OR, 14–17 June 2015, pp.127–135. New York: ACM.

17.

Kairouz

Viswanath

Extremal mechanisms for local differential privacy. Adv Neural Inf Process Syst 2014; 4: 2879–2887.

18.

Huang

Gairola

Huang

et al . PARC: privacy-aware data cleaning. In: Proceedings of the ACM international on conference on information and knowledge management, Indianapolis, IN, 24–28 October 2016, pp.2433–2436. New York: ACM.

19.

Krishnan

Wang

Franklin

et al . PrivateClean: data cleaning and differential privacy. In: Proceedings of the international conference on management of data, San Francisco, CA, 26 June–1 July 2016, pp.937–951. New York: ACM.

20.

Chen

Qin

et al . Private spatial data aggregation in the local setting. In: Proceedings of the 32nd international conference on data engineering (ICDE), Helsinki, 16–20 May 2016, pp.289–300. New York: IEEE.

21.

Barg

Optimal schemes for discrete distribution estimation under locally differential privacy. IEEE T Inform Theory 2018; 64: 5662–5676.