Abstract
Introduction
In the wireless sensor network, we can realize real-time monitoring and data collection of various environments or monitoring object information. After the collection phase, the network would process the data and transmit it to the user. 1 Nowadays, the data collection of sensor network is often a combination of multiple sensors and a collection of multiple attributes of data, and the multiple data may cause leakage of certain sensitive information,2,3 for example, from the data of pedometer we can speculate if the user walking or resting with the attributes of time and step count.
Example 1
The data collected on pedometer may leak someone’s whereabouts. When we know the number of steps the user had in a certain period of time, we can guess if he were sporting or lying on the bed. In order to protect privacy and not reveal sensitive information, data protection technology is needed to protect data that may contain sensitive information.
For the sensors are always deployed in a variety of complex environments, the data acquired in the sensor nodes are highly unreliable; as a result, the collected data need to be cleaned first before data mining and other data processing.4,5 As the last program to identify and correct identifiable errors in data files, data cleaning is an indispensable part of the data analysis process, and the quality of data cleaning is directly related to the effect of data analysis models. The purpose of cleaning data is to improve the quality of data and the accuracy of data analysis and processing results. The function dependency of satisfying the knowledge base during the data cleaning process is the final step of cleaning. If the data do not satisfy the function dependencies, we call the data have had contradiction, and call such data as contradictory data. After removing/repairing missing data, removing/modifying data with wrong format and content, and removing unnecessary data, it is necessary to use the restrictions in the knowledge base to perform correlation verification, thereby correcting contradictory data.
Example 2
The most basic pedometer consists of two parts: the vibration sensor and the electronic counter. The vibration sensor records the motion state of the cycle. One cycle means one step. Then, the electronic pedometer records and displays this movement state. The data of the shock sensor and the electronic counter are collected in the background, and the data of the two are in one-to-one correspondence. If the number of cycles of the vibration sensor data display is 1, and the counter digital display is also 1, the data are likely to be correct. If the number of cycles displayed in the vibration sensor data is 2 and the counter data is 3, the correct number of steps may be 2 or 3, and the data need to be cleaned. Therefore, the record needs to be cleaned and repaired.
In recent years, the privacy protection of data has attracted more and more people’s attention. The common data privacy methods include k-anonymity, l-diversity, differential privacy, and so on. However, the privacy protection of data with dependencies cannot be directly protected like the data of ordinary single attributes, but needs to ensure the accuracy of the dependencies between attributes while protecting the data.
Differential privacy is a strict privacy protection model that defines the strength of privacy protection, 6 that is, adding or deleting a record under the differential privacy model will hardly have effect on the query results. At the same time, differential privacy does not need to consider the background knowledge that the attacker may have, because it protects the potential user privacy information in the published data by adding interference noise to the data, so that even if an attacker already has all the information except the attacked data, the desired information cannot be obtained. 7 Traditional differential privacy consolidates raw data into a third-party data collection platform that is assumed to be trusted. Under this assumption, differential privacy is called Centralized Differential Privacy. In practical applications, finding a truly trusted third party is difficult, so the application scope of centralized differential privacy is very limited. Based on this, the concept of Local Differential Privacy 8 has been created. Local differential privacy transfers the privacy process of data to each user on the premise of inheriting Centralized Differential Privacy and provides more complete privacy protection. 9
In this article, we address the issue of data privacy breaches based on dependencies between data attributes. Based on the correlation verification and repair process in data cleaning, 10 we combine the conflicting process of repairing data with the local differential privacy protection mechanism. That is, we use an improving randomized response mechanism during the data cleaning process to ensure that the data after cleaning can satisfy local differential privacy. While the differential privacy cleaning model simplifies the data processing process, it uses the characteristics of dirty data to reduce the addition of privacy protection noise and find a balance between data availability and privacy.
The main processes of the algorithm are as follows: (1) finding the external knowledge base corresponding to the pending database and comparing the external knowledge base with the corresponding attributes of the pending database. (2) Improving the randomized response mechanism to make the data satisfy the external knowledge base in the pending database. The tuples of the function dependencies and the tuples that do not satisfy the constraint are replaced according to a certain probability. (3) Ensuring that the tuples in the pending database after replacement are satisfied with the functional dependencies in the external knowledge base.
To summarize our contributions:
We propose H-RR, a model that combines privacy protection with data cleaning.
We combine the two steps of data processing into one, simplifying the complexity of data processing.
We have improved the randomized response mechanism to handle data with two dependency properties, and most of the previous randomized response mechanisms can only be used for the noise of single attribute data.
Section “Related works” of this article summarizes and analyzes the main related works of data cleaning and local differential privacy; section “Preliminary” introduces some basic definition knowledge used in this article; section “H-RR” introduces the proposed differential privacy cleaning model; section “Optimizing the H-RR” is the improvement of the algorithm based on section “H-RR”; section “Experiment” introduces the experimental scheme adopted in this article, evaluates the proposed new method through experiments, and analyzes the experimental results; section “Conclusion” summarizes the full text and gives future research directions.
Related works
Improving data availability is an important prerequisite for data analysis. 11 Data cleaning, the consolidation and restoration of data, is one of the effective ways to improve data quality. The quality of data cleaning is directly related to the data availability. Among them, correlation verification between data is an important step to improve data cleaning accuracy and availability. Data correlation verification mainly fixes consistency errors, which means to fix the database that violates a predefined function dependency. Xie et al. 12 proposed a data consistency error repairing method based on probability. This method presupposes a reasonable repair space that satisfies the consistency constraint. In this space, sampling is performed according to the probability distribution, and the reasonable repair is used to complete the matching and repair data. Moser 13 proposed a threshold-based sampling repair framework based on the quasi-equal-length concept. This method allows the error of signal reconstruction to be estimated by reconstructing the sampled signal and applying quasi-equal measures in the sample space. However, most data-based cleaning methods based on function dependencies do not consider data security as a reference factor.
With the development of information technology and people’s increasing awareness of data information security, privacy protection technologies have received more and more attention. As a privacy protection mechanism, differential privacy has received more and more attention in recent years. Against the background of third-party untrustworthiness, Kasiviswanathan et al. 8 proposed local differential privacy, which shifts the privacy process of data to each user and protects user-level privacy security. The main perturbation mechanism of local differential privacy is randomized response mechanism proposed in Warner. 14 In order to satisfy the complex data structure, the Rappor method proposed in Pihur and Korolova 15 is a representative single-valued frequency statistics method that expresses the value of a variable as a string. Bassily and Smith 16 proposed the S-Hist method. In this method, each user encodes a string and randomly selects a bit, uses a randomized response mechanism to perform perturbation, and then sends it to a data collector. Kairouz et al. 17 proposed K-RR to overcome the situation that the variable contains more than two candidate values. On this basis, randomized response mechanisms have been improved from different perspectives. The existing local differential privacy data perturbation mechanism is basically a variant of the randomized response mechanism. In this article, we use the function dependency between attributes to optimize the randomized response mechanism to satisfy the scenario.
In recent years, the issue of data cleansing based on privacy protection has been concerned; Huang et al. 18 designed a privacy-based data cleaning model PARC, which takes the privacy protection as a step of data cleaning, so that the cleaned data meet the data privacy requirements. PARC only proposes a model framework and does not clarify the specific privacy protection method; Krishnan et al. 19 combined local differential privacy with data cleaning and proposed the PrivateClean method for privacy cleaning of discrete and numerical data. The drawback of the method is that it does not consider functional dependencies between attributes.
The current privacy cleaning models just simply combine the two processes of data cleaning and privacy protection and do not go deep into the optimization of the privacy cleaning process based on accounting the correlation between data attributes. Therefore, we take the external knowledge base as an auxiliary condition and perform data privacy cleaning based on the function dependencies between attributes and the local differential privacy mechanism.
Preliminary
The summary of symbol notations is shown in Table 1.
Summary of symbol notations.
Data association verification
Performing data association verification based on function dependencies is one of the important methods for current data cleaning and repairing. In this process, we need to build an external knowledge base that can contain all the correct dependencies and then use this external knowledge base to verify that a tuple to be cleaned meets the constraints.
Definition 1 (external knowledge base (EKB))
A public collection containing all fact-dependent dependencies between attributes. In this article, we will consider the EKB
In addition, according to the different properties of attribute
Description 1
If the value in
Example 3
We take the correspondence between the data of vibration sensor and the electronic counter as an example. The number of each vibration sensor data corresponds to an electronic counter number. There is no duplicate number; therefore, the correspondence between the data of vibration sensor and the electronic counter satisfies Description 1.
Description 2
If the value in
Example 4
Consider a pedometer with time, we can easily know that in a short time one person have limited number of steps, for example, most people can walk most two steps, and run about 4–10 steps per second. As a result, one-time interval can only correspond to several steps. Therefore, the correspondence satisfies Description 2.
Definition 2 (pending database (PD))
A database before the data is cleaned according to the functional dependencies between the EKB.
Define a pending database
Privacy model
In order to protect data privacy and reduce data processing steps based on the data that has functional dependencies between attributes, we need to perform tuple level cleaning of the data and a user-level (tuple-level) privacy protection model to combine privacy protection with data cleaning. Generally, we can use local differential privacy for user-level (tuple-level) privacy protection. The formal definition of local differential privacy is as follows.
For a given
In local differential privacy, each user perturbs its own data and then uploads it to the data collector. The data records of the other party are not known between any two users, that is, there is no concept of global sensitivity in local differential privacy, so the commonly used differential privacy plus noise mechanism Laplace mechanism and the exponential mechanism do not apply here. As a result, we mainly use the randomized response mechanism (W-RR) as the mainstream perturbation mechanism of local differential privacy. Moreover, Kairouz et al. proposed a gradient response technique k-randomized response (K-RR), which overcomes the problem of randomized response techniques for binary variables and can be performed directly for cases where the variable contains
For each discrete attribute
where
To ensure local differential privacy, the privacy budget
Description 3
K-RR cannot guarantee attribute function dependency.
The K-RR mechanism only responds randomly to a certain attribute
K-RR can indirectly add noise to attributes
Theorem 1
The privacy budget of K-RR when the external knowledge base satisfies the constraint of function dependencies
For a tuple containing an attribute function dependency, the privacy budget
Problem description
The privacy cleaning model integrates the correlation verification into the random response mechanism for the function dependencies in the external knowledge base and combines the data cleaning with the localized differential privacy to combine the two data processing processes into one. The problem of this article is how to use the noise existing in the data that does not satisfy the constraint relationship in the privacy protection for adding less noise when protecting such data. Therefore, the overall privacy budget can be reduced in the privacy protection, and at the same time, the data can satisfy the correlation verification, and the two steps of data cleaning and privacy protection can be realized.
In the pending database, the privacy cleaning operation is performed according to the function dependencies in the external knowledge base. If the tuple of the pending database satisfies
H-RR
In this section, we propose a privacy cleaning model H-RR for a pending database that contains Description 1. In the next section, we will improve this model to enable privacy cleaning of pending database that contains Description 2 in the corresponding external knowledge base.
Homologous randomized response
First, we define discrete attribute function dependencies as
Definition 3(homologous randomized response (H-RR))
For the function dependencies
where
Lemma 1
Homologous randomized response satisfies local differential privacy.
Proof
For any
According to equation (4) we can get
To ensure local differential privacy, according to definition 3, a privacy budget
Theorem 2
H-RR has a lower privacy budget than K-RR.
Proof
The process by which H-RR processes the pending database is as follows: we first traverse the pending database and then use equation (4) to perturb the data and make it satisfy local differential privacy
As it turns out, H-RR has a smaller privacy budget than K-RR.
Lines 5–16 of Algorithm 1 are the local differential privacy replacement process between the external knowledge base and the pending database. The core of the algorithm is to ensure that all tuples in the pending database after processing satisfy the function dependency in the external knowledge base A. Line 1 satisfies Lemma 1. The third row of the algorithm gives a (0,1) random number, replace_pre, to guarantee the randomness of the probability of each substitution of the tuple. Lines 5–9 are the alternatives when the tuple in the pending database is clean data, and lines 10–16 is the replacement when the tuple in the pending database is dirty.
Optimizing the H-RR
Problem description
The H-RR only applies to the case where the knowledge base table meets Description 1. In real life, there are many function dependencies that do not meet Description 1 but Description 2, such as the city-state relationship. In order to perform privacy cleaning operations on such data, we must improve the H-RR.
If the function dependencies of the EKB satisfy Description 2, when the attribute
When the attribute
When the attribute Y is a sensitive attribute, if the tuple
Optimizing the H-RR
According to the analysis in section “Problem description,” when privacy cleaning is performed on a pending database that contains in Description 2 in an external knowledge base, it is necessary to consider whether the privacy attribute satisfies the uniqueness constraint, to determine whether the privacy attribute belongs to several-to-one or one-to-several in the external knowledge base. Different attribute relationships distribute different privacy budgets; so, it is necessary to optimize the H-RR in different situations.
Definition 4 (several-for-one homologous randomized response (SH-RR))
If
where
Lemma 2
SH-RR satisfies local differential privacy.
Proof
For any
If
To ensure local differential privacy, according to Definition 4, the privacy budget
Theorem 3
SH-RR has a lower privacy budget than K-RR.
The proof method is the same as theorem 1.
The biggest difference between Algorithm 2 and Algorithm 1 is the processing of dirty data. Lines 10–18 are the SH-RR processing methods for dirty data.
Definition 5 (one-for-several homologous randomized response (OH-RR))
If Y is a sensitive attribute, the conversion mechanism is called OH-RR if the function dependencies
where
Lemma 3
OH-RR satisfies local differential privacy.
Proof
For any
If
To ensure local differential privacy, according to Definition 5, the privacy budget
Theorem 4
OH-RR has a lower privacy budget than K-RR.
The proof method is the same as Theorem 1.
The
Experiment
This section demonstrates the effect of our algorithms on data cleaning and privacy protection through experiments. To simplify the experimental steps, we use data on cafe licensing data issued by the Chicago government in the United States rather than the normal sensor data. The degree of data cleaning is reflected in the availability of data, so data availability can be used as a measure of data cleaning. Data availability is also one of the important indicators to measure the quality of the privacy protection model. The degree of privacy protection of data is mainly measured by
Experiment settings
Data description
Coffee shop license data issued by the Chicago Municipal Government include license number, time of issue, expiration date, cafe address, street, street type, latitude and longitude coordinates, and belonging police station properties. There is a constraint between street and street types, and the data itself is clean.
Experimental steps
First, the function dependency between street and street types is extracted from the original data and written into a new data table as an external knowledge base; here, two new external knowledge bases are created, which satisfy the function dependencies of Description 1 and Description 2, respectively; second, according to the different function dependencies contained in the external knowledge base, two types of pending databases are extracted from the original data, and some data are randomly extracted from the pending database to become dirty data; and then according to different function dependencies. Because we do not find someone else who have done the similar work, so we will first process the pending database using K-RR and then compare with the result of H-RR, OH-RR, and SH-RR; adjust the value of
Data availability
Data availability is indicator of whether a privacy model is built to be reasonable.
20
We examine the performance of the algorithm from two perspectives, data cleaning and privacy protection. The degree of privacy protection is negatively related to data availability. The higher the privacy protection, the lower the data availability, and the lower the privacy protection, the higher the data availability. In local differential privacy, the privacy budget
Definition 6 (accuracy rate (Ar)): accuracy of data after cleaning and disturbance
In order to better measure the practicability of the algorithm, the raw data we used in this article meet the EKB, so we can easily calculate the accuracy rate by comparing the same part of processed data with the original data. If we take
Result
The following experiments used K-RR and the three methods proposed in this article to compare the effect of different values on the statistical results.
In local differential privacy, different privacy budgets

Functional relation between
Figure 1(a) shows the relationship between the privacy budget
Because the constraints contained in the external knowledge base are different, we will compare the results of K-RR with the results of H-RR, OH-RR, and SH-RR.
Figure 2 shows the comparison between K-RR and H-RR. The number of rows in the data set is 1000. The external knowledge base A contains nine constraints. In the pending database, 500 tuples are randomly selected and become dirty data. Calculate the correctness rate of data repaired after K-RR and H-RR are repaired under different privacy budgets.

Accuracy rate of krr and hrr.
It can be clearly seen from the figure that as the privacy budget
When
Figures 3 and 4 show the comparison between K-RR, SH-RR, and OH-RR. The number of rows in the data set is 2000. The external knowledge base A contains 42 constraints, and 1000 randomly selected tables are to be processed. The tuples become dirty data and calculate the correctness rate of data repaired after the K-RR and SH-RR and OH-RR are repaired under different privacy budgets.

Accuracy rate of K-RR and SH-RR.

Accuracy rate of K-RR and OH-RR.
From Figure 3, it can be clearly seen that with the increase in privacy budget
The knowledge base in Figure 4 is the same as Figure 3 in the experiment. There are 42 constraints in the knowledge base. Therefore, when
Conclusion
Model H-RR and its improved models SH-RR and OH-RR are privacy cleaning algorithms based on data cleaning and data privacy protection for the data collected by sensors. The model assumes the dependency of functions in the external knowledge base and fixes the correlation of data. The process incorporates a local differential privacy cleaning model that combines data cleaning with privacy protection in a local differential privacy process. In this article, we utilize the noise of the data itself that does not satisfy the function dependencies and the correlation verification repair strategy to reduce the amount of noise added when protecting such data. Compared with the traditional cleaning-plus-noise separation data processing process, the differential privacy cleaning model presented in this article treats the function dependencies between attributes as the processing basis in the process of data processing and adds noise as a process of data cleaning, which makes noise-added data satisfies the function dependency and does not require the secondary repair, which greatly simplifies the data processing process, improves the data processing speed, and can improve the data availability.
However, there are many things that can be optimized in this article. First, function dependencies in external knowledge bases can become more complex, such as transfer function dependencies. At this time, the model needs to be further improved according to the relationship between attributes; second, when the external knowledge base meets the function dependencies between attributes, it does not consider the case where the sensitivity of
