TY - JOUR
AU - Li,, Kaiju
AB - Abstract Although analyzing and mining user’s trajectory data can provide outstanding benefit, data owners may not be willing to upload their trajectory data because of privacy concerns. Recently, differential privacy technology has achieved a good trade-off between data utility and privacy preserving by publishing noisy outputs, and relevant schemes have been proposed for trajectory release. However, we experimentally find that a relatively accurate estimate of the true data value can still be obtained from the noisy outputs by means of a posterior estimation. But there are no practical mechanisms against current schemes to verify their effectiveness and resistance. To fill this gap, we propose a solution to evaluate the resistance performance of differential privacy on trajectory data release, including a notion of correlation-distinguishability filtering (CDF) and a privacy quantification measurement. Specifically, taking advantage of the principle of filtering that independent noise can be filtered out from correlated sequence, CDF is proposed to sanitize the noise added into the trajectory. To conduct this notion in practice, we attempt to apply a Kalman/particle filter to filter out the corresponding Gaussian/Laplace noise added by differential privacy schemes. Furthermore, to quantify the distortion of privacy strength before and after filtering, an entropy-based privacy quantification metric is proposed, which is used to measure the lost uncertainty of the true locations for an adversary. Experimental results show that the resistance performance of current approaches has a degradation to varying degrees under the filtering attack model in our solution. Moreover, the privacy quantification metric can be regarded as a unified criterion to measure the privacy strength introduced by the noise that does not conform to the form required by differential privacy. 1. Introduction Trajectory data, generated by GPS terminals, smart phones, etc., are a typical form of spatio-temporal data. The data record the behavioral features of moving objects, including positions, timestamps and other attributes. Aggregating and mining trajectory data are beneficial for governments, businesses and individuals in lots of fields, such as travel routes recommendation [1, 2], road traffic dispatching [3], and environmental protection (e.g. air quality monitoring) [4, 5]. For example, trajectory data uploaded by the user to the service provider can be used to provide better navigation service. Moreover, trajectories aggregated and mined by the third party can support for business hotspots analysis. As the above examples suggest, trajectory data are significantly useful in knowledge discovery and acquisition. Nonetheless, trajectory data publishing without special sanitization may violate individual’s privacy. For example, by observing a user’s uploaded trajectories from Monday to Friday, an adversary can infer that the most frequent starting position in these trajectories is his/her home address. Due to privacy leakage concerns [6, 7], data owners may not be willing to publish their trajectory data. A privacy disclosure instance of trajectory data release is illustrated in Example 1 and Fig. 1. Figure 1. Open in new tabDownload slide Privacy disclosure instance of trajectory data release. Figure 1. Open in new tabDownload slide Privacy disclosure instance of trajectory data release. Example 1. Consider the scenario of location-based tourist attractions recommendation, if the user Amy wants to find the interesting tourist spots near her planned travel route, she must upload her travel route to a service provider and launch a recommendation query to it. After analyzing Amy’s travel route, the provider returns the spots near the route to her. In this interactive process, the provider may be semi-trusted, i.e. on the one hand, it renders the desired location-based service (LBS) to Amy; on the other hand, it transfers the trajectory data to the third party (e.g. researchers who may be untrusted), facing with the risk of location privacy disclosure. As a result, Amy’s precise position is known to the untrusted third party. Thus, how to preserve individual’s trajectory data privacy while not affecting LBS results is a challenging issue. The problem of private trajectory data publishing has attracted attentions from researchers spanning multiple disciplines [8–11]. For example, AbuMl et al. [12] proposed a novel concept of |$k$|-anonymity based on co-localization that exploits the inherent uncertainty of the moving object’s whereabouts. Cieck et al. [13] proposed a new privacy metric |$p$|-confidentiality that ensures location diversity by bounding the probability of a user visiting a sensitive location with the |$p$| input parameter. Gursoy et al. [8] presented AdaTrace, a scalable location trace synthesizer with three novel features: provable statistical privacy, deterministic attack resilience and strong utility preservation. In the advanced technologies, random perturbation induces uncertainty (e.g. random noise) about individual values, and the introduction of a small amount of noise has little impact on data utility. Therefore, it has become a widely accepted and practical approach for private data publishing. Among the alternatives, differential privacy [14, 15] is a state-of-the-art standard privacy notion. By introducing independent and identically distributed (IID) Gaussian (loose privacy guarantee but better utility) or Laplace (rigorous privacy guarantee but worse utility) noise, it provides privacy guarantee that can be mathematically proved. An obvious advantage of differential privacy is that it guarantees strong security regardless of the extent of background knowledge an adversary has of the data. Because of the intrinsic limitation that standard differential privacy assumes that the data intended to be protected are independent, it is not suitable for trajectory release since the locations in a trajectory are always correlated. State-of-the-art schemes have extended the notion of differential privacy to adapt correlated trajectory data and they can be categorized into model-based and transform-based methods. The model-based methods establish specific correlation models, e.g. Markov [16], Bayesian [17–19] or correlation degree matrix model [20], to describe the correlation of original data. Then the IID noise is recalculated according to these models. The transform-based methods transform the correlated positions in the trajectory to another independent domain series (e.g. discrete Fourier transform (DFT) [21] and discrete wavelet transform (DWT) [22] and extract a set of independent properties to express the correlation features of the positions (e.g. principal component analysis (PCA) [23]). Then the perturbation is added into the transformed series or extracted properties. Despite the improvement in privacy guarantee introduced by current two kinds of trajectory data publishing schemes utilizing differential privacy technology, a more complete solution is still to be explored since current approaches are affected with the following challenges: Privacy distortion. Gaussian or Laplace noise employed in current methods is IID but the positions in the trajectory are always correlated, which has a potential risk that the IID noise may destroy the correlation of the trajectory to a certain extent. Then current schemes may have a privacy or utility distortion comparing with the expected level. However, up to now, there is no practical method to evaluate the risk degree of the privacy or utility distortion caused by the differences in correlations. Privacy measurement. Most of model-based or transform-based methods extend the notion of differential privacy to adapt to trajectory data based on their own privacy metrics, which makes it difficult to compare the protect performance among different methods. In addition, after sanitizing operation, retained noise may not strictly obey Gaussian or Laplace distribution, causing the problem that insistently using the privacy metric in standard differential privacy is not appropriate. These challenges imply that both a practical method to assess the risk of the privacy distortion, and a privacy metric to measure the retained privacy after sanitizing are still in high demand. In consideration of the evaluation method, we find that despite its Gaussian or Laplace form, the noise added into the original trajectory is IID while the positions in a trajectory are always correlated. According to the signal processing theory, the IID noise injected may destroy the correlation of the trajectory, leading to the distortion on privacy degree and data utility. In terms of the second challenge, as we know, entropy is an information theoretical approach to measure the amount of information, indicating the uncertainty an adversary has about the true data values. Thus, it can be regarded as a universal privacy measurement regardless of the form of noise. To this end, based on this criterion, we design a privacy metric suitable for trajectory data. Based on these observations, we propose a practical mechanism, including a notion of correlation-distinguishability filtering (CDF) and an entropy-based privacy metric (EBPM) to evaluate the effectiveness and the degree of the privacy distortion of current schemes. CDF gives a formal definition of the verification mechanism using filtration, and further designs two kinds of filter to conduct CDF in practice. EBPM measures the retained privacy after the filtering operation. We evaluate the performance of CDF and EBPM on real-world trajectory datasets. To the best of our knowledge, it is the first work utilizing a filtering-based mechanism and entropy-based privacy metric to evaluate the resistance of current differential privacy schemes for trajectory data release. Our contributions are threefold: A notion of CDF is proposed to verify the effectiveness of current schemes. CDF attempts to estimate the true trajectory data from the perturbed trajectory to verify the effectiveness of current schemes. Theoretical analysis demonstrates the privacy distortion under CDF, and confirms that, the noise added into the trajectory by current schemes can be indeed filtered out to a certain level and it can not ensure the privacy guarantee as it claims. Two kinds of practical filters are designed to conduct CDF in practice. To conduct CDF efficiently in practice, in this paper, we design two optimal filters (Kalman and particle filter) against Gaussian and Laplace noise of differential privacy respectively. Since the structures of these two filters are easy to build, they can be implemented efficiently on equipment with limited computing power. An entropy-based privacy metric is proposed to measure the retained privacy strength of current methods after filtering. Since the noise after filtering does not conform to a definite form, we propose a privacy measurement based on entropy to evaluate the uncertainty before and after filtering to explore the retained certainty an adversary has about the true data values. The remainder of this paper is organized as follows. In Section 2, we summarize related work on differentially private publication over trajectory data and describe the limitations of existing methods. We then briefly introduce the notations and definitions adopted in this work and illustrate the schematic diagram of privacy distortion in Section 3. Our proposal and experiments are described in Sections 4 and 5, respectively, followed by the conclusions and future work in Section 6. 2 Related work Existing differential privacy preserving methods for trajectory data release can be categorized into model-based methods and transform-based methods. The model-based methods establish specific models to describe the correlations of trajectory data and recalculate the noise according to these models. The transform-based methods transform the correlated positions to another independent domain series or extract a set of independent properties to express the correlations. Next we describe these two kinds of methods in detail. 2.1 Model-based mechanisms In the model-based methods, Cao et al. [24] proposed a correlated hidden Markov detection model to deal with the problem that abnormal data may raise the global sensitivity. They detected and removed the abnormal data by applying the one-step transition probability, which can decrease the noise level added to the original data. However, this model assumes that the releasing probability of the current data is only relevant to its former data, leading to the decline of the detecting results. To increase the accuracy of the detecting results, Yang et al. [17] proposed a privacy definition called Bayesian differential privacy. They constructed a Gaussian correlation model, which assumed that the data to be released conform to the Gaussian distribution. Except for these probability models, Zhu et al. [20] built a correlated degree matrix to measure the whole relationship between records. The coefficients of the correlated degree matrix were used as weights to rebuild the sensitivity function, in place of the traditional global sensitivity. Therefore, the correlated sensitivity can be used to decrease the redundant noise introduced by the global sensitivity. These model-based methods can preserve the privacy for correlated trajectory data under their assumption. However, the correlations of trajectory data are always complicated and they can rarely be represented by a single model. Thus, there still exists certain risk of privacy leakage in these methods. 2.2 Transform-based mechanisms In the transform-based methods, a typical approach is to transform trajectory into independent series in another domain, thus the trajectory can be processed independently. For example, Rastogi et al. [21] transformed trajectory into independent series in another domain by applying DFT, and then the noise was added to the Fourier coefficients. Thus, a perturbed trajectory can be obtained by applying the inverse DFT transform. However, DFT is just a global transformation, which can not describe the local features of the original trajectory accurately. As an improved algorithm, Xiao et al. [22] expanded the range of applications by applying DWT, which can preserve more features of the trajectory in comparison with DFT. In dealing with high dimensionality trajectory, Jiang et al. [23] extracted the features of the trajectory using the properties of PCA, and then these correlated features were classified into several groups of independent features by applying singular value decomposition (SVD). Compared with the model-based methods, although these transform-based methods can ensure a high data utility, the noise added by these methods maybe not conform to |$\epsilon $|-DP, since the noise form is changed after an inverse transformation. 2.3 Summary In terms of differential privacy preserving for trajectory data release, existing methods recalculate the sensitivity by establishing correlation models or transforming correlated data to independent series. Then the noise corresponding to the sensitivity is added into the original data. However, the noise generated by current schemes is still IID and has a potential risk to destroy the correlation of the trajectory. Thus, the privacy intensity of these schemes seems not as strong as they claim. In this paper, we aim to address the following issues: Are the existing differentially private publishing methods for trajectory data effective and how to verify their resistance? How much is the privacy distortion if current schemes do not achieve the privacy level as they claim? How can we measure the privacy strength when the noise does not obey Gaussian or Laplace distribution? 3 Preliminaries In this section, we first define the notations associated with our work and then we review the theory of differential privacy. Finally, we demonstrate the problem of privacy distortion in the form of a comprehensible schematic diagram. 3.1 Notations Since the most vulnerable part that has a risk to disclose one’s privacy in a trajectory is the location data, we express a single trajectory as a sequence consisting of positions and their corresponding timestamps. A trajectory is formalized as Definition 1. Definition 1. (Trajectory) A trajectory |$ T $| is an ordered list of time-location pairs: |$ T=(t_{1},l_{1})\rightarrow \cdots \rightarrow (t_{i},l_{i})\rightarrow \cdots \rightarrow (t_{N},l_{N}) $|⁠, where |$N$| is the length of this trajectory and |$ \forall i, 1\leq i\leq N $|⁠. Definition 1 gives the form of a trajectory. In trajectory |$ T $|⁠, some special locations (e.g. home and work addresses) may leak user’s privacy and need to be protected. In Definition 2, we depict these sensitive locations. Definition 2. (Sensitive dataset) An arbitrary location |$ l_{i} $| that the user wants to protect is defined as a sensitive location at timestamps |$ t_{i} $|⁠, i.e. |$ l_{i} $| should be addressed before publishing. |$ l_{i}\in{D}_{i} $| is a discrete spatial point, which is represented by the latitude and longitude coordinate. Then we denote |${D}$|⁠, sampled from |$T$| with a length of |$ n $|⁠, as the universe sensitive locations in trajectory |$T$|⁠. Differential privacy preserves user’s privacy by defining a neighboring dataset which differs the record that the user wants to protect. In Definition 3, we formalize the neighboring dataset that lacks this record. Definition 3. (Neighboring dataset) If a sensitive location |$ l_{i} $| is removed from the dataset |$ {D} $|⁠, denoted as |${D}_{-i} $|⁠. Then we call the dataset |$ {D}_{-i} $| as the neighboring dataset of |$ {D} $|⁠. The above definitions give the formal definitions of the trajectory and the dataset consisting of sensitive locations associated with differential privacy technology. Next, we demonstrate how differential privacy protects the sensitive location using the notion of neighboring dataset. 3.2 Differential privacy Differential privacy is a currently recognized preservation model that can guarantee stricter security than the models that depend on the security of their algorithms. It is essentially a kind of noise perturbed mechanism. By adding noise to the raw data or statistical results, differential privacy can guarantee that the value changing of a single record has a minimal effect on the statistical output results. Thus, differential privacy can not only preserve privacy of sensitive data, but also support data mining on statistical results. Its formal definition is in Definition 4. Figure 2. Open in new tabDownload slide Output probability density of random algorithm |$M$| on |${D}$| and |${D}_{-i}$|⁠. Figure 2. Open in new tabDownload slide Output probability density of random algorithm |$M$| on |${D}$| and |${D}_{-i}$|⁠. Definition 4. (⁠|$\varepsilon $|-Differential privacy [14]) We give the dataset |${D}$| and its neighboring dataset |${D}_{-i}$|⁠, which have the same cardinality but differ in only one record. A random perturbation mechanism, |$M$|⁠, ensures |$\varepsilon $|-differential privacy if |$M$| makes every set of outcomes, |$S$|⁠, for any pair of |${D}$| and |${D}_{-i}$| satisfy $$ \begin{equation} Pr[M({D})\in S]\leq exp(\varepsilon)\times Pr[M({D}_{-i})\in S], \end{equation}$$ (1) where |$S\subseteq Range(M)$| and |$Range(M)$| is the value range of |$M$|⁠. |$Pr[\cdot ]$| and |$\varepsilon $| denote probability and privacy budget parameters, respectively. A smaller |$\varepsilon $| means better privacy. Figure 2 depicts the output probability distribution of randomized algorithm |$M$| satisfying |$\varepsilon $|-differential privacy on |${D}$| and |${D}_{-i}$|⁠, where |$ f(D) $| and |$ f({D}_{-i}) $| denote the output of |$ D $| and |${D}_{-i}$|⁠, respectively. Figure 3. Open in new tabDownload slide Illustration of privacy distortion for trajectory release. Figure 3. Open in new tabDownload slide Illustration of privacy distortion for trajectory release. In practical applications, |$M$| is generally realized by a Gaussian or Laplace mechanism. The Gaussian mechanism has a higher data availability, but with a lower privacy degree compared with the Laplace mechanism. The definitions of these two mechanisms are as follows. Definition 5. (Gaussian mechanism [25]) Assuming that |$f(\cdot )$| is the statistical output function, then a noise sequence |$Y\sim N(0,\sigma ^{2})$|⁠, which obeys Gaussian distribution, can make randomized algorithm |$M({D})=f({D})+Y$| satisfy |$(\varepsilon ,\delta )$|-differential privacy. |$\sigma $| is the scale parameter of the Gaussian distribution, and the PDF of the Gaussian distribution is $$ \begin{equation} \rho(x)=\frac{1}{\sqrt{2\pi}\sigma}{\mathrm{exp}}\left(-\frac{x^{2}}{2\sigma^{2}}\right). \end{equation}$$ (2) For |$ c^{2}>2{\textrm{ln}}(1.25/\delta ) $|⁠, the Gaussian mechanism with parameter |$ \sigma \geq c \Delta _{2}\,f/\varepsilon $|⁠, where |$ c $| is a constant and |$ \Delta _{2}\,f $| is the two norm $$ \begin{equation} \Delta_{2}f=\max_{{D},{D}^{-i}} \|f({D})-f({D}_{-i})\|_{2}. \end{equation}$$ (3) Definition 6. (Laplace mechanism [15]) Assuming that |$f(\cdot )$| is the statistical output function, then a noise sequence |$Y\sim Lap(\lambda )$|⁠, which obeys Laplace distribution, can make randomized algorithm |$M(D)=f(D)+Y$| satisfy |$\varepsilon $|-differential privacy. |$\lambda $| is the scale parameter of the Laplace distribution, and the PDF of the Laplace distribution is $$ \begin{equation} \rho(x)=\frac{1}{2\lambda}{\mathrm{exp}}\left(-\frac{|x|}{\lambda}\right). \end{equation}$$ (4) For various derivations, it would seem convenient to consider a reparameterization of the Laplace density. In this case, the standard Laplace distribution is given by the variance $$ \begin{equation} p(\sigma,x)=\frac{1}{\sqrt{2}\sigma}{\textrm{exp}}(-\frac{\sqrt{2}|x|}{\sigma}), \end{equation}$$ (5) and reformulating any result from one parametrization to the other is a matter of replacing |$\lambda $| by |$\sigma /\sqrt{2}$| or |$\sigma $| by |$\sqrt{2}\lambda $|⁠. The scale parameter |$\lambda $| is determined by sensitivity function |$\Delta f$| and privacy preserving intensity |$\varepsilon $|⁠: $$ \begin{equation} \lambda=\frac{\Delta\, f}{\varepsilon}, \end{equation}$$ (6) where |$\Delta f$| is the maximum effect of the statistical output function that a single record has on $$ \begin{equation} \Delta\, f=\max_{{D},{D}^{-i}} \|f({D})-f({D}_{-i})\|_{1} \end{equation}$$ (7) As an example, consider a dataset whose sensitivity of a query is 1. According to the differential privacy, the noise added to the true answer, which is distributed according to |$Lap(1/\varepsilon )$|⁠, suffices to guarantee |$\varepsilon $|-differential privacy. 3.3 Privacy distortion Figure 3 depicts the problem of the privacy distortion caused by the IID noise. As shown in Fig. 3, to preserve differential privacy of the sensitive dataset |$ {D} $|⁠, mechanism |$ M $| adds an IID noise series |$ Y $| to |$ {D} $| and obtains a perturbed dataset |$ {D}^{\prime }$|⁠. |$ \varepsilon $|-differential privacy makes the difference of the PDF of the outputs between the perturbed dataset |$ {D}^{\prime }$| and its neighboring dataset |$ {D}_{-i}^{\prime }$| be bounded by |$ [e^{-\varepsilon },e^\varepsilon ] $|⁠, as shown in Fig. 3b(1). However, the original sensitive locations in |$ {D} $| are correlated but the noise introduced by mechanism |$ M $| is an IID series. Intuitively, the IID noise series |$ Y $| can be sanitized from the perturbed sensitive dataset |$ {D}^{\prime } $| by applying a filter and the adversary obtains a filtered dataset |$ \hat{D}$|⁠, as shown in Fig. 3a(2). Compared with the PDF of |$ M\left ({D}_{-i}^{\prime } \right ) $|⁠, the PDF of |$ M(\hat{D}_{-i}) $| is closer to |$ M\left ({D} \right ) $|⁠, which means that the filtered dataset |$ \hat{D}$| is closer to the original dataset |$ {D}$|⁠, as shown in Fig. 3b(2). As a result, the privacy parameter |$ \varepsilon ^{\prime } $| in Fig. 3b(2) is larger than |$ \varepsilon $| in Fig. 3b(1), indicating that current IID approaches may not be practical. 4 Methodology Section 3.3 gives the diagrammatic sketch of the privacy distortion problem. In this section, we first formalize the problem definition, and then overview the sketch of our solution, including a notion of correlation-distinguishability filtering called CDF and an EBPM. To conduct CDF in practice, we also design two filters, Kalman and particle filter, to address Gaussian and Laplace noise respectively. Finally, we quantify the retained privacy after filtering using EBPM. 4.1 Problem definition Since our goal is to quantify the degree of this distortion, next we formalize the problem of privacy distortion and give some necessary definitions associated with the problem. Definition 7. (Noisy dataset) Since the noisy dataset is a sequence of location data arranged in timestamps, we can regard it as a randomized procedure. Therefore, the noisy dataset |$ {D}^{\prime } $| can be explained as an original sensitive dataset |$ D $| plus a corresponding noise series, i.e. $$ \begin{equation} {D}^{\prime}:=D+U, \end{equation}$$ (8) where |$U$| is the perturbation added into the original sensitive dataset |$ D $|⁠. In the differential privacy protection technology concerned in this paper, it obeys the Gaussian or Laplace distribution with a variance |$ \sigma ^2 $|⁠. Definition 8. (Posterior estimate) The dataset after filtering can be regarded as a series that contains the posterior estimate of the sensitive locations. The posterior estimate, denoted by |$ \hat{D}$|⁠, can be given by the following conditional estimate: $$ \begin{equation} \hat{D}:=Es(D|D^\prime), \end{equation}$$ (9) where |$ D^\prime =\{D^\prime _1,D^\prime _2,\ldots ,D^\prime _n\} $| is the noisy dataset and denotes the set of observations obtained by the data collector, |$ Es(\cdot ) $| is the estimate function. Definition 9. (Privacy distortion) Let |$ \hat{D} $| be the result of a filtering operation on the published outputs |$ {D}^{\prime } $|⁠. The privacy distortion is the privacy leakage after this operation, i.e. $$ \begin{equation} Dis_{\hat{D},{D}^{\prime}}(\hat{D},{D}^{\prime}):=\dfrac{\mathcal{L}(D,{D}^{\prime})-\mathcal{L}(D,\hat{D})}{\mathcal{L}(D,{D}^{\prime})}, \end{equation}$$ (10) where |$ Dis_{\hat{D},{D}^{\prime }}(\hat{D},{D}^{\prime }) $| denotes the privacy distortion of |$ \hat{D} $| and |$ {D}^{\prime }$|⁠. |$ \mathcal{L}(\cdot ) $| denotes a mechanism or function to quantify the privacy degree between any two objects, whose details are given in Definition 14. In terms of Gaussian or Laplace mechanism, the privacy intensity is controlled by the privacy budget |$ \varepsilon $|⁠. Nonetheless, in this paper, the noise after sanitizing may not obey a fixed distribution. Thus, to analyze the privacy distortion |$ Dis_{\hat{D},{D}^{\prime }}(\hat{D},{D}^{\prime }) $|⁠, we need to redefine |$ \mathcal{L}(\cdot ) $|⁠. 4.2 Overview Figure 4 is the overview of our solution. Our solution mainly consists of two parts, including a notion of CDF and an EBPM. Specifically, CDF utilizes the different correlations between the original dataset and the noise series to design an optimum filter to sanitize the IID noise. EBPM improves itself based on the notion of entropy and proposes a metric suitable for trajectory data to measure the privacy distortion despite of noise form. Figure 4. Open in new tabDownload slide Overview of our solution. Figure 4. Open in new tabDownload slide Overview of our solution. 4.2.1 Correlation-distinguishability filtering Since the location points in the trajectory are correlated, there is an inherent disadvantage when using IID noise to preserve differential privacy. Although differential privacy has achieved complete privacy within its defined strength and large quantities of methods claim that they can effectively protect the privacy of trajectory data, IID noise can still be sanitized to a certain extent and these schemes still have the potential risk of a privacy distortion illustrated in Section 3.3. Inspired by this observation, we propose a notion of CDF. It first defines a mechanism that attempts to obtain the posterior estimate of true values under the condition of correlations to verify the effectiveness of existing schemes, then it designs two kinds of filters to test the real effect of our notion in practice. 4.2.2 Privacy quantification Differential privacy uses a mathematical tool |$ \varepsilon $| to represent the privacy degree, but the noise must strictly obey a Gaussian or Laplace distribution. However, this is a rigorous restriction and the noise after sanitizing may not obey these specific distributions. Thus, we need a new privacy metric to measure the retained privacy strength. Since entropy can reflect the uncertainty of the true data values for an adversary in spite of the form of noise [26], in this paper, we calculate the distances between the observations, the corresponding estimates and their distribution to quantify the privacy. 4.3 Correlation-distinguishability filtering In this section, we demonstrate the notion of CDF, which utilizes the difference of the correlations between the original dataset |$D$| and the noise series |$Y$| to design an optimum filter to sanitize the IID noise. The formal definition of CDF is as follows. Definition 10. (Correlation-distinguishability filtering) Denote |$ C_Y $| and |$ C_D $| as the notations of the data correlations in noise series |$ Y $| and sensitive dataset |$D$|⁠, then CDF is defined as a mechanism |$ \mathcal{F} $| to obtain the posterior estimates of the true data values in |$D$| under the condition of |$ C_Y $| and |$ C_D $|⁠: $$ \begin{equation} \hat{D}:=\mathcal{F}({D}|C_Y,C_D). \end{equation}$$ (11) In practice, a mechanism satisfying CDF can be realized by a waveform matching optimum filter, whose basic idea is to design a filter to make the mean square error between the input and output minimum. Then mechanism |$ \mathcal{F} $| can be formalized using the least mean square error (LMS) criterion [27]: $$ \begin{equation} \mathcal{F}(\hat{D},D):=min\{E[\hat{D}-D]^2\}. \end{equation}$$ (12) There are various structures of the filter with different accuracy and complexity to achieve the goal in Equation (12), e.g. mean and median filters. Among the alternative filters, Kalman filter [28] assumes that the noise form obeys Gaussian distribution and performs best when address Gaussian noise, while particle filter [29] has no restriction on the noise form and can address any special noise form. Thus, we choose these two kinds of filters to sanitize the Gaussian and Laplace noise, respectively. Next, we analyze the resistance performance of the Gaussian and Laplace mechanisms against the Kalman and particle filtering. 4.3.1 Kalman filtering for Gaussian noise In order to evaluate the resistance performance of the Gaussian mechanism, we design an adaptive Kalman filter in this paper. Figure 5 gives the schematic diagram of the Kalman filter. It utilizes the linear combination of values of prior estimate |$ P(k) $| and current observation |$ O(k) $| to approximate the true value. The LMS criterion is used as a feedback to continuously adjust the coefficients of the combination to achieve the optimal estimates. Next we give the definitions of the prior estimate and current observation. Figure 5. Open in new tabDownload slide Schematic diagram of the Kalman filter. Figure 5. Open in new tabDownload slide Schematic diagram of the Kalman filter. Definition 11. (Prior estimate) In the scenario of trajectory data publishing, since the changes of the data values of two adjacent points in a trajectory are often very small, the correlations of the data in |$ D $|⁠, denoted as |$ C_D $|⁠, are very strong. Then the prior estimate |$P(k)$| of true data value at timestamp |$ t_k $|⁠, can be regarded as the last estimate |$P(k-1)$| plus a noise |$w(k)$|⁠, i.e. |$P(k)$| can be expressed as $$ \begin{equation} \begin{split} P(k)=&P(k-1)+w(k),\\ \textrm{subject} \quad \textrm{to}\quad&Y(k) \quad \textrm{is} \quad IID,\\ &D(k) \quad \textrm{is} \quad \textrm{not} \quad IID,\\ &C_D\neq C_Y, \end{split} \end{equation}$$ (13) where |$ w(k) $| is a Gaussian variable with a variance |$ \sigma _{w}^2 $|⁠. |$ \sigma _{w}^2 $| indicates the confidence strength of the predict model that we have, and its value is highly related to the specific dataset in practice. For example, if the positions need to be protected in a trajectory locate in the same road and the speed changes slowly, then the value of |$ \sigma _{w}^2 $| will be small since the values of the two adjacent positions have little change. Definition 12. (Observation) Let |$ O(k) $| denote the observation value of the current output, i.e. true value of a process at timestamp |$ t_k $|⁠. The observation value at timestamp |$ t_k $| can be modeled by the following equation: $$ \begin{equation} O(k)=D(k)+v(k). \end{equation}$$ (14) This observation model indicates that the current observation |$ O(k) $| is the result of original data plus a white Gaussian noise |$ v(k) $|⁠, called the process noise with a variance |$ \sigma _v^{2} $|⁠. |$\sigma _v^{2} $| can be initialized arbitrarily and it will be convergent after several iterations. As shown in Definitions 11 and 12, we formalize the predict value |$ P(k) $| and the observation |$ O(k) $|⁠. Then the estimate |$ \hat{D}(k) $| of the true value |$ D(k) $| is the linear combination of |$ P(k) $| and |$ O(k) $|⁠. In order to improve the accuracy of the estimation, the Kalman filter conducts an iterative process and updates the weights of |$ P(k) $| and |$ O(k) $| at each round. By constantly updating the weights, the modified estimate converges to the true value. Denote the weights of |$ P(k) $| and |$ O(k) $| are |$ \eta _k $| and |$ \gamma _k $| respectively, which are solved according to the LMS criterion in Equation (12). Next, we formalize these two weights in Definition 13, and give the computing results of them in Theorem 1 and 2. Definition 13. (Coefficients) At timestamp |$ t_k $|⁠, the posterior optimal state estimate |$ \hat{D}_{k} $| is made combining the prior estimate |$ P(k) $| with the current observation |$ O(k) $|⁠: $$ \begin{equation} \hat{D}_{k}=\eta_{k}P(k)+\gamma_{k}O(k), \end{equation}$$ (15) where |$\eta _{k}$| and |$\gamma _{k}$| are the weighted coefficients of the prior estimate and the current observation respectively. In order to clearly describe the process of Kalman filtering, we summarize the implementation process as in Algorithm 1, where |$ KF(\cdot ) $| denotes the Kalman filtering. As shown in Algorithm 1,a kalman filter is equivalent to the optimal estimation algorithm for the weighted coefficients |$\eta _{k}$| and |$\gamma _{k}$| with the LMS criterion. Thus, if the values of |$\eta _{k}$| and |$\gamma _{k}$| is solved, then we can obtain the optimum posterior estimate |$\hat{D}_{k}$|⁠. Next, we utilize the LMS criterion to get the values of |$\eta _{k}$| and |$\gamma _{k}$|⁠, as shown in Theorems 1 and 2. Theorem 1. Given the weighted coefficient of current observation |$\gamma _k$|⁠, the weighted coefficient of the prior estimate |$ \eta _{k} $| can be expressed as $$ \begin{equation} \eta_{k}=1-\gamma_k. \end{equation}$$ (16) Proof. The variance between the true and observation value at timestamp |$t_k$| is $$\begin{equation*} \sigma_k^2=E[D(k)-\hat{D}(k)]^2=E[D(k)-\eta_{k}P_{k}-\gamma_{k}O(k)]^2. \end{equation*}$$ According to the calculation method of variance in Equation (12), LMS makes the partial derivative of |$\sigma _k^2$| to |$\eta _{k}$| equal to 0, i.e. $$\begin{equation*} \mathcal{F}(\hat{D},D)=min\{\sigma_k^2\}=\frac{\partial\sigma_k^2}{\partial \eta_{k}}=0. \end{equation*}$$ Then we can deduce the following expression from the above equations: $$\begin{equation*} E\{[\eta_{k}\hat{D}(k-1)]\hat{D}(k-1)\}=E\{[D(k)-\gamma_{k}O(k)]\hat{D}(k-1)\}. \end{equation*}$$ The left part of the above expression equals to $$\begin{align*} &E\{[\eta_{k}\hat{D}(k-1)]\hat{D}(k-1)\}\\ &=\eta_{k}E\{[D(k-1)-e(k-1)]\hat{D}(k-1)\}\\ &=\eta_{k}E[D(k-1)\hat{D}(k-1)]-\eta_{k}E[e(k-1)\hat{D}(k-1)], \end{align*}$$ where |$ e(\cdot ) $| denotes the error between the true and observation value. In addition, |$ \hat{D}(k-1)=\eta _{k-1}\hat{D}(k-1)+\gamma _{k-1}O(k-1) $|⁠, we have $$\begin{align*} &E[e(k-1)\hat{D}(k-1)]\\ &=\eta_{k-1}E[e(k-1)\hat{D}(k-1)]+\gamma_{k-1}E[e(k-1)O(k-1)]. \end{align*}$$ Since the noise is IID, then |$ E[e(k-1)\hat{D}(k-1)]=0 $|⁠. The left part equals to $$\begin{equation*} E\{[\eta_{k}\hat{D}(k-1)]\hat{D}(k-1)\}=\eta_{k}E[D(k-1)\hat{D}(k-1)]. \end{equation*}$$ Next, we analyze the right part, which can be extended as $$\begin{equation*} \begin{split} &E\{[D(k)-\gamma_{k}O(k)]\hat{D}(k-1)\}\\ &=E\{[D(k)-\gamma_{k}D(k)-\gamma_{k}v(k)]\hat{D}(k-1)\}\\ &=(1-\gamma_{k})E[D(k)\hat{D}(k-1)]-\gamma_{k}E[v(k)\hat{D}(k-1)].\\ \end{split} \end{equation*}$$ Since the estimate value |$\hat{D}(k-1)$| at timestamps |$t_{k-1}$| is uncorrelated with the observed noise |$v(k)$| at timestamps |$t_k$|⁠, thus |$ E[v(k)\hat{D}(k-1)]=0. $| Then the right part equals to $$\begin{equation*} \begin{split} &E\{[D(k)-\gamma_{k}O(k)]\hat{D}(k-1)\}\\ &=(1-\gamma_{k})E[D(k)\hat{D}(k-1)]\\ &=(1-\gamma_{k})E\{[D(k-1)+w(k-1)]\hat{D}(k-1)\}\\ &=(1-\gamma_{k})E[D(k-1)\hat{D}(k-1)]\\ &+(1-\gamma_{k})E[w(k-1)\hat{D}(k-1)]. \end{split} \end{equation*}$$ Since |$ E[w(k-1)\hat{D}(k-1)]=0 $|⁠, then the right part is $$\begin{equation*} E\{[D(k)-\gamma_{k}O(k)]\hat{D}(k-1)\}=(1-\gamma_{k})E[D(k-1)\hat{D}(k-1)]. \end{equation*}$$ Furthermore, since the left part equals to the right part, then we have $$\begin{equation*} \eta_{k}E\{[D(k-1)\hat{D}(k-1)]=(1-\gamma_{k})E[D(k-1)\hat{D}(k-1)]. \end{equation*}$$ Thus, $$\begin{equation*} \eta_{k}=1-\gamma_{k}. \end{equation*}$$ Theorem 1 gives the relationship between |$ \gamma _{k} $| and |$ \eta _{k} $|⁠. Next, we give the mathematical expression of |$ \gamma _{k} $| in Theorem 2. Theorem 2. The weighted coefficient of current observation |$\gamma _{k}$| at timestamp |$ t_k $| is $$ \begin{equation} \gamma_{k}=\frac{\sigma_{k-1}^2+\sigma_{w}^2}{\sigma_{k-1}^2+\sigma_{w}^2+\sigma_{v}^2}. \end{equation}$$ (17) Proof. The variance between the true and observation value at timestamps |$t_k$| is $$\begin{equation*} \begin{split} \sigma_k^2=\ &E[D(k)-\hat{D}(k)]^2\\ =\ &E\{e(k)[D(k)-\eta_{k}P_{k}-\gamma_{k}O(k)]\}\\ =\ &E[e(k)D(k)]-\eta_{k}E[e(k)\hat{D}(k-1)]-\gamma_{k}E[e(k)O(k)]. \end{split} \end{equation*}$$ Since |$E[e(k)\hat{D}(k-1)]$| and |$ E[e(k)O(k)] $| equal to 0, then $$\begin{equation*} \sigma_k^2=E[e(k)D(k)]. \end{equation*}$$ According to Equation (14), |$ D(k)=O(k)-v(k) $|⁠. Plug it into the above equation, we have $$\begin{equation*} \begin{split} \sigma_k^2=\ &E[e(k)O(k)]-E[e(k)v(k)]\\ =\ &-E[e(k)v(k)]\\ =&-E[D(k)v(k)]+\eta_{k}E[\hat{D}(k-1)v(k)]+\gamma_{k}E[O(k)v(k)]. \end{split} \end{equation*}$$ Since noise is independent with the trajectory data, |$E[D(k)v(k)]$| and |$E[\hat{D}(k-1)v(k)]$| are also equivalent to 0, then $$\begin{equation*} \begin{split} \sigma_k^2=\,&\gamma_{k}E[O(k)v(k)]\\ =\,&\gamma_{k}E[D(k)v(k)]+\gamma_{k}E[v(k)]^2\\ =\ &\gamma_{k}\sigma_v^2. \end{split} \end{equation*}$$ The form of |$ \sigma _k^2 $| can also be expressed as $$\begin{equation*} \begin{split} \sigma_k^2=\ &E[D(k)-\hat{D}(k)]^2\\ =\ &E\{D(k-1)+w(k-1)\\&- \eta_{k}\hat{D}(k-1)-\gamma_{k}[D(k)+v(k)]\}^2\\ =\ &E\{[1-\gamma_{k}][e(k-1)]+[1-\gamma_{k}]w(k-1)-\gamma_{k}v(k)\}^2\\ =\ &[1-\gamma_{k}]^{2}\sigma_{k-1}^2+[1-\gamma_{k}]^{2}\sigma_w^2+\eta^{2}_k\sigma_v^2. \end{split} \end{equation*}$$ Combine these two forms of |$ \sigma _k^2 $|⁠, we get a quadratic equation with one unknown: $$\begin{equation*} [1-\gamma_{k}]^{2}\sigma_{k-1}^2+[1-\gamma_{k}]^{2}\sigma_w^2+\eta^{2}_k\sigma_v^2=\gamma_{k}\sigma_v^2. \end{equation*}$$ Solve this equation and we get the expression of |$\gamma _{k}$|⁠: $$\begin{equation*} \gamma_{k}=\frac{\sigma_{k-1}^2+\sigma_{w}^2}{\sigma_{k-1}^2+\sigma_{w}^2+\sigma_{v}^2}. \end{equation*}$$ Theorem 2 gives the expression of |$ \gamma _{k} $|⁠. In addition, combine with Theorem 1, we obtain the expression of |$ \eta _{k} $|⁠: $$\begin{equation*} \eta_{k}=\frac{\sigma_{v}^2}{\sigma_{k-1}^2+\sigma_{w}^2+\sigma_{v}^2}. \end{equation*}$$ From the proof process in Theorem 2, we know that |$ \gamma _{k} $| in Theorem 2 makes the variance |$ \sigma _k^2 $| minimum, which satisfies LMS criterion. Then the LMS between |$ \hat{D} $| and |$ D $| is given in Theorem 3. Theorem 3. The LMS between the estimates |$ \hat{D} $| and true dataset |$ D $| after Kalman filtering is $$ \begin{equation} \mathcal{F}_{Kalman}(\hat{D},D)=\sum_{k=1}^{n}\frac{(\sigma_{k-1}^2+\sigma_{w}^2)\sigma_{v}^2}{\sigma_{k-1}^2+\sigma_{w}^2+\sigma_{v}^2}. \end{equation}$$ (18) Proof. According to Equation (12), we have $$\begin{equation*} \mathcal{F}_{Kalman}(\hat{D},D)=\sum_{k=1}^{n}min\{\sigma_k^2\}. \end{equation*}$$ In addition, |$ min\{\sigma _k^2\}=\gamma _{k}\sigma _v^2 $| and combine with the result of |$ \gamma _{k} $| in Theorem 2, we have $$\begin{equation*} \mathcal{F}_{Kalman}(\hat{D},D)=\sum_{k=1}^{n}\gamma_{k}\sigma_v^2=\sum_{k=1}^{n}\frac{(\sigma_{k-1}^2+\sigma_{w}^2)\sigma_{v}^2}{\sigma_{k-1}^2+\sigma_{w}^2+\sigma_{v}^2}. \end{equation*}$$ 4.3.2 Particle filtering for Laplace noise In Section 4.3.1, we design a Kalman filter against Gaussian mechanism. However, because of the particular distribution of Laplace noise, there is no specific filter to address this noise form. In this section, we use the universal Bayesian filtering [30] framework to obtain the posteriori estimation of true trajectory data. According to the Bayesian principle, the optimal estimation of current real value, which satisfies the LMS criterion in Equation (12), is a conditional mean of all of the observations: $$ \begin{equation} \hat{D}_{LMS}(k):=E[\hat{D}(k)|O_{1:k}]=\int \hat{D}(k)p(\hat{D}(k)|O_{1:k})\textrm{d}\hat{D}(k), \end{equation}$$ (19) where |$ O_{1:k} $| is the set of all of the observation values and |$ p(\hat{D}(k)|O_{1:k}) $| is the posterior probability of the known |$ O_{1:k} $|⁠. Nonetheless, the integral in Equation (19) is difficult to solve if the noise is non-Gaussian form. Particle filtering is a specific implementation method of Bayesian filtering in practice, which can be used to solve the problem that the posteriori estimation of Bayesian filtering is difficult to calculate. The implementation process of the particle filtering is described in Algorithm 2, where |$ PF(\cdot ) $| denotes the particle filtering. Particle filtering takes advantage of the weighted sum of a series of random samples to express the posterior probability and the integral operation is approximated by this summation. Theorem 4 gives a general solution of particle filtering. Theorem 4. Denote |$ q(\hat{D}(k)|O_{1:k}) $| as an important probability density function, whose samples are easy to generate, e.g. obeying uniform or Gaussian distribution. Assume that |$ M $| random samples |$ D^i(k), i=1,\ldots ,M $| can be extracted from the posterior probability |$ p(\hat{D}(k)|O_{1:k}) $|⁠. Then the estimate |$ \hat{D}_{LMS}(k) $| at timestamp |$ t_k $| that conforms to the LMS criterion is $$ \begin{equation} \hat{D}_{LMS}(k)=\frac{1}{N}\sum_{i=1}^{N}D^i(k)w_k^i, \end{equation}$$ (20) where |$ w_k^i $| is a normalized weight parameter. Proof. Since |$ p(\hat{D}(k)|O_{1:k}) $| can be viewed as a sampling approximation, then $$\begin{equation*}p(\hat{D}(k)|O_{1:k}) = \sum\limits_{i = 1}^N w_k^i {\delta (\hat{D}(k) - D^i(k))}, \end{equation*}$$ where |$ \delta (\cdot ) $| is the unit impulse function (Dirac function), i.e. |$ \delta (\hat{D}(k) - D^i(k))=0 $| only if |$ \hat{D}(k)\neq D^i(k) $| and |$ \int \delta (x)\mathrm{d}x=1 $|⁠. |$ w_k^i $| is calculated by $$\begin{equation*} w_k^i\varpropto\frac{p(\hat{D}(k)|O_{1:k})}{q(\hat{D}(k)|O_{1:k})}. \end{equation*}$$ When the number of sampling particles is large, |$ p(\hat{D}(k)|O_{1:k}) $| can approximate the real posterior probability density function approximately. Then the expectation estimate of |$ p(\hat{D}(k)|O_{1:k}) $| is $$\begin{align*} \hat{D}_{LMS}(k)&=E[\hat{D}(k)|O_{1:k}]\\ &=\frac{1}{M}D^i(k)\frac{p(\hat{D}(k)|O_{1:k})}{q(\hat{D}(k)|O_{1:k})}=\frac{1}{M}\sum_{i=1}^{M}D^i(k)w_k^i. \end{align*}$$ In practice, a particle filter can be conducted in various ways, e.g. sampling from uniform or Gaussian distribution to approximate the posterior probability. In this paper, as an example, we analyze the LMS using the Gaussian distribution as the sampled sequence. The following is the result of the error. Theorem 5. Given original and perturbed data series |$ D,\hat{D}\in R $| and an important distribution |$q(\cdot )$| with |$M$| length and variance |$\sigma _q^2$|⁠, a sequence consists of |$m$| samples from the important distribution leads to the following posterior error: $$ \begin{equation} \mathcal{F}_{Particle}(\hat{D},D)=\dfrac{m}{M}\left[\left( \dfrac{\sigma_q^4}{2\sigma_q^{2}-1}\right)^{m/2}-1\right], \end{equation}$$ (21) where |$\sigma _q^{2}>1$|⁠. Proof. Consider the case of a basics of Monte Carlo method without resampling: $$\begin{equation*} \left\{ {w _k^i} \right\}_{i = 1}^M\left\{ {D^i(k)} \right\}_{i = 1}^M =\prod_{i=1}^M w _k^i(D^i(k)). \end{equation*}$$ In addition, $$\begin{equation*} p\left\{ {D^i(k)} \right\}_{i = 1}^M=\prod_{i=1}^M (2\pi)^{m/2}\exp\left( -\dfrac{( {D^i(k)} )^2}{2}\right). \end{equation*}$$ If the important distribution obeys $$\begin{equation*} q\left\{ {D^i(k)} \right\}_{i = 1}^M=\prod_{i=1}^M q({D^i(k)}) \end{equation*}$$ and |$\sigma _q^2>1$|⁠, we have |$ var(\hat D(k))<\infty $| and $$\begin{equation*} var(\hat D(k)-D(k))=\dfrac{1}{M}\left[\left(\dfrac{\sigma_q^4}{2\sigma_q^{2}-1}\right)^{m/2}-1\right]. \end{equation*}$$ If we sample |$m$| points from the important distribution, their respective asymptotic variances are given by $$\begin{align*} &\dfrac{1}{M}\left(\int \dfrac{w_m^2(D(1))}{q(D(1))}\mathrm{d}D(1)-1\right.\\[3pt] &+\left.\sum_{k=2}^m \int \dfrac{w_m^2(D^k(1))}{w_{k-1}(D^{k-1}(1)q(D(k)|D^{k-1}(1)))}\mathrm{d}D^{k}(k-1)-1\right). \end{align*}$$ In this case, the asymptotic variance is finite only when |$ \sigma _q^{2}>\dfrac{1}{2} $| and $$\begin{align*} \mathcal{F}_{Particle}(\hat{D},D) &\approx\dfrac{1}{M}\sum_{k}^{m}[\int \dfrac{w_m^2(D(1))}{q(D(1))}\mathrm{d}D(1)-1\\ &\quad+\sum_{k=2}^m\int\dfrac{w_m^2(D(k))}{q(D(k))}\mathrm{d}D(k)-1]\\ &=\dfrac{m}{M}[( \dfrac{\sigma_q^4}{2\sigma_q^{2}-1})^{m/2}-1]. \end{align*}$$ Theorem 4 demonstrates the principle of the particle filtering and Theorem 5 explores the results of the error after particle filtering. 4.4 Privacy quantification In the above sections, we have given the error after Kalman and particle filtering. In this section, we will analyze the privacy distortion defined in Definition 9. We first give a metric to measure the privacy strength suitable for any form of noise. Then we analyze the privacy distortion after filtering using this notion. C. Dwork proposes the concept of differential privacy, including a privacy measurement approach for statistical databases. In this approach, the privacy strength introduced by the Laplace noise is measured by the privacy budget assigned to each record. However, in our work, the noise after filtering does not obey a fixed form. Thus, privacy budget |$ \varepsilon $| is not suitable for our case. The KL divergence metric proposed in [31] is always used to measure the similarity of two trajectories. While the purpose of this paper is the explore the information loss caused by the noise before and after filtering. Thus, we need to redefine a metric to intuitively express the privacy distortion. As we know, a more useful metric is entropy, which is an information theoretical approach to measure uncertainty. In [32, 33], entropy |$ H $| is proposed as an anonymity metric to measure the privacy offered by a system. It is defined as |$ H=-\sum _i p_i log_2 p_i $|⁠, where |$ p_i $| is the attacker’s estimate of the probability that a participant |$ i $| is responsible for some observed action. Although entropy adequately assesses the uncertainty of the adversary, it can not be applied in privacy measurement for trajectory data directly. To this end, we propose an entropy-based metric, called EBPM, where the privacy distortion is measured by the lost uncertainty of error before and after filtering. Specifically, we calculate the distance of the filtered positions from the perturbed ones and multiply it with its probability distribution in order to obtain the privacy quantification measurement |$ \mathcal{L}(\cdot ) $| of location inference for a corresponding user. Definition 14. (Entropy-based privacy measurement) Given the perturbed and estimate sensitive trajectory dataset |$ D^\prime $| and |$ \hat{D} $|⁠, the entropy-based privacy measurement |$ \mathcal{L}(\cdot ) $| is given by the following formula: $$ \begin{equation} \mathcal{L}(D^\prime,\hat{D})=-\sum_{k=1}^{n} Dist( D^\prime(k),\hat{D}(k))\cdot \ln p[Dist( D^\prime(k),\hat{D}(k))], \end{equation}$$ (22) where |$ Dist( D^\prime (k),\hat{D}(k)) $| is the distance between estimate position |$ \hat{D}(k) $| and perturbed position |$ D^\prime (k) $| at timestamp |$t_k$|⁠. |$ p[Dist( D^\prime (k),\hat{D}(k))] $| is the probability of the distance appeared. In our work, we define |$ Dist( D^\prime (k),\hat{D}(k)) $| as a normalized distance function that gives distance in |$ [0,1] $| and therefore, the privacy level computed is in the interval |$ [0,1] $|⁠, where 0 means no privacy protection and 1 means full privacy protection. This is done by normalizing the actual distance by an upper bound distance per time step (e.g. the maximum driving speed in our case). We employ the Euclidean distance function, but other choices of distance functions are possible as well. Definition 14 gives the method to measure the privacy strength after filtering, next we theoretically analyze the privacy distortion after filtering, as demonstrated in Theorem 6. Theorem 6. The privacy distortion |$ Dis_{\hat{D},{D}^{\prime }}(\hat{D},{D}^{\prime }) $| after Kalman/particle filtering is $$ \begin{equation} Dis_{\hat{D},{D}^{\prime}}(\hat{D},{D}^{\prime})=\dfrac{\ln(2\pi{\sigma^\prime}^2)+1}{\ln(2\pi \sigma^2)+1}, \end{equation}$$ (23) where |$ \sigma ^\prime $| is the LMS, i.e. |$ \mathcal{F}_{Kalman}(\hat{D},D) $| or |$ \mathcal{F}_{Particle}(\hat{D},D) $|⁠. |$ \sigma ^2 $| is the variance of the noise introduced by Gaussian or Laplace mechanism. Proof. Since a Kalman filter is a linear structure, the noise after Kalman filtering is also Gaussian distribution. Similarly, since we use Gaussian samples to approximate the posterior estimate of true data values, the noise after particle filtering is also Gaussian distribution. If we denote the LMS error, i.e. |$ \mathcal{F}_{Kalman}(\hat{D},D) $| or |$ \mathcal{F}_{Particle}(\hat{D},D) $|⁠, as the symbol |$ {\sigma ^\prime }^{2} $|⁠, then the privacy distortion can be deducted according to Equation (10): $$\begin{equation*} \begin{split} Dis_{\hat{D},{D}^{\prime}}(\hat{D},{D}^{\prime}):&=\dfrac{\mathcal{L}(D,{D}^{\prime})-\mathcal{L}(D,\hat{D})}{\mathcal{L}(D,{D}^{\prime})}\\ &=\dfrac{-\int_{-\infty}^{+\infty}p({\sigma^\prime}^{2} )\ln p({\sigma^\prime}^{2} )d{\sigma^\prime}^{2}}{-\int_{-\infty}^{+\infty}p(\sigma^{2} )\ln p(\sigma^{2} )d\sigma^{2}}\\ &=\dfrac{\int_{-\infty}^{+\infty}p({\sigma^\prime}^{2} )\ln{\dfrac{1}{\sqrt{2\pi{\sigma^\prime}^{2}}}\exp (-\dfrac{x^2}{2{\sigma^\prime}^{2}})dx}} {\int_{-\infty}^{+\infty}p(\sigma^{2} )\ln{\dfrac{1}{\sqrt{2\pi\sigma^{2}}}\exp (-\dfrac{x^2}{2\sigma^2})dx}} \end{split} \end{equation*}$$ $$\begin{equation*} \begin{split} \\ &=\dfrac{\dfrac{1}{2}\ln(2\pi{\sigma^\prime}^{2} )+\dfrac{1}{2}}{\dfrac{1}{2}\ln(2\pi\sigma^{2} )+\dfrac{1}{2}}\\ &=\dfrac{\ln(2\pi{\sigma^\prime}^2)+1}{\ln(2\pi \sigma^2)+1}. \end{split} \end{equation*}$$ Since |$ \ln (\cdot ) $| is a monotone increasing function and |$ {\sigma ^\prime }^2<\sigma ^2 $| always holds, we have |$ Dis_{\hat{D},{D}^{\prime }}(\hat{D},{D}^{\prime })<1 $|⁠. In addition, |$ {\sigma ^\prime }^2 $| and |$ \sigma ^2 $| are usually bigger than |$ \sqrt{\dfrac{1}{2\pi e}} $|⁠, leading to |$ Dis_{\hat{D},{D}^{\prime }}(\hat{D},{D}^{\prime })>0 $|⁠. Thus, Theorem 6 demonstrates the existence of privacy distortion after filtering. Next we quantify this distortion through experiments on real-world datasets. 5 Experimental evaluation In this section, we evaluate the resistance performance of the current schemes using our proposed solution. Specifically, we first analyze the influence of correlation on privacy protection strength, and explore whether our solution is effective. Then the resistance performance of current six schemes, including typical model-based (Markov [16], Bayesian [17] and CIM [20]) and transform-based methods (DFT [21], DWT [22] and PCA [23]), are evaluated on four real-world datasets. Finally, we analyze the impact of filter accuracy on the resistance performance. 5.1 Datasets and configuration We plan to evaluate the resistance performance with 4 real-world trajectory datasets. The experiments were performed on an Intel Core 2 Quad 3.06-Hz Windows 7 machine equipped with 16 GB main memory. Each experiment was run 100 times. Check-in [34]: The dataset contains check-in locations from more than 49,000 Americans and 31,000 social users in Losangeles, New York. Each sign information includes user ID, timestamp, location and place type ID. Trajectory: Owing to the Geolife project [35], this dataset contains 17 621 trajectories with a total distance of 1 292 951 km and a total duration of 50 176 h. These trajectory datasets were collected by Microsoft Research Asia from 182 users over 5 years (from April 2007 to August 2012). A trajectory in this dataset contains the latitude, longitude, altitude coordinates and timestamp. T-Drive Taxi [36]: This dataset describes the GPS trajectory data of 8602 taxis in Beijing in May 2009. The sampling frequency of the dataset track varies from 30|$ \sim $|300 s, which contains about 4 300 000 passenger records, each of which is composed of an interpolation sequence about 30 s intervals. Traffic [37]: Traffic is a daily traffic count dataset for Seattle-area highway traffic monitoring and control provided by the Intelligent Transportation Systems Research Program at University of Washington. We chose the traffic count at location I-5 143.62 southbound from April 2003 till October 2004. This time-series consists of 5400 data points. In the four datasets, the data in the Traffic dataset have the strongest correlations since the cars can only travel on the road, i.e. the direction and velocity vary slowly. On the contrary, the data in Check-in have the weakest correlation since the walkers may change the direction of motion suddenly. In order to collect various different types of statistical query results, a query set |$Q$| containing 1000 random queries, including the applications of LBSs, trajectory clustering, etc., were conducted on every dataset. The values of key parameters are shown in Table 1. We use privacy distortion defined in Equation (10) as the criterion to evaluate the resistance performance. Figure 6. Open in new tabDownload slide Comparison results of impact of correlation. (a) Kalman filtering. (b) Particle filtering. Figure 6. Open in new tabDownload slide Comparison results of impact of correlation. (a) Kalman filtering. (b) Particle filtering. 5.2 Impact of correlation Table 1. The values of key parameters. Parameter Value |$\sigma _{w}^2$| (Check-in) 2.0, (Trajectory) 1.5, (T-Drive Taxi) 1.0, (Traffic) 0.5 |$\sigma _{v}^2$| 1.0 |$\sigma _{q}^2$| 2.0 Markov Order=1 Bayesian Expectation Maximization EM CIM |$T$|=0.3(Iteration parameter),|$\eta $|=7.0(Update parameter) DFT |$k$|=20(Coefficients), |$n$|=2000 DWT |$k$|=5(Translation parameter), Order=10 PCA Components=2 Parameter Value |$\sigma _{w}^2$| (Check-in) 2.0, (Trajectory) 1.5, (T-Drive Taxi) 1.0, (Traffic) 0.5 |$\sigma _{v}^2$| 1.0 |$\sigma _{q}^2$| 2.0 Markov Order=1 Bayesian Expectation Maximization EM CIM |$T$|=0.3(Iteration parameter),|$\eta $|=7.0(Update parameter) DFT |$k$|=20(Coefficients), |$n$|=2000 DWT |$k$|=5(Translation parameter), Order=10 PCA Components=2 Open in new tab Table 1. The values of key parameters. Parameter Value |$\sigma _{w}^2$| (Check-in) 2.0, (Trajectory) 1.5, (T-Drive Taxi) 1.0, (Traffic) 0.5 |$\sigma _{v}^2$| 1.0 |$\sigma _{q}^2$| 2.0 Markov Order=1 Bayesian Expectation Maximization EM CIM |$T$|=0.3(Iteration parameter),|$\eta $|=7.0(Update parameter) DFT |$k$|=20(Coefficients), |$n$|=2000 DWT |$k$|=5(Translation parameter), Order=10 PCA Components=2 Parameter Value |$\sigma _{w}^2$| (Check-in) 2.0, (Trajectory) 1.5, (T-Drive Taxi) 1.0, (Traffic) 0.5 |$\sigma _{v}^2$| 1.0 |$\sigma _{q}^2$| 2.0 Markov Order=1 Bayesian Expectation Maximization EM CIM |$T$|=0.3(Iteration parameter),|$\eta $|=7.0(Update parameter) DFT |$k$|=20(Coefficients), |$n$|=2000 DWT |$k$|=5(Translation parameter), Order=10 PCA Components=2 Open in new tab The impact of correlation on privacy protection strength was examined through comparison on four datasets with different correlation strength. To exclude the impact of data transformation on experimental results, we do not deal with the trajectory data and directly add IID noise. We set the total privacy budget |$\varepsilon $| from 0.1 to 0.9 with a step of 0.2, and calculate the privacy distortion by Equation (10) under different privacy strength setting. The order of Kalman filter is set to 4, and the samples of the particle filter are equal to the original dataset. The privacy evaluation results are shown in Fig. 6 and Table 2. Figure 7. Open in new tabDownload slide Comparison results of different schemes after Kalman filtering. (a) Traffic. (b) T-Drive. (c) Trajectory. (d) Check-in. Figure 7. Open in new tabDownload slide Comparison results of different schemes after Kalman filtering. (a) Traffic. (b) T-Drive. (c) Trajectory. (d) Check-in. Figure 8. Open in new tabDownload slide Comparison results of different schemes after particle filtering. (a) Traffic. (b) T-Drive. (c) Trajectory. (d) Check-in. Figure 8. Open in new tabDownload slide Comparison results of different schemes after particle filtering. (a) Traffic. (b) T-Drive. (c) Trajectory. (d) Check-in. Figure 6 illustrates the comparison results of impact of correlation via Kalman and particle filtering. We observe that the privacy distortion exists after these two filtering methods, which means that the IID noise is sanitized to a certain extent. Besides, the differences of privacy distortion on the four datasets demonstrate that correlation really has an impact on the privacy distortion. Specifically, as shown in Fig. 6(a) and the data in Table 1, the privacy distortion after Kalman filtering is 97.8% on Traffic dataset with the strongest correlation while that is 32.7% on Check-in with the lowest correlation when |$ \varepsilon =0.1 $|⁠, which has an improvement of 65.1% caused by the variation of correlation. When |$ \varepsilon =0.9 $|⁠, the privacy distortion on Traffic and Check-in are 50.1% and 8.6%, respectively, with an improvement of 41.5%. The same improvement trend can also be observed on the four datasets with particle filtering. It demonstrates the effectiveness of our solution, i.e. the difference of correlation between noise and original data can really lead to privacy distortion. Moreover, when |$ \varepsilon =0.1 $|⁠, the privacy distortion is bigger than that when |$ \varepsilon =0.9 $|⁠. Because a smaller |$ \varepsilon $| means more noise is added to the original dataset, more noise is sanitized, and the privacy distortion becomes larger. It indicates that the noise added according to a stricter |$ \varepsilon $| has a more flexible privacy guarantee in practice. The evaluation of correlation impact on privacy distortion demonstrates the effectiveness of our sanitization solution, and also indicates how the privacy distortion is affected by the correlation. 5.3 Resistance of different schemes The resistance performance of current schemes was examined through six state-of-the-art approaches. The experiments were run on four datasets, and the parameters in each method were set according to the suggestion in their experiments. The evaluation results are shown in Figs 7|$\sim $|8 and Tables 3|$ \sim $|4. As shown in Fig. 7, we observe that all of existing typical methods have a certain degree of privacy distortion. Compared with the methods based on transformation (DFT, DWT and PCA), model-based methods (Markov, Bayesian and CIM) have a less distortion on privacy. The reason is that transform-based methods destroy the correlations of data. Specifically, as shown in Table 2, in transform-based methods on Traffic dataset, when |$ \varepsilon =0.1 $|⁠, the privacy distortion of DFT is |$ 87.5\% $| while that of PCA is |$ 78.7\% $|⁠, with an improvement of |$ 8.8\% $|⁠. In contrast, in model-based methods, the privacy distortion of Markov is |$ 75.5\% $| while that of CIM is |$ 51.9\% $|⁠, with an improvement of |$ 23.6\% $|⁠. The same trend can also be observed in the other datasets. Figure 8 and Table 3 show the results of privacy distortion of different schemes via particle filtering. Experimental results demonstrate that the privacy distortion of Laplace noise after particle filtering is slightly smaller than that of Gaussian noise after Kalman filtering, indicating that a Kalman filter is more effective against Gaussian noise compared with the particle filter against Laplace noise. Moreover, the privacy distortion on Check-in is smaller than that on Traffic since the data in Traffic have a stronger correlation, demonstrating the conclusions in Section 5.2. Table 2. Impact of correlation on four datasets. Datasets |$ \varepsilon $| Check-in Trajectory T-Drive Traffic Kalman filtering 0.1 32.7 54.1 96.3 97.8 0.3 21.9 48.3 89.4 92.9 0.5 18.4 33.7 75.6 86.1 0.7 10.2 21.9 57.8 76.2 0.9 8.6 19.5 36.2 50.1 Particle filtering 0.1 21.4 41.2 87.6 89.9 0.3 18.3 34.9 73.4 83.5 0.5 14.2 25.9 69.0 79.7 0.7 10.9 20.2 41.6 64.2 0.9 6.4 13.1 27.2 41.6 Datasets |$ \varepsilon $| Check-in Trajectory T-Drive Traffic Kalman filtering 0.1 32.7 54.1 96.3 97.8 0.3 21.9 48.3 89.4 92.9 0.5 18.4 33.7 75.6 86.1 0.7 10.2 21.9 57.8 76.2 0.9 8.6 19.5 36.2 50.1 Particle filtering 0.1 21.4 41.2 87.6 89.9 0.3 18.3 34.9 73.4 83.5 0.5 14.2 25.9 69.0 79.7 0.7 10.9 20.2 41.6 64.2 0.9 6.4 13.1 27.2 41.6 Open in new tab Table 2. Impact of correlation on four datasets. Datasets |$ \varepsilon $| Check-in Trajectory T-Drive Traffic Kalman filtering 0.1 32.7 54.1 96.3 97.8 0.3 21.9 48.3 89.4 92.9 0.5 18.4 33.7 75.6 86.1 0.7 10.2 21.9 57.8 76.2 0.9 8.6 19.5 36.2 50.1 Particle filtering 0.1 21.4 41.2 87.6 89.9 0.3 18.3 34.9 73.4 83.5 0.5 14.2 25.9 69.0 79.7 0.7 10.9 20.2 41.6 64.2 0.9 6.4 13.1 27.2 41.6 Datasets |$ \varepsilon $| Check-in Trajectory T-Drive Traffic Kalman filtering 0.1 32.7 54.1 96.3 97.8 0.3 21.9 48.3 89.4 92.9 0.5 18.4 33.7 75.6 86.1 0.7 10.2 21.9 57.8 76.2 0.9 8.6 19.5 36.2 50.1 Particle filtering 0.1 21.4 41.2 87.6 89.9 0.3 18.3 34.9 73.4 83.5 0.5 14.2 25.9 69.0 79.7 0.7 10.9 20.2 41.6 64.2 0.9 6.4 13.1 27.2 41.6 Open in new tab Table 3. The resistance results of different schemes after Kalman filtering. Datasets |$ \varepsilon $| DFT DWT PCA Markov Bayesian CIM Traffic 0.1 87.5 85.2 78.7 75.5 61.4 51.9 0.3 84.1 81.1 77.1 71.6 61.2 49.9 0.5 78.8 72.0 67.9 61.3 54.8 43.8 0.7 65.2 62.6 58.4 51.0 51.9 41.2 0.9 56.9 50.4 40.2 38.5 37.4 35.6 T-Drive 0.1 85.2 82.1 75.4 72.1 60.3 50.2 0.3 81.3 80.6 72.2 70.8 59.0 45.8 0.5 71.5 70.9 63.5 60.5 52.9 41.7 0.7 54.2 52.1 54.9 50.9 47.0 39.5 0.9 35.8 32.6 30.2 28.7 27.6 26.4 Trajectory 0.1 83.4 81.0 74.1 70.3 70.1 48.2 0.3 80.4 78.5 71.0 66.7 65.3 43.1 0.5 70.9 69.1 62.4 52.9 51.8 40.5 0.7 52.0 50.3 51.9 42.3 40.1 38.7 0.9 31.1 30.4 28.8 26.1 25.0 21.2 Check-in 0.1 80.2 80.1 73.2 72.4 68.3 47.6 0.3 78.6 76.4 70.9 65.5 62.4 41.2 0.5 69.3 68.7 61.8 51.0 50.8 39.8 0.7 49.8 48.8 47.2 41.2 39.6 34.6 0.9 30.2 29.9 28 23.5 24.7 21.0 Datasets |$ \varepsilon $| DFT DWT PCA Markov Bayesian CIM Traffic 0.1 87.5 85.2 78.7 75.5 61.4 51.9 0.3 84.1 81.1 77.1 71.6 61.2 49.9 0.5 78.8 72.0 67.9 61.3 54.8 43.8 0.7 65.2 62.6 58.4 51.0 51.9 41.2 0.9 56.9 50.4 40.2 38.5 37.4 35.6 T-Drive 0.1 85.2 82.1 75.4 72.1 60.3 50.2 0.3 81.3 80.6 72.2 70.8 59.0 45.8 0.5 71.5 70.9 63.5 60.5 52.9 41.7 0.7 54.2 52.1 54.9 50.9 47.0 39.5 0.9 35.8 32.6 30.2 28.7 27.6 26.4 Trajectory 0.1 83.4 81.0 74.1 70.3 70.1 48.2 0.3 80.4 78.5 71.0 66.7 65.3 43.1 0.5 70.9 69.1 62.4 52.9 51.8 40.5 0.7 52.0 50.3 51.9 42.3 40.1 38.7 0.9 31.1 30.4 28.8 26.1 25.0 21.2 Check-in 0.1 80.2 80.1 73.2 72.4 68.3 47.6 0.3 78.6 76.4 70.9 65.5 62.4 41.2 0.5 69.3 68.7 61.8 51.0 50.8 39.8 0.7 49.8 48.8 47.2 41.2 39.6 34.6 0.9 30.2 29.9 28 23.5 24.7 21.0 Open in new tab Table 3. The resistance results of different schemes after Kalman filtering. Datasets |$ \varepsilon $| DFT DWT PCA Markov Bayesian CIM Traffic 0.1 87.5 85.2 78.7 75.5 61.4 51.9 0.3 84.1 81.1 77.1 71.6 61.2 49.9 0.5 78.8 72.0 67.9 61.3 54.8 43.8 0.7 65.2 62.6 58.4 51.0 51.9 41.2 0.9 56.9 50.4 40.2 38.5 37.4 35.6 T-Drive 0.1 85.2 82.1 75.4 72.1 60.3 50.2 0.3 81.3 80.6 72.2 70.8 59.0 45.8 0.5 71.5 70.9 63.5 60.5 52.9 41.7 0.7 54.2 52.1 54.9 50.9 47.0 39.5 0.9 35.8 32.6 30.2 28.7 27.6 26.4 Trajectory 0.1 83.4 81.0 74.1 70.3 70.1 48.2 0.3 80.4 78.5 71.0 66.7 65.3 43.1 0.5 70.9 69.1 62.4 52.9 51.8 40.5 0.7 52.0 50.3 51.9 42.3 40.1 38.7 0.9 31.1 30.4 28.8 26.1 25.0 21.2 Check-in 0.1 80.2 80.1 73.2 72.4 68.3 47.6 0.3 78.6 76.4 70.9 65.5 62.4 41.2 0.5 69.3 68.7 61.8 51.0 50.8 39.8 0.7 49.8 48.8 47.2 41.2 39.6 34.6 0.9 30.2 29.9 28 23.5 24.7 21.0 Datasets |$ \varepsilon $| DFT DWT PCA Markov Bayesian CIM Traffic 0.1 87.5 85.2 78.7 75.5 61.4 51.9 0.3 84.1 81.1 77.1 71.6 61.2 49.9 0.5 78.8 72.0 67.9 61.3 54.8 43.8 0.7 65.2 62.6 58.4 51.0 51.9 41.2 0.9 56.9 50.4 40.2 38.5 37.4 35.6 T-Drive 0.1 85.2 82.1 75.4 72.1 60.3 50.2 0.3 81.3 80.6 72.2 70.8 59.0 45.8 0.5 71.5 70.9 63.5 60.5 52.9 41.7 0.7 54.2 52.1 54.9 50.9 47.0 39.5 0.9 35.8 32.6 30.2 28.7 27.6 26.4 Trajectory 0.1 83.4 81.0 74.1 70.3 70.1 48.2 0.3 80.4 78.5 71.0 66.7 65.3 43.1 0.5 70.9 69.1 62.4 52.9 51.8 40.5 0.7 52.0 50.3 51.9 42.3 40.1 38.7 0.9 31.1 30.4 28.8 26.1 25.0 21.2 Check-in 0.1 80.2 80.1 73.2 72.4 68.3 47.6 0.3 78.6 76.4 70.9 65.5 62.4 41.2 0.5 69.3 68.7 61.8 51.0 50.8 39.8 0.7 49.8 48.8 47.2 41.2 39.6 34.6 0.9 30.2 29.9 28 23.5 24.7 21.0 Open in new tab Figure 9. Open in new tabDownload slide Comparison results of filter accuracy. (a) Kalman filtering. (b) Particle filtering. Figure 9. Open in new tabDownload slide Comparison results of filter accuracy. (a) Kalman filtering. (b) Particle filtering. Table 4. The resistance results of different schemes after particle filtering. Datasets |$ \varepsilon $| DFT DWT PCA Markov Bayesian CIM Traffic 0.1 85.3 81.9 74.2 71.0 59.8 50.2 0.3 82.4 78.8 73.1 70.9 58.7 48.1 0.5 72.6 71.2 63.5 60.4 51.0 41.9 0.7 61.5 60.5 54.9 50.5 48.7 40.4 0.9 53.7 49.8 30.2 42.1 34.7 32.6 T-Drive 0.1 81.0 81.5 74.3 70.4 59.8 46.9 0.3 80.1 78.3 71.0 67.9 57.6 43.2 0.5 71.4 68.2 62.6 58.8 51.0 40.1 0.7 53.8 50.6 51.4 47.7 43.2 37.8 0.9 34.6 30.3 27.9 26.7 26.9 24.5 Trajectory 0.1 80.1 78.8 76.0 69.8 65.8 48.2 0.3 80.2 76.9 70.8 65.8 64.3 40.1 0.5 65.9 68.3 61.2 52.3 50.8 39.5 0.7 50.0 48.7 51.2 40.2 38.7 37.7 0.9 31.7 28.0 27.7 28.6 25.4 27.0 Check-in 0.1 81.0 78.8 72.3 70.1 64.3 44.0 0.3 76.3 75.2 68.1 62.3 60.7 40.6 0.5 68.2 67.5 60.2 50.6 47.5 36.6 0.7 48.7 46.1 42.7 40.5 37.5 32.9 0.9 29.9 28.4 26.9 20.6 21.8 18.7 Datasets |$ \varepsilon $| DFT DWT PCA Markov Bayesian CIM Traffic 0.1 85.3 81.9 74.2 71.0 59.8 50.2 0.3 82.4 78.8 73.1 70.9 58.7 48.1 0.5 72.6 71.2 63.5 60.4 51.0 41.9 0.7 61.5 60.5 54.9 50.5 48.7 40.4 0.9 53.7 49.8 30.2 42.1 34.7 32.6 T-Drive 0.1 81.0 81.5 74.3 70.4 59.8 46.9 0.3 80.1 78.3 71.0 67.9 57.6 43.2 0.5 71.4 68.2 62.6 58.8 51.0 40.1 0.7 53.8 50.6 51.4 47.7 43.2 37.8 0.9 34.6 30.3 27.9 26.7 26.9 24.5 Trajectory 0.1 80.1 78.8 76.0 69.8 65.8 48.2 0.3 80.2 76.9 70.8 65.8 64.3 40.1 0.5 65.9 68.3 61.2 52.3 50.8 39.5 0.7 50.0 48.7 51.2 40.2 38.7 37.7 0.9 31.7 28.0 27.7 28.6 25.4 27.0 Check-in 0.1 81.0 78.8 72.3 70.1 64.3 44.0 0.3 76.3 75.2 68.1 62.3 60.7 40.6 0.5 68.2 67.5 60.2 50.6 47.5 36.6 0.7 48.7 46.1 42.7 40.5 37.5 32.9 0.9 29.9 28.4 26.9 20.6 21.8 18.7 Open in new tab Table 4. The resistance results of different schemes after particle filtering. Datasets |$ \varepsilon $| DFT DWT PCA Markov Bayesian CIM Traffic 0.1 85.3 81.9 74.2 71.0 59.8 50.2 0.3 82.4 78.8 73.1 70.9 58.7 48.1 0.5 72.6 71.2 63.5 60.4 51.0 41.9 0.7 61.5 60.5 54.9 50.5 48.7 40.4 0.9 53.7 49.8 30.2 42.1 34.7 32.6 T-Drive 0.1 81.0 81.5 74.3 70.4 59.8 46.9 0.3 80.1 78.3 71.0 67.9 57.6 43.2 0.5 71.4 68.2 62.6 58.8 51.0 40.1 0.7 53.8 50.6 51.4 47.7 43.2 37.8 0.9 34.6 30.3 27.9 26.7 26.9 24.5 Trajectory 0.1 80.1 78.8 76.0 69.8 65.8 48.2 0.3 80.2 76.9 70.8 65.8 64.3 40.1 0.5 65.9 68.3 61.2 52.3 50.8 39.5 0.7 50.0 48.7 51.2 40.2 38.7 37.7 0.9 31.7 28.0 27.7 28.6 25.4 27.0 Check-in 0.1 81.0 78.8 72.3 70.1 64.3 44.0 0.3 76.3 75.2 68.1 62.3 60.7 40.6 0.5 68.2 67.5 60.2 50.6 47.5 36.6 0.7 48.7 46.1 42.7 40.5 37.5 32.9 0.9 29.9 28.4 26.9 20.6 21.8 18.7 Datasets |$ \varepsilon $| DFT DWT PCA Markov Bayesian CIM Traffic 0.1 85.3 81.9 74.2 71.0 59.8 50.2 0.3 82.4 78.8 73.1 70.9 58.7 48.1 0.5 72.6 71.2 63.5 60.4 51.0 41.9 0.7 61.5 60.5 54.9 50.5 48.7 40.4 0.9 53.7 49.8 30.2 42.1 34.7 32.6 T-Drive 0.1 81.0 81.5 74.3 70.4 59.8 46.9 0.3 80.1 78.3 71.0 67.9 57.6 43.2 0.5 71.4 68.2 62.6 58.8 51.0 40.1 0.7 53.8 50.6 51.4 47.7 43.2 37.8 0.9 34.6 30.3 27.9 26.7 26.9 24.5 Trajectory 0.1 80.1 78.8 76.0 69.8 65.8 48.2 0.3 80.2 76.9 70.8 65.8 64.3 40.1 0.5 65.9 68.3 61.2 52.3 50.8 39.5 0.7 50.0 48.7 51.2 40.2 38.7 37.7 0.9 31.7 28.0 27.7 28.6 25.4 27.0 Check-in 0.1 81.0 78.8 72.3 70.1 64.3 44.0 0.3 76.3 75.2 68.1 62.3 60.7 40.6 0.5 68.2 67.5 60.2 50.6 47.5 36.6 0.7 48.7 46.1 42.7 40.5 37.5 32.9 0.9 29.9 28.4 26.9 20.6 21.8 18.7 Open in new tab In summary, the above evaluation demonstrates the following aspects: Model-based methods have a less degree of privacy distortion than transform-based methods since transform-based methods destroy the data correlation; Kalman filtering is more effective against Gaussian noise than particle filtering against Laplace noise; The stronger the correlation is, the bigger the privacy distortion will be. 5.4 Impact of filter accuracy Figure 9 shows the comparison results of different filter accuracy. From Fig. 9(a) we can see that the privacy distortion grows exponentially along with the increase of the order of Kalman filter, which means that a more accuracy filter has a better filtering effect. However, what needs to be explained is that the privacy distortion exceeds |$ 100\% $| when the order of filter is 64. The reason is that some original data are also sanitized by the filter when the order increases. Figure 9(b) shows the influence of the ratio of samples relative to original data length in the particle filter. The number of samples directly affect the accuracy of the estimate. Th privacy distortion also increases along with the growth of the number of samples. When the number of samples is twice of the original data length, the privacy distortion approximates 100%. Figure 9(b) indicates that the more the numbers of samples, the more noise will be filtered from the perturbed trajectory. 6 Conclusions and future work In this paper, we evaluate the resistance performance of current schemes utilizing differential privacy technology for trajectory data publishing. We propose a notion called CDF, which takes advantage of different correlations between the noise and trajectory data. Two efficient filtering methods are also applied to conduct CDF in practice. Furthermore, to measure the privacy distortion after our filtering attack, we propose a privacy metric based on entropy theory (EBPM). This is the first work that attempts to evaluate the resistance performance of current schemes using different correlations of noise and original data. Note that our attack scheme is general for data while not limited to trajectory. We can launch the attack at each latitude using our filtering method when addressing high dimensional locations. Extensive experiments on real-life datasets demonstrate the effectiveness of our solution. Our solution can significantly reduce the privacy strength of the state-of-the-art approaches and provide a benchmark for researchers to design more effective privacy preserving methods for trajectory data release. Future work includes investigating more accurate and effective attacks combining with the features of trajectory data, and exploring corresponding defense methods, i.e. more robust differentially private trajectory data publishing mechanisms that can resist the filtering attack proposed in this paper. Acknowledgments This work was supported in part by the Scientific and Technological Projects of Chongqing Education Committee (KJQN201900612). The authors are grateful for the anonymous reviewers who made constructive comments and improvements. REFERENCES 1 Wang , H. and Xu , Z. ( 2017 ) CTS-DP: publishing correlated time-series data via differential privacy . Knowl. Based Syst. , 122 , 167 – 179 . Google Scholar Crossref Search ADS WorldCat 2 Lee , W. , Tseng , S. and Tsai , S. ( 2009 ) A knowledge based real-time travel time prediction system for urban network . Expert. Syst. Appl. , 36 , 4239 – 4247 . Google Scholar Crossref Search ADS WorldCat 3 Falvi , G. and Pedersen , T. ( 2009 ) Mining long, sharable patterns in trajectories of moving objects . Geoinformatica , 13 , 27 – 55 . Google Scholar Crossref Search ADS WorldCat 4 Wang , H. and Li , K. ( 2019 ) SRS-LM: Differentially private publication for infinite streaming data . J. Amb. Intel. Hum. Comp. , 10 , 2453 – 2466 . Google Scholar Crossref Search ADS WorldCat 5 Wang , H. , Xu , Z. and Jia , S. ( 2017 ) Cluster-indistinguishability: A practical differential privacy mechanism for trajectory clustering . Intell. Data Anal. , 21 , 1305 – 1326 . Google Scholar Crossref Search ADS WorldCat 6 Komishani , E. and Abadi , M. ( 2013 ) A generalization-based approach for personalized privacy preservation in trajectory data publishing . In Int. Symposium on Telecommunications , pp. 1129 – 1135 . IEEE : Tehran . 7 Komishani , E.G. , Abadi , M. and Deldar , F. ( 2016 ) Pptd: Preserving personalized privacy in trajectory data publishing by sensitive attribute generalization and trajectory local suppression . Knowl. Based Syst. , 94 , 43 – 59 . Google Scholar Crossref Search ADS WorldCat 8 Mehmet Emre , S.T.L.Y. , Gursoy , L.L. and Wei , W. ( 2018 ) Utility-aware synthesis of differentially private and attack-resilient location traces . In CCS, Ser. 46th ACM Symposium on Theory of Computing . ACM : Toronto . Google Preview WorldCat COPAC 9 Ke Gu , Y.L. , Yang , L. and Liao , N. ( 2018 ) Trajectory data privacy protection based on differential privacy mechanism . In IOP Conference Series: Materials Science and Engineering, Ser. 46th ACM Symposium on Theory of Computing . ACM : Toronto . Google Preview WorldCat COPAC 10 Deldar , F. and Abadi , M. ( 2018 ) Pldp-td: Personalized-location differentially private data analysis on trajectory databases . Pervasive Mob. Comput. , 49 , 1 – 22 . Google Scholar Crossref Search ADS WorldCat 11 Emre Kaplan , M.E. , Gursoy , M.E. and Saygin , Y. ( 2018 ) Location disclosure risks of releasing trajectory distances . Data Knowl. Eng. , 113 , 43 – 63 . Google Scholar Crossref Search ADS WorldCat 12 Osman Abul , F.B. and Nanni , M. ( 2008 ) Never walk alone: Uncertainty for anonymity in moving objects databases . In ICDE Proc. of the 2008 IEEE 24th Int. Conf. on Data Engineering, Ser , pp. 376 – 385 , IEEE : Chicago . Google Preview WorldCat COPAC 13 Cicek , N.M.E. , Ercument , A. and Yucel , S. ( 2014 ) Ensuring location diversity in privacy-preserving spatio-temporal data publishing . VLDB J. , 23 , 609 – 625 . Google Scholar Crossref Search ADS WorldCat 14 Dwork , C. ( 2006 ) Differential privacy . In Int. Colloquium on Automata, Languages & Programming , ACM : Venice , pp. 1 – 12 . 15 Dwork , C. ( 2008 ) Differential privacy: A survey of results . Int. Conf. Theory Appl. Models Comput. , 4978 , 1 – 19 . WorldCat 16 Shen , E. and Yu , T. ( 2013 ) Mining frequent graph patterns with differential privacy . In the 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , ACM : Alaska , pp. 545 – 553 . 17 Yang , B. , Sato , I. and Nakagawa , H. ( 2015 ) Bayesian differential privacy on correlated data . In the 36th ACM SIGMOD Int. Conf. on Management of Data , ACM : Victoria , pp. 747 – 762 . 18 Xiao , Y. and Xiong , L. ( 2014 ) Dynamic differential privacy for location based applications . IEEE T. Depend. Secure , 25 , 1400 – 1412 . WorldCat 19 Shokri , R. ( 2014 ) Privacy games: Optimal protection mechanism design for bayesian and differential privacy . IEEE T. Mobile Comput. , 12 , 34 – 35 WorldCat 20 Zhu , T. , Xiong , P. , Li , G. and Zhou , W. ( 2015 ) Correlated differential privacy: Hiding information in non-IID data set . IEEE Trans. Inf. Forens. Security , 10 , 229 – 242 . Google Scholar Crossref Search ADS WorldCat 21 Rastogi , V. and Nath , S. ( 2010 ) Differentially private aggregation of distributed time-series with transformation and encryption . In Proceedings of the ACM SIGMOD Int. Conf. on Management of Data , ACM : Indiana , 735 – 746 . WorldCat 22 Xiao , X. , Wang , G. and Gehrke , J. ( 2011 ) Differential privacy via wavelet transforms . IEEE Trans. Knowl. Data Eng. , 23 , 1200 – 1214 . Google Scholar Crossref Search ADS WorldCat 23 Jiang , W. , Xie , C. and Zhang , Z. ( 2015 ) Wishart mechanism for differentially private principal components analysis . Comput. Sci. , 9285 , 458 – 473 . WorldCat 24 Cao , L. , Ou , Y. and Yu , P. ( 2011 ) Coupled behavior analysis with applications . IEEE Trans. Knowl. Data Eng. , 24 , 1378 – 1392 . Google Scholar Crossref Search ADS WorldCat 25 Dwork , C. ( 2011 ) A firm foundation for private data analysis . Commun. ACM , 54 , 86 – 95 . Google Scholar Crossref Search ADS WorldCat 26 Liu , X. ( 1992 ) Entropy, distance measure and similarity measure of fuzzy sets and their relations . Fuzzy Set. Syst. , 52 , 305 – 318 . Google Scholar Crossref Search ADS WorldCat 27 Dallal , G. ( 1994 ) Lms: Least median of squares regression . J. Am. Stat. Assoc. , 79 , 871 – 880 . WorldCat 28 Fildes , R. ( 1991 ) Forecasting, structural time series models and the kalman filter: Bayesian forecasting and dynamic models . Technometrics , 34 , 496 – 497 . WorldCat 29 Salmond , D. and Birch , H. ( 2002 ) A particle filter for track-before-detect . In American Control Conf. , IEEE : New York , pp. 3755 – 3760 . 30 Stenger , B. , Thayananthan , A. , Torr , P. and Cipolla , R. ( 2006 ) Model-based hand tracking using a hierarchical bayesian filter . IEEE Trans. Pattern Anal. , 28 , 1372 – 1384 . Google Scholar Crossref Search ADS WorldCat 31 Kifer , D. and Gehrke , J. ( 2006 ) Injecting utility into anonymized datasets . In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, Ser. 46th ACM Symposium on Theory of Computing , ACM : Chicago . Google Preview WorldCat COPAC 32 Seys , S. , Claessens , J. and Preneel , B. ( 2002 ) Towards measuring anonymity . In Int. Conf. on Privacy Enhancing Technologies , IEEE : San Francisco , pp. 54 – 68 . 33 Serjantov , A. and Danezis , G. ( 2002 ) Towards an information theoretic metric for anonymity . In Int. Conf. on Privacy Enhancing Technologies , IEEE : San Francisco , pp. 41 – 53 . 34 Cheng , Z. , Caverlee , J. , Lee , K. and Sui , D. ( 2011 ) Exploring millions of footprints in location sharing services . In Int. Conf. on Weblogs & Social Media , AAAI : Barcelona , pp. 81 – 88 . 35 Zheng , Y. , Xie , X. and Ma , W. ( 2010 ) Geolife: A collaborative social networking service among user, location and trajectory . Bulletin Tech. Commi. Data Eng. , 33 , 32 – 39 . WorldCat 36 Yuan , J. , Zheng , Y. , Xie , X. and Sun , G. ( 2012 ) T-drive: Enhancing driving directions with taxi drivers . IEEE Trans. Knowl. Data Eng. , 25 , 220 – 232 . Google Scholar Crossref Search ADS WorldCat 37 Fan , L. , Xiong , L. and Sunderam , V. ( 2013 ) Fast. differentially private real-time aggregate monitor with filtering and adaptive sampling . In ACM SIGKDD Conf. on Knowledge Discovery and Data Mining , ACM : New York , pp. 1065 – 1068 . © The British Computer Society 2019. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
TI - Resistance of IID Noise in Differentially Private Schemes for Trajectory Publishing
JO - The Computer Journal
DO - 10.1093/comjnl/bxz097
DA - 2020-04-17
UR - https://www.deepdyve.com/lp/oxford-university-press/resistance-of-iid-noise-in-differentially-private-schemes-for-oOedjkFBzu
SP - 1
VL - Advance Article
IS - 
DP - DeepDyve
ER -