The Computer Journal, Volume 61 (8) – Aug 1, 2018

10 pages

1

/lp/ou_press/an-authentication-method-based-on-the-turtle-shell-algorithm-for-8pBnSMhwAZ

- Publisher
- Oxford University Press
- Copyright
- © The British Computer Society 2018. All rights reserved. For Permissions, please email: journals.permissions@oup.com
- ISSN
- 0010-4620
- eISSN
- 1460-2067
- D.O.I.
- 10.1093/comjnl/bxy024
- Publisher site
- See Article on Publisher Site

Abstract Outsourcing data mining tasks is beneficial for data owners who either lack expertise in data mining or sufficient computing resources. However, directly releasing the original data would leak private information. Research on Privacy-Preserving Data Mining (PPDM) is dedicated to addressing this issue, the aim of this research is to reduce the risk of privacy violations and preserve the knowledge in the original data. However, most existing methods in the literature ignore the case in which service providers want to verify the integrity and authenticity of their clients’ data to avoid data tampering before performing data mining tasks. In this paper, a new method is proposed to extend the turtle shell algorithm of data hiding to protect the privacy of the original data and to acquire authentication functions simultaneously. The act of data perturbation is performed by replacing data values with their closest neighbors according to a reference matrix. Further, a message authentication code is hidden in the perturbed data to verify the integrity and authenticity of the perturbed data. The experimental results showed that the proposed method achieved the purpose of data perturbation and outperformed similar methods in satisfying the PPDM requirement. 1. INTRODUCTION In recent years, data mining techniques [1] have been used extensively to obtain useful information, such as correlations and patterns, from large amounts of data. These techniques have enabled data owners to understand their data better and to gain competitive advantages. While executing data mining tasks requires considerable expertise in data mining and, usually, intensive computational resources, it is beneficial for data owners who are not familiar with the techniques of data mining or who do not have sufficient computing resources to outsource the data mining tasks to external service providers. However, releasing the original data unavoidably exposes confidential information that can be linked to specific individuals. Because data privacy is a critical concern for legal and commercial reasons [2], an appropriate privacy-preserving mechanism should be applied to the original data before they are sent to service providers. This area of data mining, known as Privacy-Preserving Data Mining (PPDM) [3–5], can protect private information effectively and preserve the knowledge in the original data. Numerous methods have been proposed to deal with PPDM. The k-anonymity model [6, 7] aggregates the quasi-identifier attributes of the original data to different groups, where each quasi-identifier value is indistinguishably mapped to at least k records. For instance, Inan et al. [8] studied the problem of building classification models from k-anonymized data using different statistics of quasi-identifier attributes. Data swapping [9, 10] replaces portions of the data with data taken from the same distribution and preserves the statistics information of the original data. Moore et al. [11] proposed that two attribute values can be swapped only if the difference between them falls within a defined interval. Randomization, which adds controlled random noises to the original data to perturb the sensitive content, is another way of releasing the perturbed data for PPDM. Lin [12, 13] utilized a randomization method based on a randomly linear transformation to perturb the original data and performed the kernel k-means clustering and support vector machines classification on the perturbed data. Although these methods are useful in preserving the privacy information of the original data to some extent, they fail to hide secret messages in the perturbed data. Such secret messages are used for various purposes, such as copyright protection, covert communications and content authentication. Taking content authentication as an example, if an intruder tampers with the perturbed data at any stage of transmission, then the result of data mining on the tampered data is untrusted. So, when service providers receive their clients’ data, the first thing they should do is to verify the integrity and authenticity of the perturbed data with their clients. In some circumstances, the data owners are required to protect the privacy of their original data and to embed useful information into the perturbed data. Thus, in this paper, we have proposed a method for simultaneously perturbing the original data and hiding secret messages in a reference matrix. There are two methods whose motivations are most similar to our motivation, namely Privacy Difference Expansion (PDE) [14] and Reversible Data Transform (RDT) [15]. Both of them preserve the privacy of the original data and hide a string of secret digits in the perturbed data using difference expansion [16] and integer transform [17], respectively. Because the difference expansion of data hiding is more suitable for hiding the secret digits into two records that have a small difference, there is an obvious defect that PDE and RDT may produce abnormal records if the difference between two adjacent records of the original data is large. To overcome this, a new method is proposed to perturb the original data using the turtle shell algorithm [18], which modifies the pixel values in a grayscale image according to a reference matrix composed of a great amount of turtle shells. We extend this algorithm to replace the original data values with their neighbors in the same turtle shell set so that the proposed method can avoid producing abnormal records and preserve the availability of the perturbed data. The rest of the paper is organized as follows. The related works are addressed in Section 2. The proposed method is described in Section 3, and examples of our method are presented in Section 4. The experimental results are analyzed in Section 5, and Section 6 presents our conclusions. 2. RELATED WORKS Many privacy-preserving methods for data mining have been proposed, which can be categorized into four groups: data anonymization, data swapping, randomization and cryptographic method. The first group was introduced by [6] which aggregates individual data to groups and makes it difficult to distinguish individuals from their corresponding groups, the second group was proposed by [9] which swaps attribute values of records but preserving the original statistics information, the third group was introduced by [3] which preserves the private data by adding noise to the original data, and the last used cryptographic methods to protect the privacy information of the original data [19]. Our proposed method belongs to the third group loosely, focusing on the perturbation methods with the function of data hiding. Several works of the similar motivation have been presented recently. For example, PDE [14] uses the concept of difference expansion [16] in image processing. It pairs two adjacent records of the original data into groups and calculates the difference values of each neighboring data group. PDE expands these differences between each group and embeds secret digits into these new data differences. In this way, the original records were perturbed by the operations of difference expansion and hiding secret digits. Another method, RDT [15], improves on PDE by enlarging the size of the record groups, which can be determined by the users. RDT first performed the difference expansion on these data within the group and then distorted the data group by using integer transformation [17]. There are three notable differences between our proposed method and the above two works. First, the proposed method uses the turtle shell algorithm instead of the difference expansion. Compared with the difference expansion, the turtle shell algorithm can replace the original value with its adjacent value and preserve the availability of the value. Second, the secret digits used in PDE and RDT to verify the integrity of the perturbed data are random and do not have any specific relationship with the original data, while our secret digits are generated by the original data. These secret digits are called the message authentication code in the following. Third, for PDE and RDT, the degree of data perturbation is adjusted by the degree of difference expansion. Different expansions correspond to different parameters, so there will be several parameters if there are many adjacent values with great differences in the original data. It is not feasible for the data owners to set and understand each of these parameters. By contrast, the proposed method can provide users with a simple way of adjusting the degree of data perturbation by distorting different parts of the original values. 3. DATA TRANSFORM BASED ON THE TURTLE SHELL ALGORITHM PPDM requires that the privacy of the original data be preserved and that the risk of losing knowledge be reduced [20, 21]. Although greater distortion provides better protection of private information, it is more difficult to get accurate results from data mining. Based on the process of data normalization and the turtle shell algorithm, the proposed method replaces the values of the original data with their closest neighbors and preserves the similarity between the original records and their corresponding perturbed records. Because of the process of data normalization, the proposed method is more suitable for perturbing the numerical attributes. Although the values of categorical attributes can be converted to numerical forms, we leave the better solution to the categorical attributes as a future work. A message authentication code generated from the original data is used to identify the integrity and authenticity of the perturbed data. The degree of data perturbation can be adjusted easily by selecting different parts of the data values to perturb. 3.1. Generation of the reference matrix A reference matrix should be constructed before the data are perturbed. So this paper will be self-contained, we briefly introduce the reference matrix given in [18] in this subsection and present the important algorithms used to implement the process. There are two obvious features of the reference matrix, i.e. (1) for each row, the minimum value is 0 and the maximum value is 7. The right value is equal to its corresponding left value plus 1, and then the modular operation of 8 is done; (2) for each column, the value of difference between the two adjacent elements is set alternately to ‘2’ or ‘3’, and it also does the modular operation of 8. In this way, a number of hexagons, called turtle shells, that contain eight different digits from 0 to 7 can be used to divide the reference matrix. In [18], because the pixel values in a grayscale image ranged from 0 to 255, a 256×256 reference matrix is generated. However, in this work, if the reference matrix has the above two features, the size of the reference matrix has no such restrictions. Figure 1 presents an example of the reference matrix based on turtle shells. The first turtle shell in the left-bottom corner of the reference matrix is key to the matrix, because the subsequent turtle shells are arranged according to its position. That is, different start positions of the first turtle shells correspond to different reference matrixes. Figure 1. View largeDownload slide Example of the reference matrix based on turtle shells. Figure 1. View largeDownload slide Example of the reference matrix based on turtle shells. Note that there is no need to store the entire reference matrix because the value of any element in the reference matrix can be calculated by the element’s row and column indexes. In order to get the value of (RowIdx, ColIdx), the first element of row RowIdx, i.e. element at (RowIdx, 0), is added to ColIdx. Then, the modular operation of 8 for the sum is the value at (RowIdx, ColIdx). For example, for (5, 6), the first element of row 5 is 4 at (5, 0). Then, 4 is added to the column index 6 for 10, and the modular operation of 8 for 10 is 2. That is, 2 is the value at (5, 6). As shown in Algorithm 1, the GetElementValue algorithm has the time complexity of O(n), where n is the number of rows of the reference matrix. The corresponding elements in the same turtle shell also can be calculated because the seven elements contained in the turtle shell can be determined by the element at the bottom of the shell. Algorithm 2 shows the GetTurtleShell algorithm, which has the time complexity of O(n), and n is the number of rows of the reference matrix. Also, the location of each element in a turtle shell or at the intersection of two or three turtle shells also can be calculated. Thus, the proposed method does not impose any storage overhead on the users. Algorithm 1 GetElementValue(rowIdx, colIdx). 1: begin 2: if (rowIdx is equal to 0) 3: return (colIdx mod 8); 4: first=0; 5: idx=0; 6: while( idx<rowIdx) do 7: begin 8: if(idx mod 2 is equal to 0) 9: temp=2; 10: else 11: temp=3; 12: first=(first+temp) mod 8; 13: idx++; 14: end 15: value=(first+colIdx) mod 8; 16: return value; 17: end 1: begin 2: if (rowIdx is equal to 0) 3: return (colIdx mod 8); 4: first=0; 5: idx=0; 6: while( idx<rowIdx) do 7: begin 8: if(idx mod 2 is equal to 0) 9: temp=2; 10: else 11: temp=3; 12: first=(first+temp) mod 8; 13: idx++; 14: end 15: value=(first+colIdx) mod 8; 16: return value; 17: end View Large Algorithm 1 GetElementValue(rowIdx, colIdx). 1: begin 2: if (rowIdx is equal to 0) 3: return (colIdx mod 8); 4: first=0; 5: idx=0; 6: while( idx<rowIdx) do 7: begin 8: if(idx mod 2 is equal to 0) 9: temp=2; 10: else 11: temp=3; 12: first=(first+temp) mod 8; 13: idx++; 14: end 15: value=(first+colIdx) mod 8; 16: return value; 17: end 1: begin 2: if (rowIdx is equal to 0) 3: return (colIdx mod 8); 4: first=0; 5: idx=0; 6: while( idx<rowIdx) do 7: begin 8: if(idx mod 2 is equal to 0) 9: temp=2; 10: else 11: temp=3; 12: first=(first+temp) mod 8; 13: idx++; 14: end 15: value=(first+colIdx) mod 8; 16: return value; 17: end View Large Algorithm 2 GetTurtleShell(the bottom element) 1: begin 2: rowIdx=the row index of the bottom element; 3: colIdx=the column index of the bottom element; 4: the 1st element=thebottomelement; 5: the 2nd element=GetElementValue(rowIdx+1,colIdx−1); 6: the 3rd element=GetElementValue(rowIdx+2,colIdx−1); 7: the 4th element=GetElementValue(rowIdx+3,colIdx); 8: the 5th element=GetElementValue(rowIdx+2,colIdx+1); 9: the 6th element=GetElementValue(rowIdx+1,colIdx+1); 10: the 7th element=GetElementValue(rowIdx+2,colIdx); 11: the 8th element=GetElementValue(rowIdx+1,colIdx); 12: a turtle shell is constructed by the eight elements; 13: end 1: begin 2: rowIdx=the row index of the bottom element; 3: colIdx=the column index of the bottom element; 4: the 1st element=thebottomelement; 5: the 2nd element=GetElementValue(rowIdx+1,colIdx−1); 6: the 3rd element=GetElementValue(rowIdx+2,colIdx−1); 7: the 4th element=GetElementValue(rowIdx+3,colIdx); 8: the 5th element=GetElementValue(rowIdx+2,colIdx+1); 9: the 6th element=GetElementValue(rowIdx+1,colIdx+1); 10: the 7th element=GetElementValue(rowIdx+2,colIdx); 11: the 8th element=GetElementValue(rowIdx+1,colIdx); 12: a turtle shell is constructed by the eight elements; 13: end View Large Algorithm 2 GetTurtleShell(the bottom element) 1: begin 2: rowIdx=the row index of the bottom element; 3: colIdx=the column index of the bottom element; 4: the 1st element=thebottomelement; 5: the 2nd element=GetElementValue(rowIdx+1,colIdx−1); 6: the 3rd element=GetElementValue(rowIdx+2,colIdx−1); 7: the 4th element=GetElementValue(rowIdx+3,colIdx); 8: the 5th element=GetElementValue(rowIdx+2,colIdx+1); 9: the 6th element=GetElementValue(rowIdx+1,colIdx+1); 10: the 7th element=GetElementValue(rowIdx+2,colIdx); 11: the 8th element=GetElementValue(rowIdx+1,colIdx); 12: a turtle shell is constructed by the eight elements; 13: end 1: begin 2: rowIdx=the row index of the bottom element; 3: colIdx=the column index of the bottom element; 4: the 1st element=thebottomelement; 5: the 2nd element=GetElementValue(rowIdx+1,colIdx−1); 6: the 3rd element=GetElementValue(rowIdx+2,colIdx−1); 7: the 4th element=GetElementValue(rowIdx+3,colIdx); 8: the 5th element=GetElementValue(rowIdx+2,colIdx+1); 9: the 6th element=GetElementValue(rowIdx+1,colIdx+1); 10: the 7th element=GetElementValue(rowIdx+2,colIdx); 11: the 8th element=GetElementValue(rowIdx+1,colIdx); 12: a turtle shell is constructed by the eight elements; 13: end View Large 3.2. Perturbation of the original data After generating the reference matrix, the original data are perturbed column by column. However, if we directly apply the proposed method to perturb the original data, the attributes of which have different units, it may cause large differences between the original data and the perturbed data, thereby leading to the loss of a significant amount of information. So, preprocessing of the data is necessary in the proposed method. The original data are processed by normalization first, which can preserve the relationship among attributes of the original data. 3.2.1. Perturbation procedures The proposed perturbation of the data is described as follows: Input: the reference matrix M, the original data D, where D={di,j∣i=1,2,…,n;j=1,2,…,m}, n is the number of records of D, and m is the number of attributes of D. (The jth attribute of D is denoted as Aj). Output: the perturbed data D′, where D′={di,j′∣i=1,2,…,n;j=1,2,…,m}, the message authentication code S. Step 1. Generate the message authentication code S according to the original data D. In PDE and RDT, the original data are perturbed by embedding secret digits that could be extracted conversely to identify whether the perturbed data have been tampered with. This is a useful mechanism to preserve the integrity of the perturbed data; however, the secret digits are generated randomly and have no specific relationship with the original data. By contrast, our proposed method generates a message authentication code S according to the original data. The message authentication code that can be generated by MD5 [22] and other hash algorithms [23] is a string of binary numbers in which every three numbers will be converted to an integer in the range of 0 to 7. Step 2. Normalize Aj, where j=1,2,…,m. The normalization methods in [1], such as min–max normalization, standard deviation normalization and normalization by decimal scaling, can be used in the proposed method to prevent different value ranges and numerical difficulty. Step 3. Convert the normalized attribute values to integers. After normalization, attribute values become decimal numbers. Because the row and column indexes of each element of the reference matrix should be integers, the normalized attribute values denoting these indexes should be converted to integers. Step 4. Group every two adjacent values di,j, di+1,j of the jth attribute, where j=1,2,…,m, into a data pair ( di,j, di+1,j). Transform ( di,j, di+1,j) to its closest neighbor ( di,j′, di+1,j′) according to the reference matrix M and the current secret digit of the message authentication code S. For the original data, each data pair ( di,j, di+1,j) of the jth attribute can locate an element of the reference matrix M, where di,j corresponds to the row index of the element, and di+1,j corresponds to the column index. Before perturbing, an associate set G consisting of turtle shells should be found. If the element located by the data pair ( di,j, di+1,j) is inside a turtle shell, the associate set G contains only the turtle shell; if the element is on the edge of two or three turtle shells, the associate set G will contain these turtle shells. When the associate set G is found, the pair ( di,j′, di+1,j′), the value of which is equal to the current secret digit of the message authentication code S and that has the shortest distance with ( di,j, di+1,j), is selected from all candidate elements in the turtle shells of the associate set G. After that, the original data pair ( di,j, di+1,j) is modified to ( di,j′, di+1,j′) implying the current secret digit. Meanwhile, the current secret digit is embedded in the reference matrix instructed by ( di,j′, di+1,j′). Step 5. Convert the new values di,j′, di+1,j′ back to their original format. The attribute values are expanded to integers in Step 3. So, after they have been perturbed, they should be converted back to their original decimal format. 3.2.2. Adjustment of the degree of data perturbation To increase the flexibility of preserving privacy, different parts of the original values can be selected to construct different data pairs in the above Steps 3 and 4. Without loss of generality, say a normalized value d=0.x1⋯xk⋯xn, the process of perturbation is summarized as follows: 0.x1⋯xk⋯xn→①0.x1⋯xk→②x1⋯xk→③x1⋯xk′→④0.x1⋯xk′⋯xn, where ① takes k digits following the decimal point, ② converts the decimal to an integer, ③ perturbs the last digit of the integer and ④ converts the perturbed integer back to its original decimal format and concatenates the rest digits xk+1,…,xn. The degree of data perturbation is controlled by the subscript k. The stricter protection of privacy is required, the smaller the value of k is, which means greater distortion. That is, for d=0.x1⋯xk⋯xn, its greatest distortion form is 0.x1′x2⋯xn, and its slightest distortion form is 0.x1⋯xn−1xn′. According to different requirements of protection of private information, users can adjust the degree of data perturbation easily. Meanwhile, every perturbed data pair is not far away form its original data pair because both of them are in the same associate set G, which guarantees the data utility. Thus, our proposed method can achieve a good trade-off between privacy protection and data utility. After perturbation, the data owner can send the perturbed data D′ together with the generation rules of the reference matrix, the start position of the first turtle shell, and the message authentication code S to external service providers for data mining and knowledge discovery. 3.3. Extraction of the message authentication code It is possible for an intruder to tamper with the perturbed data at any stage of transmission, and the results of data mining of the tampered data cannot be trusted. Thus, when service providers receive their clients’ data, their first action should be to verify the integrity and authenticity of the data. The procedure for extracting the message authentication code is described as follows: Input: the reference matrix M, the perturbed data D′ Output: a message authentication code S′ Step 1: Each perturbed data pair ( di,j′, di+1,j′) of the perturbed data is used to locate an element in the reference matrix M. Step 2: The value of the element corresponds to a secret digit. After all data pairs of the perturbed data are mapped into the reference matrix M, the service providers can extract string S′ consisting of the secret digits. If string S′ is equal to the original message authentication code S which first was sent to the service providers, it means that the perturbed data have not been tampered with, and the results of the data mining are trustworthy; otherwise, the perturbed data are not trustworthy. 3.4. Security analysis Although the generation rules of the reference matrix are quite simple, it is difficult for an intruder who does not know the start position of the first turtle shell to deduce the original data. For the original data D={di,j∣i=1,2,…,n;j=1,2,…,m}, where n is the number of records of D, and m is the number of attributes of D. According to different relevances between attributes and the entire data, suppose there are k(1≤k≤m) attributes that must be determined first to identify a record. Every turtle shell in the reference matrix always contains eight digits from 0 to 7. If an intruder who knows the reference matrix tries to deduce the original data pair from a perturbed one, there will be eight kinds of possibilities. Moreover, every turtle shell always occupies three rows and four columns of the reference matrix. If the intruder tries to identify a record uniquely, there will be at least 3k kinds of possibilities. If he wants to deduce the entire data, there will be at least 3nk kinds of possibilities. In practice, n, m, k are large numbers, so the proposed method does fulfill the strict privacy requirements. 4. EXAMPLES OF THE PROPOSED METHOD In this section, examples are given to illustrate the proposed method. Figure 2a presents a column of example data. The normalization form of the column is presented in Fig. 2b using min–max normalization [1]. Suppose that part of the message authentication code is (001 111 010 101 011)2. Therefore, the corresponding secret digits are 1, 7, 2, 5 and 3 as generated by 001, 111, 010, 101 and 011, respectively. Figure 2. View largeDownload slide Different data forms before, during and after the proposed data perturbation. Figure 2. View largeDownload slide Different data forms before, during and after the proposed data perturbation. The first data pair of the perimeter attribute in Fig. 2b is (0.513, 0.373). Set k (the degree of perturbation) to 3. To work as the row and column indexes, the pair is expanded to (513, 373) by multiplying each value by 1000. Because the data pair (513, 373) is in a turtle shell, the associate set G only has a turtle shell, as shown in Fig. 3. So (514, 372) is determined to be the selected candidate pair with the current secret digit of the message authentication code ‘1’. Thus, the original data pair (513, 373) is modified to (514, 372). Then, the final perturbation form of (95.250, 60.250) is (0.514, 0.372). See Fig. 2a and c. Figure 3. View largeDownload slide Example of the proposed data perturbation for the data pair (513, 373). Figure 3. View largeDownload slide Example of the proposed data perturbation for the data pair (513, 373). The second data pair of the perimeter attribute in Fig. 2b is (0.344, 0.473). To work as the row and column indexes, the pair is expanded to (344, 473) by multiplying each value by 1000. Because (344, 473) is at the intersection of three turtle shells, the associate set G has three turtle shells, as shown in Fig. 4. Because the current secret digit of the message authentication code is ‘7’, the three candidate elements with blue borders in Fig. 4 are found. The pair (345, 473) has the shortest distance with (344, 473). Thus, the original data pair (344, 473) is modified to (345, 473). Then, the final perturbation form of (53.000, 85.250) is (0.345, 0.473). See Fig. 2a and c. The entire perturbation form of the perimeter attribute is presented in Fig. 2c. Figure 4. View largeDownload slide Example of the proposed data perturbation for the data pair (344, 473). Figure 4. View largeDownload slide Example of the proposed data perturbation for the data pair (344, 473). If a greater distortion is required, then different parts of the original values can be selected to construct different data pairs. For example, set k (the degree of perturbation) to 1. The original data pair (0.513, 0.373) can be replaced with the pair (0.5, 0.3) for which the perturbed data form will be (0.6, 0.2) according to the corresponding secret digit ‘1’. Therefore, the data pair (0.513, 0.373) is modified to (0.613, 0.273). When the service providers want to extract the message authentication code from the perturbed data to verify the integrity and authenticity of the data and ensure the correctness of the data mining results, they can extract secret digits from the same reference matrix according to the perturbed data pairs. For example, if we multiply the first data pair in Fig. 2c, (0.514, 0.372), by 1000, we get (514, 372). The new data pair (514, 372) will locate the element for which the value is ‘1’ that is equal to the first digit of the message authentication code, shown in Fig. 5. Figure 5. View largeDownload slide Example of extracting the secret digit according to the data pair (514, 372). Figure 5. View largeDownload slide Example of extracting the secret digit according to the data pair (514, 372). 5. EXPERIMENTAL RESULTS AND ANALYSIS The proposed method perturbs the original data based on the algorithm of data hiding, as in the PDE [14] and RDT [15] methods. Therefore, we compared the performances of these two methods. The experiments were performed on a PC with a 3.4 GHz @Intel core i7 CPU and 6 GB of RAM running Windows 7 (64-bit). 5.1. Datasets In our experiments, three datasets, i.e. small, medium and large datasets provided by UCI Machine Learning Repository [24], were used to test the performance of the proposed method comprehensively. The details of these datasets are presented in Table 1, of which the column ‘Number of classes’ means the number of categories of records in the original data. The more categories there are, the more complex the classification task is. Table 1. Test datasets. Datasets Number of attributes Number of records Number of classes Breast 10 699 2 Abalone 8 4177 3 KDD Cup 38 4 000 000 23 Datasets Number of attributes Number of records Number of classes Breast 10 699 2 Abalone 8 4177 3 KDD Cup 38 4 000 000 23 View Large Table 1. Test datasets. Datasets Number of attributes Number of records Number of classes Breast 10 699 2 Abalone 8 4177 3 KDD Cup 38 4 000 000 23 Datasets Number of attributes Number of records Number of classes Breast 10 699 2 Abalone 8 4177 3 KDD Cup 38 4 000 000 23 View Large 5.2. Measurements The purpose of PPDM is to protect the private information in the original data while retaining the existing knowledge. There should be a trade-off between the risk of disclosing the privacy and preserving knowledge [14, 20, 21]. Thus, the three measurements listed below were used to perform a comparison of the effectiveness of our proposed method with PDE and RDT. Classification Accuracy The classification technique was used on the above datasets in the experiments. The knowledge mined from the perturbed data should be similar to the original data, so classification accuracy was used to measure the knowledge preservation. In our experiments, three general classifiers in the Weka 3.6 tool [25], i.e. Decision Tree, Native Bayes and Support Vector Machine (SVM), were used to analyze the knowledge preservation of the perturbed data generated by different perturbation methods. Note that in our experiments, all parameters in Weka 3.6 were set by default. The closer the accuracy of the perturbed data to the original data, the more knowledge of the perturbed data is preserved. Probabilistic Information Loss (PIL): The equation used to calculate the total PIL is defined as follows [20, 21]: PIL=100×(PIL(m1)+PIL(m2)+PIL(m3)+PIL(m4)+PIL(m5))5, where PIL( m1) is the differences between every original data value and its corresponding perturbed value, PIL( m2) is the variation of the attribute means, PIL( m3) is the variation of the attribute variances, PIL( m4) is the variation of the covariance matrix of the original data and PIL( m5) is the variation of the correlation matrix of the original data. The smaller the value of PIL is, the less possibility there is a loss of information. Privacy Disclosure Risk (PDR) The equation used to calculate PDR is [14]: PDR=0.5×ID+0.5×DLD, where ID is the Interval Disclosure [20, 21], which represents the average amount of the perturbed attribute values falling in the intervals around their corresponding original values, and DLD is the Distance Linkage Disclosure [20, 21], which represents the average amount of perturbed records falling in the distances that may be subjected to the linkage attack. The smaller the value of PDR is, the lower the risk of the disclosure of private information. 5.3. Comparison of performance We compared the performances of PDE, RDT, and our method based on the same datasets and measurements. The results are shown in Figs. 6–8. Figure 6. View largeDownload slide Analysis of classification accuracy. Figure 6. View largeDownload slide Analysis of classification accuracy. Figure 7. View largeDownload slide Analysis of PIL. Figure 7. View largeDownload slide Analysis of PIL. Figure 8. View largeDownload slide Analysis of PDR. Figure 8. View largeDownload slide Analysis of PDR. The classification accuracies of PDE, RDT and our method are presented in Fig. 6. For the Breast and Abalone datasets, the accuracies of PDE, RDT and our method were very close to the original accuracies. For the KDD Cup dataset, the accuracy of our method was closer to the original accuracy than PDE and RDT. This showed that our method is capable of preserving more knowledge of the original data with different sizes. The PIL values obtained by using PDE, RDT and our method to perform perturbation on each dataset are shown in Fig. 7. It is clear that our proposed method was superior to the other two methods in reducing the risk of information loss of the test datasets, because the process of normalization preserves the internal similarity of the original attributes and the perturbed attributes, and changing the data slightly can avoid abnormal records. Combined with Table 1, we also concluded that the proposed method does not result in much greater information loss when the amount of the original data has increased exponentially. The PDR values obtained using PDE, RDT and our method are presented in Fig. 8. The figure shows that the PDR values of our proposed method in the Breast and KDD Cup datasets were not as good as those of RDT, but they were better than those of PDE. The range of the values of the original data of the Abalone dataset was small, so the proposed method can produce new data similar to the original data, thereby unfortunately causing a great probability of disclosure risk. In fact, PIL and PDR are conflicting because they cannot simultaneously be small. Although the proposed method was slightly inferior to the other two methods on the PIL index, it obviously outperformed the other two methods on the PDR index. Also, from the above comparison, it can be seen that our method is more suitable for the datasets that have greater differences between each pair of records. To put it another way, if the original data have large differences between each pair of records, then the process of normalization will scale them in a fixed range and preserve the character of each pair of records. Then, slightly perturbing the data falling in the designated range will retain more complete knowledge and reduce the risk of disclosing private disclosure. The execution time and the sizes of the memory requirement of PDE, RDT and our proposed method for three datasets were compared. Table 2 shows that our method required a longer execution time than PDE and RDT. This was because our method takes time to calculate the element values and search the corresponding turtle shells of the reference matrix during perturbation. However, both PDE and RDT are row-based, whereas our method is column-based and only perturbs two values each time; thus, our method requires less memory than PDE and RDT, shown in Table 3. Table 2. Comparison of execution time among three methods. Datasets Execution time (ms) PDE RDT Our method Breast 16 16 1618 Abalone 218 218 3454 KDD Cup 2 390 219 2 390 219 46 502 000 Datasets Execution time (ms) PDE RDT Our method Breast 16 16 1618 Abalone 218 218 3454 KDD Cup 2 390 219 2 390 219 46 502 000 View Large Table 2. Comparison of execution time among three methods. Datasets Execution time (ms) PDE RDT Our method Breast 16 16 1618 Abalone 218 218 3454 KDD Cup 2 390 219 2 390 219 46 502 000 Datasets Execution time (ms) PDE RDT Our method Breast 16 16 1618 Abalone 218 218 3454 KDD Cup 2 390 219 2 390 219 46 502 000 View Large Table 3. Comparison of memory requirement among three methods. Datasets Memory space (MB) PDE RDT Our method Breast 10.86 31.47 10.54 Abalone 62.14 63.96 48.56 KDD Cup 574.14 616.21 453.17 Datasets Memory space (MB) PDE RDT Our method Breast 10.86 31.47 10.54 Abalone 62.14 63.96 48.56 KDD Cup 574.14 616.21 453.17 View Large Table 3. Comparison of memory requirement among three methods. Datasets Memory space (MB) PDE RDT Our method Breast 10.86 31.47 10.54 Abalone 62.14 63.96 48.56 KDD Cup 574.14 616.21 453.17 Datasets Memory space (MB) PDE RDT Our method Breast 10.86 31.47 10.54 Abalone 62.14 63.96 48.56 KDD Cup 574.14 616.21 453.17 View Large Finally, we compared the performances of the proposed method according to different degrees of perturbation. The result of the Breast dataset is presented in Table 4, and the result of other datasets has the similar characteristics. When k (the degree of perturbation) is equal to 0, the original data had not been perturbed. Thus, its PIL value is 0, which means no loss of information, and its PDR value is 100, which means the greatest risk of the disclosure of privacy. In Table 4, the classification accuracies of different degrees of perturbation were very close to the original accuracy. With the increment of k, the degree of perturbation became slighter, and the corresponding PIL value became smaller in contrast to the PDR value. This means that our proposed method can adjust the degree of data perturbation according to different requirements. Table 4. Comparison of three measurements among different degrees of perturbation of the Breast dataset. Degree of perturbation Measurements (%) Accuracy PIL PDR k=0 94.54 0 100 k=1 93.75 12.98 11.43 k=2 94.23 12.36 13.08 k=3 94.29 9.52 14.21 Degree of perturbation Measurements (%) Accuracy PIL PDR k=0 94.54 0 100 k=1 93.75 12.98 11.43 k=2 94.23 12.36 13.08 k=3 94.29 9.52 14.21 View Large Table 4. Comparison of three measurements among different degrees of perturbation of the Breast dataset. Degree of perturbation Measurements (%) Accuracy PIL PDR k=0 94.54 0 100 k=1 93.75 12.98 11.43 k=2 94.23 12.36 13.08 k=3 94.29 9.52 14.21 Degree of perturbation Measurements (%) Accuracy PIL PDR k=0 94.54 0 100 k=1 93.75 12.98 11.43 k=2 94.23 12.36 13.08 k=3 94.29 9.52 14.21 View Large 6. CONCLUSIONS In this paper, we proposed a method for data transformation based on an improved turtle shell algorithm of data hiding. First, a reference matrix was generated. Then, according to the reference matrix, pairs of values of the original data were transformed to their closest neighbors, which protected the private information and avoided generating abnormal records. At the same time, a message authentication code was hidden in the reference matrix and was used to verify the integrity and authenticity of the perturbed data. The experimental results demonstrated that the proposed method achieved the purpose of data perturbation and could be efficiently applied to privacy-preserving data mining to balance the trade-off between privacy preservation and data utility. Future work can investigate other data hiding techniques to achieve a higher level of privacy preservation and speed the execution time of perturbation. In addition, techniques for preserving the information of categorical attributes are under development. FUNDING The Academic and Technological Leadership Training Foundation of Sichuan Province, China (WZ0100112371601/004 and YH1500411031402). ACKNOWLEDGEMENT We would like to thank the anonymous reviewers for their helpful comments. REFERENCES 1 Han , J.W. , Micheline , K. and Pei , J. ( 2011 ) Data Mining: Concepts and Techniques. . Morgan Kaufmann , Waltham . 2 Standards for the privacy of individually identifiable health information . https://www.hhs.gov/hipaa/index.html. Accessed 3 July 2017 . 3 Agrawal , R. and Srikant , R. ( 2000 ) Privacy-preserving data mining . ACM SIGMOD Rec. , 29 , 439 – 450 . Google Scholar CrossRef Search ADS 4 Aggarawal , C.C. and Yu , P.S. ( 2008 ) Privacy-Preserving Data Mining: Models and Algorithms. . Springer , New York . Google Scholar CrossRef Search ADS 5 Xu , L. , Jiang , C.X. , Wang , J. , Yuan , J. and Ren , Y. ( 2014 ) Information security in big data: privacy and data mining . IEEE Access , 2 , 1149 – 1176 . Google Scholar CrossRef Search ADS 6 Sweeney , L. ( 2002 ) k-anonymity: a model for protecting privacy . Int. J. Unc. Fuzz. Knowl. Based Syst. , 10 , 557 – 570 . Google Scholar CrossRef Search ADS 7 LeFevre , K. , DeWitt , D.J. and Ramakrishnan , R. ( 2005 ) Incognito: Efficient Full-Domain k-Anonymity. Proc. 2005 ACM SIGMOD Int. Conf. Management of Data, Baltimore, Maryland, 14–16 June, pp. 49–60. ACM, New York. 8 Inan , A. , Kantarcioglu , M. and Bertino , E. ( 2009 ) Using Anonymized Data for Classification. Proc. ICDE 09, Shanghai, China, 29 March–2 April, pp. 429–440. IEEE, New York. 9 Dalenius , T. and Reiss , S.P. ( 1982 ) Data-swapping: a technique for disclosure control . J. Stat. Plan. Inference , 6 , 73 – 85 . Google Scholar CrossRef Search ADS 10 Reiss , S.P. ( 1984 ) Practical data-swapping: the first steps . ACM Trans. Database Syst. , 9 , 20 – 37 . Google Scholar CrossRef Search ADS 11 Moore , J. and Richard , A. ( 1996 ) Controlled data-swapping techniques for masking public use microdata sets. Technical report. U.S. Bureau of the Census, Statistical Research Division Report rr96/04. 12 Lin , K.P. , Chang , Y.W. and Chen , M.S. ( 2015 ) Secure support vector machines outsourcing with random linear transformation . Knowl. Inf. Syst. , 44 , 147 – 176 . Google Scholar CrossRef Search ADS 13 Lin , K.P. ( 2016 ) Privacy-preserving kernel k-means clustering outsourcing with random transformation . Knowl. Inf. Syst. , 49 , 885 – 908 . Google Scholar CrossRef Search ADS 14 Chen , T.S. , Lee , W.B. , Chen , J. , Kao , Y.H. and Hou , P.W. ( 2013 ) Reversible privacy preserving data mining: a combination of difference expansion and privacy preserving . J. Supercomput. , 66 , 907 – 917 . Google Scholar CrossRef Search ADS 15 Lin , C.Y. ( 2016 ) A reversible data transform algorithm using integer transform for privacy-preserving data mining . J. Syst. Softw. , 117 , 104 – 112 . Google Scholar CrossRef Search ADS 16 Tian , J. ( 2003 ) Reversible data embedding using a difference expansion . IEEE Trans. Circuits Syst. Video Technol. , 13 , 890 – 896 . Google Scholar CrossRef Search ADS 17 Peng , F. , Li , X.L. and Yang , B. ( 2012 ) Adaptive reversible data hiding scheme based on integer transform . Signal Process. , 92 , 54 – 62 . Google Scholar CrossRef Search ADS 18 Chang , C.C. , Liu , Y.J. and Nguyen , T.S. ( 2014 ) A Novel Turtle Shell Based Scheme for Data Hiding. Proc. 10th Int. Conf. Intelligent Information Hiding and Multimedia Signal Processing, Kitakyushu, Japan, 27–29 August, pp. 89–93. IEEE, New York. 19 Pinkas , B. ( 2002 ) Cryptographic techniques for privacy-preserving data mining . ACM SIGKDD Explor. Newsl. , 4 , 12 – 19 . Google Scholar CrossRef Search ADS 20 Domingo-Ferrer , J. and Torra , V. ( 2001 ) A quantitative comparison of disclosure control methods for microdata . Confidentiality Disclosure Data Access , 1 , 111 – 134 . 21 Domingo-Ferrer , J. , Mateo-Sanz , J.M. and Torra , V. ( 2001 ) Comparing SDC Methods for Microdata on the Basis of Information Loss and Disclosure Risk. Proc. ETK and NTTS 2001, Luxemburg, pp. 807–826. 22 Rivest , R.L. ( 1992 ). RFC 1321: The MD5 Message-Digest Algorithm. http://www.ietf.org/rfc/rfc1321.txt. Accessed 3 July 2017. 23 Deepakumara , J. , Heys , H.M. and Venkatesan , R. ( 2001 ) Fpga implementation of md5 hash algorithm. Proceedings of the Canadian Conference on Electrical and Computer Engineering, Toronto, Canada, 13–16 May, pp. 919–924. IEEE, New York. 24 Uci machine learning repository . http://archive.ics.uci.edu/ml/. Accessed 3 July 2017 . 25 Witten , I.H. , Frank , E. , Hall , M.A. and Pal , C.J. ( 2016 ) Data Mining: Practical Machine Learning Tools and Techniques . Morgan Kaufmann , Cambridge . Author notes Handling editor: Steven Furnell © The British Computer Society 2018. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

The Computer Journal – Oxford University Press

**Published: ** Aug 1, 2018

Loading...

personal research library

It’s your single place to instantly

**discover** and **read** the research

that matters to you.

Enjoy **affordable access** to

over 18 million articles from more than

**15,000 peer-reviewed journals**.

All for just $49/month

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Read from thousands of the leading scholarly journals from *SpringerNature*, *Elsevier*, *Wiley-Blackwell*, *Oxford University Press* and more.

All the latest content is available, no embargo periods.

## “Hi guys, I cannot tell you how much I love this resource. Incredible. I really believe you've hit the nail on the head with this site in regards to solving the research-purchase issue.”

Daniel C.

## “Whoa! It’s like Spotify but for academic articles.”

@Phil_Robichaud

## “I must say, @deepdyve is a fabulous solution to the independent researcher's problem of #access to #information.”

@deepthiw

## “My last article couldn't be possible without the platform @deepdyve that makes journal papers cheaper.”

@JoseServera

DeepDyve ## Freelancer | DeepDyve ## Pro | |
---|---|---|

Price | FREE | $49/month |

Save searches from | ||

Create lists to | ||

Export lists, citations | ||

Read DeepDyve articles | Abstract access only | Unlimited access to over |

20 pages / month | ||

PDF Discount | 20% off | |

Read and print from thousands of top scholarly journals.

System error. Please try again!

or

By signing up, you agree to DeepDyve’s Terms of Service and Privacy Policy.

Already have an account? Log in

Bookmark this article. You can see your Bookmarks on your DeepDyve Library.

To save an article, **log in** first, or **sign up** for a DeepDyve account if you don’t already have one.

All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.

ok to continue