Subscribe

Issue No.01 - January (2012 vol.61)

pp: 101-117

Yingpeng Sang , University of Adelaide, Adelaide

Hong Shen , University of Adelaide, Adelaide

Hui Tian , Beijing Jiaotong University, Beijing

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2011.83

ABSTRACT

Random Projection (RP) has raised great concern among the research community of privacy-preserving data mining, due to its high efficiency and utility, e.g., keeping the euclidean distances among the data points. It was shown in [33] that, if the original data set composed of m attributes is multiplied by a mixing matrix of k\times m (m>k) which is random and orthogonal on expectation, then the k series of perturbed data can be released for mining purposes. Given the data perturbed by RP and some necessary prior knowledge, to our knowledge, little work has been done in reconstructing the original data to recover some sensitive information. In this paper, we choose several typical scenarios in data mining with different assumptions on prior knowledge. For the cases that an attacker has full or zero knowledge of the mixing matrix R, respectively, we propose reconstruction methods based on Underdetermined Independent Component Analysis (UICA) if the attributes of the original data are mutually independent and sparse, and propose reconstruction methods based on Maximum A Posteriori (MAP) if the attributes of the original data are correlated and nonsparse. Simulation results show that our reconstructions achieve high recovery rates, and outperform the reconstructions based on Principal Component Analysis (PCA). Successful reconstructions essentially mean the leakage of privacy, so our work identify the possible risks of RP when it is used for data perturbations.

INDEX TERMS

Privacy-preserving data mining, data perturbation, data reconstruction, underdetermined independent component analysis, Maximum A Posteriori, principal component analysis.

CITATION

Yingpeng Sang, Hong Shen, Hui Tian, "Effective Reconstruction of Data Perturbed by Random Projections",

*IEEE Transactions on Computers*, vol.61, no. 1, pp. 101-117, January 2012, doi:10.1109/TC.2011.83REFERENCES

- [1] N. Adam and J. Worthmann, "Security-Control Methods for Statistical Databases: A Comparative Study,"
ACM Computing Surveys, vol. 21, no. 4, pp. 515-556, 1989.- [2]
Privacy-Preserving Data Mining: Models and Algorithms, C. Aggarwal and P.S. Yu, eds. Springer, 2008.- [3] D. Agrawal and C. Aggarwal, "On the Design and Quantification of Privacy Preserving Data Mining Algorithms,"
Proc. 20th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS), pp. 247-255, 2001.- [4] R. Agrawal and R. Srikant, "Privacy-Preserving Data Mining,"
Proc. 2000 ACM SIGMOD Conf. Management of Data, pp. 439-450, 2000.- [5] S. Agrawal and J.R. Haritsa, "A Framework for High-Accuracy Privacy-Preserving Mining,"
Proc. 21st Int'l Conf. Data Eng. (ICDE '05), pp. 193-204, 2005.- [6] M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, and V. Verykios, "Disclosure Limitation of Sensitive Rules,"
Proc. Workshop Knowledge and Data Eng. Exchange (KDEX '99), pp. 45-52, 1999.- [7] P. Bofill and M. Zibulevsky, "Underdetermined Blind Source Separation Using Sparse Representations,"
Signal Processing, vol. 81, no. 11, pp. 2353-2362, 2001.- [8] X. Cao and R. Liu, "General Approach to Blind Source Separation,"
IEEE Trans. Signal Processing, vol. 44, no. 3, pp. 562-571, Mar. 1996.- [9] K. Chen, G. Sun, and L. Liu, "Towards Attack-Resilient Geometric Data Perturbation,"
Proc. SIAM Int'l Conf. Data Mining (SDM '07), Apr. 2007.- [10] S.S. Chen, D.L. Donoho, and M.A. Saunders, "Atomic Decomposition by Basis Pursuit,"
SIAM Rev., vol. 43, no. 1, pp. 129-159, 2001.- [11] R. Cramer, I. Damgard, and J. Nielsen, "Multiparty Computation from Threshold Homomorphic Encryption,"
EUROCRYPT '01: Proc. Int'l Conf. the Theory and Application of Cryptographic Techniques: Advances in Cryptology, pp. 280-300, 2001.- [12] T. Dalenius and S.P. Reiss, "Data-Swapping: A Technique for Disclosure Control,"
J. Statistical Planning and Inference, vol. 6, pp. 73-85, 1982.- [13] S. Dasgupta, D. Hsu, and N. Verma, "A Concentration Theorem for Projections,"
Proc. 22nd Conf. Uncertainty in Artificial Intelligence, pp. 1-17, 2006.- [14] S. Dasgupta, "Learning Mixtures of Gaussians,"
Proc. 40th Ann. IEEE Symp. Foundations of Computer Science (FOCS), pp. 634-644, 1999.- [15] W. Du and Z. Zhan, "Building Decision Tree Classifier on Private Data,"
Proc. IEEE ICDM Workshop Privacy, Security and Data Mining (PSDM '02), pp. 1-8, 2002.- [16] A. Evfimievski, J. Gehrke, and R. Srikant, "Limiting Privacy Breaches in Privacy Preserving Data Mining,"
Proc. 22nd ACM Symp. Principles of Database Systems (PODS '03), pp. 211-222, 2003.- [17] S.E. Fienberg and J. McIntyre, "Data Swapping: Variations on a Theme by Dalenius and Reiss,"
Proc. Privacy in Statistical Databases, pp. 14-29, 2004.- [18] O. Goldreich,
Foundations of Cryptography: Volume 2, Basic Applications. Cambridge Univ. Press, 2004.- [19] A. Gretton, K. Fukumizu, C. Teo, L. Song, B. Scholkopf, and A. Smola, "A Kernel Statistical Test of Independence,"
Advances in Neural Information Processing Systems, pp. 585-592, MIT Press, 2007.- [20] S. Guo and X. Wu, "Deriving Private Information from Arbitrarily Projected Data,"
Proc. 11th Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD '07), May 2007.- [21] J.A. Halderman, S.D. Schoen, N. Heninger, W. Clarkson, W. Paul, J.A. Calandrino, A.J. Feldman, J. Appelbaum, and E.W. Felten, "Lest We Remember: Cold Boot Attacks on Encryption Keys,"
Proc. 17th USENIX Security Symp., pp. 45-60, 2008.- [22] Z. Huang, W. Du, and B. Chen, "Deriving Private Information from Randomized Data,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 37-48, 2005.- [23] A. Hyvärinen and E. Oja, "Independent Component Analysis: Algorithms and Applications,"
Neural Networks, vol. 13, pp. 411-430, 2000.- [24] S. Jha, L. Kruger, and P. McDaniel, "Privacy Preserving Clustering,"
Proc. 10th European Symp. Research in Computer Security (ESORICS), pp. 397-417, 2005.- [25] M. Kantarcioglu and C. Clifton, "Privacy-Preserving Distributed Mining of Association Rules on Horizontally Partitioned Data,"
IEEE Trans. Knowledge and Data Eng., vol. 16, no. 9, pp. 1026-1037, Sept. 2004.- [26] A. Kankainen and N. Ushakov, "A Consistent Modification of a Test for Independence Based on the Empirical Characteristic Function,"
J. Math. Sciences, vol. 89, no. 5, pp. 1-10, 1998.- [27] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, "On the Privacy Preserving Properties of Random Data Perturbation Techniques,"
Proc. Third IEEE Int'l Conf. Data Mining (ICDM '03), pp. 99-106, 2003.- [28] S. Kotz, T.J. Kozubowski, and K. Podgórski,
The Laplace Distribution and Generalizations: A Revisit with Applications to Communications, Economics, Engineering, and Finance. Birkhäuser, 2001.- [29] E. Lefons, A. Silvestri, and F. Tangorra, "An Analytic Approach to Statistical Databases,"
Proc. Ninth Int'l Conf. Very Large Data Bases (VLDB), 1983.- [30] N. Li, T. Li, and S. Venkatasubramanian, "T-Closeness: Privacy Beyond K-Anonymity and L-Diversity,"
Proc. IEEE 23rd Int'l Conf. Data Eng. (ICDE '07), pp. 106-115, 2007.- [31] C.K. Liew, U.J. Choi, and C.J. Liew, "A Data Distortion by Probability Distribution,"
ACM Trans. Database Systems, vol. 10, no. 3, pp. 395-411, 1985.- [32] Y. Lindell and B. Pinkas, "Privacy Preserving Data Mining,"
Proc. Advances in Cryptology (CRYPTO '00), pp. 36-54, 2000.- [33] K. Liu, H. Kargupta, and J. Ryan, "Random Projection-Based Multiplicative Data Perturbation for Privacy Preserving Distributed Data Mining,"
IEEE Trans. Knowledge and Data Eng., vol. 18, no. 1, pp. 92-106, Jan. 2006.- [34] K. Liu, C. Giannella, and H. Kargupta, "An Attacker's View of Distance Preserving Maps for Privacy Preserving Data Mining,"
Proc. Principles of Data Mining and Knowledge Discovery (PKDD '06), pp. 297-308, 2006.- [35] K. Liu, "Multiplicative Data Perturbation for Privacy Preserving Data Mining," PhD thesis, Univ. of Maryland, Jan. 2007.
- [36] J. Löfberg, "YALMIP : A Toolbox for Modeling and Optimization in MATLAB,"
Proc. IEEE Int'l Symp. Computer Aided Control Systems Design, pp. 284-289, Sept. 2004.- [37] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, "L-Diversity: Privacy Beyond K-Anonymity,"
Proc. 22nd IEEE Int'l Conf. Data Eng. (ICDE '06), p. 24, 2006.- [38] K.V. Mardia, "Measures of Multivariate Skewness and Kurtosis with Applications,"
Biometrika, vol. 57, no. 3, pp. 519-530, 1970.- [39] C.J. Mecklin and D.J. Mundfrom, "An Appraisal and Bibliography of Tests for Multivariate Normality,"
Int'l Statistical Rev., vol. 72, no. 1, pp. 123-138, 2004.- [40] P.D. O'Grady, B.A. Pearlmutter, and S.T. Rickard, "Survey of Sparse and Non-Sparse Methods in Source Separation,"
Int'l J. Imaging Systems and Technology, vol. 15, no. 1, pp. 18-33, 2005.- [41] S.R.M. Oliveira and O.R. Zaïane, "A Privacy-Preserving Clustering Approach Toward Secure and Effective Data Analysis for Business Collaboration,"
Computers and Security, vol. 26, no. 1, pp. 81-93, 2007.- [42] K.B. Peterson and M.S. Pederson, "The Matrix Cookbook," Version:, http:/matrixcookbook.com/, Nov. 2008.
- [43] S. Rizvi and J. Haritsa, "Maintaining Data Privacy in Association Rule Mining,"
Proc. 28th Int'l Conf. Very Large Databases (VLDB), Aug. 2002.- [44] Y. Sang, H. Shen, and H. Tian, "Reconstructing Data Perturbed by Random Projections when the Mixing Matrix Is Known,"
Proc. European Conf. Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp. 334-349, Sept. 2009.- [45] Y. Sang, H. Shen, and H. Tian, "Privacy Preserving Tuple Matching in Distributed Database,"
IEEE Trans. Knowledge and Data Eng., vol. 21, no. 12, pp. 1767-1782, Dec. 2009.- [46] Y. Sang and H. Shen, "Efficient and Secure Protocols for Privacy Preserving Set Operations,"
ACM Trans. Information and System Security, vol. 13, no. 1, 2009.- [47] Y. Saygin, V.S. Verykios, and C. Clifton, "Using Unknowns to Prevent Discovery of Association Rules,"
ACM SIGMOD Record, vol. 30, no. 4, pp. 45-54, 2001.- [48] L. Sweeney, "K-Anonymity: A Model for Protecting Privacy,"
Int'l J. Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 5, pp. 557-570, 2002.- [49] F.J. Theis, E.W. Lang, and C.G. Puntonet, "A Geometric Algorithm for Overcomplete Linear ICA,"
Neurocomputing, vol. 56, pp. 381-398, 2004.- [50] E.O. Turgay, T.B. Pedersen, Y. Saygin, E. Savas, and A. Levi, "Disclosure Risks of Distance Preserving Data Transformations,"
Proc. 20th Int'l Conf. Scientific and Statistical Database Management (SSDBM '08), pp. 79-94, 2008.- [51] V. Verykios, A. Elmagarmid, B. Elisa, D. Elena, Y. Saygin, and E. Dasseni, "Association Rule Hiding,"
IEEE Trans. Knowledge and Data Eng., vol. 16, no. 4, pp. 434-447, Apr. 2004.- [52] Z. Yang, S. Zhong, and R.N. Wright, "Privacy-Preserving Classification of Customer Data without Loss of Accuracy,"
Proc. SIAM Int'l Conf. Data Mining (SDM), 2005.- [53] M. Zibulevsky and B.A. Pearlmutter, "Blind Source Separation by Sparse Decomposition in a Signal Dictionary,"
Neural Computation, vol. 13, no. 4, pp. 863-882, 2001. |