The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.02 - February (2011 vol.23)
pp: 161-174
Gabriel Ghinita , Purdue University, West Lafayette
Panos Kalnis , King Abdullah University of Science and Technology (KAUST), Jeddah
Yufei Tao , Chinese University of Hong Kong, Hong Kong
ABSTRACT
Existing research on privacy-preserving data publishing focuses on relational data: in this context, the objective is to enforce privacy-preserving paradigms, such as k-anonymity and \ell-diversity, while minimizing the information loss incurred in the anonymizing process (i.e., maximize data utility). Existing techniques work well for fixed-schema data, with low dimensionality. Nevertheless, certain applications require privacy-preserving publishing of transactional data (or basket data), which involve hundreds or even thousands of dimensions, rendering existing methods unusable. We propose two categories of novel anonymization methods for sparse high-dimensional data. The first category is based on approximate nearest-neighbor (NN) search in high-dimensional spaces, which is efficiently performed through locality-sensitive hashing (LSH). In the second category, we propose two data transformations that capture the correlation in the underlying data: 1) reduction to a band matrix and 2) Gray encoding-based sorting. These representations facilitate the formation of anonymized groups with low information loss, through an efficient linear-time heuristic. We show experimentally, using real-life data sets, that all our methods clearly outperform existing state of the art. Among the proposed techniques, NN-search yields superior data utility compared to the band matrix transformation, but incurs higher computational overhead. The data transformation based on Gray code sorting performs best in terms of both data utility and execution time.
INDEX TERMS
Privacy, anonymity, transactional data.
CITATION
Gabriel Ghinita, Panos Kalnis, Yufei Tao, "Anonymous Publication of Sensitive Transactional Data", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 2, pp. 161-174, February 2011, doi:10.1109/TKDE.2010.101
REFERENCES
[1] G. Ghinita, Y. Tao, and P. Kalnis, "On the Anonymization of Sparse, High-Dimensional Data," Proc. IEEE Int'l Conf. Data Eng. (ICDE), pp. 715-724, 2008.
[2] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, "l-Diversity: Privacy beyond k-Anonymity," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2006.
[3] K. LeFevre, D.J. DeWitt, and R. Ramakrishnan, "Incognito: Efficient Full-Domain k-Anonymity," Proc. ACM SIGMOD, pp. 49-60, 2005.
[4] K. LeFevre, D.J. DeWitt, and R. Ramakrishnan, "Mondrian Multidimensional k-Anonymity," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2006.
[5] C.C. Aggarwal, "On k-Anonymity and the Curse of Dimensionality," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 901-909, 2005.
[6] X. Xiao and Y. Tao, "Anatomy: Simple and Effective Privacy Preservation," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 139-150, 2006.
[7] A. Gionis, P. Indyk, and R. Motwani, "Similarity Search in High Dimensions via Hashing," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 518-529, 1999.
[8] S. Skiena, Implementing Discrete Mathematics: Combinatorics and Graph Theory with Mathematica. Addison-Wesley, 1990.
[9] B.-C. Chen, K. LeFevre, and R. Ramakrishnan, "Privacy Skyline: Privacy with Multidimensional Adversarial Knowledge," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 770-781, 2007.
[10] R. Agrawal and R. Srikant, "Privacy Preserving Data Mining," Proc. ACM SIGMOD, pp. 439-450, 2000.
[11] Z. Huang, W. Du, and B. Chen, "Deriving Private Information from Randomized Data," Proc. ACM SIGMOD, pp. 37-48, 2005.
[12] K. LeFevre, D.J. DeWitt, and R. Ramakrishnan, "Workload-Aware Anonymization," Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 277-286, 2006.
[13] P. Samarati, "Protecting Respondents' Identities in Microdata Release," IEEE Trans. Knowledge and Data Eng., vol. 13, no. 6, pp. 1010-1027, Nov./Dec. 2001.
[14] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, and A. Zhu, "Achieving Anonymity via Clustering," Proc. ACM Symp. Principles of Database Systems (PODS), pp. 153-162, 2006.
[15] G. Ghinita, P. Karras, P. Kalnis, and N. Mamoulis, "Fast Data Anonymization with Low Information Loss," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 758-769, 2007.
[16] Q. Zhang, N. Koudas, D. Srivastava, and T. Yu, "Aggregate Query Answering on Anonymized Tables," Proc. IEEE Int'l Conf. Data Eng. (ICDE), pp. 116-125, 2007.
[17] M. Atzori, F. Bonchi, F. Giannotti, and D. Pedreschi, "Anonymity Preserving Pattern Discovery," VLDB J., vol. 17, pp. 703-727, 2008.
[18] V. Verykios, A. Elmagarmid, E. Bertino, Y. Saygin, and E. Dasseni, "Association Rule Hiding," IEEE Trans. Knowledge and Data Eng., vol. 16, no. 4, pp. 434-447, Apr. 2004.
[19] C.C. Aggarwal and P.S. Yu, "On Privacy-Preservation of Text and Sparse Binary Data with Sketches," Proc. SIAM Conf. Data Mining, 2007.
[20] M. Terrovitis, N. Mamoulis, and P. Kalnis, "Privacy-Preserving Anonymization of Set-Valued Data," Proc. Int'l Conf. Very Large Data Bases (VLDB), 2008.
[21] Y. Xu, K. Wang, A.W.-C. Fu, and P.S. Yu, "Anonymizing Transaction Databases for Publication," Proc. SIGKDD, pp. 767-775, 2008.
[22] D. Richards, "Data Compression and Gray-Code Sorting," Information Processing Letters, vol. 22, pp. 201-205, 1986.
[23] A. Pinar, T. Tao, and H. Ferhatosmanoglu, "Compressing Bitmap Indices by Data Reorganization," Proc. IEEE Int'l Conf. Data Eng. (ICDE), pp. 310-321, 2005.
[24] D. Kifer and J. Gehrke, "Injecting Utility into Anonymized Datasets," Proc. ACM SIGMOD, pp. 217-228, 2006.
[25] A. Andoni, M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni, Nearest Neighbor Methods in Learning and Vision: Theory and Practice. MIT Press, 2006.
[26] J.K. Reid and J.A. Scott, "Reducing the Total Bandwidth of a Sparse Unsymmetric Matrix," SIAM J. Matrix Analysis and Applications, vol. 28, no. 3, pp. 805-821, 2006.
[27] C. Papadimitriou, "The NP-Completeness of the Bandwidth Minimization Problem," Computing, vol. 16, pp. 263-270, 1976.
[28] Z. Zheng, R. Kohavi, and L. Mason, "Real World Performance of Association Rule Algorithms," Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 401-406, 2001.
[29] A. Narayanan and V. Shmatikov, "How to Break Anonymity of the Netflix Prize Dataset," http://arxiv.org/abs/cs0610105, 2010.
32 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool