The Community for Technology Leaders
RSS Icon
Issue No.03 - March (2010 vol.22)
pp: 334-347
Slava Kisilevich , University of Konstanz, Konstanz
Lior Rokach , Ben-Gurion University, Be'er-Sheva
Yuval Elovici , Ben-Gurion University, Be'er-Sheva
Bracha Shapira , Ben-Gurion University, Be'er-Sheva
Many applications that employ data mining techniques involve mining data that include private and sensitive information about the subjects. One way to enable effective data mining while preserving privacy is to anonymize the data set that includes private information about subjects before being released for data mining. One way to anonymize data set is to manipulate its content so that the records adhere to k-anonymity. Two common manipulation techniques used to achieve k-anonymity of a data set are generalization and suppression. Generalization refers to replacing a value with a less specific but semantically consistent value, while suppression refers to not releasing a value at all. Generalization is more commonly applied in this domain since suppression may dramatically reduce the quality of the data mining results if not properly used. However, generalization presents a major drawback as it requires a manually generated domain hierarchy taxonomy for every quasi-identifier in the data set on which k-anonymity has to be performed. In this paper, we propose a new method for achieving k-anonymity named K-anonymity of Classification Trees Using Suppression (kACTUS). In kACTUS, efficient multidimensional suppression is performed, i.e., values are suppressed only on certain records depending on other attribute values, without the need for manually produced domain hierarchy trees. Thus, in kACTUS, we identify attributes that have less influence on the classification of the data records and suppress them if needed in order to comply with k-anonymity. The kACTUS method was evaluated on 10 separate data sets to evaluate its accuracy as compared to other k-anonymity generalization- and suppression-based methods. Encouraging results suggest that kACTUS' predictive performance is better than that of existing k-anonymity algorithms. Specifically, on average, the accuracies of TDS, TDR, and kADET are lower than kACTUS in 3.5, 3.3, and 1.9 percent, respectively, despite their usage of manually defined domain trees. The accuracy gap is increased to 5.3, 4.3, and 3.1 percent, respectively, when no domain trees are used.
Privacy-preserving data mining, k-anonymity, deindentified data, decision trees.
Slava Kisilevich, Lior Rokach, Yuval Elovici, Bracha Shapira, "Efficient Multidimensional Suppression for K-Anonymity", IEEE Transactions on Knowledge & Data Engineering, vol.22, no. 3, pp. 334-347, March 2010, doi:10.1109/TKDE.2009.91
[1] M. Kantarcioglu, J. Jin, and C. Clifton, “When Do Data Mining Results Violate Privacy?” Proc. 2004 Int'l Conf. Knowledge Discovery and Data Mining, pp. 599-604, 2004.
[2] L. Rokach, R. Romano, and O. Maimon, “Negation Recognition in Medical Narrative Reports,” Information Retrieval, vol. 11, no. 6, pp. 499-538, 2008.
[3] M.S. Wolf and C.L. Bennett, “Local Perspective of the Impact of the HIPAA Privacy Rule on Research,” Cancer-Philadelphia Then Hoboken, vol. 106, no. 2, pp. 474-479, 2006.
[4] P. Samarati and L. Sweeney, “Generalizing Data to Provide Anonymity When Disclosing Information,” Proc. 17th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, vol. 17, p. 188, 1998.
[5] L. Sweeney, “k-Anonymity: A Model for Protecting Privacy,” Int'l J. Uncertainty, Fuzziness, and Knowledge-Based Systems, vol. 10, no. 5, pp. 557-570, 2002.
[6] L. Sweeney, “Achieving k-Anonymity Privacy Protection Using Generalization and Suppression,” Int'l J. Uncertainty, Fuzziness, and Knowledge-Based Systems, vol. 10, no. 5, pp. 571-588, 2002.
[7] B.C.M. Fung, K. Wang, and P.S. Yu, “Top-Down Specialization for Information and Privacy Preservation,” Proc. 21st IEEE Int'l Conf. Data Eng. (ICDE '05), pp. 205-216, Apr. 2005.
[8] K. Wang, P.S. Yu, and S. Chakraborty, “Bottom-Up Generalization: A Data Mining Solution to Privacy Protection,” Proc. Fourth IEEE Int'l Conf. Data Mining, pp. 205-216, 2004.
[9] L. Tiancheng and I. Ninghui, “Optimal K-Anonymity with Flexible Generalization Schemes through Bottom-Up Searching,” Proc. Sixth IEEE Int'l Conf. Data Mining Workshops, pp. 518-523, 2006.
[10] S.V. Iyengar, “Transforming Data to Satisfy Privacy Constraints,” Proc. Eighth ACM SIGKDD, pp. 279-288, 2002.
[11] B.C.M. Fung, K. Wang, and P.S. Yu, “Anonymizing Classification Data for Privacy Preservation,” IEEE Trans. Knowledge and Data Eng., vol. 19, no. 5, pp. 711-725, May 2007.
[12] K. LeFevre, D.J. DeWitt, and R. Ramakrishnan, “Incognito: Efficient Full Domain k-Anonymity,” Proc. 2005 ACM SIGMOD, pp. 49-60, 2005.
[13] A. Friedman, R. Wolff, and A. Schuster, “Providing k-Anonymity in Data Mining,” Int'l J. Very Large Data Bases, vol. 17, no. 4, pp.789-804, 2008.
[14] R. Kohavi and G.H. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence, vol. 97, nos. 1/2, pp. 273-324, 1997.
[15] V.S. Verykios, E. Bertino, I.N. Fovino, L.P. Provenza, Y. Saygin, and Y. Theodoridis, “State-of-the-Art in Privacy Preserving Data Mining,” ACM SIGMOD Record, vol. 33, no. 1, pp. 50-57, 2004.
[16] A. Agrawal and R. Srikant, “Privacy Preserving Data Mining,” ACM SIGMOD Record, vol. 29, no. 2, pp. 439-450, 2000.
[17] B. Gilburd, A. Schuster, and R. Wolff, “k-TTP: A New Privacy Model for Large-Scale Distributed Environments,” Proc. 10th ACM SIGKDD, pp. 563-568, 2004.
[18] Z. Yang, S. Zhong, and R.N. Wright, “Privacy-Preserving Classification of Customer Data without Loss of Accuracy,” Proc. Fifth Int'l Conf. Data Mining, 2005.
[19] J. Roberto, Jr. Bayardo, and A. Rakesh, “Data Privacy through Optimal k-Anonymization,” Proc. Int'l Conf. Data Eng., vol. 21, pp. 217-228, 2005.
[20] A. Blum, C. Dwork, F. McSherry, and K. Nissim, “Practical Privacy: The SuLQ Framework,” Proc. 24th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, pp. 128-138, June 2005.
[21] S. Chawla, C. Dwork, F. McSherry, A. Smith, and H. Wee, “Toward Privacy in Public Databases,” Proc. Theory of Cryptography Conf., pp. 363-385, 2005.
[22] K. Wang, B.C.M. Fung, and P.S. Yu, “Template-Based Privacy Preservation in Classification Problems,” Proc. Fifth IEEE Int'l Conf. Data Mining, pp. 466-473, 2005.
[23] E. Bertino, B.C. Ooi, Y. Yang, and R.H. Deng, “Privacy and Ownership Preserving of Outsourced Medical Data,” Proc. Int'l Conf. Data Eng., vol. 21, pp. 521-532, 2005.
[24] G. Aggarwal, A. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu, “Approximation Algorithms for k-Anonymity,” J. Privacy Technology, 2005.
[25] A. Meyerson and R. Williams, “On the Complexity of Optimal k-Anonymity,” Proc. 23rd ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, pp. 223-228, 2004.
[26] P. Samarati, “Protecting Respondents' Identities in Microdata Release,” IEEE Trans. Knowledge and Data Eng., vol. 13, no. 6, pp. 1010-1027, Nov./Dec. 2001.
[27] K. LeFevre, D.J. DeWitt, and R. Ramakrishnan, “Mondrian Multidimensional k-Anonymity,” Proc. 22nd Int'l Conf. Data Eng., p. 25, Apr. 2006.
[28] K. LeFevre, D.J. DeWitt, and R. Ramakrishnan, “Workload-Aware Anonymization,” Proc. 12th ACM SIGKDD, pp. 277-286, 2006.
[29] L. Sweeney, “Datafly: A System for Providing Anonymity in Medical Data,” Proc. IFIP TC11 WG11.3 11th Int'l Conf. Database Security XI: Status and Prospects, pp. 356-381, 1997.
[30] P. Sharkey, H. Tian, W. Zhang, and S. Xu, “Privacy-Preserving Data Mining through Knowledge Model Sharing,” Privacy, Security and Trust in KDD, pp. 97-115, Springer, 2008.
[31] S. Grumbach and T. Milo, “Towards Tractable Algebras for Bags,” J. Computer and System Sciences, vol. 52, no. 3, pp. 570-588, 1996.
[32] Y. Du, T. Xia, Y. Tao, D. Zhang, and F. Zhu, “On Multidimensional k-Anonymity with Local Recoding Generalization,” Proc. Int'l Conf. Data Eng. (ICDE), pp. 1422-1424, 2007.
[33] L. Rokach, L. Naamani, and A. Shmilovici, “Pessimistic Cost-Sensitive Active Learning of Decision Trees,” Data Mining and Knowledge Discovery, vol. 17, no. 2, pp. 283-316, 2008.
[34] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[35] E. Alpaydin, “Combined 5×2 cv F Test for Comparing Supervised Classification Learning Classifiers,” Neural Computation, vol. 11, no. 8, pp. 1885-1892, 1999.
[36] A. Asuncion and D.J. Newman, “UCI Machine Learning Repository,” School of Information and Computer Science, Univ. of California, /mlearn.ics.uci.eduMLRepository.html , 2007.
[37] E. Frank and I.H. Witten, “Generating Accurate Rule Sets without Global Optimization,” Proc. 15th Int'l Conf. Machine Learning, pp. 144-151, 1998.
[38] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools. Morgan Kaufmann, 2005.
[39] J. Demsar, “Statistical Comparisons of Classifiers over Multiple Data Sets,” J. Machine Learning Research, vol. 7, pp. 1-30, 2006.
[40] L. Rokach, “Genetic Algorithm-Based Feature Set Partitioning for Classification Problems,” Pattern Recognition, vol. 41, no. 5, pp. 1693-1717, 2008.
19 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool