This Article 
 Bibliographic References 
 Add to: 
Privacy: A Machine Learning View
August 2004 (vol. 16 no. 8)
pp. 939-948

Abstract—The problem of disseminating a data set for machine learning while controlling the disclosure of data source identity is described using a commuting diagram of functions. This formalization is used to present and analyze an optimization problem balancing privacy and data utility requirements. The analysis points to the application of a generalization mechanism for maintaining privacy in view of machine learning needs. We present new proofs of NP-hardness of the problem of minimizing information loss while satisfying a set of privacy requirements, both with and without the addition of a particular uniform coding requirement. As an initial analysis of the approximation properties of the problem, we show that the cell suppression problem with a constant number of attributes can be approximated within a constant. As a side effect, proofs of NP-hardness of the minimum k{\hbox{-}}{\rm{union}}, maximum k{\hbox{-}}{\rm{intersection}}, and parallel versions of these are presented. Bounded versions of these problems are also shown to be approximable within a constant.

[1] A. Øhrn and L. Ohno-Machado, Using Boolean Reasoning to Anonymize Databases Artificial Inteligence in Medicne, vol. 15, no. 3, pp. 235-254, 1999.
[2] M. Fischetti and J. J. Salazar, Models and Algorithms for the 2-Dimensional Cell Suppression Problem in Statistical Disclosure Control Math. Programming, no. 84, pp. 283-312, 1999.
[3] L. Sweeney, Guaranteeing Anonymity When Sharing Medical Data, the Datafly System Proc Am. Medical Informatics Assoc. Ann. Fall Symp., pp. 51-55, 1997.
[4] P. Samarati and L. Sweeney, Generalizing Data to Provide Anonymity When Disclosing Information (Abstract) Proc. 17th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, p. 188, June 1998.
[5] A.J. Hundepool and L.C.R.J. Willenborg, Mu- and Tau-Argus: Software for Statistical Disclosure Control Proc. Third Int'l Seminar on Statistical Confidentiality at Bled, 1996.
[6] P. Kooiman, L. Willenborg, and J. Gouweleeuw, Pram: A Method for Disclosure Limitation of Microdata Rsm-80330, Statistics Netherland, 1997.
[7] M.P. Armstrong, G. Rushton, and D.L. Zimmerman, Geographically Masking Health Data to Preserve Confidentiality Statistics in Medicine, no. 18, pp. 497-525, 1999.
[8] D. Agrawal and C.C. Aggarwal, On the Design and Quantification of Privacy Preserving Data Mining Algorithms Proc. 20th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, pp. 247-255, May 2001.
[9] D.E. Denning, Secure Statistical Databases with Random Sample Queries ACM Trans. Database Systems, vol. 5, no. 3, pp. 291-315, 1980.
[10] A. Brodsky, C. Farkas, and S. Jajodia, Secure Databases: Constraints, Inference Channels, and Monitoring Disclosures IEEE Trans. Knowledge and Data Eng., vol. 12, no. 6, pp. 900-919, Nov./Dec. 2000.
[11] T.-A. Su and G. Ozsoyoglu, Controlling fd and mvd Inferences in Multilevel Relational Database Systems IEEE Trans. Knowledge and Data Eng., vol. 3, no. 4, pp. 474-485, Dec. 1991.
[12] R. Agrawal and R. Srikant, Privacy-Preserving Data Mining Proc. 2000 ACM SIGMOD Int'l Conf. Management of Data, vol. 29, no. 2, pp. 439-450, May 2000.
[13] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, Privacy Preserving Mining of Association Rules Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), July 2002.
[14] A.P. Felty and S. Matwin, Privacy-Oriented Data Mining by Proof Checking Proc. Principles of Data Mining and Knowledge Discovery, Sixth European Conf., PKDD Aug. 2002.
[15] M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, and V. Verykios, Disclosure Limitation Of Sensitive Rules Proc. Knowledge and Data Exchange Workshop, 1999.
[16] C. Clifton, Using Sample Size to Limit Exposure to Data Mining J. Computer Security, vol. 8, no. 4, 2000.
[17] V. Estivill-Castro and L. Brankovic, Data Swapping: Balancing Privacy against Precision in Mining for Logic Rules Proc. Conf. Data Warehousing and Knowledge Discovery DaWaK-99, pp. 389-398, 1999.
[18] V.S. Iyengar, Transforming Data to Satisfy Privacy Constraints Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 279-287, July 2002.
[19] P. Samarati and L. Sweeney, Protecting Privacy When Disclosing Information:$k$-Anonymity and Its Enforcement through Generalization and Suppression Proc. IEEE Symp. Research in Security and Privacy, May 1998.
[20] M. Kantarcioglu and J. Vaidya, An Architecture for Privacy-Preserving Mining of Client Information Proc. IEEE Int'l Conf. Data Mining Workshop Privacy, Security, and Data Mining, vol. 14, pp. 37-42, Dec. 2002.
[21] J. Vaidya and C. Clifton, Privacy Preserving Association Rule Mining in Vertically Partitioned Data Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 639-644, July 2002.
[22] Nat'l Center for Biomedical Information, Nat'l Library of Medicine, Entrez-pubmed,http://www.ncbi.nlm.nih.goventrez/, 2004.
[23] L.H. Cox, Suppression Methodology and Statistical Disclosure Control J. Am. Statistical Assoc., vol. 75, no. 370, pp. 377-385, 1980.
[24] A.G. De Waal and L.C.R.J. Willenborg, Global Recordings and Local Suppressions in Microdata Sets technical report, Statistics Netherlands, Voorburg, 1995.
[25] L. Ohno-Machado, S. Vinterbo, and S. Dreiseitl, Effects of Data Anonymization by Cell Suppression on Descriptive Statistics and Predictive Modeling Performance Proc Am. Medical Informatics Assoc. Symp. J. Am. Medical Informatics Assoc., vol. 8, pp. 503-508, 2001.
[26] W.E. Winkler, Using Simulated Annealing for k-Anonymity Technical Report Statistics #2002-07, Statistical Research Division, US Bureau of the Cencus, 2002.
[27] A. Skowron and C. Rauszer, The Discernibility Matrices and Functions in Information Systems Series D: System Theory, Knowledge Eng. and Problem Solving, vol. 11, chap. III-2, pp. 331-362, Dordrecht, The Netherlands: Kluwer Academic Publishers, 1992.
[28] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. New York: W.H. Freeman and Company, 1979.
[29] G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, A. Marchetti-Spaccamela, and M. Protasi, Complexity and Approximation: Combinatorial Optimization Problems and Their Approximability Properties. Springer-Verlag, 1999.
[30] US Census Bureau, Geographic Areas Reference Manual,, Jan. 2003.
[31] R. Schalkoff, Pattern Recognition, Statistical, Structural and Neural Approaches. John Wiley&Sons, Inc., 1992.
[32] J. Rissanen, Modeling by the Shortest Data Description Automatica, vol. 14, pp. 465-471, 1978.
[33] M. Li and P. Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications. Springer Verlag, 1997.
[34] J.R. Quinlan, Induction of Decision Trees Machine Learning, vol. 1, pp. 81-106, 1986.
[35] Z. Pawlak, Rough Sets Int'l J. Information and Computer Science, vol. 11, 1982.

Index Terms:
Privacy, disclosure control, combinatorial optimization, complexity, approximation properties, machine learning.
Staal A. Vinterbo, "Privacy: A Machine Learning View," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 8, pp. 939-948, Aug. 2004, doi:10.1109/TKDE.2004.31
Usage of this product signifies your acceptance of the Terms of Use.