The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2010 vol.22)
pp: 437-446
Thomas A. Lasko , Google, Inc., Mountain View
Staal A. Vinterbo , Brigham and Women's Hospital, Boston
ABSTRACT
The goal of data anonymization is to allow the release of scientifically useful data in a form that protects the privacy of its subjects. This requires more than simply removing personal identifiers from the data because an attacker can still use auxiliary information to infer sensitive individual information. Additional perturbation is necessary to prevent these inferences, and the challenge is to perturb the data in a way that preserves its analytic utility. No existing anonymization algorithm provides both perfect privacy protection and perfect analytic utility. We make the new observation that anonymization algorithms are not required to operate in the original vector-space basis of the data, and many algorithms can be improved by operating in a judiciously chosen alternate basis. A spectral basis derived from the data's eigenvectors is one that can provide substantial improvement. We introduce the term spectral anonymization to refer to an algorithm that uses a spectral basis for anonymization, and give two illustrative examples. We also propose new measures of privacy protection that are more general and more informative than existing measures, and a principled reference standard with which to define adequate privacy protection.
INDEX TERMS
Privacy, computational disclosure control, machine learning.
CITATION
Thomas A. Lasko, Staal A. Vinterbo, "Spectral Anonymization of Data", IEEE Transactions on Knowledge & Data Engineering, vol.22, no. 3, pp. 437-446, March 2010, doi:10.1109/TKDE.2009.88
REFERENCES
[1] T.A. Lasko, “Spectral Anonymization of Data,” PhD dissertation, Massachusetts Inst. of Tech nology, 2007.
[2] M. Barbaro, T. Zeller, and S. Hansell, “A Face Is Exposed for AOL Searcher No. 4417749,” New York Times, http://www.nytimes. com/2006/08/09/technology 09aol.html, Aug. 2006.
[3] N. Spruill, “The Confidentiality and Analytic Usefulness of Masked Business Microdata,” Proc. Section on Survey Research Methods, pp. 602-607, 1983.
[4] R. Brand, “Microdata Protection through Noise Addition,” Inference Control in Statistical Databases, pp. 97-116, Springer, 2002.
[5] S. Chawla, C. Dwork, F. McSherry, A. Smith, and H. Wee, “Towards Privacy in Public Databases,” Proc. Second Theory of Cryptography Conf. (TCC '05), Feb. 2005.
[6] S.E. Fienberg and J. McIntyre, “Data Swapping: Variations on a Theme by Dalenius and Reiss,” Privacy in Statistical Databases, pp.14-29, Springer, 2004.
[7] V. Torra and S. Miyamoto, “Evaluating Fuzzy Clustering Algorithms for Microdata Protection,” Privacy in Statistical Databases, pp. 175-186, Springer, 2004.
[8] J. Domingo-Ferrer and V. Torra, “Ordinal, Continuous and Heterogeneous k-Anonymity through Microaggregation,” Data Mining and Knowledge Discovery, vol. 11, no. 2, pp. 195-212, Sept. 2005.
[9] L.H. Cox, S. McDonald, and D. Nelson, “Confidentiality Issues at the United States Bureau of the Census,” J. Official Statistics, vol. 2, no. 2, pp. 135-160, 1986.
[10] J.G. Bethlehem, W.J. Keller, and J. Pannekoek, “Disclosure Control of Microdata,” J. Am. Statistical Assoc., vol. 85, no. 409, pp. 38-45, Mar. 1990.
[11] R.J.A. Little, “Statistical Analysis of Masked Data,” J. Official Statistics, vol. 9, no. 2, pp. 407-426, 1993.
[12] C.K. Liew, U.J. Choi, and C.J. Liew, “A Data Distortion by Probability Distribution,” ACM Trans. Database Systems, vol. 10, no. 3, pp. 395-411, 1985.
[13] D.B. Rubin, “Discussion: Statistical Disclosure Limitation,” J.Official Statistics, vol. 9, no. 2, pp. 461-468, 1993.
[14] R.A. Dandekar, M. Cohen, and N. Kirkendall, “Sensitive Micro Data Protection Using Latin Hypercube Sampling Technique,” Inference Control in Statistical Databases: From Theory to Practice, pp. 117-125, Springer-Verlag, 2002.
[15] S. Polettini, “Maximum Entropy Simulation for Microdata Protection,” Statistics and Computing, vol. 13, no. 4, pp. 307-320, 2003.
[16] W.A. Fuller, “Masking Procedures for Microdata Disclosure Limitation,” J. Official Statistics, vol. 9, no. 2, pp. 383-406, 1993.
[17] R.T. Clemen and T. Reilly, “Correlations and Copulas for Decision and Risk Analysis,” Management Science, vol. 45, no. 2, pp. 208-224, 1999.
[18] J. Kim, “A Method for Limiting Disclosure in Microdata Based on Random Noise and Transformation,” Proc. Section on Survey Research Methods, pp. 370-374, 1986.
[19] C.C. Aggarwal, “On k-Anonymity and the Curse of Dimensionality,” Proc. 31st Int'l Conf. Very Large Data Bases (VLDB '05), pp. 901-909, 2005.
[20] L. Willenborg and T. de Waal, “Predictive Disclosure,” Elements of Statistical Disclosure Control, vol. 155, ch. 2.1, pp. 42-46, Springer, 2001.
[21] L. Willenborg and T. de Waal, “Reidentification Risk,” Elements of Statistical Disclosure Control, vol. 155, ch. 2.5, pp. 46-51, Springer, 2001.
[22] W.E. Winkler, “Re-Identification Methods for Evaluating the Confidentiality of Analytically Valid Microdata,” Research in Official Statistics, vol. 1, pp. 87-104, 1998.
[23] P. Samarati and L. Sweeney, “Protecting Privacy When Disclosing Information: k-Anonymity and Its Enforcement through Generalization and Suppression,” Proc. IEEE Symp. Research in Security and Privacy, May 1998.
[24] S.A. Vinterbo, “Privacy: A Machine Learning View,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 8, pp. 939-948, Aug. 2004.
[25] A. Narayanan and V. Shmatikov, “Robust de Anonymization of Large Sparse Datasets,” Proc. IEEE Symp. Security and Privacy, 2008 (SP '08), pp. 111-125, May 2008.
[26] F.J. Massey,Jr., “The Kolmogorov-Smirnov Test for Goodness of Fit,” J. Am. Statistical Assoc., vol. 46, no. 253, pp. 68-78, Mar. 1951.
[27] W.J. Conover, “Tests on Two Independent Samples,” Practical Nonparametric Statistics, third ed., ch. 6.3, pp. 456-465, Wiley, 1999.
[28] S.E. Fienberg, E. Makov, and R. Steel, “Disclosure Limitation Using Perturbation and Related Methods for Categorical Data,” J.Official Statistics, vol. 14, no. 4, pp. 485-502, 1998.
[29] G. Strang, Introduction to Linear Algebra, third ed. Wellesley-Cambridge, 2003.
[30] S.P. Reiss, M.J. Post, and T. Dalenius, “Non-Reversible Privacy Transformations,” Proc. First ACM SIGACT-SIGMOD Symp. Principles of Database Systems (PODS '82), pp. 139-146, 1982.
[31] M. Carlson and M. Salabasis, “A Data-Swapping Technique using Ranks—a Method for Disclosure Control,” Research in Official Statistics, vol. 6, no. 2, pp. 35-64, 2002.
[32] R. Moore, “Controlled Data Swapping Techniques for Masking Public Use Data Sets,” Report RR96/04, Statistical Research Division, US Bureau of the Census, 1996.
[33] K. Muralidhar and R. Sarathy, “Data Shuffling—a New Masking Approach for Numerical Data,” Management Science, vol. 52, no. 5, pp. 658-670, May 2006.
[34] “National Health and Examination Survey (NHANES),” National Center for Health Statistics (NCHS), Centers for Disease Control and Prevention (CDC), http://www.cdc.gov/nchs/about/major/nhanes datalink.htm, 2009.
[35] P. Tendick and N. Matloff, “A Modified Random Perturbation Method for Database Security,” ACM Trans. Database Systems, vol. 19, no. 1, pp. 47-63, 1994.
[36] T.A. Lasko, J.G. Bhagwat, K.H. Zou, and L. Ohno-Machado, “The Use of Receiver Operating Characteristic Curves in Biomedical Informatics,” J. Biomedical Informatics, vol. 38, no. 5, pp. 404-415, , Oct. 2005.
[37] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, “Random-Data Perturbation Techniques and Privacy-Preserving Data Mining,” Knowledge and Information Systems, vol. 7, no. 4, pp. 387-414, May 2005.
6 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool