This Article 
 Bibliographic References 
 Add to: 
More Hybrid and Secure Protection of Statistical Data Sets
Sept.-Oct. 2012 (vol. 9 no. 5)
pp. 727-740
Javier Herranz, Universitat Politècnica de Catalunya, Barcelona
Jordi Nin, Universitat Politècnica de Catalunya, Barcelona
Marc Solé, Universitat Politècnica de Catalunya, Barcelona
Different methods and paradigms to protect data sets containing sensitive statistical information have been proposed and studied. The idea is to publish a perturbed version of the data set that does not leak confidential information, but that still allows users to obtain meaningful statistical values about the original data. The two main paradigms for data set protection are the classical one and the synthetic one. Recently, the possibility of combining the two paradigms, leading to a hybrid paradigm, has been considered. In this work, we first analyze the security of some synthetic and (partially) hybrid methods that have been proposed in the last years, and we conclude that they suffer from a high interval disclosure risk. We then propose the first fully hybrid SDC methods; unfortunately, they also suffer from a quite high interval disclosure risk. To mitigate this, we propose a postprocessing technique that can be applied to any data set protected with a synthetic method, with the goal of reducing its interval disclosure risk. We describe through the paper a set of experiments performed on reference data sets that support our claims.

[1] L. Willenborg and T. de Waal, Elements of Statistical Diclosure Control: Lecture Notes in Statistics, vol. 155, Springer, 2001.
[2] C. Dwork, "Differential Privacy," Proc. Int'l Conf. Automata, Languages and Programming (ICALP), pp. 1-12, 2006.
[3] J. Domingo-Ferrer and Ú. González-Nicolás, "Hybrid Microdata Using Microaggregation," Information Sciences, vol. 180, no. 15, pp. 2834-2844, 2010.
[4] J. Domingo-Ferrer and J.M. Mateo-Sanz, "Practical Data-Oriented Microaggregation for Statistical Disclosure Control," IEEE Trans. Knowledge and Data Eng., vol. 14, no. 1, pp. 189-201, Jan./Feb. 2002.
[5] K. Muralidhar and R. Sarathy, "Generating Sufficiency-Based Non-Synthetic Perturbed Data," Trans. Data Privacy, vol. 1, no. 1, pp. 17-33, 2008.
[6] J. Burridge, "Information Preserving Statistical Obfuscation," Statistics and Computing, vol. 13, pp. 321-327, 2003.
[7] R. Moore, "Controlled Data Swapping Techniques for Masking Public Use Microdata Sets," U.S. Census, 1996.
[8] J. Domingo-Ferrer and V. Torra, "Disclosure Control Methods and Information Loss for Microdata," Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, pp. 91-110, North Holland, 2001.
[9] J. Kim, "A Method for Limiting Disclosure in Microdata Based on Random Noise and Transformation," Proc. ASA Section on Survey Research Methodology, pp. 303-308, 1986.
[10] J. Nin, J. Herranz, and V. Torra, "Rethinking Rank Swapping to Decrease Disclosure Risk," Data and Knowledge Eng., vol. 64, no. 1, pp. 346-364, 2008.
[11] L. Sweeney, "$k$ -Anonymity: A Model for Protecting Privacy," Int'l J. Uncertainty Fuzziness Knowledge-Based Systems, vol. 10, no. 5, pp. 557-570, 2002.
[12] L. Sweeney, "Achieving $k$ -Anonymity Privacy Protection Using Generalization and Suppression," Int'l J. Uncertainty Fuzziness Knowledge-Based Systems, vol. 10, no. 5, pp. 571-588, 2002.
[13] A. Oganian and J. Domingo-Ferrer, "On the Complexity of Optimal Microaggregation for Statistical Disclosure Control," Statistical J. United Nations Economic Commission for Europe, vol. 18, no. 4, pp. 345-354, 2000.
[14] W.E. Winkler, "Matching and Record Linkage," Business Survey Methods, pp. 355-384, Wiley, 1995.
[15] R. Agrawal and R. Srikant, "Privacy-Preserving Data Mining," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), pp. 439-450, 2000.
[16] T.M. Truta and B. Vinay, "Privacy Protection: P-Sensitive k-Anonymity Property," Proc. Int'l Conf. Data Eng. Workshops (ICDEW), 2006.
[17] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam, "I-Diversity: Privacy Beyond K-Anonymity," Proc. IEEE Int'l Conf. Data Eng., 2006.
[18] N. Li and T. Li, "t-Closeness: Privacy Beyond K-Anonymity and L-Diversity," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2007.
[19] "Data Extraction System," US Census Bureau, http:/www., 2009.
[20] J. Domingo-Ferrer, V. Torra, J.M. Mateo-Sanz, and F. Seb´, "Systematic Measures of Re-Identification Risk Based on the Probabilistic Links of the Partially Synthetic Data Back to the Original Microdata," technical report, URV and IIIA-CSIC, 2005.
[21] U.S. Energy Information Authority, http:/, 2012.
[22] C. Dwork, "A Firm Foundation for Private Data Analysis," Comm. ACM, vol. 54, no. 1, pp. 86-95, 2011.
[23] A. Machanavajjhala, J. Gehrke, and M. Götz, "Data Publishing against Realistic Adversaries," Proc. VLDB Endowment, vol. 2, no. 1, pp. 790-801, 2009.
[24] N. Mohammed, R. Chen, B.C.M. Fung, and P.S. Yu, "Differentially Private Data Release for Data Mining," Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD), pp. 493-501, 2011.
[25] D. Kifer and B.-R. Lin, "Towards an Axiomatization of Statistical Privacy and Utility," Proc. ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS), pp. 147-158, 2010.

Index Terms:
Privacy,Security,Data privacy,Couplings,Computational modeling,Databases,Generators,interval disclosure risk.,Statistical data sets protection,synthetic methods,hybrid methods
Javier Herranz, Jordi Nin, Marc Solé, "More Hybrid and Secure Protection of Statistical Data Sets," IEEE Transactions on Dependable and Secure Computing, vol. 9, no. 5, pp. 727-740, Sept.-Oct. 2012, doi:10.1109/TDSC.2012.40
Usage of this product signifies your acceptance of the Terms of Use.