The Community for Technology Leaders
2015 IEEE 31st International Conference on Data Engineering (ICDE) (2015)
Seoul, South Korea
April 13, 2015 to April 17, 2015
ISBN: 978-1-4799-7964-6
pp: 1035-1046
Shengzhi Xu , State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China
Sen Su , State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China
Xiang Cheng , State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China
Zhengyi Li , State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China
Li Xiong , Math and Computer Science Department, Emory University, Atlanta, GA, USA
ABSTRACT
In this paper, we study the problem of mining frequent sequences under the rigorous differential privacy model. We explore the possibility of designing a differentially private frequent sequence mining (FSM) algorithm which can achieve both high data utility and a high degree of privacy. We found, in differentially private FSM, the amount of required noise is proportionate to the number of candidate sequences. If we could effectively reduce the number of unpromising candidate sequences, the utility and privacy tradeoff can be significantly improved. To this end, by leveraging a sampling-based candidate pruning technique, we propose a novel differentially private FSM algorithm, which is referred to as PFS2. The core of our algorithm is to utilize sample databases to further prune the candidate sequences generated based on the downward closure property. In particular, we use the noisy local support of candidate sequences in the sample databases to estimate which sequences are potentially frequent. To improve the accuracy of such private estimations, a sequence shrinking method is proposed to enforce the length constraint on the sample databases. Moreover, to decrease the probability of misestimating frequent sequences as infrequent, a threshold relaxation method is proposed to relax the user-specified threshold for the sample databases. Through formal privacy analysis, we show that our PFS2 algorithm is ε-differentially private. Extensive experiments on real datasets illustrate that our PFS2 algorithm can privately find frequent sequences with high accuracy.
INDEX TERMS
Databases, Privacy, Noise, Data privacy, Algorithm design and analysis, Sensitivity
CITATION
Shengzhi Xu, Sen Su, Xiang Cheng, Zhengyi Li, Li Xiong, "Differentially private frequent sequence mining via sampling-based candidate pruning", 2015 IEEE 31st International Conference on Data Engineering (ICDE), vol. 00, no. , pp. 1035-1046, 2015, doi:10.1109/ICDE.2015.7113354
83 ms
(Ver 3.3 (11022016))