Issue No.04 - April (2013 vol.25)
pp: 751-763
Bin Jiang , Simon Fraser University, Burnaby
Jian Pei , Simon Fraser Univeristy, Burnaby
Yufei Tao , Chinese University of Hong Kong, Hong Kong
Xuemin Lin , The University of New South Wales, Sydney and East China Normal University, China
Clustering on uncertain data, one of the essential tasks in mining uncertain data, posts significant challenges on both modeling similarity between uncertain objects and developing efficient computational methods. The previous methods extend traditional partitioning clustering methods like $(k)$-means and density-based clustering methods like DBSCAN to uncertain data, thus rely on geometric distances between objects. Such methods cannot handle uncertain objects that are geometrically indistinguishable, such as products with the same mean but very different variances in customer ratings. Surprisingly, probability distributions, which are essential characteristics of uncertain objects, have not been considered in measuring similarity between uncertain objects. In this paper, we systematically model uncertain objects in both continuous and discrete domains, where an uncertain object is modeled as a continuous and discrete random variable, respectively. We use the well-known Kullback-Leibler divergence to measure similarity between uncertain objects in both the continuous and discrete cases, and integrate it into partitioning and density-based clustering methods to cluster uncertain objects. Nevertheless, a naïve implementation is very costly. Particularly, computing exact KL divergence in the continuous case is very costly or even infeasible. To tackle the problem, we estimate KL divergence in the continuous case by kernel density estimation and employ the fast Gauss transform technique to further speed up the computation. Our extensive experiment results verify the effectiveness, efficiency, and scalability of our approaches.
Probability distribution, Clustering algorithms, Cameras, Random variables, Kernel, Measurement uncertainty, Educational institutions, fast Gauss transform, Clustering, uncertain data, probabilistic distribution, density estimation
Bin Jiang, Jian Pei, Yufei Tao, Xuemin Lin, "Clustering Uncertain Data Based on Probability Distribution Similarity", IEEE Transactions on Knowledge & Data Engineering, vol.25, no. 4, pp. 751-763, April 2013, doi:10.1109/TKDE.2011.221
[1] S. Abiteboul, P.C. Kanellakis, and G. Grahne, "On the Representation and Querying of Sets of Possible Worlds," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 1987.
[2] M.R. Ackermann, J. Blömer, and C. Sohler, "Clustering for Metric and Non-Metric Distance Measures," Proc. Ann. ACM-SIAM Symp. Discrete Algorithms (SODA), 2008.
[3] M. Ankerst, M.M. Breunig, H.-P. Kriegel, and J. Sander, "Optics: Ordering Points to Identify the Clustering Structure," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 1999.
[4] A. Banerjee, S. Merugu, I.S. Dhillon, and J. Ghosh, "Clustering with Bregman Divergences," J. Machine Learning Research, vol. 6, pp. 1705-1749, 2005.
[5] D.M. Blei, A.Y. Ng, and M.I. Jordan, "Latent Dirichlet Allocation," J. Machine Learning Research, vol. 3, pp. 993-1022, 2003.
[6] V. Cerny, "A Thermodynamical Approach to the Travelling Salesman Problem: An Efficient Simulation Algorithm," J. Optimization Theory and Applications, vol. 45, pp. 41-51, 1985.
[7] R. Cheng, D.V. Kalashnikov, and S. Prabhakar, "Evaluating Probabilistic Queries over Imprecise Data," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 2003.
[8] N.N. Dalvi and D. Suciu, "Management of Probabilistic Data: Foundations and Challenges," Proc. ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS), 2007.
[9] I.S. Dhillon, S. Mallela, and R. Kumar, "A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification," J. Machine Learning Research, vol. 3, pp. 1265-1287, 2003.
[10] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," Proc. Second Int'l Conf. Knowledge Discovery and Data Mining (KDD), 1996.
[11] T. Feder and D.H. Greene, "Optimal Algorithms for Approximate Clustering," Proc. Ann. ACM Symp. Theory of Computing (STOC), 1988.
[12] T.F. Gonzalez, "Clustering to Minimize the Maximum Intercluster Distance," Theoretical Computer Science, vol. 38, pp. 293-306, 1985.
[13] L. Greengard and J. Strain, "The Fast Gauss Transform," SIAM J. Scientific and Statistical Computing, vol. 12, pp. 79-94, 1991.
[14] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Elsevier, 2000.
[15] T. Imielinski and W.L. LipskiJr., "Incomplete Information in Relational Databases," J. ACM, vol. 31, pp. 761-791, 1984.
[16] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
[17] R. Jampani, F. Xu, M. Wu, L.L. Perez, C.M. Jermaine, and P.J. Haas, "Mcdb: A Monte Carlo Approach to Managing Uncertain Data," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 2008.
[18] B. Kao, S.D. Lee, D.W. Cheung, W.-S. Ho, and K.F. Chan, "Clustering Uncertain Data Using Voronoi Diagrams," Proc. IEEE Int'l Conf. Data Mining (ICDM), 2008.
[19] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, 1990.
[20] S. Kirkpatrick, C.G. GelattJr., and M.P. Vecchi, "Optimization by Simulated Annealing," Science, vol. 220, pp. 671-680, 1983.
[21] H.-P. Kriegel and M. Pfeifle, "Density-Based Clustering of Uncertain Data," Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery in Data Mining (KDD), 2005.
[22] H.-P. Kriegel and M. Pfeifle, "Hierarchical Density-Based Clustering of Uncertain Data," Proc. IEEE Int'l Conf. Data Mining (ICDM), 2005.
[23] S. Kullback and R.A. Leibler, "On Information and Sufficiency," The Annals of Math. Statistics, vol. 22, pp. 79-86, 1951.
[24] S.D. Lee, B. Kao, and R. Cheng, "Reducing Uk-Means to k-Means," Proc. IEEE Int'l Conf. Data Mining Workshops (ICDM), 2007.
[25] S.P. Lloyd, "Least Squares Quantization in PCM," IEEE Trans. Information Theory, vol. IT-28, no. 2, pp. 129-137, Mar. 1982.
[26] J.B. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," Proc. Fifth Berkeley Symp. Math. Statistics and Probability, 1967.
[27] W.K. Ngai, B. Kao, C.K. Chui, R. Cheng, M. Chau, and K.Y. Yip, "Efficient Clustering of Uncertain Data," Proc. Sixth Int'l Conf. Data Mining (ICDM), 2006.
[28] M.C. Peel, B.L. Finlayson, and T.A. McMahon, "Updated World Map of the köppen-Geiger Climate Classification," Hydrology and Earth System Sciences, vol. 11, pp. 1633-1644, 2007.
[29] J. Pei, B. Jiang, X. Lin, and Y. Yuan, "Probabilistic Skylines on Uncertain Data," Proc. 33rd Int'l Conf. Very Large Data Bases (VLDB), 2007.
[30] J.M. Ponte and W.B. Croft, "A Language Modeling Approach to Information Retrieval," Proc. 21st Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR), 1998.
[31] A.D. Sarma, O. Benjelloun, A.Y. Halevy, and J. Widom, "Working Models for Uncertain Data," Proc. Int'l Conf. Data Eng. (ICDE), 2006.
[32] D.W. Scott, Multivariate Density Estimation: Theory, Practical, and Visualization. Wiley, 1992.
[33] B.W. Silverman, Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986.
[34] F. Song and W.B. Croft, "A General Language Model for Information Retrieval," Proc. Int'l Conf. Information and Knowledge Management (CIKM), 1999.
[35] Y. Tao, R. Cheng, X. Xiao, W.K. Ngai, B. Kao, and S. Prabhakar, "Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions," Proc. Int'l Conf. Very Large Data Bases (VLDB), 2005.
[36] P.B. Volk, F. Rosenthal, M. Hahmann, D. Habich, and W. Lehner, "Clustering Uncertain Data with Possible Worlds," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2009.
[37] J. Xu and W.B. Croft, "Cluster-Based Language Models for Distributed Retrieval," Proc. 22nd Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR), 1999.
[38] C. Yang, R. Duraiswami, N.A. Gumerov, and L.S. Davis, "Improved Fast Gauss Transform and Efficient Kernel Density Estimation," Proc. IEEE Int'l Conf. Computer Vision (ICCV), 2003.