The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.06 - June (2012 vol.24)
pp: 1036-1050
Lijun Wang , Wayne State University, Detroit
Manjeet Rege , Rochester Institute of Technology, Rochester
Ming Dong , Wayne State University, Detroit
Yongsheng Ding , Donghua University, Shanghai
ABSTRACT
Traditional clustering techniques are inapplicable to problems where the relationships between data points evolve over time. Not only is it important for the clustering algorithm to adapt to the recent changes in the evolving data, but it also needs to take the historical relationship between the data points into consideration. In this paper, we propose ECKF, a general framework for evolutionary clustering large-scale data based on low-rank kernel matrix factorization. To the best of our knowledge, this is the first work that clusters large evolutionary data sets by the amalgamation of low-rank matrix approximation methods and matrix factorization-based clustering. Since the low-rank approximation provides a compact representation of the original matrix, and especially, the near-optimal low-rank approximation can preserve the sparsity of the original data, ECKF gains computational efficiency and hence is applicable to large evolutionary data sets. Moreover, matrix factorization-based methods have been shown to effectively cluster high-dimensional data in text mining and multimedia data analysis. From a theoretical standpoint, we mathematically prove the convergence and correctness of ECKF, and provide detailed analysis of its computational efficiency (both time and space). Through extensive experiments performed on synthetic and real data sets, we show that ECKF outperforms the existing methods in evolutionary clustering.
INDEX TERMS
Clustering, low-rank matrix approximation, matrix decomposition.
CITATION
Lijun Wang, Manjeet Rege, Ming Dong, Yongsheng Ding, "Low-Rank Kernel Matrix Factorization for Large-Scale Evolutionary Clustering", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 6, pp. 1036-1050, June 2012, doi:10.1109/TKDE.2010.258
REFERENCES
[1] A.K. Jain, M.N. Murty, and P.J. Flynn, "Data Clustering: A Review," ACM Computing Surveys, vol. 31, pp. 264-323, 1999.
[2] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. Wiley-Interscience, 2001.
[3] N. Bouguila and D. Ziou, "High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1716-1731, Oct. 2007.
[4] M. Bouguessa and S. Wang, "Mining Projected Clusters in High-Dimensional Spaces," IEEE Trans. Knowledge and Data Eng., vol. 21, no. 4, pp. 507-522, Apr. 2009.
[5] D. Perera, J. Kay, I. Koprinska, K. Yacef, and O.R. Zaiane, "Clustering and Sequential Pattern Mining of Online Collaborative Learning Data," IEEE Trans. Knowledge and Data Eng., vol. 21, no. 6, pp. 759-772, June 2009.
[6] S. Marinai, E. Marino, and G. Soda, "Font Adaptive Word Indexing of Modern Printed Documents," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1187-1199, Aug. 2006.
[7] M. Bulacu and L. Schomaker, "Text-Independent Writer Identification and Verification Using Textural and Allographic Features," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 4, pp. 701-717, Apr. 2007.
[8] L. Jing, M.K. Ng, and J.Z. Huang, "An Entropy Weighting K-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 8, pp. 1026-1041, Aug. 2007.
[9] K.M. Hammouda and M.S. Kamel, "Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster Summarization," IEEE Trans. Knowledge and Data Eng., vol. 21, no. 5, pp. 681-698, May 2009.
[10] N.K. Papadakis, D. Skoutas, K. Raftopoulos, and T.A. Varvarigou, "Stavies: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques," IEEE Trans. Knowledge and Data Eng., vol. 17, no. 12, pp. 1638-1652, Dec. 2005.
[11] S.G. Petridou, V.A. Koutsonikola, A.I. Vakali, and G.I. Papadimitriou, "Time-Aware Web Users' Clustering," IEEE Trans. Knowledge and Data Eng., vol. 20, no. 5, pp. 653-667, May 2008.
[12] G.J. Bloy, "Blind Camera Fingerprinting and Image Clustering," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 3, pp. 532-534, Mar. 2008.
[13] M. Rege, M. Dong, and J. Hua, "Graph Theoretical Framework for Simultaneously Integrating Visual and Textual Features for Efficient Web Image Clustering," Proc. 17th Int'l Conf. World Wide Web, pp. 317-326, 2008.
[14] J. Shi and J. Malik, "Normalized Cuts and Image Segmentation," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888-905, Aug. 2000.
[15] I.S. Dhillon, Y. Guan, and B. Kulis, "Weighted Graph Cuts without Eigenvectors a Multilevel Approach," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 11, pp. 1944-1957, Nov. 2007.
[16] M. Vignes and F. Forbes, "Gene Clustering via Integrated Markov Models Combining Individual and Pairwise Features," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 6, no. 2, pp. 260-270, Apr.-June 2009.
[17] P.C.H. Ma and K.C.C. Chan, "A Novel Approach for Discovering Overlapping Clusters in Gene Expression Data," IEEE Trans. Biomedical Eng., vol. 56, no. 7, pp. 1803-1809, July 2009.
[18] N. Du, B. Wu, X. Pei, B. Wang, and L. Xu, "Community Detection in Large-Scale Social Networks," Proc. Ninth WebKDD and First SNA-KDD 2007 Workshop Web Mining and Social Network Analysis, pp. 16-25, 2007.
[19] J. Ruan and W. Zhang, "An Efficient Spectral Algorithm for Network Community Discovery and Its Applications on Biological and Social Networks," Proc. IEEE Int'l Conf. Data Mining, pp. 643-648, 2007.
[20] D. Chakrabarti, R. Kumar, and A. Tomkins, "Evolutionary Clustering," Proc. 12th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 554-560, 2006.
[21] Y. Chi, X. Song, D. Zhou, K. Hino, and B.L. Tseng, "Evolutionary Spectral Clustering by Incorporating Temporal Smoothness," Proc. 13th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 153-162, 2007.
[22] H. Prehn and G. Sommer, "An Adaptive Classification Algorithm Using Robust Incremental Clustering," Proc. 18th Int'l Conf. Pattern Recognition, vol. 1, pp. 896-899, 2006.
[23] W. Pedrycz and K.-C. Kwak, "The Development of Incremental Models," IEEE Trans. Fuzzy Systems, vol. 15, no. 3, pp. 507-518, June 2007.
[24] Z. Lu and M.A. Carreira-Perpinan, "Constrained Spectral Clustering through Affinity Propagation," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, June 2008.
[25] P.K. Mallapragada, R. Jin, and A.K. Jain, "Active Query Selection for Semi-Supervised Clustering," Proc. 19th Int'l Conf. Pattern Recognition, pp. 1-4, Dec. 2008.
[26] C. Gupta and R. Grossman, "Genic: A Single Pass Generalized Incremental Algorithm for Clustering," Proc. SIAM Int'l Conf. Data Mining, pp. 147-153, 2004.
[27] S. Guha, N. Mishra, R. Motwani, and L. OCallaghan, "Clustering Data Streams: Theory and Practice," IEEE Trans. Knowledge and Data Eng., vol. 15, no. 3, pp. 515-528, May/June 2003.
[28] M. Charikar, C. Chekuri, T. Feder, and R. Motwani, "Incremental Clustering and Dynamic Information Retrieval," Proc. 29th Ann. ACM Symp. Theory of Computing (STOC), pp. 626-635, 1997.
[29] Y. Li, J. Han, and J. Yang, "Clustering Moving Objects," Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 617-622, 2004.
[30] T. Zhang, R. Ramakrishnan, and M. Livny, "Birch: An Efficient Data Clustering Method for Very Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 103-114, 1996.
[31] H. Ning, W. Xu, Y. Chi, Y. Gong, and T. Huang, "Incremental Spectral Clustering with Application to Monitoring of Evolving Blog Communities," Proc. SIAM Int'l Conf. Data Mining, pp. 261-272, 2007.
[32] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl, "Constrained k-Means Clustering with Background Knowledge," Proc. Int'l Conf. Machine Learning (ICML), pp. 577-584, 2001.
[33] Y. Chen, M. Rege, M. Dong, and J. Hua, "Incorporating User Provided Constraints into Document Clustering," Proc. IEEE Int'l Conf. Data Mining (ICDM), pp. 103-112, 2007.
[34] X. Ji and W. Xu, "Document Clustering with Prior Knowledge," Proc. ACM Ann. Int'l SIGIR Conf. Research and Development in Information Retrieval (SIGIR), pp. 405-412, 2006.
[35] G.H. Golub and C.F. Van Loan, Matrix Computations, third ed. Johns Hopkins Univ. Press, 1996.
[36] D.D. Lee and H.S. Seung, "Learning the Parts of Objects by Non-Negative Matrix Factorization.," Nature, vol. 401, no. 6755, pp. 788-791, Oct. 1999.
[37] D.D. Lee and H.S. Seung, "Algorithms for Non-Negative Matrix Factorization," Advances in Neural Information Processing Systems 13, pp. 556-562, MIT Press, 2001.
[38] T. Li and C. Ding, "The Relationships among Various Nonnegative Matrix Factorization Methods for Clustering," Proc. IEEE Int'l Conf. Data Mining, pp. 362-371, 2006.
[39] C. Ding, X. He, and H.D. Simon, "On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering," Proc. SIAM Int'l Conf. Data Mining, pp. 606-610, 2005.
[40] P.O. Hoyer, "Non-Negative Matrix Factorization with Sparseness Constraints," J. Machine Learning Research, vol. 5, pp. 1457-1469, 2004.
[41] C. Ding, T. Li, W. Peng, and H. Park, "Orthogonal Nonnegative Matrix T-Factorizations for Clustering," Proc. 12th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 126-135, 2006.
[42] C. Ding, T. Li, and M.I. Jordan, "Convex and Semi-Nonnegative Matrix Factorizations," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 1, pp. 45-55, Jan. 2010.
[43] D. Achlioptas and F. Mcsherry, "Fast Computation of Low-Rank Matrix Approximations," J. ACM, vol. 54, no. 2,article 9, 2007.
[44] M.W. Berry, S.A. Pulatova, and G.W. Stewart, "Algorithm 844: Computing Sparse Reduced-Rank Approximations to Sparse Matrices," ACM Trans. Math. Software, vol. 31, no. 2, pp. 252-269, 2005.
[45] P. Drineas, R. Kannan, and M.W. Mahoney, "Fast Monte Carlo Algorithms for Matrices iii: Computing a Compressed Approximate Matrix Decomposition," SIAM J. Computing, vol. 36, pp. 184-206, 2006.
[46] J. Sun, Y. Xie, H. Zhang, and C. Faloutsos, "Less is More: Sparse Graph Mining with Compact Matrix Decomposition," Statistical Analysis and Data Mining, vol. 1, no. 1, pp. 6-22, 2008.
[47] H. Tong, S. Papadimitriou, J. Sun, P.S. Yu, and C. Faloutsos, "Colibri: Fast Mining of Large Static and Dynamic Graphs," Proc. 14th ACM Int'l Conf. Knowledge Discovery and Data Mining, pp. 686-694, 2008.
[48] A. Strehl, J. Ghosh, and C. Cardie, "Cluster Ensembles—A Knowledge Reuse Framework for Combining Multiple Partitions," J. Machine Learning Research, vol. 3, pp. 583-617, 2002.
[49] B. Klimt and Y. Yang, "The Enron Corpus: A New Dataset for Email Classification Research," Proc. European Conf. Machine Learning, pp. 217-226, 2004.
[50] M.F. Porter, "An Algorithm for Suffix Stripping," Program, vol. 14, no. 3, pp. 130-137, 1980.
18 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool