The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2011 vol.33)
pp: 568-586
Yangqiu Song , Microsoft Research Asia, Beijing
Hongjie Bai , Google Information Technology (China) Co, Ltd., Beijing
Wen-Yen Chen , Yahoo! Inc,, Sunnyvale
Edward Y. Chang , Google Research, Palo Alto
ABSTRACT
Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nyström method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a document data set of 193,844 instances and a photo data set of 2,121,863, we show that our parallel algorithm can effectively handle large problems.
INDEX TERMS
Parallel spectral clustering, distributed computing, normalized cuts, nearest neighbors, Nyström approximation.
CITATION
Yangqiu Song, Hongjie Bai, Wen-Yen Chen, Edward Y. Chang, "Parallel Spectral Clustering in Distributed Systems", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.33, no. 3, pp. 568-586, March 2011, doi:10.1109/TPAMI.2010.88
REFERENCES
[1] D. Achlioptas, F. McSherry, and B. Schölkopf, "Sampling Techniques for Kernel Methods," Proc. Conf. Neural Information Processing Systems, pp. 335-342, 2002.
[2] F.R. Bach and M.I. Jordan, "Learning Spectral Clustering," Proc. Conf. Neural Information Processing Systems, 2003.
[3] M. Barnett, S. Gupta, D.G. Payne, L. Shuler, R. Geijn, and J. Watts, "Interprocessor Collective Communication Library (Intercom)," Proc. Scalable High Performance Computing Conf., pp. 357-364, 1994.
[4] J.L. Bentley, "Multidimensional Binary Search Trees Used for Associative Searching," Comm. ACM, vol. 18, no. 9, pp. 509-517, 1975.
[5] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R.E. Gruber, "Bigtable: A Distributed Storage System for Structured Data," Proc. Symp. Operating Systems Design and Implementation, pp. 205-218, 2006.
[6] C.-T. Chu, S.K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A.Y. Ng, and K. Olukotun, "Map-Reduce for Machine Learning on Multicore," Proc. Conf. Neural Information Processing Systems, pp. 281-288, 2007.
[7] F. Chung, Spectral Graph Theory. Am. Math. Soc., 1997.
[8] J. Dean and S. Ghemawat, "Mapreduce: Simplified Data Processing on Large Clusters," Comm. ACM, vol. 51, no. 1, pp. 107-113, 2008.
[9] I.S. Dhillon, "Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning," Proc. ACM SIGKDD, pp. 269-274, 2001.
[10] I.S. Dhillon, Y. Guan, and B. Kulis, "Weighted Graph Cuts without Eigenvectors: A Multilevel Approach," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 11, pp. 1944-1957, Nov. 2007.
[11] I.S. Dhillon and D.S. Modha, "A Data-Clustering Algorithm on Distributed Memory Multiprocessors," Large-Scale Parallel Data Mining, pp. 245-260, Springer, 1999.
[12] C.H.Q. Ding, X. He, H. Zha, M. Gu, and H.D. Simon, "A Min-Max Cut Algorithm for Graph Partitioning and Data Clustering," Proc. Int'l Conf. Data Mining, 2001.
[13] C. Fowlkes, S. Belongie, F. Chung, and J. Malik, "Spectral Grouping Using the Nyström Method," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp. 214-225, Feb. 2004.
[14] S. Ghemawat, H. Gobioff, and S.-T. Leung, "The Google File System," Proc. Symp. Operating Systems Principles, pp. 29-43, 2003.
[15] A. Gionis, P. Indyk, and R. Motwani, "Similarity Search in High Dimensions via Hashing," Proc. Int'l Conf. Very Large Data Bases, M.P. Atkinson, M.E. Orlowska, P. Valduriez, S.B. Zdonik, and M.L. Brodie, eds., pp. 518-529, 1999.
[16] A. Grama, G. Karypis, V. Kumar, and A. Gupta, Introduction to Parallel Computing, second ed. Addison Wesley, Jan. 2003.
[17] W. Gropp, E. Lusk, and A. Skjellum, Using MPI-2: Advanced Features of the Message-Passing Interface. MIT Press, 1999.
[18] A. Gürsoy, "Data Decomposition for Parallel K-Means Clustering," Proc. Int'l Conf. Parallel Processing and Applied Math., pp. 241-248, 2003.
[19] L. Hagen and A. Kahng, "New Spectral Methods for Ratio Cut Partitioning and Clustering," IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 11, no. 9, pp. 1074-1085, Sept. 1992.
[20] V. Hernandez, J.E. Roman, A. Tomas, and V. Vidal, "A Survey of Software for Sparse Eigenvalue Problems," technical report, Universidad Politecnica de Valencia, 2005.
[21] V. Hernandez, J.E. Roman, and V. Vidal, "SLEPc: A Scalable and Flexible Toolkit for the Solution of Eigenvalue Problems," ACM Trans. Math. Software, vol. 31, pp. 351-362, 2005.
[22] R.B. Lehoucg, D.C. Sorensen, and C. Yang, ARPACK User's Guide. SIAM, 1998.
[23] D.D. Lewis, Y. Yang, T.G. Rose, and F. Li, "RCV1: A New Benchmark Collection for Text Categorization Research," J. Machine Learning Research, vol. 5, pp. 361-397, 2004.
[24] B. Li, E.Y. Chang, and Y.-L. Wu, "Discovery of a Perceptual Distance Function for Measuring Image Similarity," Multimedia Systems, vol. 8, no. 6, pp. 512-522, 2003.
[25] R. Liu and H. Zhang, "Segmentation of 3D Meshes through Spectral Clustering," Proc. Conf. Pacific Graphics, 2004.
[26] T. Liu, A. Moore, A. Gray, and K. Yang, "An Investigation of Practical Approximate Nearest Neighbor Algorithms," Proc. Conf. Neural Information Processing Systems, 2004.
[27] I.M. Llorente, F. Tirado, and L. Vázquez, "Some Aspects about the Scalability of Scientific Applications on Parallel Architectures," Parallel Computing, vol. 22, no. 9, pp. 1169-1195, 1996.
[28] U. Luxburg, "A Tutorial on Spectral Clustering," Statistics and Computing, vol. 17, no. 4, pp. 395-416, 2007.
[29] O.A. Marques, "BLZPACK: Description and User's Guide," Technical Report TR/PA/95/30, CERFACS, 1995.
[30] K. Maschhoff and D. Sorensen, "A Portable Implementation of ARPACK for Distributed Memory Parallel Architectures," Proc. Copper Mountain Conf. Iterative Methods, 1996.
[31] M. Meila and J. Shi, "Learning Segmentation by Random Walks," Proc. Conf. Neural Information Processing Systems, pp. 873-879, 2000.
[32] A.Y. Ng, M.I. Jordan, and Y. Weiss, "On Spectral Clustering: Analysis and an Algorithm," Proc. Conf. Neural Information Processing Systems, pp. 849-856, 2001.
[33] M. Ouimet and Y. Bengio, "Greedy Spectral Embedding," Proc. Int'l Conf. Artificial Intelligence and Statistics, pp. 253-260, 2005.
[34] C.H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity. Dover Publications, 1998.
[35] J. Shi and J. Malik, "Normalized Cuts and Image Segmentation," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888-905, Aug. 2000.
[36] J.R. Smith and S.-F. Chang, "Automated Image Retrieval Using Color and Texture," Technical Report CU/CTR 408-95-14, Columbia Univ., 1996.
[37] M. Snir and S. Otto, MPI—The Complete Reference: The MPI Core. MIT Press, 1998.
[38] A. Strehl and J. Ghosh, "Cluster Ensembles—A Knowledge Reuse Framework for Combining Multiple Partitions," J. Machine Learning Research, vol. 3, pp. 583-617, 2002.
[39] A. Talwalkar, S. Kumar, and H. Rowley, "Large-Scale Manifold Learning," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[40] R. Thakur, R. Rabenseinfer, and W. Gropp, "Optimization of Collective Communication Operations in MPICH," Int'l J. High Performance Computing Applications, vol. 19, no. 1, pp. 49-66, 2005.
[41] S. Tong and E. Chang, "Support Vector Machine Active Learning for Image Retrieval," Proc. ACM Int'l Conf. Multimedia, pp. 107-118, 2001.
[42] J.K. Uhlmann, "Satisfying General Proximity/Similarity Queries with Metric Trees," Information Processing Letters, vol. 40, no. 4, pp. 175-179, 1991.
[43] C.K.I. Williams, C.E. Rasmussen, A. Schwaighofer, and V. Tresp, "Observations on the Nyström Method for Gaussian Process Prediction," technical report, Univ. of Edinburgh, 2002.
[44] C.K.I. Williams and M. Seeger, "Using the Nyström Method to Speed up Kernel Machines," Proc. Conf. Neural Information Processing Systems, pp. 682-688, 2000.
[45] K. Wu and H. Simon, "A Parallel Lanczos Method for Symmetric Generalized Eigenvalue Problems," Technical Report LBNL-42953, Lawrence Berkeley Nat'l Laboratory, 1997.
[46] K. Wu and H. Simon, "TRLAN User Guide," Technical Report LBNL-41284, Lawrence Berkeley Nat'l Laboratory, 1999.
[47] M. Wu and B. Schölkopf, "A Local Learning Approach for Clustering," Proc. Conf. Neural Information Processing Systems, pp. 1529-1536, 2007.
[48] S. Xu and J. Zhang, "A Hybrid Parallel Web Document Clustering Algorithm and Its Performance Study," J. Supercomputing, vol. 30, no. 2, pp. 117-131, 2004.
[49] W. Xu, X. Liu, and Y. Gong, "Document Clustering Based on Non-Negative Matrix Factorization," Proc. SIGIR, pp. 267-273, 2003.
[50] S.X. Yu and J. Shi, "Multiclass Spectral Clustering," Proc. IEEE Int'l Conf. Computer Vision, p. 313, 2003.
[51] L. Zelnik-Manor and P. Perona, "Self-Tuning Spectral Clustering," Proc. Conf. Neural Information Processing Systems, pp. 1601-1608, 2005.
[52] H. Zha, C.H.Q. Ding, M. Gu, X. He, and H. Simon, "Spectral Relaxation for K-Means Clustering," Proc. Conf. Neural Information Processing Systems, pp. 1057-1064, 2001.
[53] K. Zhang, I. Tsang, and J. Kwok, "Improved Nyström Low-Rank Approximation and Error Analysis," Proc. Int'l Conf. Machine Learning, 2008.
[54] S. Zhong and J. Ghosh, "A Unified Framework for Model-Based Clustering," J. Machine Learning Research, vol. 4, pp. 1001-1037, 2003.
105 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool