CSDL Home IEEE Transactions on Pattern Analysis & Machine Intelligence 2011 vol.33 Issue No.03 - March

Subscribe

Issue No.03 - March (2011 vol.33)

pp: 568-586

Wen-Yen Chen , Yahoo! Inc,, Sunnyvale

Yangqiu Song , Microsoft Research Asia, Beijing

Hongjie Bai , Google Information Technology (China) Co, Ltd., Beijing

Chih-Jen Lin , National Taiwan University, Taipei

Edward Y. Chang , Google Research, Palo Alto

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPAMI.2010.88

ABSTRACT

Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nyström method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a document data set of 193,844 instances and a photo data set of 2,121,863, we show that our parallel algorithm can effectively handle large problems.

INDEX TERMS

Parallel spectral clustering, distributed computing, normalized cuts, nearest neighbors, Nyström approximation.

CITATION

Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, Edward Y. Chang, "Parallel Spectral Clustering in Distributed Systems",

*IEEE Transactions on Pattern Analysis & Machine Intelligence*, vol.33, no. 3, pp. 568-586, March 2011, doi:10.1109/TPAMI.2010.88REFERENCES