Second IEEE International Conference on Data Mining (ICDM'02)
Iterative Clustering of High Dimensional Text Data Augmented by Local Search
Maebashi City, Japan
December 09-December 12
ISBN: 0-7695-1754-4
The k-means algorithm with cosine similarity, also known as the spherical k-means algorithm, is a popular method for clustering document collections. However, spherical k-means can often yield qualitatively poor results, especially when cluster sizes are small, say 25-30 documents per cluster, where it tends to get stuck at a local maximum far away from the optimal solution. In this paper, we present a local search procedure, which we call "first-variation" that refines a given clustering by incrementally moving data points between clusters, thus achieving a higher objective function value. An enhancement of first variation allows a chain of such moves in a Kernighan-Lin fashion and leads to a better local maximum. Combining the enhanced first-variation with spherical k-means yields a powerful "ping-pong" strategy that often qualitatively improves k-means clustering and is computationally efficient. We present several experimental results to high-light the improvement achieved by our proposed algorithm in clustering high-dimensional and sparse text data.
Citation:
Inderjit S. Dhillon, Yuqiang Guan, J. Kogan, "Iterative Clustering of High Dimensional Text Data Augmented by Local Search," icdm, pp.131, Second IEEE International Conference on Data Mining (ICDM'02), 2002