Issue No. 07 - July (2002 vol. 24)

ISSN: 0162-8828

pp: 881-892

ABSTRACT

<p>In k\hbox{-}{\rm{means}} clustering, we are given a set of n data points in d\hbox{-}{\rm{dimensional}} space {\bf{R}}^d and an integer k and the problem is to determine a set of k points in {\bf{R}}^d, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k\hbox{-}{\rm{means}} clustering is Lloyd's algorithm. In this paper, we present a simple and efficient implementation of Lloyd's k\hbox{-}{\rm{means}} clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.</p>

INDEX TERMS

Pattern recognition, machine learning, data mining, k-means clustering, nearest-neighbor searching, k-d tree, computational geometry, knowledge discovery.

CITATION

Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu, "An Efficient k-Means Clustering Algorithm: Analysis and Implementation",

*IEEE Transactions on Pattern Analysis & Machine Intelligence*, vol. 24, no. , pp. 881-892, July 2002, doi:10.1109/TPAMI.2002.1017616SEARCH