This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
An Efficient k-Means Clustering Algorithm: Analysis and Implementation
July 2002 (vol. 24 no. 7)
pp. 881-892

In k\hbox{-}{\rm{means}} clustering, we are given a set of n data points in d\hbox{-}{\rm{dimensional}} space {\bf{R}}^d and an integer k and the problem is to determine a set of k points in {\bf{R}}^d, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k\hbox{-}{\rm{means}} clustering is Lloyd's algorithm. In this paper, we present a simple and efficient implementation of Lloyd's k\hbox{-}{\rm{means}} clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.

[1] P.K. Agarwal and C.M. Procopiuc, “Exact and Approximation Algorithms for Clustering,” Proc. Ninth Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 658-667, Jan. 1998.
[2] K. Alsabti, S. Ranka, and V. Singh, “An Efficient$\big. k\hbox{-}{\rm{means}}\bigr.$Clustering Algorithm,” Proc. First Workshop High Performance Data Mining, Mar. 1998.
[3] S. Arora, P. Raghavan, and S. Rao, “Approximation Schemes for Euclidean$k \hbox {-} {\rm Medians}$and Related Problems,” Proc. 30th ACM STOC, pp. 106-113, 1998.
[4] S. Arya and D. M. Mount, “Approximate Range Searching,” Computational Geometry: Theory and Applications, vol. 17, pp. 135-163, 2000.
[5] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A.Y. Wu, “An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions,” J. ACM, vol. 45, no. 6, pp. 891-923, Nov. 1998.
[6] G.H. Ball and D.J. Hall, “Some Fundamental Concepts and Synthesis Procedures for Pattern Recognition Preprocessors,” Proc. Int'l Conf. Microwaves, Circuit Theory, and Information Theory, Sept. 1964.
[7] J.L. Bentley, "Multidimensional Binary Search Trees Used for Associative Searching," Comm. ACM, vol. 18, no. 9, pp. 509-517, 1975.
[8] L. Bottou and Y. Bengio, “Convergence Properties of the$\big. k\hbox{-}{\rm{means}}\bigr.$Algorithms,” Advances in Neural Information Processing Systems 7, G. Tesauro and D. Touretzky, eds., pp. 585-592. MIT Press, 1995.
[9] P.S. Bradley and U. Fayyad, “Refining Initial Points for K-means Clustering,” Proc. 15th Int'l Conf. Machine Learning, pp. 91-99, 1998.
[10] P.S. Bradley, U. Fayyad, and C. Reina, “Scaling Clustering Algorithms to Large Databases,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, pp. 9-15, 1998.
[11] V. Capoyleas, G. Rote, and G. Woeginger, “Geometric Clusterings,” J. Algorithms, vol. 12, pp. 341-356, 1991.
[12] J.M. Coggins and A.K. Jain, “A Spatial Filtering Approach to Texture Analysis,” Pattern Recognition Letters, vol. 3, pp. 195-203, 1985.
[13] S. Dasgupta, “Learning Mixtures of Gaussians,” Proc. 40th IEEE Symp. Foundations of Computer Science, pp. 634-644, Oct. 1999.
[14] S. Dasgupta and L.J. Shulman, “A Two-Round Variant of EM for Gaussian Mixtures,” Proc. 16th Conf. Uncertainty in Artificial Intelligence (UAI-2000), pp. 152-159, June 2000.
[15] Q. Du, V. Faber, and M. Gunzburger, Centroidal Voronoi Tessellations: Applications and Algorithms SIAM Rev., vol. 41, no. 4, pp. 637-676, 1999.
[16] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. New York: John Wiley&Sons, 1973.
[17] M. Ester, H. Kriegel, and X. Xu, “A Database Interface for Clustering in Large Spatial Databases,” Proc. First Int'l Conf. Knowledge Discovery and Data Mining (KDD-95), pp. 94-99, 1995.
[18] V. Faber, “Clustering and the Continuous$\big. k\hbox{-}{\rm{means}}\bigr.$Algorithm,” Los Alamos Science, vol. 22, pp. 138-144, 1994.
[19] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
[20] W. Feller, An Introduction to Probability Theory and Its Applications, third ed. New York: John Wiley&Sons, 1968.
[21] E. Forgey, “Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classification,” Biometrics, vol. 21, p. 768, 1965.
[22] K. Fukunaga, Introduction to Statistical Pattern Recognition, second edition. Academic Press, 1990.
[23] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness.New York: W.H. Freeman, 1979.
[24] A. Gersho and R.M. Gray, Vector Quantization and Signal Compression. Boston: Kluwer Academic, 1992.
[25] M. Inaba, private communication, 1997.
[26] M. Inaba, H. Imai, and N. Katoh, “Experimental Results of a Randomized Clustering Algorithm,” Proc. 12th Ann. ACM Symp. Computational Geometry, pp. C1-C2, May 1996.
[27] M. Inaba, N. Katoh, and H. Imai, “Applications of Weighted Voronoi Diagrams and Randomization to Variance-Based$\big. k\hbox{-}{\rm{clustering}}\bigr.$,” Proc. 10th Ann. ACM Symp. Computational Geometry, pp. 332-339, June 1994.
[28] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, N.J.: Prentice Hall, 1988.
[29] A.K. Jain, R.P.W. Duin, and J. Mao, Statistical Pattern Recognition: A Review IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, Jan. 2000.
[30] M.N. Murty, A.K. Jain, and P.J. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.
[31] T. Kanungo, D.M. Mount, N.S. Netanyahu, C. Piatko, R. Silverman, and A.Y. Wu, “Computing Nearest Neighbors for Moving Points and Applications to Clustering,” Proc. 10th Ann. ACM-SIAM Symp. Discrete Algorithms, pp. S931-S932, Jan. 1999.
[32] T. Kanungo, D.M. Mount, N.S. Netanyahu, C. Piatko, R. Silverman, and A.Y. Wu, “The Analysis of a Simple$\big. k\hbox{-}{\rm{means}}\bigr.$Clustering Algorithm,” Technical Report CAR-TR-937, Center for Automation Research, Univ. of Maryland, College Park, Jan. 2000.
[33] T. Kanungo, D.M. Mount, N.S. Netanyahu, C. Piatko, R. Silverman, and A.Y. Wu, “The Analysis of a Simple$\big. k\hbox{-}{\rm{means}}\bigr.$Clustering Algorithm,” Proc. 16th Ann. ACM Symp. Computational Geometry, pp. 100-109, June 2000.
[34] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley&Sons, 1990.
[35] T. Kohonen, "Self-Organization and Associated Memory," Berlin Heidelberg. New York: Springer-Verlag, 1988.
[36] S. Kolliopoulos and S. Rao, “A Nearly Linear-Time Approximation Scheme for the Euclidean$\big. k\hbox{-}{\rm{median}}\bigr.$Problem,” Proc. Seventh Ann. European Symp. Algorithms, J. Nesetril, ed., pp. 362-371, July 1999.
[37] S.P. Lloyd, “Least Squares Quantization in PCM,” IEEE Trans. Information Theory, vol. 28, pp. 129-137, Mar. 1982.
[38] J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” Proc. Fifth Berkeley Symp. Math. Statistics and Probability, vol. 1, pp. 281-296, 1967.
[39] S. Maneewongvatana and D.M. Mount, “Analysis of Approximate Nearest Neighbor Searching with Clustered Point Sets,” Proc. Workshop Algorithm Eng. and Experiments (ALENEX '99), Jan. 1999. Available from:http://ftp.cs.umd.edu-pub/faculty/mount/ Papersdimacs99.ps.gz.
[40] O.L. Mangasarian, “Mathematical Programming in Data Mining,” Data Mining and Knowledge Discovery, vol. 42, no. 1, pp. 183-201, 1997.
[41] J. Matousek, “On Approximate Geometric$\big. k\hbox{-}{\rm{clustering}}\bigr.$,” Discrete and Computational Geometry, vol. 24, pp. 61-84, 2000.
[42] A. Moore, “Very Fast EM-Based Mixture Model Clustering Using Multiresolution kd-Trees,” Proc. Conf. Neural Information Processing Systems, 1998.
[43] D.M. Mount and S. Arya, “ANN: A Library for Approximate Nearest Neighbor Searching,” Proc. Center for Geometric Computing Second Ann. Fall Workshop Computational Geometry, 1997. (available fromhttp://www.cs.umd.edu/~mountANN.)
[44] R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. 20th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1994, pp. 144-155.
[45] D. Pelleg and A. Moore, “Accelerating Exact$\big. k\hbox{-}{\rm{means}}\bigr.$Algorithms with Geometric Reasoning,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 277-281, Aug. 1999.
[46] D. Pelleg and A. Moore, “$\big. x\hbox{-}{\rm{means}}\bigr.$: Extending$\big. k\hbox{-}{\rm{means}}\bigr.$with Efficient Estimation of the Number of Clusters,” Proc. 17th Int'l Conf. Machine Learning, July 2000.
[47] D. Pollard, “A Centeral Limit Theorem for$\big. k\hbox{-}{\rm{means}}\bigr.$Clustering,” Annals of Probability, vol. 10, pp. 919-926, 1982.
[48] F.P. Preparata and M.I. Shamos, Computational Geometry. Springer-Verlag, 1985.
[49] S.Z. Selim and M.A. Ismail, “$\big. K\hbox{-}{\rm{means}}\hbox{-}{\rm{type}}\bigr.$Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 6, pp. 81-87, 1984.
[50] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: A New Data Clustering Algorithm and Its Applications,” Data Mining and Knowledge Discovery, vol. 1, no. 2, pp. 141-182, 1997. (software available from).

Index Terms:
Pattern recognition, machine learning, data mining, k-means clustering, nearest-neighbor searching, k-d tree, computational geometry, knowledge discovery.
Citation:
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu, "An Efficient k-Means Clustering Algorithm: Analysis and Implementation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881-892, July 2002, doi:10.1109/TPAMI.2002.1017616
Usage of this product signifies your acceptance of the Terms of Use.