Publication 2002 Issue No. 7 - July Abstract - An Efficient k-Means Clustering Algorithm: Analysis and Implementation
 This Article Share Bibliographic References Add to: Digg Furl Spurl Blink Simpy Google Del.icio.us Y!MyWeb Search Similar Articles Articles by Tapas Kanungo Articles by David M. Mount Articles by Nathan S. Netanyahu Articles by Christine D. Piatko Articles by Ruth Silverman Articles by Angela Y. Wu
An Efficient k-Means Clustering Algorithm: Analysis and Implementation
July 2002 (vol. 24 no. 7)
pp. 881-892
 ASCII Text x Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu, "An Efficient k-Means Clustering Algorithm: Analysis and Implementation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881-892, July, 2002.
 BibTex x @article{ 10.1109/TPAMI.2002.1017616,author = {Tapas Kanungo and David M. Mount and Nathan S. Netanyahu and Christine D. Piatko and Ruth Silverman and Angela Y. Wu},title = {An Efficient k-Means Clustering Algorithm: Analysis and Implementation},journal ={IEEE Transactions on Pattern Analysis and Machine Intelligence},volume = {24},number = {7},issn = {0162-8828},year = {2002},pages = {881-892},doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2002.1017616},publisher = {IEEE Computer Society},address = {Los Alamitos, CA, USA},}
 RefWorks Procite/RefMan/Endnote x TY - JOURJO - IEEE Transactions on Pattern Analysis and Machine IntelligenceTI - An Efficient k-Means Clustering Algorithm: Analysis and ImplementationIS - 7SN - 0162-8828SP881EP892EPD - 881-892A1 - Tapas Kanungo, A1 - David M. Mount, A1 - Nathan S. Netanyahu, A1 - Christine D. Piatko, A1 - Ruth Silverman, A1 - Angela Y. Wu, PY - 2002KW - Pattern recognitionKW - machine learningKW - data miningKW - k-means clusteringKW - nearest-neighbor searchingKW - k-d treeKW - computational geometryKW - knowledge discovery.VL - 24JA - IEEE Transactions on Pattern Analysis and Machine IntelligenceER -

In k\hbox{-}{\rm{means}} clustering, we are given a set of n data points in d\hbox{-}{\rm{dimensional}} space {\bf{R}}^d and an integer k and the problem is to determine a set of k points in {\bf{R}}^d, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k\hbox{-}{\rm{means}} clustering is Lloyd's algorithm. In this paper, we present a simple and efficient implementation of Lloyd's k\hbox{-}{\rm{means}} clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.

[1] P.K. Agarwal and C.M. Procopiuc, “Exact and Approximation Algorithms for Clustering,” Proc. Ninth Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 658-667, Jan. 1998.
[2] K. Alsabti, S. Ranka, and V. Singh, “An Efficient$\big. k\hbox{-}{\rm{means}}\bigr.$Clustering Algorithm,” Proc. First Workshop High Performance Data Mining, Mar. 1998.
[3] S. Arora, P. Raghavan, and S. Rao, “Approximation Schemes for Euclidean$k \hbox {-} {\rm Medians}$and Related Problems,” Proc. 30th ACM STOC, pp. 106-113, 1998.
[4] S. Arya and D. M. Mount, “Approximate Range Searching,” Computational Geometry: Theory and Applications, vol. 17, pp. 135-163, 2000.
[5] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A.Y. Wu, “An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions,” J. ACM, vol. 45, no. 6, pp. 891-923, Nov. 1998.
[6] G.H. Ball and D.J. Hall, “Some Fundamental Concepts and Synthesis Procedures for Pattern Recognition Preprocessors,” Proc. Int'l Conf. Microwaves, Circuit Theory, and Information Theory, Sept. 1964.
[7] J.L. Bentley, "Multidimensional Binary Search Trees Used for Associative Searching," Comm. ACM, vol. 18, no. 9, pp. 509-517, 1975.
[8] L. Bottou and Y. Bengio, “Convergence Properties of the$\big. k\hbox{-}{\rm{means}}\bigr.$Algorithms,” Advances in Neural Information Processing Systems 7, G. Tesauro and D. Touretzky, eds., pp. 585-592. MIT Press, 1995.
[9] P.S. Bradley and U. Fayyad, “Refining Initial Points for K-means Clustering,” Proc. 15th Int'l Conf. Machine Learning, pp. 91-99, 1998.
[10] P.S. Bradley, U. Fayyad, and C. Reina, “Scaling Clustering Algorithms to Large Databases,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, pp. 9-15, 1998.
[11] V. Capoyleas, G. Rote, and G. Woeginger, “Geometric Clusterings,” J. Algorithms, vol. 12, pp. 341-356, 1991.
[12] J.M. Coggins and A.K. Jain, “A Spatial Filtering Approach to Texture Analysis,” Pattern Recognition Letters, vol. 3, pp. 195-203, 1985.
[13] S. Dasgupta, “Learning Mixtures of Gaussians,” Proc. 40th IEEE Symp. Foundations of Computer Science, pp. 634-644, Oct. 1999.
[14] S. Dasgupta and L.J. Shulman, “A Two-Round Variant of EM for Gaussian Mixtures,” Proc. 16th Conf. Uncertainty in Artificial Intelligence (UAI-2000), pp. 152-159, June 2000.
[15] Q. Du, V. Faber, and M. Gunzburger, Centroidal Voronoi Tessellations: Applications and Algorithms SIAM Rev., vol. 41, no. 4, pp. 637-676, 1999.
[16] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. New York: John Wiley&Sons, 1973.
[17] M. Ester, H. Kriegel, and X. Xu, “A Database Interface for Clustering in Large Spatial Databases,” Proc. First Int'l Conf. Knowledge Discovery and Data Mining (KDD-95), pp. 94-99, 1995.
[18] V. Faber, “Clustering and the Continuous$\big. k\hbox{-}{\rm{means}}\bigr.$Algorithm,” Los Alamos Science, vol. 22, pp. 138-144, 1994.
[19] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
[20] W. Feller, An Introduction to Probability Theory and Its Applications, third ed. New York: John Wiley&Sons, 1968.
[21] E. Forgey, “Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classification,” Biometrics, vol. 21, p. 768, 1965.
[22] K. Fukunaga, Introduction to Statistical Pattern Recognition, second edition. Academic Press, 1990.
[23] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness.New York: W.H. Freeman, 1979.
[24] A. Gersho and R.M. Gray, Vector Quantization and Signal Compression. Boston: Kluwer Academic, 1992.
[25] M. Inaba, private communication, 1997.
[26] M. Inaba, H. Imai, and N. Katoh, “Experimental Results of a Randomized Clustering Algorithm,” Proc. 12th Ann. ACM Symp. Computational Geometry, pp. C1-C2, May 1996.
[27] M. Inaba, N. Katoh, and H. Imai, “Applications of Weighted Voronoi Diagrams and Randomization to Variance-Based$\big. k\hbox{-}{\rm{clustering}}\bigr.$,” Proc. 10th Ann. ACM Symp. Computational Geometry, pp. 332-339, June 1994.
[28] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, N.J.: Prentice Hall, 1988.
[29] A.K. Jain, R.P.W. Duin, and J. Mao, Statistical Pattern Recognition: A Review IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, Jan. 2000.
[30] M.N. Murty, A.K. Jain, and P.J. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.
[31] T. Kanungo, D.M. Mount, N.S. Netanyahu, C. Piatko, R. Silverman, and A.Y. Wu, “Computing Nearest Neighbors for Moving Points and Applications to Clustering,” Proc. 10th Ann. ACM-SIAM Symp. Discrete Algorithms, pp. S931-S932, Jan. 1999.
[32] T. Kanungo, D.M. Mount, N.S. Netanyahu, C. Piatko, R. Silverman, and A.Y. Wu, “The Analysis of a Simple$\big. k\hbox{-}{\rm{means}}\bigr.$Clustering Algorithm,” Technical Report CAR-TR-937, Center for Automation Research, Univ. of Maryland, College Park, Jan. 2000.
[33] T. Kanungo, D.M. Mount, N.S. Netanyahu, C. Piatko, R. Silverman, and A.Y. Wu, “The Analysis of a Simple$\big. k\hbox{-}{\rm{means}}\bigr.$Clustering Algorithm,” Proc. 16th Ann. ACM Symp. Computational Geometry, pp. 100-109, June 2000.
[34] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley&Sons, 1990.
[35] T. Kohonen, "Self-Organization and Associated Memory," Berlin Heidelberg. New York: Springer-Verlag, 1988.
[36] S. Kolliopoulos and S. Rao, “A Nearly Linear-Time Approximation Scheme for the Euclidean$\big. k\hbox{-}{\rm{median}}\bigr.$Problem,” Proc. Seventh Ann. European Symp. Algorithms, J. Nesetril, ed., pp. 362-371, July 1999.
[37] S.P. Lloyd, “Least Squares Quantization in PCM,” IEEE Trans. Information Theory, vol. 28, pp. 129-137, Mar. 1982.
[38] J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” Proc. Fifth Berkeley Symp. Math. Statistics and Probability, vol. 1, pp. 281-296, 1967.
[39] S. Maneewongvatana and D.M. Mount, “Analysis of Approximate Nearest Neighbor Searching with Clustered Point Sets,” Proc. Workshop Algorithm Eng. and Experiments (ALENEX '99), Jan. 1999. Available from:http://ftp.cs.umd.edu-pub/faculty/mount/ Papersdimacs99.ps.gz.
[40] O.L. Mangasarian, “Mathematical Programming in Data Mining,” Data Mining and Knowledge Discovery, vol. 42, no. 1, pp. 183-201, 1997.
[41] J. Matousek, “On Approximate Geometric$\big. k\hbox{-}{\rm{clustering}}\bigr.$,” Discrete and Computational Geometry, vol. 24, pp. 61-84, 2000.
[42] A. Moore, “Very Fast EM-Based Mixture Model Clustering Using Multiresolution kd-Trees,” Proc. Conf. Neural Information Processing Systems, 1998.
[43] D.M. Mount and S. Arya, “ANN: A Library for Approximate Nearest Neighbor Searching,” Proc. Center for Geometric Computing Second Ann. Fall Workshop Computational Geometry, 1997. (available fromhttp://www.cs.umd.edu/~mountANN.)
[44] R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. 20th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1994, pp. 144-155.
[45] D. Pelleg and A. Moore, “Accelerating Exact$\big. k\hbox{-}{\rm{means}}\bigr.$Algorithms with Geometric Reasoning,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 277-281, Aug. 1999.
[46] D. Pelleg and A. Moore, “$\big. x\hbox{-}{\rm{means}}\bigr.$: Extending$\big. k\hbox{-}{\rm{means}}\bigr.$with Efficient Estimation of the Number of Clusters,” Proc. 17th Int'l Conf. Machine Learning, July 2000.
[47] D. Pollard, “A Centeral Limit Theorem for$\big. k\hbox{-}{\rm{means}}\bigr.$Clustering,” Annals of Probability, vol. 10, pp. 919-926, 1982.
[48] F.P. Preparata and M.I. Shamos, Computational Geometry. Springer-Verlag, 1985.
[49] S.Z. Selim and M.A. Ismail, “$\big. K\hbox{-}{\rm{means}}\hbox{-}{\rm{type}}\bigr.$Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 6, pp. 81-87, 1984.
[50] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: A New Data Clustering Algorithm and Its Applications,” Data Mining and Knowledge Discovery, vol. 1, no. 2, pp. 141-182, 1997. (software available from).

Index Terms:
Pattern recognition, machine learning, data mining, k-means clustering, nearest-neighbor searching, k-d tree, computational geometry, knowledge discovery.
Citation:
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu, "An Efficient k-Means Clustering Algorithm: Analysis and Implementation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881-892, July 2002, doi:10.1109/TPAMI.2002.1017616