
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu, "An Efficient kMeans Clustering Algorithm: Analysis and Implementation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881892, July, 2002.  
BibTex  x  
@article{ 10.1109/TPAMI.2002.1017616, author = {Tapas Kanungo and David M. Mount and Nathan S. Netanyahu and Christine D. Piatko and Ruth Silverman and Angela Y. Wu}, title = {An Efficient kMeans Clustering Algorithm: Analysis and Implementation}, journal ={IEEE Transactions on Pattern Analysis and Machine Intelligence}, volume = {24}, number = {7}, issn = {01628828}, year = {2002}, pages = {881892}, doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2002.1017616}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Pattern Analysis and Machine Intelligence TI  An Efficient kMeans Clustering Algorithm: Analysis and Implementation IS  7 SN  01628828 SP881 EP892 EPD  881892 A1  Tapas Kanungo, A1  David M. Mount, A1  Nathan S. Netanyahu, A1  Christine D. Piatko, A1  Ruth Silverman, A1  Angela Y. Wu, PY  2002 KW  Pattern recognition KW  machine learning KW  data mining KW  kmeans clustering KW  nearestneighbor searching KW  kd tree KW  computational geometry KW  knowledge discovery. VL  24 JA  IEEE Transactions on Pattern Analysis and Machine Intelligence ER   
In k\hbox{}{\rm{means}} clustering, we are given a set of n data points in d\hbox{}{\rm{dimensional}} space {\bf{R}}^d and an integer k and the problem is to determine a set of k points in {\bf{R}}^d, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k\hbox{}{\rm{means}} clustering is Lloyd's algorithm. In this paper, we present a simple and efficient implementation of Lloyd's k\hbox{}{\rm{means}} clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kdtree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a datasensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.
[1] P.K. Agarwal and C.M. Procopiuc, “Exact and Approximation Algorithms for Clustering,” Proc. Ninth Ann. ACMSIAM Symp. Discrete Algorithms, pp. 658667, Jan. 1998.
[2] K. Alsabti, S. Ranka, and V. Singh, “An Efficient$\big. k\hbox{}{\rm{means}}\bigr.$Clustering Algorithm,” Proc. First Workshop High Performance Data Mining, Mar. 1998.
[3] S. Arora, P. Raghavan, and S. Rao, “Approximation Schemes for Euclidean$k \hbox {} {\rm Medians}$and Related Problems,” Proc. 30th ACM STOC, pp. 106113, 1998.
[4] S. Arya and D. M. Mount, “Approximate Range Searching,” Computational Geometry: Theory and Applications, vol. 17, pp. 135163, 2000.
[5] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A.Y. Wu, “An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions,” J. ACM, vol. 45, no. 6, pp. 891923, Nov. 1998.
[6] G.H. Ball and D.J. Hall, “Some Fundamental Concepts and Synthesis Procedures for Pattern Recognition Preprocessors,” Proc. Int'l Conf. Microwaves, Circuit Theory, and Information Theory, Sept. 1964.
[7] J.L. Bentley, "Multidimensional Binary Search Trees Used for Associative Searching," Comm. ACM, vol. 18, no. 9, pp. 509517, 1975.
[8] L. Bottou and Y. Bengio, “Convergence Properties of the$\big. k\hbox{}{\rm{means}}\bigr.$Algorithms,” Advances in Neural Information Processing Systems 7, G. Tesauro and D. Touretzky, eds., pp. 585592. MIT Press, 1995.
[9] P.S. Bradley and U. Fayyad, “Refining Initial Points for Kmeans Clustering,” Proc. 15th Int'l Conf. Machine Learning, pp. 9199, 1998.
[10] P.S. Bradley, U. Fayyad, and C. Reina, “Scaling Clustering Algorithms to Large Databases,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, pp. 915, 1998.
[11] V. Capoyleas, G. Rote, and G. Woeginger, “Geometric Clusterings,” J. Algorithms, vol. 12, pp. 341356, 1991.
[12] J.M. Coggins and A.K. Jain, “A Spatial Filtering Approach to Texture Analysis,” Pattern Recognition Letters, vol. 3, pp. 195203, 1985.
[13] S. Dasgupta, “Learning Mixtures of Gaussians,” Proc. 40th IEEE Symp. Foundations of Computer Science, pp. 634644, Oct. 1999.
[14] S. Dasgupta and L.J. Shulman, “A TwoRound Variant of EM for Gaussian Mixtures,” Proc. 16th Conf. Uncertainty in Artificial Intelligence (UAI2000), pp. 152159, June 2000.
[15] Q. Du, V. Faber, and M. Gunzburger, Centroidal Voronoi Tessellations: Applications and Algorithms SIAM Rev., vol. 41, no. 4, pp. 637676, 1999.
[16] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. New York: John Wiley&Sons, 1973.
[17] M. Ester, H. Kriegel, and X. Xu, “A Database Interface for Clustering in Large Spatial Databases,” Proc. First Int'l Conf. Knowledge Discovery and Data Mining (KDD95), pp. 9499, 1995.
[18] V. Faber, “Clustering and the Continuous$\big. k\hbox{}{\rm{means}}\bigr.$Algorithm,” Los Alamos Science, vol. 22, pp. 138144, 1994.
[19] U.M. Fayyad, G. PiatetskyShapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
[20] W. Feller, An Introduction to Probability Theory and Its Applications, third ed. New York: John Wiley&Sons, 1968.
[21] E. Forgey, “Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classification,” Biometrics, vol. 21, p. 768, 1965.
[22] K. Fukunaga, Introduction to Statistical Pattern Recognition, second edition. Academic Press, 1990.
[23] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NPCompleteness.New York: W.H. Freeman, 1979.
[24] A. Gersho and R.M. Gray, Vector Quantization and Signal Compression. Boston: Kluwer Academic, 1992.
[25] M. Inaba, private communication, 1997.
[26] M. Inaba, H. Imai, and N. Katoh, “Experimental Results of a Randomized Clustering Algorithm,” Proc. 12th Ann. ACM Symp. Computational Geometry, pp. C1C2, May 1996.
[27] M. Inaba, N. Katoh, and H. Imai, “Applications of Weighted Voronoi Diagrams and Randomization to VarianceBased$\big. k\hbox{}{\rm{clustering}}\bigr.$,” Proc. 10th Ann. ACM Symp. Computational Geometry, pp. 332339, June 1994.
[28] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, N.J.: Prentice Hall, 1988.
[29] A.K. Jain, R.P.W. Duin, and J. Mao, Statistical Pattern Recognition: A Review IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 437, Jan. 2000.
[30] M.N. Murty, A.K. Jain, and P.J. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264323, 1999.
[31] T. Kanungo, D.M. Mount, N.S. Netanyahu, C. Piatko, R. Silverman, and A.Y. Wu, “Computing Nearest Neighbors for Moving Points and Applications to Clustering,” Proc. 10th Ann. ACMSIAM Symp. Discrete Algorithms, pp. S931S932, Jan. 1999.
[32] T. Kanungo, D.M. Mount, N.S. Netanyahu, C. Piatko, R. Silverman, and A.Y. Wu, “The Analysis of a Simple$\big. k\hbox{}{\rm{means}}\bigr.$Clustering Algorithm,” Technical Report CARTR937, Center for Automation Research, Univ. of Maryland, College Park, Jan. 2000.
[33] T. Kanungo, D.M. Mount, N.S. Netanyahu, C. Piatko, R. Silverman, and A.Y. Wu, “The Analysis of a Simple$\big. k\hbox{}{\rm{means}}\bigr.$Clustering Algorithm,” Proc. 16th Ann. ACM Symp. Computational Geometry, pp. 100109, June 2000.
[34] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley&Sons, 1990.
[35] T. Kohonen, "SelfOrganization and Associated Memory," Berlin Heidelberg. New York: SpringerVerlag, 1988.
[36] S. Kolliopoulos and S. Rao, “A Nearly LinearTime Approximation Scheme for the Euclidean$\big. k\hbox{}{\rm{median}}\bigr.$Problem,” Proc. Seventh Ann. European Symp. Algorithms, J. Nesetril, ed., pp. 362371, July 1999.
[37] S.P. Lloyd, “Least Squares Quantization in PCM,” IEEE Trans. Information Theory, vol. 28, pp. 129137, Mar. 1982.
[38] J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” Proc. Fifth Berkeley Symp. Math. Statistics and Probability, vol. 1, pp. 281296, 1967.
[39] S. Maneewongvatana and D.M. Mount, “Analysis of Approximate Nearest Neighbor Searching with Clustered Point Sets,” Proc. Workshop Algorithm Eng. and Experiments (ALENEX '99), Jan. 1999. Available from:http://ftp.cs.umd.edupub/faculty/mount/ Papersdimacs99.ps.gz.
[40] O.L. Mangasarian, “Mathematical Programming in Data Mining,” Data Mining and Knowledge Discovery, vol. 42, no. 1, pp. 183201, 1997.
[41] J. Matousek, “On Approximate Geometric$\big. k\hbox{}{\rm{clustering}}\bigr.$,” Discrete and Computational Geometry, vol. 24, pp. 6184, 2000.
[42] A. Moore, “Very Fast EMBased Mixture Model Clustering Using Multiresolution kdTrees,” Proc. Conf. Neural Information Processing Systems, 1998.
[43] D.M. Mount and S. Arya, “ANN: A Library for Approximate Nearest Neighbor Searching,” Proc. Center for Geometric Computing Second Ann. Fall Workshop Computational Geometry, 1997. (available fromhttp://www.cs.umd.edu/~mountANN.)
[44] R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. 20th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1994, pp. 144155.
[45] D. Pelleg and A. Moore, “Accelerating Exact$\big. k\hbox{}{\rm{means}}\bigr.$Algorithms with Geometric Reasoning,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 277281, Aug. 1999.
[46] D. Pelleg and A. Moore, “$\big. x\hbox{}{\rm{means}}\bigr.$: Extending$\big. k\hbox{}{\rm{means}}\bigr.$with Efficient Estimation of the Number of Clusters,” Proc. 17th Int'l Conf. Machine Learning, July 2000.
[47] D. Pollard, “A Centeral Limit Theorem for$\big. k\hbox{}{\rm{means}}\bigr.$Clustering,” Annals of Probability, vol. 10, pp. 919926, 1982.
[48] F.P. Preparata and M.I. Shamos, Computational Geometry. SpringerVerlag, 1985.
[49] S.Z. Selim and M.A. Ismail, “$\big. K\hbox{}{\rm{means}}\hbox{}{\rm{type}}\bigr.$Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 6, pp. 8187, 1984.
[50] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: A New Data Clustering Algorithm and Its Applications,” Data Mining and Knowledge Discovery, vol. 1, no. 2, pp. 141182, 1997. (software available from).