This Article 
 Bibliographic References 
 Add to: 
Efficient Disk-Based K-Means Clustering for Relational Databases
August 2004 (vol. 16 no. 8)
pp. 909-921

Abstract—K-means is one of the most popular clustering algorithms. This article introduces an efficient disk-based implementation of K-means. The proposed algorithm is designed to work inside a relational database management system. It can cluster large data sets having very high dimensionality. In general, it only requires three scans over the data set. It is optimized to perform heavy disk I/O and its memory requirements are low. Its parameters are easy to set. An extensive experimental section evaluates quality of results and performance. The proposed algorithm is compared against the Standard K-means algorithm as well as the Scalable K-means algorithm.

[1] C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, and J. Park, Fast Algorithms for Projected Clustering Proc. ACM SIGMOD Conf., 1999.
[2] C. Aggarwal and P. Yu, Finding Generalized Projected Clusters in High Dimensional Spaces Proc. ACM SIGMOD Conf., 2000.
[3] C. Aggarwal and P. Yu, Outlier Detection for High Dimensional Data Proc. ACM SIGMOD Conf., 2001.
[4] R. Agrawal, J. Gehrke, D. Gunopolos, and P. Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications Proc. ACM SIGMOD Conf., 1998.
[5] R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules between Sets of Items in Large Databases Proc. ACM SIGMOD Conf., pp. 207-216, 1993.
[6] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules in Large Databases Proc. Very Large Data Base Conf., 1994.
[7] A. Aho, R. Sethi, and J.D. Ullman, Compilers: Principles, Techniques and Tools. pp. 200-250, Addison-Wesley, 1986.
[8] P. Bradley, U. Fayyad, and C. Reina, Scaling Clustering Algorithms to Large Databases Proc. ACM KDD Conf., 1998.
[9] P. Bradley, U. Fayyad, and C. Reina, Scaling EM Clustering to Large Databases technical report, Microsoft Research, 1999.
[10] M. Breunig, H.P. Kriegel, P. Kroger, and J. Sander, Data Bubbles: Quality Preserving Performance Boosting for Hierarchical Clustering Proc. ACM SIGMOD Conf., 2001.
[11] S. Chaudhuri and G. Weikum, Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System Proc. Very Large Data Base Conf., 2000.
[12] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P. Lohman, Nonstop SQL/MX Primitives for Knowledge Discovery Proc. ACM KDD Conf., 1999.
[13] A.P. Dempster, N.M. Laird, and D. Rubin, Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm J. Royal Statistical Soc., vol. 39, no. 1, pp. 1-38, 1977.
[14] R. Dubes and A.K. Jain, Clustering Methodologies in Exploratory Data Analysis. pp. 10-35, New York: Academic Press, 1980.
[15] R. Duda and P. Hart, Pattern Classification and Scene Analysis. pp. 10-45, J. Wiley and Sons, 1973.
[16] R. Elmasri and S.B. Navathe, Fundamentals of Database Systems, third ed. pp. 841-871, Addison-Wesley, 2000.
[17] F. Fanstrom, J. Lewis, and C. Elkan, Scalability for Clustering Algorithms Revisited SIGKDD Explorations, vol. 2, no. 1, pp. 51-57, June 2000.
[18] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, The Kdd Process for Extracting Useful Knowledge from Volumes of Data Comm. ACM, vol. 39, no. 11, pp. 27-34, Nov. 1996.
[19] B. Fritzke, The LBG-U Method for Vector Quantization An Improvement over LBG Inspired from Neural Networks Neural Processing Letters, vol. 5, no. 1, pp. 35-45, 1997.
[20] V. Ganti, J. Gehrke, and R. Ramakrishnan, Cactus-Clustering Categorical Data Using Summaries Proc. ACM KDD Conf., 1999.
[21] S. Guha, R. Rastogi, and K. Shim, Cure: An Efficient Clustering Algorithm for Large Databases Proc. SIGMOD Conf., 1998.
[22] S. Guha, R. Rastogi, and K. Shim, ROCK: A Robust Clustering Algorithm For Categorical Attributes Proc. 15th Int'l Conf. Data Eng., pp. 512-521, 1999.
[23] J. Han, J. Pei, and Y. Yun, Mining Frequent Patterns without Candidate Generation Proc. ACM SIGMOD Conf., 2000.
[24] A. Hinneburg and D. Keim, Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality Proc. Very Large Data Base Conf., 1999.
[25] Z. Huang, Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values Data Mining and Knowledge Discovery, vol. 2, no. 3, 1998.
[26] M. Jordan and R. Jacobs, Hierarchical Mixtures of Experts and the EM Algorithm Neural Computation, vol. 6, no. 2, 1994.
[27] J.B. MacQueen, Some Methods for Classification and Analysis of Multivariate Observations Proc. Fifth Berkeley Symp. Math. Statistics and Probability, 1967.
[28] A. Mood, F. Graybill, and D. Boes, Introduction to the Theory of Statistics. pp. 299-320, New York: McGraw Hill, 1974.
[29] A. Nanopoulos, Y. Theodoridis, and Y. Manolopoulos, C2p: Clustering Based on Closest Pairs Proc. Very Large Data Bases Conf., 2001.
[30] R. Neal, G. Hinton, A View of the EM Algorithm that Justifies Incremental, Sparse and Other Variants technical report, Dept. of Statistics, Univ. of Toronto, 1993.
[31] A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt, Integrating Data Mining with SQL Databases: Ole Db for Data Mining Proc. IEEE Int'l Conf. Data Eng., 2001.
[32] M. Ng, K-means Type Algorithms on Distributed Memory Computers Int'l J. High Speed Computing, vol. 11, no. 2, 2000.
[33] R. Ng and J. Han, Efficient and Effective Clustering Method for Spatial Data Mining Proc. Very Large Data Bases Conf., 1994.
[34] C. Ordonez, Clustering Binary Data Streams with K-means Proc. ACM DKMD Workshop, 2003.
[35] C. Ordonez and P. Cereghini, SQLEM: Fast Clustering in SQL Using the EM Algorithm Proc. ACM SIGMOD Conf., 2000.
[36] C. Ordonez and E. Omiecinski, FREM: Fast and Robust EM Clustering for Large Data Sets Proc. ACM Conf. Information and Knowledge Management, 2002.
[37] D. Pelleg and A. Moore, Accelerating Exact K-means Algorithms with Geometric Reasoning Proc. Knowledge Discovery and Data Mining Conf., 1999.
[38] R.A. Redner and H.F. Walker, Mixture Densities, Maximum Likelihood, and the EM Algorithm SIAM Rev., vol. 26, pp. 195-239, 1984.
[39] S. Roweis and Z. Ghahramani, A Unifying Review of Linear Gaussian Models Neural Computation, 1999.
[40] S. Sarawagi, S. Thomas, and R. Agrawal, Integrating Mining with Relational Databases: Alternatives and Implications Proc. ACM SIGMOD Conf., 1998.
[41] D. Scott, Multivariate Density Estimation, pp. 10-130. New York: J. Wiley and Sons, 1992.
[42] M. Seigel, E. Sciore, and S. Salveter, Knowledge Discovery in Databases. AAAI Press/The Mit Press, 1991.
[43] N. Ueda, R. Nakano, Z. Ghahramani, and G. Hinton, SMEM Algorithm for Mixture Models Neural Information Processing Systems, 1998.
[44] J.D. Ullman, Principles of Database and Knowledge-Base Systems, vol. 1. Rockville, Md: Computer Science Press, 1988.
[45] L. Xu and M. Jordan, On Convergence Properties of the EM Algorithm for Gaussian Mixtures Neural Computation, vol. 7, 1995.
[46] T. Zhang, R. Ramakrishnan, and M. Livny, Birch: An Efficient Data Clustering Method for Very Large Databases Proc. ACM SIGMOD Conf., 1996.

Index Terms:
Clustering, K-means, relational databases, disk.
Carlos Ordonez, Edward Omiecinski, "Efficient Disk-Based K-Means Clustering for Relational Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 8, pp. 909-921, Aug. 2004, doi:10.1109/TKDE.2004.25
Usage of this product signifies your acceptance of the Terms of Use.