This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Integrating K-Means Clustering with a Relational DBMS Using SQL
February 2006 (vol. 18 no. 2)
pp. 188-201
Integrating data mining algorithms with a relational DBMS is an important problem for database programmers. We introduce three SQL implementations of the popular K-means clustering algorithm to integrate it with a relational DBMS: 1) a straightforward translation of K-means computations into SQL, 2) an optimized version based on improved data organization, efficient indexing, sufficient statistics, and rewritten queries, and 3) an incremental version that uses the optimized version as a building block with fast convergence and automated reseeding. We experimentally show the proposed K-means implementations work correctly and can cluster large data sets. We identify which K-means computations are more critical for performance. The optimized and incremental K-means implementations exhibit linear scalability. We compare K-means implementations in SQL and C++ with respect to speed and scalability and we also study the time to export data sets outside of the DBMS. Experiments show that SQL overhead is significant for small data sets, but relatively low for large data sets, whereas export times become a bottleneck for C++.

[1] C. Aggarwal and P. Yu, “Finding Generalized Projected Clusters in High Dimensional Spaces,” Proc. ACM SIGMOD Conf., pp. 70-81, 2000.
[2] P. Bradley, U. Fayyad, and C. Reina, “Scaling Clustering Algorithms to Large Databases,” Proc. ACM Int'l Conf. Knowledge Discovery and Data Mining, pp. 9-15, 1998.
[3] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P. Lohman, “Nonstop SQL/MX Primitives for Knowledge Discovery,” Proc. ACM Int'l Conf. Knowledge Discovery and Data Mining, pp. 425-429, 1999.
[4] R. Duda and P. Hart, Pattern Classification and Scene Analysis. John Wiley and Sons, 1973.
[5] F. Fanstrom, J. Lewis, and C. Elkan, “Scalability for Clustering Algorithms Revisited,” SIGKDD Explorations, vol. 2, no. 1, pp. 51-57, June 2000.
[6] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “The KDD Process for Extracting Useful Knowledge from Volumes of Data,” Comm. ACM, vol. 39, no. 11, pp. 27-34, Nov. 1996.
[7] B. Fritzke, “The LBG-U Method for Vector Quantization— An Improvement over LBG Inspired from Neural Networks,” Neural Processing Letters, vol. 5, no. 1, pp. 35-45, 1997.
[8] G. Graefe, U. Fayyad, and S. Chaudhuri, “On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases,” Proc. ACM Int'l Conf. Knowledge Discovery and Data Mining, pp. 204-208, 1998.
[9] G. Hammerly and C. Elkan, “Alternatives to k-Means Clustering that Find Better Solutions,” Proc. ACM Conf. Information and Knowledge Management, pp. 600-607, 2002.
[10] J. Han, Y. Fu, W. Wang, J. Chiang, O.R. Zaiane, and K. Koperski, “DBMiner: Interactive Mining of Multiple-Level Knowledge in Relational Databases,” Proc. ACM SIGMOD Conf., p. 550, 1996.
[11] T. Imielinski and A. Virmani, “MSQL: A Query Language for Database Mining,” Data Mining and Knowledge Discovery, vol. 3, no. 4, pp. 373-408, 1999.
[12] H. Jamil, “Ad Hoc Association Rule Mining as SQL3 Queries,” Proc. IEEE Int'l Conf. Data Mining, pp. 609-612, 2001.
[13] J.B. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” Proc. Fifth Berkeley Symp. Math. Statistics and Probability, 1967.
[14] R. Meo, G. Psaila, and S. Ceri, “An Extension to SQL for Mining Association Rules,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 195-224, 1998.
[15] C. Ordonez, “Clustering Binary Data Streams with K-Means,” Proc. ACM Data Mining and Knowledge Discovery Workshop, pp. 10-17, 2003.
[16] C. Ordonez, “Programming the K-Means Clustering Algorithm in SQL,” Proc. ACM Int'l Conf. Knowledge Discovery and Data Mining, pp. 823-828, 2004.
[17] C. Ordonez and P. Cereghini, “SQLEM: Fast Clustering in SQL Using the EM Algorithm,” Proc. ACM SIGMOD Conf., pp. 559-570, 2000.
[18] C. Ordonez and E. Omiecinski, “FREM: Fast and Robust EM Clustering for Large Data Sets,” Proc. ACM Conf. Information and Knowledge Management, pp. 590-599, 2002.
[19] C. Ordonez and E. Omiecinski, “Efficient Disk-Based K-Means Clustering for Relational Databases,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 8, pp. 909-921, Aug. 2004.
[20] D. Papadopoulos, C. Domeniconi, D. Gunopulos, and S. Ma, “Clustering Gene Expression Data in SQL Using Locally Adaptive Metrics,” Proc. ACM Data Mining and Knowledge Discovery Workshop, pp. 35-41, 2003.
[21] D. Pelleg and A. Moore, “Accelerating Exact K-Means Algorithms with Geometric Reasoning,” Proc. ACM Int'l Conf. Knowledge Discovery and Data Mining, pp. 277-281, 1999.
[22] S. Roweis and Z. Ghahramani, “A Unifying Review of Linear Gaussian Models,” Neural Computation, vol. 11, pp. 305-345, 1999.
[23] S. Sarawagi, S. Thomas, and R. Agrawal, “Integrating Mining with Relational Databases: Alternatives and Implications,” Proc. ACM SIGMOD Conf., pp. 343-354, 1998.
[24] K. Sattler and O. Dunemann, “SQL Database Primitives for Decision Tree Classifiers,” Proc. ACM Conf. Information and Knowledge Management, pp. 379-386, 2001.
[25] H. Wang, C. Zaniolo, and C.R. Luo, “ATLAS: A Small but Complete SQL Extension for Data Mining and Data Streams,” Proc. Very Large Databases Conf., pp. 1113-1116, 2003.
[26] A. Witkowski, S. Bellamkonda, T. Bozkaya, G. Dorman, N. Folkert, A. Gupta, L. Sheng, and S. Subramanian, “Spreadsheets in RDBMS for OLAP,” Proc. ACM SIGMOD Conf., pp. 52-63, 2003.
[27] L. Xu and M. Jordan, “On Convergence Properties of the EM Algorithm for Gaussian Mixtures,” Neural Computation, vol. 8, no. 1, pp. 129-151, 1996.
[28] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Databases,” Proc. ACM SIGMOD Conf., pp. 103-114, 1996.

Index Terms:
Index Terms- Clustering, K-means, SQL, relational DBMS.
Citation:
Carlos Ordonez, "Integrating K-Means Clustering with a Relational DBMS Using SQL," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 2, pp. 188-201, Feb. 2006, doi:10.1109/TKDE.2006.31
Usage of this product signifies your acceptance of the Terms of Use.