This Article 
 Bibliographic References 
 Add to: 
Bayesian Classifiers Programmed in SQL
January 2010 (vol. 22 no. 1)
pp. 139-144
Carlos Ordonez, University of Houston, Houston
Sasi K. Pitchaimalai, University of Houston, Houston
The Bayesian classifier is a fundamental classification technique. In this work, we focus on programming Bayesian classifiers in SQL. We introduce two classifiers: Naive Bayes and a classifier based on class decomposition using K-means clustering. We consider two complementary tasks: model computation and scoring a data set. We study several layouts for tables and several indexing alternatives. We analyze how to transform equations into efficient SQL queries and introduce several query optimizations. We conduct experiments with real and synthetic data sets to evaluate classification accuracy, query optimizations, and scalability. Our Bayesian classifier is more accurate than Naive Bayes and decision trees. Distance computation is significantly accelerated with horizontal layout for tables, denormalization, and pivoting. We also compare Naive Bayes implementations in SQL and C++: SQL is about four times slower. Our Bayesian classifier in SQL achieves high classification accuracy, can efficiently analyze large data sets, and has linear scalability.

[1] P. Bradley, U. Fayyad, and C. Reina, “Scaling Clustering Algorithms to Large Databases,” Proc. ACM Knowledge Discovery and Data Mining (KDD) Conf., pp. 9-15, 1998.
[2] T. Hastie, R. Tibshirani, and J.H. Friedman, The Elements of Statistical Learning, first ed. Springer, 2001.
[3] B.L. Milenova and M.M. Campos, “O-Cluster: Scalable Clustering of Large High Dimensional Data Sets,” Proc. IEEE Int'l Conf. Data Mining (ICDM), pp. 290-297, 2002.
[4] C. Ordonez, “Integrating K-Means Clustering with a Relational DBMS Using SQL,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 2, pp. 188-201, Feb. 2006.
[5] C. Ordonez, “Building Statistical Models and Scoring with UDFs,” Proc. ACM SIGMOD, pp. 1005-1016, 2007.
[6] S. Thomas and M.M. Campos, SQL-Based Naive Bayes Model Building and Scoring, US Patent 7,051,037, US Patent and Trade Office, 2006.
[7] R. Vilalta and I. Rish, “A Decomposition of Classes via Clustering to Explain and Improve Naive Bayes,” Proc. European Conf. Machine Learning (ECML), pp. 444-455, 2003.
[8] H. Wang, C. Zaniolo, and C.R. Luo, “ATLaS: A Small but Complete SQL Extension for Data Mining and Data Streams,” Proc. Very Large Data Bases (VLDB) Conf., pp. 1113-1116, 2003.

Index Terms:
Classification, K-means, query optimization.
Carlos Ordonez, Sasi K. Pitchaimalai, "Bayesian Classifiers Programmed in SQL," IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 1, pp. 139-144, Jan. 2010, doi:10.1109/TKDE.2009.127
Usage of this product signifies your acceptance of the Terms of Use.