The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.12 - December (2010 vol.22)
pp: 1752-1765
Carlos Ordonez , University of Houston, Houston
ABSTRACT
Statistical models are generally computed outside a DBMS due to their mathematical complexity. We introduce techniques to efficiently compute fundamental statistical models inside a DBMS exploiting User-Defined Functions (UDFs). Specifically, we study the computation of linear regression, PCA, clustering, and Naive Bayes. Two summary matrices on the data set are mathematically shown to be essential for all models: the linear sum of points and the quadratic sum of cross products of points. We consider two layouts for the input data set: horizontal and vertical. We first introduce efficient SQL queries to compute summary matrices and score the data set. Based on the SQL framework, we introduce UDFs that work in a single table scan: aggregate UDFs to compute summary matrices for all models and a set of primitive scalar UDFs to score data sets. Experiments compare UDFs and SQL queries (running inside the DBMS) with C++ (analyzing exported files). In general, UDFs are faster than SQL queries and not much slower than C++. Considering export times, C++ is slower than UDFs and SQL queries. Statistical models based on precomputed summary matrices are computed in a few seconds. UDFs scale linearly and only require one table scan, highlighting their efficiency.
INDEX TERMS
DBMS, SQL, statistical model, UDF.
CITATION
Carlos Ordonez, "Statistical Model Computation with UDFs", IEEE Transactions on Knowledge & Data Engineering, vol.22, no. 12, pp. 1752-1765, December 2010, doi:10.1109/TKDE.2010.44
REFERENCES
[1] T. Chan, G. Golub, and R.J. LeVeque, "Algorithms for Computing the Sample Variance: Analysis and Recommendations," Am. Statistician, vol. 7, no. 1, pp. 242-247, 1983.
[2] S. Chaudhuri, "Efficient Evaluation of Queries with Mining Predicates," Proc. 18th Int'l Conf. Data Eng. (ICDE), pp. 529-540, 2002.
[3] A. Deshpande and S. Madden, "MauveDB: Supporting Model-Based User Views in Database Systems," Proc. ACM SIGMOD, pp. 73-84, 2006.
[4] R. Ghani and C. Soares, "Data Mining for Business Applications: KDD-2006 Workshop," SIGKDD Explorations Newsletter, vol. 8, no. 2, pp. 79-81, 2006.
[5] G. Graefe, U. Fayyad, and S. Chaudhuri, "On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases," Proc. ACM Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 204-208, 1998.
[6] T. Hastie, R. Tibshirani, and J.H. Friedman, The Elements of Statistical Learning, first ed. Springer, 2001.
[7] Z. He, B.S. Lee, and R. Snapp, "Self-Tuning Cost Modeling of User-Defined Functions in an Object-Relational DBMS," ACM Trans. Database Systems, vol. 30, no. 3, pp. 812-853, 2005.
[8] C. Luo, H. Thakkar, H. Wang, and C. Zaniolo, "A Native Extension of SQL for Mining Data Streams," Proc. ACM SIGMOD, pp. 873-875, 2005.
[9] O. Meshar, D. Irony, and S. Toledo, "An Out-of-Core Sparse Symmetric-Indefinite Factorization Method," ACM Trans. Math. Software, vol. 32, no. 3, pp. 445-471, 2006.
[10] A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt, "Integrating Data Mining with SQL Databases: OLE DB for Data Mining," Proc. IEEE Int'l Conf. Data Eng. (ICDE), pp. 379-387, 2001.
[11] C. Ordonez, "Integrating K-Means Clustering with a Relational DBMS Using SQL," IEEE Trans. Knowledge and Data Eng., vol. 18, no. 2, pp. 188-201, Feb. 2006.
[12] C. Ordonez, "Building Statistical Models and Scoring with UDFs," Proc. ACM SIGMOD, pp. 1005-1016, 2007.
[13] C. Ordonez, "Models for Association Rules Based on Clustering and Correlation," Intelligent Data Analysis, vol. 13, no. 2, pp. 337-358, 2009.
[14] C. Ordonez and E. Omiecinski, "Efficient Disk-Based K-Means Clustering for Relational Databases," IEEE Trans. Knowledge and Data Eng., vol. 16, no. 8, pp. 909-921, Aug. 2004.
[15] C. Ordonez and S. Pitchaimalai, "Bayesian Classifiers Programmed in SQL," IEEE Trans. Knowledge and Data Eng., vol. 22, no. 1, pp. 139-144, Jan. 2010.
[16] B. Panda, J. Herbach, S. Basu, and R.J. Bayardo, "PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 1426-1437, 2009.
[17] S. Sarawagi, S. Thomas, and R. Agrawal, "Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications," Proc. ACM SIGMOD, pp. 343-354, 1998.
[18] K. Wang, Y. He, and J. Han, "Pushing Support Constraints into Association Rules Mining," IEEE Trans. Knowledge and Data Eng., vol. 15, no. 3, pp. 642-658, Mar. 2003.
[19] H. Xiong, S. Shekhar, P.N. Tan, and V. Kumar, "TAPER: A Two-Step Approach for All-Strong-Pairs Correlation Query in Large Databases," IEEE Trans. Knowledge and Data Eng., vol. 18, no. 4, pp. 493-508, Apr. 2006.
[20] T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: An Efficient Data Clustering Method for Very Large Databases," Proc. ACM SIGMOD, pp. 103-114, 1996.
28 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool