This Article 
 Bibliographic References 
 Add to: 
A Hybrid Estimator for Selectivity Estimation
March/April 1999 (vol. 11 no. 2)
pp. 338-354

Abstract—Traditional sampling-based estimators infer the actual selectivity of a query based purely on runtime information gathering, excluding the previously collected information, which underutilizes the information available. Table-based and parametric estimators extrapolate the actual selectivity of a query based only on the previously collected information, ignoring on-line information, which results in inaccurate estimation in a frequently updated environment. We propose a novel hybrid estimator that utilizes and optimally combines the on-line and previously collected information. Theoretical analysis demonstrates that the on-line and previously collected information is complementary and that the comprehensive utilization of the on-line and previously collected information is of value for further performance improvement. Our theoretical results are validated by a comprehensive experimental study using a practical database, in the presence of insert, delete, and update operations. The hybrid approach is very promising in the sense that it provides the adaptive mechanism that allows the optimal combination of information obtained from different sources in order to achieve a higher estimation accuracy and reliability.

[1] R. Ahad, K.V.B. Rao, and D. McLeod, "On Estimating the Cardinality of Projection of a Database Relation," ACM Trans. Database Systems, vol. 14, no. 1, pp. 28-40, Mar. 1989.
[2] M.M. Astrahan, M. Schkolnick, and K.y. Whang, "Approximating the Number of Unique Values of an Attribute without Sorting," Information System, vol. 12, no. 1, pp. 11-15, 1987.
[3] G.A. Carpenter and S. Grossberg, "The Art of Adaptive Pattern Recognition by a Self-Organizing Neural Network," Computer, pp. 77-88, Mar. 1988.
[4] S. Chaudhuri, U. Dayal, and T.W. Yan, "Join Query with External Text Sources: Execution and Optimization Techniques," Proc. ACM SIGMOD, pp. 410-422, May 1995.
[5] S. Chaudhuri, R. Krishnamurthy, S. Potamianos, and K. Shim, "Optimizing Queries with Materialized Views," Proc. 11th Int'l Conf. Data Eng.,Taipei, Taiwan, 1995.
[6] C.M. Chen and N. Roussopoulos, "Adaptive Selectivity Estimation Using Query Feedback," Proc. ACM SIGMOD Conf., pp. 161-172, May 1994.
[7] S. Christodoulakis, "Estimating Record Selectivities," Information Systems, vol. 8, no. 2, pp. 105-115, 1983.
[8] S. Christodoulakis, "Estimating Block Selectivities," Information Systems, vol. 9, no. 1, pp. 69-79, 1984.
[9] P.-C. Chu, "A Contingency Approach to Estimating Record Selectivities," Software Eng., vol. 17, no. 6, pp. 544-552, 1991.
[10] M.P. Consens and T. Milo, "Optimizing Queries on Files," Proc. ACM SIGMOD Conf., pp. 301-312, May 1994.
[11] W. Feller, An Introduction to Probability Theory and its Application, vol. 2, John Wiley&Sons, 1971.
[12] E. Gelenbe and D. Gardy, "On the Size of Projection," Information Processing Letter, vol. 14, no. 1, pp. 18-21, Mar. 1982.
[13] E. Gelenbe and D. Gardy, "On the Size of Projections," Proc. Eighth VLDB Conf., 1982.
[14] P.J. Haas, J.F. Naughton, S. Seshadri, and L. Stokes, “Sampling-Based Estimation of the Number of Distinct Values of an Attribute,” Proc. 21st Int'l Conf. Very Large Databases, pp. 311-322, 1995.
[15] P.J. Haas, J.F. Naughton, and A.N. Swami, "On the Relative Cost of Sampling for Join Selectivity," IBM Research Report, 1994.
[16] P. Haas and A. Swami, “Sequential Sampling Procedures for Query Size Estimation,” Proc. ACM SIGMOD, pp. 341-350, June 1992.
[17] P.J. Haas and A.N. Swami, "Sampling-Based Selectivity Estimation for Joins Using Augmented Frequent Value Statistics," Proc. 11th Int'l Conf. Data Eng., pp. 522-531, Mar. 1995.
[18] W.-C. Hou and G. Ozsoyoglu, "Statistical Estimators for Aggregate Relational Algebra Queries," ACM Trans. Database Systems, vol. 16, no. 4, pp. 600-654, Dec. 1991.
[19] W.-C. Hou, G. Ozsoyoglu, and E. Dogdu, "Error-Constrained Count Query Evaluation in Relational Databases," Proc. ACM SIGMOD Conf., pp. 278-287, Aug. 1991.
[20] Y.E. Ioannidis and S. Christodoulakis, “Optimal Histograms for Limiting Worst-Case Error Propogation in the Size of Join Results,” ACM Trans. Database Systems, vol. 18, no. 4, 1993.
[21] Y. Ioannidis and V. Poosala, “Balancing Histogram Optimality and Practicality for Query Result Size Estimation,” Proc. ACM SIGMOD, 1995.
[22] Y. Ling, "Self-Organizing Model for Size Estimation in Database System," PhD dissertation, School of Computer Science, Florida International Univ., 1995.
[23] Y. Ling and B. He, "Entropic Analysis of the Biological Models," IEEE Trans. Biomedical Eng., vol. 40, no. 12, pp. 1,193-1,200, Dec. 1993.
[24] Y. Ling and W. Sun, "A Supplement to Sampling-Based Methods for Query Size Estimation in a Database System," SIGMOD Record, vol. 16, pp. 15-18, Dec. 1992.
[25] Y. Ling and W. Sun, "A Size Estimator for Selections Based on Neural Network Learning Model," Proc. Fifth Japanese Neural Network Soc. Conf.,Tsukuba, Japan, Oct. 1994.
[26] Y. Ling and W. Sun, "A Comprehensive Evaluation of Sampling-Based Size Estimation Methods for Selections in Database Systems," Proc. IEEE 11th Int'l Conf. Data Eng., pp. 532-539, Mar. 1995.
[27] Y. Ling, W. Sun, and X. Xiang, "A Hybrid Estimator: Full Utilization of All Available Information," technical report, Florida International Univ., 1995.
[28] R. Lipton and J. Naughton, "Estimating the Size of Generalized Transitive Closures," Proc. 15th Int'l Conf. Very Large Data Bases, pp. 165-172, 1989.
[29] R. Lipton and J. Naughton, "Query Size Estimation by Adaptive Sampling," Proc. Ninth ACM Symp. Principles of Database Systems, pp. 40-46, Mar. 1990.
[30] R.J. Lipton, J.F. Naughton, and D.A. Schneider, Practical Selectivity Estimation through Adaptive Sampling Proc. ACM SIGMOD, pp. 1-11, May 1990.
[31] M.-L. Lo, M.-S. Chen, C. V. Ravishankar, and P. S. Yu,“On optimal processor allocation to support pipelined hash joins,”inProc. ACM SIGMOD, May 1993, pp. 69–78.
[32] C.A. Lynch,“Selectivity estimation and query optimization in large databases with highly skewed distributions of column values,” Proc. VLDB Conf., pp. 240-251, 1988.
[33] M.V. Mannino, P. Chu, and T. Sager, “Statiscal Profile Estimation in Database Systems,” ACM Computing Surveys, vol. 20, no. 3, Sept. 1988.
[34] J.F. Naughton and S. Seshadri, "On Estimating the Size of Projections," tech. report, Computer Dept., Univ. of Wisconsin—Madison, 1992.
[35] K.T. Sun and H.C. Fu, "A Hybrid Neural Network Model for Solving Optimization Problem," IEEE Trans. Computer, vol. 42, no. 2, pp. 218-227, Feb. 1993.
[36] W. Sun, Y. Ling, N. Rishe, and Y. Deng, "An Instant and Accurate Size Estimation Method for Joins and Selections in a Retrieval-Intensive Environment," Proc. SIGMOD, pp. 79-88, 1993.
[37] T. Wolf, D. Dias, and P. Yu, "An Effective Algorithm for Parallelizing Sort Merge Joins in the Presence of Data Skew," Proc. Second Int'l Symp. Databases in Parallel and Distribution Systems, pp. 103-115, 1990.
[38] J.M. Zurada, Introduction to Artificial Neural Systems. West Publishing Company, 1992.

Index Terms:
Hybrid estimator, sampling estimator, parametric estimator, table-based estimator, query optimization, estimation accuracy, estimation reliability.
Yibei Ling, Wei Sun, Naphtali D. Rishe, Xianjing Xiang, "A Hybrid Estimator for Selectivity Estimation," IEEE Transactions on Knowledge and Data Engineering, vol. 11, no. 2, pp. 338-354, March-April 1999, doi:10.1109/69.761667
Usage of this product signifies your acceptance of the Terms of Use.