This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Domains and Active Domains: What This Distinction Implies for the Estimation of Projection Sizes in Relational Databases
August 1995 (vol. 7 no. 4)
pp. 641-655

Abstract—Database optimizers require statistical information about data distributions in order to evaluate result sizes and access plan costs for processing user queries. In this context, we consider the problem of estimating the size of the projections of a database relation, when measures on attribute domain cardinalities are maintained in the system. Our main theoretical contribution is a new formal model (AD), valid under the hypotheses of attribute independence and uniform distribution of attribute values, derived considering the difference between time-invariant domain (the set of values that an attribute can assume) and time-dependent “active domain” (the set of values that are actually assumed, at a certain time). Early models developed under the same assumptions are shown to be formally incorrect. Since the AD model is computationally high-demanding, we also introduce an approximate, easy-to-compute model (A2D) that, unlike previous approximations, yields low errors on all the parameter space of the active domain cardinalities. Finally, we extend the A2D model to the case of nonuniform distributions and present experimental results confirming the good behavior of the model.

[1] R. Ahad, K.V.B. Rao, and D. McLeod, "On Estimating the Cardinality of Projection of a Database Relation," ACM Trans. Database Systems, vol. 14, no. 1, pp. 28-40, Mar. 1989.
[2] E.A. Bender,“Central and local limit theorems applied to asymptotic enumerations,” J. Combinatorial Theory (A), vol. 15, pp. 91-111, 1973.
[3] D. Bitton and D.J. DeWitt,“Duplicate record elimination in large data files,” ACM Trans. Database Systems, vol. 8, no. 2, pp. 255-265, June 1983.
[4] S. Ceri and G. Pelagatti, Distributed Databases: Principles and Systems.New York: McGraw-Hill, 1984.
[5] T.Y. Cheung,“A statistical model for estimating the number of records in a relationaldatabase,” Information Processing Letters, vol. 15, no. 3, pp. 115-118, Oct. 1982.
[6] S. Christodoulakis,“Estimating record selectivities,” Information Systems, vol. 8, no. 2, pp. 105-115, 1983.
[7] S. Christodoulakis,“Implications of certain assumptions in database performance evaluation,” ACM Trans. on Database Systems, vol. 9, no. 2, pp. 163-186, June 1984.
[8] P. Ciaccia,“Block access estimation for clustered data,” IEEE Trans. Knowledge and Data Engineering, vol. 5, no. 4, pp. 712-718, Aug. 1993.
[9] P. Ciaccia and D. Maio,“On the complexity of finding bounds for projection cardinalities in relationaldatabases,” Information Systems, vol. 17, no. 6, pp. 511-515, Nov. 1992.
[10] P. Ciaccia and D. Maio,“Access cost estimation for physical database design,” IEEE Trans. Knowledge and Data Engineering, vol. 11, no. 2, pp. 125-150, 1993.
[11] P. Ciaccia and M.R. Scalas,“Optimization strategies for relational disjunctive queries,” IEEE Trans. Software Engineering, vol. 15, no. 10, pp. 1,217-1,235, Oct. 1989.
[12] D. Gardy and C. Puech,“On the size of projections: A generating function approach,” Information Systems, vol. 9, nos. 3-4, pp. 231-235, 1984.
[13] D. Gardy and C. Puech, “On the Effect of Join Operations on Relation Sizes,” ACM Trans. Database Systems, vol. 14, no. 4, pp. 574-603, Dec. 1989.
[14] F. Grandi and M. Scalas,“Block access estimation for clustered data using a finite LRUbuffer,” IEEE Trans. Software Engineering, vol. 19, no. 5, pp. 641-660, June 1993.
[15] W.-C. Hou and G. Ozsoyoglu, "Statistical Estimators for Aggregate Relational Algebra Queries," ACM Trans. Database Systems, vol. 16, no. 4, pp. 600-654, Dec. 1991.
[16] Y.E. Ioannidis and S. Christodoulakis, “On the Propagation of Errors in the Size of Join Results,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 268-277, 1991.
[17] Y.E. Ioannidis and S. Christodoulakis, “Optimal Histograms for Limiting Worst-Case Error Propogation in the Size of Join Results,” ACM Trans. Database Systems, vol. 18, no. 4, 1993.
[18] M. Jarke and J. Koch, “Query Optimization in Database Systems,” ACM Computer Surveys, vol. 16, pp. 111–152, 1984.
[19] E. Jaynes,“Where do we stand on maximum entropy?” Levine and Tribus, eds., The Maximum Entropy Formalism, pp. 15-118.Cambridge, Mass.: MIT Press, 1979.
[20] N. Kamel and R. King,“A model of data distribution based on texture analysis,” Proc. ACM, pp. 319-325, 1985.
[21] L. Kerschberg, P.D. Ting, and S.B. Yao, “Query Optimization in a Star Computer Network,” ACM Trans. Database Systems, vol. 7, pp. 678-711, Dec. 1982.
[22] D.E. Knuth, The Art of Computer Programming, vol. 1,Addison Wesley, second ed. 1973.
[23] A. Kumar and M. Stonebraker,“The effect of join selectivity on optimal nesting order,” SIGMOD Record, vol. 16, no. 1, pp. 28-41, Mar. 1987.
[24] C.A. Lynch,“Selectivity estimation and query optimization in large databases with highly skewed distributions of column values,” Proc. VLDB Conf., pp. 240-251, 1988.
[25] L.F. Mackert and G.M. Lohman, “Index Scans Using a Finite LRU Buffer: A Validated I/O Model,” ACM Trans. Database Systems, vol. 14, no. 3, pp. 401-424, 1989.
[26] D. Maio,M.R. Scalas,, and P. Tiberio,“On estimating access costs in relational databases,” Information Processing Letters, vol. 19, no. 3, pp. 157-161, Oct. 1984.
[27] M.V. Mannino, P. Chu, and T. Sager, “Statiscal Profile Estimation in Database Systems,” ACM Computing Surveys, vol. 20, no. 3, Sept. 1988.
[28] A. Marshall and I. Olkin,Inequalities: Theory of Majorization andIts Applications.New York: Academic Press, 1979.
[29] T.H. Merrett and E. Otoo,“Distribution models of relations,” Proc. Fifth VLDB Int’l Conf., pp. 418-425,Rio de Janeiro, Brazil, Oct. 1979.
[30] T. Mostardi,“Estimating the size of relational SPQ J operation results: An analyticalapproach,” Information Systems, vol. 15, no. 5, pp. 591-601, 1990.
[31] R. Mukkamala and S. Jajodia,“A note on estimating the cardinality of the projection of a databaserelation,” ACM Trans. Database Systems, vol. 16, no. 3, pp. 564-566, Sept. 1991.
[32] S. Salza and M. Terranova, “Evaluating the Size of Queries on Relational Databases with Non-Uniform Distribution and Stochastic Dependence,” Proc. ACM SIGMOD Conf., pp. 8-14, June 1989.
[33] P. Selinger,D. Astrahan,D. Chamberlin,R. Lorie,, and T. Price,“Access path selection in a relational database management system,” Proc. 1979 ACM-SIGMOD Int’l Conf. Management of Data, pp. 23-34,Boston, May 1979.
[34] K.-Y. Whang,B.T. Vander Zanden,, and H.M. Taylor,“A linear-time probabilistic counting algorithm for databaseapplications,” ACM Trans. Database Systems, vol. 15, no. 2, pp. 208-229, June 1990.
[35] G.K. Zipf,Human Behaviour and the Principle of Least Effort.Reading, Mass.: Addison-Wesley, 1949.

Index Terms:
Relational database, projection, error estimate, combinatorial models, statistical profile, query optimization.
Citation:
Paolo Ciaccia, Dario Maio, "Domains and Active Domains: What This Distinction Implies for the Estimation of Projection Sizes in Relational Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 7, no. 4, pp. 641-655, Aug. 1995, doi:10.1109/69.404035
Usage of this product signifies your acceptance of the Terms of Use.