This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Selectivity Model for Fragmented Relations: Applied in Information Retrieval
May 2004 (vol. 16 no. 5)
pp. 635-639

Abstract—New application domains cause today's database sizes to grow rapidly, posing great demands on technology. Data fragmentation facilitates techniques (like distribution, parallelization, and main-memory computing) meeting these demands. Also, fragmentation might help to improve efficient processing of query types such as top {\rm{N}}. Database design and query optimization require a good notion of the costs resulting from a certain fragmentation. Our mathematically derived selectivity model facilitates this. Once its two parameters have been computed based on the fragmentation, after each (though usually infrequent) update, our model can forget the data distribution, resulting in fast and quite good selectivity estimation. We show experimental verification for Zipfian distributed IR databases.

[1] Proc. ACM SIGMOD, 1996.
[2] Proc. Very Large Data Base Conf., M.P. Atkinson, M.E. Orlowska, P. Valduriez, S.B. Zdonik, and M.L. Brodie, eds., 1999.
[3] H.E. Blok, S. Choenni, H.M. Blanken, and P.M.G. Apers, A Selectivity Model for Fragmented Relations in Information Retrieval, Technical report, no. TR-CTIT-01-02, Centre for Telematics and Information Technology, University of Twente, The Netherlands, Jan. 2001.
[4] H.E. Blok, D. Hiemstra, S. Choenni, F. de Jong, H.M. Blanken, and P.M.G. Apers, Predicting the Cost-Quality Trade-Off for Information Retrieval Queries: Facilitating Database Design and Query Optimization Proc. ACM Conf. Information and Knowledge Management, 2001.
[5] H.E. Blok, Database Optimization Aspects for Information Retrieval PhD thesis, Univ. of Twente, Enschede, The Netherlands, 2002.
[6] A.F. Cardenas, Analysis and Performance of Inverted Data Base Structures Comm. ACM, vol. 18, no. 5, pp. 253-263, 1975.
[7] S. Chaudhuri and L. Gravano, Evaluating Top-$k$Selection Queries Proc. Very Large Data Base Conf., pp. 397-410, 1999.
[8] M.J. Carey and D. Kossmann, Reducing the Braking Distance of an SQL Query Engine Proc. Very Large Data Base Conf., pp. 158-169, 1998.
[9] S. Choenni, M. Kersten, A. Saad, and J. v.d. Akker, A Framework for Multi-Query Optimization Report CS-R9638, CWI, Centre for Math. and Computer Science, Amsterdam, 1996.
[10] S. Chaudhuri, R. Motwani, and V. Narasayya, Random Sampling for Histogram Construction: How Much Is Enough? Proc. ACM SIGMOD, pp. 436-447, 1998.
[11] S. Chaudhuri, R. Motwani, and V. Narasayya, On Random Sampling over Joins Proc. ACM SIGMOD, pp. 263-274, 1999.
[12] C.M. Chen and N. Roussopoulos, Adaptive Selectivity Estimation Using Query Feedback Proc. ACM SIGMOD, pp. 161-172, 1994.
[13] D. Donjerkovic and R. Ramakrishnan, Probabilistic Optimization of Top N Queries Proc. Very Large Data Base Conf., pp. 411-422, 1999.
[14] M. Fang, N. Shivakumar, H Garcia-Molina, R. Motwani, and J.D. Ullman, Computing Iceberg Queries Efficiently Proc. Very Large Data Base Conf., pp. 299-310, 1998.
[15] S. Ganguly, P.B. Gibbons, Y. Matias, and A. Silberschatz, Bifocal Sampling for Skew-Resistant Join Size Estimation Proc. ACM SIGMOD, pp. 271-281, 1996.
[16] Proc. Very Large Data Base Conf., A. Gupta, O. Shmueli, and J. Widom, eds., 1998.
[17] L.M. Haas, D. Kossmann, E.L. Wimmers, and J. Yang, Optimizing Queries across Diverse Data Sources Proc. Very Large Data Base Conf., 1997.
[18] A. Ijbema and H.M. Blanken, Estimating Bucket Accesses: A Practical Approach Proc. Int'l Conf. Data Eng., pp. 30-37, 1986.
[19] Y.E. Ioannidis and V. Poosala, Balancing Histogram Optimality and Practicality for Query Result Size Estimation Proc. ACM SIGMOD, pp. 233-244, 1995.
[20] R.J. Lipton, J.F. Naughton, and D.A. Schneider, Practical Selectivity Estimation through Adaptive Sampling Proc. ACM SIGMOD, H. Garcia-Molina and H.V. Jagadish, eds., pp. 1-11, 1990.
[21] V. Poosala, Y.E. Ioannidi, P.J. Haas, and E.J. Shekita, Improved Histograms for Selectivity Estimation of Range Predicates Proc. ACM SIGMOD, pp. 294-305, 1996.
[22] A.P. Sheth and J.A. Larson, Federated Database Systems for Managing Distributed, Heterogeneaous, and Autonomous Databases ACM Computing Surveys, vol. 22, no. 3, pp. 183-236, 1990.
[23] A.P. de Vries and D. Hiemstra, The miRRor DBMS at TREC Proc. Text REtrieval Conf., E.M. Voorhees and D.K. Harman, eds., pp. 725-734, NIST special publications, 1999.
[24] S.B. Yao, Approximating Block Accesses in Database Organizations Comm. ACM, vol. 20, no. 4, 260-261, 1977.
[25] G.K. Zipf, Human Behavior and the Principle of Least Effort. Reading, Mass.: Addison-Wesley, Reading, 1949.

Index Terms:
Selectivity, fragmentation, Zipf, information retrieval, databases.
Citation:
Henk Ernst Blok, Sunil Choenni, Henk M. Blanken, Peter M.G. Apers, "A Selectivity Model for Fragmented Relations: Applied in Information Retrieval," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 5, pp. 635-639, May 2004, doi:10.1109/TKDE.2004.1277824
Usage of this product signifies your acceptance of the Terms of Use.