This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Materialized Sample Views for Database Approximation
March 2008 (vol. 20 no. 3)
pp. 337-351
We consider the problem of creating sample view of a database table. A sample view is an indexed, materialized view that permits efficient sampling from an arbitrary range query over the view. Such "sample views'' are very useful to applications that require random samples from a database: approximate query processing, online aggregation, data mining, and randomized algorithms are a few examples. Our core technical contribution is a new file organization called the ACE Tree that is suitable for organizing and indexing a sample view. One of the most important aspects of the ACE Tree is that it supports online random sampling from the view. That is, at all times, the set of records returned by the ACE Tree constitutes a statistically random sample of the database records satisfying the relational selection predicate over the view. Our paper presents experimental results that demonstrate the utility of the ACE Tree.

[1] R. Motwani and P. Raghavan, Randomized Algorithms. Cambridge Univ. Press, 1995.
[2] J.M. Hellerstein, P.J. Haas, and H.J. Wang, “Online Aggregation,” Proc. ACM SIGMOD, pp. 171-182, 1997.
[3] J.M. Hellerstein, R. Avnur, A. Chou, C. Hidber, C. Olston, V. Raman, T. Roth, and P.J. Haas, “Interactive Data Analysis: The Control Project,” Computer, vol. 32, no. 8, pp. 51-59, Aug. 1999.
[4] P.J. Haas and J.M. Hellerstein, “Ripple Joins for Online Aggregation,” Proc. ACM SIGMOD, pp. 287-298, 1999.
[5] P.S. Bradley, U.M. Fayyad, and C. Reina, “Scaling Clustering Algorithms to Large Databases,” Proc. Third Int'l Conf. Knowledge Discovery and Data Mining (KDD '98), pp. 9-15, 1998.
[6] F. Provost, D. Jensen, and T. Oates, “Efficient Progressive Sampling,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining (KDD '99), pp. 23-32, 1999.
[7] C. Meek, B. Thiesson, and D. Heckerman, “The Learning-Curve Sampling Method Applied to Model-Based Clustering,” J. Machine Learning Research, vol. 2, pp. 397-418, 2002.
[8] M. Saar-Tsechansky and F. Provost, “Active Sampling for Class Probability Estimation and Ranking,” Machine Learning, vol. 54, no. 2, pp. 153-178, 2004.
[9] T. Scheffer and S. Wrobel, “Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling,” J. Machine Learning Research, vol. 3, pp. 833-862, 2002.
[10] P.J. Haas and C. Koenig, “A Bilevel Bernoulli Scheme for Database Sampling,” Proc. ACM SIGMOD, pp. 275-286, 2004.
[11] P.J. Haas, “The Need for Speed: Speeding Up DB2 Using Sampling,” IDUG Solutions J., vol. 10, pp. 32-34, 2003.
[12] F. Olken and D. Rotem, “Random Sampling from B+ Trees,” Proc. 15th Int'l Conf. Very Large Data Bases (VLDB '89), pp. 269-277, 1989.
[13] F. Olken and D. Rotem, “Simple Random Sampling from Relational Databases,” Proc. 12th Int'l Conf. Very Large Data Bases (VLDB '86), pp. 160-169, 1986.
[14] F. Olken, D. Rotem, and P. Xu, “Random Sampling from Hash Files,” Proc. ACM SIGMOD, pp. 375-386, 1990.
[15] F. Olken, “Random Sampling from Databases,” PhD dissertation, 1993.
[16] F. Olken and D. Rotem, “Sampling from Spatial Databases,” Proc. Ninth IEEE Int'l Conf. Data Eng. (ICDE '93), pp. 199-208, 1993.
[17] G. Antoshenkov, “Random Sampling from Pseudo-Ranked B+ Trees,” Proc. 18th Int'l Conf. Very Large Data Bases (VLDB '92), pp.375-382, 1992.
[18] S. Chaudhuri, G. Das, and U. Srivastava, “Effective Use of Block-Level Sampling in Statistics Estimation,” Proc. ACM SIGMOD, pp.287-298, 2004.
[19] A.A. Diwan, S. Rane, S. Seshadri, and S. Sudarshan, “Clustering Techniques for Minimizing External Path Length,” Proc. 22nd Int'l Conf. Very Large Data Bases (VLDB '96), pp. 342-353, 1996.
[20] H. Garcia-Molina, J. Widom, and J.D. Ullman, Database System Implementation. Prentice Hall, 1999.
[21] A. Guttman, “R-Trees: A Dynamic Index Structure for Spatial Searching,” Proc. ACM SIGMOD, pp. 47-57, 1984.
[22] S.T. Leutenegger, J.M. Edgington, and M.A. Lopez, “STR: A Simple and Efficient Algorithm for R-Tree Packing,” Proc. 13th IEEE Int'l Conf. Data Eng. (ICDE '97), pp. 497-506, 1997.
[23] D.G. Severance and G.M. Lohman, “Differential Files: Their Application to the Maintenance of Large Databases,” ACM Trans. Database Systems, vol. 1, no. 3, pp. 256-267, 1976.
[24] P.G. Brown and P.J. Haas, “Techniques for Warehousing of Sample Data,” Proc. 22nd IEEE Int'l Conf. Data Eng. (ICDE '06), p.6, 2006.
[25] P. Muth, P.E. O'Neil, A. Pick, and G. Weikum, “Design, Implementation, and Performance of the LHAM Log-Structured History Data Access Method,” Proc. 24th Int'l Conf. Very Large Data Bases (VLDB '98), pp. 452-463, 1998.

Index Terms:
Indexing methods, Query processing, Sampling
Citation:
Shantanu Joshi, Christopher Jermaine, "Materialized Sample Views for Database Approximation," IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 3, pp. 337-351, March 2008, doi:10.1109/TKDE.2007.190664
Usage of this product signifies your acceptance of the Terms of Use.