This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Online Random Shuffling of Large Database Tables
January 2007 (vol. 19 no. 1)
pp. 73-84
Many applications require a randomized ordering of input data. Examples include algorithms for online aggregation, data mining, and various randomized algorithms. Most existing work seems to assume that accessing the records from a large database in a randomized order is not a difficult problem. However, it turns out to be extremely difficult in practice. Using existing methods, randomization is either extremely expensive at the front end (as data are loaded), or at the back end (as data are queried). This paper presents a simple file structure which supports both efficient, online random shuffling of a large database, as well as efficient online sampling or randomization of the database when it is queried. The key innovation of our method is the introduction of a small degree of carefully controlled, rigorously monitored nonrandomness into the file.

[1] R. Motwani and P. Raghavan, Randomized Algorithms. Cambridge Univ. Press, 2004.
[2] A.R. Barron, “The Convergence in Information of Probability Density Estimators,” Proc. IEEE Symp. Information Theory, vol. 38, pp. 1437-1454, 1988.
[3] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Databases,” Proc. ACM SIGMOD Conf., pp. 103-114, 1996.
[4] J.M. Hellerstein, P.J. Haas, and H.J. Wang, “Online Aggregation,” Proc. ACM SIGMOD Conf., pp. 171-182, 1997.
[5] P.J. Haas and J.M. Hellerstein, “Ripple Joins for Online Aggregation,” Proc. ACM SIGMOD Conf., pp. 287-298, 1999.
[6] P.J. Haas and C. Koenig, “A Bi-Level Bernoulli Scheme for Database Sampling,” Proc. ACM SIGMOD Conf., pp. 275-286, 2004.
[7] S. Chaudhuri, G. Das, and U. Srivastava, “Effective Use of Block-Level Sampling in Statistics Estimation,” Proc. ACM SIGMOD Conf., pp. 287-298, 2004.
[8] H. Garcia-Molina, J.D. Ullman, and J. Widom, Database System Implementation. Prentice-Hall, 2000.
[9] C. Jermaine, A. Datta, and E. Omiecinski, “A Novel Index Supporting High Volume Data Warehouse Insertion,” Proc. Very Large Databases Conf., pp. 235-246, 1999.
[10] C. Jermaine, E. Omiecinski, and W.G. Yee, “The Partitioned Exponential File for Database Storage Management,” VLDB J., 2006.
[11] H.V. Jagadish, P.P.S. Narayan, S. Seshadri, S. Sudarshan, and R. Kanneganti, “Incremental Organization for Data Recording and Warehousing,” Proc. Very Large Databases Conf., pp. 16-25, 1997.
[12] F. Olken and D. Rotem, “Random Sampling from B+ Trees,” Proc. Very Large Databases Conf., pp. 269-277, 1989.
[13] F. Olken and D. Rotem, “Random Sampling from Database Files: A Survey,” Proc. Int'l Conf. Scientific and Statistical Database Management, pp. 92-111, 1990.
[14] G. Antoshenkov, “Random Sampling from Pseudo-Ranked B+-Trees,” Proc. Very Large Databases Conf., pp. 375-382, 1992.
[15] P.E. O'Neil, E. Cheng, D. Gawlick, and E.J. O'Neil, “The Log-Structured Merge-Tree (LSM-Tree),” Acta Informatica, vol. 33, no. 4, pp. 351-385, 1996.
[16] L. Arge, “The Buffer Tree: A New Technique for Optimal I/O-Algorithms (Extended Abstract),” Proc. Int'l Workshop Algorithms and Data Structures, pp. 334-345, 1995.
[17] W.G. Cochran, Sampling Techniques. Wiley Series in Probability and Statistics, 1977.
[18] G. Luo, C. Ellmann, P.J. Haas, and J.F. Naughton, “A Scalable Hash Ripple Join Algorithm,” Proc. ACM SIGMOD Conf., pp. 252-262, 2002.
[19] N.L. Johnson, S. Kotz, and A.W. Kemp, Univariate Discrete Distributions, second ed. John Wiley and Sons, 1994.
[20] P.S. Bradley, U.M. Fayyad, and C. Reina, “Scaling Clustering Algorithms to Large Databases,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, pp. 9-15, 1998.
[21] T. Scheffer and S. Wrobel, “Finding the Most Interesting Patterns in a Database Quickly by Using Sequential Sampling,” J. Machine Learning Research, vol. 3, pp. 833-862, 2002.
[22] C. Meek, B. Thiesson, and D. Heckerman, “The Learning-Curve Sampling Method Applied to Model-Based Clustering,” J. Machine Learning Research, vol. 2, pp. 397-418, 2002.
[23] P. Domingos and G. Hulten, “A General Method for Scaling Up Machine Learning Algorithms and Its Application to Clustering,” Proc. Int'l Conf. Machine Learning, pp. 106-113, 2001.
[24] F.J. Provost, D. Jensen, and T. Oates, “Efficient Progressive Sampling,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 23-32, 1999.
[25] M. Saar-Tsechansky and F.J. Provost, “Active Sampling for Class Probability Estimation and Ranking,” Machine Learning, vol. 54, no. 2, pp. 153-178, 2004.
[26] D.W.-L. Cheung, J. Han, V. Ng, and C.Y. Wong, “Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique,” Proc. Int'l Conf. Data Eng., pp. 106-114, 1996.
[27] J.X. Yu, Z. Chong, and H.L. A. Zhou, “False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams,” Proc. Very Large Databases Conf., pp. 204-215, 2004.
[28] E.L. Lehmann, Testing Statistical Hypotheses. Springer Texts in Statistics, 1997.
[29] C.-E. Sarndal, B. Swensson, and J. Wretman, Model Assisted Survey Sampling. Springer Series in Statistics, 2003.
[30] http://stat.fsu. edu/pubdiehard/, 2006.
[31] A. Kawaguchi, D. Lieuwen, I.S. Mumick, D. Quass, and K.A. Ross, “Concurrency Control Theory for Deferred Materialized Views,” Proc. Int'l Conf. Database Theory, pp. 306-320, 1997.

Index Terms:
Sampling methods, database systems.
Citation:
Christopher Jermaine, "Online Random Shuffling of Large Database Tables," IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp. 73-84, Jan. 2007, doi:10.1109/TKDE.2007.13
Usage of this product signifies your acceptance of the Terms of Use.