This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Random Sampling for Continuous Streams with Arbitrary Updates
January 2007 (vol. 19 no. 1)
pp. 96-110
The existing random sampling methods have at least one of the following disadvantages: they 1) are applicable only to certain update patterns, 2) entail large space overhead, or 3) incur prohibitive maintenance cost. These drawbacks prevent their effective application in stream environments (where a relation is updated by a large volume of insertions and deletions that may arrive in any order), despite the considerable success of random sampling in conventional databases. Motivated by this, we develop several fully dynamic algorithms for obtaining random samples from individual relations, and from the join result of two tables. Our solutions can handle any update pattern with small space and computational overhead. We also present an in-depth analysis that provides valuable insight into the characteristics of alternative sampling strategies and leads to precision guarantees. Extensive experiments validate our theoretical findings and demonstrate the efficiency of our techniques in practice.

[1] S. Acharya, P.B. Gibbons, and V. Poosala, “Congressional Samples for Approximate Answering of Group-By Queries,” Proc. ACM SIGMOD Conf., 2000.
[2] S. Acharya, P.B. Gibbons, V. Poosala, and S. Ramaswamy, “Join Synopses for Approximate Query Answering,” Proc. ACM SIGMOD Conf., 1999.
[3] N. Alon, P.B. Gibbons, Y. Matias, and M. Szegedy, “Tracking Join and Self-Join Sizes in Limited Storage,” Proc. 18th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, 1999.
[4] B. Babcock, S. Chaudhuri, and G. Das, “Dynamic Sample Selection for Approximate Query Processing,” Proc. ACM SIGMOD Conf., 2003.
[5] B. Babcock, M. Datar, and R. Motwani, “Sampling from a Moving Window over Streaming Data,” Proc. Ann. ACM-SIAM Symp. Discrete Algorithms, 2002.
[6] S. Chaudhuri, G. Das, and V. Narasayya, “A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries,” Proc. ACM SIGMOD Conf., 2001.
[7] S. Chaudhuri, G. Das, and U. Srivastava, “Effective Use of Block-Level Sampling in Statistics Estimation,” Proc. ACM SIGMOD Conf., 2004.
[8] S. Chaudhuri, R. Motwani, and V. Narasayya, “On Random Sampling over Joins,” Proc. ACM SIGMOD Conf., 1999.
[9] G. Cormode, S. Muthukrishnan, and I. Rozenbaum, “Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling,” Proc. 31st Int'l Conf. Very Large Data Bases, 2005.
[10] G. Frahling, P. Indyk, and C. Sohler, “Sampling in Dynamic Data Streams and Applications,” Proc. Ann. Symp. Computational Geometry, 2005.
[11] S. Ganguly, M. Garofalakis, and R. Rastogi, “Processing Set Expressions over Continuous Update Streams,” Proc. ACM SIGMOD Conf., 2003.
[12] S. Ganguly, P.B. Gibbons, Y. Matias, and A. Silberschatz, “Bifocal Sampling for Skew-Resistant Join Size Estimation,” Proc. ACM SIGMOD Conf., 1996.
[13] P.B. Gibbons and Y. Matias, “New Sampling-Based Summary Statistics for Improving Approximate Query Answers,” Proc. ACM SIGMOD Conf., 1998.
[14] P.B. Gibbons, Y. Matias, and V. Poosala, “Fast Incremental Maintenance of Approximate Histograms,” ACM Trans. Database Systems, vol. 27, no. 3, pp. 261-298, 2002.
[15] G. Graefe and P.-A. Larson, “B-Tree Indexes and CPU Caches,” Proc. Int'l Conf. Data Eng., 2001.
[16] S. Guha, C. Kim, and K. Shim, “Xwave: Approximate Extended Wavelets for Streaming Data,” Proc. 30th Int'l Conf. Very Large Data Bases, 2004.
[17] S. Guha, K. Shim, and J. Woo, “Rehist: Relative Error Histogram Construction Algorithms,” Proc. 30th Int'l Conf. Very Large Data Bases, 2004.
[18] P.J. Haas and C. Konig, “A Bi-Level Bernoulli Scheme for Database Sampling,” Proc. ACM SIGMOD Conf., 2004.
[19] P.J. Haas, J.F. Naughton, and A.N. Swami, “On the Relative Cost of Sampling for Join Selectivity Estimation,” Proc. 13th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, 1994.
[20] C. Jermaine, “Robust Estimation with Sampling and Approximate Pre-Aggregation,” Proc. 29th Int'l Conf. Very Large Data Bases, 2003.
[21] C. Jermaine, A. Pol, and S. Arumugam, “Online Maintenance of Very Large Random Samples,” Proc. ACM SIGMOD Conf., 2004.
[22] K.-H. Li, “Reservoir-Sampling Algorithms of Time Complexity $o(n(1 + \log({\rm n}/{\rm n})))$ ,” ACM Trans. Math. Software, vol. 20, no. 4, pp.481-493, 1994.
[23] M.F. Mokbel, X. Xiong, and W.G. Aref, “Sina: Scalable Incremental Processing of Continuous Queries in Spatio-Temporal Databases,” Proc. ACM SIGMOD Conf., 2004.
[24] U. Srivastava and J. Widom, “Memory-Limited Execution of Windowed Stream Joins,” Proc. 30th Int'l Conf. Very Large Data Bases, 2004.
[25] N. Tatbul, U. Cetintemel, S.B. Zdonik, M. Cherniack, and M. Stonebraker, “Load Shedding in a Data Stream Manager,” Proc. 29th Int'l Conf. Very Large Data Bases, 2003.
[26] J.S. Vitter, “Random Sampling with a Reservoir,” ACM Trans. Math. Software, vol. 11, no. 1, pp. 37-57, 1985.
[27] E. Keogh, The UCR Time Series Data Mining Archive, Univ. of California—Computer Science & Eng. Dept., http://www.cs. ucr.edu/~eamonn/TSDMAindex.html , 2006.

Index Terms:
Sampling, selectivity estimation.
Citation:
Yufei Tao, Xiang Lian, Dimitris Papadias, Marios Hadjieleftheriou, "Random Sampling for Continuous Streams with Arbitrary Updates," IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp. 96-110, Jan. 2007, doi:10.1109/TKDE.2007.14
Usage of this product signifies your acceptance of the Terms of Use.