This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
One-Pass Wavelet Decompositions of Data Streams
May/June 2003 (vol. 15 no. 3)
pp. 541-554

Abstract—We present techniques for computing small space representations of massive data streams. These are inspired by traditional wavelet-based approximations that consist of specific linear projections of the underlying data. We present general “sketch”-based methods for capturing various linear projections and use them to provide pointwise and rangesum estimation of data streams. These methods use small amounts of space and per-item time while streaming through the data and provide accurate representation as our experiments with real data streams show.

[1] R. Agrawal and A. Swami, “A One-Pass Space-Ecient Algorithm for Finding Quantiles,” Proc. COMAD, 1995.
[2] N. Alon, P. Gibbons, Y. Matias, and M. Szegedy, “Tracking Join and Self-Join Sizes in Limited Storage,” Proc. ACM PODS, pp. 10-20, 1999.
[3] N. Alon, Y. Matias, and M. Szegedy, “The Space Complexity of Approximating the Frequency Moments,” Proc. ACM Symp. Theory of Computing (STOC), pp. 20-29, 1996.
[4] S. Babu and J. Widom, “Continuous Queries over Data Streams,” SIGMOD Record, vol. 30, no. 3, Sept. 2001.
[5] K. Chakrabarti, M.N. Garofalakis, R. Rastogi, and K. Shim, “Approximate Query Processing Using Wavelets,” Proc. VLDB Conf., pp. 111-122, 2000.
[6] S. Chaudhuri and L. Gravano, “Evaluating Top-k Selection Queries,” Proc. Very Large Data Bases Conf., pp. 397-410, 1999.
[7] K.P. Chan and A. Fu, “Efficient Time Series Matching by Wavelets,” Proc. Int'l Conf. Data Eng., 1999.
[8] C. Cortes and D. Pregibon, “Signature-Based Methods for Data Streams,” Data Mining and Knowledge Discovery, vol. 5, no. 3, pp. 167-182, 2001.
[9] G. Cormode, P. Indyk, N. Koudas, and S. Muthukrishnan, “Fast Mining of Tabular Data via Approximate Distance Computations,” Proc. Int'l Conf. Data Eng., 2002.
[10] A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi, “Processing Complex Aggregate Queries over Data Stream,” Proc. ACM SIMGOD, June 2002.
[11] P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD), P. Domingos and G. Hulten, eds., pp. 71-80, 2000.
[12] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J.D. Ullman, “Computing Iceberg Queries Efficiently,” Proc. 24th Int'l Conf. Very Large Data Bases, pp. 299-310, Aug. 1998.
[13] J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan, “An Approximate$\big. L_1{\hbox{-}}{\rm{Difference}}\bigr.$Algorithm for Massive Data Streams,” Proc. 40th Ann. Symp. Foundations of Computer Science, pp. 501-511, 1999.
[14] J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan, “Testing and Spot-Checking of Data Streams,” Proc. ACM-SIAM Symp. Discrete Algorithms (SODA), pp. 165-174, 2000.
[15] V. Ganti, J. Gehrke,, and R. Ramakrishnan,"Mining Very Large Databases," Computer, vol. 32, no. 8, Aug. 1999, pp. 38-45.
[16] J. Gehrke, F. Korn, and D. Srivastava, “On Computing Correlated Aggregates over Continuous Data Streams,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 13-24, May 2001.
[17] A. Gilbert, S. Guha, P. Indyk, S. Muthukrishnan, and M. Strauss, “Near-Optimal Sparse Fourier Representations via Sampling,” Proc. ACM Symp. Theory of Computing (STOC), pp. 152-161, 2002.
[18] A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “Fast, Small-Space Algorithms for Approximate Histogram Maintenance,” Proc. ACM Symp. Theory of Computing (STOC), 2002.
[19] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “How to Summarize the Universe: Dynamic Maintenance of Quantiles,” Proc. Very Large Data Bases Conf., pp. 454-465, 2002.
[20] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “QuickSAND: Quick Summary and Analysis of Network Data,” DIMACS Technical Report 2001-43, Nov. 2001.
[21] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “Optimal and Approximate Computation of Summary Statistics for Range Aggregates,” Proc. ACM Symp. Principles of Database Systems (PODS), pp. 227-236, 2001.
[22] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M.J. Strauss, “Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries,” Proc. 27th Int'l Conf. Very Large Data Bases, pp. 79-88, Sept. 2001.
[23] P.B. Gibbons and Y. Matias, New Sampling-Based Summary Statistics for Improving Approximate Query Answers Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, 1998.
[24] S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering Data Streams,” Proc. 41st Ann. Symp. Foundations of Computer Science, 2000.
[25] P. Indyk, “Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation,” Proc. 40th Symp. Foundations of Computer Science, pp. 189-197, 2000.
[26] N. Thaper, S. Guha, P. Indyk, and N. Koudas, “Multidimensional Dynamic Histograms,” Proc. ACM SIGMOD Int'l Conf. Management of Data, 2002.
[27] J. Lee, D. Kim, and C. Chung, Multi-Dimensional Selectivity Estimation Using Compressed Histogram Information Proc. 1999 ACM SIGMOD Int'l Conf. Management of Data, 1999.
[28] F. MacWilliams and N. Sloane, The Theory of Error-Correcting Codes. vol. 16,New York: North Holland Mathematical Library, 1977.
[29] S. Madden and M.J. Franklin, “Fjording the Stream: An Architecture for Queries over Streaming Sensor Data,” Proc. 18th IEEE Int'l Conf. Data Eng., pp. 555-566, Feb. 2002.
[30] G. Singh, S. Rajagopalan, and B. Lindsay, Random Sampling Techniques for Space Efficient Computation Of Large Data Sets Proc. SIGMOD, June 1999.
[31] Y. Matias, J.S. Vitter, and M. Wang, Wavelet-Based Histograms for Selectivity Estimation Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, 1998.
[32] Y. Matias, J. Vitter, and M. Wang, “Dynamic Maintenance of Wavelet-Based Histograms,” Proc. Very Large Data Bases Conf., pp. 101-110, 2000.
[33] N. Nisan, “Pseudorandom Generators for Space-Bounded Computation,” Proc. ACM Symp. Theory of Computing (STOC), pp. 204-212, 1990.
[34] V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita, “Improved Histograms for Selectivity Estimation of Range Predicates,” Proc. ACM SIGMOD 1996, pp. 294-305, 1996.
[35] Y. Wu, D. Agrawal, and A. Abbadi, “Applying the Golden Rule of Sampling for Query Estimation,” Proc. SIGMOD, pp. 449-460, 2001.
[36] J.S. Vitter, M. Wang, and B.R. Iyer, Data Cube Approximation and Histograms via Wavelets Proc. 1998 ACM CIKM Int'l Conf. Information and Knowledge Management, 1998.

Index Terms:
Data streams, wavelets, randomized algorithms, approximate queries.
Citation:
Anna C. Gilbert, Yannis Kotidis, S. Muthukrishnan, Martin J. Strauss, "One-Pass Wavelet Decompositions of Data Streams," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 541-554, May-June 2003, doi:10.1109/TKDE.2003.1198389
Usage of this product signifies your acceptance of the Terms of Use.