This Article 
 Bibliographic References 
 Add to: 
Efficient Approximation of Correlated Sums on Data Streams
May/June 2003 (vol. 15 no. 3)
pp. 569-572
Johannes Gehrke, IEEE Computer Society
Flip Korn, IEEE Computer Society
Divesh Srivastava, IEEE Computer Society

Abstract—In many applications such as IP network management, data arrives in streams and queries over those streams need to be processed online using limited storage. Correlated-sum (CS) aggregates are a natural class of queries formed by composing basic aggregates on (x,y) pairs and are of the form SUM{g(y) : x \leq f(AGG(x))}, where AGG(x) can be any basic aggregate and f(), g() are user-specified functions. CS-aggregates cannot be computed exactly in one pass through a data stream using limited storage; hence, we study the problem of computing approximate CS-aggregates. We guarantee a priori error bounds when AGG(x) can be computed in limited space (e.g., MIN, MAX, AVG), using two variants of Greenwald and Khanna's summary structure for the approximate computation of quantiles. Using real data sets, we experimentally demonstrate that an adaptation of the quantile summary structure uses much less space, and is significantly faster, than a more direct use of the quantile summary structure, for the same a posteriori error bounds. Finally, we prove that, when AGG(x) is a quantile (which cannot be computed over a data stream in limited space), the error of a CS-aggregate can be arbitrarily large.

[1] N. Alon, Y. Matias, and M. Szegedy, “The Space Complexity of Approximating the Frequency Moments,” JCSS: J. Computer and System Sciences, vol. 58, pp. 137-147, 1999.
[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and Issues in Data Stream Systems,” Proc. Symp. Principles of Database Systems, 2002.
[3] D. Chatziantonian and K. Ross, “Querying Multiple Features in Relational Databases,” Proc. 22nd Int'l Conf. Very Large Data Bases, pp. 295-306, Sept. 1996.
[4] A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi, “Processing Complex Aggregate Queries over Data Stream,” Proc. ACM SIMGOD, June 2002.
[5] J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan, “An Approximate$\big. L_1{\hbox{-}}{\rm{Difference}}\bigr.$Algorithm for Massive Data Streams,” Proc. 40th Ann. Symp. Foundations of Computer Science, pp. 501-511, 1999.
[6] J. Gehrke, F. Korn, and D. Srivastava, “On Computing Correlated Aggregates over Continuous Data Streams,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 13-24, May 2001.
[7] P. Gibbons and Y. Matias, “Synopsis Structures for Massive Data Sets,” DIMACS, Series in Discrete Math. and Theoretical Computer Science, A, 1999.
[8] M. Greenwald and S. Khanna, “Space-Efficient Online Computation of Quantile Summaries,” Proc. SIGMOD, 2001.
[9] M. Sullivan and A. Heybey, “Tribeca: A System for Managing Large Databases of Network Traffic,” Proc. USENIX, 1998.

Index Terms:
Correlated aggregates, data streams, approximation, summary structures, a priori error bounds, IP network management.
Rohit Ananthakrishna, Abhinandan Das, Johannes Gehrke, Flip Korn, S. Muthukrishnan, Divesh Srivastava, "Efficient Approximation of Correlated Sums on Data Streams," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 569-572, May-June 2003, doi:10.1109/TKDE.2003.1198391
Usage of this product signifies your acceptance of the Terms of Use.