This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Semantic Approximation of Data Stream Joins
January 2005 (vol. 17 no. 1)
pp. 44-59
We consider the problem of approximating sliding window joins over data streams in a data stream processing system with limited resources. In our model, we deal with resource constraints by shedding load in the form of dropping tuples from the data streams. We make two main contributions. First, we define the problem space by discussing architectural models for data stream join processing and surveying suitable measures for the quality of an approximation of a set-valued query result. Second, we examine in detail a large part of this problem space. More precisely, we consider the number of generated result tuples as the quality measure and we propose optimal offline and fast online algorithms for it. In a thorough experimental study with synthetic and real data, we show the efficacy of our solutions.

[1] R.K. Ahuja, T.L. Magnanti, and J.B. Orlin, Network Flows: Theory, Algorithms, and Applications. Prentice-Hall, 1993.
[2] A. Arasu, B. Babcock, S. Babu, J. McAlister, and J. Widom, “Characterizing Memory Requirements for Queries over Continuous Data Streams,” Proc. Symp. Principles of Database Systems (PODS), pp. 221-232, 2002.
[3] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and Issues in Data Stream Systems,” Proc. Symp. Principles of Database Systems (PODS), pp. 1-16, 2002.
[4] S. Babu and J. Widom, “Continuous Queries over Data Streams,” ACM SIGMOD Record, vol. 30, no. 3, pp. 109-120, 2001.
[5] D. Barbará, W. DuMouchel, C. Faloutsos, P.J. Haas, J.M. Hellerstein, Y.E. Ioannidis, H.V. Jagadish, T. Johnson, R.T. Ng, V. Poosala, K.A. Ross, and K.C. Sevcik, “The New Jersey Data Reduction Report,” IEEE Data Eng. Bull., vol. 20, no. 4, pp. 3-45, 1997.
[6] P. Bonnet, J. Gehrke, and P. Seshadri, “Towards Sensor Database Systems,” Proc. Int'l Conf. Mobile Data Management (MDM), pp. 3-14, 2001.
[7] D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik, “Monitoring Streams— A New Class of Data Management Applications,” Proc. Int'l Conf. Very Large Databases (VLDB), 2002.
[8] S. Chandrasekaran and M.J. Franklin, “Streaming Queries over Streaming Data,” Proc. Int'l Conf. Very Large Databases (VLDB), 2002.
[9] S. Chaudhuri, R. Motwani, and V.R. Narasayya, “On Random Sampling over Joins,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 263-274, 1999.
[10] J. Chen, D.J. DeWitt, F. Tian, and Y. Wang, “NiagaraCQ: A Scalable Continuous Query System for Internet Databases,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 379-390, 2000.
[11] C.D. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk, and O. Spatscheck, “Gigascope: High Performance Network Monitoring with an SQL Interface,” Proc. ACM SIGMOD Int'l Conf. Management of Data, p. 623, 2002.
[12] A. Das, J. Gehrke, and M. Riedewald, “Approximate Join Processing over Data Streams,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 40-51, 2003.
[13] A. Das, J. Gehrke, and M. Riedewald, “Semantic Approximation of Data Stream Joins,” Technical Report TR2004-1932, Cornell Univ., 2004, http:/techreports.library.cornell.edu.
[14] A. Das, J. Gehrke, and M. Riedewald, “Semantic Approximation of Data Stream Joins (supplementary material),” CS Digital Library, available at http://computer.org/tkdearchives.htm, 2004.
[15] M. Datar, A. Gionis, P. Indyk, and R. Motwani, “Maintaining Stream Statistics over Sliding Windows,” Proc. ACM-SIAM Symp. Discrete Algorithms (SODA), pp. 635-644, 2002.
[16] A. Dobra, M.N. Garofalakis, J. Gehrke, and R. Rastogi, “Processing Complex Aggregate Queries over Data Streams,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 61-72, 2002.
[17] M.N. Garofalakis, J. Gehrke, and R. Rastogi, “Querying and Mining Data Streams: You Only Get One Look, Proc. ACM SIGMOD Int'l Conf. Management of Data, 2002.
[18] IEEE Data Eng. Bull., special issue on data stream processing, J. Gehrke, ed., vol. 26, 2003.
[19] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, “Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries,” Proc. Int'l Conf. Very Large Databases (VLDB), pp. 79-88, 2001.
[20] A.V. Goldberg, “An Efficient Implementation of a Scaling Minimum-Cost Flow Algorithm,” J. Algorithms, vol. 22, no. 1, pp. 1-29, 1997.
[21] M. Greenwald and S. Khanna, “Space-Efficient Online Computation of Quantile Summaries,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 58-65, 2001.
[22] S. Guha, N. Koudas, and K. Shim, “Data-Streams and Histograms,” Proc. ACM Symp. Theory of Computing (STOC), pp. 471-475, 2001.
[23] C.J. Hahn, S.G. Warren, and J. London, “Edited Synoptic Cloud Reports from Ships and Land Stations over the Globe,” 1982-1991, http://cdiac.esd.ornl.gov/ftpndp026b, 1996.
[24] M.A. Hammad, M.J. Franklin, W.G. Aref, and A.K. Elmagarmid, “Scheduling for Shared Window Joins over Data Streams,” Proc. Int'l Conf. Very Large Databases (VLDB), pp. 297-308, 2003.
[25] D. Hand, H. Mannila, and P. Smyth, Principles of Data Mining. MIT Press, 2001.
[26] J.M. Hellerstein, M.J. Franklin, S. Chandrasekaran, A. Deshpande, K. Hildrum, S. Madden, V. Raman, and M.A. Shah, “Adaptive Query Processing: Technology in Evolution,” IEEE Data Eng. Bull., vol. 23, no. 2, pp. 7-18, 2000.
[27] Y. E. Ioannidis and V. Poosala, “Histogram-Based Approximation of Set-Valued Query-Answers,” Proc. Int'l Conf. Very Large Databases (VLDB), pp. 174-185, 1999.
[28] Z.G. Ives, D. Florescu, M. Friedman, A.Y. Levy, and D.S. Weld, “An Adaptive Query Execution System for Data Integration,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 299-310, 1999.
[29] J. Kang, J.F. Naughton, and S.D. Viglas, “Evaluating Window Joins over Unbounded Streams,” Proc. Int'l Conf. Data Eng. (ICDE), 2003.
[30] F. Korn, S. Muthukrishnan, and D. Srivastava, “Reverse Nearest Neighbor Aggregates over Data Streams,” Proc. Int'l Conf. Very Large Databases (VLDB), 2002.
[31] S. Madden and M.J. Franklin, “Fjording the Stream: An Architecture for Queries over Streaming Sensor Data,” Proc. Int'l Conf. Data Eng. (ICDE), 2002.
[32] S.R. Madden, M.A. Shah, J.M. Hellerstein, and V. Raman, “Continuously Adaptive Continuous Queries over Streams,” Proc. ACM SIGMOD Int'l Conf. Management of Data, 2002.
[33] R. Ramakrishnan and J. Gehrke, Database Management Systems, third ed. McGraw-Hill, 2003.
[34] R.T. Rockafellar, Network Flows and Monotropic Optimization. John Wiley & Sons, 1984.
[35] Y. Rubner, C. Tomasi, and L.J. Guibas, “A Metric for Distributions with Applications to Image Databases,” Proc. Int'l Conf. Computer Vision (ICCV), pp. 207-214, 1998.
[36] M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, and M.J. Franklin, “Flux: An Adaptive Partitioning Operator for Continuous Query Systems,” Proc. Int'l Conf. Data Eng. (ICDE), pp. 25-36, 2003.
[37] N. Tatbul, U. Çetintemel, S. Zdonik, M. Cherniack, and M. Stonebraker, “Load Shedding in a Data Stream Manager,” Proc. Int'l Conf. Very Large Databases (VLDB), pp. 309-320, 2003.
[38] Stanford STREAM Team, Stream Query Repository, http://www-db.stanford.edu/streamsqr, 2004.
[39] N. Thaper, S. Guha, P. Indyk, and N. Koudas, “Dynamic Multidimensional Histograms,” Proc. ACM SIGMOD Int'l Conf. Management of Data, 2002.
[40] C.J. van Rijsbergen, Information Retrieval, second ed. Butterworths, 1979.
[41] S.D. Viglas, J. Burger, and J.F. Naughton, “Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources,” Proc. Int'l Conf. Very Large Databases (VLDB), pp. 285-296, 2003.

Index Terms:
Data streams, approximation algorithms, semantic load shedding, set approximation error metrics, join processing.
Citation:
Abhinandan Das, Johannes Gehrke, Mirek Riedewald, "Semantic Approximation of Data Stream Joins," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 1, pp. 44-59, Jan. 2005, doi:10.1109/TKDE.2005.17
Usage of this product signifies your acceptance of the Terms of Use.