This Article 
 Bibliographic References 
 Add to: 
WhiteWater: Distributed Processing of Fast Streams
September 2007 (vol. 19 no. 9)
pp. 1214-1226
Monitoring systems today often involve continuous queries over streaming data, in a distributed collaborative fashion. The distribution of query operators over a network of processors, and their processing sequence, form a query configuration with inherent constraints on the throughput it can support. In this paper we discuss the implications of measuring and optimizing for output throughput, and its limitations. We propose to use instead the more granular input throughput and a version of throughput measure, the profiled input throughput, that is focused on matching the expected behavior of the input streams. We show how to evaluate a query configuration based on profiled input throughput, and that the problem of finding the optimal configuration is NP-hard. Furthermore, we describe how to overcome the complexity limitation by adapting hill-climbing heuristics to reduce the search space of configurations. We show experimentally that the approach used is not only efficient but also effective.

[1] D.J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A.S. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik, “The Design of the Borealis Stream Processing Engine,” Proc. Second Biennial Conf. Innovative Data Systems Research (CIDR '05), Jan. 2005.
[2] Y. Ahmad and U. Çetintemel, “Networked Query Processing for Distributed Stream-Based Applications,” Proc. 30th Int'l Conf. Very Large Data Bases (VLDB '04), 2004.
[3] D.J. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik, “Aurora: A New Model and Architecture for Data Stream Management,” The VLDBJ., vol. 12, no. 2, pp. 120-139, 2003.
[4] R. Avnur and J.M. Hellerstein, “Eddies: Continuously Adaptive Query Processing,” Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '00), W. Chen, J.F. Naughton, and P.A. Bernstein, eds., 2000.
[5] D.J. Abadi, W. Lindner, S. Madden, and J. Schuler, “An Integration Framework for Sensor Networks and Data Stream Management Systems,” Proc. 30th Int'l Conf. Very Large Data Bases (VLDB '04), 2004.
[6] R. Battiti and G. Tecchiolli, “The Reactive Tabu Search,” ORSA J. Computing., pp. 126-140, 1994.
[7] M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel, Y. Xing, and S. Zdonik, “Scalable Distributed Stream Processing,” Proc. First Biennial Conf. Innovative Data Systems Research (CIDR '03), 2003.
[8] S. Chandrasekaran, O. Cooper, A. Deshpande, M.J. Franklin, J.M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M.A. Shah, “Telegraphcq: Continuous Dataflow Processing for an Uncertain World,” Proc. First Biennial Conf. Innovative Data Systems Research (CIDR '03), 2003.
[9] C.D. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk, “Gigascope: A Stream Database for Network Applications,” Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '03), 2003.
[10] M.J. Franklin, S.R. Jeffery, S. Krishnamurthy, F. Reiss, S. Rizvi, E. Wu, O. Cooper, A. Edakkunni, and W. Hong, “Design Considerations for High Fan-In Systems: The HIFI Approach,” Proc. Second Biennial Conf. Innovative Data Systems Research (CIDR '05), 2005.
[11] M.N. Garofalakis and Y.E. Ioannidis, “Multi-Dimensional Resource Scheduling for Parallel Queries,” Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '96), H.V. Jagadish and I.S.Mumick, eds., 1996.
[12] F.S. Hillier and G.J. Lieberman, Introduction to Operations Research, ninth ed. McGraw-Hill, 2005.
[13] Y.E. Ioannidis and Y. Kang, “Randomized Algorithms for Optimizing Large Join Queries,” Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '90), 1990.
[14] Y.E. Ioannidis and E. Wong, “Query Optimization by Simulated Annealing,” Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '87), pp. 9-22, 1987.
[15] M. Jarke and J. Koch, “Query Optimization in Database Systems,” ACM Computing Surveys, vol. 16, no. 2, pp. 111-152, 1984.
[16] Y. Jin and R. Strom, “Relational Subscription Middleware for Internet-Scale Publish-Subscribe,” Proc. Second Int'l Workshop Distributed Event-Based Systems (DEBS '03), 2003.
[17] J. Kang, J.F. Naughton, and S. Viglas, “Evaluating Window Joins over Unbounded Streams,” Proc. 19th IEEE Int'l Conf. Data Eng. (ICDE '03), 2003.
[18] R. Kumar, M. Wolenetz, B. Agarwalla, J. Shin, P. Hutto, A. Paul, and U. Ramachandran, “Dfuse: A Framework for Distributed Data Fusion,” Proc. First Int'l Conf. Embedded Networked Sensor Systems (SenSys '03), 2003.
[19] R.S.G. Lanzelotte, P. Valduriez, and M. Zaït, “On the Effectiveness of Optimization Search Strategies for Parallel Execution Spaces,” Proc. 19th Int'l Conf. Very Large Data Bases (VLDB '93), R. Agrawal, S. Baker, and D.A. Bell, eds., pp. 493-504, 1993.
[20] S. Madden, M.J. Franklin, J.M. Hellerstein, and W. Hong, “Tag: A Tiny Aggregation Service for Ad Hoc Sensor Networks,” SIGOPS Operating Systems Rev., vol. 36, no. SI, pp. 131-146, 2002.
[21] P. Pietzuch, J. Ledlie, J. Shneidman, M. Roussopoulos, M. Welsh, and M. Seltzer, “Network-Aware Operator Placement for Stream-Processing Systems,” Proc. 22nd IEEE Int'l Conf. Data Eng. (ICDE '06), 2006.
[22] V. Raman, A. Deshpande, and J.M. Hellerstein, “Using State Modules for Adaptive Query Processing,” Proc. 19th IEEE Int'l Conf. Data Eng. (ICDE '03), 2003.
[23] M. Stonebraker, P.M. Aoki, W. Litwin, A. Pfeffer, A. Sah, J. Sidell, C. Staelin, and A. Yu, “Mariposa: A Wide-Area Distributed Database System,” The VLDB J., vol. 5, no. 1, pp. 048-063, 1996.
[24] M. Shah, J. Hellerstein, S. Chandrasekaran, and M. Franklin, “Flux: An Adaptive Partitioning Operator for Continuous Query Systems,” Proc. Int'l Conf. Data Eng., 2002.
[25] N.G. Shivaratri, P. Krueger, and M. Singhal, “Load Distributing for Locally Distributed Systems,” Computer, vol. 25, no. 12, pp. 33-44, Dec. 1992.
[26] U. Srivastava, K. Munagala, and J. Widom, “Operator Placement for In-Network Stream Query Processing,” Proc. 24th ACM Symp. Principles of Database Systems (PODS '05), 2005.
[27] S. Viglas and J.F. Naughton, “Rate-Based Query Optimization for Streaming Information Sources,” Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '02), 2002.
[28] M.H. Willebeek-LeMair and A.P. Reeves, “Strategies for Dynamic Load Balancing on Highly Parallel Computers,” IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 9, pp. 979-993, Sept. 1993.
[29] J. Widom and R. Motwani, “Query Processing, Resource Management, and Approximation in a Data Stream Management System,” Proc. First Biennial Conf. Innovative Data Systems Research, 2003.
[30] Y. Xing, S. Zdonik, and J.-H. Hwang, “Dynamic Load Distribution in the Borealis Stream Processor,” Proc. 21st IEEE Int'l Conf. Data Eng. (ICDE '05), 2005.

Index Terms:
Query processing, optimization, database architectures, distributed applications
Ioana Stanoi, George Mihaila, Themis Palpanas, Christian Lang, "WhiteWater: Distributed Processing of Fast Streams," IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 9, pp. 1214-1226, Sept. 2007, doi:10.1109/TKDE.2007.1056
Usage of this product signifies your acceptance of the Terms of Use.