Issue No.12 - Dec. (2012 vol.23)
pp: 2351-2365
Vincenzo Gulisano , Universidad Politécnica de Madrid, Madrid
Ricardo Jiménez-Peris , Universidad Politécnica de Madrid, Madrid
Marta Patiño-Martínez , Universidad Politécnica de Madrid, Madrid
Claudio Soriente , Universidad Politécnica de Madrid, Madrid
Patrick Valduriez , INRIA and LIRMM, University Montpellier 2, Montpellier
Many applications in several domains such as telecommunications, network security, large-scale sensor networks, require online processing of continuous data flows. They produce very high loads that requires aggregating the processing capacity of many nodes. Current Stream Processing Engines do not scale with the input load due to single-node bottlenecks. Additionally, they are based on static configurations that lead to either under or overprovisioning. In this paper, we present StreamCloud, a scalable and elastic stream processing engine for processing large data stream volumes. StreamCloud uses a novel parallelization technique that splits queries into subqueries that are allocated to independent sets of nodes in a way that minimizes the distribution overhead. Its elastic protocols exhibit low intrusiveness, enabling effective adjustment of resources to the incoming load. Elasticity is combined with dynamic load balancing to minimize the computational resources used. The paper presents the system design, implementation, and a thorough evaluation of the scalability and elasticity of the fully implemented system.
Peer to peer computing, Semantics, Streaming media, Scalability, Load management, Cloud computing, Elasticity, elasticity, Data streaming, scalability
Vincenzo Gulisano, Ricardo Jiménez-Peris, Marta Patiño-Martínez, Claudio Soriente, Patrick Valduriez, "StreamCloud: An Elastic and Scalable Data Streaming System", IEEE Transactions on Parallel & Distributed Systems, vol.23, no. 12, pp. 2351-2365, Dec. 2012, doi:10.1109/TPDS.2012.24
[1] M. Stonebraker, U. Çetintemel, and S.B. Zdonik, "The 8 Requirements of Real-Time Stream Processing," SIGMOD Record, vol. 34, no. 4, pp. 42-47, 2005.
[2] S. Chandrasekaran, O. Cooper, A. Deshpande, M.J. Franklin, J.M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M.A. Shah, "Telegraphcq: Continuous Dataflow Processing for an Uncertain World," Proc. First Biennial Conf. Innovative Data Systems Research (CIDR), 2003.
[3] D.J. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S.B. Zdonik, "The Design of the Borealis Stream Processing Engine," Proc. Second Biennial Conf. Innovative Data Systems Research (CIDR), pp. 277-289, 2005.
[4] M.T. Özsu and P. Valduriez, Principles of Distributed Database Systems, third ed. Springer, 2011.
[5] V. Gulisano, R. Jiménez-Peris, M. Patiño-Martínez, and P. Valduriez, "Streamcloud: A Large Scale Data Streaming System," Proc. Int'l Conf. Distributed Computing Systems (ICDCS '10), pp. 126-137, 2010.
[6] N. Tatbul, U. Çetintemel, and S.B. Zdonik, "Staying Fit: Efficient Load Shedding Techniques for Distributed Stream Processing," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 159-170, 2007.
[7] D.J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S.B. Zdonik, "Aurora: A New Model and Architecture for Data Stream Management," VLDB J., vol. 12, no. 2, pp. 120-139, 2003.
[8] M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Çetintemel, Y. Xing, and S.B. Zdonik, "Scalable Distributed Stream Processing," Proc. Biennial Conf. Innovative Data Systems Research (CIDR), 2003.
[9] M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, and M.J. Franklin, "Flux: An Adaptive Partitioning Operator for Continuous Query Systems," Proc. IEEE Int'l Conf. Data Eng. (ICDE), pp. 25-36, 2003.
[10] G. Graefe, "Encapsulation of Parallelism in the Volcano Query Processing System," Proc. SIGMOD Conf., pp. 102-111, 1990.
[11] A.J. Demers, J. Gehrke, B. Panda, M. Riedewald, V. Sharma, and W.M. White, "Cayuga: A General Purpose Event Monitoring System," Proc. Biennial Conf. Innovative Data Systems Research (CIDR), pp. 412-422, 2007.
[12] L. Brenna, J. Gehrke, M. Hong, and D. Johansen, "Distributed Event Stream Processing with Non-Deterministic Finite Automata," Proc. ACM Int'l Conf. Distributed Event-Based Systems (DEBS), pp. 1-12, 2009.
[13] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, "S4: Distributed Stream Computing Platform," Proc. IEEE Int'l Conf. Data Mining Workshops (ICDM), pp. 170-177, 2010.
[14] H. Andrade, B. Gedik, K.-L. Wu, and P.S. Yu, "Processing High Data Rate Streams in System S," J. Parallel Distributed Computing, vol. 71, no. 2, pp. 145-156, 2011.
[15] Y. Xing, S.B. Zdonik, and J.-H. Hwang, "Dynamic Load Distribution in the Borealis Stream Processor," Proc. Int'l Conf. Data Eng. (ICDE), pp. 791-802, 2005.
[16] R.L. Collins and L.P. Carloni, "Flexible Filters: Load Balancing through Backpressure for Stream Programs," Proc. Seventh ACM Int'l Conf. Embedded Software (EMSOFT), pp. 205-214, 2009.
[17] G. Soundararajan, C. Amza, and A. Goel, "Database Replication Policies for Dynamic Content Applications," Proc. ACM SIGOPS/EuroSys European Conf. Computer Systems (EuroSys), pp. 89-102, 2006.
[18] J. Chen, G. Soundararajan, and C. Amza, "Autonomic Provisioning of Backend Databases in Dynamic Content Web Servers," Proc. IEEE Int'l Conf. Autonomic Computing (ICAC), pp. 231-242, 2006.
[19] A. Gounaris, J. Smith, N.W. Paton, R. Sakellariou, A.A.A. Fernandes, and P. Watson, "Adaptive Workload Allocation in Query Processing in Autonomous Heterogeneous Environments," Distributed and Parallel Databases, vol. 25, no. 3, pp. 125-164, 2009.
[20] M. Balazinska, H. Balakrishnan, S. Madden, and M. Stonebraker, "Fault-Tolerance in the Borealis Distributed Stream Processing System," Proc. SIGMOD Conf., pp. 13-24, 2005.