2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (2016)
Chicago, IL, USA
May 23, 2016 to May 27, 2016
As the data-driven economy evolves, enterprises have come to realize a competitive advantage in being able to act on high volume, high velocity streams of data. Technologies such as distributed message queues and streaming processing platforms that can scale to thousands of data stream partitions on commodity hardware are a response. However, the programming API provided by these systems is often low-level, requiring substantial custom code that adds to the programmer learning curve and maintenance overhead. Additionally, these systems often lack SQL querying capabilities that have proven popular on Big Data systems like Hive, Impala or Presto. We define a minimal set of extensions to standard SQL for data stream querying and manipulation. These extensions are prototyped in SamzaSQL, a new tool for streaming SQL that compiles streaming SQL into physical plans that are executed on Samza, an open-source distributed stream processing framework. We compare the performance of streaming SQL queries against native Samza applications and discuss usability improvements. SamzaSQL is a part of the open source Apache Samza project and will be available for general use.
Yarn, Standards, Distributed databases, Fault tolerance, Fault tolerant systems, Computer architecture, Big data
M. Pathirage, J. Hyde, Y. Pan and B. Plale, "SamzaSQL: Scalable Fast Data Management with Streaming SQL," 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Chicago, IL, USA, 2016, pp. 1627-1636.