2015 IEEE 22nd International Conference on High Performance Computing Workshops (HiPCW) (2015)
Dec. 16, 2015 to Dec. 19, 2015
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/HiPCW.2015.22
The growing ability to collect observations on the physical and natural world has introduced data-driven science as a first class paradigm. However, the types of data driving scientific and data-management challenges have themselves evolved in the last decade. From efforts to batch-analyze large datasets from a few massive instruments like the Large Synoptic Survey Telescope (LSST)1, that takes hours if not days to process, we have made our way to high throughput gene sequencing and adaptive weather analysis that process large data volumes in minutes or hours for clinical diagnostics and steering weather instruments. This evolution continues on with the deployment of sensors and mobile devices generating stream of events to be examined in seconds or faster to drive decisions.This highlights the need for middleware for decision support applications that process streams, at large numbers and fast rates, and need to respond rapidly to control costly and critical systems in a closed loop. Two popular terms that capture these application domains and their middleware are the broad areas of Internet of Things (IoT), and Big Data platforms, particularly in the velocity dimension. Many applications within IoT exemplify Dynamic Data Driven Applications Systems (DDDAS)  such as Smart Power and Water Grids , Smart Transportation, and urban monitoring. Such application exhibit the need for an Observe Orient Decide Act (OODA) feedback cycle, where observations on the system are used to decide when and how to optimize the system, enact these decisions, and continue their observations to ensure the goal, such as reliability or efficiency, is met. These decision logic themselves may use simple time-series forecasting models, rule-based systems or sophisticated machine learning algorithms, depending on the latency available for decision making.Scientific workflows were popular for orchestrating data-intensive applications on Grids and Clouds, and some dynamic workflows, including our own work, have been used to control DDDAS applications . However, workflow systems are not crafted for low-latency applications. There has been a growth in Distributed Stream Processing Systems (DSPS) over the past few years to support streaming applications. Platforms like Apache Storm and Apache Spark Streaming allow composition of streaming dataflows from user logic, and are executed incrementally with low latency over hundreds of streams at 1000's of events per second, on commodity clusters.DSPS appear to be an intrinsic platform needed by IoT middleware but need better examination based on the emerging needs of this novel domain. In this paper, we characterize the role, capabilities and the performance requirements from DSPS to leverage them for large-scale IoT applications. Specifically, we make the following contributions:1) We offer IoT application scenarios which exemplify the role of DSPS, and other information flow processing systems, within a closed-loop data-driven decision making and control system.2) We characterize IoT data streams based on real-world, public datasets from Smart Grids, Smart Transportation, and environmental monitoring, to understand their throughput and message size distributions.3) We identify several micro-patterns used in composition of DSPS and applications dataflow patterns for IoT scenarios, such as data-preprocessing, temporal aggregation and forecasting, that are representative of workloads in this area. This, along with the stream characterization, help define IoT benchmark workloads and associated performance characteristics to evaluate DSPS.4) We validate this workload by offering a comparative evaluation of Storm and Spark Streaming, two popular open-source DSPS, and discuss the results.5) Lastly, we identify open problems in DSPS that are essential to be addressed to enable their effective use in IoT middleware, including support for elasticity on Clouds, use of edge devices for distributed analytics, deadline-driven stream processing, and integrating DSMS with DSPS for event and stream analytics.
Digital signal processing, Middleware, Process control, Smart grids, Conferences, Meteorology, Forecasting
A. Shukla, T. Sharma and Y. Simmhan, "Characterizing Distributed Stream Processing Systems for IoT Applications," 2015 IEEE 22nd International Conference on High Performance Computing Workshops (HiPCW)(HIPCW), Bengaluru, India, 2015, pp. 61.