2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService) (2016)
Oxford, United Kingdom
March 29, 2016 to April 1, 2016
Forecasts of daily pollutant levels have become a standard part of weather predictions in television, on-line, and in newspapers. Research groups also need to analyze larger timeframes across more locations to correlate long term developments for different pollutants with multiple serious health effects such as asthma. This paper presents a comparison of the Hadoop MapReduce and Spark programing models for air quality simulations, guiding future code development for the research groups interested in these analyses. Two use cases have been used, namely (i) calculating the eight hour rolling average of pollutants in a restricted region, (ii) identifying clusters of sensors showing similar patterns in pollutant concentration over multiple years in the state of Texas. The data set used in this analysis is air pollution data collected over fifteen years at 179 monitor sites across the state of Texas for a variety of pollutants. Our results reveal 20-25% performance benefits for the Spark solutions over MapReduce. Furthermore, it documents performance benefits of the Spark MLlib machine learning library over the Mahout library which is based on the MapReduce programing model.
Sparks, Atmospheric modeling, Air quality, Analytical models, Sensors, Computational modeling, Data models
H. Ayyalasomayajula, E. Gabriel, P. Lindner and D. Price, "Air Quality Simulations Using Big Data Programming Models," 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService)(BIGDATASERVICE), Oxford, United Kingdom, 2016, pp. 182-184.