Issue No. 01 - Jan. (2014 vol. 63)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2013.121
Chamikara Jayalath , Purdue University, West Lafayette
Julian Stephen , Purdue University, West Lafayette
Patrick Eugster , Purdue University, West Lafayette
Efficiently analyzing big data is a major issue in our current era. Examples of analysis tasks include identification or detection of global weather patterns, economic changes, social phenomena, or epidemics. The cloud computing paradigm along with software tools such as implementations of the popular MapReduce framework offer a response to the problem by distributing computations among large sets of nodes. In many scenarios, input data are, however, geographically distributed (geodistributed) across data centers, and straightforwardly moving all data to a single data center before processing it can be prohibitively expensive. Above-mentioned tools are designed to work within a single cluster or data center and perform poorly or not at all when deployed across data centers. This paper deals with executing sequences of MapReduce jobs on geo-distributed data sets. We analyze possible ways of executing such jobs, and propose data transformation graphs that can be used to determine schedules for job sequences which are optimized either with respect to execution time or monetary cost. We introduce G-MR, a system for executing such job sequences, which implements our optimization framework. We present empirical evidence in Amazon EC2 and VICCI of the benefits of G-MR over common, naïve deployments for processing geodistributed data sets. Our evaluations show that using G-MR significantly improves processing time and cost for geodistributed data sets.
data center, Geodistributed, MapReduce, big data
C. Jayalath, J. Stephen and P. Eugster, "From the Cloud to the Atmosphere: Running MapReduce across Data Centers," in IEEE Transactions on Computers, vol. 63, no. 1, pp. 74-87, 2013.