2013 IEEE 5th International Conference on Cloud Computing Technology and Science (2013)
Bristol, United Kingdom United Kingdom
Dec. 2, 2013 to Dec. 5, 2013
As an efficient parallel computing system based on MapReduce model, Hadoop is widely used for large-scale data analysis such as data mining, machine learning and scientific simulation. However, there are still some performance problems in MapReduce, especially the situation in the shuffle phase. In order to solve these problems, in this paper, a lightweight individual shuffle service component with more efficient I/O policy was proposed rather than the existing shuffle phase in MapReduce. We also describe how to implement the shuffle service in three steps: extract shuffle from reduce task as a shuffle task, reconstruct the shuffle task as a service and improve I/O scheduling policy on Map sides. Furthermore both simulated experiments and MapReduce job comparative studies are conducted to evaluate the performance of our improvements. The result reveals that our approach can decrease the whole job's execution time and make full use of cluster resources.
Data models, Bandwidth, Computational modeling, Google, Facebook, Memory management, Protocols,shuffle, hadoop, mapreduce
Jingui Li, Xuelian Lin, Xiaolong Cui, Yue Ye, "Improving the Shuffle of Hadoop MapReduce", 2013 IEEE 5th International Conference on Cloud Computing Technology and Science, vol. 01, no. , pp. 266-273, 2013, doi:10.1109/CloudCom.2013.42