2013 IEEE 5th International Conference on Cloud Computing Technology and Science (2013)
Bristol, United Kingdom United Kingdom
Dec. 2, 2013 to Dec. 5, 2013
The MapReduce programming model, along with its open-source implementation Hadoop has provided a cost effective solution for many data-intensive applications. Hadoop stores data distributively and exploits data locality by assigning tasks to where data is stored. In many cases, however, accessing remote data (rack-local and off-rack) is inevitable. In this paper we are evaluating the possibility of improving the remote data accessing performance by streaming data from multiple available replicas. The proposed design consists of a circular buffer, a slice reader and a enhanced Data Node. Such system is capable of adapting to both the static performance variance caused by network topology as well as dynamic variance caused by congestion. Extensive experiments show that mutil-source streaming can significantly improve the throughput of remote data access and accelerate the related map tasks by 10%-20%. In some imbalanced environment, the proposed system can even achieve as much as 4x speedup.
Throughput, Bandwidth, Benchmark testing, Peer-to-peer computing, Network topology, Media, Servers,Streaming, MapReduce, Hadoop, Mutil-source
Jiadong Wu, Bo Hong, "Improving MapReduce Performance by Streaming Input Data from Multiple Replicas", 2013 IEEE 5th International Conference on Cloud Computing Technology and Science, vol. 01, no. , pp. 623-630, 2013, doi:10.1109/CloudCom.2013.88