2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (2013)
Cambridge, MA, USA USA
May 20, 2013 to May 24, 2013
MapReduce, in particular Hadoop, is a popular framework for the distributed processing of large datasets on clusters of relatively inexpensive servers. Although Hadoop clusters are highly scalable and ensure data availability in the face of server failures, their efficiency is poor. We study data placement as a potential source of inefficiency. Despite networking improvements that have narrowed the performance gap between map tasks that access local or remote data, we find that nodes servicing remote HDFS requests see significant slowdowns of collocated map tasks due to interference effects, whereas nodes making these requests do not experience proportionate slowdowns. To reduce remote accesses, and thus avoid their destructive performance interference, we investigate an intelligent data placement policy we call 'partitioned data placement'. We find that, in an unconstrained cluster where a job's map tasks may be scheduled dynamically on any node over time, Hadoop's default random data placement is effective in avoiding remote accesses. However, when task placement is restricted by long-running jobs or other reservations, partitioned data placement substantially reduces remote access rates (e.g., by as much as 86% over random placement for a job allocated only one-third of a cluster).
data placement, MapReduce, Hadoop, remote accesses
P. Tandon, M. J. Cafarella and T. F. Wenisch, "Minimizing Remote Accesses in MapReduce Clusters," 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum(IPDPSW), Cambridge, MA, USA USA, 2013, pp. 1928-1936.