The Community for Technology Leaders
Green Image
ISSN: 1939-1374
Zujie Ren , Hangzhou Dianzi University, Hangzhou
Jian Wan , Hangzhou Dianzi University, Hangzhou
Weisong Shi , Wayne State University, Detroit
Xianghua Xu , Hangzhou Dianzi University, Hangzhou
Min Zhou , Taobao, Inc., Hangzhou
MapReduce is becoming the state-of-the-art computing paradigm for processing large-scale datasets on a large cluster. Hadoop, an open-source implementation of MapReduce, is widely used to support data-intensive computation jobs in a cluster. Like other server-side systems, understanding the characteristics of workloads is the key to making optimal configuration decisions and improving the system throughput. However, workload analysis on a Hadoop cluster, especially in a large-scale e-commerce production environment, has not been well studied yet. In this paper, we performed a comprehensive workload analysis using the trace collected from a 2,000-node Hadoop cluster at Taobao, which is the biggest online e-commerce enterprise in Asia, ranked 11$^{th}$ in the world as reported by Alexa. The results of the workload analysis are representative and generally consistent with the data warehouses for e-commerce web sites. Based on the observations and implications derived from the trace, we designed a workload generator Ankus, to expedite the performance evaluation and debugging of new mechanisms. Furthermore, we proposed and implemented a job scheduling algorithm Fair4S, which is designed to be biased towards small jobs. Experimental evaluation verified the Fair4S accelerates the average waiting times of small jobs by a factor of 7 compared with the fair scheduler.
Distributed systems, Distributed databases, Parallel databases

Z. Ren, J. Wan, W. Shi, X. Xu and M. Zhou, "Workload Analysis, Implications and Optimization on a Production Hadoop Cluster: A Case Study on Taobao," in IEEE Transactions on Services Computing.
91 ms
(Ver 3.3 (11022016))