Subscribe
pp: 1
Zujie Ren , Hangzhou Dianzi University, Hangzhou
Jian Wan , Hangzhou Dianzi University, Hangzhou
Weisong Shi , Wayne State University, Detroit
Xianghua Xu , Hangzhou Dianzi University, Hangzhou
Min Zhou , Taobao, Inc., Hangzhou
ABSTRACT
MapReduce is becoming the state-of-the-art computing paradigm for processing large-scale datasets on a large cluster. Hadoop, an open-source implementation of MapReduce, is widely used to support data-intensive computation jobs in a cluster. Like other server-side systems, understanding the characteristics of workloads is the key to making optimal configuration decisions and improving the system throughput. However, workload analysis on a Hadoop cluster, especially in a large-scale e-commerce production environment, has not been well studied yet. In this paper, we performed a comprehensive workload analysis using the trace collected from a 2,000-node Hadoop cluster at Taobao, which is the biggest online e-commerce enterprise in Asia, ranked 11$^{th}$ in the world as reported by Alexa. The results of the workload analysis are representative and generally consistent with the data warehouses for e-commerce web sites. Based on the observations and implications derived from the trace, we designed a workload generator Ankus, to expedite the performance evaluation and debugging of new mechanisms. Furthermore, we proposed and implemented a job scheduling algorithm Fair4S, which is designed to be biased towards small jobs. Experimental evaluation verified the Fair4S accelerates the average waiting times of small jobs by a factor of 7 compared with the fair scheduler.
INDEX TERMS
Distributed systems, Distributed databases, Parallel databases
CITATION
Zujie Ren, Jian Wan, Weisong Shi, Xianghua Xu, Min Zhou, "Workload Analysis, Implications and Optimization on a Production Hadoop Cluster: A Case Study on Taobao", IEEE Transactions on Services Computing, , no. 1, pp. 1, PrePrints PrePrints, doi:10.1109/TSC.2013.40