• Publication
  • PrePrints
  • Abstract - Workload Analysis, Implications and Optimization on a Production Hadoop Cluster: A Case Study on Taobao
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Workload Analysis, Implications and Optimization on a Production Hadoop Cluster: A Case Study on Taobao
PrePrint
ISSN: 1939-1374
Zujie Ren, Hangzhou Dianzi University, Hangzhou
Jian Wan, Hangzhou Dianzi University, Hangzhou
Weisong Shi, Wayne State University, Detroit
Xianghua Xu, Hangzhou Dianzi University, Hangzhou
Min Zhou, Taobao, Inc., Hangzhou
MapReduce is becoming the state-of-the-art computing paradigm for processing large-scale datasets on a large cluster. Hadoop, an open-source implementation of MapReduce, is widely used to support data-intensive computation jobs in a cluster. Like other server-side systems, understanding the characteristics of workloads is the key to making optimal configuration decisions and improving the system throughput. However, workload analysis on a Hadoop cluster, especially in a large-scale e-commerce production environment, has not been well studied yet. In this paper, we performed a comprehensive workload analysis using the trace collected from a 2,000-node Hadoop cluster at Taobao, which is the biggest online e-commerce enterprise in Asia, ranked 11$^{th}$ in the world as reported by Alexa. The results of the workload analysis are representative and generally consistent with the data warehouses for e-commerce web sites. Based on the observations and implications derived from the trace, we designed a workload generator Ankus, to expedite the performance evaluation and debugging of new mechanisms. Furthermore, we proposed and implemented a job scheduling algorithm Fair4S, which is designed to be biased towards small jobs. Experimental evaluation verified the Fair4S accelerates the average waiting times of small jobs by a factor of 7 compared with the fair scheduler.
Index Terms:
Distributed systems,Distributed databases,Parallel databases
Citation:
Zujie Ren, Jian Wan, Weisong Shi, Xianghua Xu, Min Zhou, "Workload Analysis, Implications and Optimization on a Production Hadoop Cluster: A Case Study on Taobao," IEEE Transactions on Services Computing, 06 Aug. 2013. IEEE computer Society Digital Library. IEEE Computer Society, <http://doi.ieeecomputersociety.org/10.1109/TSC.2013.40>
Usage of this product signifies your acceptance of the Terms of Use.