The Community for Technology Leaders
Green Image
Issue No. 05 - May (2016 vol. 27)
ISSN: 1045-9219
pp: 1470-1483
Zhendong Bei , Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen, 518055, China
Zhibin Yu , Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen, 518055, China
Huiling Zhang , Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen, 518055, China
Wen Xiong , Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen, 518055, China
Chengzhong Xu , Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen, 518055, China
Lieven Eeckhout , Ghent University, Belgium
Shengzhong Feng , Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen, 518055, China
ABSTRACT
Hadoop is a widely-used implementation framework of the MapReduce programming model for large-scale data processing. Hadoop performance however is significantly affected by the settings of the Hadoop configuration parameters. Unfortunately, manually tuning these parameters is very time-consuming, if at all practical. This paper proposes an approach, called RFHOC, to automatically tune the Hadoop configuration parameters for optimized performance for a given application running on a given cluster. RFHOC constructs two ensembles of performance models using a random-forest approach for the map and reduce stage respectively. Leveraging these models, RFHOC employs a genetic algorithm to automatically search the Hadoop configuration space. The evaluation of RFHOC using five typical Hadoop programs, each with five different input data sets, shows that it achieves a performance speedup by a factor of 2.11 x on average and up to 7.4 x over the recently proposed cost-based optimization (CBO) approach. In addition, RFHOC's performance benefit increases with input data set size.
INDEX TERMS
Training, Predictive models, Genetic algorithms, Analytical models, Support vector machines, Data models, Prediction algorithms
CITATION

Z. Bei et al., "RFHOC: A Random-Forest Approach to Auto-Tuning Hadoop's Configuration," in IEEE Transactions on Parallel & Distributed Systems, vol. 27, no. 5, pp. 1470-1483, 2016.
doi:10.1109/TPDS.2015.2449299
3494 ms
(Ver 3.3 (11022016))