Issue No. 08 - Aug. (2017 vol. 66)
Dazhao Cheng , Department of Computer Science, University of North Carolina at Charlotte, Charlotte, NC
Xiaobo Zhou , Department of Computer Science, University of Colorado, Colorado Springs, CO
Palden Lama , Department of Computer Science, University of Texas at San Antonio, 1 UTSA Circle, San Antonio, TX
Jun Wu , Department of Computer Science & Technology, Tongji University, 1239 Siping Road, Shanghai, China
Changjun Jiang , Department of Computer Science & Technology, Tongji University, 1239 Siping Road, Shanghai, China
While MapReduce is inherently designed for batch and high throughput processing workloads, there is an increasing demand for non-batch processes on big data, e.g., interactive jobs, real-time queries, and stream computations. Emerging Apache Spark fills in this gap, which can run on an established Hadoop cluster and take advantages of existing HDFS. As a result, the deployment model of Spark-on-YARN is widely applied by many industry leaders. However, we identify three key challenges to deploy Spark on YARN, inflexible reservation-based resource management, inter-task dependency blind scheduling, and the locality interference between Spark and MapReduce applications. The three challenges cause inefficient resource utilization and significant performance deterioration. We propose and develop a cross-platform resource scheduling middleware,
iKayak, which aims to improve the resource utilization and application performance in multi-tenant Spark-on-YARN clusters. iKayak relies on three key mechanisms: reservation-aware executor placement to avoid long waiting for resource reservation, dependency-aware resource adjustment to exploit under-utilized resource occupied by reduce tasks, and cross-platform locality-aware task assignment to coordinate locality competition between Spark and MapReduce applications. We implement iKayak in YARN. Experimental results on a testbed show that iKayak can achieve 50 percent performance improvement for Spark applications and 19 percent performance improvement for MapReduce applications, compared to two popular Spark-on-YARN deployment models, i.e., YARN-client model and YARN-cluster model.
Sparks, Resource management, Yarn, Job shop scheduling, Computer science, Processor scheduling, Big data
D. Cheng, X. Zhou, P. Lama, J. Wu and C. Jiang, "Cross-Platform Resource Scheduling for Spark and MapReduce on YARN," in IEEE Transactions on Computers, vol. 66, no. 8, pp. 1341-1353, 2017.