This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration
March 2003 (vol. 14 no. 3)
pp. 236-247

Abstract—Effective scheduling strategies to improve response times, throughput, and utilization are an important consideration in large supercomputing environments. Parallel machines in these environments have traditionally used space-sharing strategies to accommodate multiple jobs at the same time by dedicating the nodes to a single job until it completes. This approach, however, can result in low system utilization and large job wait times. This paper discusses three techniques that can be used beyond simple space-sharing to improve the performance of large parallel systems. The first technique we analyze is backfilling, the second is gang-scheduling, and the third is migration. The main contribution of this paper is an analysis of the effects of combining the above techniques. Using extensive simulations based on detailed models of realistic workloads, the benefits of combining the various techniques are shown over a spectrum of performance criteria.

[1] J. Casas, D.L. Clark, R. Konuru, S.W. Otto, R.M. Prouty, and J. Walpole, “MPVM: A Migration Transparent Version of PVM,” Usenix Computing Systems, vol. 8, no. 2, pp. 171-216, 1995.
[2] A.B. Downey, “Using Queue Time Predictions for Processor Allocation,” Proc. Int'l Parallel and Distributed Processing Symp. Workshop Job Scheduling Strategies for Parallel Processing, pp. 35-57, Apr. 1997.
[3] D.G. Feitelson, “A Survey of Scheduling in Multiprogrammed Parallel Systems,” Technical Report RC 19790 (87657), IBM T.J. Watson Research Center, Oct. 1994.
[4] D.G. Feitelson and M.A. Jette, Improved Utilization and Responsiveness with Gang Scheduling Job Scheduling Strategies for Parallel Processing, pp. 238-261, 1997.
[5] D.G. Feitelson, L. Rudolph, U. Schwiegelshohn, K.C. Sevcik, and P. Wong, “Theory and Practice in Parallel Job Scheduling,” Proc. Int'l Parallel and Distributed Processing Symp. Workshop Job Scheduling Strategies for Parallel Processing, pp. 1-34, Apr. 1997.
[6] D.G. Feitelson and A.M. Weil, “Utilization and Predictability in Scheduling the IBM SP2 with Backfilling,” Proc. 12th Int'l Parallel Processing Symp., pp. 542-546, Apr. 1998.
[7] H. Franke, J. Jann, J.E. Moreira, and P. Pattnaik, “An Evaluation of Parallel Job Scheduling for ASCI Blue-Pacific,” IBM Research Report RC21559, Nov. 1999.
[8] H. Franke, P. Pattnaik, and L. Rudolph, Gang Scheduling for Highly Efficient Distributed Multiprocessor Systems Proc. Sixth Symp. Frontiers Massively Parallel Computing, pp. 1-9, Oct. 1996.
[9] R. Gibbons, “A Historical Application Profiler for Use by Parallel Schedulers,” Proc. Int'l Parallel and Distributed Processing Symp. Workshop Job Scheduling Strategies for Parallel Processing, pp. 58-77, Apr. 1997.
[10] B. Gorda and R. Wolski, “Time Sharing Massively Parallel Machines,” Int'l Conf. Parallel Processing, vol. II, pp. 214-217, Aug. 1995.
[11] N. Islam, A.L. Prodromidis, M.S. Squillante, L.L. Fong, and A.S. Gopal, “Extensible Resource Management for Cluster Computing,” Proc. 17th Int'l Conf. Distributed Computing Systems, pp. 561-568, 1997.
[12] J. Jann, P. Pattnaik, H. Franke, F. Wang, J. Skovira, and J. Riordan, “Modeling of Workload in MPPs,” Proc. Third Ann. Workshop Job Scheduling Strategies for Parallel Processing, pp. 95-116, Apr. 1997.
[13] H.D. Karatza, “A Simulation-Based Performance Analysis of Gang Scheduling in a Distributed System,” Proc. 32nd Ann. Simulation Symp., pp. 26-33, Apr. 1999.
[14] D. Lifka, “The ANL/IBM SP Scheduling System,” Proc. Int'l Parallel and Distributed Processing Symp. Workshop Job Scheduling Strategies for Parallel Processing vol. 949, pp. 295-303, Apr. 1995.
[15] J.E. Moreira, W. Chan, L.L. Fong, H. Franke, and M.A. Jette, “An Infrastructure for Efficient Parallel Job Execution in Terascale Computing Environments,” Proc. Supercomputing (SC '98), Nov. 1998.
[16] J.E. Moreira, H. Franke, W. Chan, L.L. Fong, M.A. Jette, and A. Yoo, “A Gang-Scheduling System for ASCI Blue-Pacific,” High-Performance Computing and Networking, Seventh Int'l Conf., vol. 1593, pp. 831-840, Apr. 1999.
[17] J.K. Ousterhout, “Scheduling Techniques for Concurrent Systems,” Third Int'l Conf. Distributed Computing Systems, pp. 22-30, 1982.
[18] S. Petri and H. Langendörfer, “Load Balancing and Fault Tolerance in Workstation Clusters—Migrating Groups of Communicating Processes,” Operating Systems Rev., vol. 29, no. 4, pp. 25-36, Oct. 1995.
[19] J. Pruyne and M. Livny, “Managing Checkpoints for Parallel Programs,” Job Scheduling Strategies for Parallel Processing, IPPS'96 Workshop, D.G. Feitelson and L. Rudolph, eds., vol. 1162, pp. 140-154, Apr. 1996.
[20] U. Schwiegelshohn and R. Yahyapour, Improving First-Come-First-Serve Job Scheduling by Gang Scheduling Job Scheduling Strategies for Parallel Processing, pp. 180-198, 1998.
[21] J. Skovira, W. Chan, H. Zhou, and D. Lifka, “The EASY-LoadLeveler API Project,” Job Scheduling Strategies for Parallel Processing, D.G. Feitelson and L. Rudolph, eds., pp. 41–47, 1996.
[22] W. Smith, V. Taylor, and I. Foster, “Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance,” Proc. Fifth Ann. Workshop Job Scheduling Strategies for Parallel Processing, Apr. 1999.
[23] K. Suzaki and D. Walsh, “Implementation of the Combination of Time Sharing and Space Sharing on AP/Linux,” Int'l Parallel and Distributed Processing Symp. Workshop Job Scheduling Strategies for Parallel Processing, Mar. 1998.
[24] K.K. Yue and D.J. Lilja, “Comparing Processor Allocation Strategies in Multiprogrammed Shared-Memory Multiprocessors,” J. Parallel and Distributed Computing, vol. 49, no. 2, pp. 245-258, Mar. 1998.

Index Terms:
Parallel scheduling, gang scheduling, backfilling, migration, simulation.
Citation:
Yanyong Zhang, Hubertus Franke, Jose Moreira, Anand Sivasubramaniam, "An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 3, pp. 236-247, March 2003, doi:10.1109/TPDS.2003.1189582
Usage of this product signifies your acceptance of the Terms of Use.