This Article 
 Bibliographic References 
 Add to: 
Impact of Memory Contention on Dynamic Scheduling on NUMA Multiprocessors
November 1996 (vol. 7 no. 11)
pp. 1201-1214

Abstract—Self-scheduling is a method for task scheduling in parallel programs, in which each processor acquires a new block of tasks for execution whenever it becomes idle. To get the best performance, the block size must be chosen to balance the scheduling overhead against the load imbalance. To determine the best block size, a better understanding of the role of load imbalance in self-scheduling performance is needed.

In this paper we study the effect of memory contention on task duration distributions and, hence, load balancing in self-scheduling on a Nonuniform Memory Access (NUMA) machine. Experimental studies on a BBN TC2000 are used to reveal the strengths and weaknesses of analytical performance models to predict running time and optimal block size. The models are shown to be very accurate for small block sizes. However, the models fail when the block size is large due to a previously unrecognized source of load imbalance. We extend the analytical models to address this failure. The implications for the construction of compilers and runtime systems are discussed.

[1] G.F. Pfister, W.C. Brantley, D.A. George, S.L. Harvey, W.J. Kleinfelder, K.P. McAuliffe, E.A. Melton, V.A. Norton, and J. Weiss, "The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture," Proc. 1985 Int'l Conf. Parallel Processing, pp. 764-771, Aug. 1985.
[2] A. Gottlieb, R. Grishman, C.P. Kruskal, K.P. McAuliffe, L. Rudolph, and M. Snir, "The NYU Ultracomputer Designing an MIMD Shared Memory Parallel Computer," IEEE Trans. Computers, pp. 175-189, Feb. 1983.
[3] D. Lenoski et al., "The directory-based cache coherence protocol for the dash multiprocessor," Proc. 17th Int'l Symp. Computer Architecture,Los Alamitos, Calif., pp. 148-159, 1990.
[4] Z.G. Vranesic, M. Stumm, D.M. Lewis, and R. White, “Hector: A Hierarchically Structured Shared-Memory Multiprocessor,” Computer, vol. 24, no. 1, pp. 72-79, Jan. 1991.
[5] "KSR Parallel Programming," 1991.
[6] D. Gajski, D. Kuck, D. Lawrie, and A. Sameh, "Cedar—A large Scale Multiprocessor," Proc. Int'l Conf. Parallel Processing, pp. 524-529, Aug. 1983.
[7] "Inside the TC2000," 1990.
[8] S.F. Hummel, E. Schonberg, and L.E. Flynn, "Factoring: A Practical and Robust Method for Scheduling Parallel Loops," Proc. Supercomputing Conf., pp. 610-619, Nov. 1991.
[9] C.P. Kruskal and A. Weiss, "Allocating Independent Subtasks on Parallel Processors," IEEE Trans. Software Eng., vol. 11, no. 10, pp. 1,001-1,016, Oct. 1985.
[10] C.D. Polychronopoulos and D.J. Kuck, “Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers,” IEEE Trans. Computers, vol. 36, no. 12, pp. 1425-1439, Dec. 1987.
[11] T.H. Tzen and L.M. Ni, "Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers," IEEE Trans. Parallel and Distributed Systems, vol. 4, pp. 87-98, Jan. 1993.
[12] J.-R. Abrial, "Assigning Programs to Meanings," Mathematical Logic and Programming Languages, Philosophical Trans. Royal Society, series A, vol. 312, 1984.
[13] H. Li, S. Tandri, M. Stumm, and K. Sevcik, "Locality and Loop Scheduling on NUMA Multiprocessors," Proc. 1998 Int'l Conf. Parallel Processing, pp. II 140-II 127, Aug. 1993.
[14] R.H. Thomas and W. Crowther, "The Uniform System: An Approach to Runtime Support for Large Scale Shared Memory Parallel Processors.," Proc. Int'l Conf. Parallel Processing, pp. 245-254, Aug. 1988.
[15] R. Butler, J. Boyle, T. Disz, B. Glickfeld, E. Lusk, R. Overbeek, J. Patterson, and R. Stevens, Portable Programs for Parallel Processors.New York: Holt, Rinehart, and Winston, Inc. 1987.
[16] S.F. Hummel and E. Schonberg, "Low-Overhead Scheduling of Nested Parallelism," IBM J. Research and Development, pp. 743-65, Nov. 1991.
[17] S. Madala and J.B. Sinclair, "Performance of Synchronous Parallel Algorithms with Regular Structures," IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 1, pp. 105-116, Jan. 1991.
[18] H. David, Order Statistics.New York: John Wiley, 1981.
[19] T. Montaut, "Ordonnancement dynamique d'un programme décomposéen sous-tâches indépendantes. analyse des performances," Master's thesis, Institut de Recherche en Informatique et Systèmes Aléatoires, Sept. 1991.
[20] M. Dayde Private communication, 1991.
[21] F. Bodin, D. Windheiser, W. Jalby, D. Ataputta, M. Lee, and D. Gannon, "Performance Evaluation and Prediction for Parallel Algorithms on BBN GP1000," Proc. Int'l Conf. Supercomputing, pp. 403-413, 1990.
[22] L. Kervella, "Etude expérimentale de la durée des tâches parallèles avec un ordonnancement dynamique," Master's thesis, Institut de Recherche en Informatique et Systèmes Aléatoires, Sept. 1991.
[23] L.S. Bowman and K.O. Bowman, Marimum Likelihood Estimation in Small Samples. Charles Griffin and Company, 1977.
[24] V. Sarkar, "Determining Average Program Execution Times and their Variance," 1989 Proc. SIGPLAN Notices Conf. Programming Language Design and Implementation, vol., 24, no. 7, pp. 298-312, 1989.
[25] T. Fahringer and H. Zima, “A Static Parameter Based Performance Prediction Tool for Parallel Programs,” Proc. ACM Int'l Conf. Supercomputing, pp. 207-219, Tokyo, 1993.
[26] K. Gallivan, D. Gannon, W. Jalby, A. Malony, and H. Wijshoff, "Experimentally Characterizing the Behavior of Multiprocessor Memory Systems: A Case Study," IEEE Trans. Software Eng., vol. 16, no. 2, pp. 216-223, Feb. 1990.

Index Terms:
Dynamic scheduling, load balancing, memory performance, NUMA multiprocessors, self-scheduling.
Dannie Durand, Thierry Montaut, Lionel Kervella, William Jalby, "Impact of Memory Contention on Dynamic Scheduling on NUMA Multiprocessors," IEEE Transactions on Parallel and Distributed Systems, vol. 7, no. 11, pp. 1201-1214, Nov. 1996, doi:10.1109/71.544359
Usage of this product signifies your acceptance of the Terms of Use.