This Article 
 Bibliographic References 
 Add to: 
The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared Memory Multiprocessor
June 1994 (vol. 5 no. 6)
pp. 573-584

Trace-driven simulations of numerical Fortran programs are used to study the impact ofthe parallel loop scheduling strategy on data prefetching in a shared memorymultiprocessor with private data caches. The simulations indicate that to maximizememory performance, it is important to schedule blocks of consecutive iterations toexecute on each processor, and then to adaptively prefetch single-word cache blocks tomatch the number of iterations scheduled. Prefetching multiple single-word cache blockson a miss reduces the miss ratio by approximately 5% to 30% compared to a system withno prefetching. In addition, the proposed adaptive prefetching scheme further reducesthe miss ratio while significantly reducing the false sharing among cache blocks compared to nonadaptive prefetching strategies. Reducing the false sharing causes fewer coherence invalidations to be generated, and thereby reduces the total network traffic. The impact of the prefetching and scheduling strategies on the temporal distribution ofcoherence invalidations also is examined. It is found that invalidations tend to be evenlydistributed throughout the execution of parallel loops, but tend to be clustered whenexecuting sequential program sections. The distribution of invalidations in both types of program sections is relatively insensitive to the prefetching and scheduling strategy.

[1] A. Agarwal and A. Gupta, "Memory-reference characteristics of multiprocessor applications under MACH," inProc. ACM SIGMETRICS Conf Measurement and Modeling of Computer Systems, 1988, pp. 215-226.
[2] A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz, "An evaluation of directory schemes for cache coherence," inProc. 15th Int. Symp. Comput. Architecture, June 1988, pp. 280-289.
[3] D. Callahan, K. Kennedy, and A. Porterfield, "Software prefetching,"Int. Conf. Architectural Support for Programming Languages and Operating Syst., 1991, pp. 40-52.
[4] L. M. Censier and P. Feautrier, "A new solution to coherence problems in multicache systems,"IEEE Trans. Comput., vol. C-27, no. 12, pp. 1112-1118, Dec. 1978.
[5] M. Dubois, C. Scheurich, and F. A. Briggs, "Synchronization, coherence, and event ordering in multiprocessors,"Comput., vol. 21, pp. 9-21, Feb. 1988.
[6] S. J. Eggers and R. H. Katz, "The effect of sharing on the cache and bus performance of parallel programs," inProc. 3rd Int. Conf. Architectural Support Programming Languages Oper. Syst., Boston, MA, Apr. 1989, pp. 257-270.
[7] J. W. C. Fu and J. H. Patel, "Data prefetching in multiprocessor vector cache memories," inProc. 18th Int. Symp. on Comput. Architecture, 1991, p. 54-63.
[8] J.R. Goodman, "Using Cache Memory to Reduce Processor Memory Traffic,"Proc. 10th Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., Order No. 473 (microfiche only), 1983, pp. 124-131.
[9] A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir, "The NYU Ultrcomputer-Designing an MIMD shared memory parallel computer," inProc. 9th Annu. Int. Symp. Comput. Architecture, 1982, pp. 27-42.
[10] M. Gupta and D. A. Padua, "Effects of program parallelization and stripmining transformation on cache performance in a multiprocessor,"Int. Conf. Parallel Processing: Architecture, vol. I, pp. 301-304, 1991.
[11] S.F. Hummel, E. Schonberg, and L.E. Flynn, "Factoring: A Practical and Robust Method for Scheduling Parallel Loops,"Proc. Supercomputing 91, IEEE CS Press, Los Alamitos, Calif., Order No. 2158, pp. 610-619.
[12] A. C. Klaiber and H. M. Levy, "An architecture for software-controlled data prefetching,"Int. Symp. Comput. Architecture, 1991, pp. 43-54.
[13] D. Kroft, "Lockup-free instruction fetch/prefetch cache organization," inProc. 8th Annu. Symp. Comput. Architecture, June 1981, pp. 81-87.
[14] D. J. Kuck, E. S. Davidson, D. J. Lawrie, and A. H. Sameh, "Parallel supercomputing today and the Cedar approach,"Sci., vol. 231, pp. 967-974, 28 Feb. 1986.
[15] R. L. Lee, P. C. Yew, and D. H. Lawrie, "Multiprocessor cache design considerations," inProc. 14th Annu. Int. Symp. Comput. Architecture, June 1987, pp. 253-262.
[16] D. J. Lilja, "Prefetching and scheduling interactions in shared memory multiprocessors,"Midwest Electrotechnol. Conf., 1992, pp. 84-87.
[17] D. J. Lilja and P.-C. Yew, "Improving memory Utilization in cache coherence directories,"IEEE Trans. Parallel and Distrib. Syst., vol. 4, pp. 1130-1146, Oct. 1993.
[18] R. Perron and C. Mundie, "The architecture of the Alliant FX/8 computer,"IEEE COMPCON, 1986, pp. 390-393.
[19] G. F. Pfister, W. C. Brantley, D. A. George, S. L. Harvey, W. J. Kleinfelder, K. P. McAuliffe, E. A. Melton, V. A. Norton, and J. Weiss, "The IBM research parallel processor prototype (RP3): Introduction and architecture,"Int. Conf. Parallel Processing, 1985, pp. 764-771.
[20] C. Polychronopoulos and D. Kuck, "Guided self-scheduling: A practical scheduling scheme for parallel supercomputers,"IEEE Tran. Comput., 1987.
[21] C. D. Polychronopoulos, "Toward auto-scheduling compilers,"J. Supercomputing, vol. 2, pp. 297-330, 1988.
[22] S. Przybylski, M. Horowitz, and J. Hennessy, "Performance Trade-offs in Cache Design,"15th Ann. Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, CA, Order No. 861, 1988, pp. 290-298.
[23] S. Przybylski, "The performance impact of block sizes and fetch strategies,"Int. Symp. Comput. Architecture, 1990, pp. 160-169.
[24] C. Scheurich and M. Dubois, "Thedesign of a lockup-free cache for high-performance multiprocessors," inProc. Supercomputing, 1988, pp. 352-359.
[25] A. Smith, "Cache Memories,"Computing Surveys, Vol. 14, No. 3, Sept. 1982, pp. 473- 530.
[26] A. J. Smith, "Line (block) size choice for CPU cache memories,"IEEE Trans. Computers, vol. 36, no. 9, pp. 1063-1074, 1987.
[27] W.-D. Weber and A. Gupta, "Analysis of cache invalidation patterns in multiprocessors," inProc. 3rd Int. Conf. Architectural Support Programming Languages Oper. Syst., Boston, MA, Apr. 1989, pp. 243-256.

Index Terms:
Index Termsscheduling; buffer storage; shared memory systems; parallel programming; performanceevaluation; parallel loop scheduling; prefetching; shared memory multiprocessor;trace-driven simulations; numerical Fortran programs; data caches; memory performance;single-word cache blocks; cache coherence; cache pollution; false sharing; guidedself-scheduling
D.J. Lilja, "The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared Memory Multiprocessor," IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 6, pp. 573-584, June 1994, doi:10.1109/71.285604
Usage of this product signifies your acceptance of the Terms of Use.