This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors
October 1998 (vol. 47 no. 10)
pp. 1041-1055

Abstract—We evaluate three extensions to directory-based cache coherence protocols in shared-memory multiprocessors. These extensions are aimed at reducing the penalties associated with memory accesses and include a hardware prefetching scheme, a migratory sharing optimization, and a competitive-update mechanism. Since each extension targets distinct components of the read and write penalties, they can be combined effectively. This paper identifies the combinations yielding the best performance gains and cost trade-offs in the context of a class of cache-coherent NUMA (Non-Uniform Memory Access) architectures. Detailed architectural simulations of a multiprocessor with single-issue, statically scheduled CPUs, using five benchmarks, show that the protocol extensions often provide additive gains when they are properly combined. For example, the combination of prefetching with the competitive-update mechanism speeds up the execution by nearly a factor of two under release consistency. The same speedup is obtained under sequential consistency by combining prefetching with the migratory sharing optimization. This paper shows that a basic write-invalidate protocol augmented by appropriate extensions can eliminate most memory access penalties without any support from the programmer or the compiler.

[1] M. Brorsson, F. Dahlgren, H. Nilsson, and P. Stenström, "The CacheMire Test Bench_A Flexible and Effective Approach for Simulation of Multiprocessors," Proc. 26th Ann. Simulation Symp., pp. 41-49, 1993.
[2] L.M. Censier and P. Feautrier, "A New Solution to Coherence Problems in Multicache Systems," IEEE Trans. Computers, vol. 27, no. 12, pp. 1,112-1,118, Dec. 1978.
[3] A.L. Cox and R.J. Fowler, "Adaptive Cache Coherency for Detecting Migratory Shared Data," Proc. 20th Ann. Int'l Symp. Computer Architecture, IEEE Computer Soc. Press, Los Alamitos, Calif., 1993, pp. 98-108.
[4] F. Dahlgren, M. Dubois, and P. Stenstrom, "Sequential Hardware Prefetching in Shared Memory Multiprocessors," IEEE Trans. Parallel and Distributed Systems, Vol. 6, No. 7, July 1995, pp. 733-746.
[5] F. Dahlgren and P. Stenstrom, "Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors," IEEE Trans. Parallel and Distributed Systems, Apr. 1996, pp. 385-398.
[6] F. Dahlgren, M. Dubois, and P. Stenström, "Combined Performance Gains of Simple Cache Protocol Extensions," Proc. 21st Int'l Symp. Computer Architecture, pp.187-197, 1994,.
[7] F. Dahlgren and P. Stenström, "Using Write Caches to Improve Performance of Cache Coherence Protocols in Shared-Memory Multiprocessors," J. Parallel and Distributed Computing, vol. 26, no. 2, pp. 193-210, Apr. 1995.
[8] M. Dubois and C. Scheurich, "Memory Access Dependencies in Shared-Memory Multiprocessors," IEEE Trans. Computers, vol. 16, no. 6, pp. 660-673, June 1990.
[9] S.J. Eggers and R.H. Katz, "A Characterization of Sharing in Parallel Programs and Its Application to Coherency Protocol Evaluation," Proc. 15th Ann. Int'l Symp. Computer Architecture, IEEE Computer Society Press, Los Alamitos, Calif., 1988, pp. 373-382.
[10] K. Gharachorloo, A. Gupta, and J. Hennessy, "Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors," Proc. ASPLOS IV, pp. 245-257, 1991.
[11] K. Gharachorloo, A. Gupta, and J. Hennessy, "Hiding Memory Latency Using Dynamic Scheduling in Shared-Memory Multiprocessors," Proc. 19th Ann. Int'l Symp. Computer Architecture, pp. 22-33, 1992.
[12] H. Grahn, P. Stenström, and M. Dubois, "Implementation and Evaluation of Update-Based Cache Protocols Under Relaxed Memory Consistency Models," Future Generation Computer Systems, vol. 11, no. 3, pp. 247-271, June 1995.
[13] A. Gupta et al., "Comparative Evaluation of Latency Reducing and Tolerating Techniques," Proc. 18th Annual Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., June 1991, pp. 254-263.
[14] E. Hagersten, "Towards Scalable Cache Only Memory Architectures," PhD thesis, Swedish Inst. of Computer Science, Oct. 1992 (SICS Dissertation Series 08).
[15] R. Lee, P.-C. Yew, and D. Lawrie, "Data Prefetching in Shared-Memory Multiprocessors," Proc. 1987 Int'l Conf. Parallel Processing, vol. I, pp. 28-31, 1987.
[16] T. Mowry and A. Gupta, "Tolerating Latency through Software-Controlled Prefetching in Scalable Shared- Memory Multiprocessors," J. Parallel and Distributed. Computing, vol. 12, pp. 87-106, June 1991.
[17] V. Pai, P. Ranganathan, and S. Adve, “The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology,” Proc. Third Int'l Symp. High-Performance Computer Architecture, pp. 72-83, Feb. 1997.
[18] J.P. Singh, W.D. Weber, and A. Gupta, "SPLASH: Stanford Parallel Applications for Shared Memory," Proc. 19th Annual Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., May 1992, pp. 5-14.
[19] P. Stenstrom, T. Joe, and A. Gupta, "Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures," Proc. 19th Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., 1992, pp. 80-91.
[20] P. Stenström, M. Brorsson, and L. Sandberg, "An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing," Proc. 20th Ann. Int'l Symp. Computer Architecture, pp. 109-118, 1993.

Index Terms:
Shared-memory multiprocessors, cache-coherence protocols, prefetching, competitive-update protocols, write caches, performance evaluation.
Citation:
Fredrik Dahlgren, Michel Dubois, Per Stenström, "Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors," IEEE Transactions on Computers, vol. 47, no. 10, pp. 1041-1055, Oct. 1998, doi:10.1109/12.729785
Usage of this product signifies your acceptance of the Terms of Use.