This Article 
 Bibliographic References 
 Add to: 
Data Forwarding in Scalable Shared-Memory Multiprocessors
December 1996 (vol. 7 no. 12)
pp. 1250-1264

Abstract—Scalable shared-memory multiprocessors are often slowed down by long-latency memory accesses. One way to cope with this problem is to use data forwarding to overlap memory accesses with computation. With data forwarding, when a processor produces a datum, in addition to updating its cache, it sends a copy of the datum to the caches of the processors that the compiler identified as consumers of it. As a result, when the consumer processors access the datum, they find it in their caches.

This paper addresses two main issues. First, it presents a framework for a compiler algorithm for forwarding. Second, using address traces, it evaluates the performance impact of different levels of support for forwarding. Our simulations of a 32-processor machine show that an optimistic support for forwarding speeds up five applications by an average of 50% for large caches and 30% for small caches. For large caches, most sharing read misses are eliminated, while for small caches, forwarding does not increase the number of conflict misses significantly. Overall, support for forwarding in shared-memory multiprocessors promises to deliver good application speedups.

[1] S. Adve and M. Hill, “Weak Ordering—A New Definition,” Proc. 17th Ann. Int'l Symp. Computer Architecture, May 1990.
[2] A. Agarwal,“Performance tradeoffs in multithreaded processors,” IEEE Trans. on Parallel and Distributed Systems, vol. 3, no. 5, pp. 525-539, Sept. 1992.
[3] U. Banerjee,Dependence Analysis for Supercomputing. Norwell, MA: Kluwer, 1988.
[4] W. Berke, "A Cache Technique for Synchronization Variables in Highly Parallel, Shared Memory Systems," Ultracomputer Note 141, Dec. 1988.
[5] M. Berry et al., "The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers," Int'l J. Supercomputer Applications, vol. 3, no. 3, pp. 5-40, Fall 1989.
[6] R. Eigenmann, J. Hoeflinger, G. Jaxon, and D. Padua, "The Cedar Fortran Project," Technical Report 1262, Center for Supercomputing Research and Development, Oct. 1992.
[7] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy, “Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors,” Proc. 17th Ann. Int'l Symp. Computer Architecture, 1990.
[8] D. Glasco, B. Delagi, and M. Flynn, "Update-Based Cache Coherence Protocols for Scalable Shared-Memory Multiprocessors," Proc. 27th Ann. Hawaii Int'l Conf. System Sciences, pp. 543-545, Jan. 1994.
[9] J. R. Goodman, M. K. Vernon, and P. J. Woest,“Efficient synchronization primitives for large-scale cache-coherent multiprocessors,”inProc. 3rd Int. Conf. Architect. Support, Programm. Languages, Oper. Syst., Apr. 1989, pp. 64–75.
[10] C. Kruskal and M. Snir, "The Performance of Multistage Interconnection Networks for Multiprocessors," IEEE Trans. Computers, vol. 32, no. 12, pp. 1,091-1,098, Dec. 1983.
[11] M. Lam, E. Rothberg, and M. Wolf, “The Cache Performance and Optimizations of Blocked Algorithms,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), 1991.
[12] D. Lenoski et al., “The Stanford DASH Multiprocessor,” Computer, pp. 63-79, Mar. 1992.
[13] T. Mowry and A. Gupta, "Tolerating Latency through Software-Controlled Prefetching in Scalable Shared- Memory Multiprocessors," J. Parallel and Distributed. Computing, vol. 12, pp. 87-106, June 1991.
[14] T.C. Mowry, M.S. Lam, and A. Gupta, “Design and Evaluation of a Compiler Algorithm for Prefetching,” Proc. Fifth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 1992.
[15] M. Ohara, "Producer-Oriented versus Consumer-Oriented Prefetching: A Comparison and Analysis of Parallel Application Programs," Technical Report CSL-TR-96-695, Computer Systems Laboratory, Stanford Univ., June 1996.
[16] C.D. Polychronopoulos et al., "Parafrase-2: An Environment for Parallelizing, Partitioning, Synchronizing, and Scheduling Programs on Multiprocessors," Proc. 1989 Int'l Conf. Parallel Processing, vol. II, pp. 39-48, Aug. 1989.
[17] D.K. Poulsen, "Memory Latency Reduction via Data Prefetching and Data Forwarding in Shared Memory Multiprocessors," PhD thesis, Univ. of Illinois at Urbana-Champaign. Also Center for Supercomputing Research and Development Report 1377, July 1994.
[18] D.K. Poulsen and P.-C. Yew, "Execution Driven Tools for Parallel Simulation of Parallel Architectures and Applications," Proc. Supercomputing '93, pp. 860-869, Nov. 1993.
[19] D.K. Poulsen and P.-C. Yew, "Data Prefetching and Data Forwarding in Shared Memory Multiprocessors," Proc. 1994 Int'l Conf. Parallel Processing, vol. II, pp. 276-280, Aug. 1994.
[20] D.K. Poulsen and P.-C. Yew, "Integrating Fine-Grained Message Passing in Cache Coherent Shared Memory Multiprocessors," J. Parallel and Distributed Computing, to appear, 1996.
[21] E. Rosti, E. Smirni, T.D. Wagner, A.W. Apon, and L.W. Dowdy, "The KSR1: Experimentation and Modeling of Poststore," Proc. ACM Sigmetrics Conf. Measurement and Modeling of Computer Systems, pp. 74-85, May 1993.
[22] P. Stenström, "A Survey of Cache Coherence Scheme for Multiprocessors," Computer, vol. 23, no. 6, pp. 12-24, Jun.e 1990.
[23] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.

Index Terms:
Memory latency hiding, forwarding and prefetching, multiprocessor caches, scalable shared-memory multiprocessors, address trace analysis.
David A. Koufaty, Xiangfeng Chen, David K. Poulsen, Josep Torrellas, "Data Forwarding in Scalable Shared-Memory Multiprocessors," IEEE Transactions on Parallel and Distributed Systems, vol. 7, no. 12, pp. 1250-1264, Dec. 1996, doi:10.1109/71.553274
Usage of this product signifies your acceptance of the Terms of Use.