This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Performance Analysis of Four Memory Consistency Models for Multithreaded Multiprocessors
October 1995 (vol. 6 no. 10)
pp. 1085-1099

Abstract—Stochastic timed Petri nets are developed to evaluate the relative performance of distributed shared memory models for scalable multiprocessors, using multithreaded processors as building blocks. Four shared memory models are evaluated: the Sequential Consistency (SC) model by Lamport (1979), the Weak Consistency (WC) model by Dubois et al. (1986), the Processor Consistency (PC) model by Goodman (1989), and the Release Consistency (RC) model by Gharachorloo et al. (1990). We assumed a scalable network with a sufficient bandwidth to absorb the increased traffic from multithreading, coherent caches, and memory event reordering. The embedded Markov chains are solved to reveal the performance attributes. Under saturated conditions, we find that multithreading contributes more than 50% of the performance improvement, while the improvement from memory consistency models varies between 20% to 40% of the total performance gain. Petri net models are effective to predict the performance of processors with a larger number of contexts than that can be simulated in previous benchmark studies. The accuracy of these memory performance models was validated with the simulation results from Stanford University. Our analytical results reveal the lowest performance of the SC model amongst four memory consistency models. The PC model requires to use larger write buffers, while the WC and RC models require smaller write buffers. The PC model may perform even lower than the SC model, if a small buffer was used. The performance of the WC model depends heavily on the synchronization rate in user code. For a low synchronization rate, the WC model performs as well as the RC model. With sufficient multithreading and network bandwidth, the RC model shows the best performance among the four models. Furthermore, we discovered that cache interferences cause very little performance degradation in all relaxed memory consistency models; as long as the network is contention-free even when multithreading has saturated the system.

[1] S.V. Adve and M.D. Hill,“A unified formalization of four shared-memory models,” IEEE Trans. on Parallel and Distributed Systems, vol. 4, no. 6, pp. 613-624, June 1993.
[2] R. Alverson et al., "The Tera Computer System," Proc. Int'l Conf. Supercomputing, Assoc. of Computing Machinery, N.Y., 1990, pp. 1-6.
[3] A. Agarwal et al., “The MIT Alewife machine: A large-scale distributed-memory multiprocessor,” Dubois and Shreekant, eds., Scalable Shared Memory Multiprocessors.Boston, Mass.: Kluwer Academic Publishers, 1992.
[4] A. Agarwal,“Performance tradeoffs in multithreaded processors,” IEEE Trans. on Parallel and Distributed Systems, vol. 3, no. 5, pp. 525-539, Sept. 1992.
[5] M.Ajmone Marsan,G. Balbo,, and G. Conte,“A class of generalized stochastic Petri nets for the performance evaluation of multiprocessor systems,” ACM Trans. Computer Systems, pp. 93-122, vol. 2, no. 2, May 1984.
[6] G. Bell, “Ultracomputers: A Teraflop Before Its Time,” Comm. ACM, vol. 35, no. 8, pp. 26-47, Aug. 1992.
[7] G. Chiola,“GreatSPN user manual version 1.3,” Technical report, Dipartimento di Informatica, Universita di Torino, Torino, Italy, Sept. 1987.
[8] Y.K. Chong,“Effects of memory consistency models on multithreaded multiprocessor performance,” MSc thesis, University of Southern California, May 1993.
[9] Convex Computer, Inc., The Examplar Architecture.Richardson, Tex.: Convex Press, 1993.
[10] Cray Res. Inc., The Cray T3D System Architecture Overview.Madison, Wis.: Cray Res. Inc., 1993.
[11] M. Dubois and C. Scheurich, "Memory Access Dependencies in Shared-Memory Multiprocessors," IEEE Trans. Computers, vol. 16, no. 6, pp. 660-673, June 1990.
[12] M. Dubois,C. Scheurich,, and F. Briggs,“Memory access buffering in multiprocessors,” Proc. 13th Int’l Symp. Comp. Arch., pp. 434-442, June 1986.
[13] K. Gharachorloo, A. Gupta, and J. Hennessy, "Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors," Proc. ASPLOS IV, pp. 245-257, 1991.
[14] A. Gupta et al., "Comparative Evaluation of Latency Reducing and Tolerating Techniques," Proc. 18th Annual Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., June 1991, pp. 254-263.
[15] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy, “Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors,” Proc. 17th Ann. Int'l Symp. Computer Architecture, 1990.
[16] J.R. Goodman,“Cache consistency and sequential consistency,” Technical Report 61, IEEE SCI Committee, 1989.
[17] K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw-Hill, 1993.
[18] L. Lamport,“How to make a multiprocessor computer that correctly executes multiprocess programs,” IEEE Transactions on Computers, vol. 28, no. 9, pp. 241-248, Sept. 1979.
[19] B.H. Lim and A. Agrawal,“Waiting algorithms for synchronization in large-scale multiprocessors,” ACM Trans. Computer Systems, vol. 11, no. 3, pp. 253-294, 1993.
[20] K. Li and P. Hudak, "Memory Coherence in Shared Virtual Memory Systems," ACM Trans. Computer Surveys, vol. 7, no. 4, Nov. 1989.
[21] T. Mowry and A. Gupta, "Tolerating Latency through Software-Controlled Prefetching in Scalable Shared- Memory Multiprocessors," J. Parallel and Distributed. Computing, vol. 12, pp. 87-106, June 1991.
[22] D. Lenoski et al., “The Stanford DASH Multiprocessor,” Computer, pp. 63-79, Mar. 1992.
[23] B. Nitzberg and V. Lo, "Distributed Shared Memory: A Survey of Issues and Algorithms," Computer, vol. 24, no. 8, Aug. 1991.
[24] R.H. Saavedra,D.E. Culler,, and T. von Eicken,“Analysis of multithreaded architecture for parallel computing,” Proc. Second ACM Symp. Par. Algo. and Architecture, July 1990.
[25] R.H. Saavedra and D.E. Culler,“An analytical solution for a Markov chain modeling multithreaded execution,” Technical Report UCB/CSD-91/623, Computer Science Division, University of Californiaat Berkeley, Mar. 1991.
[26] R.H. Saavedra,W. Mao,, and K. Hwang,“Performance and optimization of data prefetching strategies in scalable multiprocessors,” J. of Parallel and Distributed Computing, pp. 427-448, Sept. 1994.
[27] Thinking Machines Corp., The CM-5 Technical Summary.Cambridge, Mass.: Thinking Machines Corp., 1991.
[28] W.D. Weber and A. Gupta, "Exploring the Benefits of Multiple Hardware Contexts in a Multiprocessor Architecture: Preliminary Results," Proc. 16th Ann. Int'l Symp. Computer Architecture, ACM Press, 1989, pp. 273-280.
[29] R.N. Zucker and J.L. Baer,“A performance study of memory consistency models,” Proc. 19th Int’l Symp. Comp. Arch., pp. 2-12, May 1992.

Index Terms:
Distributed shared memory, memory consistency models, stochastic Petri nets, scalable multiprocessors, latency hiding techniques, multithreaded processors, context switching, performance evaluation.
Citation:
Yong-Kim Chong, Kai Hwang, "Performance Analysis of Four Memory Consistency Models for Multithreaded Multiprocessors," IEEE Transactions on Parallel and Distributed Systems, vol. 6, no. 10, pp. 1085-1099, Oct. 1995, doi:10.1109/71.473517
Usage of this product signifies your acceptance of the Terms of Use.