This Article 
 Bibliographic References 
 Add to: 
Distributed Hardwired Barrier Synchronization for Scalable Multiprocessor Clusters
June 1995 (vol. 6 no. 6)
pp. 591-605

Abstract—Conventional multiprocessors mostly use centralized, memory-based barriers to synchronize concurrent processes created in multiple processors. These centralized barriers often become the bottleneck or hot spots in the shared memory. In this paper, we overcome the difficulty by presenting a distributed and hardwired barrier architecture, that is hierarchically constructed for fast synchronization in cluster-structured multiprocessors. The hierarchical architecture enables the scalability of cluster-structured multiprocessors. A special set of synchronization primitives is developed for explicit use of distributed barriers dynamically. To show the application of the hardwired barriers, we demonstrate how to synchronize Doall and Doacross loops using a limited number of hardwired barriers. Timing analysis shows an $O(10^2)$ to $O(10^5)$ reduction in synchronization overhead, compared with the use of software-controlled barriers implemented in a shared memory. The hardwired architecture is effective in implementing any partially ordered set of barriers or fuzzy barriers with extended synchronization regions. The versatility, scalability, programmability, and low overhead make the distributed barrier architecture attractive in constructing fine-grain, massively parallel MIMD systems using multiprocessor clusters with distributed shared memory.

Index Terms—Barrier synchronization, distributed shared memory, Doacross loops, Doall loops, fuzzy barriers, parallel processing, partially ordered barriers, scalable multiprocessors, wired-NOR logic.

[1] T. E. Anderson,“The performance of spin lock alternatives for shared-memory multiprocessor,”IEEE Trans. Parallel Distrib. Syst.,vol. 1, pp. 6–16, Jan. 1990.
[2] J. B. Andrews, C. J. Beckmann, and D. K. Poulsen,“Notification and multicast networks for synchronization and coherence,”J. Parallel Distrib. Comput., vol. 15, pp. 332–350, Aug. 1992.
[3] N. S. Arenstorf and H. F. Jordan,“Comparing barrier algorithms,”Parallel Comput., vol. 12, pp. 157–170, 1989.
[4] T. S. Axelrod,“Effects of synchronization barriers on multiprocessor performance,”Parallel Comput., vol. 3, pp. 129–140, 1986.
[5] U. Banerjee, R. Eigenmann, A. Nicolau, and D. A. Padua,“Automatic program parallelization,”IEEE Proc., vol. 81, pp. 211–243, Feb. 1993.
[6] B. Beck, B. Kasten, and S. Thakkar,“VLSI assist for a multiprocessor,”inProc. 2nd Int. Conf. Architectural Support, Programm. Languages, Oper. Syst., Oct. 1987, pp. 10–20.
[7] C. J. Beckmann and C. D. Polychronopoulos,“Fast barrier synchronization hardware,”inProc. IEEE Supercomput., 1991, pp. 180–189.
[8] ——,“Broadcast networks for fast synchronization,”inProc. Int. Conf. Parallel Process., 1991, pp. I220–I228.
[9] Burroughs Corp.,“Federal and special systems,”Numerical Aerodynamic Simulation Facility Feasibility Study, Final Rep., 1979.
[10] Cray,Cray T3D System Architecture Overview, Chippewa Falls, WI, 1993.
[11] R. Cytron,“Doacross: Beyond vectorization for multiprocessors,”inProc. Int. Conf. Parallel Process., 1986, pp. 836–844.
[12] H. G. Dietz,“Finding large-grain parallelism in loops with serial control dependencies,”inProc. Int. Conf. Parallel Process., 1988, pp. 114–121.
[13] J. R. Goodman, M. K. Vernon, and P. J. Woest,“Efficient synchronization primitives for large-scale cache-coherent multiprocessors,”inProc. 3rd Int. Conf. Architect. Support, Programm. Languages, Oper. Syst., Apr. 1989, pp. 64–75.
[14] G. Graunke and S. Thakkar,“Synchronization algorithms for shared-memory multiprocessors,”IEEE Comput., vol. 23, pp. 60–69, June 1990.
[15] R. Gupta,“The fuzzy barrier: A mechanism for high speed synchronization of processors,”inProc. 3rd Int. Conf. Architect. Support, Programm. Languages, Oper. Syst., Apr. 1989, pp. 54–63.
[16] R. Gupta and C. R. Hill,“A scalable implementation of barrier synchronization using an adaptive tree,”Int. J. Parallel Programm., vol. 18, no. 3, pp. 161–180, June 1989.
[17] T. Hoshino,“PAX computer,”inHigh-Speed Parallel Processing and Scientific Computing, H. S. Stone, Ed. Reading, MA: Addison-Wesley, 1989.
[18] K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw-Hill, 1993.
[19] K. Hwang and S. Shang,“Wired-NOR barrier synchronization for designing large shared-memory multiprocessor,”inProc. Int. Conf. Parallel Process., St. Charles, IL, Aug. 13–15, 1991, pp. I171–I175.
[20] J. Lee and U. Ramachandran,“Synchronization with multiprocessor caches,”inProc. 17th Int. Symp. Comput. Architect., May 1990, pp. 27–37.
[21] D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hennessy,“The DASH prototype: Logic overhead and performance,”IEEE Trans. Parallel Distrib. Syst., vol. 4, pp. 41–61, Jan. 1993.
[22] S. F. Lundstrom,“Applications considerations in the system design of highly concurrent multiprocessors,”IEEE Trans. Comput., vol. C-36, pp. 1292–1309, Nov. 1987.
[23] J. M. Mellor-Crummey and M. L. Scott,“Algorithms for scalable synchronization on shared-memory multiprocessors,”ACM Trans. Comput. Syst., vol, 9, no. 1, pp. 21–65, Feb. 1991.
[24] S. Midkiff and D. Padua,“Compiler generated synchronization for do loops,”inProc. Int. Conf. Parallel Process., 1986, pp. 544–551.
[25] ——,“Compiler algorithms for synchronization,”IEEE Trans. Comput., vol. C-36, pp. 1485–1495, Dec. 1987.
[26] M. T. O'Keefe and H. G. Dietz,“Hardware barrier synchronization: Dynamic barrier MIMD (DBM),”inProc. Int. Conf. Parallel Process., 1990, pp. I43–I46.
[27] ——,“Hardware barrier synchronization: Static barrier MIMD (SBM),”inProc. Int. Conf. Parallel Process., 1990, pp. I35–I42.
[28] G. F. Pfister and V. A. Norton,“Hot Spot contention and combining in multistage interconnection networks,”IEEE Trans. Comput., vol. C-34, pp. 943–948, Oct. 1985.
[29] H. B. Ribas,“Obtaining dependence vectors for nested-loop computations,”inProc. Int. Conf. Parallel Process., 1990, pp. 212–219.
[30] S. Shang,“Fast barrier synchronization for shared-memory multiprocessors,”Ph.D. dissertation, Univ. Southern California, Los Angeles, CA, 1993.
[31] P. Tang and P. Yew,“Processor self-scheduling for multiple-nested parallel loops,”inProc. Int. Conf. Parallel Process., 1986, pp. 528–535.
[32] M. Wolfe,“Optimizing Supercompilers For Supercomputers.”Cambridge, MA: MIT, 1989.
[33] P. C. Yew, N. F. Tzeng, and D. H. Lawrie,“Distributing hot-spot addressing in large-scale multiprocessors,”IEEE Trans. Comput., vol. C-36, pp. 388–395, Apr. 1987.
[34] J. Zahorjan and C. McCann,“Processor scheduling in shared memory multiprocessors,”inProc. 1990 ACM SIGM Conf. Meas., Model., Comput., Syst., May 1990, pp. 214–225.

Shisheng Shang, Kai Hwang, "Distributed Hardwired Barrier Synchronization for Scalable Multiprocessor Clusters," IEEE Transactions on Parallel and Distributed Systems, vol. 6, no. 6, pp. 591-605, June 1995, doi:10.1109/71.388040
Usage of this product signifies your acceptance of the Terms of Use.