This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
An Advanced Compiler Framework for Non-Cache-Coherent Multiprocessors
March 2002 (vol. 13 no. 3)
pp. 241-259

The Cray T3D and T3E are non-cache-coherent (NCC) computers with a NUMA structure. They have been shown to exhibit a very stable and scalable performance for a variety of application programs. Considerable evidence suggests that they are more stable and scalable than many other shared-memory multiprocessors. However, the principal drawback of these machines is a lack of programmability, caused by the absence of the global cache coherence that is necessary to provide a convenient shared view of memory in hardware. This forces the programmer to keep careful track of where each piece of data is stored, a complication that is unnecessary when a pure shared-memory view is presented to the user. We believe that a remedy for this problem is advanced compiler technology. In this paper, we present our experience with a compiler framework for automatic parallelization and communication generation that has the potential to reduce the time-consuming hand-tuning that would otherwise be necessary to achieve good performance with this type of machine. From our experiments, we learned that our compiler performs well for a variety of applications on the T3D and T3E and we found a few sophisticated techniques that could improve performance even more once they are fully implemented in the compiler.

[1] S. Amarasinghe, J. Anderson, M. Lam, and C. Tseng, “The SUIF Compiler for Scalable Parallel Machines,” Proc. Seventh SIAM Conf. Parallel Processing for Scientific Computing, Feb. 1995.
[2] J. Anderson and M. Lam, "Global Optimizations for Parallelism and Locality on Scalable Parallel Machines," Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 112-125,Albuquerque, N.M., June 1993.
[3] R. Arpaci, D. Culler, A. Krishnamurthy, S. Steinberg, and K. Yelick, “Empirical Evaluation of the CRAY-T3D: A Compiler Perspective,” Proc. Int'l Symp. Computer Architecture, pp. 320-331, June 1995.
[4] C.F. Baillie, J.C. McWilliams, J.B. Weiss, and I. Yavneh, “Implementation and Performance of a Grand Challenge 3d Quasi-Geostrophic Multi-Grid Code on the Cray T3D and IBM SP2,” Proc. Supercomputing '95, 1995.
[5] W. Blume, R. Doallo, R. Eigenmann, J. Grout, J. Hoeflinger, T. Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. Tu,“Parallel Programming with Polaris,” Computer, vol. 29, no. 12, pp. 78-82, Dec. 1996.
[6] S. Bokhari, “Communication Overhead on Intel Paragon, IBM SP2 and Meiko CS-2,” Technical Report 28, Inst. for Computer Applications in Science and Eng., NASA Langley Research Center, 1995.
[7] A. Brooke, D. Kendrick, and A. Meeraus, Release 2.25 GAMS A User's Guide. The Scientific Press, 1992.
[8] B.L. Chamberlain, S. Choi, and L. Snyder, "A Compiler Abstraction for Machine-Independent Communication Generation," Workshop on Languages and Compilers for Parallel Computing, Springer-Verlag, Berlin, 1997.
[9] B. Chapman,P. Mehrotra,H. Moritsch,, and H. Zima,“Dynamic data distributions in Vienna Fortran,” Proc. of Supercomputing’93, pp. 284-293, Nov. 1993.
[10] S. Chatterjee, J.R. Gilbert, and R. Schreiber, “The Alignment-Distribution Graph,” Proc. Sixth Ann. Workshop Languages and Compilers for Parallel Computing, Aug. 1993.
[11] Cray Research Inc., SHMEM Technical Note for Fortran, 1994.
[12] D.E. Culler, A. Dusseau, S.C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick, "Parallel Programming in Split-C," Supercomputing, 1993.
[13] T. von Eicken et al., “Active Messages: A Mechanism for Integrated Communication and Computation,” Proc. 19th Int’l Symp. Computer Architecture, Assoc. of Computing Machinery, N.Y., May 1992, pp. 256-266.
[14] J. Garcia, E. Ayguade, and J. Labarta, "A Novel Approach Towards Automatic Data Distribution," Proc. Supercomputing'95,San Diego, Calif., Dec. 1995.
[15] M. Gupta and E. Schonberg, “Static Analysis to Reduce Synchronization Costs in Data-Parallel Programs,” Proc. Principles of Programming Languages, Jan. 1996.
[16] M.W. Hall et al., "Maximizing Multiprocessor Performance with the SUIF Compiler," Computer, Dec. 1996, pp. 84-89.
[17] P. Havlak and K. Kennedy, "An Implementation of Interprocedural Bounded Regular Section Analysis," IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 3, pp. 350-360, July 1991.
[18] High Performance Fortran Forum, High Performance Fortran Language Specification, ver. 2. Jan. 1997.
[19] W.D. Hillis, The Connection Machine, MIT Press, Cambridge, Mass., 1985.
[20] S. Hiranandani, K. Kennedy, and C. Tseng, “Evaluating Compiler Optimizations for Fortran D,” J. Parallel and Distributed Computing, pp. 27-45, 1994.
[21] J. Hoeflinger, “Interprocedural Parallelization Using Memory Classification Analysis,” PhD thesis, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign, Aug. 1998.
[22] J. Hoeflinger and Y. Paek, “The Access Region Test,” Proc. Workshop Language and Compilers for Parallel Computing, Aug. 1999.
[23] J. Hoeflinger, Y. Paek, and K. Yi, “Unified Interprocedural Parallelism Detection,” Int'l J. Parallel Processing, Apr. 2001.
[24] C.-H. Huang and P. Sadayappan, “Communication-Free Partitioning of Nested Loops,” J. Parallel and Distributed Computing, vol. 19, pp. 90-102, 1993.
[25] V. Karamcheti and A.A. Chien, “A Comparison of Architectural Support for Messaging in the TMC CM-5 and the Cray T3D,” Proc. ISCA '95, 1995.
[26] K. Kennedy and U. Kremer, “Automatic Data Layout for High Performance Fortran,” Proc. Supercomputing '95, Dec. 1995.
[27] A. Krishnamurthy and K. Yelick, “Optimizing Parallel SPMD Programs,” Lecture Notes in Computer Science, New York: Springer Verlag, Aug. 1994.
[28] T. LeBlanc and E. Markatos, “Shared Memory vs. Message Passing in Shared-Memory Multiprocessors,” Proc. Fourth Symp. Parallel and Distributed Processing, Dec. 1992.
[29] G.R. Luecke and J.J. Coyle, “Comparing the Performance of MPI on the Cray T3E-900, the Cray Origin 2000, and the IBM P2SC,” J. Performance Evaluation and Modeling for Computer Systems, June 1998.
[30] G.R. Luecke, B. Raffin, and J.J. Coyle, “Comparing the Scalability of the Cray T3E-600 and the Cray Origin 2000 Using SHMEM Routines,” J. Performance Evaluation and Modeling for Computer Systems, Dec. 1998.
[31] G.R. Luecke, B. Raffin, and J.J. Coyle, “Comparing the Communication Performance and Scalability of a Linux and an NT Cluster of PCs, a SGI Origin 2000, and IBM SP, and a Cray T3E-600,” J. Performance Evaluation and Modeling for Computer Systems, Mar. 2000.
[32] G.R. Luecke, B. Raffin, and J.J. Coyle, “Comparing the Communication Performance and Scalability of a SGI Origin 2000, a Cluster of Origin 2000s, and a Cray T3E-1200 using SHMEM and MPI Routines,” J. Performance Evaluation and Modeling for Computer Systems, Oct. 1999.
[33] R. Marcelin, Message Passing on the CRAY T3D. Massively Parallel Computing Group, NERSC, 1995.
[34] Message Passing Interface Forum, MPI-2: Extensions to the Message-Passing Interface, Jan. 1996.
[35] A. Navarro, R. Asenjo, E. Zapata, and D. Padua, “Access Descriptor Based Locality Analysis for Distributed-Shared Memory Multiprocessors,” Proc. Int'l Conf. Parallel Processing, pp. 86-94, Sept. 1999.
[36] A. Navarro, Y. Paek, E. Zapata, and D. Padua, “Compiler Techniques for Effective Communication on Distributed-Memory Multiprocessors,” Proc. Int'l Conf. Parallel Processing, Aug. 1997.
[37] A.G. Navarro and E.L. Zapata, “An Automatic Iteration/data Distribution Method Based on Access Descriptors for DSM Multiprocessors,” Technical Report UMA-DAC-99/07, Dept. of Computer Architecture, Univ. of Málaga, 1999.
[38] A.G. Navarro and E.L. Zapata, “An Automatic Iteration/Data Distribution Method Based on Access Descriptors for dsmm,” Proc. Workshop Language and Compilers for Parallel Computing, Aug. 1999.
[39] J. Nieplocha, R.J. Harrison, and R.J. Littlefield, "Global Arrays: A Portable 'Shared-Memory' Programming Model for Distributed Memory Computers," Proc. Supercomputing '94, IEEE CS Press, Los Alamitos, Calif., 1994, pp. 340-349.
[40] Y. Paek, “Automatic Parallelization for Distributed Memory Machines Based on Access Region Analysis,” PhD thesis, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign, Apr. 1997.
[41] Y. Paek, J. Hoeflinger, D. Padua, “Simplification of Array Access Patterns for Compiler Optimizations,” Proc. ACM SIGPLAN‘98, PLDI, pp. 60–71, June 1998.
[42] Y. Paek, A. Navarro, E. Zapata, and D. Padua, “Parallelization of Benchmarks for Scalable Shared Memory Machines,” Proc. Int'l Conf. Parallel Architecures and Compilation Techniques, Oct. 1998.
[43] Y. Paek and D. Padua, “Experimental Study of Compiler Techniques for NUMA Machines,” Proc. IEEE Int'l Parallel Processing Symp. and Symp. Parallel and Distributed Processing, Apr. 1998.
[44] C. Polychronopoulos, M. Girkar, M.R. Haghighat, C. Lee, B. Leung, and D. Schouten, “The Structure of Parafrase-2: An Advanced Parallelizing Compiler for C and Fortran,” Proc. Workshop Language and Compilers for Parallel Computing, Aug. 1990.
[45] W. Pugh, “The Omega Test: A Fast and Practical Integer Programming Algorithm for Dependence Analysis,” Comm. ACM, vol. 8, pp. 102–114, Aug. 1992.
[46] S.L. Scott, "Synchronization and Communication in the T3E Multiprocess," Proc. ASPLOS-VII, Oct. 1996.
[47] J. Subhlok and B. Yang, "A New Model for Integrated Nested Task and Data Parallel Programming," Proc. ACM Symp. Principles and Practice of Parallel Programming, ACM Press, New York, 1997, pp. 1-12.
[48] R. Triolet, F. Irigoin, and P. Feautrier, "Direct Parallelization of CALL Statements," Proc. SIGPLAN '86 Symp. Compiler Construction, pp. 176-185,Palo Alto, Calif., June 1986.
[49] A.J. van der Steen, “EuroBen Experiences with the SGI Origin 2000 and the Cray T3E,” J. Performance Evaluation and Modeling for Computer Systems, June 1998.

Index Terms:
compiler, array privatization, dependence analysis, multiprocessors, noncoherent caches, shared-memory programming, Put/Get
Citation:
Y. Paek, A. Navarro, E. Zapata, J. Hoeflinger, D. Padua, "An Advanced Compiler Framework for Non-Cache-Coherent Multiprocessors," IEEE Transactions on Parallel and Distributed Systems, vol. 13, no. 3, pp. 241-259, March 2002, doi:10.1109/71.993205
Usage of this product signifies your acceptance of the Terms of Use.