This Article 
 Bibliographic References 
 Add to: 
A Programming Methodology for Dual-Tier Multicomputers
March 2000 (vol. 26 no. 3)
pp. 212-226

Abstract—Hierarchically organized ensembles of shared memory multiprocessors possess a richer and more complex model of locality than previous generation multicomputers with single processor nodes. These dual-tier computers introduce many new factors into the programmer's performance model. We present a methodology for implementing block-structured numerical applications on dual-tier computers and a run-time infrastructure, called KeLP2, that implements the methodology. KeLP2 supports two levels of locality and parallelism via hierarchical SPMD control flow, run-time geometric meta-data, and asynchronous collective communication. KeLP applications can effectively overlap communication with computation under conditions where nonblocking point-to-point message passing fails to do so. KeLP's abstractions hide considerable detail without sacrificing performance and dual-tier applications written in KeLP consistently outperform equivalent single-tier implementations written in MPI. We describe the KeLP2 model and show how it facilitates the implementation of five block-structured applications specially formulated to hide communication latency on dual-tiered architectures. We support our arguments with empirical data from applications running on various single- and dual-tier multicomputers. KeLP2 supports a migration path from single-tier to dual-tier platforms and we illustrate this capability with a detailed programming example.

[1] P.R. Woodward, “Perspectives on Supercomputing: Three Decades of Change,” Computer, vol. 29, no. 10, pp. 99–111, Oct. 1996.
[2] “Accelerated Strategic Computing Initiative (ASCI),” Technical Report UCRL-MI-125923, Lawrence Livermore Nat'l Laboratory, 1998.
[3] W.W. Gropp and E.L. Lusk, “A Taxonomy of Programming Models for Symmetric Multiprocessors and SMP Clusters,” Programming Models for Massively Parallel Computers, W.K. Giloi, S. Jahnichen, and B.D. Shriver, eds., pp. 2–7, CS Press, 1995.
[4] S.J. Fink and S.B. Baden, “Runtime Support for Multi-Tier Programming of Block-Structured Applications on SMP Clusters,” Scientific Computing in Object-Oriented Parallel Environments, Y. Ishikawa, R. Oldehoeft, J.V.W. Reynders, and M. Tholburn, eds., vol. 1343, Berlin: Springer-Verlag, pp. 1–8, 1997.
[5] S. Lumetta, A. Mainwaring, and D. Culler, "Multi-Protocol Active Messages on a Cluster of SMPs," Proc. Supercomputing 97, IEEE CS Press, 1997.
[6] S.J. Fink, “Hierarchical Programming for BlockStructured Scientific Calculations,” doctoral dissertation, Dept. Computer Science and Eng., Univ. of Calif., San Diego, 1998.
[7] S.B. Baden and S.J. Fink, “Communication Overlap in Multi-Tier Parallel Algorithms,” Proc. SC '98, , Sept. 1999.
[8] “MPI: A Message-Passing Interface Standard,” Message Passing Interface Forum, sc98_FullAbstracts/Baden708/INDEX.HTMhttp:/ /www-unix.mcs.anl.govmpi/, June 1995
[9] S.J. Fink, S.B. Baden, and S.R. Kohn, “Flexible Communication Mechanisms for Dynamic Structured Applications,” Proc. Third Int'l Workshop Parallel Algorithms for Irregularly Structured Problems, Aug. 1996.
[10] S.R. Kohn, J.H. Weare, E.M.G. Ong, and S.B. Baden, “Software Abstractions and Computational Issues in Parallel Structured Adaptive Mesh Methods for Electronic Structure Calculations,” Structured Adaptive Mesh Refinement Grid Methods, S.B. Baden, N. Chrisochoides, M. Norman, and D. Gannon, eds., Berlin: Springer-Verlag, 1999.
[11] J. Howe, S.B. Baden, T. Grimmett, and K. Nomura, “Modernization of Legacy Application Software,” Proc. Applied Parallel Computing: Large Scale Scientific and Industrial Problem: Fourth Int'l Workshop, B. Kagström, J. Dongarra, E. Elmroth, and J. Wasniewski, eds., pp. 255–262, 1997.
[12] S.M. Figueira, “Modeling the Effects of Contention on Application Performance in Multi-User Environments,” doctoral dissertation, Computer Science Eng. Dept., Univ. of Calif., San Diego, Dec. 1996.
[13] F. Berman, R. Wolski, S. Figueira, J. Schopf,, and G. Shao, “Application-Level Scheduling on Distributed Heterogeneous Networks,” Proc. Supercomputing, 1996.
[14] R. Wolski, G. Shao, and F. Berman, “Predicting the Cost of Redistribution in Scheduling,” Proc. Eighth SIAM Conf. Parallel and Processing Science Computing, 1997.
[15] A. Sohn and R. Biswas, “Communication Studies of DMP and SMP Machines,” Technical Report NAS-97-004, NASA Ames Research Center., Mar. 1997.
[16] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard,” Parallel Computing, vol. 22, no. 6, pp. 789–828, 1996.
[17] S.J. Fink, S.B. Baden, and S. R. Kohn, “Efficient Run-Time Support for Irregular Block-Structured Applications,” J. Parallel and Distributed Computing, vol. 50, pp. 61–62, May 1998.
[18] S.J. Fink and S.B. Baden, “Run-Time Data Distribution for Block-Structured Applications on Distributed Memory Computers,” Proc. Seventh SIAM Conf. Parallel Processing for Scientific Computing, pp. 762–767, Feb. 1995.
[19] G. Agrawal, A. Sussman, and J. Saltz, “An Integrated Runtime and Compile-Time Approach for Parallelizing Structured and Block Structured Applications,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 7, pp. 747–754, July 1995.
[20] A.C. Sawdey, M.T. O'Keefe, and W.B. Jones, “A General Programming Model for Developing Scalable Ocean Circulation Applications,” Proc. ECMWF Workshop Use of Parallel Processors in Meteorology, Jan. 1997.
[21] G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, C. Lopes, J.-M. Longtier, and J. Irwin, “Aspect-Oriented Programming,” Technical Report SPL97-008 P9710042, Xerox PARC, Palo Alto, Calif., Feb. 1997.
[22] “High Performance Fortran Language Specification Version 2.0,” High Performance Fortran Forum, org/sc96/proceedings/. Depts/CRPC/HPFF/versionsindex.cfm , Jan. 1997.
[23] S.J. Fink, KeLP Reference Manual, Version 2.0, Dept. of Computer Science and Eng., Univ. of Calif., San Diego, Jun. 1998.
[24] L. Snyder, “Foundations of Practical Parallel Programming Languages,” Portability and Performance of Parallel Processing, T. Hey and J. Ferrante, eds., John Wiley&Sons, 1993.
[25] B. Alpern, L. Carter, and J. Ferrante, “Modeling Parallel Computers as Memory Hierarchies,” Programming Models for Massively Parallel Computers, W.K. Giloi, S. Jahnichen, and B.D. Shriver, eds., pp. 116–123 IEEE CS Press, Sept. 1993.
[26] D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga, “The NAS Parallel Benchmarks,” Technical Report RNR-94-007, NASA Ames Research Center, Mar. 1994.
[27] W.L. Briggs, A Multigrid Tutorial, SIAM, 1987.
[28] R.C. Agarwal, F.G. Gustavson, and M. Zubair, “An Efficient Parallel Algorithm for the 3-D FFT NAS Parallel Benchmark,” Proc. Scalable High-performance Computing Conf. `94, pp. 129–133, May 1994.
[29] R. van de Geign and J. Watts, “SUMMA: Scalable Universal Matrix Multiplication Algorithm,” Concurrency: Practice and Experience, vol. 9, pp. 255–274, Apr. 1997.
[30] J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and R.C. Whaley, “The Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines,” Scientific Programming, vol. 5, pp. 173-184, 1996.
[31] K.M. Chandy and C. Kesselman, "Compositional C++: Compositional Parallel Programming," Lecture Notes in Computer Science, Vol. 757, 1993, pp. 124-144.
[32] S. Pakin, Karacheti, and A. Chien, "Fast Messages (FM): Efficient, Portable Communication for Workstation Clusters and Massively-Parallel Processors," IEEE Parallel and Distributed Technology, Vol. 5, No. 2, Apr.-June 1997.
[33] J. Choi, A. Cleary, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R.C. Whaley, “ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers—Design Issues and Performance,” Proc. Supercomputing '96, DONGARRAINDEX.HTM, Sept. 1999.
[34] H. Casanova and J. Dongarra, “NetSolve: A Network Enabled Server for Solving Computational Science Problems,” Int'l J. Supercomputer Applications and High Performance Computing, vol. 11, pp. 212–223, 1997.
[35] R. Eigenmann, J. Hoeflinger, G. Jaxson, and D. Padua, “Cedar Fortran and Its Compiler,” Joint Int'l Conf. Vector and Parallel Processing CONPAR 90-VAPP IV, pp. 288–299, 1990.
[36] S. Murer, J. Feldman, C.-C. Lim, and M.-M. Seidel, “pSather: Layered Extensions to an Object-Oriented Language for Efficient Parallel Computation,” Technical Report TR-93-028, Computer Science Division, Univ. of Calif. at Berkeley, Dec. 1993.
[37] D.A. Baden and J. JáJá, “SIMPLE: A Methodology for Programming High Performance Algorithms on Clusters of Symmetric Multiprocessors,” Technical Report CS-TR-3798, UMIACS-TR-97-48, Inst. for Advanced Computer Studies, Univ. of Maryland, College Park, 1997.
[38] J.H. Merlin, S.B. Baden, S.J. Fink, and B.M. Chapman, “Multiple Data Parallelism with HPF and KeLP,” Proc. High Performance Computing and Networking '98, Apr. 1998.
[39] G.E. Blelloch, S. Chatterjee, J. Hardwick, J. Sipelstein, and M. Zagha, “Implementation of a Portable Nested Data-Parallel Language,” Proc. Fourth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, 1993.
[40] A.S. Grimshaw, J.B. Weissman, and T. Strayer, “Portable Run-Time Support for Dynamic Object-Oriented Parallel Processing,” Technical Report CS-93-40, Dept. of Computer Science, Univ. of Virginia, July 1993.
[41] R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Randall, and Y. Zhou, “Cilk: An Efficient Multithreaded Runtime System,” Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 207–216, July 1995.
[42] I. Foster and K.M. Chandy, "Fortran M: A Language for Modular Parallel Programming," J. Parallel and Distributed Computing, vol. 26, no. 1, pp. 24-35, Apr. 1995.
[43] D. Grunwald and S. Vajracharya, “The DUDE Runtime System: An Object-Oriented Macro-Dataflow Approach to Integrated Task and Object Parallelism,” Technical Report CU-CS-779-95, Dept. of Computer Science, Univ. of Colorado, 1994.
[44] P.E. Crandall, E.V. Sumithasri, J. Leichtl, and M.A. Clement, “A Taxonomy for Dual-Level Parallelism in Cluster Computing,” Technical Report, Dept. of Computer Science and Eng., Univ. of Connecticut, Mansfield, 1998.
[45] A.K. Somani and A.M. Sansano, “Minimizing Overhead in Parallel Algorithms through Overlapping Communication/Computation,” Technical Report 97-8, NASA ICASE, Langley, Va., Feb. 1997.
[46] A. Erlichson, N. Nuckolls, G. Chasson, and J. Hennessy, “SoftFLASH: Analyzing the Performance of Clustered Distributed Virtual Shared Memory,” Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 210–220, 1996.
[47] S.R. Kohn, "A parallel software infrastructure for dynamic block-irregular scientific calculations," PhD thesis, Univ. of California at San Diego, June 1995.
[48] S. Kohn, J. Weare, M.E. Ong, and S.B. Baden, “Parallel Adaptive Mesh Refinement for Electronic Structure Calculations,” Proc. SIAM Conf. Parallel Processing for Scientific Computing, Mar. 1997.
[49] F. Berman and R. Wolski, “Scheduling from the Perspective of the Application,” Proc. Fifth IEEE Int'l Symp. High Performance Distributed Computing, pp. 100–111, Aug. 1996.
[50] J. Saltz, A. Sussman, S. Graham, J. Demmel, S. Baden, and J. Dongarra, “Programming Tools and Environments,” Comm. ACM, vol. 41, pp. 64–73, Nov. 1998.

Index Terms:
Dual-tier parallel computers, hierarchical parallelism, KeLP, block-structured scientific applications, scientific application requirements, C++ framework, SMP clusters.
Scott B. Baden, Stephen J. Fink, "A Programming Methodology for Dual-Tier Multicomputers," IEEE Transactions on Software Engineering, vol. 26, no. 3, pp. 212-226, March 2000, doi:10.1109/32.842948
Usage of this product signifies your acceptance of the Terms of Use.