This Article 
 Bibliographic References 
 Add to: 
A Transformation Approach to Derive Efficient Parallel Implementations
April 2000 (vol. 26 no. 4)
pp. 315-339

Abstract—The construction of efficient parallel programs usually requires expert knowledge in the application area and a deep insight into the architecture of a specific parallel machine. Often, the resulting performance is not portable, i.e., a program that is efficient on one machine is not necessarily efficient on another machine with a different architecture. Transformation systems provide a more flexible solution. They start with a specification of the application problem and allow the generation of efficient programs for different parallel machines. The programmer has to give an exact specification of the algorithm expressing the inherent degree of parallelism and is released from the low-level details of the architecture. In this article, we propose such a transformation system with an emphasis on the exploitation of the data parallelism combined with a hierarchically organized structure of task parallelism. Starting with a specification of the maximum degree of task and data parallelism, the transformations generate a specification of a parallel program for a specific parallel machine. The transformations are based on a cost model and are applied in a predefined order, fixing the most important design decisions like the scheduling of independent multitask activations, data distributions, pipelining of tasks, and assignment of processors to task activations. We demonstrate the usefulness of the approach with examples from scientific computing.

[1] A. Alexandrov, M. Ionescu, K.E. Schauser, and C. Scheiman, “LogGP: Incorporating Long Messages into the LogP model—One Step Closer towards a Realistic Model for Parallel Computation,” Technical Report TRCS95-09, Univ. of California at Santa Barbara, 1995.
[2] B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, and M. Vanneschi, “P3L: A Structured High Level Programming Language and Its Structured Support,” Concurrency: Practice and Experience, vol. 7, no. 3, pp. 225-255, 1995.
[3] H. Bal and M. Haines, “Approaches for Integrating Task and Data Parallelism,” IEEE Concurrency, vol. 6, no. 3, pp. 74-84, July-Aug. 1998.
[4] P. Banerjee et al., "The Paradigm Compiler for Distributed-Memory Multicomputers," Computer, Vol. 28, No. 10, Oct. 1995, pp. 37-47.
[5] C.J. Beckmann and C. Polychronopoulos, “Microarchitecture Support for Dynamic Scheduling of Acyclic Task Graphs,” Technical Report CSRD Report 1207, Univ. of Illi nois, 1992.
[6] G.E. Blelloch, “Programming Parallel Algorithms,” Comm. ACM, vol. 39, no. 3, pp. 85-97, Mar. 1996.
[7] D. Culler, J.P. Singh, and A. Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, San Francisco, 1998.
[8] D. Culler,R. Karp,D. Patterson,A. Sahay,K.E. Schauser,E. Santos,R. Subramonian,, and T. von Eicken,“LogP: Towards a realistic model of parallel computation,” Fourth Symp. Principles and Practices Parallel Programming, SIGPLAN’93, ACM, May 1993.
[9] J. Darlington, Y. Guo, H.W. To, and J. Yang, “Functional Skeletons for Parallel Coordination,” Proc. EURO-PAR '95, pp. 55-66, 1995.
[10] R. Dautray and J.-L. Lions, Mathematical Analysis and Numerical Methods for Science and Technology, vol. 4.Springer, 1990.
[11] J. Dongarra and D. Walker, “Software Libraries for Linear Algebra Computations on High Performance Computers,” SIAM Review, vol. 37, no. 2,pp. 151–180, 1995.
[12] M.A. Driscoll and W.R. Daasch, “Accurate Predictions of Parallel Program Execution Time,” J. Parallel and Distributed Computing, vol. 25, no. 1, pp. 16-30, 1995.
[13] High Performance Fortran Forum, “High Performance Fortran Language Specification,” Scientific Programming, vol. 2, no. 1, 1993.
[14] Message Passing Interface Forum, "MPI: A Message-Passing Interface Standard," Technical Report CS-93-214, Univ. of Tennessee, Apr. 1994.
[15] R. Foschia, T. Rauber, and G. Rünger, “Modeling the Communication Behavior of the Intel Paragon,” Proc. Fifth Symp. Modeling, Analysis, and Simulation of Computer and Telecomm. Systems (MASCOTS '97), pp. 117-124, 1997.
[16] I. Foster and K.M. Chandy, "Fortran M: A Language for Modular Parallel Programming," J. Parallel and Distributed Computing, vol. 26, no. 1, pp. 24-35, Apr. 1995.
[17] B. Avalani, I. Foster, M. Xu, and A. Choudhary, "A Compilation System That Integrates HPF and Fortran M," Proc. Scalable High Performance Computing Conf. (SHPCC-94), pp. 293-300, May 1994.
[18] S. Ghosh, M. Martonosi, and S. Malik, "Cache Miss Equations: An Analytical Representation of Cache Misses," Proc. Int'l Conf. Supercomputing (ICS 97), IEEE Computer Soc. Press, Los Alamitos, Calif., 1997, pp. 317-324.
[19] R. Graham, “Bounds on Multiprocessor Timing Anomalies,” SIAM J. Applied Math., 1969.
[20] E. Hairer, S.P. Nørsett, and G. Wanner, Solving Ordinary Differential Equations I: Nonstiff Problems, Berlin: Springer-Verlag, 1993.
[21] E. Hairer, G. Wanner, Solving Ordinary Differential Equations II. Springer, 1991.
[22] P. Hanrahan, D. Saltzman, and L. Aupperle, "A Rapid Hierarchical Radiosity Algorithm," Computer Graphics, SIGGRAPH '91,Las Vegas, vol. 25, no. 4, pp. 197-206, Aug. 1991.
[23] M. Hill, W. McColl, and D. Skillicorn, “Questions and Answers about BSP,” Scientific Programming, vol. 6, no. 3, pp. 249-274, 1997.
[24] K. Hwang, Z. Xu, and M. Arakawa, "Benchmark Evaluation of the IBM SP2 for Parallel Signal Processing," IEEE Trans. Parallel and Distributed Systems, Vol. 7, No. 5, May 1996, pp. 522-536.
[25] S. Johnsson, “Performance Modeling of Distributed Memory Architecture,” J. Parallel and Distributed Computing, vol. 12, pp. 300-312, 1991.
[26] S.L. Johnsson and C.T. Ho,“Spanning graphs for optimum broadcasting and personalizedcommunication in hypercubes,” IEEE Trans. Computers, vol. 38, no. 9, pp. 1,249-1,268, Sept. 1989.
[27] V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin Cummings, 1994.
[28] W.F. McColl, “Universal Computing,” Proc. EuroPar '96, pp. 25-36, 1996.
[29] M.G. Norman and P. Thanisch, “Models of Machines and Computation for Mapping in Multicomputers,” ACM Computing Surveys, vol. 25, no. 3, pp. 263-302, 1993.
[30] J. O'Donnell and G. Rünger, “A Methodology for Deriving Parallel Programs with a Family of Abstract Parallel Machines,” Proc. Parallel Processing, Euro-Par '97, C. Lengauer, M. Griebl, and S. Gorlatch, eds., pp. 661–668, 1997.
[31] S. Ramaswamy, "Simultaneous Exploitation of Task and Data Parallelism in Regular Scientific Applications," PhD thesis CRHC-96-03/UILU-ENG-96-2203, Dept. of Electrical and Computer Eng., Univ. of Illi nois, Urbana, Jan. 1996.
[32] S. Ramaswamy and P. Banerjee, "Automatic Generation of Efficient Array Redistribution Routines for Distributed Memory Multicomputers," Proc. Frontiers '95: The Fifth Symposium on the Frontiers of Massively Parallel Computation, pp. 342-349,McLean, Va., Feb. 1995.
[33] S. Ramaswamy, S. Sapatnekar, and P. Banerjee, “A Framework for Exploiting Task and Data Parallelism on Distributed-Memory Multicomputers,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 11, pp. 1,098-1,116, Nov. 1997.
[34] T. Rauber and G. Rünger, “Optimal Data Distribution for LU Decomposition,” Proc. EuroPar '95, pp. 391-402, 1995.
[35] T. Rauber and G. Rünger, “Parallel Solution of a Schrödinger-Poisson System,” Proc. Int'l Conf. High-Performance Computing and Networking, pp. 697-702, 1995.
[36] T. Rauber and G. Rünger, “Comparing Task and Data Parallel Execution Schemes for the DIIRK Method,” Proc. EuroPar '96, pp. 52-61, 1996.
[37] T. Rauber and G. Rünger, “Parallel Iterated Runge-Kutta Methods and Applications,” Int'l J. Supercomputer Applications, vol. 10, no. 1, pp. 62-90, 1996.
[38] T. Rauber and G. Rünger, “The Compiler TwoL for the Design of Parallel Implementations,” Proc. Fourth Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '96), pp. 292-301, 1996.
[39] T. Rauber and G. Rünger, “Load Balancing Schemes for Extrapolation Methods,” Concurrency: Practice and Experience, vol. 9, no. 3, pp. 181-202, 1997.
[40] T. Rauber and G. Rünger, “Compiler Support for Task Scheduling in Hierarchical Execution Models,” J. Systems Architecture, vol. 45, pp. 483-503, 1998.
[41] T. Rauber and G. Rünger, and R. Wilhelm, “Deriving Optimal Data Distributions for Group Parallel Numerical Algorithms,” Proc. Second Conf. Massively Parallel Programming Models, pp. 33-41, 1995.
[42] J. Rose, “LocusRoute: A Parallel Global Router for StandardCells,” Proc. 25th ACM/IEEE Design Automation Conf., pp. 189-195, 1988.
[43] J. Rose, “Parallel Global Routing for Standard Cells,” IEEE Trans. Computer-Aided Design, vol. 9, no. 10, 1990.
[44] D. Skillicorn, “Architecture-Independent Parallel Computation,” Computer, vol. 23, no. 12, pp. 38-51, 1990.
[45] D.B. Skillicorn and D. Talia, “Models and Languages for Parallel Computation,” ACM Computing Surveys, vol. 30, no. 2, pp. 123-169, 1998.
[46] J. Stoer, R. Bulirsch, Introduction to Numerical Analysis. Springer, 1990.
[47] J. Subhlok, “Automatic Mapping of Task and Data Parallel Programs for Efficient Execution on Multiprocessors,” Technical Report CMU-CS-93-212, Carnegie Mellon Univ., 1993.
[48] P. Trindler, K. Hammond, H.-W. Loidl, and S.L. Peyton Jones, “Algorithm + Strategy = Parallelism,” J. Functional Programming, vol. 8, no. 1, pp. 23-60, 1998.
[49] J. Turek, U. Schwiegelshohn, J. Wolf, and P. Yu,“Scheduling parallel tasks to minimize average response times,”inProc. 5th Annu. ACM-SIAM Symp. Discrete Algorithms, Alexandria, VA, Jan. 1994, pp. 200–209.
[50] J. Turek, J. Wolf, and P. Yu,“Approximate algorithms for scheduling parallelizable tasks,”inProc. 4th Annu. Symp. Parallel Algorithms, Architect., San Diego, CA, June 1992, pp. 323–332.
[51] L.G. Valiant, “A Bridging Model for Parallel Computation,” Comm. ACM, vol. 33, no. 8, pp. 103-111, Aug. 1990.
[52] P.J. van der Houwen and B.P. Sommeijer, “Parallel Iteration of High-Order Runge-Kutta Methods with Stepsize Control,” J. Computational and Applied Math., vol. 29, pp. 111-127, 1990.
[53] P.J. van der Houwen and B.P. Sommeijer, “Iterated Runge-Kutta Methods on Parallel Computers,” SIAM J. Scientific and Statistical Computing, vol. 12, no. 5, pp. 1,000-1,028, 1991.
[54] P.J. van der Houwen, B.P. Sommeijer, and W. Couzy, “Embedded Diagonally Implicit Runge-Kutta Algorithms on Parallel Computers,” Math. Computation, vol. 58, no. 197, pp. 135-159, Jan. 1992.
[55] M. Wolfe, High Performance Compilers for Parallel Computing. Addison-Wesley, 1996.
[56] S.C. Woo et al., "The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. 22nd Annual Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., June 1995, pp. 24-36.
[57] Z. Xu and K. Hwang, “Early Prediction of MPP Performance: SP2, T3D, and Paragon Experiences,” J. Parallel Computing, vol. 22, pp. 917-942, Oct. 1996.
[58] K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken, “Titanium: A High-Performance Java Dialect,” Concurrency: Practice and Experience, 1998.

Index Terms:
Transformation system, coordination language, task and data parallelism, hierarchical module structure, data distribution types, scientific computing, message-passing program, MPI.
Thomas Rauber, Gudula Rünger, "A Transformation Approach to Derive Efficient Parallel Implementations," IEEE Transactions on Software Engineering, vol. 26, no. 4, pp. 315-339, April 2000, doi:10.1109/32.844492
Usage of this product signifies your acceptance of the Terms of Use.