
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Thomas Rauber, Gudula Rünger, "A Transformation Approach to Derive Efficient Parallel Implementations," IEEE Transactions on Software Engineering, vol. 26, no. 4, pp. 315339, April, 2000.  
BibTex  x  
@article{ 10.1109/32.844492, author = {Thomas Rauber and Gudula Rünger}, title = {A Transformation Approach to Derive Efficient Parallel Implementations}, journal ={IEEE Transactions on Software Engineering}, volume = {26}, number = {4}, issn = {00985589}, year = {2000}, pages = {315339}, doi = {http://doi.ieeecomputersociety.org/10.1109/32.844492}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Software Engineering TI  A Transformation Approach to Derive Efficient Parallel Implementations IS  4 SN  00985589 SP315 EP339 EPD  315339 A1  Thomas Rauber, A1  Gudula Rünger, PY  2000 KW  Transformation system KW  coordination language KW  task and data parallelism KW  hierarchical module structure KW  data distribution types KW  scientific computing KW  messagepassing program KW  MPI. VL  26 JA  IEEE Transactions on Software Engineering ER   
Abstract—The construction of efficient parallel programs usually requires expert knowledge in the application area and a deep insight into the architecture of a specific parallel machine. Often, the resulting performance is not portable, i.e., a program that is efficient on one machine is not necessarily efficient on another machine with a different architecture. Transformation systems provide a more flexible solution. They start with a specification of the application problem and allow the generation of efficient programs for different parallel machines. The programmer has to give an exact specification of the algorithm expressing the inherent degree of parallelism and is released from the lowlevel details of the architecture. In this article, we propose such a transformation system with an emphasis on the exploitation of the data parallelism combined with a hierarchically organized structure of task parallelism. Starting with a specification of the maximum degree of task and data parallelism, the transformations generate a specification of a parallel program for a specific parallel machine. The transformations are based on a cost model and are applied in a predefined order, fixing the most important design decisions like the scheduling of independent multitask activations, data distributions, pipelining of tasks, and assignment of processors to task activations. We demonstrate the usefulness of the approach with examples from scientific computing.
[1] A. Alexandrov, M. Ionescu, K.E. Schauser, and C. Scheiman, “LogGP: Incorporating Long Messages into the LogP model—One Step Closer towards a Realistic Model for Parallel Computation,” Technical Report TRCS9509, Univ. of California at Santa Barbara, 1995.
[2] B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, and M. Vanneschi, “P3L: A Structured High Level Programming Language and Its Structured Support,” Concurrency: Practice and Experience, vol. 7, no. 3, pp. 225255, 1995.
[3] H. Bal and M. Haines, “Approaches for Integrating Task and Data Parallelism,” IEEE Concurrency, vol. 6, no. 3, pp. 7484, JulyAug. 1998.
[4] P. Banerjee et al., "The Paradigm Compiler for DistributedMemory Multicomputers," Computer, Vol. 28, No. 10, Oct. 1995, pp. 3747.
[5] C.J. Beckmann and C. Polychronopoulos, “Microarchitecture Support for Dynamic Scheduling of Acyclic Task Graphs,” Technical Report CSRD Report 1207, Univ. of Illi nois, 1992.
[6] G.E. Blelloch, “Programming Parallel Algorithms,” Comm. ACM, vol. 39, no. 3, pp. 8597, Mar. 1996.
[7] D. Culler, J.P. Singh, and A. Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, San Francisco, 1998.
[8] D. Culler,R. Karp,D. Patterson,A. Sahay,K.E. Schauser,E. Santos,R. Subramonian,, and T. von Eicken,“LogP: Towards a realistic model of parallel computation,” Fourth Symp. Principles and Practices Parallel Programming, SIGPLAN’93, ACM, May 1993.
[9] J. Darlington, Y. Guo, H.W. To, and J. Yang, “Functional Skeletons for Parallel Coordination,” Proc. EUROPAR '95, pp. 5566, 1995.
[10] R. Dautray and J.L. Lions, Mathematical Analysis and Numerical Methods for Science and Technology, vol. 4.Springer, 1990.
[11] J. Dongarra and D. Walker, “Software Libraries for Linear Algebra Computations on High Performance Computers,” SIAM Review, vol. 37, no. 2,pp. 151–180, 1995.
[12] M.A. Driscoll and W.R. Daasch, “Accurate Predictions of Parallel Program Execution Time,” J. Parallel and Distributed Computing, vol. 25, no. 1, pp. 1630, 1995.
[13] High Performance Fortran Forum, “High Performance Fortran Language Specification,” Scientific Programming, vol. 2, no. 1, 1993.
[14] Message Passing Interface Forum, "MPI: A MessagePassing Interface Standard," Technical Report CS93214, Univ. of Tennessee, Apr. 1994.
[15] R. Foschia, T. Rauber, and G. Rünger, “Modeling the Communication Behavior of the Intel Paragon,” Proc. Fifth Symp. Modeling, Analysis, and Simulation of Computer and Telecomm. Systems (MASCOTS '97), pp. 117124, 1997.
[16] I. Foster and K.M. Chandy, "Fortran M: A Language for Modular Parallel Programming," J. Parallel and Distributed Computing, vol. 26, no. 1, pp. 2435, Apr. 1995.
[17] B. Avalani, I. Foster, M. Xu, and A. Choudhary, "A Compilation System That Integrates HPF and Fortran M," Proc. Scalable High Performance Computing Conf. (SHPCC94), pp. 293300, May 1994.
[18] S. Ghosh, M. Martonosi, and S. Malik, "Cache Miss Equations: An Analytical Representation of Cache Misses," Proc. Int'l Conf. Supercomputing (ICS 97), IEEE Computer Soc. Press, Los Alamitos, Calif., 1997, pp. 317324.
[19] R. Graham, “Bounds on Multiprocessor Timing Anomalies,” SIAM J. Applied Math., 1969.
[20] E. Hairer, S.P. Nørsett, and G. Wanner, Solving Ordinary Differential Equations I: Nonstiff Problems, Berlin: SpringerVerlag, 1993.
[21] E. Hairer, G. Wanner, Solving Ordinary Differential Equations II. Springer, 1991.
[22] P. Hanrahan, D. Saltzman, and L. Aupperle, "A Rapid Hierarchical Radiosity Algorithm," Computer Graphics, SIGGRAPH '91,Las Vegas, vol. 25, no. 4, pp. 197206, Aug. 1991.
[23] M. Hill, W. McColl, and D. Skillicorn, “Questions and Answers about BSP,” Scientific Programming, vol. 6, no. 3, pp. 249274, 1997.
[24] K. Hwang, Z. Xu, and M. Arakawa, "Benchmark Evaluation of the IBM SP2 for Parallel Signal Processing," IEEE Trans. Parallel and Distributed Systems, Vol. 7, No. 5, May 1996, pp. 522536.
[25] S. Johnsson, “Performance Modeling of Distributed Memory Architecture,” J. Parallel and Distributed Computing, vol. 12, pp. 300312, 1991.
[26] S.L. Johnsson and C.T. Ho,“Spanning graphs for optimum broadcasting and personalizedcommunication in hypercubes,” IEEE Trans. Computers, vol. 38, no. 9, pp. 1,2491,268, Sept. 1989.
[27] V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin Cummings, 1994.
[28] W.F. McColl, “Universal Computing,” Proc. EuroPar '96, pp. 2536, 1996.
[29] M.G. Norman and P. Thanisch, “Models of Machines and Computation for Mapping in Multicomputers,” ACM Computing Surveys, vol. 25, no. 3, pp. 263302, 1993.
[30] J. O'Donnell and G. Rünger, “A Methodology for Deriving Parallel Programs with a Family of Abstract Parallel Machines,” Proc. Parallel Processing, EuroPar '97, C. Lengauer, M. Griebl, and S. Gorlatch, eds., pp. 661–668, 1997.
[31] S. Ramaswamy, "Simultaneous Exploitation of Task and Data Parallelism in Regular Scientific Applications," PhD thesis CRHC9603/UILUENG962203, Dept. of Electrical and Computer Eng., Univ. of Illi nois, Urbana, Jan. 1996.
[32] S. Ramaswamy and P. Banerjee, "Automatic Generation of Efficient Array Redistribution Routines for Distributed Memory Multicomputers," Proc. Frontiers '95: The Fifth Symposium on the Frontiers of Massively Parallel Computation, pp. 342349,McLean, Va., Feb. 1995.
[33] S. Ramaswamy, S. Sapatnekar, and P. Banerjee, “A Framework for Exploiting Task and Data Parallelism on DistributedMemory Multicomputers,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 11, pp. 1,0981,116, Nov. 1997.
[34] T. Rauber and G. Rünger, “Optimal Data Distribution for LU Decomposition,” Proc. EuroPar '95, pp. 391402, 1995.
[35] T. Rauber and G. Rünger, “Parallel Solution of a SchrödingerPoisson System,” Proc. Int'l Conf. HighPerformance Computing and Networking, pp. 697702, 1995.
[36] T. Rauber and G. Rünger, “Comparing Task and Data Parallel Execution Schemes for the DIIRK Method,” Proc. EuroPar '96, pp. 5261, 1996.
[37] T. Rauber and G. Rünger, “Parallel Iterated RungeKutta Methods and Applications,” Int'l J. Supercomputer Applications, vol. 10, no. 1, pp. 6290, 1996.
[38] T. Rauber and G. Rünger, “The Compiler TwoL for the Design of Parallel Implementations,” Proc. Fourth Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '96), pp. 292301, 1996.
[39] T. Rauber and G. Rünger, “Load Balancing Schemes for Extrapolation Methods,” Concurrency: Practice and Experience, vol. 9, no. 3, pp. 181202, 1997.
[40] T. Rauber and G. Rünger, “Compiler Support for Task Scheduling in Hierarchical Execution Models,” J. Systems Architecture, vol. 45, pp. 483503, 1998.
[41] T. Rauber and G. Rünger, and R. Wilhelm, “Deriving Optimal Data Distributions for Group Parallel Numerical Algorithms,” Proc. Second Conf. Massively Parallel Programming Models, pp. 3341, 1995.
[42] J. Rose, “LocusRoute: A Parallel Global Router for StandardCells,” Proc. 25th ACM/IEEE Design Automation Conf., pp. 189195, 1988.
[43] J. Rose, “Parallel Global Routing for Standard Cells,” IEEE Trans. ComputerAided Design, vol. 9, no. 10, 1990.
[44] D. Skillicorn, “ArchitectureIndependent Parallel Computation,” Computer, vol. 23, no. 12, pp. 3851, 1990.
[45] D.B. Skillicorn and D. Talia, “Models and Languages for Parallel Computation,” ACM Computing Surveys, vol. 30, no. 2, pp. 123169, 1998.
[46] J. Stoer, R. Bulirsch, Introduction to Numerical Analysis. Springer, 1990.
[47] J. Subhlok, “Automatic Mapping of Task and Data Parallel Programs for Efficient Execution on Multiprocessors,” Technical Report CMUCS93212, Carnegie Mellon Univ., 1993.
[48] P. Trindler, K. Hammond, H.W. Loidl, and S.L. Peyton Jones, “Algorithm + Strategy = Parallelism,” J. Functional Programming, vol. 8, no. 1, pp. 2360, 1998.
[49] J. Turek, U. Schwiegelshohn, J. Wolf, and P. Yu,“Scheduling parallel tasks to minimize average response times,”inProc. 5th Annu. ACMSIAM Symp. Discrete Algorithms, Alexandria, VA, Jan. 1994, pp. 200–209.
[50] J. Turek, J. Wolf, and P. Yu,“Approximate algorithms for scheduling parallelizable tasks,”inProc. 4th Annu. Symp. Parallel Algorithms, Architect., San Diego, CA, June 1992, pp. 323–332.
[51] L.G. Valiant, “A Bridging Model for Parallel Computation,” Comm. ACM, vol. 33, no. 8, pp. 103111, Aug. 1990.
[52] P.J. van der Houwen and B.P. Sommeijer, “Parallel Iteration of HighOrder RungeKutta Methods with Stepsize Control,” J. Computational and Applied Math., vol. 29, pp. 111127, 1990.
[53] P.J. van der Houwen and B.P. Sommeijer, “Iterated RungeKutta Methods on Parallel Computers,” SIAM J. Scientific and Statistical Computing, vol. 12, no. 5, pp. 1,0001,028, 1991.
[54] P.J. van der Houwen, B.P. Sommeijer, and W. Couzy, “Embedded Diagonally Implicit RungeKutta Algorithms on Parallel Computers,” Math. Computation, vol. 58, no. 197, pp. 135159, Jan. 1992.
[55] M. Wolfe, High Performance Compilers for Parallel Computing. AddisonWesley, 1996.
[56] S.C. Woo et al., "The SPLASH2 Programs: Characterization and Methodological Considerations," Proc. 22nd Annual Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., June 1995, pp. 2436.
[57] Z. Xu and K. Hwang, “Early Prediction of MPP Performance: SP2, T3D, and Paragon Experiences,” J. Parallel Computing, vol. 22, pp. 917942, Oct. 1996.
[58] K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken, “Titanium: A HighPerformance Java Dialect,” Concurrency: Practice and Experience, 1998.