The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - April (2009 vol.20)
pp: 498-511
Georgios Goumas , National Technical University of Athens, Heroon Polytechniou, Zografou
Nikolaos Drosinos , National Technical University of Athens, Heroon Polytechniou, Zografou
Nectarios Koziris , National Technical University of Athens, Heroon Polytechniou, Zografou
ABSTRACT
In this paper we revisit the supernode-shape selection problem, that has been widely discussed in bibliography. In general, the selection of the supernode transformation greatly affects the parallel execution time of the transformed algorithm. Since the minimization of the overall parallel execution time via an appropriate supernode transformation is very difficult to accomplish, researchers have focused on scheduling-aware supernode transformations that maximize parallelism during the execution. In this paper we argue that the communication volume of the transformed algorithm is an important criterion, and its minimization should be given high priority. For this reason we define the metric of the per process communication volume and propose a method to miminize this metric by selecting a communication-aware supernode shape. Our approach is equivalent to defining a proper Cartesian process grid with MPI_Cart_Create, which means that it can be incorporated in applications in a straightforward manner. Our experimental results illustrate that by selecting the tile shape with the proposed method, the total parallel execution time is significantly reduced due to the minimization of the communication volume, despite the fact that a few more parallel execution steps are required.
INDEX TERMS
I/O and Data Communications, Load balancing and task assignment, Parallel processors, Parallel Architectures, Scheduling and task partitioning, Data communications
CITATION
Georgios Goumas, Nikolaos Drosinos, Nectarios Koziris, "Communication-Aware Supernode Shape", IEEE Transactions on Parallel & Distributed Systems, vol.20, no. 4, pp. 498-511, April 2009, doi:10.1109/TPDS.2008.114
REFERENCES
[1] F. Irigoin and R. Triolet, “Supernode Partitioning,” Proc. 15th Ann. ACM SIGACT-SIGPLAN Symp. Principles of Programming Languages (POPL '88), pp. 319-329, Jan. 1988.
[2] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, Numerical Recipes in C: The Art of Scientific Computing. Cambridge Univ. Press, 1992.
[3] B.D. Acunto, Computational Methods for PDE in Mechanics. World Scientific, 2004.
[4] K. Morton and D. Mayers, Numerical Solution of Partial Differential Equations. Cambridge Univ. Press, 2005.
[5] J. Ramanujam and P. Sadayappan, “Tiling Multidimensional Iteration Spaces for Multicomputers,” J. Parallel and Distributed Computing, vol. 16, pp. 108-120, 1992.
[6] R. Andonov, S. Balev, S. Rajopadhye, and N. Yanev, “Optimal Semi-Oblique Tiling,” IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 9, pp. 944-960, Sept. 2003.
[7] P. Boulet, A. Darte, T. Risset, and Y. Robert, “(Pen)-Ultimate Tiling,” INTEGRATION, The VLSI J., vol. 17, pp. 33-51, 1994.
[8] P. Boulet, J. Dongarra, Y. Robert, and F. Vivien, “Static Tiling for Heterogeneous Computing Platforms,” J. Parallel Computing, vol. 25, no. 5, pp. 547-568, May 1999.
[9] G. Goumas, A. Sotiropoulos, and N. Koziris, “Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping,” Proc. 15th Int'l Parallel and Distributed Processing Symp. (IPDPS '01), Apr. 2001.
[10] G. Goumas, M. Athanasaki, and N. Koziris, “An Efficient Code Generation Technique for Tiled Iteration Spaces,” IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 10, pp. 1021-1034, Oct. 2003.
[11] E. Hodzic and W. Shang, “On Supernode Transformation with Minimized Total Running Time,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 5, pp. 417-428, May 1998.
[12] G. Goumas, N. Drosinos, M. Athanasaki, and N. Koziris, “Message-Passing Code Generation for Non-Rectangular Tiling Transformations,” J. Parallel Computing, vol. 32, no. 10, pp.711-732, Nov. 2006.
[13] N. Drosinos and N. Koziris, “Performance Comparison of Pure MPI versus Hybrid MPI-OpenMP Parallelization Models on SMP Clusters,” Proc. 18th Int'l Parallel and Distributed Processing Symp. (IPDPS '04), p. 10, Apr. 2004.
[14] E. Hodzic and W. Shang, “On Time Optimal Supernode Shape,” IEEE Trans. Parallel and Distributed Systems, vol. 13, no. 12, pp.1220-1233, Dec. 2002.
[15] K. Högstedt, L. Carter, and J. Ferrante, “On the Parallel Execution Time of Tiled Loops,” IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 3, pp. 307-321, Mar. 2003.
[16] N. Koziris, A. Sotiropoulos, and G. Goumas, “A Pipelined Schedule to Minimize Completion Time for Loop Tiling with Computation and Communication Overlapping,” J. Parallel and Distributed Computing, vol. 63, no. 11, pp. 1138-1151, Nov. 2003.
[17] H. Ohta, Y. Saito, M. Kainaga, and H. Ono, “Optimal Tile Size Adjustment in Compiling General DOACROSS Loop Nests,” Proc. Ninth Int'l Conf. Supercomputing (ICS '95), pp. 270-279, July 1995.
[18] Y. Song and Z. Li, “Impact of Tile-Size Selection for Skewed Tiling,” Proc. Fifth Workshop Interaction between Compilers and Architectures (INTERACT '01), Jan. 2001.
[19] P. Tang and J. Xue, “Generating Efficient Tiled Code for Distributed Memory Machines,” J. Parallel Computing, vol. 26, no. 11, pp. 1369-1410, 2000.
[20] J. Xue, “On Tiling as a Loop Transformation,” Parallel Processing Letters, vol. 7, no. 4, pp. 409-424, 1997.
[21] J. Xue, “Communication-Minimal Tiling of Uniform Dependence Loops,” J. Parallel and Distributed Computing, vol. 42, no. 1, pp.42-59, 1997.
[22] J. Xue and W. Cai, “Time-Minimal Tiling When Rise Is Larger Than Zero,” J. Parallel Computing, vol. 28, no. 6, pp. 915-939, 2002.
[23] S. Parsa and S. Lotfi, “A New Genetic Algorithm for Loop Tiling,” J. Supercomputing, vol. 37, no. 3, pp. 249-269, 2006.
[24] L. Renganarayanan, D. Kim, S. Rajopadhye, and M.M. Strout, “Parameterized Tiled Loops for Free,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI '07), pp.405-414, 2007.
[25] S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan, “Effective Automatic Parallelization of Stencil Computations,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI '07), pp.235-244, 2007.
[26] N. Ahmed, N. Mateev, and K. Pingali, “Tiling Imperfectly-Nested Loop Nests,” Proc. ACM/IEEE Conf. Supercomputing, p. 31, 2000.
[27] R. Andonov, P. Calland, S. Niar, S. Rajopadhye, and N. Yanev, “First Steps towards Optimal Oblique Tile Sizing,” Proc. Eighth Int'l Workshop Compilers for Parallel Computers, pp.351-366, Jan. 2000.
[28] K. Högstedt, L. Carter, and J. Ferrante, “Selecting Tile Shape for Minimal Execution Time,” Proc. 11th ACM Symp. Parallel Algorithms and Architectures (SPAA '99), pp. 201-211, 1999.
[29] R. Allen, K. Kennedy, and J.R. Allen, Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann, 2001.
[30] E. D'Hollander, “Partitioning and Labeling of Loops by Unimodular Transformations,” IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 4, pp. 465-476, July 1992.
[31] M. Kandemir, R. Bordawekar, A. Choudhary, and J. Ramanujam, “A Unified Tiling Approach for Out-of-Core Computations,” Proc.Sixth Workshop Compilers for Parallel Computers, pp. 323-334, 1996.
[32] G.E. Karniadakis and R.M. Kirby, Parallel Scientific Computing in C++ and MPI: A Seamless Approach to Parallel Algorithms and Their Implementation. Cambridge Univ. Press, 2002.
[33] W. Shang and J. Fortes, “Time Optimal Linear Schedules for Algorithms with Uniform Dependencies,” IEEE Trans. Computers, vol. 40, no. 6, pp. 723-742, June 1991.
[34] A. Darte, L. Khachiyan, and Y. Robert, “Linear Scheduling Is Nearly Optimal,” Parallel Processing Letters, vol. 1, no. 2, pp. 73-81, 1991.
[35] W. Shang and J. Fortes, “On Time Mapping of Uniform Dependence Algorithms into Lower Dimensional Processor Arrays,” IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 3, pp. 350-363, May/June 1992.
[36] P. Tang and J. Zigman, “Reducing Data Communication Overhead for DOACROSS Loop Nests,” Proc. Eighth Int'l Conf. Supercomputing (ICS '94), pp. 44-53, July 1994.
[37] M. Wolf and M. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 452-471, Oct. 1991.
[38] G. Rivera and C.-W. Tseng, “Tiling Optimizations for 3D Scientific Computations,” Proc. ACM/IEEE Conf. Supercomputing, p. 32, 2000.
23 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool