loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Data Locality Exploitation in the Decomposition of Regular Domain Problems
November 2000 (vol. 11 no. 11)
pp. 1141-1150

Abstract—The aim of this paper is to study the effect of local memory hierarchy and communication network exploitation on message sending and the influence of this effect on the decomposition of regular applications. In particular, we have considered two different parallel computers, a Cray T3E-900 and an SGI Origin 2000. In both systems, the bandwidth reduction due to non-unit-stride memory access is quite significant and could be more important than the reduction due to contention in the network. These conclusions affect the choice of optimal decompositions for regular domains problems. Thus, although traditional 3D decompositions lead to lower inherent communication-to-computation ratios and could exploit more efficiently the interconnection network, lower dimensional decompositions are found to be more efficient due to the data decomposition effects on the spatial locality of the messages to be communicated. This increasing importance of local optimisations has also been shown using a well-known communication-computation overlapping technique which increases execution time, instead of reducing it as we could expect, due to poor cache memory exploitation.

[1] J. Duato, S. Yalamanchili, and L.M. Ni, Interconnection Networks: An Engineering Approach. Los Alamitos, Calif.: IEEE CS Press, 1997.
[2] S.S. Mukherjee and M.D. Hill, “Making Network Interfaces Less Peripheral,” Computer, Oct. 1998.
[3] W.A. Wulf and S.A. McKee, "Hitting the Memory Wall: Implications of the Obvious," Computer Architecture News, Vol. 23, No. 1, Mar. 1995, pp. 20-24.
[4] A. Saulsbury, F. Pong, and A. Nowatzyk, “Missing the Memory Wall: The Case for Processor/Memory Integration,” Proc. 23rd Ann. Int'l Symp. Computer Architecture (ISCA '96), pp. 90-101, May 1996.
[5] J. Dongarra and D. Walker, “Software Libraries for Linear Algebra Computations on High Performance Computers,” SIAM Review, vol. 37, no. 2,pp. 151–180, 1995.
[6] U. Rüde, “Iterative Algorithms on High Performance Architectures,” Proc. Third European Conf. Parallel Processing (Euro-Par '97), pp. 57-71, 1997.
[7] S. Carr, K.S. McKinley, and C.-W. Tseng, “Compiler Optimizations for Improving Data Locality,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 252-262, Oct. 1994.
[8] J. Anderson, S. Amarasinghe, and M. Lam, “Data and Computation Transformations for Multiprocessors,” Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, July 1995.
[9] M. Cierniak and W. Li, “Unifying Data and Control Transformations for Distributed Shared Memory Machines,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[10] R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedeljkovic, and J. Anderson, "Data-Distribution Support on Distributed-Shared Memory Multiprocessors," Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 334-345,Las Vegas, Nev., 1997.
[11] I.T. Foster, Designing and Building Parallel Programs Addison-Wesley, Reading, Mass., 1995.
[12] I.M. Llorente and F. Tirado, “Relationships Between Efficiency an Execution Time of Full Multigrid Methods on Parallel Computers,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 6, pp. 562-573, 1997.
[13] R. von Hanxleden and K. Kennedy, "Give-N-Take—A Balanced Code Placement Framework," Proc. SIGPLAN '94 Conf. Programming Language Design and Implementation, pp. 107-120. ACM Press, June 1994.
[14] A. Lain and P. Banerjee, “Techniques to Overlap Computation and Communication in Irregular Iterative Applications,” Proc. Eighth ACM Int'l Conf. Supercomputing, pp. 236–245, July 1994.
[15] S.L. Scott, "Synchronization and Communication in the T3E Multiprocess," Proc. ASPLOS-VII, Oct. 1996.
[16] D. Culler, J.P. Singh, and A. Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, San Francisco, 1998.
[17] E. Anderson, J. Brooks, C. Grass, and S. Scott, “Performance of the CRAY T3E Multiprocessor,” Proc. High Performance Networking and Computing Conf. (SC'97), Nov. 1997.
[18] J. Laudon and D. Lenoski, "The SGI Origin: A cc-NUMA Highly Scalable Server," Proc. 24th Ann. Int'l Symp. Computer Architecture, May 1997.
[19] SGI Technical Publications Library, “Origin 2000 and Onyx2 Performance Tuning and Optimization Guide,” SGI Technical Document number: 007-3430-002, Aug. 2000, http:/techpubs.sgi.com.
[20] E. Huedo, M. Prieto, I.M. Llorente, and F. Tirado, “Impact of PE Mapping on Cray T3E Message-Passing Performance,” Proc. Sixth European Conf. Parallel Processing (Euro-Par '00), pp. 199-207, 2000.
[21] V. Getov, E. Hernández, and T. Hey, “Message-Passing Performance of Parallel Computers,” Proc. Third European Conf. Parallel Processing (Euro-Par '97), pp. 1,009-1,016, 1997.
[22] J.J. Dongarra, T. Hey, and E. Strohmaier, “Selected Results from the PARKBENCH Benchmark,” Proc. Second European Conf. Parallel Processing (Euro-Par '96), pp. 251-254, Aug. 1996.
[23] A.J. van der Steen and R. van der Pas, “A Performance Analysis of the SGI Origin 2000,” Proc. Third Int'l Meeting on Vector and Parallel Processing, pp. 534-547, 1999.
[24] M. Resch, H. Berger, R. Rabenseifner, and T. Bönish, “Performance of MPI on the CRAY T3E-512,” Third European CRAY-SGI MPP Workshop, Sep. 1997.
[25] MPI Forum, “A Message-Passing Interface Standard,” Int'l J. Supercomputer Applications, vol. 8,nos. 3 and 4, 1994.
[26] M. Ashworth, “OCCOMM, Ocean Model Boundary Exchange Communications Kernel Benchmark,” Aug. 2000, http://www.cse.clrc.ac.uk/ActivityOCCOMM .
[27] M. Prieto, I.M. Llorente, and F. Tirado, “Partitioning of Regular Domains on Modern Parallel Computers,” Proc. Third Int'l Meeting Vector and Parallel Processing, pp. 411-425, 1999.
[28] M. Prieto, D. Espadas, I.M. Llorente, and F. Tirado, “Message Passing Evaluation and Analysis on Cray T3E and SGI Origin 2000 Systems,” Proc. Fifth European Conf. Parallel Processing (Euro-Par '99), pp. 173-182, 1999.
[29] M. Müller and M. Resch, “PE Mapping and the Congestion Problem on the T3E,” Proc. Fourth European Cray-SGI MPP Workshop, H. Lederer and F. Hertweck, eds., 1998.
[30] D. Espadas, M. Prieto, I.M. Llorente, and F. Tirado, “Parallel Resolution of Alternating-Line Processes by Means of Pipelining Techniques,” Proc. Seventh Euromicro Workshop Parallel and Distributed Processing, pp. 289-296, Feb. 1999.
[31] D. Espadas, M. Prieto, I.M. Llorente, and F. Tirado, “Solution of Alternating-Line Processes on Modern Parallel Computers” Proc. 28th Int'l Conf. Parallel Processing (ICPP '99), pp. 208-215, Sep. 1999.
[32] Z. Xu and K. Hwang, "Modeling Communication Overhead: MPI and MPL Performance on the IBM SP2," IEEE Parallel&Distributed Technology, Vol. 4, No. 1, Spring 1996, pp. 9-23.
[33] J. Miguel, R. Beivide, A. Arruabarrena, and J.A. Gregorio, “Assessing the Performance of the New IBM SP2 Communication Subsystem,” IEEE Parallel and Distributed Technology, vol. 4, no. 14, pp. 12-22, Winter, 1996.
[34] I.M. Llorente, J.C. Fabero, F. Tirado, and A. Bautista, “Distributed Parallel Computers versus PVM on a Workstation Cluster in the Simulation of Time Dependent PDE,” Proc. Third Euromicro Workshop Parallel and Distributed Processing, pp. 20-26, Jan. 1995.

Index Terms:
MPI performance evaluation, data locality, data partitioning, domain decomposition applications.
Citation:
Manuel Prieto, Ignacio M. Llorente, Franscisco Tirado, "Data Locality Exploitation in the Decomposition of Regular Domain Problems," IEEE Transactions on Parallel and Distributed Systems, vol. 11, no. 11, pp. 1141-1150, Nov. 2000, doi:10.1109/71.888635
Usage of this product signifies your acceptance of the Terms of Use.