This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Exploiting Locality for Irregular Scientific Codes
July 2006 (vol. 17 no. 7)
pp. 606-618

Abstract—Irregular scientific codes experience poor cache performance due to their irregular memory access patterns. In this paper, we present two new locality improving techniques for irregular scientific codes. Our techniques exploit geometric structures hidden in data access patterns and computation structures. Our new data reordering (Gpart) finds the graph structure within data accesses and applies hierarchical clustering. Quality partitions are constructed quickly by clustering multiple neighbor nodes with priority on nodes with high degree and repeating a few passes. Overhead is kept low by clustering multiple nodes in each pass and considering only edges between partitions. Our new computation reordering (Z-Sort) treats the values of index arrays as coordinates and reorders corresponding computations in Z-curve order. Applied to dense inputs, Z-Sort achieves performance close to data reordering combined with other computation reordering but without the overhead involved in data reordering. Experiments on irregular scientific codes for a variety of meshes show locality optimization techniques are effective for both sequential and parallelized codes, improving performance by 60-87 percent. Gpart achieved within 1-2 percent of the performance of more sophisticated partitioning algorithms, but with one third of the overhead. Z-Sort also yields the performance improvement of 64 percent for dense inputs, which is comparable with data reordering combined with computation reordering.

[1] K.S. McKinley, S. Carr, and C.-W. Tseng, “Improving Data Locality with Loop Transformations,” ACM Trans. Programming Languages and Systems, vol. 18, no. 4, pp. 424-453, July 1996.
[2] M.E. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN '91 Conf. Programming Language Design and Implementation, June 1991.
[3] R. Das, D. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy, “The Design and Implementation of a Parallel Unstructured Euler Solver Using Software Primitives,” Proc. 30th Aerospace Sciences Meeting and Exhibit, Jan. 1992.
[4] I. Al-Furaih and S. Ranka, “Memory Hierarchy Management for Iterative Graph Structures,” Proc. 12th Int'l Parallel Processing Symp., Apr. 1998.
[5] C. Ding and K. Kennedy, “Improving Cache Performance of Dynamic Applications with Computation and Data Layout Transformations,” Proc. SIGPLAN '99 Conf. Programming Language Design and Implementation, May 1999.
[6] J. Mellor-Crummey, D. Whalley, and K. Kennedy, “Improving Memory Hierarchy Performance for Irregular Applications,” Proc. 1999 ACM Int'l Conf. Supercomputing, June 1999.
[7] N. Mitchell, L. Carter, and J. Ferrante, “Localizing Non-Affine Array References,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Oct. 1999.
[8] H. Han and C.-W. Tseng, “A Comparison of Locality Transformations for Irregular Codes,” Proc. Fifth Workshop Languages, Compilers, and Run-Time Systems for Scalable Computers, May 2000.
[9] W. Liu and A. Sherman, “Comparative Analysis of the Cuthill-Mckee and the Reverse Cuthill-Mckee Ordering Algorithms for Sparse Matrices,” SIAM J. Numerical Analysis, vol. 13, no. 2, pp. 198-213, Apr. 1976.
[10] E. Im and K. Yelick, “Model-Based Memory Hierarchy Optimizations for Sparse Matrices,” Proc. 1998 Workshop Profile and Feedback-Directed Compilation, Oct. 1998.
[11] M. Berger and S. Bokhari, “A Partitioning Strategy for Non-Uniform Problems on Multiprocessors,” IEEE Trans. Computers, vol. 37, no. 12, pp. 570-580, Dec. 1987.
[12] G. Karypis and V. Kumar, “A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs,” Proc. 24th Int'l Conf. Parallel Processing, Aug. 1995.
[13] H. Han and C.-W. Tseng, “Improving Locality for Adaptive Irregular Codes,” Proc. 13th Workshop Languages and Compilers for Parallel Computing, Aug. 2000.
[14] S. Tjiang, M.E. Wolf, M. Lam, K. Pieper, and J. Hennessy, “Integrating Scalar Optimization and Parallelization,” Proc. Fourth Int'l Workshop Languages and Compilers for Parallel Computing, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, eds., Aug. 1991.
[15] R. v Hanxleden and K. Kennedy, “Give-n-Take— A Balanced Code Placement Framework,” Proc. SIGPLAN '94 Conf. Programming Language Design and Implementation, June 1994.
[16] G. Agarwal, J. Saltz, and R. Das, “Interprocedural Partial Redundancy Elimination and Its Application to Distributed Memory Compilation,” Proc. SIGPLAN '95 Conf. Programming Language Design and Implementation, June 1995.
[17] Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, and J. Saltz, “Runtime and Language Support for Compiling Adaptive Irregular Programs on Distributed Memory Machines,” Software— Practice and Experience, vol. 25, no. 6, pp. 597-621, June 1995.
[18] H. Yu and L. Rauchwerger, “Adaptive Reduction Parallelization Techniques,” Proc. 2000 ACM Int'l Conf. Supercomputing, May 2000.
[19] B. Cmelik and D. Keppel, “Shade: A Fast Instruction-Set Simulator for Execution Profile,” Proc. ACM SIGMETRICS Conf. Measurement and Modeling of Computer Systems, 1994.
[20] H. Han and C.-W. Tseng, “Locality Transformation Package v1.1,” http://arcs.kaist.ac.kr/hwansoo.hansoftware /, developed in COSMIC project at Univ. of Maryland, 2006.
[21] H. Han and C.-W. Tseng, “Efficient Compiler and Run-Time Support for Parallel Irregular Reductions,” Parallel Computing, vol. 26, nos. 13-14, pp. 1861-1887, Dec. 2000.
[22] W. Pottenger, “The Role of Associativity and Commutativity in the Detection and Transformation of Loop Level Parallelism,” Proc. 1998 ACM Int'l Conf. Supercomputing, July 1998.
[23] S. Hiranandani, K. Kennedy, and C.-W. Tseng, “Compiling Fortran D for MIMD Distributed-Memory Machines,” Comm. ACM, vol. 35, no. 8, pp. 66-80, Aug. 1992.
[24] S. Hiranandani, K. Kennedy, and C.-W. Tseng, “Preliminary Experiences with the Fortran D Compiler,” Proc. Conf. Supercomputing '93, Nov. 1993.
[25] C. Koelbel and P. Mehrotra, “Compiling Global Name-Space Parallel Loops for Distributed Execution,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 440-451, Oct. 1991.
[26] S. Chandra and J. Larus, “Optimizing Communication in HPF Programs for Fine-Grain Distributed Shared Memory,” Proc. Sixth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, June 1997.
[27] H. Han and C.-W. Tseng, “Improving Compiler and Run-Time Support for Adaptive Irregular Codes,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Oct. 1998.
[28] E. Gutierrez, O. Plata, and E. Zapata, “A Compiler Method for the Parallel Execution of Irregular Reductions on Scalable Shared Memory Multiprocessors,” Proc. 2000 ACM Int'l Conf. Supercomputing, May. 2000.
[29] E. Gutierrez, O. Plata, and E. Zapata, “Data Partitioning-Based Parallel Irregular Reductions Describes Efficient Parallelization of Irregular Scientific Applications,” Concurrency and Computation: Practice and Experience, vol. 16, pp. 155-172, 2004.
[30] D.E. Singh et al. “A Run-Time Framework for Parallelizing Loops with Irregular Accesses,” Proc. Seventh Workshop Languages, Compilers, and Run-Time Systems for Scalable Computers, Mar. 2002.
[31] R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang, “Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures,” J. Parallel and Distributed Computing, vol. 22, no. 3, pp. 462-479, Sept. 1994.
[32] M. Strout, L. Carter, and J. Ferrante, “Compile-Time Composition of Run-Time Data and Iteration Reorderings,” Proc. SIGPLAN '03 Conf. Programming Language Design and Implementation, June 2003.
[33] M. Strout, L. Carter, J. Ferrante, J. Freeman, and B. Kreaseck, “Combining Performance Aspects of Irregular Gauss-Seidel via Sparse Tiling,” Proc. Int'l Conf. Computational Science, May 2001.
[34] C. Douglas, J. Hu, M. Kowarschik, U. Rüde, and C. Weiss, “Digital Libraries and Autonomous Citation Indexing,” Electronic Trans. Numerical Analysis, vol. 10, pp. 21-40, Feb. 2000.
[35] C. Ding and Y. Zhong, “Predicting Whole-Program Locality through Reuse Distance Analysis,” Proc. SIGPLAN '03 Conf. Programming Language Design and Implementation, June 2003.
[36] M. Strout and P. Hovland, “Metrics and Models for Reordering Transformations,” Proc. Memory Systems Performance, June 2004.

Index Terms:
Compiler optimization, cache memories, inspector/executor, data reordering, computation reordering.
Citation:
Hwansoo Han, Chau-Wen Tseng, "Exploiting Locality for Irregular Scientific Codes," IEEE Transactions on Parallel and Distributed Systems, vol. 17, no. 7, pp. 606-618, July 2006, doi:10.1109/TPDS.2006.88
Usage of this product signifies your acceptance of the Terms of Use.