
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Mahmut Kandemir, Alok Choudhary, Nagaraj Shenoy, Prithviraj Banerjee, J. Ramanujam, "A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts," IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 2, pp. 115135, February, 1999.  
BibTex  x  
@article{ 10.1109/71.752779, author = {Mahmut Kandemir and Alok Choudhary and Nagaraj Shenoy and Prithviraj Banerjee and J. Ramanujam}, title = {A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {10}, number = {2}, issn = {10459219}, year = {1999}, pages = {115135}, doi = {http://doi.ieeecomputersociety.org/10.1109/71.752779}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Parallel and Distributed Systems TI  A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts IS  2 SN  10459219 SP115 EP135 EPD  115135 A1  Mahmut Kandemir, A1  Alok Choudhary, A1  Nagaraj Shenoy, A1  Prithviraj Banerjee, A1  J. Ramanujam, PY  1999 KW  Data reuse KW  locality optimizations KW  spatial locality KW  memory performance KW  parallelism KW  array restructuring. VL  10 JA  IEEE Transactions on Parallel and Distributed Systems ER   
Abstract—This paper presents a data layout optimization technique for sequential and parallel programs based on the theory of hyperplanes from linear algebra. Given a program, our framework automatically determines suitable memory layouts that can be expressed by hyperplanes for each array that is referenced. We discuss the cases where data transformations are preferable to loop transformations and show that under certain conditions a loop nest can be optimized for perfect spatial locality by using data transformations. We argue that data transformations can also optimize spatial locality for some arrays without distorting temporal/spatial locality exhibited by others. We divide the problem of optimizing data layout into two independent subproblems: 1) determining optimal static data layouts, and 2) determining data transformation matrices to implement the optimal layouts. By postponing the determination of the transformation matrix to the last stage, our method can be adapted to compilers with different default layouts. We then present an algorithm that considers optimizing parallelism and spatial locality simultaneously. Our results on eight programs on two distributed sharedmemory multiprocessors, the Convex Exemplar SPP2000 and the SGI Origin 2000, show that the layout optimizations are effective in optimizing spatial locality and parallelism.
[1] B. Appelbe and B. Lakshmanan, "Optimizing Parallel Programs Using Affinity Regions," Proc. 1993 Int'l Conf. Parallel Processing, pp. 246249,St. Charles, Ill., Aug. 1993.
[2] J. Anderson, S. Amarasinghe, and M. Lam, “Data and Computation Transformations for Multiprocessors,” Proc. Fifth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, July 1995.
[3] J. Anderson and M. Lam, "Global Optimizations for Parallelism and Locality on Scalable Parallel Machines," Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 112125,Albuquerque, N.M., June 1993.
[4] U. Banerjee,Dependence Analysis for Supercomputing. Norwell, MA: Kluwer, 1988.
[5] U. Banerjee, "Unimodular Transformations of Double Loops," Advances in Languages and Compilers for Parallel Processing, A. Nicolau et al., eds. MIT Press, 1991.
[6] R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedeljkovic, and J. Anderson, "DataDistribution Support on DistributedShared Memory Multiprocessors," Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 334345,Las Vegas, Nev., 1997.
[7] S. Chatterjee, J. Gilbert, R. Schreiber, and S. Teng, "Optimal Evaluation of Array Expressions on Massively Parallel Machines," ACM Trans. Programming Languages and Systems, vol. 17, no. 1, pp. 123156, Jan. 1995.
[8] M. Cierniak and W. Li, “Unifying Data and Control Transformations for Distributed Shared Memory Machines,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1995.
[9] J.J. Dongarra, J.D. Croz, S. Hammarling, and I. Duff, "A Set of Level 3 Basic Linear Algebra Subprograms," ACM Trans. Mathematical Software, vol. 16, no. 1, pp. 117, Mar. 1990.
[10] D. Gannon, W. Jalby, and K. Gallivan, "Strategies for Cache and Local Memory Management by Global Program Transformations," J. Parallel and Distributed Computing, vol. 5, no. 5, pp. 587616, Oct. 1988.
[11] J. Garcia, E. Ayguade, and J. Labarta, "A Novel Approach Towards Automatic Data Distribution," Proc. Supercomputing'95,San Diego, Calif., Dec. 1995.
[12] J. Garcia, E. Ayguade, and J. Labarta, "Dynamic Data Distribution with Control Flow Analysis," Proc. Supercomputing'96,Pittsburgh, Penn., Nov. 1996.
[13] M. Gupta and P. Banerjee, “Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers,” IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 2, pp. 179193, Mar. 1992.
[14] M. Hill and A. Smith, "Evaluating Associativity in CPU Caches," IEEE Trans. Computers, vol. 38, no. 12, pp. 1,6121,630, Dec. 1989.
[15] "NWChem: A Computational Chemistry Package for Parallel Computers," version 1.1, High Performance Computational Chemistry Group, Pacific Northwest Laboratory, Richland, Wash., 1995.
[16] C.H. Huang and P. Sadayappan, “CommunicationFree Partitioning of Nested Loops,” J. Parallel and Distributed Computing, vol. 19, pp. 90102, 1993.
[17] T. Jeremiassen and S. Eggers, “Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations,” Proc. SIGPLAN Symp. Principles and Practices of Parallel Programming, pp. 179188, July 1995.
[18] Y. Ju and H. Dietz, “Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation,” Languages and Compilers for Parallel Computing, U. Banerjee et al., eds., pp. 344358, Springer, 1992.
[19] M. Kandemir, A. Choudhary, J. Ramanujam, and M. Kandaswamy, "Locality Optimization Algorithms for Compilation of OutofCore Codes," J. Information Science and Eng., vol. 14, no. 1, pp. 107138, Mar. 1998.
[20] M. Kandemir, A. Choudhary, J. Ramanujam, and R. Bordawekar, "Compilation Techniques for OutofCore Parallel Computations," Parallel Computing, vol. 24, nos. 34, pp. 597628, June 1998.
[21] M. Kandemir, J. Ramanujam, and A. Choudhary, “A Compiler Algorithm for Optimizing Locality in Loop Nests,” Proc. 1997 ACM Int'l Conf. Supercomputing, pp. 269276, July 1997.
[22] M. Kandemir, J. Ramanujam, and A. Choudhary, “Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed Memory Machines,” Proc. Int'l Conf. Parallel Architecture and Compiler Techniques (PACT '97), pp. 236247, Nov. 1997.
[23] K. Kennedy and U. Kremer, “Automatic Data Layout for High Performance Fortran,” Proc. Supercomputing '95, Dec. 1995.
[24] J. Laudon and D. Lenoski, "The SGI Origin: A ccNUMA Highly Scalable Server," Proc. 24th Ann. Int'l Symp. Computer Architecture, May 1997.
[25] S.T. Leung and J. Zahorjan, "Optimizing Data Locality by Array Restructuring," Technical Report TR 950901, Dept. of Computer Science and Eng., Univ. of Washington, Sept. 1995.
[26] W. Li, “Compiling for NUMA Parallel Machines,” PhD thesis, Dept. of Computer Science, Cornell Univ., 1993.
[27] J. Li and M. Chen, “Compiling Communication Efficient Programs for Massively Parallel Machines,” J. Parallel and Distributed Computers, vol. 2, no. 3, pp. 361376, 1991.
[28] M. Mace, Memory Storage Patterns in Parallel Processing.Boston: Kluwer Academic, 1987.
[29] V. Maslov, "Delinearization: An Efficient Way to Break MultiLoop Dependence Equations," Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 152161,San Francisco, June 1992.
[30] K. McKinley, S. Carr, and C.W. Tseng, “Improving Data Locality with Loop Transformations,” ACM Trans. Programming Languages and Systems, vol. 18, no. 4, pp. 424453, July 1996.
[31] M. O'Boyle and P. Knijnenburg, "NonSingular Data Transformations: Definition, Validity, Applications," Proc. Sixth Workshop Compilers for Parallel Computers, pp. 287297,Aachen, Germany, 1996.
[32] D. Palermo and P. Banerjee, "Automatic Selection of Dynamic Data Partitioning Schemes for DistributedMemory Multicomputers," Proc. Eighth Workshop Languages and Compilers for Parallel Computing,Columbus, Ohio, pp. 392406, 1995.
[33] J. Ramanujam, "CompileTime Techniques for Parallel Execution of Loops on Distributed Memory Multiprocessors," PhD thesis, The Ohio State Univ., Columbus, Ohio, 1990. Also available from University Microfilms Inc. as Document 9111789.
[34] J. Ramanujam, “NonUnimodular Transformations of Nested Loops,” Proc. Supercomputing '92, pp. 214223, Nov. 1992.
[35] J. Ramanujam and A. Narayan, "Integrating Data Distribution and Loop Transformations for Distributed Memory Machines," Proc. Seventh SIAM Conf. Parallel Processing for Scientific Computing, D. Bailey et al., eds., pp. 668673, Feb. 1995.
[36] J. Ramanujam and P. Sadayappan, “CompileTime Techniques for Data Distribution in Distributed Memory Machines,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 472482, Oct. 1991.
[37] A. Schrijver, Theory of Linear and Integer Programming. John Wiley, 1986.
[38] S. Tandri and T. Abdelrahman, “Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors,” Proc. 1997 Int'l Conf. Parallel Processing (ICPP '97), pp. 6473, Aug. 1997.
[39] J. Torrellas, M. Lam, and J. Hennessey, "False Sharing and Spatial Locality in Multiprocessor Caches," IEEE Trans. Computers, vol. 43, no. 6, pp. 651663, June 1994.
[40] E. Torrie, C. Tseng, M. Martonosi, and M. Hall, "Evaluating the Impact of Advanced Memory Systems on CompilerParallelized Codes," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques,Limassol, Cyprus, June 1995.
[41] B. Verghese, S. Devine, A. Gupta, and M. Rosenblum, "Operating System Support for Improving Data Locality on ccNUMA Compute Servers," Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 279289,Cambridge, Mass., Oct. 1996.
[42] R. Wilson, R. French, C. Wilson, S. Amarasinghe, J. Anderson, S. Tjiang, S. Liao, C. Tseng, M. Hall, M. Lam, and J. Hennessy, "SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers," ACM SIGPLAN Notices, vol. 29, no. 12, pp. 3137, Dec 1994.
[43] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 3044, June 1991.
[44] M. Wolfe, High Performance Compilers for Parallel Computing. AddisonWesley, 1996.