This Article 
 Bibliographic References 
 Add to: 
Automatic Compilation of Loops to Exploit Operator Parallelism on Configurable Arithmetic Logic Units
January 2002 (vol. 13 no. 1)
pp. 45-66

Configurable Arithmetic Logic Units (ALUs) offer opportunities for adapting the underlying hardware to support the varying amount of parallelism in the computation. The problem of identifying the optimal parallel configurations (a configuration is defined as a given hardware implementation of different operators along with their multiplicities) at different steps in a program is a very complex issue but, if solved, allows the power of these ALUs to be maximally used. This paper focuses on developing an automatic compilation framework for configuration analysis to exploit operator parallelism within loop nests. The focus of this work is on performing configuration analysis to minimize costly reconfiguration overheads. In our framework, we initially carry out some operator and loop transformations to expose more opportunities for configuration reuse. We then present a two pass solution. The first pass attempts to generate either maximal cutsets (a cutset is defined as a group of statements that execute under a given configuration) or maximally parallel configurations by performing an analysis on the program dependency graph (PDG) of a loop nest. The second pass analyzes the trade-offs between the costs and benefits of reconfigurations across different cutsets and attempts to eliminate the reconfiguration overheads by merging cutsets. This methodology is implemented in the SUIF compilation system and is tested using some loops extracted from Perfect benchmarks and Livermore kernels. Good speedups are obtained, showing the merit of the proposed method. The method also scales well with the loop sizes and the amount of space available on FPGAs for configurable logic.

[1] “Spectrum Reconfigurable Computing Platform—Giga Operations Corporation,”, 1997.
[2] P.M. Athanas and H.F. Silverman, “Processor Reconfiguration through Instruction-Set Metamorphosis,” Computer, vol. 26, no. 3, pp. 11-18, Mar. 1993.
[3] U. Banerjee, Loop Parallelization: Loop Transformations for Restructuring Compilers. Boston: Kluwer Academic, 1994.
[4] G.E. Blelloch, S. Chatterjee, and M. Zagha, “Solving Linear Recurrences with Loop Raking,” J. Parallel and Distributed Computing vol. 25, no. 1, pp. 91-97, Feb. 1995.
[5] K. Bondalapati and V.K. Prasanna, “Reconfigurable Meshes: Theory and Practice,” Proc. Reconfigurable Architectures Workshop (RAW '97), Apr. 1997.
[6] T.J. Callahan and J. Wawrzynek, “Instruction Level Parallelism for Reconfigurable Computing,” Proc. Int'l Workshop Field Programmable Logic and Applications, Aug. 1998.
[7] D.A. Clark and B.L. Hutchings, “Supporting FPGA Microprocessors through Retargettable Software Tools,” Proc. IEEE Workshop FPGAs for Custom Computing Machines, pp. 195-203, Apr. 1996.
[8] D.C. Cronquist, P. Franklin, S.G. Berg, and C. Ebeling, “Specifying and Compiling Applications for RaPiD,” Proc. Field-Programmable Custom Computing Machines 1998, Apr. 1998.
[9] D. Bau, I. Kodukula, V. Kotlyar, K. Pingali, and P. Stodghill, “Solving Alignment Using Elementary Linear Algebra,” Proc. Seventh Int'l Workshop Languages and Compilers for Parallel Computing, pp. 46-60, June 1994.
[10] S.D. Kaushik, C.-H. Huang, R.W. Johnson, and P. Sadayappan, “An Approach to Communication-Efficient Data Redistribution,” Proc. 1994 ACM Int'l Conf. Supercomputing, pp. 364-373, June 1994.
[11] J. Ferrante,K.J. Ottenstein,, and J.D. Warren,“The program dependence graph and its use in optimization,” ACM Trans. Programming Languages and Systems, vol. 9, no. 3, pp. 319-349, June 1987.
[12] High Performance Fortran Forum, “High Performance Fortran Language Specification, Version 1.0,” Technical Report CRPC-TR92225, Center for Research on Parallel Computation, Rice Univ., Houston, Tex, 1992.
[13] M.B. Gokhale, J. Kaba, and J.A. Marks, “Malleable Architecture Generator for FPGA Computing,” Proc. SPIE Conf. High Speed Computing, Digital Signal Processing and Filtering Using Reconfigurable Logic, Nov. 1996.
[14] S. Hauck, Z. Li, and E. Schwabe, “Configuration Compression for the Xilinx XC6200 FPGA,” Proc. Field-Programmable Custom Computing Machines 1998, Apr. 1998.
[15] J. Hauser and J. Wawrzynek, “Garp: A MIPS Processor with a Reconfigurable Coprocessor,” Proc. IEEE Symp. FPGAs for Custom Computing Machines, pp. 12-27, Apr. 1997.
[16] J. Anderson and M. Lam, "Global Optimizations for Parallelism and Locality on Scalable Parallel Machines," Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 112-125,Albuquerque, N.M., June 1993.
[17] M. Kandemir, J. Ramanujam, A. Choudhary, and P. Banerjee, “An Iteration Space Transformation Algorithm Based on Explicit Data Layout Representation for Optimizing Locality,” Proc. Workshop Languages and Compilers for Parallel Computing, Aug. 1998.
[18] I. Kodukula, N. Ahmed, and K. Pingali, “Data-Centric Multi-Level Blocking,” Proc. Programming Language Design and Implementation (PLDI '97), June 1997.
[19] W. Mangione-Smith, B. Hutchins, D.L. Andrews, A. DeHon, C. Ebeling, R.W. Hartenstein, O. Mencer, J.M. Krishna, V. Palem, V.K. Prasanna, and H.A.E. Spaanenburg, “Seeking Solutions in Configurable Computing,” Computer, vol. 30, no. 12, pp. 38-43, Dec. 1997.
[20] R. Subramanian, N. Ramasubramanian, and S. Pande, “Automatic Analysis of Loops to Exploit Operator Parallelism on Reconfigurable Systems,” Languages and Compilers for Parallel Computing, S. Chatterjee et al., eds., pp. 305-322, Aug. 1998.
[21] A. Omondi, Computer Arithmetic Systems: Algorithms, Architectures, and Implementation. Prentice Hall, 1994.
[22] P. Tu and D. Padua, “Automatic Array Privatization,” Proc. Sixth Int'l Workshop Languages and Compilers for Parallel Computing, pp. 500–521, Aug. 1993.
[23] K.V. Palem, S. Talla, and P.W. Devaney, “Adaptive Explicitly Parallel Instruction Computing,” Proc. Fourth Australasian Computer Architecture Conf., Jan. 1999.
[24] S.S. Pande, “A Compile Time Partitioning Method for Doall Loops on Distributed Memory Systems,” Proc. 1996 Int'l Conf. Parallel Processing, pp. 35-44, Aug. 1996.
[25] Stanford Univ., “The SUIF Library,” http:/, 1994.
[26] T. Fahringer, “Estimating and Optimizing Performance for Parallel Programs,” Computer, vol. 28, pp. 47-56, Nov. 1995.
[27] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal, “Baring It All to Software: Raw Machines,” Computer, pp. 86-93, Sept. 1997.
[28] M. Wazlowski, L. Agarwal, T. Lee, A. Smith, E. Lam, P. Athanas, H. Silverman, and S. Ghosh, “PRISM-II Compiler and Architecture,” Proc. IEEE Workshop FPGAs for Custom Computing Machines, pp. 9-16, Apr. 1995.
[29] M. Weinhardt, “Compilation and Pipeline Synthesis for Reconfigurable Architectures,” Proc. Reconfigurable Architectures—High Performance by Configware (Proc. RAW '97), Apr. 1997.
[30] C.M. Procopiuc et al., "A Monte Carlo Algorithm for Fast Projective Clustering," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 2002, pp. 418-427.
[31] M.J. Wirthlin and B.L. Hutchings, “A Dynamic Instruction Set Computer,” Proc. IEEE Workshop FPGAs for Custom Computing Machines, pp. 99-107, Apr. 1995.
[32] M. Wolf and M. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, Oct. 1991.
[33] M. Wolfe, High Performance Compilers for Parallel Computing. Addison-Wesley, 1996.
[34] Xilinx Corporation, XC6200 Field Programmable Gate Arrays Manual, ., Apr. 1997.

Index Terms:
Compilers, parallel computing, operator parallelism, reconfigurable systems, loop transformation, FPGAs.
Narasimhan Ramasubramanian, Ram Subramanian, Santosh Pande, "Automatic Compilation of Loops to Exploit Operator Parallelism on Configurable Arithmetic Logic Units," IEEE Transactions on Parallel and Distributed Systems, vol. 13, no. 1, pp. 45-66, Jan. 2002, doi:10.1109/71.980026
Usage of this product signifies your acceptance of the Terms of Use.