This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Achieving Full Parallelism Using Multidimensional Retiming
November 1996 (vol. 7 no. 11)
pp. 1150-1163

Abstract—Most scientific and Digital Signal Processing (DSP) applications are recursive or iterative. Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable transformation tool in one-dimensional problems, when loops are represented by data flow graphs (DFGs). In this paper, uniform nested loops are modeled as multidimensional data flow graphs (MDFGs). Full parallelism of the loop body, i.e., all nodes in the MDFG executed in parallel, substantially decreases the overall computation time. It is well known that, for one-dimensional DFGs, retiming can not always achieve full parallelism. Other existing optimization techniques for nested loops also can not always achieve full parallelism. This paper shows an important and counter-intuitive result, which proves that we can always obtain full-parallelism for MDFGs with more than one dimension. This result is obtained by transforming the MDFG into a new structure. The restructuring process is based on a multidimensional retiming technique. The theory and two algorithms to obtain full parallelism are presented in this paper. Examples of optimization of nested loops and digital signal processing designs are shown to demonstrate the effectiveness of the algorithms.

[1] A. Aiken, "Compaction Based Parallelization," PhD thesis, Technical Report 88-922, Cornell Univ., 1988.
[2] A. Aiken and A. Nicolau, "Fine-Grain Parallelization and the Wavefront Method," Languages and Compilers for Parallel Computing, D. Gelernter, A. Nicolau, and D. Padua, eds. MIT Press, 1990.
[3] U. Banerjee, "Unimodular Transformations of Double Loops," Advances in Languages and Compilers for Parallel Processing, pp. 192-219,Cambridge, Mass: MIT Press, 1991.
[4] L.-F. Chao and E.H.-M. Sha, "Static Scheduling of Uniform Nested Loops," Proc. Seventh Int'l Parallel Processing Symp., pp. 1,421-1,424,Newport Beach, Calif., Apr. 1993.
[5] L.-F. Chao, "Scheduling and Behavioral Transformations for Parallel Systems," PhD dissertation, Princeton Univ., 1993.
[6] L.F. Chao, A. LaPaugh, and E.H. Sha, "Rotation Scheduling: A Loop Pipelining Algorithm," Proc. ACM/IEEE Design Automation Conf., 1993.
[7] L.-F. Chao and E.H.-M. Sha, "Retiming and Unfolding Data-Flow Graphs," Proc. 1992 Int'l Conf. Parallel Processing, pp. 33-40,St. Charles, Ill., Aug. 1992.
[8] L.-F. Chao and E.H.-M. Sha, "Unified Static Scheduling on Various Models," Proc. 1993 Int'l Conf. Parallel Processing, pp. 231-235,St. Charles, Ill., Aug. 1993.
[9] E. Cohen and N. Megiddo, "Strongly Polynomial-Time and NC Algorithms for Detecting Cycles in Dynamic Graphs," Proc. 21st ACM Ann Symp. Theory of Computing, pp. 523-534, 1989.
[10] R. Cytron, "Doacross: Beyond Vectorization for Multiprocessors". Proc. Int'l Conf. Parallel Processing, pp. 836-844, 1986.
[11] D.E. Dudgeon and R.M. Mersereau, Multidimensional Digital Signal Processing.Englewood Cliffs, N.J.: Prentice Hall, 1984.
[12] A. Fettweis and G. Nitsche, "Numerical Integration of Partial Differential Equations Using Principles of Multidimensional Wave Digital Filters," J. VLSI Signal Processing, vol. 3, pp. 7-24, 1991.
[13] A. Fisher and B.R. Rau, "Instruction-Level Parallel Processing," Science, vol. 253, pp. 1,233-1,241, Sept. 1991.
[14] G. Goossens, J. Vandewalle, and H. De Man, "Lopp Optimization in Register-Transfer Scheduling for DSP-Systems," Proc. ACM/IEEE Design Automation Conf., 1989.
[15] S. R. Kosaraju and G.F. Sullivan, "Detecting Cycles in Dynamic Graphs in Polynomial Time," Proc. 20th ACM Ann Symp. Theory of Computing, pp. 398-406, 1988.
[16] S.Y. Kung, VLSI Array Processors. Prentice Hall, 1988.
[17] L. Lamport, "The Parallel Execution of DO Loops," Comm. ACM, vol. 17, Feb. 1974.
[18] M. Lam, "Software Pipelining: An Effective Scheduling Technique for VLIW Machines," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, 1988.
[19] T.-F. Lee, A.C.-H. Wu, D.D. Gajski, and Y.-L. Lin, "An Effective Methodology for Functional Pipelining," Proc. Int'l Conf. Computer Aided Design, pp. 230-233, Dec. 1992.
[20] C.E. Leiserson and J.B. Saxe, "Retiming Synchronous Circuitry," Algorithmica, vol. 6, pp. 5-35, 1991.
[21] D.I. Moldovan and J.A.B. Fortes, “Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays,” IEEE Trans. Computers, vol. 35, no. 1, pp.1-12, Jan. 1986.
[22] A. Nicolau, "Loop Quantization or Unwinding Done Right," Proc. 1987 ACM Int'l Conf. Supercomputing, Springer Verlag Lecture Notes on Computer Science, vol. 289, pp. 294-308, May 1987.
[23] N. Park and A.C. Parker, "Sehwa: A Software Package for Synthesis of Pipelines from Behavioral Specifications," IEEE Trans. Computer-Aided Design, vol. 7, Mar. 1988.
[24] N.L. Passos, E.H.-M. Sha, and S.C. Bass, "Schedule-Based Multidimensional Retiming," to appear in Proc. Eighth Int'l Parallel Processing Symp.,Cancun, MX, Apr., 1994.
[25] N.L. Passos and E.H.-M. Sha, "Full Parallelism in Uniform Nested Loops Using Multidimensional Retiming," Proc. 23rd Int'l Conf. Parallel Processing, vol. 2, pp. 130-133, Aug. 1994.
[26] N.L. Passos, E.H.-M. Sha, and S.C. Bass, "Loop Pipelining for Scheduling Multidimensional Systems Via Rotation," to appear in Proc. 31st Design Automation Conf.,San Diego, Calif., June 1994.
[27] N.L. Passos, E.H.-M. Sha, and S.C. Bass, "Partitioning and Retiming of Multidimensional Systems," to appear in Proc. IEEE Int'l Conf. Circuits and Systems,London, May 1994.
[28] R. Potasman, J. Lis, A. Nicolau, and D. Gajski, "Percolation Based Scheduling," Proc. ACM/IEEE Design Automation Conf., pp. 444-449, 1990.
[29] D.A. Schwartz, "Cyclo-Static Realizations, Loop Unrolling and CPM: Optimal Multiprocessor Scheduling," technical report, Georgia Inst. of Technology, School of Electrical Eng., 1987.
[30] R. Tarjan, "Data Structures and Network Algorithms," SIAM,Philadelphia, Penn., 1983.
[31] C.-Y. Wang and K.K. Parhi, "High Level DSP Synthesis Using the MARS Design System," Proc. Int'l Symp. Circuits and Systems, pp. 164-167, 1992.
[32] M. Wolfe, "Loop Skewing: The Wavefront Method Revisited," Int'l J. Parallel Programming, vol. 15, Aug. 1986.
[33] M. Wolfe,“Optimizing Supercompilers For Supercomputers.”Cambridge, MA: MIT, 1989.
[34] M. Wolf and M. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, Oct. 1991.

Index Terms:
Retiming, multidimensional data-flow graphs, instruction level parallelism, loop transformation, nested loops, VLIW, superscalar.
Citation:
Nelson Luiz Passos, Edwin Hsing-Mean Sha, "Achieving Full Parallelism Using Multidimensional Retiming," IEEE Transactions on Parallel and Distributed Systems, vol. 7, no. 11, pp. 1150-1163, Nov. 1996, doi:10.1109/71.544356
Usage of this product signifies your acceptance of the Terms of Use.