
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Nelson Luiz Passos, Edwin HsingMean Sha, "Achieving Full Parallelism Using Multidimensional Retiming," IEEE Transactions on Parallel and Distributed Systems, vol. 7, no. 11, pp. 11501163, November, 1996.  
BibTex  x  
@article{ 10.1109/71.544356, author = {Nelson Luiz Passos and Edwin HsingMean Sha}, title = {Achieving Full Parallelism Using Multidimensional Retiming}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {7}, number = {11}, issn = {10459219}, year = {1996}, pages = {11501163}, doi = {http://doi.ieeecomputersociety.org/10.1109/71.544356}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Parallel and Distributed Systems TI  Achieving Full Parallelism Using Multidimensional Retiming IS  11 SN  10459219 SP1150 EP1163 EPD  11501163 A1  Nelson Luiz Passos, A1  Edwin HsingMean Sha, PY  1996 KW  Retiming KW  multidimensional dataflow graphs KW  instruction level parallelism KW  loop transformation KW  nested loops KW  VLIW KW  superscalar. VL  7 JA  IEEE Transactions on Parallel and Distributed Systems ER   
Abstract—Most scientific and Digital Signal Processing (DSP) applications are recursive or iterative. Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable transformation tool in onedimensional problems, when loops are represented by data flow graphs (DFGs). In this paper, uniform nested loops are modeled as multidimensional data flow graphs (MDFGs). Full parallelism of the loop body, i.e., all nodes in the MDFG executed in parallel, substantially decreases the overall computation time. It is well known that, for onedimensional DFGs, retiming can not always achieve full parallelism. Other existing optimization techniques for nested loops also can not always achieve full parallelism. This paper shows an important and counterintuitive result, which proves that we can always obtain fullparallelism for MDFGs with more than one dimension. This result is obtained by transforming the MDFG into a new structure. The restructuring process is based on a multidimensional retiming technique. The theory and two algorithms to obtain full parallelism are presented in this paper. Examples of optimization of nested loops and digital signal processing designs are shown to demonstrate the effectiveness of the algorithms.
[1] A. Aiken, "Compaction Based Parallelization," PhD thesis, Technical Report 88922, Cornell Univ., 1988.
[2] A. Aiken and A. Nicolau, "FineGrain Parallelization and the Wavefront Method," Languages and Compilers for Parallel Computing, D. Gelernter, A. Nicolau, and D. Padua, eds. MIT Press, 1990.
[3] U. Banerjee, "Unimodular Transformations of Double Loops," Advances in Languages and Compilers for Parallel Processing, pp. 192219,Cambridge, Mass: MIT Press, 1991.
[4] L.F. Chao and E.H.M. Sha, "Static Scheduling of Uniform Nested Loops," Proc. Seventh Int'l Parallel Processing Symp., pp. 1,4211,424,Newport Beach, Calif., Apr. 1993.
[5] L.F. Chao, "Scheduling and Behavioral Transformations for Parallel Systems," PhD dissertation, Princeton Univ., 1993.
[6] L.F. Chao, A. LaPaugh, and E.H. Sha, "Rotation Scheduling: A Loop Pipelining Algorithm," Proc. ACM/IEEE Design Automation Conf., 1993.
[7] L.F. Chao and E.H.M. Sha, "Retiming and Unfolding DataFlow Graphs," Proc. 1992 Int'l Conf. Parallel Processing, pp. 3340,St. Charles, Ill., Aug. 1992.
[8] L.F. Chao and E.H.M. Sha, "Unified Static Scheduling on Various Models," Proc. 1993 Int'l Conf. Parallel Processing, pp. 231235,St. Charles, Ill., Aug. 1993.
[9] E. Cohen and N. Megiddo, "Strongly PolynomialTime and NC Algorithms for Detecting Cycles in Dynamic Graphs," Proc. 21st ACM Ann Symp. Theory of Computing, pp. 523534, 1989.
[10] R. Cytron, "Doacross: Beyond Vectorization for Multiprocessors". Proc. Int'l Conf. Parallel Processing, pp. 836844, 1986.
[11] D.E. Dudgeon and R.M. Mersereau, Multidimensional Digital Signal Processing.Englewood Cliffs, N.J.: Prentice Hall, 1984.
[12] A. Fettweis and G. Nitsche, "Numerical Integration of Partial Differential Equations Using Principles of Multidimensional Wave Digital Filters," J. VLSI Signal Processing, vol. 3, pp. 724, 1991.
[13] A. Fisher and B.R. Rau, "InstructionLevel Parallel Processing," Science, vol. 253, pp. 1,2331,241, Sept. 1991.
[14] G. Goossens, J. Vandewalle, and H. De Man, "Lopp Optimization in RegisterTransfer Scheduling for DSPSystems," Proc. ACM/IEEE Design Automation Conf., 1989.
[15] S. R. Kosaraju and G.F. Sullivan, "Detecting Cycles in Dynamic Graphs in Polynomial Time," Proc. 20th ACM Ann Symp. Theory of Computing, pp. 398406, 1988.
[16] S.Y. Kung, VLSI Array Processors. Prentice Hall, 1988.
[17] L. Lamport, "The Parallel Execution of DO Loops," Comm. ACM, vol. 17, Feb. 1974.
[18] M. Lam, "Software Pipelining: An Effective Scheduling Technique for VLIW Machines," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, 1988.
[19] T.F. Lee, A.C.H. Wu, D.D. Gajski, and Y.L. Lin, "An Effective Methodology for Functional Pipelining," Proc. Int'l Conf. Computer Aided Design, pp. 230233, Dec. 1992.
[20] C.E. Leiserson and J.B. Saxe, "Retiming Synchronous Circuitry," Algorithmica, vol. 6, pp. 535, 1991.
[21] D.I. Moldovan and J.A.B. Fortes, “Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays,” IEEE Trans. Computers, vol. 35, no. 1, pp.112, Jan. 1986.
[22] A. Nicolau, "Loop Quantization or Unwinding Done Right," Proc. 1987 ACM Int'l Conf. Supercomputing, Springer Verlag Lecture Notes on Computer Science, vol. 289, pp. 294308, May 1987.
[23] N. Park and A.C. Parker, "Sehwa: A Software Package for Synthesis of Pipelines from Behavioral Specifications," IEEE Trans. ComputerAided Design, vol. 7, Mar. 1988.
[24] N.L. Passos, E.H.M. Sha, and S.C. Bass, "ScheduleBased Multidimensional Retiming," to appear in Proc. Eighth Int'l Parallel Processing Symp.,Cancun, MX, Apr., 1994.
[25] N.L. Passos and E.H.M. Sha, "Full Parallelism in Uniform Nested Loops Using Multidimensional Retiming," Proc. 23rd Int'l Conf. Parallel Processing, vol. 2, pp. 130133, Aug. 1994.
[26] N.L. Passos, E.H.M. Sha, and S.C. Bass, "Loop Pipelining for Scheduling Multidimensional Systems Via Rotation," to appear in Proc. 31st Design Automation Conf.,San Diego, Calif., June 1994.
[27] N.L. Passos, E.H.M. Sha, and S.C. Bass, "Partitioning and Retiming of Multidimensional Systems," to appear in Proc. IEEE Int'l Conf. Circuits and Systems,London, May 1994.
[28] R. Potasman, J. Lis, A. Nicolau, and D. Gajski, "Percolation Based Scheduling," Proc. ACM/IEEE Design Automation Conf., pp. 444449, 1990.
[29] D.A. Schwartz, "CycloStatic Realizations, Loop Unrolling and CPM: Optimal Multiprocessor Scheduling," technical report, Georgia Inst. of Technology, School of Electrical Eng., 1987.
[30] R. Tarjan, "Data Structures and Network Algorithms," SIAM,Philadelphia, Penn., 1983.
[31] C.Y. Wang and K.K. Parhi, "High Level DSP Synthesis Using the MARS Design System," Proc. Int'l Symp. Circuits and Systems, pp. 164167, 1992.
[32] M. Wolfe, "Loop Skewing: The Wavefront Method Revisited," Int'l J. Parallel Programming, vol. 15, Aug. 1986.
[33] M. Wolfe,“Optimizing Supercompilers For Supercomputers.”Cambridge, MA: MIT, 1989.
[34] M. Wolf and M. Lam, “A Loop Transformation Theory and an Algorithm to Maximize Parallelism,” IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, Oct. 1991.