This Article 
 Bibliographic References 
 Add to: 
Designing a Scalable Processor Array for Recurrent Computations
August 1997 (vol. 8 no. 8)
pp. 840-856

Abstract—In this paper, we study the design of a coprocessor (CoP) to execute efficiently recursive algorithms with uniform dependencies. Our design is based on two objectives: 1) fixed bandwidth to main memory (MM) and 2) scalability to higher performance without increasing MM bandwidth. Our CoP has an access unit (AU) organized as multiple queues, a processor array (PA) with regularly connected processing elements (PEs), and input/output networks for data routing. Our design is unique because it addresses input/output bottleneck and scalability, two of the most important issues in integrating processor arrays in current systems. To allow processor arrays to be widely usable, they must be scalable to high performance with little or no impact on the supporting memory system. The use of multiple queues in AU also eliminates the use of explicit data addresses, thereby simplifying the design of the control program. We present a mapping algorithm that partitions a data dependence graph (DG) of an application into regular blocks, sequences the blocks through AU, and schedules the execution of the blocks, one at a time, on PA. We show that our mapping procedure minimizes the amount of communication between blocks in the partitioned DG, and sequences the blocks through AU to reduce the communication between AU and MM. Using the matrix-product and transitive-closure applications, we study design trade-offs involving 1) division of a fixed chip area between PA and AU, and 2) improvements in speedup with respect to increases in chip area. Our results show, for a fixed chip area, 1) that there is little degradation in throughput in using a linear PA as compared to a PA organized as a square mesh, and 2) that the design is not sensitive to the division of chip area between PA and AU. We further show that, for a fixed throughput, there is an inverse square root relationship between speedup and total chip area. Our study demonstrates the feasibility of a low-cost, memory bandwidth-limited, and scalable coprocessor system for evaluating recurrent algorithms with uniform dependencies.

[1] M. Annaratone, E. Arnould, T. Gross, H.T. Kung, M. Lam, O. Menzilcioglu, and J. Webb, "The Warp Computer: Architecture, Implementation, and Performance," IEEE Trans. Computers, vol. 36, no. 12, pp. 1,523-1,538, Dec. 1987.
[2] J. Bu and E.F. Deprettere,“Processor clustering for the design of optimal fixed-size systolic arrays,” Proc. Int’l Conf. on Application Specific Array Processors, pp. 402-413, Sept. 1991.
[3] W.P. Burleston, "Partitioning Problem on VLSI Arrays: I/O and Local Memory Complexity," Proc. of ICASSP, pp. 1,217-1,220,Toronto, Canada, May 1991.
[4] S. Carr and K. Kennedy, “Compiler Blockability of Numerical Algorithms;” Proc. Supercomputing, pp. 114-124, Minneapolis, Minn., Nov. 1992.
[5] V.V. Dongen, "Mapping Uniform Recurrence onto Small Size Arrays," Proc. PARLE, pp. 191-208, 1991.
[6] B.L. Drake, F.T. Luk, J.M. Speiser, and J.J. Symanski, "SLAPP: A Systolic Linear Algebra Computer," Computer, vol. 20, no. 7, p. 45, July 1987.
[7] J.A.B. Fortes, B.W. Wah, W. Shang, and K.N. Ganapathy, "Algorithm-Specific Parallel Processing with Linear Processor Arrays," Advances in Computers, M. Yovits, ed. Academic Press, 1994.
[8] D.E. Fouler and R. Schreiber, "The Saxpy Matrix-1: A General Purpose Systolic Computer," Computer, vol. 20, no. 7, p. 35, July 1987.
[9] K. Ganapathy and B.W. Wah, "Optimal Synthesis of Algorithm-Specific Lower-Dimensional Processor Arrays," IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 3, pp. 274-287, Mar. 1996.
[10] K. Ganapathy, "Mapping Regular Recursive Algorithms to Fine-Grained Processor Arrays," PhD thesis, Univ. of Illinois, Urbana-Champaign, May 1994.
[11] S. Singh, M. Woo, and C.S. Raghavendra, “Power-Aware Routing in Mobile Ad Hoc Networks,” Proc. Conf. Mobile Computing MOBICOM, pp. 181-190, 1998.
[12] K.N. Ganapathy and B.W. Wah, "Synthesizing Optimal Lower Dimensional Processor Arrays," Proc. Int'l Conf. Parallel Processing, pp. 96-103, Pennsylvania State Univ. Press, Aug. 1992.
[13] J.-W. Hong and H.T. Kung, "The I/O Complexity: The Red Blue Pebble Game," Proc. 13th Ann. ACM Symp. Theory of Computing, pp. 326-333, May 1981.
[14] F. Irigoin and R. Triolet, “Supernode Partitioning,” Proc. 15th ACM Symp. Principles of Programming Languages, pp. 319-329, Jan. 1988.
[15] K. Jainandunsing, "Optimal Partitioning Schemes for Wavefront/Systolic Array Processors," technical report, Delft Univ. of Tech nology, Delft, The Netherlands, Apr. 1986.
[16] R. Karp, R. Miller, and S. Winograd, "The Organization of Computations for Uniform Recurrence Equations," J. ACM, vol. 14, July 1967.
[17] P. Kuchibhotla and B.D. Rao, "Efficient Scheduling Methods for Partitioned Systolic Algorithms," Proc. Application Specific Array Processors, pp. 649-663, IEEE CS Press, Aug. 1992.
[18] D. Kulkarni, K. Kumar, A. Basu, and A. Paulraj, "Loop Partitioning for Distributed Memory Multiprocessors as Unimodular Transformations," Proc. Int'l Conf. Supercomputing, pp. 206-215, 1991.
[19] S.Y. Kung, VLSI Processor Arrays.Englewood Cliffs, N.J.: Prentice Hall, 1988.
[20] D. Le, M. Ercegovac, T. Lang, and J. Moreno, "MAMACG: A Tool for Mapping Matrix Algorithms on to Mesh Connected Processor Arrays," Proc. Application Specific Array Processors, pp. 511-525, Aug. 1992.
[21] D.I. Moldovan and J.A.B. Fortes, “Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays,” IEEE Trans. Computers, vol. 35, no. 1, pp.1-12, Jan. 1986.
[22] J.H. Moreno and T. Lang, “Matrix Computations on Systolic-Type Meshes,” Computer, vol. 20, no. 4, pp. 32-51, Apr. 1990.
[23] J.H. Moreno, "Matrix Computations on Mesh Arrays," PhD thesis, Univ. of California, Los Angeles, June 1989.
[24] J.H. Moreno and M.E. Figueroa, "A Decoupled Access/Execute Processor for Matrix Algorithms: Architecture and Programming," Proc. Application Specific Array Processors, pp. 281-295, IEEE CS Press, 1991.
[25] J.J. Navarro, J.M. Llaberia, and M. Valero, "Partitioning: An Essential Step in Mapping Algorithms into Systolic Array Processors," Computer, vol. 20, no. 7, pp. 77-89, July 1987.
[26] J.K. Peir and R. Cytron, "Minimum Distance: A Method for Partitioning Recurrences for Multiprocessors," Proc. Int'l Conf. Parallel Processing, pp. 217-225, 1987.
[27] K.W. Przytula, "Medium Grain Parallel Architecture for Image and Signal Processing," Parallel Architectures and Algorithms for Image Understanding, V.K.P. Kumar, ed., pp. 95-119. Academic Press, 1991.
[28] W. Shang and J.A.B. Fortes, "On Time Mapping of Uniform Dependence Algorithms into Lower Dimensional Processor Arrays," IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 5, pp. 350-363, May 1992.
[29] A. Suarez, J.M. Llaberia, and A. Fernandez, "Scheduling Partitions in Systolic Algorithms," Proc. Application Specific Array Processors, pp. 619-633, IEEE CS Press, Aug. 1992.
[30] J. Symanski and K. Bromley, "Video Analysis Transputer Array (VATA) Processor," Proc. SPIE Real-Time Signal Processing XI, Aug. 1988.
[31] M. Wolf and M. Lam, “A Data Locality Optimizing Algorithm,” Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 30-44, June 1991.
[32] X. Zhong and S. Rajopadhye, "Deriving Fully Efficient Systolic Arrays by Quasi-Linear Allocation Functions," Proc. PARLE, pp. 219-235, 1991.

Index Terms:
Access unit, affine dependencies, area index, clock-rate reduction, dependence graph, memory bandwidth, multimesh graph, partitioning, processor array, scheduling, uniform dependencies.
Kumar N. Ganapathy, Benjamin W. Wah, Chien-Wei Li, "Designing a Scalable Processor Array for Recurrent Computations," IEEE Transactions on Parallel and Distributed Systems, vol. 8, no. 8, pp. 840-856, Aug. 1997, doi:10.1109/71.605770
Usage of this product signifies your acceptance of the Terms of Use.