This Article 
 Bibliographic References 
 Add to: 
Scalable Hardware-Algorithms for Binary Prefix Sums
August 2000 (vol. 11 no. 8)
pp. 838-850

Abstract—In this work, we address the problem of designing efficient and scalable hardware-algorithms for computing the sum and prefix sums of a $w^k\hbox{-}{\rm{bit}}$, $(k\geq 2)$, sequence using as basic building blocks linear arrays of at most $w^2$ shift switches, where $w$ is a small power of $2$. An immediate consequence of this feature is that in our designs broadcasts are limited to buses of length at most $w^2$. We adopt a VLSI delay model where the “length” of a bus is proportional with the number of devices on the bus. We begin by discussing a hardware-algorithm that computes the sum of a $w^k\hbox{-}{\rm{bit}}$ binary sequence in the time of $2k-2$ broadcasts, while the corresponding prefix sums can be computed in the time of $3k-4$ broadcasts. Quite remarkably, in spite of the fact that our hardware-algorithm uses only linear arrays of size at most $w^2$, the total number of broadcasts involved is less than three times the number required by an “ideal” design. We then go on to propose a second hardware-algorithm, operating in pipelined fashion, that computes the sum of a $kw^k\hbox{-}{\rm{bit}}$ binary sequence in the time of $3k+\lceil\log_w k\rceil -3$ broadcasts. Using this design, the corresponding prefix sums can be computed in the time of $4k+\lceil\log_w k\rceil -5$ broadcasts.

[1] S.G. Akl, Parallel Computation: Models and Methods. Upper Saddle River, N.J.: Prentice Hall, 1997.
[2] H. Alnuweiri, M. Alimuddin, and H. Aljunaidi, “Switch Models and Reconfigurable Networks: Tutorial and Partial Survey,” Proc. Workshop on Reconfigurable Architectures, pp. 1–10, Apr. 1994.
[3] G. E. Blelloch,“Scans as primitive operations,”IEEE Trans. Comput., vol. C-38, no. 11, pp. 1526–1538, Nov. 1989.
[4] R.P. Brent and H.T. Kung, “A Regular Layout for Parallel Adders,” IEEE Trans. Computers, vol. 31, pp. 260–264, 1982.
[5] J.J.F. Cavanaugh, Digital Computer Arithmetic Design and Implementation. New York: McGraw-Hill, 1984.
[6] L. Dadda and V. Piuri, “Pipelined Adders,” IEEE Trans. Computers, vol. 45, pp. 348–356, 1996.
[7] F. Halsall, Data Communications, Computer Networks and Open Systems. Addison-Wesley, 1996.
[8] R.H. Katz, Contemporary Logic Design. Benjamin/Cummings Publishing, 1994.
[9] U. Ko, P. T. Balsara, and W. Lee, “Low-Power Design Techniques for High-Performance CMOS Adders,” IEEE Trans. VLSI Systems, vol. 3, pp. 327–333, 1995.
[10] P.M. Kogge and H.S. Stone, “A Parallel Algorithm for the Efficient Solution of a General Class of Recurrences,” IEEE Trans. Computers, vol. 22, pp. 786–793, 1973.
[11] H.T. Kung and C.E. Leiserson, “Algorithms for VLSI Processor Arrays,” Introduction to VLSI Systems, C. Mead and L. Conway, eds. Reading, Mass.: Addison-Wesley, 1980.
[12] R.E. Ladner and M.J. Fischer, "Parallel Prefix Computation," J. ACM, vol. 27, no. 4, pp. 831-838, Oct. 1980.
[13] S. Lakshmivarahan and S.K. Dhall, Parallel Computing Using the Prefix Problem. Oxford University Press, 1994.
[14] H. Li and M. Maresca,“Polymorphic-torus network,” IEEE Trans. on Computers, vol. 38, no. 9, pp. 1345-1351, Sept. 1989.
[15] M.-B. Lin and A.Y. Oru, “The Design of an Optoelectric Arithmetic Processor Based on Permutation Networks,” IEEE Trans. Computers, vol. 46, pp. 142–153, 1997.
[16] R. Lin and S. Olariu, "Reconfigurable Buses with Shift Switching: Concepts and Applications," IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 1, pp. 93-102, Jan. 1995.
[17] N. Lindert, T. Sugii, S. Tang, C. Hu, “Dynamic Threshold Pass-Transistor Logic for Improved Delay at Lower Power Supply Voltages,” IEEE J. Solid-State Circuits, vol. 34, pp. 85–89, 1999.
[18] M. Maresca, "Polymorphic Processor Arrays," IEEE Trans. Parallel and Distributed Systems, vol. 4, pp. 490-506, 1993.
[19] R. Miller,V.K. Prasanna Kumar,D.I. Reisis, and Q.F. Stout,“Parallel computations on reconfigurable meshes,” IEEE Trans. on Computers, pp. 678-692, June 1993.
[20] K. Nakano, “An Efficient Algorithm for Summing up Binary Values on a Reconfigurable Mesh,” IEICE Trans. Fundamentals of Electronics, Communications and Computer Sciences, vol. E77-A, no. 4, pp. 652–657, 1994.
[21] K. Nakano, “Prefix-Sums Algorithms on Reconfigurable Meshes,” Parallel Processing Letters, vol. 5, pp. 23-35, 1995.
[22] K. Nakano and S. Olariu, “An Efficient Algorithm for Row Minima Computations on Basic Recofigurable Meshes,” IEEE Trans. Parallel and Distributed Systems, vol. 9, pp. 561-569, 1998.
[23] K. Nakano, “A Bibliography of Published Papers on Dynamically Reconfigurable Architectures,” Parallel Processing Letters, vol. 5, pp. 111-124, 1995.
[24] S. Olariu, J.L. Schwing, and J. Zhang, “Fundamental Data Movement Algorithms for Reconfigurable Meshes,” Int'l J. High Speed Computing, vol. 6, pp. 311–323, 1994.
[25] T.-H. Liu, M.K. Ganai, A. Aziz, and J.L. Burns, “Performance Driven Synthesis for Pass-Transistor Logic,” Proc. 12th IEEE Int'l Conf. VLSI Design, pp. 372–377, 1999.
[26] W.-H. Paik, S.-W. Kim, “Sum-Selector Generation Algorithm Based 64-Bit Adder Using Dynamic Chain Architecture,” Proc. Fourth IEEE Int'l Conf. Electronics, Circuits and Systems, vol. 3, pp. 1,020-1,024, 1997.
[27] B. Parhami, Computer Arithmetic—Algorithms and Hardware Designs. New York: Oxford Univ. Press, 2000.
[28] T. Stourakis, S.W. Kim, and A. Skavantzos, “Full-Adder Based Arithmetic Units in Finite Fields,” IEEE Trans. Circuits and Systems II, vol. 40, pp. 741–745, 1993.
[29] M. Suzuki et al. "A 1.5ns 32b CMOS ALU in Double Pass-Transistor Logic," IEEE J. Solid-State Circuits, vol. 28, no. 11, pp. 1,145-1,151, Nov. 1993.
[30] E. E. Swartzlander,Computer Arithmetic,, vol. 1. Los Alamitos, CA: IEEE Computer Society, 1990.
[31] E.E. Swartzlander, Jr.,“Parallel Counters,” IEEE Trans. Computers, vol. 22, pp. 1,021–1,024, 1973.
[32] Thinking Machines Corporation, “Connection Machine Parallel Instruction Set (PARIS),” July 1986.
[33] J.D. Ullman, Computational Aspects of VLSI. Rockville, Md.: Computer Science Press, 1984.
[34] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, Addison-Wesley, 1994.
[35] R. Zimmermann and W. Fichtner, “Low-Power Logic Styles: CMOS Versus Pass-Transistor Logic,” IEEE J. Solid-State Circuits, vol. 32, pp. 1,079–1,090, 1997.

Index Terms:
Hardware-algorithms, shift switching, binary prefix sums, binary counting, scalable architectures, pipelining.
R. Lin, K. Nakano, S. Olariu, M.C. Pinotti, J.L. Schwing, A.Y. Zomaya, "Scalable Hardware-Algorithms for Binary Prefix Sums," IEEE Transactions on Parallel and Distributed Systems, vol. 11, no. 8, pp. 838-850, Aug. 2000, doi:10.1109/71.877941
Usage of this product signifies your acceptance of the Terms of Use.