This Article 
 Bibliographic References 
 Add to: 
An Efficient Parallel Prefix Sums Architecture with Domino Logic
September 2003 (vol. 14 no. 9)
pp. 922-931

Abstract—The main contribution of this work is to propose an efficient parallel prefix sums architecture based on the recently-developed technique of shift switching with domino logic, where the charge/discharge signals propagate along the switch chain producing semaphores in a network that is fast and highly hardware-compact. The proposed architecture for computing the prefix sums of N-1 bits features a total delay of (4 log N + \sqrt N -2) * T_d, where T_d is the delay for charging or discharging a row of two prefix sum units of eight shift switches. Our simulation results show that, under 0.8-micron CMOS technology, the delay T_d does not exceed 1 ns. As it turns out, our design is faster than any design known to us for values on N in the range 1 \leq N \leq 2^10. Yet, another important and novel feature of the proposed architecture is that it requires very simple controls, partially driven by the semaphores. This significantly reduces the hardware complexity of the design and fully utilizes the inherent speed of the process.

[1] S.G. Akl, Parallel Computation Models and Methods. Englewood Cliffs, N.J.: Prentice Hall, 1997.
[2] H. Alnuweiri, M. Alimuddin, and H. Aljunaidi, Switch Models and Reconfigurable Networks: Tutorial and Partial Survey Proc. Workshop on Reconfigurable Architectures, Apr. 1994.
[3] G.E. Blelloch, "Scans as Primitive Parallel Operations," IEEE Trans. Computers, vol. 38, pp. 1,526-1,538, 1989.
[4] K. Bondalapati and V.K. Prasanna, Reconfigurable Meshes: Theory and Practice Proc. Reconfigurable Architecture Workshop, Apr. 1997.
[5] R.P. Brent and H.T. Kung, A Regular Layout for Parallel Adders IEEE Trans. Computers, vol. 31, pp. 260-264, 1982.
[6] L. Dadda and V. Piuri, “Pipelined Adders,” IEEE Trans. Computers, vol. 45, pp. 348–356, 1996.
[7] F. Halsall, Data Communications, Computer Networks, and Open Systems. Addison-Wesley, 1996.
[8] I.S. Hwang and A.L. Fisher, "A 3.1 ns 32 b CMOS Adder in Multiple Output Domino Logic," IEEE J. Solid-State Circuits, vol. 24, pp. 358-369, Apr. 1989.
[9] J. Jang, H. Park, and V.K. Prasanna, A Bit Model of the Reconfigurable Mesh Proc. Workshop Reconfigurable Architectures, Apr. 1994.
[10] U. Ko, P. T. Balsara, and W. Lee, “Low-Power Design Techniques for High-Performance CMOS Adders,” IEEE Trans. VLSI Systems, vol. 3, pp. 327–333, 1995.
[11] P.M. Kogge and H.S. Stone, A Parallel Algorithm for the Efficient Solution of a General Class of Recurrences IEEE Trans. Computers, vol. 22, pp. 786-793, 1973.
[12] R.E. Ladner and M.J. Fischer, Parallel Prefix Computation J. ACM, vol. 27, pp. 831-838, 1980.
[13] S. Lakshmivarahan and S.K. Dhall, Parallel Computing Using the Prefix Problem. Oxford Univ. Press, 1994.
[14] F.T. Leighton, Parallel Algorithms and Architectures: Arrays, Trees, and Hypercubes. Morgan Kaufmann, 1992.
[15] H. Li and M. Maresca,“Polymorphic-torus network,” IEEE Trans. on Computers, vol. 38, no. 9, pp. 1345-1351, Sept. 1989.
[16] R. Lin, Reconfigurable Buses with Shift Switching VLSI Radix Sort Proc. Int'l Conf. Parallel Processing, vol. 3, pp. 2-9, 1992.
[17] R. Lin, "Shift Switching and Novel Arithmetic Schemes," Proc. 29th Asilomar Conf. Signals, Systems, and Computers, pp. 580-585,Pacific Grove, Calif., Nov. 1995.
[18] R. Lin, K. Nakano, S. Olariu, C. Pinotti, J.L. Schwing, and A. Zomaya, Scalable Hardware-Algorithms for Binary Prefix Sums IEEE Trans. Parallel and Distributed Systems, vol. 11, pp. 838-850, 2000.
[19] R. Lin and S. Olariu, "Reconfigurable Buses with Shift Switching: Concepts and Applications," IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 1, pp. 93-102, Jan. 1995.
[20] R. Lin and S. Olariu, Efficient VLSI Architecture for Columnsort IEEE Trans. VLSI, vol. 7, pp. 135-139, 1999.
[21] R. Lin and S. Olariu, Reconfigurable Shift Switching Parallel Comparators VLSI Design, vol. 9, pp. 83-90, 1999.
[22] M.-B. Lin and A.Y. Oru, “The Design of an Optoelectric Arithmetic Processor Based on Permutation Networks,” IEEE Trans. Computers, vol. 46, pp. 142–153, 1997.
[23] N. Lindert, T. Sugii, S. Tang, C. Hu, “Dynamic Threshold Pass-Transistor Logic for Improved Delay at Lower Power Supply Voltages,” IEEE J. Solid-State Circuits, vol. 34, pp. 85–89, 1999.
[24] T.-H. Liu, M.K. Ganai, A. Aziz, and J.L. Burns, “Performance Driven Synthesis for Pass-Transistor Logic,” Proc. 12th IEEE Int'l Conf. VLSI Design, pp. 372–377, 1999.
[25] LSI Logic 1.0 Micron Cell-Based Products Data Book, LSI Logic Corp., Milpitas, Calif., 1991.
[26] M. Maresca, "Polymorphic Processor Arrays," IEEE Trans. Parallel and Distributed Systems, vol. 4, pp. 490-506, 1993.
[27] R. Miller,V.K. Prasanna Kumar,D.I. Reisis, and Q.F. Stout,“Parallel computations on reconfigurable meshes,” IEEE Trans. on Computers, pp. 678-692, June 1993.
[28] A. Mukherjee, Introduction to nMOS&CMOS VLSI System Design. Englewood, N.J.: Prentice Hall, 1986.
[29] K. Nakano, An Efficient Algorithm for Summing Up Binary Values on a Reconfigurable Mesh IEICE Trans. Fundamentals of Electronics, Comm., and Computer Sciences, vol. 4, pp. 652-657, 1994.
[30] K. Nakano, Prefix-Sums Algorithms on Reconfigurable Meshes Parallel Processing Letters, vol. 5, pp. 23-25, 1995.
[31] K. Nakano and S. Olariu, “An Efficient Algorithm for Row Minima Computations on Basic Recofigurable Meshes,” IEEE Trans. Parallel and Distributed Systems, vol. 9, pp. 561-569, 1998.
[32] K. Nakano, A Bibliography of Published Papers on Dynamically Reconfigurable Architectures Parallel Processing Letters, vol. 5, pp. 111-124, 1995.
[33] S. Olariu, J.L. Schwing, and J. Zhang, Fundamental Data Movement Algorithms for Reconfigurable Meshes Int'l J. High Speed Computing, vol. 6, pp. 311-323, 1994.
[34] W.-H. Paik and S.-W. Kim, Sum-Selector Generation Algorithm Based 64-Bit Adder Using Dynamic Chain Architecture Proc. Fourth IEEE Int'l Conf. Electronics, Circuits, and Systems, vol. 3, pp. 1020-1024, 1997.
[35] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs. Oxford, U.K.: Oxford Univ. Press, 2000.
[36] T. Stourakis, S.W. Kim, and A. Skavantzos, Full-Adder Based Arithmetic Units in Finite Fields IEEE Trans. Circuits and Systems II, vol. 40, pp. 741-745, 1993.
[37] M. Suzuki et al. "A 1.5ns 32b CMOS ALU in Double Pass-Transistor Logic," IEEE J. Solid-State Circuits, vol. 28, no. 11, pp. 1,145-1,151, Nov. 1993.
[38] E.E. Swartzlander, Jr., Parallel Counters IEEE Trans. Computers, vol. 22, pp. 1021-1024, 1973.
[39] Thinking Machines Corp., Connection Machine Parallel Instruction Set (PARIS), July 1986.
[40] N. Weste and K. Eshraghian, Principles oF CMOS VLSI Design: A Systems Perspective. second ed., Addison-Wesley, 1993.
[41] R. Zimmermann and W. Fichtner, “Low-Power Logic Styles: CMOS Versus Pass-Transistor Logic,” IEEE J. Solid-State Circuits, vol. 32, pp. 1,079–1,090, 1997.

Index Terms:
Hardware-algorithms, shift switching, binary prefix sums, binary counting, scalable architectures, VLSI design, domino logic.
Rong Lin, Koji Nakano, Stephan Olariu, Albert Y. Zomaya, "An Efficient Parallel Prefix Sums Architecture with Domino Logic," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 9, pp. 922-931, Sept. 2003, doi:10.1109/TPDS.2003.1233714
Usage of this product signifies your acceptance of the Terms of Use.