
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Rong Lin, Koji Nakano, Stephan Olariu, Albert Y. Zomaya, "An Efficient Parallel Prefix Sums Architecture with Domino Logic," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 9, pp. 922931, September, 2003.  
BibTex  x  
@article{ 10.1109/TPDS.2003.1233714, author = {Rong Lin and Koji Nakano and Stephan Olariu and Albert Y. Zomaya}, title = {An Efficient Parallel Prefix Sums Architecture with Domino Logic}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {14}, number = {9}, issn = {10459219}, year = {2003}, pages = {922931}, doi = {http://doi.ieeecomputersociety.org/10.1109/TPDS.2003.1233714}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Parallel and Distributed Systems TI  An Efficient Parallel Prefix Sums Architecture with Domino Logic IS  9 SN  10459219 SP922 EP931 EPD  922931 A1  Rong Lin, A1  Koji Nakano, A1  Stephan Olariu, A1  Albert Y. Zomaya, PY  2003 KW  Hardwarealgorithms KW  shift switching KW  binary prefix sums KW  binary counting KW  scalable architectures KW  VLSI design KW  domino logic. VL  14 JA  IEEE Transactions on Parallel and Distributed Systems ER   
Abstract—The main contribution of this work is to propose an efficient parallel prefix sums architecture based on the recentlydeveloped technique of shift switching with domino logic, where the charge/discharge signals propagate along the switch chain producing semaphores in a network that is fast and highly hardwarecompact. The proposed architecture for computing the prefix sums of N1 bits features a total delay of (4 log N + \sqrt N 2) * T_d, where T_d is the delay for charging or discharging a row of two prefix sum units of eight shift switches. Our simulation results show that, under 0.8micron CMOS technology, the delay T_d does not exceed 1 ns. As it turns out, our design is faster than any design known to us for values on N in the range 1 \leq N \leq 2^10. Yet, another important and novel feature of the proposed architecture is that it requires very simple controls, partially driven by the semaphores. This significantly reduces the hardware complexity of the design and fully utilizes the inherent speed of the process.
[1] S.G. Akl, Parallel Computation Models and Methods. Englewood Cliffs, N.J.: Prentice Hall, 1997.
[2] H. Alnuweiri, M. Alimuddin, and H. Aljunaidi, Switch Models and Reconfigurable Networks: Tutorial and Partial Survey Proc. Workshop on Reconfigurable Architectures, Apr. 1994.
[3] G.E. Blelloch, "Scans as Primitive Parallel Operations," IEEE Trans. Computers, vol. 38, pp. 1,5261,538, 1989.
[4] K. Bondalapati and V.K. Prasanna, Reconfigurable Meshes: Theory and Practice Proc. Reconfigurable Architecture Workshop, Apr. 1997.
[5] R.P. Brent and H.T. Kung, A Regular Layout for Parallel Adders IEEE Trans. Computers, vol. 31, pp. 260264, 1982.
[6] L. Dadda and V. Piuri, “Pipelined Adders,” IEEE Trans. Computers, vol. 45, pp. 348–356, 1996.
[7] F. Halsall, Data Communications, Computer Networks, and Open Systems. AddisonWesley, 1996.
[8] I.S. Hwang and A.L. Fisher, "A 3.1 ns 32 b CMOS Adder in Multiple Output Domino Logic," IEEE J. SolidState Circuits, vol. 24, pp. 358369, Apr. 1989.
[9] J. Jang, H. Park, and V.K. Prasanna, A Bit Model of the Reconfigurable Mesh Proc. Workshop Reconfigurable Architectures, Apr. 1994.
[10] U. Ko, P. T. Balsara, and W. Lee, “LowPower Design Techniques for HighPerformance CMOS Adders,” IEEE Trans. VLSI Systems, vol. 3, pp. 327–333, 1995.
[11] P.M. Kogge and H.S. Stone, A Parallel Algorithm for the Efficient Solution of a General Class of Recurrences IEEE Trans. Computers, vol. 22, pp. 786793, 1973.
[12] R.E. Ladner and M.J. Fischer, Parallel Prefix Computation J. ACM, vol. 27, pp. 831838, 1980.
[13] S. Lakshmivarahan and S.K. Dhall, Parallel Computing Using the Prefix Problem. Oxford Univ. Press, 1994.
[14] F.T. Leighton, Parallel Algorithms and Architectures: Arrays, Trees, and Hypercubes. Morgan Kaufmann, 1992.
[15] H. Li and M. Maresca,“Polymorphictorus network,” IEEE Trans. on Computers, vol. 38, no. 9, pp. 13451351, Sept. 1989.
[16] R. Lin, Reconfigurable Buses with Shift Switching VLSI Radix Sort Proc. Int'l Conf. Parallel Processing, vol. 3, pp. 29, 1992.
[17] R. Lin, "Shift Switching and Novel Arithmetic Schemes," Proc. 29th Asilomar Conf. Signals, Systems, and Computers, pp. 580585,Pacific Grove, Calif., Nov. 1995.
[18] R. Lin, K. Nakano, S. Olariu, C. Pinotti, J.L. Schwing, and A. Zomaya, Scalable HardwareAlgorithms for Binary Prefix Sums IEEE Trans. Parallel and Distributed Systems, vol. 11, pp. 838850, 2000.
[19] R. Lin and S. Olariu, "Reconfigurable Buses with Shift Switching: Concepts and Applications," IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 1, pp. 93102, Jan. 1995.
[20] R. Lin and S. Olariu, Efficient VLSI Architecture for Columnsort IEEE Trans. VLSI, vol. 7, pp. 135139, 1999.
[21] R. Lin and S. Olariu, Reconfigurable Shift Switching Parallel Comparators VLSI Design, vol. 9, pp. 8390, 1999.
[22] M.B. Lin and A.Y. Oru, “The Design of an Optoelectric Arithmetic Processor Based on Permutation Networks,” IEEE Trans. Computers, vol. 46, pp. 142–153, 1997.
[23] N. Lindert, T. Sugii, S. Tang, C. Hu, “Dynamic Threshold PassTransistor Logic for Improved Delay at Lower Power Supply Voltages,” IEEE J. SolidState Circuits, vol. 34, pp. 85–89, 1999.
[24] T.H. Liu, M.K. Ganai, A. Aziz, and J.L. Burns, “Performance Driven Synthesis for PassTransistor Logic,” Proc. 12th IEEE Int'l Conf. VLSI Design, pp. 372–377, 1999.
[25] LSI Logic 1.0 Micron CellBased Products Data Book, LSI Logic Corp., Milpitas, Calif., 1991.
[26] M. Maresca, "Polymorphic Processor Arrays," IEEE Trans. Parallel and Distributed Systems, vol. 4, pp. 490506, 1993.
[27] R. Miller,V.K. Prasanna Kumar,D.I. Reisis, and Q.F. Stout,“Parallel computations on reconfigurable meshes,” IEEE Trans. on Computers, pp. 678692, June 1993.
[28] A. Mukherjee, Introduction to nMOS&CMOS VLSI System Design. Englewood, N.J.: Prentice Hall, 1986.
[29] K. Nakano, An Efficient Algorithm for Summing Up Binary Values on a Reconfigurable Mesh IEICE Trans. Fundamentals of Electronics, Comm., and Computer Sciences, vol. 4, pp. 652657, 1994.
[30] K. Nakano, PrefixSums Algorithms on Reconfigurable Meshes Parallel Processing Letters, vol. 5, pp. 2325, 1995.
[31] K. Nakano and S. Olariu, “An Efficient Algorithm for Row Minima Computations on Basic Recofigurable Meshes,” IEEE Trans. Parallel and Distributed Systems, vol. 9, pp. 561569, 1998.
[32] K. Nakano, A Bibliography of Published Papers on Dynamically Reconfigurable Architectures Parallel Processing Letters, vol. 5, pp. 111124, 1995.
[33] S. Olariu, J.L. Schwing, and J. Zhang, Fundamental Data Movement Algorithms for Reconfigurable Meshes Int'l J. High Speed Computing, vol. 6, pp. 311323, 1994.
[34] W.H. Paik and S.W. Kim, SumSelector Generation Algorithm Based 64Bit Adder Using Dynamic Chain Architecture Proc. Fourth IEEE Int'l Conf. Electronics, Circuits, and Systems, vol. 3, pp. 10201024, 1997.
[35] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs. Oxford, U.K.: Oxford Univ. Press, 2000.
[36] T. Stourakis, S.W. Kim, and A. Skavantzos, FullAdder Based Arithmetic Units in Finite Fields IEEE Trans. Circuits and Systems II, vol. 40, pp. 741745, 1993.
[37] M. Suzuki et al. "A 1.5ns 32b CMOS ALU in Double PassTransistor Logic," IEEE J. SolidState Circuits, vol. 28, no. 11, pp. 1,1451,151, Nov. 1993.
[38] E.E. Swartzlander, Jr., Parallel Counters IEEE Trans. Computers, vol. 22, pp. 10211024, 1973.
[39] Thinking Machines Corp., Connection Machine Parallel Instruction Set (PARIS), July 1986.
[40] N. Weste and K. Eshraghian, Principles oF CMOS VLSI Design: A Systems Perspective. second ed., AddisonWesley, 1993.
[41] R. Zimmermann and W. Fichtner, “LowPower Logic Styles: CMOS Versus PassTransistor Logic,” IEEE J. SolidState Circuits, vol. 32, pp. 1,079–1,090, 1997.