This Article 
 Bibliographic References 
 Add to: 
Scalar Operand Networks
February 2005 (vol. 16 no. 2)
pp. 145-162

Abstract—The bypass paths and multiported register files in microprocessors serve as an implicit interconnect to communicate operand values among pipeline stages and multiple ALUs. Previous superscalar designs implemented this interconnect using centralized structures that do not scale with increasing ILP demands. In search of scalability, recent microprocessor designs in industry and academia exhibit a trend toward distributed resources such as partitioned register files, banked caches, multiple independent compute pipelines, and even multiple program counters. Some of these partitioned microprocessor designs have begun to implement bypassing and operand transport using point-to-point interconnects. We call interconnects optimized for scalar data transport, whether centralized or distributed, scalar operand networks. Although these networks share many of the challenges of multiprocessor networks such as scalability and deadlock avoidance, they have many unique requirements, including ultra-low latency (a few cycles versus tens of cycles) and ultra-fast operation-operand matching. This paper discusses the unique properties of scalar operand networks (SONs), examines alternative ways of implementing them, and introduces the AsTrO taxonomy to distinguish between them. It discusses the design of two alternative networks in the context of the Raw microprocessor, and presents timing, area, and energy statistics for a real implementation. The paper also presents a 5-tuple performance model for SONs and analyzes their performance sensitivity to network properties for ILP workloads.

[1] V. Agarwal, M.S. Hrishikesh, S.W. Keckler, and D. Burger, “Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures,” Proc. Int'l Symp. Computer Architecture, pp. 248-259, 2000.
[2] Arvind and S. Brobst, “The Evolution of Dataflow Architectures from Static Dataflow to P-RISC,” Int'l J. High Speed Computing, vol. 5, no. 2, pp. 125-153, June 1993.
[3] R. Barua, W. Lee, S. Amarasinghe, and A. Agarwal, “Maps: A Compiler-Managed Memory System for Raw Machines,” Proc. Int'l Symp. Computer Architecture, pp. 4-15, 1999.
[4] W.J. Dally, A VLSI Architecture for Concurrent Data Structures. Kluwer Academic Publishers, 1987.
[5] J. Duato and T.M. Pinkston, “A General Theory for Deadlock-Free Adaptive Routing Using a Mixed Set of Resources,” IEEE Trans. Parallel and Distributed Systems, vol. 12, no. 12, pp. 1-16, Dec. 2001.
[6] T. Gross and D.R. O'Halloron, iWarp, Anatomy of a Parallel Computing System. Cambridge, Mass.: MIT Press, 1998.
[7] J. Janssen and H. Corporaal, “Partitioned Register File for TTAs,” Proc. Int'l Symp. Microarchitecture, pp. 303-312, 1996.
[8] H.-S. Kim and J.E. Smith, “An Instruction Set Architecture and Microarchitecture for Instruction Level Distributed Processing,” Proc. Int'l Symp. Computer Architecture, pp. 71-81, 2002.
[9] J. Kim, M. Taylor, J. Miller, and D. Wentzlaff, “Energy Characterization of a Tiled Architecture Processor with On-Chip Networks,” Proc. Int'l Symp. Low Power Electronics and Design, 2003.
[10] J. Kubiatowicz, A. Agarwal, “Anatomy of a Message in the Alewife Multiprocessor,” Proc. Int'l Supercomputing Conf., pp. 195-206, 1993.
[11] W Lee et al., “Space-Time Scheduling of Instruction-Level Parallelism on a Raw Machine,” Proc. Conf. Architectural Support for Programming Languages and Operating Systems, pp. 46-54, 1998.
[12] K. Mackenzie, J. Kubiatowicz, M. Frank, W. Lee, V. Lee, A. Agarwal, and M.F. Kaashoek, “Exploiting Two-Case Delivery for Fast Protected Messaging,” Proc. Int'l Symp. High Performance Computer Architecture, July 1997.
[13] K. Mai, T. Paaske, N. Jayasena, R. Ho, W.J. Dally, and M. Horowitz, “Smart Memories: A Modular Reconfigurable Architecture,” Proc. Int'l Symp. Computer Architecture, pp. 161-171, 2000.
[14] S.D. Naffziger and G. Hammond, “The Implementation of the Next-Generation 64b Itanium Microprocessor,” Proc. IEEE Int'l Solid-State Circuits Conf., pp. 344-345, 472, 2002.
[15] R. Nagarajan, K. Sankaralingam, D. Burger, and S.W. Keckler, “A Design Space Evaluation of Grid Processor Architectures,” Proc. Int'l Symp. Microarchitecture, pp. 40-51, 2001.
[16] S. Palacharla, N.P. Jouppi, and J.E. Smith, “Complexity-Effective Superscalar Processors,” Proc. Int'l Symp. Computer Architecture, pp. 206-218, 1997.
[17] D. Panda, S. Singal, and R. Kesavan, “Multidestination Message Passing in Wormhole k-Ary n-Cube Networks with Base Routing Conformed Paths,” IEEE Trans. Parallel and Distributed Systems, 1999.
[18] L. Peh et al., “Flit-Reservation Flow Control,” Proc. Symp. High-Performance Computer Architecture, 2000.
[19] K. Sankaralingam, V. Singh, S. Keckler, and D. Burger, “Routed Inter-ALU Networks for ILP Scalability and Performance,” Proc. Int'l Conf. Computer Design, 2003.
[20] G. Sohi, S. Breach, and T. Vijaykumar, “Multiscalar Processors,” Proc. Int'l Symp. Computer Architecture, pp. 414-425, 1995.
[21] Y.H. Song and T.M. Pinkston, “A Progressive Approach to Handling Message Dependent Deadlocks in Parallel Computer Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 3, pp. 259-275, Mar. 2003.
[22] H. Sullivan and T.R. Bashkow, “A Large Scale, Homogeneous, Fully Distributed Parallel Machine,” Proc. Fourth Ann. Symp. Computer Architecture, pp. 105-117, 1977.
[23] S. Swanson et al., “WaveScalar,” Proc. Int'l Symp. Microarchitecture, 2003.
[24] M. Taylor, “The Raw Processor Specification,” ftp://ftp.cag.lcs. RawSpec99.pdf, 2004.
[25] M. Taylor et al., “How to Build Scalable On-Chip ILP Networks for a Decentralized Architecture,” Technical Report 628, Massachusetts Inst. of Tech nology, Apr. 2000.
[26] M. Taylor et al., “The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs,” IEEE Micro, pp. 25-35, Mar. 2002.
[27] M. Taylor et al., “Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures,” Proc. Int'l Symp. High Performance Computer Architecture, 2003.
[28] M. Taylor et al., “Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams,” Proc. Int'l Symp. Computer Architecture, 2004.
[29] T. von Eicken, D. Culler, S. Goldstein, and K. Schauser, “Active Messages: A Mechanism for Integrated Communication and Computation,” Proc. 19th Int'l Symp. Computer Architecture, May 1992.
[30] E. Waingold et al., “Baring It All to Software: Raw Machines,” Computer, vol. 30, no. 9, pp. 86-93, Sept. 1997.
[31] H. Wang, L. Peh, and S. Malik, “Power-Driven Design of Router Microarchitectures in On-Chip Networks,” Proc. Int'l Symp. Microarchitecture, 2003.

Index Terms:
Interconnection architectures, distributed architectures, microprocessors.
Michael Bedford Taylor, Walter Lee, Saman P. Amarasinghe, Anant Agarwal, "Scalar Operand Networks," IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 2, pp. 145-162, Feb. 2005, doi:10.1109/TPDS.2005.24
Usage of this product signifies your acceptance of the Terms of Use.