This Article 
 Bibliographic References 
 Add to: 
Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors
October 2005 (vol. 16 no. 10)
pp. 944-955
Aneesh Aggarwal, IEEE Computer Society
Manoj Franklin, IEEE Computer Society

Abstract—In the evolving submicron technology, making it particularly attractive to use decentralized designs. A common form of decentralization adopted in processors is to partition the execution core into multiple clusters. Each cluster has a small instruction window, and a set of functional units. A number of algorithms have been proposed for distributing instructions among the clusters. The first part of this paper analyzes (qualitatively as well as quantitatively) the effect of various hardware parameters such as the type of cluster interconnect, the fetch size, the cluster issue width, the cluster window size, and the number of clusters on the performance of different instruction distribution algorithms. The study shows that the relative performance of the algorithms is very sensitive to these hardware parameters and that the algorithms that perform relatively better with four or fewer clusters are generally not the best ones for a larger number of clusters. This is important, given that with an imminent increase in the transistor budget, more clusters are expected to be integrated on a single chip. The second part of the paper investigates alternate interconnects that provide scalable performance as the number of clusters is increased. In particular, it investigates two hierarchical interconnects—a single ring of crossbars and multiple rings of crossbars—as well as instruction distribution algorithms to take advantage of these interconnects. Our study shows that these new interconnects with the appropriate distribution techniques achieve an IPC (instructions per cycle) that is 15-20 percent better than the most scalable existing configuration, and is within 2 percent of that achieved by a hypothetical ideal processor having a 1-cycle latency crossbar interconnect. These results confirm the utility and applicability of hierarchical interconnects and hierarchical distribution algorithms in clustered processors.

[1] V. Agarwal, M.S. Hrishikesh, S.W. Keckler, and D. Burger, “Clock Rate versus IPC: the End of the Road for Conventional Microarchitectures,” Proc. Int'l Symp. Computer Architecture-27, 2000.
[2] R. Balasubramonian, S. Dwarkadas, and D.H. Albonesi, “Dynamically Managing the Communication-Parallelism Trade-Off in Future Clustered Processors,” Proc. Int'l Symp. Computer Architecture-30, 2003.
[3] R. Bhargava and L.K. John, “Improving Dynamic Cluster Assignment for Clustered Trace Cache Processors,” Proc. Int'l Symp. Computer Architecture-30, 2003.
[4] A. Baniasadi and A. Moshovos, “Instruction Distribution Heuristics for Quad-Cluster, Dynamically-Scheduled, Superscalar Processors,” Proc. Int'l Symp. Microarchitecture-33, 2000.
[5] D. Burger and J.R. Goodman, “Guest Editors Introduction: Billion-Transistor Architectures,” Computer, vol. 30, no. 9, pp. 46-49, Sept. 1997.
[6] R. Canal, J.M. Parcerisa, and A. González, “Dynamic Cluster Assignment Mechanisms,” Proc. Int'l Symp. High Performance Computer Architecture-6, 2000.
[7] R. Canal, J.M. Parcerisa, and A. González, “Dynamic Code Partitioning for Clustered Architectures,” Int'l J. Parallel Programming, 2000.
[8] K. Farkas, P. Chow, N. Jouppi, and Z. Vranesic, “The Multicluster Architecture: Reducing Cycle Time through Partitioning,” Proc. Int'l Symp. Microarchitecture-30, 1997.
[9] K. Hwang, Advanced Computer Architecture. McGraw-Hill, 1992.
[10] G.A. Kemp and M. Franklin, “PEWS: A Decentralized Dynamic Scheduler for ILP Processing,” Proc. Int'l Conf. Parallel Processing (ICPP), vol. 1, pp. 239-246, 1996.
[11] K. Kailas, K. Ebcioglu, and A. Agrawala, “CARS: A New Code Generation Framework for Clustered ILP Processors,” Proc. Int'l Symp. High Performance Computer Architecture-7, 2001.
[12] D. Leibholz and R. Razdan, “The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor,” Proc. Compcon Conf., pp. 28-36, 1997.
[13] S. Palacharla, N.P. Jouppi, and J.E. Smith, “Complexity-Effective Superscalar Processors,” Proc. Int'l Symp. Computer Architecture-24, 1997.
[14] J.M. Parcerisa and A. Gonzalez, “Reducing Wire Delay Penalty through Value Prediction,” Proc. Int'l Symp. Microarchitecture-33, 2000.
[15] J.M. Parcerisa, J. Sahuquillo, A. Gonzalez, and J. Duato, “Efficient Interconnects for Clustered Microarchitectures,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques-11, 2002.
[16] N. Rangananathan and M. Franklin, “An Empirical Study of Decentralized ILP Execution Models,” Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), 1998.
[17] E. Rotenberg, S. Bennett, and J.E. Smith, “Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching,” Proc. Int'l Symp. Microarchitecture-29, 1996.
[18] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith, “Trace Processors,” Proc. Int'l Symp. Microarchitecture-30, 1997.
[19] G.S. Sohi, S.E. Breach, and T.N. Vijaykumar, “Multiscalar Processors,” Proc. Int'l Symp. Computer Architecture-22, 1995.
[20] S.S. Sastry, S. Palacharla, and J.E. Smith, “Exploiting Idle Floating-Point Resources For Integer Execution,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, 1998.
[21] K.C. Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, pp. 28-40, Apr. 1996.
[22] The Nat'l Technology Roadmap for Semiconductors, Semiconductor Industry Assoc., 1999.

Index Terms:
Clustered processor architecture, pipeline processors, interconnection architectures, load balancing and task assignment.
Aneesh Aggarwal, Manoj Franklin, "Scalability Aspects of Instruction Distribution Algorithms for Clustered Processors," IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 10, pp. 944-955, Oct. 2005, doi:10.1109/TPDS.2005.128
Usage of this product signifies your acceptance of the Terms of Use.