This Article 
 Bibliographic References 
 Add to: 
Multithreading with Distributed Functional Units
April 1997 (vol. 46 no. 4)
pp. 399-411

Abstract—Multithreaded processors multiplex the execution of a number of concurrent threads onto the hardware in order to hide latencies associated with memory access, synchronization, and arithmetic operations. Conventional multithreading aims to maximize throughput in a single instruction pipeline whose execution stages are served by a collection of centralized functional units. This paper examines a multithreaded microarchitecture where the heterogeneous functional unit set is expanded so that units may be distributed and partly shared across several instruction pipelines operating simultaneously, thereby allowing greater exploitation of interthread parallelism in improving utilization factors of costly resources. The multiple pipeline approach is studied specifically in the Concurro processor architecture—a machine supporting multiple thread contexts and capable of context switching asynchronously in response to dynamic data and resource availability.

Detailed simulations of Concurro processors indicate that instruction throughputs for programs accessing main memory directly can be scaled, without recompilation, from one to over eight instructions per cycle simply by varying the number of pipelines and functional units. In comparison with an equivalent coherent-cache, single-chip multiprocessor, Concurro offers marginally better performance at less than half of the estimated implementation cost. With suitable prefetching, multiple instruction caches can be avoided, and multithreading is shown to obviate the need for sophisticated instruction dispatch mechanisms on parallel workloads. Distribution of functional units results in a 150% improvement over the centralized approach in utilization factors of arithmetic units, and enables saturation of the most critical processor resources.

[1] A. Agarwal, J. Kubiatowicz, D. Kranz, B. Lim, D. Yeung, G. D'Souza, and M. Parkin, "Sparcle: An Evolutionary Processor Design for Large-Scale Multiprocessors," IEEE Micro, vol. 13, no. 3, pp. 48-61, June 1993.
[2] R. Alverson et al., "The Tera Computer System," Proc. Int'l Conf. Supercomputing, Assoc. of Computing Machinery, N.Y., 1990, pp. 1-6.
[3] Arvind, R.S. Nikhil, and K.K. Pingali, "I-Structures: Data Structures for Parallel Computing," ACM Trans. Programming Languages and Systems, vol. 11, no. 4, pp. 598-632, Oct. 1989.
[4] P.S. Barth, R.S. Nikhil, and Arvind, "M-Structures: Extending a Parallel, Non-Strict, Functional Language with State," Proc. Fifth Conf. Functional Programming Languages and Computer Architecture, pp. 538-568, Aug. 1991.
[5] D.E. Culler, K.E. Schauser, and T. von Eicken, "Two Fundamental Limits on Dataflow Multiprocessing," Technical Report UCB/CSD 92/716, Computer Science Division, Univ. of California at Berkeley, 1992.
[6] G.E. Daddis Jr. and H.C. Torng, "The Concurrent Execution of Multiple Instruction Streams on Superscalar Processors," Proc. 1991 Int'l Conf. Parallel Processing, vol. I, pp. I-76-I-83, Aug. 1991.
[7] M. Farrens, G. Tyson, and A.R. Pleszkun, "A Study of Single-Chip Processor/Cache Organizations for Large Number of Transistors," Proc. 21st Ann. Int'l Symp. Computer Architecture, pp. 338-347, Apr. 1994.
[8] B.K. Gunther, "A High Speed Mechanism for Short Branches," ACM SIGARCH Computer Architecture News, vol. 18, no. 4, pp. 59-61, Dec. 1990.
[9] B.K. Gunther, "An Integrated Pre-Access Architecture for CMOS SRAM," IEEE J. Solid-State Circuits, vol. 27, no. 6, pp. 901-907, June 1992.
[10] B.K. Gunther, "Superscalar Performance in a Multithreaded Microprocessor," PhD thesis, Dept. of Computer Science, Univ. of Tasmania, Dec. 1993.
[11] R. Gupta, "Employing Register Channels for the Exploitation of Instruction Level Parallelism," Proc. Second Symp. Principles&Practice of Parallel Programming, pp. 118-127, Mar. 1990.
[12] L. Gwennap, "Weitek Announces SPARC Upgrade Chip," Microprocessor Report, vol. 7, no. 9, pp. 12-15, July 1993.
[13] R.H. Halstead Jr. and T. Fujita, "MASA: A Multithreaded Processor Architecture for Parallel Symbolic Computing," Proc. 15th Ann. Int'l Symp. Computer Architecture, pp. 443-451, May 1988.
[14] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann, San Mateo, Calif., 1990.
[15] H. Hirata et al., "An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads," Proc. Int'l Symp. Computer Architecture, Assoc. of Computing Machinery, N.Y., 1992, pp. 136-145.
[16] R.N. Ibbett and N.P. Topham, Architecture of High Performance Computers, vol. I. New York: Springer-Verlag, 1989.
[17] H.F. Jordan, "Performance Measurements on HEP—A Pipelined MIMD Computer," Proc. 10th Ann. Int'l Symp. Computer Architecture, pp. 207-212, June 1983.
[18] S.W. Keckler and W.J. Dally, "Processor Coupling: Integrating Compile Time and Runtime Scheduling for Parallelism," Proc. Int'l Symp. Computer Architecture, ACM, 1992, pp. 202-213.
[19] D. Kroft, "Lockup-Free Instruction Fetch/Prefetch Cache Organization," Proc. Eighth Int'l Symp. Computer Architecture, pp. 81-87, 1981.
[20] J. Laudon, A. Gupta, and M. Horowitz, "Architectural and Implementation Tradeoffs in the Design of Multiple-Context Processors," Technical Report CSL-TR-92-523, Computer Systems Laboratory, Stanford Univ., May 1992.
[21] J. Laudon, A. Gupta, and M. Horowitz, "Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations," Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 308-318, Oct. 1994.
[22] F.H. McMahon, "The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range," Report UCRL-53745, Lawrence Livermore Nat'l Laboratory, Livermore, Calif., Dec. 1986.
[23] B.A. Nayfeh and K. Olukotun, "Exploring the Design Space for a Shared-Cache Multiprocessor," Proc. 21st Ann. Int'l Symp. Computer Architecture, pp. 166-175, Apr. 1994.
[24] R.S. Nikhil and Arvind, "Can Dataflow Subsume von Neumann Computing?" Proc. 16th Ann. Int'l Symp. Computer Architecture, ACM Press, 1989, pp. 262-272.
[25] G.M. Papadopoulos and D.E. Culler, "Monsoon: An Explicit Token-Store Architecture," Proc. 17th Ann. Int'l Symp. Computer Architecture, pp. 82-91, May 1990.
[26] W.W. Park, D.S. Fussell, and R.M. Jenevein, "Performance Advantages of Multithreaded Processors," Proc. 1991 Int'l Conf. Parallel Processing, vol. I, pp. I-97-I-101, Aug. 1991.
[27] R.G. Prasadh and C. Wu, "A Benchmark Evaluation of a Multi-Threaded RISC Processor Architecture," Proc. 1991 Int'l Conf. Parallel Processing, vol. I, pp. I-84-I-91, Aug. 1991.
[28] B.R. Rau, D.W.L. Yen, W. Yen, and R.A. Towle, "The Cydra 5 Departmental Supercomputer: Design Philosophies, Decisions and Trade-Offs," Proc. 22nd Ann. Hawaii Int'l Conf. System Sciences, vol. I—Architecture Track, pp. 202-213, Jan. 1989.
[29] A. Wolfe and J.P. Shen, "A Variable Instruction Stream Extension to the VLIW Architecture," Proc. ACM Int'l Conf. Architectural Support for Programming Languages and Operating Systems, 1991.

Index Terms:
Distributed functional units, hardware utilization, latency tolerance, multiple context processors, multithreading, pipelined computers, pre-access instruction cache, simulation, synchronization.
Bernard K. Gunther, "Multithreading with Distributed Functional Units," IEEE Transactions on Computers, vol. 46, no. 4, pp. 399-411, April 1997, doi:10.1109/12.588034
Usage of this product signifies your acceptance of the Terms of Use.