Issue No.04 - April (1997 vol.46)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/12.588034
<p><b>Abstract</b>—Multithreaded processors multiplex the execution of a number of concurrent threads onto the hardware in order to hide latencies associated with memory access, synchronization, and arithmetic operations. Conventional multithreading aims to maximize throughput in a single instruction pipeline whose execution stages are served by a collection of centralized functional units. This paper examines a multithreaded microarchitecture where the heterogeneous functional unit set is expanded so that units may be distributed and partly shared across several instruction pipelines operating simultaneously, thereby allowing greater exploitation of interthread parallelism in improving utilization factors of costly resources. The multiple pipeline approach is studied specifically in the Concurro processor architecture—a machine supporting multiple thread contexts and capable of context switching asynchronously in response to dynamic data and resource availability.</p><p>Detailed simulations of Concurro processors indicate that instruction throughputs for programs accessing main memory directly can be scaled, without recompilation, from one to over eight instructions per cycle simply by varying the number of pipelines and functional units. In comparison with an equivalent coherent-cache, single-chip multiprocessor, Concurro offers marginally better performance at less than half of the estimated implementation cost. With suitable prefetching, multiple instruction caches can be avoided, and multithreading is shown to obviate the need for sophisticated instruction dispatch mechanisms on parallel workloads. Distribution of functional units results in a 150% improvement over the centralized approach in utilization factors of arithmetic units, and enables saturation of the most critical processor resources.</p>
Distributed functional units, hardware utilization, latency tolerance, multiple context processors, multithreading, pipelined computers, pre-access instruction cache, simulation, synchronization.
Bernard K. Gunther, "Multithreading with Distributed Functional Units", IEEE Transactions on Computers, vol.46, no. 4, pp. 399-411, April 1997, doi:10.1109/12.588034