This Article 
 Bibliographic References 
 Add to: 
Improving Latency Tolerance of Multithreading through Decoupling
October 2001 (vol. 50 no. 10)
pp. 1084-1094

Abstract—The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. This work presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in-order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. Our study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. Since one of the problems of multithreading is the degradation of the memory system performance, both in terms of miss latency and bandwidth requirements, this improvement becomes critical for high miss latencies, where bandwidth might become a bottleneck. Finally, although it may seem rather surprising, our study reveals that multithreading by itself exhibits little memory latency tolerance. Our results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent.

[1] A. Berrached, L. Coraor, and P. Hulina, “A Decoupled Access/Execute Architecture for Efficient Accesss of Structured Data,” Proc. Hawaii Int'l Conf. System Services, Jan. 1993.
[2] P.L. Bird, A. Rawsthorne, and N.P. Topham, “The Effectiveness of Decoupling,” Proc. Int'l Conf. Supercomputing, pp. 47-56, July 1993.
[3] R. Canal, J.-M. Parcerisa, and A. González, “A Cost-Effective Clustered Architecture,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Oct. 1999.
[4] G.E. Daddis and H.C. Torng, “The Concurrent Execution of Multiple Execution Streams on Superscalar Processors,” Proc. Int'l Conf. Parallel Processing, pp. 76-83, Aug. 1991.
[5] Digital Equipment Corp., Alpha 21164 Microprocessor Hardware Reference Manual, Or. Num. EC-QAEQB-TE, Maynard, Mass., Apr. 1995.
[6] K. Farkas et al., "The Multicluster Architecture: Reducing Cycle Time Through Partitioning," to appear in Proc. 30th Ann. IEEE/ACM Int'l Symp Microarchitecture, IEEE Computer Society, Press, Los Alamitos, Calif., 1997.
[7] J.R. Goodman,J. Hsieh,K. Kiou,A.R. Pleszkun,P.B. Scheuchter,, and H.C. Young,“PIPE: A VLSI decoupled architecture,” Proc. 12th Int’l Symp. Computer Architecture, pp. 20-27,Boston, June 1985.
[8] L. Gwennap, “Intel's P6 Uses Decoupled Superscalar Design,” Microprocessor Report, vol. 9, no. 2, pp. 9-15, Feb. 1995.
[9] L. Gwennap, “Digital 21264 Sets New Standard,” Microprocessor Report, vol. 10, no. 14, Oct. 1996.
[10] H. Hirata et al., "An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads," Proc. Int'l Symp. Computer Architecture, Assoc. of Computing Machinery, N.Y., 1992, pp. 136-145.
[11] P.Y.T. Hsu, “Design of the FTP Microprocessor,” IEEE Micro, vol. 14, no. 2, pp. 23-33, Apr. 1994.
[12] M. Johnson, Superscalar Microprocessor Design. Englewood Cliffs, N.J.: Prentice Hall, 1991.
[13] G.P. Jones and N.P. Topham, “A Limitation Study into Access Decoupling,” Proc. Third Euro- Par Conf., pp. 1102-1111, Aug. 1997.
[14] N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully Associative Cache and Prefetch Buffers,” Proc. 17th Int'l Symp. Computer Architecture, pp. 364-373, May 1990.
[15] G.A. Kemp and M. Franklin, PEWs: A Decentralized Dynamic Scheduler for ILP Processing Proc. Int'l Conf. Parallel Processing, pp. 239-246, 1996.
[16] D. Kroft, "Lockup-Free Instruction Fetch/Prefetch Cache Organization," Proc. Eighth Int'l Symp. Computer Architecture, pp. 81-87, 1981.
[17] A. Kumar, “The HP-PA8000 RISC CPU: A High Performance Out-of-Order Processor,” Proc. Hot Chips VIII, pp. 9-20, Aug. 1996.
[18] L. Kurian, P.T. Hulina, and L.D. Coraor, “Memory Latency Effects in Decoupled Architectures,” IEEE Trans. Computers, vol. 43, no. 10, pp. 1129-1139, Oct. 1994.
[19] D. Levitan, T. Thomas,, and P. Tu, ``The PowerPC 620 Microprocessor: A High Performance Superscalar RISC Microprocessor,'' Proc. CompCon '95, pp. 285-291, Mar. 1995.
[20] J.M. Parcerisa and A. González, “The Latency Hiding Effectiveness of Decoupled Access/Execute Processors,” Proc. 24th Euromicro Conf., pp. 293-300, Aug. 1998.
[21] S. Palacharla, N.P. Jouppi, and J.E. Smith, "Complexity-Effective Superscalar Processors," Proc. Int'l Symp. Computer Architecture, ACM, 1997, pp. 206-218.
[22] S. Palacharla and J.E. Smith, “Decoupling Integer Execution in Superscalar Processors,” Proc. 28th Ann. Symp. Microarchitecture, pp. 285-290, Nov. 1995.
[23] A.R. Pleszkun and E.S. Davidson, “Structured Memory Access Architecture,” Proc. 1983 Int'l Conf. Parallel Processing, pp. 461-471, Aug. 1983.
[24] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J.E. Smith, Trace Processors Proc. 30th Int'l Symp. Microarchitecture, pp. 138-148, 1997.
[25] S.S. Sastry, S. Palacharla, and J.E. Smith, “Exploiting Idle Floating-Point Resources for Integer Execution,” Proc. Int'l Conf. Programming Language Design and Implementation, 1998.
[26] J.E. Smith, "A Study of Branch Prediction Strategies," Proc. Eighth Ann. Int'l Symp. Computer Architecture, pp. 135-148, June 1981.
[27] J.E. Smith,“Decoupled access/execute architectures,” ACM Trans. Computer Systems, vol. 2, no. 4, pp. 289-308, Nov. 1984.
[28] J.E. Smith, G.E. Dermer, B.D. Vanderwarn, S.D. Klinger, C.M. Rozewski, D.L. Fowler, K.R. Scidmore, and J.P. Laudon, “The ZS-1 Central Processor,” Proc. Second Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 199-204, Oct. 1987.
[29] J.E. Smith and A. Pleszkun, "Implementation of Precise Interrupts in Pipelined Processors," Proc. 12th Ann. Int'l Symp. Computer Architecture,Boston, June 1985.
[30] J.E. Smith, S. Weiss, and N.Y. Pang, A Simulation Study of Decoupled Architecture Computers IEEE Trans. Computers, vol. 35, no. 8, pp. 692-701, Aug. 1986.
[31] G.S. Sohi, S.E. Breach, and T. Vijaykumar, "Multiscalar Processors," Proc. Int'l Symp. Computer Architecture, ACM, 1995, pp. 414-425.
[32] G.S. Sohi and M. Franklin, “High-Bandwidth Data Memory Systems for Superscalar Processors,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 53-62, 8-11 Apr. 1991.
[33] A. Srivastava and A. Eustace, "ATOM: A System for Building Customized Program Analysis Tools," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, ACM Press, New York, 1994.
[34] Standard Performance Evaluation Corp., SPEC Newsletter, Fairfax, Va., Sept. 1995.
[35] R.M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM J. Research and Development, vol. 11, no. 1, pp. 25-33, Jan. 1967.
[36] N.P. Topham, A. Rawsthorne, C.E. McLean, M.J.R.G. Mewissen, and P. Bird, “Compiling and Optimizing for Decoupled Architectures,” Proc. Supercomputing '95, Dec. 1995.
[37] D. M. Tullsen et al., "Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor," Proc. Int'l Symp. Computer Architecture, ACM, 1996, pp. 191-202.
[38] D.M. Tullsen, S.J. Eggers, and H.M. Levy, "Simultaneous Multithreading: Maximizing On-Chip Parallelism," Proc. 22nd Ann. Int'l Symp. Computer Architecture, IEEE CS Press, 1995, pp. 392-403.
[39] G. Tyson, M. Farrens, and A.R. Pleszkun, “MISC: A Multiple Instruction Stream Computer,” Proc. 25th Ann. Symp. Microarchitecture, pp. 193-196, Dec. 1992.
[40] W.A. Wolf, Evaluation of the WM Architecture Proc. IEEE/ACM Int'l Symp. Computer Architecture, pp. 382-390, May 1992.
[41] K.C. Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, vol. 16, no. 2, pp. 28–40, Apr. 1996.
[42] Y. Zhang and G.B. Adams, Performance Modeling and Code Partitioning for the DS Architecture Proc. IEEE/ACM Int'l Symp. Computer Architecture, pp. 293-304, June 1998.

Index Terms:
Access/execute decoupling, simultaneous multithreading, latency hiding, instruction-level parallelism, hardware complexity.
Joan-Manuel Parcerisa, Antonio González, "Improving Latency Tolerance of Multithreading through Decoupling," IEEE Transactions on Computers, vol. 50, no. 10, pp. 1084-1094, Oct. 2001, doi:10.1109/12.956093
Usage of this product signifies your acceptance of the Terms of Use.