The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2009 vol.20)
pp: 389-403
Rama Sangireddy , University of Texas at Dallas, Richardson
Hui Wang , University of Texas at Dallas, Richardson
ABSTRACT
The resource sharing nature of Simultaneous Multithreading (SMT) processors and the presence of long latency instructions from concurrent threads make the instruction scheduling window (IW), which is a primary shared component among key pipeline structures in SMT, a performance bottleneck. Due to the tight constraints on its physical size, the IW faces more severe pressure to handle the instructions from various threads while attempting to avoid resource monopolization by some low-ILP threads. It is particularly challenging to optimize the efficiency and fairness in IW utilization to fulfill the affordable performance by SMT under the shadow of long latency instructions. Most of the existing optimization schemes in SMT processors rely on the fetch policy to control the instructions that are allowed to enter the pipeline, while little effort is put to control the long latency instructions that are already located in the IW. In this paper, we propose streamline buffers to handle the long latency instructions that have already entered the pipeline and clog the IW, while the controlling fetch policies take time to react. Each streamline buffer extracts from IW and holds a chain of instructions from a thread that are stalled by dependency on a long latency load.
INDEX TERMS
Multithreaded processors, Speculative multi-threading, Support for multi-threaded execution
CITATION
Rama Sangireddy, Hui Wang, "Optimizing Instruction Scheduling through Combined In-Order and O-O-O Execution in SMT Processors", IEEE Transactions on Parallel & Distributed Systems, vol.20, no. 3, pp. 389-403, March 2009, doi:10.1109/TPDS.2008.97
REFERENCES
[1] D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, and R.L. Stamm, “Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,” Proc. 23rd Ann. Int'l Symp. Computer Architecture (ISCA), 1996.
[2] D.M. Tullsen and J.A. Brown, “Handling Long-Latency Loads in a Simultaneous Multithreading Processor,” Proc. 34th Int'l Symp. Microarchitecture (MICRO '01), Dec. 2001.
[3] N. Mitchell, L. Carter, J. Ferrante, and D. Tullsen, “ILP versus TLP on SMT,” Proc. ACM/IEEE Supercomputing Conf. (SC), 1999.
[4] S. Palacharla, N.P. Jouppi, and J.E. Smith, “Complexity-Effective Superscalar Processors,” Proc. 24th Ann. Int'l Symp. Computer Architecture (ISCA '97), pp. 206-218, June 1997.
[5] N. Mehta, B. Singer, R.I. Bahar, M. Leuchtenburg, and R. Weiss, “Fetch Halting on Critical Load Misses,” Proc. 22nd IEEE Int'l Conf. Computer Design (ICCAD), 2004.
[6] D. Folegnani and A. Gonzalez, “Energy-Effective Issue Logic,” Proc. Int'l Symp. Computer Architecture (ISCA '01), July 2001.
[7] A. El-Moursy and D.H. Albonesi, “Front-End Policies for Improved Issue Efficiency in SMT Processors,” Proc. Ninth Int'l Symp. High-Performance Computer Architecture (HPCA), 2003.
[8] F.J. Cazorla, A. Ramirez, M. Valero, and E. Fernandez, “DCache Warn: An I-Fetch Policy to Increase SMT Efficiency,” Proc. 18th Int'l Parallel and Distributed Processing Symp. (IPDPS), 2004.
[9] S. Eyerman and L. Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” Proc. 13th Int'l Symp. High Performance Computer Architecture (HPCA '07), pp. 240-249, Feb. 2007.
[10] S.E. Raasch and S.K. Reinhardt, “The Impact of Resource Partitioning on SMT Processors,” Proc. 12th Int'l Conf. Parallel Architecture and Compilation Techniques (PACT '03), pp. 15-25, 2003.
[11] F.J. Cazorla, A. Ramirez, M. Valero, and E. Fernandez, “Dynamically Controlled Resource Allocation in SMT Processors,” Proc. 37th Int'l Symp. Microarchitecture (MICRO '04), pp.171-182, 2004.
[12] T. Austin, E. Larson, and D. Ernst, “SimpleScalar: An Infrastructure for Computer System Modeling,” Computer, pp.59-67, Feb. 2002.
[13] E. Perelman, G. Hamerly, M.V. Biesbrouck, T. Sherwood, and B. Calder, “Using SimPoint for Accurate and Efficient Simulation,” Proc. ACM SIGMETRICS '03, pp. 318-319, 2003.
[14] I. Kim and M.H. Lipasti, “Half-Price Architecture,” Proc. 30th Int'l Symp. Computer Architecture (ISCA '03), pp. 28-38, 2003.
[15] R. Sangireddy, “Reducing Rename Logic Complexity For High-Speed and Low-Power Front-End Architectures,” IEEE Trans. Computers, vol. 55, no. 6, pp. 672-685, June 2006.
[16] D. Ernst and T. Austin, “Efficient Dynamic Scheduling through Tag Elimination,” Proc. 29th Int'l Symp. Computer Architecture (ISCA '02), pp. 37-46, May 2002.
[17] N.P. Jouppi and D. Wall, “Available Instruction Level Parallelism for Superscalar and Superpipelined Machines,” Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '89), pp. 272-282, Apr. 1989.
[18] A. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg, “A Large, Fast Instruction Window for Tolerating Cache Misses,” Proc. 29th Ann. Int'l Symp. Computer Architecture (ISCA'02), pp. 59-70, 2002.
[19] E. Brekelbaum, J. Rupley, C. Wilkerson, and B. Black, “Hierarchical Scheduling Windows,” Proc. 35th Ann. Int'l Symp. Microarchitecture (MICRO '02), pp. 27-36, Nov. 2002.
[20] F.J. Cazorla et al., “Improving Memory Latency Aware Fetch Policies for SMT Processors,” Proc. Int'l Symp. High Performance Computing (ISHPC), 2003.
[21] A. Cristal, D. Ortega, J. Llosa, and M. Valero, “Out-of-Order Commit Processors,” Proc. 10th Int'l Symp. High-Performance Computer Architecture (HPCA '04), pp. 48-59, Feb. 2004.
[22] S.T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton, “Continual Flow Pipelines,” Proc. 11th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '04), Oct. 2004.
[23] M. Pericas, R. Gonzalez, A. Cristal, D.A. Jimnez, and M. Valero, “A Decoupled KILO-Instruction Processor,” Proc. 12th Int'l Symp. High Performance Computer Architecture (HPCA '06), pp.52-63, Feb. 2006.
[24] J. Abella, R. Canal, and A. Gonzalez, “Power-and Complexity-Aware Issue Queue Designs,” IEEE Micro, 2003.
[25] J. Dundas and T. Mudge, “Improving Data Cache Performance by Pre-Executing Instructions under a Cache Miss,” Proc. 11th Int'l Conf. Supercomputing (SC '97), pp. 68-75, 1997.
[26] O. Mutlu, J. Stark, C. Wilkerson, and Y.N. Patt, “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors,” Proc. Ninth IEEE Int'l Symp. HighPerformance Computer Architecture (HPCA '03), pp. 129-140, Feb. 2003.
[27] R.D. Barnes, S. Ryoo, and W.W. Hwu, ““Flea-Flicker” Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense,” Proc. 38th Int'l Symp. Microarchitecture (MICRO '05), pp. 319-330, 2005.
[28] Y.H. Song and M. Dubois, “Assisted Execution,” Technical Report CENG 98-25, Dept. of EE Systems, Univ. of Southern California, 1998.
[29] R.S. Chappell, J. Stark, S.P. Kim, S.K. Reinhardt, and Y.N. Patt, “Simultaneous Subordinate Microthreading (SSMT),” Proc. 26th Int'l Symp. Computer Architecture (ISCA), 1999.
[30] A. Roth and G. Sohi, “Speculative Data-Driven Multithreading,” Proc. 26th Int'l Symp. High Performance Computer Architecture (HPCA), 2001.
[31] J.J. Sharkey and D.V. Ponomarev, “Efficient Instruction Schedulers for SMT Processors,” Proc. 12th Int'l Symp. High-Performance Computer Architecture (HPCA '06), pp. 293-303, Feb. 2006.
[32] J.J. Sharkey and D.V. Ponomarev, “Instruction Packing: Reducing Power and Delay of the Dynamic Scheduling Logic,” Proc. Int'l Symp. Low Power Electronics and Design (ISLPED), 2005.
[33] J. Sharkey and D. Ponomarev, “Balancing ILP and TLP in SMT Architectures through Out-of-Order Instruction Dispatch,” Proc. 35th Int'l Conf. Parallel Processing (ICPP), 2006.
[34] K. Luo, J. Gummaraju, and M. Franklin, “Balancing Throughput and Fairness in SMT Processors,” Proc. Int'l Symp. Performance Analysis of Systems and Software (ISPASS '01), pp.164-171, Nov. 2001.
[35] J.J. Yi, A. Joshi, R. Sendag, L. Eeckhout, and D.J. Lilja, “Analyzing the Processor Bottlenecks in SPEC CPU 2000,” Proc. SPEC Benchmark Workshop in conjunction with the Ann. Meeting of the Standard Performance Evaluation Corporation (SPEC '06), Jan. 2006.
[36] S. Sair and M. Charney, “Memory Behavior of the SPEC2000 Benchmark Suite,” Technical Report RC21852, IBM T.J. Watson Research Center, Oct. 2000.
[37] J. Abella and A. Gonzalez, “Low-Complexity Distributed Issue Queue,” Proc. 10th Int'l Symp. High Performance Computer Architecture (HPCA '04), pp. 73-82, Feb. 2004.
[38] A. Falcon, A. Ramirez, and V. Valero, “A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors,” Proc. 10th Int'l Symp. High Performance Computer Architecture (HPCA), 2004.
22 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool