| | This Article | |
| |
| |
| | Share | |
| |
| |
| | Bibliographic References | |
| |
| |
| | Add to: | |
| |
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
| |
| | Search | |
| |
| |
| | |
Beating In-Order Stalls with "Flea-Flicker" Two-Pass Pipelining
January 2006 (vol. 55 no. 1)
pp. 18-33
While compilers have generally proven adept at planning useful static instruction-level parallelism for in-order microarchitectures, the efficient accommodation of unanticipable latencies, like those of load instructions, remains a vexing problem. Traditional out-of-order execution hides some of these latencies, but repeats scheduling work already done by the compiler and adds additional pipeline overhead. Other techniques, such as prefetching and multithreading, can hide some anticipable, long-latency misses, but not the shorter, more diffuse stalls due to difficult-to-anticipate, first or second-level misses. Our work proposes a microarchitectural technique, two-pass pipelining, whereby the program executes on two in-order back-end pipelines coupled by a queue. The "advance” pipeline often defers instructions dispatching with unready operands rather than stalling. The "backup” pipeline allows concurrent resolution of instructions deferred by the first pipeline allowing overlapping of useful "advanced” execution with miss resolution. An accompanying compiler technique and instruction marking further enhance the handling of miss latencies. Applying our technique to an Itanium 2-like design achieves a speedup of 1.38\times in mcf, the most memory-intensive SPECint2000 benchmark, and an average of 1.12\times across other selected benchmarks, yielding between 32 percent and 67 percent of an idealized out-of-order design's speedup at a much lower design cost and complexity.
[1] 18 D.I. August, D.A. Connors, S.A. Mahlke, J.W. Sias, K.M. Crozier, B.-C. Cheng, P.R. Eaton, Q.B. Olaniran, and W.W. Hwu, “Integrated Predicated and Speculative Execution in the IMPACT EPIC Architecture,” Proc. 25th Ann. Int'l Symp. Computer Architecture, pp. 227-237, July 1998.[2] P.H. Wang, H. Wang, J.D. Collins, E. Grochowski, R.M. Kling, and J.P. Shen, “Memory Latency-Tolerance Approaches for Itanium Processors: Out-of-Order Execution vs. Speculative Precomputation,” Proc. Eighth Int'l Symp. High-Performance Computer Architecture, pp. 167-176, Feb. 2002.[3] S.A. Mahlke, W.Y. Chen, R.A. Bringmann, R.E. Hank, W.W. Hwu, B.R. Rau, and M.S. Schlansker, “Sentinel Scheduling: A Model for Compiler-Controlled Speculative Execution,” ACM Trans. Computer Systems (TOCS), vol. 11, no. 4, pp. 376-408, 1993.[4] Intel Corp., Intel Itanium 2 Processor Reference Manual for Software Development and Optimization, Apr. 2003.[5] R.D. Barnes, E.M. Nystrom, J.W. Sias, S.J. Patel, N. Navarro, and W.W. Hwu, “Beating In-Order Stalls with 'Flea-Flicker' Two-Pass Pipelining,” Proc. 36th Ann. Int'l Symp. Microarchitecture, pp. 387-398, Nov. 2003.[6] D.M. Gallagher, W.Y. Chen, S.A. Mahlke, J.C. Gyllenhaal, and W.W. Hwu, “Dynamic Memory Disambiguation Using the Memory Conflict Buffer,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 183-193, Oct. 1994.[7] R. Zahir, J. Ross, D. Morris, and D. Hess, “OS and Compiler Considerations in the Design of the IA-64 Architecture,” Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 213-222, Oct. 2000.[8] R.E. Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, vol. 19, Mar./Apr. 1999.[9] W.W. Hwu and Y.N. Patt, “Checkpoint Repair for Out-of-Order Execution Machines,” Proc. 14th Ann. Int'l Symp. Computer Architecture, pp. 18-26, July 1987.[10] V. Zyuban and P. Kogge, “The Energy Complexity of Register Files,” Proc. 1998 Int'l Symp. Low Power Electronics and Design, pp. 305-310, Aug. 1998.[11] V. Agarwal, M.S. Hrishikesh, S.W. Keckler, and D. Burger, “Clock Rate versus IPC: The End of the Road for Conventional Microprocessors,” Proc. 27th Ann. Int'l Symp. Computer Architecture, pp. 248-259, July 2000.[12] D. Ponomarev, G. Kucuk, and K. Ghose, “Reducing Power Requirements of Instruction Scheduling through Dynamic Allocation of Multiple Datapath Resources,” Proc. 34th Ann. Int'l Symp. Microarchitecture, pp. 90-101, Nov. 2001.[13] E.S. Fetzer, M. Gibson, A. Klein, N. Calick, C. Zhu, E. Busta, and B. Mohammad, “A Fully Bypassed Six-Issue Integer Datapath and Register File on the Itanium-2 Microprocessor,” IEEE J. Solid-State Circuits, vol. 37, Nov. 2002.[14] P.H. Wang, H. Wang, R.M. Kling, K. Ramakrishnan, and J.P. Shen, “Register Renaming and Scheduling for Dynamic Execution of Predicated Code,” Proc. Seventh Int'l Symp. High-Performance Computer Architecture, pp. 15-25, Jan. 2001.[15] P. Bose, D. Brooks, A. Buyuktosunoglu, P. Cook, K. Das, P. Emma, M. Gschwind, H. Jacobson, T. Karkhanis, P. Kudva, S. Schuster, J.E. Smith, V. Srinivasan, and V. Zyuban, “Early-Stage Definition of LPX: A Low-Power Issue-Execute Processor,” Proc. Second Int'l Workshop Power-Aware Computer Systems, pp. 1-17, 2003.[16] J.W. Sias, S.-Z. Ueng, G.A. Kent, I.M. Steiner, E.M. Nystrom, and W.W. Hwu, “Field Testing IMPACT EPIC Research Results in Itanium 2,” Proc. 31st Ann. Int'l Symp. Computer Architecture, June 2004.[17] R.E. Wunderlich, T.F. Wenisch, B. Falsafi, and J.C. Hoe, “SMARTS: Accelerating Microarchitectural Simulation via Rigorous Statistical Sampling,” Proc. 30th Ann. Int'l Symp. Computer Architecture, pp. 84-95, June 2003.[18] R.D. Barnes, J.W. Sias, E.M. Nystrom, and W.W. Hwu, “EPIC's Future: Exploring the Space between In- and Out-of-Order,” Proc. Third Workshop Explicitly Parallel Instruction Computing Architectures and Compiler Technology, Mar. 2004, http://www.cgo.org/html/workshopsepic3_program.htm .[19] M.C. Merten, A.R. Trick, R.D. Barnes, E.M. Nystrom, C.N. George, J.C. Gyllenhaal, and W.W. Hwu, “An Architectural Framework for Run-Time Optimization,” IEEE Trans. Computers, vol. 50, no. 6, pp. 567-589, June 2001.[20] F. Spadini, B. Fahs, S.J. Patel, and S.S. Lumetta, “Improving Quasi-Dynamic Schedules through Region Slip,” Proc. First Int'l Symp. Code Generation and Optimization, pp. 149-158, Apr. 2003.[21] H. Chen, W.-C. Hsu, and D.-Y. Chen, “Dynamic Trace Selection Using Performance Monitoring Hardware Sampling,” Proc. First Code Generation and Optimization, pp. 79-90, 2003.[22] J. Dundas and T. Mudge, “Improving Data Cache Performance by Pre-Executing Instructions under a Cache Miss,” Proc. 11th Ann. Int'l Conf. Supercomputing, pp. 66-75, June 1997.[23] O. Mutlu, J. Stark, C. Wilkerson, and Y.N. Patt, “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors,” Proc. Ninth Int'l Symp. High-Performance Computer Architecture, pp. 129-140, Feb. 2003.[24] Z. Purser, K. Sundaramoorthy, and E. Rotenberg, “A Study of Slipstream Processors,” Proc. 33rd Ann. Int'l Symp. Microarchitecture, pp. 269-280, Nov. 2000.[25] C. Zilles and G. Sohi, “Master/Slave Speculative Parallelization,” Proc. 35th Ann. Int'l Symp. Microarchitecture, pp. 85-96, Nov. 2002.[26] R.S. Chappell, J. Stark, S.P. Kim, S.K. Reinhardt, and Y.N. Patt, “Simultaneous Subordinate Microthreading (SSMT),” Proc. 26th Ann. Int'l Symp. Computer Architecture, pp. 186-195, July 1999.[27] J.R. Goodman, J. Hsieh, K. Liou, A.R. Pleszkun, P. Schechter, and H.C. Young, “PIPE: A VLSI Decoupled Architecture,” Proc. 12th Ann. Int'l Symp. Computer Architecture, pp. 20-27, July 1985.[28] J.D. Collins, H. Wang, D.M. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J.P. Shen, “Speculative Precomputation: Long-Range Prefetching of Delinquent Loads,” Proc. 28th Ann. Int'l Symp. Computer Architecture, pp. 14-25, July 2001.[29] J.D. Collins, D.M. Tullsen, H. Wang, and J.P. Shen, “Dynamic Speculative Precomputation,” Proc. 34th Ann. Int'l Symp. Microarchitecture, pp. 306-317, Nov. 2001.[30] M. Annavaram, J.M. Patel, and E.S. Davidson, “Data Prefetching by Dependence Graph Precomputation,” Proc. 28th Ann. Int'l Symp. Computer Architecture, pp. 52-61, July 2001.
Index Terms:
Index Terms- Runahead execution, out-of-order execution, prefetching, cache-miss tolerance.
Citation:
Ronald D. Barnes, John W. Sias, Erik M. Nystrom, Sanjay J. Patel, Jose (Nacho) Navarro, Wen-mei W. Hwu, "Beating In-Order Stalls with "Flea-Flicker" Two-Pass Pipelining," IEEE Transactions on Computers, vol. 55, no. 1, pp. 18-33, Jan. 2006, doi:10.1109/TC.2006.4