This Article 
 Bibliographic References 
 Add to: 
Enlarging Instruction Streams
October 2007 (vol. 56 no. 10)
pp. 1342-1357

Abstract—The stream fetch engine is a high-performance fetch architecture based on the concept of instruction stream. We call stream to a sequence of instructions from the target of a taken branch to the next taken branch, potentially containing multiple basic blocks. The long size of instruction streams makes it possible for the stream fetch engine to provide high fetch bandwidth and to hide the branch predictor access latency, leading to performance results close to a trace cache at lower implementation cost and complexity. Therefore, enlarging instruction streams is an excellent way for improving the stream fetch engine. In this paper, we present several hardware and software mechanisms focused on enlarging those streams that finalize at particular branch types. However, our results point out that focusing on particular branch types is not a good strategy due to Amdahl?s law. Consequently, we propose the multiple stream predictor, a novel mechanism that deals with all branch types by combining single streams into long virtual streams. This proposal tolerates the prediction table access latency without requiring the complexity caused by additional hardware mechanisms like prediction overriding. Moreover, it provides high performance results, which are comparable to state-of-the-art fetch architectures, but with a simpler design that consumes less energy.

[1] A. Ramirez, O.J. Santana, J.L. Larriba-Pey, and M. Valero, “Fetching Instruction Streams,” Proc. 35th Int'l Symp. Microarchitecture, 2002.
[2] O.J. Santana, A. Ramirez, J.L. Larriba-Pey, and M. Valero, “A Low-Complexity Fetch Architecture for High-Performance Superscalar Processors,” ACM Trans. Architecture and Code Optimization, vol. 1, no. 2, 2004.
[3] V. Agarwal, M.S. Hrishikesh, S.W. Keckler, and D. Burger, “Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures,” Proc. 27th Int'l Symp. Computer Architecture (ISCA '00), 2000.
[4] D.A. Jimenez, S.W. Keckler, and C. Lin, “The Impact of Delay on the Design of Branch Predictors,” Proc. 33rd Int'l Symp. Microarchitecture, 2000.
[5] A. Seznec, S. Felix, V. Krishnan, and Y. Sazeides, “Design Tradeoffs for the Alpha EV8 Conditional Branch Predictor,” Proc. 29th Int'l Symp. Computer Architecture (ISCA '02), 2002.
[6] O.J. Santana, A. Ramirez, and M. Valero, “Latency Tolerant Branch Predictors,” Proc. Int'l Workshop Innovative Architecture for Future Generation High-Performance Processors and Systems, 2003.
[7] O.J. Santana, A. Ramirez, and M. Valero, “Reducing Fetch Architecture Complexity Using Procedure Inlining,” Proc. Eighth Ann. Workshop Interaction between Compilers and Computer Architectures (INTERACT '04), 2004.
[8] O.J. Santana, A. Ramirez, and M. Valero, “Multiple Stream Prediction,” Proc. Sixth Int'l Symp. High Performance Computing (ISHPC '05), 2005.
[9] Q. Jacobson, E. Rotenberg, and J.E. Smith, “Path-Based Next Trace Prediction,” Proc. 30th Int'l Symp. Microarchitecture, 1997.
[10] E. Rotenberg, S. Bennett, and J.E. Smith, “A Trace Cache Microarchitecture and Evaluation,” IEEE Trans. Computers, vol. 48, no. 2, Feb. 1999.
[11] A. Peleg and U. Weiser, “Dynamic Flow Instruction Cache Memory Organized around Trace Segments Independent of Virtual Address Line,” US patent 5,381,533, 1995.
[12] G. Reinman, T. Austin, and B. Calder, “A Scalable Front-End Architecture for Fast Instruction Delivery,” Proc. 26th Int'l Symp. Computer Architecture (ISCA '99), 1999.
[13] R. Cohn, D. Goodwin, P.G. Lowney, and N. Rubin, “Spike: An Optimizer for Alpha/NT Executables,” Proc. Usenix Windows NT Workshop, 1997.
[14] J.R. Ellis, BULLDOG: A Compiler for VLIW Architectures, ACM Doctoral Dissertation Awards. MIT Press, 1986.
[15] W.M.W. Hwu, S.A. Mahlke, W.Y. Chen, P.P. Chang, N.J. Warter, R.A. Bringmann, R.G. Ouellette, R.E. Hank, T. Kiyohara, G.E. Haab, J.G. Holm, and D.M. Lavery, “The Superblock: An Effective Technique for VLIW and Superscalar Compilation,” J. Supercomputing, vol. 7, nos. 1-2, 1993.
[16] S.A. Mahlke, D.C. Lin, W.Y. Chen, R.E. Hank, and R.A. Bringmann, “Effective Compiler Support for Predicated Execution,” Proc. 25th Int'l Symp. Microarchitecture, 1992.
[17] J.R. Allen, K. Kennedy, C. Porterfield, and J. Warren, “Conversion of Control Dependence to Data Dependence,” Proc. 10th Symp. Principles of Programming Languages (POPL '83), 1983.
[18] R. Allen and S. Johnson, “Compiling C for Vectorization, Parallelization, and Inline Expansion,” Proc. Conf. Programming Language Design and Implementation (PLDI '88), 1988.
[19] W.W. Hwu and P.P. Chang, “Achieving High Instruction Cache Performance with an Optimizing Compiler,” Proc. Conf. Programming Language Design and Implementation (PLDI '89), 1989.
[20] A. Ayers, R. Gottlieb, and R. Schooler, “Aggressive Inlining,” Proc. Conf. Programming Language Design and Implementation (PLDI '97), 1997.
[21] R. Muth, S.K. Debray, S.A. Watterson, and K.D. Bosschere, “ALTO: A Link-Time Optimizer for the Compaq Alpha,” Software—Practice and Experience, vol. 31, no. 1, 2001.
[22] H. Aydin and D. Kaeli, “Using Cache Line Coloring to Perform Aggressive Procedure Inlining,” ACM Computer Architecture News, vol. 28, no. 1, 2000.
[23] D.A. Jimenez, “Reconsidering Complex Branch Predictors,” Proc. Ninth Int'l Conf. High-Performance Computer Architecture (HPCA '03), 2003.
[24] A. Seznec and A. Fraboulet, “Effective Ahead Pipelining of Instruction Block Address Generation,” Proc. 30th Int'l Symp. Computer Architecture (ISCA '03), 2003.
[25] D. Tullsen, S. Eggers, and H. Levy, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” Proc. 22nd Int'l Symp. Computer Architecture (ISCA '95), 1995.
[26] A. Falcón, O.J. Santana, A. Ramirez, and M. Valero, “Tolerating Branch Predictor Latency on SMT,” Proc. Fifth Int'l Symp. High Performance Computing (ISHPC '03), 2003.
[27] L. Gwennap, “Digital 21264 Sets New Standard,” Microprocessor Report, vol. 10, no. 14, 1996.
[28] B. Calder and D. Grunwald, “Next Cache Line and Set Prediction,” Proc. 22nd Int'l Symp. Computer Architecture (ISCA '95), 1995.
[29] R. Rosner, A. Mendelson, and R. Ronen, “Filtering Techniques to Improve Trace Cache Efficiency,” Proc. 10th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '01), 2001.
[30] A. Ramirez, J.L. Larriba-Pey, and M. Valero, “Trace Cache Redundancy: Red & Blue Traces,” Proc. Sixth Int'l Conf. High Performance Computer Architecture, 2000.
[31] A. KleinOsowski and D.J. Lilja, “MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research,” IEEE TCCA Computer Architecture Letters, vol. 1, 2002.
[32] D.A. Jimenez and C. Lin, “Dynamic Branch Prediction with Perceptrons,” Proc. Seventh Int'l Conf. High-Performance Computer Architecture (HPCA '01), 2001.
[33] D. Kaeli and P. Emma, “Branch History Table Prediction of Moving Target Branches Due to Subroutine Returns,” Proc. 18th Int'l Symp. Computer Architecture (ISCA 91), 1991.
[34] O.J. Santana, A. Falcón, E. Fernández, P. Medina, A. Ramirez, and M. Valero, “A Comprehensive Analysis of Indirect Branch Prediction,” Proc. Fourth Int'l Symp. High Performance Computing, 2002.
[35] P. Shivakumar and N.P. Jouppi, “CACTI 3.0: An Integrated Cache Timing, Power, and Area Model,” Technical Report 2001/2, Western Research Laboratory, 2001.
[36] M.S. Hrishikesh, N.P. Jouppi, K.I. Farkas, D. Burger, S.W. Keckler, and P. Shivakumar, “The Optimal Logic Depth per Pipeline Stage Is 6 to 8 FO4 Inverter Delays,” Proc. 29th Int'l Symp. Computer Architecture (ISCA '02), 2002.
[37] T.M. Conte and S.W. Sathaye, “Dynamic Rescheduling: A Technique for Object Code Compatibility in VLIW Architectures,” Proc. 22nd Int'l Symp. Computer Architecture (ISCA '95), 1995.
[38] C.Y. Cher and T.N. Vijaykumar, “Skipper: A Microarchitecture for Exploiting Control-Flow Independence,” Proc. 34th Int'l Symp. Microarchitecture, 2001.
[39] E. Rotenberg, Q. Jacobson, and J. Smith, “A Study of Control Independence in Superscalar Processors,” Proc. Fifth Int'l Conf. High-Performance Computer Architecture (HPCA '99), 1999.
[40] H. Kim, O. Mutlu, J. Stark, and Y.N. Patt, “Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution,” Proc. 38th Int'l Symp. Microarchitecture, 2005.
[41] O.J. Santana, M. Galluzzi, A. Ramirez, and M. Valero, “An Analysis of Dynamic Instruction Streams,” Proc. XIV Spanish Workshop Parallelism, 2003.
[42] P.J. Joseph and S. Vajapeyam, “Improving Control Flow Prediction by Exploiting Loop Constructs,” Technical Report IISc-CSA-2000-5, Computer Science and Automation Dept., Indian Inst. of Science, 2000.
[43] T. Sherwood and B. Calder, “Loop Termination Prediction,” Proc. Third Int'l Symp. High Performance Computing (ISHPC '00), 2000.
[44] M.R. de Alba and D.R. Kaeli, “Path-Based Hardware Loop Prediction,” Proc. Fourth Int'l Conf. Control, Virtual Instrumentation, and Digital Systems (CICINDI '02), 2002.
[45] R. Rosner, Y. Almog, M. Moffie, N. Schwartz, and A. Mendelson, “Power Awareness through Selective Dynamically Optimized Traces,” Proc. 31st Int'l Symp. Computer Architecture (ISCA '04), 2004.

Index Terms:
Superscalar processor design, instruction fetch, branch prediction, access latency, code optimization
Oliverio J. Santana, Alex Ramirez, Mateo Valero, "Enlarging Instruction Streams," IEEE Transactions on Computers, vol. 56, no. 10, pp. 1342-1357, Oct. 2007, doi:10.1109/TC.2007.70742
Usage of this product signifies your acceptance of the Terms of Use.