This Article 
 Bibliographic References 
 Add to: 
A High-Bandwidth Memory Pipeline for Wide Issue Processors
July 2001 (vol. 50 no. 7)
pp. 709-723

Abstract—Providing adequate data bandwidth is extremely important for a future wide-issue processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called data decoupling. This paper especially studies an interesting, yet less explored, behavior of memory access instructions, called access region locality, which is concerned with each static memory instruction and its range of access locations at runtime. Our experimental study using a set of SPEC95 benchmark programs shows that most memory access instructions reference a single region at runtime. Also shown is that it is possible to accurately predict the access region of a memory instruction at runtime by scrutinizing the addressing mode of the instruction and the past access history of it. We describe and evaluate a wide-issue superscalar processor with two distinct sets of memory pipelines and caches, driven by the access region predictor. Experimental results indicate that the proposed mechanism is very effective in providing high memory bandwidth to the processor, resulting in comparable or better performance than a conventional memory design with a heavily multiported data cache that can lead to much higher hardware complexity.

[1] A.V. Aho, R. Sethi, and J.D. Ullman, Compilers, Principles, Techniques and Tools.New York: Addison-Wesley, 1985.
[2] T.M. Austin and G.S. Sohi, “Zero-Cycle Loads: Microarchitecture Support for Reducing Load Latency,” Proc. 28th Int'l Symp. Microarchitecture, pp. 82-92, Nov. 1995.
[3] D. Burger and T.M. Austin, “The SimpleScalar Tool Set, Version 2.0,” Technical Report No. 1342, Computer Sciences Dept., Univ. of Wisconsin, June 1997.
[4] S. Cho, P.-C. Yew, and G. Lee, “Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor,” Proc. 26th Int'l Symp. Computer Architecture, pp. 100-110, May 1999.
[5] S. Cho, P.-C. Yew, and G. Lee, “Access Region Locality for High-Bandwidth Processor Memory System Design,” Proc. 32nd Int'l Symp. Microarchitecture, pp. 136-146, Nov. 1999.
[6] G. Chrysos and J. Emer, “Memory Dependence Prediction Using Store Sets,” Proc. 25th Int'l Symp. Computer Architecture, pp. 142-153, July 1998.
[7] D. Ditzel and R. McLellan, “Register Allocation for Free: The C Machine Stack Cache,” Proc. Symp. Architectural Support for Programming Languages and Operating Systems, pp. 48-56, Mar. 1982.
[8] J. Edmondson et al., “Internal Organization of the Alpha 21164, a 300-MHz, 64-Bit, Quad-Issue, CMOS RISC Microprocessor,” Digital Technical J., vol. 7, no. 1, 1995.
[9] R.J. Eickemeyer and S. Vassiliadis, “A Load-Instruction Unit for Pipelined Processors,” IBM J. Research and Development, vol. 9, no. 2, 1993.
[10] M.J. Flynn and L.W. Hoevel, “Execution Architecture: The DELtran Experiment,” IEEE Trans. Computers, vol. 32, no. 2, pp. 156-175, Feb. 1983.
[11] L. Gwennap, “Intel's P6 Uses Decoupled Superscalar Design,” Microprocessor Report, vol. 9, no. 2, Feb. 1995.
[12] L. Gwennap, “Digital 21264 Sets New Standard,” Microprocessor Report, vol. 10, no. 14, Oct. 1996.
[13] D. Hunt, “Advanced Performance Features of the 64-bit PA-8000,” Proc. COMPCON, pp. 123-128, 1995.
[14] IBM, ASIC SA-27E Databook, 2000.
[15] M. Johnson, Superscalar Microprocessor Design. Prentice Hall, 1991.
[16] T. Juan, J.J. Navarro, and O. Temam, “Data Caches for Superscalar Processors,” Proc. 11th. Int'l Conf. Supercomputing (ICS-11), pp. 60-67, July 1997.
[17] M. Lipasti and J. Shen, “Superspeculative Microarchitecture for Beyond AD 2000,” Computer, vol. 30, no. 9, pp. 59-66, Sept. 1997.
[18] M.H. Lipasti, C.B. Wilkerson, and J.P. Shen, "Value Locality and Load Value Prediction," Proc. Seventh Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, ACM Press, New York, 1996, pp. 138-147.
[19] S. McFarling, “Combining Branch Predictors,” WRL Technical Note TN-36, Digital Equipment Corp., June 1993.
[20] A. Moshovos et al., "Dynamic Speculation and Synchronization of Data Dependences," Proc. 24th Int'l Symp. on Computer Architecture, IEEE CS Press, Los Alamitos, Calif., 1997, pp. 181-193.
[21] A. Moshovos and G. Sohi,"Streamlining Inter-Operation Memory Communication via Data Dependence Prediction," Proc. 30th Int'l Symp. Microarchitecture, ACM Press, 1997, pp. 235-245.
[22] NEC, Block Library CB-11 Family Databook, 2000.
[23] S. Palacharla, N.P. Jouppi, and J.E. Smith, "Complexity-Effective Superscalar Processors," Proc. Int'l Symp. Computer Architecture, ACM, 1997, pp. 206-218.
[24] Y.N. Patt et al., "One Billion Transistors, One Uniprocessor, One Chip," Computer, Sept. 1997, pp. 51-58.
[25] J.A. Rivers, G.S. Tyson, E.S. Davidson, and T.M. Austin, “On High-Bandwidth Data Cache Design for Multi-Issue Processors,” Proc. 30th Int'l Symp. Microarchitecture, pp. 46-56, Dec. 1997.
[26] E. Rotenberg, S. Bennett, and J. Smith, "Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching," Proc. 29th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, IEEE CS Press, Los Alamitos, Calif., 1996, pp. 24-34.
[27] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J.E. Smith, Trace Processors Proc. 30th Int'l Symp. Microarchitecture, pp. 138-148, 1997.
[28] Samsung Electronics Co., STD130 Databook, 2000.
[29] Y. Sazeides and J. Smith, “The Predictability of Data Values,” Proc. 30th Ann. Int'l Symp. Microarchitecture (MICRO '30), pp. 248-258, Dec. 1997.
[30] G.S. Sohi, "Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers," IEEE Trans. Computers, Vol. 39, No. 3, 1990, pp. 349-359.
[31] G.S. Sohi and M. Franklin, “High-Bandwidth Data Memory Systems for Superscalar Processors,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 53-62, 8-11 Apr. 1991.
[32] The Standard Performance Evaluation Corporation,http:/, 1995.
[33] Y. Tamir and C.H. Sequin, “Strategies for Managing the Register File in RISC,” IEEE Trans. Computers, vol. 32, no. 11, pp. 977-989, Nov. 1983.
[34] G. Tyson and T. Austin, “Improving the Accuracy and Performance of Memory Communication through Renaming,” Proc. 30th Ann. Int'l Symp. Microarchitecture (MICRO '30), pp. 218-227, Dec. 1997.
[35] K.M. Wilson, K. Olukotun, and M. Rosenblum, “Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors,” Proc. 23rd Int'l Symp. Computer Architecture, pp. 147-157, May 1996.
[36] K.C. Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, vol. 16, no. 2, pp. 28–40, Apr. 1996.
[37] T.-Y. Yeh, D. Marr,, and Y.N. Patt, ``Increasing the Instruction Fetch Rate via Multiple Branch Prediction and Branch Address Cache,'' Proc. Int'l Conf. Supercomputing, pp. 67-76, 1993.
[38] A. Yoaz et al., "Speculation Techniques for Improving Load Related Instruction Scheduling," Proc. 26th Ann. Int'l Symp. Computer Architecture (ISCA 99), IEEE CS Press, Los Alamitos, Calif., 1999, pp. 42-53.

Index Terms:
Data bandwidth, data locality, instruction level parallelism, runtime stack, data stream partitioning, multiported data cache.
Sangyeun Cho, Pen-Chung Yew, Gyungho Lee, "A High-Bandwidth Memory Pipeline for Wide Issue Processors," IEEE Transactions on Computers, vol. 50, no. 7, pp. 709-723, July 2001, doi:10.1109/12.936237
Usage of this product signifies your acceptance of the Terms of Use.