The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.10 - October (2009 vol.58)
pp: 1307-1320
Enrique Torres , University of Zaragoza, Spain
Pablo Ibáñez , University of Zaragoza, Spain
Víctor Viñals-Yúfera , University of Zaragoza, Spain
José M. Llabería , Polytechnic University of Catalonia, Barcelona
ABSTRACT
This paper focuses on how to design a Store Buffer (STB) well suited to first-level multibanked data caches. The goal is to forward data from in-flight stores into dependent loads within the latency of a cache bank. Taking into account the store lifetime in the processor pipeline and the data forwarding behavior, we propose a particular two-level STB design in which forwarding is done speculatively from a distributed first-level STB made of extremely small banks, whereas a centralized, second-level STB enforces correct store-load ordering. Besides, the two-level STB admits two simplifications that leave performance almost unchanged. Regarding the second-level STB, we suggest to remove its data forwarding capability, while for the first-level STB, it is possible to: 1) remove the instruction age checking and 2) compare only the less significant address bits. Experimentation covers both integer and floating point codes executing in dynamically scheduled processors. Following our guidelines and running SPEC-2K over an 8-way processor, a two-level STB with four 8-entry banks in the first level performs similar to an ideal, single-level STB with 128-entry banks working at the first-level cache latency. Also, we show that the proposed two-level design is suitable for a memory-latency-tolerant processor.
INDEX TERMS
Cache memories, computer architecture, memory architecture, pipeline processing.
CITATION
Enrique Torres, Pablo Ibáñez, Víctor Viñals-Yúfera, José M. Llabería, "Store Buffer Design for Multibanked Data Caches", IEEE Transactions on Computers, vol.58, no. 10, pp. 1307-1320, October 2009, doi:10.1109/TC.2009.57
REFERENCES
[1] H. Akkary, R. Rajwar, and S.T. Srinivasan, “Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors,” Proc. 36th Int'l Symp. Microarchitecture (MICRO-36), pp. 423-434, Dec. 2003.
[2] R. Balasubramonian, S. Dwarkadas, and D.H. Albonesi, “Dynamically Managing the Communication-Parallelism Trade Off in Future Clustered Processors,” Proc. 30th Int'l Symp. Computer Architecture (ISCA-30), pp. 275-287, June 2003.
[3] L. Baugh and C. Zilles, “Decomposing the Load-Store Queue by Function for Power Reduction and Scalability,” Proc. IBM $P=AC\hat {2}$ Conf., pp. 52-61, Oct. 2004.
[4] D.C. Burger and T.M. Austin, “The SimpleScalar Tool Set, Version 2.0,” Technical Report #1342, UW Madison Computer Science, June 1997.
[5] G.Z. Chrysos and J.S. Emer, “Memory Dependence Prediction Using Store Sets,” Proc. 25th Int'l Symp. Computer Architecture (ISCA), pp. 142-153, June 1998.
[6] J. Cortadella and J.M. Llabera, “Evaluation of ${\rm A}+{\rm B}={\rm K}$ Conditions without Carry Propagation,” IEEE Trans. Computers, vol. 41, no. 11, pp. 1484-1488, Nov. 1992.
[7] A. Cristal, O.J. Santana, and M. Valero, “Toward Kilo-Instruction Processors,” ACM Trans. Architecture and Code Optimization (TACO), vol. 1, no. 4, pp. 389-417, Dec. 2004.
[8] J. Edmondson et al., “Internal Organization of the Alpha 21164, a 300-MHz, 64-Bit, Quad Issue, CMOS RISC Microprocessor,” Digital Technical J., vol. 7, no. 1, pp. 119-135, Jan. 1995.
[9] A. Gandhi, H. Akkary, R. Rajwar, S.T. Srinivasan, and K. Lai, “Scalable Load and Store Processing in Latency Tolerant Processors” Proc. 32nd Int'l Symp. Computer Architecture (ISCA), pp. 446-457, June 2005.
[10] P. Hsu, “Design of the TFT Microprocessor,” IEEE Micro, vol. 14, no. 2, pp. 23-33, Apr. 1994.
[11] C.N. Keltcher, K.J. McGrath, A. Ahmed, P. Conway, C.N. Keltcher, K.J. McGrath, A. Ahmed, and P. Conway, “The AMD Opteron Processor for Multiprocessor Servers,” IEEE Micro, vol. 23, no. 2, pp. 66-76, Apr. 2003.
[12] A. Kumar, “The HP PA-8000 RISC CPU,” IEEE Micro, vol. 17, no. 2, pp. 27-32, Apr. 1997.
[13] A.R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg, “A Large, Fast Instruction Window for Tolerating Cache Misses,” Proc. 29th Int'l Symp. Computer Architecture (ISCA), pp. 59-70, May 2002.
[14] P. Michaud, A. Seznec, and R. Uhlig, “Trading Conflict and Capacity Aliasing in Conditional Branch Predictors,” Proc. 24th Int'l Symp. Computer Architecture (ISCA), pp. 292-303, June 1997.
[15] S.D. Naffziger, G. Colon-Bonet, T. Fischer, R. Riedlinger, T.J. Sullivan, and T. Grutkowski, “The Implementation of the Itanium 2 Microprocessor,” IEEE J. Solid State Circuits, vol. 37, no. 11, pp.1448-1460, Nov. 2002.
[16] H. Neefs, H. Vandierendonck, and K. De Bosschere, “A Technique for High Bandwidth and Deterministic Low Latency Load/Store Accesses to Multiple Cache Banks,” Proc. Sixth Int'l Symp. High-Performance Computer Architecture (HPCA), pp. 313-324, Jan. 2000.
[17] I. Park, L.O. Chong, and T.N. Vijaykumar, “Reducing Design Complexity of the Load/Store Queue,” Proc. 36th IEEE/ACM Int'l Symp. Microarchitecture (MICRO), pp. 411-422, Dec. 2003.
[18] C. Racunas and Y.N. Patt, “Partitioned First-Level Cache Design for Clustered Microarchitectures,” Proc. 17th Int'l Conf. Supercomputing (ICS), pp. 22-31, June 2003.
[19] A. Roth, “A High-Bandwidth Load/Store Unit for Single- and Multi-Threaded Processors,” Technical Report MS-CIS-04-09, Univ. of Pennsylvania, June 2004.
[20] A. Roth, “Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization,” Proc. 32th Int'l Symp. Computer Architecture (ISCA), pp. 458-468, June 2005.
[21] S. Sethumadhavan, R. Desikan, D. Burger, C.R. Moore, and S.W. Keckler, “Scalable Hardware Memory Disambiguation for High ILP Processors,” Proc. 36th IEEE/ACM Int'l Symp. Microarchitecture (MICRO), pp. 399-410, Dec. 2003.
[22] A. Seznec, S. Felix, V. Krishnan, and Y. Sazeides, “Design Tradeoffs for the Alpha EV8 Conditional Branch Predictor,” Proc. 29th Int'l Symp. Computer Architecture (ISCA), pp. 295-306, May 2002.
[23] T. Sha, M. Martin, and A. Roth, “Scalable Store-Load Forwarding via Store Queue Index Prediction,” Proc. 38th IEEE/ACM Int'l Symp. Microarchitecture (MICRO), pp. 159-170, Nov. 2005.
[24] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically Characterizing Large Scale Program Behaviour,” Proc. 10th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 45-57, Oct. 2002.
[25] G.S. Sohi and M. Franklin, “High-Bandwidth Memory Systems for Superscalar Processors,” Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 53-62, Apr. 1991.
[26] S.T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton, “Continual Flow Pipelines,” Proc. 11th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 107-119, Oct. 2004.
[27] S.S. Stone, K.M. Woley, and M.I. Frank, “Address Indexed Memory Disambiguation and Store-to-Load Forwarding,” Proc. 38th IEEE/ACM Int'l Symp. Microarchitecture (MICRO), pp. 171-182, Nov. 2005.
[28] E. Torres, P. Ibáñez, V. Viñals, and J.M. Llabería, “Contents Management in First-Level Multibanked Data Caches,” Proc. 10th Int'l Euro-Par 2004 Conf., pp. 516-524, Sept. 2004.
[29] E. Torres, P. Ibáñez, V. Viñals, and J.M. Llabería, “Store Buffer Design for Multibanked Data Caches,” Proc. 32nd Int'l Symp. Computer Architecture (ISCA), pp. 469-480, June 2005.
[30] A. Yoaz, M. Erez, R. Ronen, and S. Jourdan, “Speculation Techniques for Improving Load Related Instruction Scheduling,” Proc. 26th Int'l Symp. Computer Architecture (ISCA), pp. 42-53, May 1999.
[31] V. Zyuban and P.M. Kogge, “Inherently Lower-Power High-Performance Superscalar Architectures,” IEEE Trans. Computers, vol. 50, no. 3, pp. 268-285, Mar. 2001.
20 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool