The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - April (2009 vol.58)
pp: 496-511
Fernando Castro , Universidad Complutense de Madrid, Madrid
Regana Noor , University of Rochester, Rochester
Alok Garg , University of Rochester, Rochester
Daniel Chaver , Universidad Complutense de Madrid, Madrid
Michael C. Huang , University of Rochester, Rochester
Luis Piñuel , Universidad Complutense de Madrid, Madrid
Manuel Prieto , Universidad Complutense de Madrid, Madrid
Francisco Tirado , Universidad Complutense de Madrid, Madrid
ABSTRACT
One of the main challenges of modern processor design is the implementation of a scalable and efficient mechanism to detect memory access order violations as a result of out-of-order execution. Traditional age-ordered associative load queues are complex, inefficient, and power hungry. In this paper, we introduce two new dependence checking schemes with different design tradeoffs, but both explicitly rely on timing information as a primary instrument to rule out dependence violation. Our timing-centric designs operate at a fraction of the energy cost of an associative LQ and achieve the same functionality with an insignificant performance impact on average. Studies with parallel benchmarks also show that they are equally effective and efficient in a chip-multiprocessor environment.
INDEX TERMS
LSQ, memory disambiguation, energy efficiency.
CITATION
Fernando Castro, Regana Noor, Alok Garg, Daniel Chaver, Michael C. Huang, Luis Piñuel, Manuel Prieto, Francisco Tirado, "Replacing Associative Load Queues: A Timing-Centric Approach", IEEE Transactions on Computers, vol.58, no. 4, pp. 496-511, April 2009, doi:10.1109/TC.2008.146
REFERENCES
[1] J. Tendler, J. Dodson, J. Fields, H. Le, and B. Sinharoy, “POWER4 System Microarchitecture,” IBM J. Research and Development, vol. 46, no. 1, pp. 5-25, Jan. 2002.
[2] Alpha 21264/EV6 Microprocessor Hardware Reference Manual, Compaq Computer, Sept. 2000.
[3] S. Adve and K. Gharachorloo, “Share Memory Consistency Models: A Tutorial,” Computer, vol. 29, no. 12, pp. 66-76, Dec. 1993.
[4] K. Gharachorloo, A. Gupta, and J. Hennessy, “Two Techniques to Enhance the Performance of Memory Consistency Models,” Proc. Int'l Conf. Parallel Processing (ICPP '91), pp. I355-I364, Aug. 1991.
[5] K. Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, vol. 16, no. 2, pp. 28-40, Apr. 1996.
[6] A. Garg, F. Castro, M. Huang, L. Piñuel, D. Chaver, and M. Prieto, “Substituting Associative Load Queue with Simple Hash Table in Out-of-Order Microprocessors,” Proc. Int'l Symp. Low-Power Electronics and Design (ISLPED '06), pp. 268-273, Oct. 2006.
[7] P. Jordan, B. Konigsburg, H. Le, and S. White, Data Processing System and Method for Using a Unique Identifier to Maintain an Age Relationship between Executing Instructions, US Patent 5,805,849, Sept. 1998.
[8] F. Castro, L. Piñuel, D. Chaver, M. Prieto, M. Huang, and F. Tirado, “DMDC: Delayed Memory Dependence Checking through Age-Based Filtering,” Proc. 39th Ann. IEEE/ACM Int'l Symp. Microarchitecture (MICRO '06), pp. 297-308, Dec. 2006.
[9] S. Meier, Store Queue Multimatch Detection, US Patent 6,523,109, Feb. 2003.
[10] D. Burger and T. Austin, “The SimpleScalar Tool Set, Version 2.0,” Technical Report 1342, Computer Sciences Dept., Univ. of Wisconsin-Madison, June 1997.
[11] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A Framework for Architectural-Level Power Analysis and Optimizations,” Proc. 27th Ann. Int'l Symp. Computer Architecture (ISCA '00), pp. 83-94, June 2000.
[12] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically Characterizing Large Scale Program Behavior,” Proc. 10th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '02), pp. 45-57, Oct. 2002.
[13] S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta, “The SPLASH-2 Programs: Characterization and Methodological Considerations,” Proc. 22nd Ann. Int'l Symp. Computer Architecture (ISCA '95), pp. 24-36, June 1995.
[14] C. Amza, A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, and W. Zwaenepoel, “TreadMarks: Shared Memory Computing on Networks of Workstations,” Computer, vol. 29, no. 2, pp. 18-28, Feb. 1996.
[15] S. Sethumadhavan, R. Desikan, D. Burger, C. Moore, and S. Keckler, “Scalable Hardware Memory Disambiguation for High ILP Processors,” Proc. 36th Ann. IEEE/ACM Int'l Symp. Microarchitecture (MICRO '03), pp. 399-410, Dec. 2003.
[16] H. Cain and M. Lipasti, “Memory Ordering: A Value-Based Approach,” Proc. 31st Ann. Int'l Symp. Computer Architecture (ISCA '04), pp. 90-101, June 2004.
[17] A. Roth, “Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization,” Proc. 32nd Ann. Int'l Symp. Computer Architecture (ISCA '05), pp. 458-468, June 2005.
[18] S. Subramaniam and G. Loh, “Fire-and-Forget: Load/Store Scheduling with No Store Queue,” Proc. 39th Ann. IEEE/ACM Int'l Symp. Microarchitecture (MICRO '06), pp. 273-284, Dec. 2006.
[19] T. Sha, M. Martin, and A. Roth, “NoSQ: Store-Load Communication without a Store Queue,” Proc. 39th Ann. IEEE/ACM Int'l Symp. Microarchitecture (MICRO '06), pp. 285-296, Dec. 2006.
[20] A. Moshovos, S. Breach, T. Vijaykumar, and G. Sohi, “Dynamic Speculation and Synchronization of Data Dependences,” Proc. 24th Ann. Int'l Symp. Computer Architecture (ISCA '97), pp. 181-193, June 1997.
[21] G. Chrysos and J. Emer, “Memory Dependence Prediction Using Store Sets,” Proc. 25th Ann. Int'l Symp. Computer Architecture (ISCA '98), pp. 142-153, June/July 1998.
[22] I. Park, C. Ooi, and T. Vijaykumar, “Reducing Design Complexity of the Load/Store Queue,” Proc. 36th Ann. IEEE/ACM Int'l Symp. Microarchitecture (MICRO '03), pp. 411-422, Dec. 2003.
[23] H. Akkary, R. Rajwar, and S. Srinivasan, “Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors,” Proc. 36th Ann. IEEE/ACM Int'l Symp. Microarchitecture (MICRO '03), pp. 423-434, Dec. 2003.
[24] A. Gandhi, H. Akkary, R. Rajwar, S. Srinivasan, and K. Lai, “Scalable Load and Store Processing in Latency Tolerant Processors,” Proc. 32nd Ann. Int'l Symp. Computer Architecture (ISCA '05), pp. 446-457, June 2005.
[25] E. Torres, P. Ibanez, V. Vinals, and J. Llaberia, “Store Buffer Design in First-Level Multibanked Data Caches,” Proc. 32nd Ann. Int'l Symp. Computer Architecture (ISCA '05), pp. 469-480, June 2005.
[26] L. Baugh and C. Zilles, “Decomposing the Load-Store Queue by Function for Power Reduction and Scalability,” IBM J. Research and Development, vol. 50, nos. 2-3, pp. 287-298, 2006.
[27] T. Sha, M. Martin, and A. Roth, “Scalable Store-Load Forwarding via Store Queue Index Prediction,” Proc. 38th Ann. IEEE/ACM Int'l Symp. Microarchitecture (MICRO '05), pp. 159-170, Dec. 2005.
[28] S. Stone, K. Woley, and M. Frank, “Address-Indexed Memory Disambiguation and Store-to-Load Forwarding,” Proc. 38th Ann. IEEE/ACM Int'l Symp. Microarchitecture (MICRO '05), pp. 171-182, Dec. 2005.
[29] C. Fang, S. Carr, S. Onder, and Z. Wang, “Feedback-Directed Memory Disambiguation through Store Distance Analysis,” Proc. 20th Ann. Int'l Conf. Supercomputing (ICS '06), pp. 278-287, June 2006.
[30] A. Garg, M. Rashid, and M. Huang, “Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification,” Proc. 33rd Ann. Int'l Symp. Computer Architecture (ISCA '06), pp. 142-153, June 2006.
22 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool