This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Reducing Memory Latency via Read-after-Read Memory Dependence Prediction
March 2002 (vol. 51 no. 3)
pp. 313-326

We observe that typical programs exhibit highly regular read-after-read (RAR) memory dependence streams. To exploit this regularity, we introduce read-after-read (RAR) memory dependence prediction. This technique predicts whether: 1) A load will access a memory location that a preceding load accesses and 2) exactly which this preceding load is. This prediction is done without actual knowledge of the corresponding memory addresses. We also present two techniques that utilize RAR memory dependence prediction to reduce memory latency. In the first technique, a load may obtain a value by naming a preceding load with which an RAR dependence is predicted. The second technique speculatively converts a series of ${\rm LOAD}_1{\hbox{-}}{\rm USE}_1,\ldots,{\rm LOAD_N}{\hbox{-}}{\rm USE_N}$ chains into a single ${\rm LOAD}_1{\hbox{-}}{\rm USE}_1\ldots{\rm USE_N}$ producer/consumer graph. This is done whenever RAR dependences are predicted among the ${\rm LOAD_i}$ instructions. Our techniques can be implemented as small extensions to the previously proposed read-after-write (RAW) dependence prediction-based speculative memory cloaking and speculative memory bypassing. On average, our RAR-based techniques provide correct values for an additional 20 percent (integer codes) and 30 percent (floating-point codes) of all loads. Moreover, a combined RAW- and RAR-based cloaking/bypassing mechanism improves performance by 6.44 percent (integer) and 4.66 percent (floating-point) over a highly aggressive dynamically scheduled superscalar processor that uses naive memory dependence speculation. By comparison, the original RAW-based cloaking/bypassing mechanism yields improvements of 4.28 percent (integer) and 3.20 percent (floating-point). When no memory dependence speculation is used, our techniques yield speedups of 9.85 percent (integer) and 6.14 percent (floating-point).

[1] E.C. Hall, "Journey to the Moon: The History of the Apollo Guidance Computer" American Institute of Aeronautics and Astronautics, Inc., Reston, VA, 1996.
[2] T.M. Austin and G.S. Sohi, “Zero-Cycle Loads: Microarchitecture Support for Reducing Load Latency,” Proc. 28th Int'l Symp. Microarchitecture, pp. 82-92, Nov. 1995.
[3] J.-L. Baer and T.-F. Chen, "An Effective On-Chip Preloading Scheme To Reduce Data Access Penalty," Proc. Supercomputing '91, pp. 176-186, 1991,.
[4] M. Bekerman, S. Jourdan, R. Ronen, G. Kirshenboim, L. Rappoport, A. Yoaz, and U. Weiser, “Correlated Load-Address Predictors,” Proc. 26th Int'l Symp. Computer Architecture, May 1999.
[5] S.E. Breach, “Design and Evaluation of a Multiscalar Processor,” PhD thesis, Univ. Wisconsin-Madison, Dec. 1998.
[6] B. Calder, G. Reinman, and D. Tullsen, Selective Value Prediction Proc. 26th Int'l Symp. Computer Architecture, 1999.
[7] B.-C. Cheng, D.A. Connors, and W.-M. Hwu, “Compiler-Directed Early Load-Address Generation,” Proc. 31st Ann. Int'l Symp. Microarchitecture, Dec. 1998.
[8] G. Chrysos and J. Emer, “Memory Dependence Prediction Using Store Sets,” Proc. 25th Int'l Symp. Computer Architecture, pp. 142-153, July 1998.
[9] R.J. Eickemeyer and S. Vassiliadis, “A Load-Instruction Unit for Pipelined Processors,” IBM J. Research and Development, vol. 37, no. 4, July 1993.
[10] F. Gabbay and A. Medelson, “Speculative Execution Based on Value Prediction,” Technical Report, TR-1080, Electrical Eng. Dept., Technion-Israel Inst. of Tech nology, Nov. 1996.
[11] M. Golden and T. Mudge, “Hardware Support for Hiding Cache Latency,” Technical Report CSE-TR-152-93, Dept. of Electrical Eng. and Computer Science, Univ. of Michigan, Feb. 1991.
[12] S. Jourdan et al., "A Novel Renaming Scheme to Exploit Value Temporal Locality Through Physical Register Reuse and Unification," Proc. MICRO-31, IEEE CS Press, 1998, pp. 216-225.
[13] G. Kane, MIPS R2000/R3000 RISC Architecture. Prentice Hall, 1987.
[14] S.M. Kurlander and C.N. Fischer, “Minimum Cost Interprocedural Register Allocation,” Proc. 23rd ACM SIGPLAN-SIGACT Symp. Principles of Programming Languages, Jan. 1996.
[15] J. González and A. González, “Speculative Execution via Address Prediction and Data Prefetching,” Proc. Int'l Conf. Supercomputing, pp. 196-203, 1997.
[16] M.H. Lipasti, Value Locality and Speculative Execution, doctoral dissertation, Carnegie Mellon Univ., Dept. Electrical and Computer Eng., May 1997.
[17] M.H. Lipasti and J.P. Shen, "Exceeding the Data-Flow Limit Via Value Prediction," Proc. 29th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, IEEE CS Press, Los Alamitos, Calif., 1996, pp. 226-237.
[18] M.H. Lipasti, C.B. Wilkerson, and J.P. Shen, "Value Locality and Load Value Prediction," Proc. Seventh Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, ACM Press, New York, 1996, pp. 138-147.
[19] S. McFarling, “Combining Branch Predictors,” Technical Report TN-36, Digital Equipment Corp., Western Research Laboratory, June 1993.
[20] A. Moshovos, “Memory Dependence Prediction,” PhD thesis, Univ. of Wisconsin, Dec. 1998.
[21] A. Moshovos et al., "Dynamic Speculation and Synchronization of Data Dependences," Proc. 24th Int'l Symp. on Computer Architecture, IEEE CS Press, Los Alamitos, Calif., 1997, pp. 181-193.
[22] A. Moshovos and G. Sohi,"Streamlining Inter-Operation Memory Communication via Data Dependence Prediction," Proc. 30th Int'l Symp. Microarchitecture, ACM Press, 1997, pp. 235-245.
[23] M. Reilly and J. Edmondson, "Performance Simulation of an Alpha Microprocessor," Computer, May 1998, pp. 50-58.
[24] G. Reinman and B. Calder, “Predictive Techniques for Aggressive Load Speculation,” Proc. 31st Int'l Symp. Microarchitecture, Dec. 1998.
[25] G. Reinman, B. Calder, D. Tullsen, G. Tyson, and T. Austin, “Profile Guided Load Marking for Memory Renaming,” Technical Report CS98-593, Univ. of California, San Diego, July 1998.
[26] E. Rotenberg, Q. Jacobsen, Y. Sazeides,, and J.E. Smith, ``Trace Processors,'' Proc. 30th Ann. ACM/IEEE Int'l Symp. Microarchitecture, 1997.
[27] Y. Sazeides and J. Smith, “The Predictability of Data Values,” Proc. 30th Ann. Int'l Symp. Microarchitecture (MICRO '30), pp. 248-258, Dec. 1997.
[28] A. Sodani and G. Sohi, Understanding the Difference Between Value Prediction and Instruction Reuse Proc. 31st Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 205-215, Dec. 1998.
[29] G. Tyson and T. Austin, “Improving the Accuracy and Performance of Memory Communication through Renaming,” Proc. 30th Ann. Int'l Symp. Microarchitecture (MICRO '30), pp. 218-227, Dec. 1997.
[30] D. Wall, "Global Register Allocation at Link-Time," Proc. Symp. Compiler Construction, ACM Press, New York, 1986, pp. 264-275.
[31] L. Widigen, E. Sowadsky, and K. McGrath, "Eliminating Operand Read Latency," Computer Architecture News, Dec. 1996, pp. 18-22.
[32] K.M. Wilson, K. Olukotun, and M. Rosenblum, “Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors,” Proc. 23rd Int'l Symp. Computer Architecture, pp. 147-157, May 1996.

Index Terms:
memory dependence prediction, load, cache, dynamic optimization
Citation:
A. Moshovos, G.S. Sohi, "Reducing Memory Latency via Read-after-Read Memory Dependence Prediction," IEEE Transactions on Computers, vol. 51, no. 3, pp. 313-326, March 2002, doi:10.1109/12.990129
Usage of this product signifies your acceptance of the Terms of Use.