This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses
December 2006 (vol. 55 no. 12)
pp. 1491-1508
Onur Mutlu, IEEE
While runahead execution is effective at parallelizing independent long-latency cache misses, it is unable to parallelize dependent long-latency cache misses. To overcome this limitation, this paper proposes a novel hardware technique, address-value delta (AVD) prediction. An AVD predictor keeps track of the address (pointer) load instructions for which the arithmetic difference (i.e., delta) between the effective address and the data value is stable. If such a load instruction incurs a long-latency cache miss during runahead execution, its data value is predicted by subtracting the stable delta from its effective address. This prediction enables the preexecution of dependent instructions, including load instructions that incur long-latency cache misses. We analyze why and for what kind of loads AVD prediction works and describe the design of an implementable AVD predictor. We also describe simple hardware and software optimizations that can significantly improve the benefits of AVD prediction and analyze the interaction of AVD prediction with runahead efficiency techniques and stream-based data prefetching. Our analysis shows that AVD prediction is complementary to these techniques. Our results show that augmenting a runahead processor with a simple, 16-entry AVD predictor improves the average execution time of a set of pointer-intensive applications by 14.3 percent (7.5 percent excluding benchmark health).

[1] A.-R. Adl-Tabatabai, R.L. Hudson, M.J. Serrano, and S. Subramoney, “Prefetch Injection Based on Hardware Monitoring and Object Metadata,” Proc. ACM SIGPLAN '04 Conf. Programming Language Design and Implementation, pp. 267-276, 2004.
[2] M. Bekerman, S. Jourdan, R. Ronen, G. Kirshenboim, L. Rappoport, A. Yoaz, and U. Weiser, “Correlated Load-Address Predictors,” Proc. 26th Int'l Symp. Computer Architecture, pp. 54-63, 1999.
[3] L. Ceze, K. Strauss, J. Tuck, J. Renau, and J. Torrellas, “CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction,” IEEE Computer Architecture Letters, vol. 3, Dec. 2004.
[4] M. Charney, “Correlation-Based Hardware Prefetching,” PhDthesis, Cornell Univ., Aug. 1995.
[5] T.M. Chilimbi and M. Hirzel, “Dynamic Hot Data Stream Prefetching for General-Purpose Programs,” Proc. ACM SIGPLAN '02 Conf. Programming Language Design and Implementation, pp. 199-209, 2002.
[6] Y. Chou, B. Fahs, and S. Abraham, “Microarchitecture Optimizations for Exploiting Memory-Level Parallelism,” Proc. 31st Int'l Symp. Computer Architecture, pp. 76-87, 2004.
[7] J.D. Collins, S. Sair, B. Calder, and D.M. Tullsen, “Pointer Cache Assisted Prefetching,” Proc. 35th Int'l Symp. Microarchitecture, pp.62-73, 2002.
[8] R. Cooksey, S. Jourdan, and D. Grunwald, “A Stateless, Content-Directed Data Prefetching Mechanism,” Proc. 10th Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 279-290, 2002.
[9] J. Dundas and T. Mudge, “Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss,” Proc. 1997 Int'l Conf. Supercomputing, pp. 68-75, 1997.
[10] R.J. Eickemeyer and S. Vassiliadis, “A Load-Instruction Unit for Pipelined Processors,” IBM J. Research and Development, vol. 37, pp.547-564, 1993.
[11] A. Glew, “MLP yes! ILP no!,” Wild and Crazy Idea Session, Proc. Eighth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 1998.
[12] M.K. Gowan, L.L. Biro, and D.B. Jackson, “Power Considerations in the Design of the Alpha 21264 Microprocessor,” Proc. 35th Ann. Design Automation Conf., pp. 726-731, 1998.
[13] D. Joseph and D. Grunwald, “Prefetching Using Markov Predictors,” Proc. 24th Int'l Symp. Computer Architecture, pp. 252-263, 1997.
[14] N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” Proc. 17th Int'l Symp. Computer Architecture, pp. 364-373, 1990.
[15] M. Karlsson, F. Dahlgren, and P. Strenstrom, “A Prefetching Technique for Irregular Accesses to Linked Data Structures,” Proc. Sixth Int'l Symp. High Performance Computer Architecture, pp. 206-217, 2000.
[16] N. Krman, M. Krman, M. Chaudhuri, and J.F. Martínez, “Checkpointed Early Load Retirement,” Proc. 11th Int'l Symp. High Performance Computer Architecture, pp. 16-27, 2005.
[17] A. KleinOsowski and D.J. Lilja, “MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research,” IEEE Computer Architecture Letters, vol. 1, June 2002.
[18] S.-J. Lee and P.-C. Yew, “On Some Implementation Issues for Value Prediction on Wide-Issue ILP Processors,” Proc. 2000 Int'l Conf. Parallel Architectures and Compilation Techniques, p. 145, 2000.
[19] M.H. Lipasti, W.J. Schmidt, S.R. Kunkel, and R.R. Roediger, “SPAID: Software Prefetching in Pointer- and Call-Intensive Environments,” Proc. 28th Int'l Symp. Microarchitecture, pp. 232-236, 1995.
[20] M.H. Lipasti, C. Wilkerson, and J.P. Shen, “Value Locality and Load Value Prediction,” Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 226-237, 1996.
[21] C.-K. Luk and T.C. Mowry, “Compiler-Based Prefetching for Recursive Data Structures,” Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 222-233, 1996.
[22] O. Mutlu, H. Kim, D.N. Armstrong, and Y.N. Patt, “An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors,” IEEE Trans. Computers, vol. 54, no. 12, pp. 1556-1571, Dec. 2005.
[23] O. Mutlu, H. Kim, and Y.N. Patt, “Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns,” Proc. 38th Int'l Symp. Microarchitecture, pp. 233-244, 2005.
[24] O. Mutlu, H. Kim, and Y.N. Patt, “Techniques for Efficient Processing in Runahead Execution Engines,” Proc. 32nd Int'l Symp. Computer Architecture, pp. 370-381, 2005.
[25] O. Mutlu, J. Stark, C. Wilkerson, and Y.N. Patt, “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors,” Proc. Ninth Int'l Symp. High Performance Computer Architecture, pp. 129-140, 2003.
[26] O. Mutlu, J. Stark, C. Wilkerson, and Y.N. Patt, “Runahead Execution: An Effective Alternative to Large Instruction Windows,” IEEE Micro, vol. 23, no. 6, pp. 20-25, Nov./Dec. 2003.
[27] P. Racunas, “Reducing Load Latency through Memory Instruction Characterization,” PhD thesis, Univ. of Michigan, 2003.
[28] A. Rogers, M.C. Carlisle, J. Reppy, and L. Hendren, “Supporting Dynamic Data Structures on Distributed Memory Machines,” ACM Trans. Programming Languages and Systems, vol. 17, no. 2, pp.233-263, Mar. 1995.
[29] A. Roth, A. Moshovos, and G.S. Sohi, “Dependence Based Prefetching for Linked Data Structures,” Proc. Eighth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 115-126, 1998.
[30] A. Roth and G.S. Sohi, “Effective Jump-Pointer Prefetching for Linked Data Structures,” Proc. 26th Int'l Symp. Computer Architecture, pp. 111-121, 1999.
[31] Y. Sazeides and J.E. Smith, “The Predictability of Data Values,” Proc. 30th Int'l Symp. Microarchitecture, pp. 248-257, 1997.
[32] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically Characterizing Large Scale Program Behavior,” Proc. 10th Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 45-57, 2002.
[33] Y. Solihin, J. Lee, and J. Torrellas, “Using a User-Level Memory Thread for Correlation Prefetching,” Proc. 29th Int'l Symp. Computer Architecture, pp. 171-182, 2002.
[34] E. Sprangle and D. Carmean, “Increasing Processor Performance by Implementing Deeper Pipelines,” Proc. 29th Int'l Symp. Computer Architecture, pp. 25-34, 2002.
[35] J. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy, “POWER4 System Microarchitecture,” IBM technical white paper, Oct. 2001.
[36] K. Wang and M. Franklin, “Highly Accurate Data Value Prediction Using Hybrid Predictors,” Proc. 30th Int'l Symp. Microarchitecture, pp. 281-290, 1997.
[37] M.V. Wilkes, “The Memory Gap and the Future of High Performance Memories,” ACM Computer Architecture News, vol. 29, no. 1, pp. 2-7, Mar. 2001.
[38] Y. Wu, “Efficient Discovery of Regular Stride Patterns in Irregular Programs and Its Use in Compiler Prefetching,” Proc. ACM SIGPLAN '02 Conf. Programming Language Design and Implementation, pp. 210-221, 2002.
[39] W. Wulf and S. McKee, “Hitting the Memory Wall: Implications of the Obvious,” ACM Computer Architecture News, vol. 23, no. 1, pp.20-24, Mar. 1995.
[40] C.-L. Yang and A.R. Lebeck, “Push vs. Pull: Data Movement for Linked Data Structures,” Proc. 2000 Int'l Conf. Supercomputing, pp.176-186, 2000.
[41] H. Zhou and T.M. Conte, “Enhancing Memory Level Parallelism via Recovery-Free Value Prediction,” Proc. 17th Int'l Conf. Supercomputing, pp. 326-335, 2003.
[42] C.B. Zilles, “Benchmark Health Considered Harmful,” Computer Architecture News, vol. 29, no. 3, pp. 4-5, June 2001.

Index Terms:
Single data stream architectures, runahead execution, value prediction, memory-level parallelism.
Citation:
Onur Mutlu, Hyesoon Kim, Yale N. Patt, "Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses," IEEE Transactions on Computers, vol. 55, no. 12, pp. 1491-1508, Dec. 2006, doi:10.1109/TC.2006.191
Usage of this product signifies your acceptance of the Terms of Use.