This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Enhancing Memory-Level Parallelism via Recovery-Free Value Prediction
July 2005 (vol. 54 no. 7)
pp. 897-912
The ever-increasing computational power of contemporary microprocessors reduces the execution time spent on arithmetic computations (i.e., the computations not involving slow memory operations such as cache misses) significantly. Therefore, for memory-intensive workloads, it becomes more important to overlap multiple cache misses than to overlap slow memory operations with other computations. In this paper, we propose a novel technique to parallelize sequential cache misses, thereby increasing memory-level parallelism (MLP). Our idea is based on value prediction, which was proposed originally as an instruction-level parallelism (ILP) optimization to break true data dependencies. In this paper, we advocate value prediction in its capability to enhance MLP instead of ILP. We propose using value prediction and value-speculative execution only for prefetching so that not only the complex prediction validation and misprediction recovery mechanisms are avoided, but better performance can also be achieved for memory-intensive workloads. The minor hardware modifications that are required also enable aggressive memory disambiguation for prefetching. The experimental results show that our technique enhances MLP effectively and achieves significant speedups, even with a simple stride value predictor.

[1] S.G. Abraham, R.A. Sugumar, D. Windheiser, B.R. Rau, and R. Gupta, “Predictability of Load/Store Latencies,” Proc. 26th Int'l Symp. Microarchitecture (MICRO-26), 1993.
[2] M. Bekerman, S. Jourdan, R. Ronen, G Kirshenboim, L. Pappoport, A. Yoaz, and U. Weiser, “Correlated Load-Address Predictors,” Proc. 26th Int'l Symp. Computer Architecture (ISCA-26), 1999.
[3] D. Burger and T. Austin, “The SimpleScalar Tool Set, v2.0,” Computer Architecture News, vol. 25 June 1997.
[4] B. Calder and G. Reinman, “A Comparative Survey of Load Speculation Architectures,” J. Instruction-Level Parallelism, 2000.
[5] M. Carlisle, “Olden: Parallelizing Programs with Dynamic Data Structures on Distributed-Memory Machines,” PhD thesis, Computer Science Dept., Princeton Univ., 1996.
[6] J.D. Collins, H. Wang, D. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J.P. Shen, “Speculative Precomputation: Long-Range Prefetching of Delinquent Loads,” Proc. 28th Int'l Symp. Computer Architecture (ISCA-28), 2001.
[7] R. Cooksey, S. Jourdan, and D. Grunwald, “A Stateless, Content-Directed Data Prefetching Mechanism,” Proc. 10th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS-X), 2002.
[8] J. Dundas and T. Mudge, “Improving Data Cache Performance by Pre-Executing Instructions under a Cache Miss,” Proc. 1997 Int'l Conf. Supercomputing, 1997.
[9] F. Gabbay and A. Mendelson, “Speculative Execution Based on Value Prediction,” Technical Report 1080, Electrical Eng. Dept., Technion-Israel Inst. of Tech nology, Nov. 1996.
[10] J. Gonzalez and A. Gonzalez, “Speculative Execution via Address Prediction and Data Prefetching,” Proc. 1997 Int'l Conf. Supercomputing, 1997.
[11] J. Gonzalez and A. Gonzalez, “Control-Flow Speculation through Value Prediction for Superscalar Processors,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, 1999.
[12] T Heil, Z. Smith, and J.E. Smith, “Improving Branch Predictors by Correlating on Data Values,” Proc. 32nd Int'l Symp. Microarchitecture (MICRO-32), 1999.
[13] J. Henning, “SPEC2000: Measuring CPU Performance in the New Millennium,” Computer, July 2000.
[14] D. Joseph and D. Grunwald, “Prefetching Using Markov Predictors,” IEEE Trans. Computers, vol. 48, no. 2, Feb. 1999.
[15] T. Karkhanis and J. Smith, “A Day in the Life of a Cache Miss,” Proc. Second Ann. Workshop Memory Performance Issues (WMPI 2002), 2002.
[16] A.R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg, “A Large, Fast Instruction Window for Tolerating Cache Misses,” Proc. 29th Int'l Symp. Computer Architecture (ISCA-29), 2002.
[17] S. Lee and P. Yew, “On Some Implementation Issues for Value Prediction on Wide ILP Processors,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '00), 2000.
[18] M.H. Lipasti and J.P. Shen, “Exceeding the Dataflow Limit via Value Prediction,” Proc. 29th Int'l Symp. Microarchitecture (MICRO-29), 1996.
[19] M.H. Lipasti, C.B. Wikerson, and J.P. Shen, “Value Locality and Load Value Prediction,” Proc. Seventh Int'l Conf. Architectural Support for Programming Language and Operation Systems (ASPLOS-7), Oct. 1996,
[20] C.K. Luk, “Tolerating Memory Latency through Soft-Ware-Controlled Preexecution in Simultaneous Multithreading Processors,” Proc. 28th Int'l Symp. Computer Architecture (ISCA-28), 2001.
[21] O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt, “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors,” Proc. Ninth Int'l Symp. High Performance Computer Architecture (HPCA-9), 2003.
[22] E. Rotenberg, S. Bennett, and J.E. Smith, “Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching,” Proc. 29th Int'l Symp. Microarchitecture (MICRO-29), 1996.
[23] A. Roth and G. Sohi, “Speculative Data Driven Multithreading,” Proc. Seventh Int'l Symp. High Performance Computer Architecture (HPCA-7), 2001.
[24] Y. Sazeides and J.E. Smith, “The Predictability of Data Values,” Proc. 30th Int'l Symp. Microarchitecture (MICRO-30), Nov. 1997.
[25] E. Sprangle and D. Carmean, “Increasing Processor Performance by Implementing Deeper Pipelines,” Proc. 29th Int'l Symp. Computer Architecture (ISCA-29), 2002.
[26] K. Wang and M. Franklin, “Highly Accurate Data Value Prediction Using Hybrid Predictors,” Proc. 30th Int'l Symp. Microarchitecture (MICRO-30), Nov. 1997.
[27] P.H. Wang, H. Wang, J.D. Collins, E. Grochowski, R.M. Kling, and J.P. Shen, “Memory Latency-Tolerance Approaches for Itanium Processors: Out-of-Order Execution vs. Speculative Precomputation,” Proc. Eighth Int'l Symp. High Performance Computer Architecture (HPCA-8), 2002.
[28] Y. Wu, “Efficient Discovery of Regular Stride Patterns in Irregular Programs and Its Use in Compiler Prefetching,” Proc. ACM 2002 Conf. Programming Language Design and Implementation (PLDI-2002), 2002.
[29] K.C. Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, 1996.
[30] H. Zhou and T. Conte, “Enhance Memory-Level Parallelism via Recovery-Free Value Prediction,” Proc. 2003 Int'l Conf. Supercomputing (ICS-03), 2003.
[31] H. Zhou, J. Bodine, and T. Conte, “Detecting Global Stride Localities in Value Streams,” Proc. 30th Int'l Symp. Computer Architecture (ISCA-30), 2003.
[32] H. Zhou and T. Conte, “Performance Modeling of Memory Latency Hiding Techniques,” Technical Report, Electrical and Computer Eng. Dept., North Carolina State Univ., Dec. 2002.
[33] C. Zilles and G. Sohi, “Execution-Based Prediction Using Speculative Slices,” Proc. 28th Int'l Symp. Computer Architecture (ISCA-28), 2001.

Index Terms:
Index Terms- Single data stream architectures.
Citation:
Huiyang Zhou, Thomas M. Conte, "Enhancing Memory-Level Parallelism via Recovery-Free Value Prediction," IEEE Transactions on Computers, vol. 54, no. 7, pp. 897-912, July 2005, doi:10.1109/TC.2005.117
Usage of this product signifies your acceptance of the Terms of Use.