This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Understanding Why Correlation Profiling Improves the Predictability of Data Cache Misses in Nonnumeric Applications
April 2000 (vol. 49 no. 4)
pp. 369-384

Abstract— Latency-tolerance techniques offer the potential for bridging the ever-increasing speed gap between the memory subsystem and today's high-performance processors. However, to fully exploit the benefit of these techniques, one must be careful to apply them only to the dynamic references that are likely to suffer cache misses—otherwise the runtime overheads can potentially offset any gains. In this paper, we focus on isolating dynamic miss instances in nonnumeric applications, which is a difficult but important problem. Although compilers cannot statically analyze data locality in nonnumeric applications, one viable approach is to use profiling information to measure the actual miss behavior. Unfortunately, the state-of-the-art in cache miss profiling (which we call summary profiling) is inadequate for references with intermediate miss ratios—it either misses opportunities to hide latency, or else inserts overhead that is unnecessary. To overcome this problem, we propose and evaluate a new profiling technique that helps predict which dynamic instances of a static memory reference will hit or miss in the cache: correlation profiling. Our experimental results demonstrate that roughly half of the 21 nonnumeric applications we study can potentially enjoy significant reductions in memory stall time by exploiting at least one of the three forms of correlation profiling we consider: control-flow correlation, self correlation, and global correlation. In addition, our detailed case studies illustrate that self correlation succeeds because a given reference's cache outcomes often contain repeated patterns and control-flow correlation succeeds because cache outcomes are often call-chain dependent. Finally, we suggest a number of ways to exploit correlation profiling in practice and demonstrate that software prefetching can achieve better performance on a modern superscalar processor when directed by correlation profiling rather than summary profiling information.

[1] T.C. Mowry, M.S. Lam, and A. Gupta, “Design and Evaluation of a Compiler Algorithm for Prefetching,” Proc. Fifth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 1992.
[2] S.G. Abraham, R.A. Sugumar, D. Windheiser, B.R. Rau, and R. Gupta, “Predictability of Load/Store Instruction Latencies,” Proc. 26th Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 139-152, Dec. 1993.
[3] M. Horowitz, M. Martonosi, T.C. Mowry, and M.D. Smith, “Informing Memory Operations: Memory Performance Feedback Mechanisms and Their Applications,” ACM Trans. Computer Systems, vol. 16, no. 2, pp. 170-205, May 1998.
[4] J. Dean et al., "ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors," Proc. 30th Symp. Microarchitecture (Micro-30), IEEE CS Press, Los Alamitos, Calif., 1997, pp. 292-302.
[5] P. Chang, E. Hao, T. Yeh, and Y. Patt, “Branch Classification: A New Mechanism for Improving Branch Predictor Performance,” Proc. 27th Ann. ACM/IEEE Int'l Symp. Microarchitecture, Nov. 1994.
[6] S. McFarling, “Combining Branch Predictors,” Technical Report TN-36, Digital Western Research Laboratory, June 1993.
[7] S. Pan, K. So, and J. Rahmeh, “Improving the Accuracy of Dynamic Branch Prediction Using Branch Correlation,” Proc. Fifth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 76-84, Oct. 1992.
[8] T.-Y. Yeh and Y. Patt, “A Comparison of Dynamic Branch Predictors that Use Two Levels of Branch History,” Proc. 20th Ann. Int'l Symp. Computer Architecture, pp. 257-266, May 1993.
[9] C. Young and M. Smith, “Improving the Accuracy of Static Branch Prediction Using Branch Correlation,” Proc. Sixth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 232-241, Oct. 1994.
[10] S.G. Abraham and B.R. Rau, “Predicting Load Latencies Using Cache Profiling,” Technical Report HPL-94-110, Hewlett-Packard Co., Nov. 1994.
[11] G. Ammons, T. Ball, and J. Larus, “Exploiting Hardware Performance Counters with Flow and Context Sensitive Profiling,” Proc. ACM SIGPLAN 97 Conf. Programming Language Design and Implementation, June 1997.
[12] C.-K. Luk, “Optimizing the Cache Performance of Non-Numeric Applications, PhD thesis, Dept. of Computer Science, Univ. of Toronto, Jan. 2000.
[13] S.C. Woo et al., "The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. 22nd Annual Int'l Symp. Computer Architecture, IEEE CS Press, Los Alamitos, Calif., June 1995, pp. 24-36.
[14] A. Rogers, M. Carlisle, J. Reppy, and L. Hendren, “Supporting Dynamic Data Structures on Distributed Memory Machines,” ACM Trans. Programming Languages and Systems, vol. 17, no. 2, Mar. 1995.
[15] M.D. Smith, “Tracing with Pixie,” Technical Report CSL-TR-91-497, Stanford Univ., Nov. 1991.
[16] K.D. Cooper, M.W. Hall, and K. Kennedy, “A Methodology for Procedure Cloning,” Computer Languages, vol. 19, no. 2, Apr. 1993.
[17] C.-K. Luk and T.C. Mowry, “Compiler-Based Prefetching for Recursive Data Structures,” Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 222-233, Oct. 1996.
[18] K.C. Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, vol. 16, no. 2, pp. 28–40, Apr. 1996.

Index Terms:
Cache performance, cache miss prediction, correlation-based profiling.
Citation:
Todd C. Mowry, Chi-Keung Luk, "Understanding Why Correlation Profiling Improves the Predictability of Data Cache Misses in Nonnumeric Applications," IEEE Transactions on Computers, vol. 49, no. 4, pp. 369-384, April 2000, doi:10.1109/12.844349
Usage of this product signifies your acceptance of the Terms of Use.