This Article 
 Bibliographic References 
 Add to: 
Run-Time Cache Bypassing
December 1999 (vol. 48 no. 12)
pp. 1338-1354

Abstract—The growing disparity between processor and memory performance has made cache misses increasingly expensive. Additionally, data and instruction caches are not always used efficiently, resulting in large numbers of cache misses. Therefore, the importance of cache performance improvements at each level of the memory hierarchy will continue to grow. In numeric programs, there are several known compiler techniques for optimizing data cache performance. However, integer (nonnumeric) programs often have irregular access patterns that are more difficult for the compiler to optimize. In the past, cache management techniques such as cache bypassing were implemented manually at the machine-language-programming level. As the available chip area grows, it makes sense to spend more resources to allow intelligent control over the cache management. In this paper, we present an approach to improving cache effectiveness, taking advantage of the growing chip area, utilizing run-time adaptive cache management techniques, optimizing both performance and cost of implementation. Specifically, we are aiming to increase data cache effectiveness for integer programs. We propose a microarchitecture scheme where the hardware determines data placement within the cache hierarchy based on dynamic referencing behavior. This scheme is fully compatible with existing Instruction Set Architectures. This paper examines the theoretical upper bounds on the cache hit ratio that cache bypassing can provide for integer applications, including several Windows applications with OS activity. Then, detailed trace-driven simulations of the integer applications are used to show that the implementation described in this paper can achieve performance close to that of the upper bound.

[1] K. Boland and A. Dollas, “Predicting and Precluding Problems with Memory Latency,” IEEE Micro, vol. 14, no. 4, pp. 59-66, Aug. 1994.
[2] C.-K. Luk and T.C. Mowry, “Compiler-Based Prefetching for Recursive Data Structures,” Proc. Seventh Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 222-233, Oct. 1996.
[3] S. Srinivasan and A. Lebeck, “Load Latency Tolerance in Dynamically Scheduled Processors,” Proc. 31st Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 148-159, Dec. 1998.
[4] V. Santhanam, E.H. Gornish, and W.-C. Hsu, “Data Prefetching on the HP PA-8000,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 264-273, June 1997.
[5] V. Kathail, M.S. Schlansker, and B.R. Rau, “HPL PlayDoh Architecture Specification: Version 1. 0,” Technical Report HPL-93-80, Hewlett-Packard Laboratories, Palo Alto, Calif., Feb. 1994.
[6] T.L. Johnson and W.W. Hwu, “Run-Time Adaptive Cache Hierarchy Management via Reference Analysis,” Proc. 24th Ann. Int'l Symp. Computer Architecture, pp. 315-326, June 1997.
[7] D. Kroft, "Lockup-Free Instruction Fetch/Prefetch Cache Organization," Proc. Eighth Int'l Symp. Computer Architecture, pp. 81-87, 1981.
[8] G.S. Sohi and M. Franklin, “High-Bandwidth Data Memory Systems for Superscalar Processors,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 53-62, 8-11 Apr. 1991.
[9] A.J. Smith, "Cache Memories," ACM Computing Surveys, Vol. 14, 1982, pp. 473-540.
[10] N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully Associative Cache and Prefetch Buffers,” Proc. 17th Int'l Symp. Computer Architecture, pp. 364-373, May 1990.
[11] J.-L. Baer and T.-F. Chen, "An Effective On-Chip Preloading Scheme To Reduce Data Access Penalty," Proc. Supercomputing '91, pp. 176-186, 1991,.
[12] T.-F. Chen and J.-L. Baer, “Reducing Memory Latency via Non-Blocking and Prefetching Caches,” Technical Report 92-06-03, Dept. of Computer Science and Eng., Univ. of Washington, Seattle, June 1992.
[13] S. Mehrotra and L. Harrison, “Quantifying the Performance Potential of a Data Prefetch Mechanism for Pointer-Intensive and Numeric Programs,” Techical Report 1458, Center for Supercomputing Research and Development, Univ. of Illi nois, Nov. 1995.
[14] J. Fu, J.H. Patel, and B.L. Janssens, "Stride Directed Prefetching in Scalar Processors," Proc. 25th Ann. Int'l Symp. Microarchitecture, pp. 102-110, 1992,.
[15] F. Dahlgren, M. Dubois, and P. Strenstrom, “Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors,” Proc. 1993 Int'l Conf. Parallel Processing, pp. 56-63, Aug. 1993.
[16] A.K. Porterfield, “Software Methods for Improvement of Cache Performance on Supercomputer Applications,” doctoral thesis, Dept. of Computer Science, Rice Univ., Apr. 1989.
[17] A. Klaiber and H. Levy, “An Architecture for Software-Controlled Data Prefetching,” Proc. 18th Ann. Int'l Symp. Computer Architecture, pp. 43-53, May 1991.
[18] T.C. Mowry, M.S. Lam, and A. Gupta, “Design and Evaluation of a Compiler Algorithm for Prefetching,” Proc. Fifth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 1992.
[19] W.Y. Chen, S.A. Mahlke, P.P. Chang,, and W.W. Hwu, “Data Access Microarchitectures for Superscalar Processors with Compiler-Assisted Data Prefetching,” Proc. 24th Ann.l ACM/IEEE Int'l Symp. Microarchitecture, pp. 69-73, 1991.
[20] W.Y. Chen, S.A. Mahlke, and W.W. Hwu, “Tolerating Data Access Latency with Register Preloading,” Proc. 1992 Int'l Conf. Supercomputing, pp. 104-113, Sept. 1992.
[21] M.H. Lipasti, W.J. Schmidt, S.R. Kunkel,, and R.R. Roediger, “SPAID: Software Prefetching in Pointer- and Call-Intensive Environments,” Proc. 28th Ann. ACM/IEEE Int'l Symp. Microarchitecture, pp. 231-236 1995.
[22] A. Roth, A. Moshovos, and G. Sohi, “Dependence Based Prefetching for Linked Data Structures,” Proc. Eighth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 1998.
[23] A. Roth and G. Sohi, "Effective Jump-Pointer Prefetching for Linked Data Structures," Proc. 26th Ann. Int'l Symp. Computer Architecture, IEEE Press, Piscataway, N.J., 1999, pp. 111-121.
[24] J.A. Rivers, E.S. Tam, G.S. Tyson, E.S. Davidson, and M. Farrens, “Utilizing Reuse Information in Data Cache Management,” Proc. 12th ACM Int'l Conf. Supercomputing, July 1998.
[25] S.G. Abraham and R.A. Sugumar, “Efficient Simulation of Caches Under Optimal Replacement with Applications to Miss Characterization,” Proc. ACM Int'l Conf. Measurement and Modeling of Computer Systems, 1993.
[26] G. Tyson et al., "A Modified Approach to Data Cache Management" Proc. 28th Int'l Symp. Microarchitecture, IEEE CS Press, 1995, pp. 93-103.
[27] A. González, C. Aliagas, and M. Valero, A Data Cache With Multiple Caching Strategies Tuned to Different Types of Locality Proc. Int'l Conf. Supercomputing, pp. 338-347, July 1995.
[28] V. Milutinovic, B. Markovic, M. Tomasevic, and M. Tremblay, “The Split Temporal/Spatial Cache: Initial Performance Analysis,” Proc. SCIzzL-5, pp. 63-70, Mar. 1996.
[29] J.A. Rivers and E.S. Davidson, "Reducing Conflicts in Direct-Mapped Caches with a Temporality Based Design," Proc. Int'l Conf. Parallel Processing, 1996.
[30] “SPEC Newsletter,” SPEC, Fremont, Calif., 1991.
[31] L.A. Belady, “A Study of Replacement Algorithms for a Virtual-Storage Computer,” IBM Systems J., vol. 5, no. 2, pp. 78-101, 1966.
[32] J. Tsai and Y. Bi, "Timing Errors in Real-Time Systems and Their Detection," Proc. Int'l Symp. Software Reliability Eng., IEEE CS Press, Los Alamitos, Calif., 1991, pp. 116-123.
[33] T.L. Johnson, “Run-Time Adaptive Cache Management,” PhD thesis, Dept. of Electrical and Computer Eng., Univ. of Illinois, Urbana, 1998.
[34] P.P. Chang, S.A. Mahlke, W.Y. Chen, N.J. Warter, and W.W. Hwu, "IMPACT: An Architectural Framework for Multiple-Issue Processors," Proc. 18th Ann. Int'l Symp. Computer Architecture, pp. 276-275,Toronto, Ontario, Canada, May 1991.
[35] J.W.C. Fu and J.H. Patel, “How to Simulate 100 Billion References Cheaply,” Technical Report CRHC-91-30, Center for Reliable and High-Performance Computing, Univ. of Illinois, Urbana, 1991.
[36] AMD Corporation, AMD-K6 Processor Data Sheet. AMD Corp., Sunnyvale, Calif., Order No. 20695, 1997.

Index Terms:
Data cache, cache management, cache bypassing, temporal locality, spatial locality.
Teresa L. Johnson, Daniel A. Connors, Matthew C. Merten, Wen-mei W. Hwu, "Run-Time Cache Bypassing," IEEE Transactions on Computers, vol. 48, no. 12, pp. 1338-1354, Dec. 1999, doi:10.1109/12.817393
Usage of this product signifies your acceptance of the Terms of Use.