The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.11 - Nov. (2012 vol.61)
pp: 1535-1550
Yoav Etsion , Israel Institute of Technology, Technion City
Dror G. Feitelson , Hebrew University, Jerusalem
ABSTRACT
Locality is often characterized by working sets, defined by Denning as the set of distinct addresses referenced within a certain window of time. This definition ignores the fact that dramatic differences exist between the usage patterns of frequently used data and transient data. We therefore propose to extend Denning's definition with that of core working sets, which identify blocks that are used most frequently and for the longest time. The concept of a core motivates the design of dual-cache structures that provide special treatment for the core. In particular, we present a probabilistic locality predictor for L1 caches that leverages the skewed popularity of blocks to distinguish transient cache insertions from more persistent ones. We further present a dual L1 design that inserts only frequently used blocks into a low-latency, low-power, direct-mapped main cache, while serving others from a small fully associative filter. To reduce the prohibitive cost of such a filter, we present a content addressable memory design that eliminates most of the costly lookups using a small auxiliary lookup table. The proposed design enables a 16K direct-mapped L1 cache, augmented with a small 2K filter, to outperform a 32K 4-way cache, while at the same time consumes 70-80 percent less dynamic power and 40 percent less static power.
INDEX TERMS
Measurement, Histograms, Probabilistic logic, Power demand, Benchmark testing, Educational institutions, System performance, cache insertion policy, Measurement, Histograms, Probabilistic logic, Power demand, Benchmark testing, Educational institutions, System performance, dual-cache design, Measurement, Histograms, Probabilistic logic, Power demand, Benchmark testing, Educational institutions, System performance, cache filtering, Core working sets, random insertion policy, mass-count disparity, L1 cache
CITATION
Yoav Etsion, Dror G. Feitelson, "Exploiting Core Working Sets to Filter the L1 Cache with Random Sampling", IEEE Transactions on Computers, vol.61, no. 11, pp. 1535-1550, Nov. 2012, doi:10.1109/TC.2011.197
REFERENCES
[1] T. Austin, E. Larson, and D. Ernst, “SimpleScalar: An Infrastructure for Computer System Modeling,” Computer, vol. 35, no. 2, pp. 59-67, Feb. 2002.
[2] M. Behar, A. Mendelson, and A. Kolodny, “Trace Cache Sampling Filter,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques (PACT), pp. 255-264, Sept. 2005.
[3] L.A. Belady, “A Study of Replacement Algorithms for a Virtual-Storage Computer,” IBM Systems J., vol. 5, no. 2, pp. 78-101, 1966.
[4] M. Brehob, S. Wagner, E. Torng, and R. Enbody, “Optimal Replacement Is NP-Hard for Nonstandard Caches,” IEEE Trans. Computers, vol. 53, no. 1, pp. 73-76, Jan. 2004.
[5] K.K. Chan, C.C. Hay, J.R. Keller, G.P. Kurpanek, F.X. Schumacher, and J. Zheng, “Design of the HP PA 7200 CPU,” Hewlett-Packard J., vol. 47, no. 1, pp. 25-33, Feb. 1996.
[6] S. Cho, P.C. Yew, and G. Lee, “Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 100-110, 1999.
[7] P.J. Denning, “The Working Set Model for Program Behavior,” Comm. ACM, vol. 11, no. 5, pp. 323-333, May 1968.
[8] P.J. Denning and S.C. Schwartz, “Properties of the Working-Set Model,” Comm. ACM, vol. 15, no. 3, pp. 191-198, Mar. 1972.
[9] D.R. Ditzel and H.R. McLellan, “Register Allocation for Free: The C Machine Stack Cache,” Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 48-56, 1982.
[10] Y. Etsion, “The Skewed Distribution of Working Sets: Leveraging Randomness for Cache Design,” PhD thesis, School of Computer Science and Eng., Hebrew Univ., Oct. 2008.
[11] Y. Etsion and D.G. Feitelson, “L1 Cache Filtering through Random Selection of Memory References,” Proc. Int'l Conf. Parallel Architecture and Compilation Techniques (PACT), pp. 235-244, Sept. 2007.
[12] Y. Etsion and D.G. Feitelson, “Probabilistic Prediction of Temporal Locality,” IEEE Computer Architecture Letters, vol. 6, no. 1, pp. 17-20, Jan.-June 2007.
[13] D.G. Feitelson, “Metrics for Mass-Count Disparity,” Proc. IEEE Int'l Symp. Modeling, Analysis, and Simulation of Computer and Telecomm. Systems (MASCOTS), pp. 61-68, Sept. 2006.
[14] A. Jaleel, K.B. Theobald, S.C. SteelyJr., and J. Emer, “High Performance Cache Replacement Using Re-Reference Interval Prediction (RRIP),” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 60-71, 2010.
[15] S. Jin and A. Bestavros, “Sources and Characteristics of Web Temporal Locality,” Proc. Int'l Symp. Modeling, Analysis, and Simulation of Computer and Telecomm. Systems (MASCOTS), pp. 28-35, Aug. 2000.
[16] T.L. Johnson, D.A. Connors, M.C. Merten, and W.-M.W. Hwu, “Run-Time Cache Bypassing,” IEEE Trans. Computers, vol. 48, no. 12, pp. 1338-1354, Dec. 1999.
[17] N.P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 364-373, 1990.
[18] H.H.S. Lee, M. Smelyanskiy, G.S. Tyson, and C.J. Newburn, “Stack Value File: Custom Microarchitecture for the Stack,” Proc. Int'l Symp. High-Performance Computer Architecture (HPCA), pp. 5-14, 2001.
[19] M.O. Lorenz, “Methods of Measuring the Concentration of Wealth,” Publications of the Am. Statistical Assoc., vol. 9, no. 70, pp. 209-219, June 1905.
[20] S. McFarling, “Cache Replacement with Dynamic Exclusion,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 191-200, 1992.
[21] N. Megiddo and D.S. Modha, “ARC: A Self-Tuning, Low Overhead Replacement Cache,” Proc. USENIX Conf. File and Storage Technologies (FAST), 2003.
[22] G.-H. Park, K.-W. Lee, J.-H. Lee, T.-D. Han, and S.-D. Kim, “A Power Efficient Cache Structure for Embedded Processors Based on the Dual Cache Structure,” Proc. Workshop Languages, Compilers, and Tools for Embedded Systems (LCTES), pp. 162-177, 2000.
[23] M.K. Qureshi, A. Jaleel, Y.N. Patt, S.C. Steely, and J. Emer, “Adaptive Insertion Policies for High Performance Caching,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 381-391, 2007.
[24] M.K. Qureshi, D.N. Lynch, O. Mutlu, and Y.N. Patt, “A Case for MLP-Aware Cache Replacement,” Proc. Int'l Symp. Computer Architecture (ISCA), pp. 167-178, June 2006.
[25] K. Rajan and G. Ramaswamy, “Emulating Optimal Replacement with a Shepherd Cache,” Proc. Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 445-454, 2007.
[26] J.A. Rivers and E.S. Davidson, “Reducing Conflicts in Direct-Mapped Caches with a Temporality-Based Design,” Proc. Int'l Conf. Parallel Processing (ICPP), vol. 1, pp. 154-163, 1996.
[27] J. Sahuquillo and A. Pont, “Splitting the Data Cache: A Survey,” IEEE Concurrency, vol. 8, no. 3, pp. 30-35, July-Sept. 2000.
[28] J.P. Shen and M.H. Lipasti, Modern Processor Design: Fundamentals of Superscalar Processors. Mcgraw-Hill, July 2004.
[29] A.J. Smith, “Cache Memories,” ACM Computing Surveys, vol. 14, no. 3, pp. 473-530, 1982.
[30] Standard Performance Evaluation Corporation, SPEC2000 Benchmark Suite, http:/www.spec.org, 2011.
[31] R. Subramanian, Y. Smaragdakis, and G.H. Loh, “Adaptive Caches: Effective Shaping of Cache Behavior to Workloads,” Proc. Ann. IEEE/ACM Int'l Symp. Microarchitecture, pp. 385-396, 2006.
[32] D. Tarjan, S. Thoziyoor, and N.P. Jouppi, “CACTI 4.0,” Technical Report HPL-2006-86, HP Laboratories, http://quid.hpl.hp. com:9081cacti/, June 2006.
[33] S.J. Walsh and J.A. Board, “Pollution Control Caching,” Proc. Int'l Conf. Computer Design (ICCD), pp. 300-306, 1995.
[34] N.H.E. Weste and D. Harris, CMOS VLSI Design: A Circuits and Systems Perspective, third ed. Addison Wesley, 2005.
22 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool