This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Reducing Data Cache Susceptibility to Soft Errors
October-December 2006 (vol. 3 no. 4)
pp. 353-364
Data caches are a fundamental component of most modern microprocessors. They provide for efficient read/write access to data memory. Errors occurring in the data cache can corrupt data values or state, and can easily propagate throughout the memory hierarchy. One of the main threats to data cache reliability is soft (transient, nonreproducible) errors. These errors can occur more often than hard (permanent) errors, and most often arise from Single Event Upsets (SEUs) caused by strikes from energetic particles such as neutrons and alpha particles. Many protection techniques exist for data caches; the most common are ECC (Error Correcting Codes) and parity. These protection techniques detect all single bit errors and, in the case of ECC, correct them. To make proper design decisions about which protection technique to use, accurate design-time modeling of cache reliability is crucial. In addition, as caches increase in storage capacity, another important goal is to reduce the failure rate of a cache, to limit disruption to normal system operation. In this paper, we present our modeling approach for assessing the impact of soft errors using architectural simulators. We also describe a new technique for reducing the vulnerability of data caches: refetching. By selectively refetching cache lines from the ECC-protected L2 cache, we can significantly reduce the vulnerability of the L1 data cache. We discuss and present results for two different algorithms that perform selective refetch. Experimental results show that we can obtain an 85 percent decrease in vulnerability when running the SPEC2K benchmark suite while only experiencing a slight decrease in performance. Our results demonstrate that selective refetch can cost-effectively decrease the error rate of an L1 data cache.

[1] H. Al-Zoubi, A. Milenkovic, and M. Milenkovic, “Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite,” Proc. 42nd ACM Southeast Regional Conf., 2004.
[2] AMD Athlon(TM) 64 Processor, http:/www.amd.com., 2005.
[3] G. Asadi, V. Sridharan, M.B. Tahoori, and D. Kaeli, “Balancing Performance and Reliability in the Memory Hierarchy,” Proc. IEEE Int'l Symp. Performance Analysis of Systems and Software (ISPASS '05), pp. 269-279, Mar. 2005.
[4] G. Asadi, V. Sridharan, M.B. Tahoori, and D. Kaeli, “Reliability Tradeoffs in Design of Cache Memories,” Proc. First Workshop Architectural Reliability (WAR-1), in conjunction with the 38th Int'l Symp. Microarchitecture (MICRO-38), Nov. 2005.
[5] R. Balasubramonian, D.H. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas, “Memory Hierarchy Reconfiguration for Energy and Performance in General-Purpose Processor Architectures,” Proc. 33rd Int'l Symp. Microarchitecture, pp. 245-257, Dec. 2000.
[6] R. Baumann, “Soft Errors in Commercial Semiconductor Technology: Overview and Scaling Trends,” Proc. IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, pp. 121.01.1-121.01.14, Apr. 2002.
[7] S. Behling, R. Bell, P. Farrell, H. Holthoff, F.O. Connell, and W. Weir, The POWER4 Processor Introduction and Tuning Guide, IBM redbooks, www.redbooks.ibm.com/pubs/pdfs/redbookssg247041.pdf , Nov. 2001.
[8] P.A. Bernstein, “SEQuoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing,” Computer, vol. 21, no. 2, pp. 37-45, Feb. 1988.
[9] D. Bossen, “Workshop Talk,” Proc. Int'l Reliability Physics Symp. (IRPS), 2002.
[10] D. Burger and T.M. Austin, “The SimpleScalar Tool Set, Version 2.0,” Technical Report No. 1342, Computer Science Dept., Univ. of Wisconsin-Madison, June 1997.
[11] S. Dropsho, A. Buyuktosunoglu, R. Balasubramonian, D.H. Albonesi, S. Dwarkadas, G. Semeraro, G. Magklis, and M.L. Scott, “Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power,” Proc. 11th Int'l Conf. Parallel Architectures and Compilation Techniques, pp. 141-152, Sept. 2000.
[12] FUJITSU Corporation, Proc. Int'l Solid-State Circuits Conf. (ISSC), 2003.
[13] J. Gaisler, “Evaluation of a 32-bit Microprocessor with Built-In Concurrent Error-Detection,” Proc. 27th Int'l Symp. Fault-Tolerant Computing (FTCS-27), pp. 42-46, June 1997.
[14] S. Hareland, J. Maiz, M. Alavi, K. Mistry, S. Walstra, and C. Dai, “Impact of CMOS Scaling and SOI on Soft Error Rates of Logic Processes,” Proc. Symp. VLSI Technology, Digest of Technical Papers, pp. 73-74, June 2001.
[15] Intel Pentium IV Processor, http:/www.intel.com, 2005.
[16] International Technology Roadmap for Semiconductors, http:/www.itrs.net/, 2004.
[17] B.W. Johnson, Design & Analysis of Fault Tolerant Digital Systems. A&W Longman Publishing, 1988.
[18] R. Kalla, S. Balaram, and J.M. Tendler, “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro, vol. 24, no. 2, pp. 40-47, Mar.-Apr. 2004.
[19] J. Karlsson, P. Ledan, P. Dahlgren, and R. Johansson, “Using Heavy-Ion Radiation to Validate Fault Handling Mechanisms,” IEEE Micro, vol. 14, no. 1, pp. 8-23, Feb. 1994.
[20] T. Karnik, B. Bloechel, K. Soumyanath, V. De, and S. Borkar, “Scaling Trends of Cosmic Rays Induced Soft Errors in Static Latches Beyond 0.18$\mu$ ,” Proc. Symp. VLSI Circuits, Digest of Technical Papers, pp. 61-62, June 2001.
[21] R. Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, vol. 19, no. 2, pp. 24-36, Mar. 1999.
[22] S. Kim and A.K. Somani, “Area Efficient Architectures for Information Integrity in Cache Memories,” Proc. Int'l Symp. Computer Architecture (ISCA '99), pp. 246-255, May 1999.
[23] S. Kim and A.K. Somani, “Soft Error Sensitivity Characterization for Microprocessor Dependability Enhancement Strategy,” Proc. Int'l Conf. Dependable Systems and Networks (DSN), pp. 416-425, June 2002.
[24] N.S. Kim, D. Blaauw, and T. Mudge, “Leakage Power Optimization Techniques for Ultra Deep Sub-Micron Multi-Level Caches,” Proc. Int'l Conf. Computer-Aided Design, pp. 627-632, Nov. 2003.
[25] K.M. Lepak and M.H. Lipasti, “Silent Stores for Free,” Proc. Int'l Symp. Microarchitecture (MICRO-33), pp. 22-31, Dec. 2000.
[26] P. Liden, P. Dahlgren, R. Johansson, and J. Karlsson, “On Latching Probability of Particle Induced Transients in Combinational Networks,” Proc. 24th Symp. Fault-Tolerant Computing (FTCS-24), pp. 340-349, June 1994.
[27] L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M.J. Irwin, “Soft Error and Energy Consumption Interactions: A Data Cache Perspective,” Proc. Int'l Symp. Low Power Electronics and Design, May 2004.
[28] S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. Kim, “Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, pp.43-52, Feb. 2005.
[29] S.S. Mukherjee, C. Weaver, J. Emer, S.K. Reinhardt, and T. Austin, “A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor,” Proc. Int'l Symp. Micro-Architecture (MICRO-36), pp. 29-40, 2003.
[30] S.S. Mukherjee, J. Emer, T. Fossum, and S.K. Reinhardt, “Cache Scrubbing in Microprocessors: Myth or Necessity?” Proc. 10th IEEE Pacific Rim Int'l Symp. Dependable Computing, pp. 37-42, Mar. 2004.
[31] H.T. Nguyen and Y. Yagil, “A Systematic Approach to SER Estimation and Solutions,” Proc. Int'l Reliability Physical Symp., pp.60-70, 2003.
[32] E. Normand, “Single Event Upset at Ground Level,” IEEE Trans. Nuclear Science, vol. 43, no. 6, pp. 2742-2750, Dec. 1996.
[33] E. Perelman, G. Hamerly, and B. Calder, “Picking Statistically Valid and Early Simulation Points,” Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, Sept. 2003.
[34] V. Phalke and B. Gopinath, “An Inter-Reference Gap Model for Temporal Locality in Program Behavior,” Proc. 1995 ACM SIGMETRICS Joint Int'l Conf. Measurement and Modeling of Computer Systems, May 1995.
[35] S. Rusu, H. Muljono, and B. Cherkauer, “Itanium 2 Processor 6M: Higher Frequency and Larger L3 Cache,” IEEE Micro, vol. 24, no. 2, pp. 10-18, Mar.-Apr. 2004.
[36] A.M. Saleh, J.J. Serrano, and J.H. Patel, “Reliability of Scrubbing Recovery-Techniques for Memory Systems,” IEEE Trans. Reliability, vol. 39, no. 1, pp. 114-122, Apr. 1990.
[37] P. Shivakumar, M. Kistler, S.W. Keckler, D. Burger, and L. Alvisi, “Modeling the Effect of Technology Trends on the Soft Error Rate of Combinatorial Logic,” Proc. Int'l Conf. Dependable Systems and Networks (DSN), pp. 389-398, June 2002.
[38] T.J. Slegel, E. Pfeffer, and J.A. MaGee, “The IBM eServer z990 Microprocessor,” IBM J. Research and Development, vol. 48, nos. 3/4, pp. 295-310, Apr. 2004.
[39] Standard Performance Evaluation Corporation CPU2000 Benchmarks, http://www.spec.orgcpu2000, 2005.
[40] Z. Wang, D. Burger, K.S. McKinley, S.K. Reinhardt, and C.C. Weems, “Guided Region Prefetching: A Cooperative Hardware/Software Approach,” Proc. Int'l Symp. Computer Architecture (ISCA '03), pp. 388-398, June 2003.
[41] C. Weaver, J. Emer, S.S. Mukherjee, and S.K. Reinhardt, “Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor,” Proc. Int'l Symp. Computer Architecture (ISCA '04), pp. 264-275, June 2004.
[42] C. Zhang, F. Vahid, and W. Najjar, “A Highly Configurable Cache Architecture for Embedded Systems,” Proc. Int'l Symp. Computer Architecture (ISCA '03), pp. 136-146, June 2003.
[43] C. Zhang, F. Vahid, and R. Lysecky, “A Self-Tuning Cache Architecture for Embedded Systems,” Proc. Design, Automation, and Test in Europe Conf. (DATE '04), pp. 142-147, Feb. 2004.
[44] W. Zhang, S. Gurumurthi, M. Kandemir, and A. Siavasubramaniam, “ICR: In-Cache Replication for Enhancing Data Cache Reliability,” Proc. Int'l Conf. Dependable Systems and Networks (DSN), pp. 291-300, June 2003.
[45] W. Zhang, J.S. Hu, V. Degalahal, M. Kandemir, N. Vijaykrishnan, and M.J. Irwin, “Reducing Instruction Cache Energy Consumption Using a Compiler-Based Strategy,” ACM Trans. Architecture and Code Optimization (TACO), vol. 1, no. 1, pp. 3-33, Mar. 2004.
[46] J.F. Ziegler, “Terrestrial Cosmic Rays,” IBM J. Research and Development, vol. 40, no. 1, pp. 19-39, Jan. 1996.

Index Terms:
Fault tolerance, reliability, soft errors, error modeling, cache memories, refresh, refetch.
Citation:
Vilas Sridharan, Hossein Asadi, Mehdi B. Tahoori, David Kaeli, "Reducing Data Cache Susceptibility to Soft Errors," IEEE Transactions on Dependable and Secure Computing, vol. 3, no. 4, pp. 353-364, Oct.-Dec. 2006, doi:10.1109/TDSC.2006.55
Usage of this product signifies your acceptance of the Terms of Use.