This Article 
 Bibliographic References 
 Add to: 
PERFECTORY: A Fault-Tolerant Directory Memory Architecture
May 2010 (vol. 59 no. 5)
pp. 638-650
Hyunjin Lee, University of Pittsburgh, Pittsburgh
Sangyeun Cho, University of Pittsburgh, Pittsburgh
Bruce R. Childers, University of Pittsburgh, Pittsburgh
The number of CPUs in chip multiprocessors is growing at the Moore's Law rate, due to continued technology advances. However, new technologies pose serious reliability challenges, such as more frequent occurrences of degraded or even nonoperational devices, and they threaten the cost-effectiveness and dependability of future computing systems. This work studies how to protect the on-chip coherence directory from fault occurrences. In a chip multiprocessor, cache coherence mechanisms such as directory memory are critical for offering consistent data view to all CPUs. We propose a novel online fault detection and correction scheme to enhance yield and resilience to runtime errors at a small performance cost. The proposed scheme uses smart encoding and coherence protocol adaptation strategies to salvage faulty directory entries. We also develop an online error recovery scheme that protects the directory memory from soft errors. We call our fault-tolerant directory memory architecture PERFECTORY. Evaluation results show that PERFECTORY achieves very high fault resilience: Over 99 percent chip yield at 0.05 percent hard error ratio and 1,934 years MTTF at 1,000 FIT using a 100-processor cluster configuration. PERFECTORY limits performance degradation to less than 1 percent at 0.05 percent hard error ratio and requires significantly smaller area overheads than existing redundancy approaches.

[1] H. Ando et al., "Accelerated Testing of a 90nm SPARC64V Microprocessor for Neutron SER," Proc. Third Workshop System Effects on Logic Soft Errors, 2007.
[2] A. Agarwal et al., "An Evaluation of Directory Schemes for Cache Coherence," Proc. Int'l Symp. Computer Architecture (ISCA), 1988.
[3] A. Agarwal et al., "A Process-Tolerant Cache Architecture for Improved Yield in Nanoscale Technologies," IEEE Trans. VLSI Systems (TVLSI) , vol. 13, no. 1, pp. 27-38, Jan. 2005.
[4] AMD Dual-Core/Quad-Core Processors, http:/, 2009.
[5] D.K. Bhavsar, "An Algorithm for Row-Column Self-Repair of RAMs and Its Implementation in the Alpha 21264," Proc. Int'l Test Conf. (ITC), pp. 311-318, Sept. 1999.
[6] C. Argyrides et al., "Matrix Codes: Multiple Bit Upsets Tolerant Method for SRAM Memories," Proc. IEEE Int'l Symp. Defect and Fault-Tolerance in VLSI System (DFT), pp. 340-348, Sept. 2005.
[7] S. Borkar, "Microarchitecture and Design Challenges for Gisele Integration," Proc. Int'l Symp. Microarchitecture (MICRO), keynote speech, Dec. 2004.
[8] S. Borkar et al., "Platform 2015: Intel Processor and Platform Evolution for the Next Decade," Technology@Intel Magazine, Mar. 2005.
[9] J. Chang and G.S. Sohi, "Cooperative Caching for Chip Multiprocessors," Proc. Int'l Symp. Computer Architecture (ISCA), 2006.
[10] J. Chang et al., "The 65-nm 16-MB Shared On-Die L3 Cache for the Dual-Core Intel Xeon Processor 7100 Series," IEEE J. Solid-State Circuits (JSSC), vol. 42, no. 4, pp. 846-852, Apr. 2007.
[11] S. Cho et al., "TPTS: A Novel Framework for Very Fast Manycore Processor Architecture Simulation," Proc. Int'l Conf. Parallel Processing (ICPP), pp. 487-494, Sept. 2008.
[12] S. Cho and L. Jin, "Managing Distributed, Shared L2 Caches through OS-Level Page Allocation," Proc. Int'l Symp. Microarchitecture (MICRO), pp. 455-465, Dec. 2006.
[13] D.E. Culler and J.P. Singh, Parallel Computer Architecture: A HW/SW Approach. Morgan Kaufmann Publishers, 1998.
[14] J.R. Day, "A Fault-Driven, Comprehensive Redundancy Algorithm," IEEE Design and Test of Computers, vol. 2, no. 3, pp. 35-44, June 1985.
[15] J. Dorsey et al., "An Integrated Quad-Core Opteron Processor," Proc. IEEE Int'l Solid-State Circuits Conf. (ISSCC), Feb. 2007.
[16] R. Fernandez-Pascual et al., "A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures," Proc. Int'l Symp. High Performance Computer Architecture (HPCA), Feb. 2007.
[17] K. Gharachorloo et al., "Architecture and Design of AlphaServer GS320," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), Dec. 2000.
[18] A. Gupta et al., "Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes," Proc. Int'l Conf. Parallel Processing (ICPP), pp. 312-321, Aug. 1990.
[19] J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, fourth ed. Morgan Kaufmann, 2007.
[20] D.A. Hodges et al., Analysis and Design of Digital Integrated Circuits in Deep Submicron Technology, third ed. McGraw-Hill, 2003.
[21] M.Y. Hsiao, "A Class of Optimal Minimum Odd-Weight-Column SEC-DED Codes," IBM J. Research and Development, vol. 14, no. 4, pp. 395-401, June 1970.
[22] IBM, "IBM System p570 with New POWER6 Processor Increases Bandwidth and Capacity," IBM US Hardware An nouncement, pp. 107-288, May 2007.
[23] D.V. James et al., "Distributed-Directory Scheme: Scalable Coherent Interface," Computer, vol. 23, no. 6, pp. 74-77, June 1990.
[24] J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J.C. Hoe, "Multi-Bit Error Tolerant Caches Using Two-Dimensional Error Coding," Proc. Int'l Symp. Microarchitecture (MICRO), Dec. 2007.
[25] P. Kongetira et al., "Niagara: A 32-Way Multithreaded Sparc Processor," IEEE Micro, vol. 25, no. 2, pp. 21-29, Mar./Apr. 2005.
[26] K. Krewell, "Best Servers of 2004, Where Multicore Is the Norm," Microprocessor Report, Jan. 2005.
[27] D. Lamet and J.F. Frenzel, "Defect-Tolerant Cache Memory Design," Proc. IEEE VLSI Test Symp. (VTS), Apr. 1993.
[28] J. Laudon and D. Lenoski, "The SGI Origin: A ccNUMA Highly Scalable Server," Proc. Int'l Symp. Computer Architecture (ISCA), June 1997.
[29] H. Lee, S. Cho, and B.R. Childers, "Performance of Graceful Degradation for Cache Faults," Proc. IEEE CS Symp. VLSI (ISVLSI), 2007.
[30] H. Lee, S. Cho, and B.R. Childers, "Exploring the Interplay of Yield, Area, and Performance in Processor Caches," Proc. Int'l Conf. Computer Design (ICCD), Oct. 2007.
[31] D. Lenoski and J. Laudon, "The Directory-Based Cache Coherency Protocol for the DASH Multiprocessor," Proc. Int'l Symp. Computer Architecture (ISCA), pp. 203-214, June 1990.
[32] A.S. Leon et al., "The UltraSPARC T1 Processor: CMT Reliability," Proc. IEEE Custom Integrated Circuits Conf. (CICC), Mar. 2006.
[33] Q. Li and S. Vlaovic, "Redundant Linked List Based Cache Coherence Protocol," Proc. IEEE Workshop Fault-Tolerant Parallel and Distributed Systems (FTPDS), pp. 43-50, June 1994.
[34] P. Mazumder, "An On-Chip Double-Bit Error-Correcting Code for Three-Dimensional Dynamic Random-Access Memory," Proc. IEEE Int'l Test Conf. (ITC), pp. 279-288, Sept. 1988.
[35] C. Morin et al., "An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures," IEEE Trans. Computers, vol. 49, no. 5, pp. 414-430, May 2000.
[36] M. Plakal et al., "Lamport Clocks: Verifying a Directory Cache-Coherence Protocol," Proc. Int'l Symp. Parallel Algorithms and Architectures (SPAA), June/July 1998.
[37] D.W. Plass and Y.H. Chan, "IBM POWER6 SRAM Arrays," IBM J. Research and Development, vol. 51, no. 6, pp. 747-756, Nov. 2007.
[38] A.F. Pour et al., "Performance Implications of Tolerating Cache Faults," IEEE Trans. Computers, vol. 42, no. 3, pp. 257-267, Mar. 1993.
[39] N. Quach, "High Availability and Reliability in the Itanium Processor," IEEE Micro, vol. 20, no. 5, pp. 61-69, Sept./Oct. 2000.
[40] SEMATECH, "Critical Reliability Challenges for ITRS," Technology Transfer #03024377A-TR, Mar. 2003.
[41] B. Sinharoy et al., "POWER5 System Microarchitecture," IBM J. Research and Development, vol. 49, nos. 4/5, pp. 505-521, July-Sept. 2005.
[42] C.W. Slayman, "Cache and Memory Error Detection, Correction, and Reduction Techniques for Terrestrial Servers and Workstations," IEEE Trans. Device and Materials Reliability, vol. 5, no. 3, pp. 397-404, Sept. 2005.
[43] G.S. Sohi, "Cache Memory Organization to Enhance the Yield of High-Performance VLSI Processors," IEEE Trans. Computers, vol. 38, no. 4, pp. 484-492, Apr. 1989.
[44] M. Spica and T. Mak, "Do We Need Anything More Than Single Bit Error Correction ECC?", Proc. Int'l Workshop Memory Technology, Design and Testing, 2004.
[45] J. Srinivasan, S.V. Adve, R. Bose, and J.A. Rivers, "The Impact of Technology Scaling on Lifetime Reliability," Proc. Int'l Conf. Dependable Systems and Networks (DSN), pp. 177-186, June 2004.
[46] B. Stackhouse et al., "A 65nm 2-Billion-Transistor Quad-Core Itanium Processor," Proc. IEEE Int'l Solid-State Circuits Conf. (ISSCC), pp. 92-93, June 2004.
[47] C.H. Stapper et al., "Synergistic Fault-Tolerance for Memory Chips," IEEE Trans. Computers, vol. 41, no. 9, pp. 1078-1087, Sept. 1992.
[48] P. Sweazey and A.J. Smith, "A Class of Compatible Cache Consistency Protocols and Their Support by the IEEE Futurebus," Proc. Int'l Symp. Computer Architecture (ISCA), May 1986.
[49] S. Thoziyoor et al., "CACTI 5.1," Technical Report HPL-2008-20, HP Lab. Apr. 2008.
[50] O.E. Theel and B.D. Fleisch, "A Dynamic Coherence Protocol for Distributed Shared Memory Enforcing High Data Availability at Low Costs," IEEE Trans. Parallel and Distributed System (TPDS), vol. 7, no. 9, pp. 915-930, Sept. 1996.
[51] C. Wilkerson et al., "Trading off Cache Capacity for Reliability to Enable Low Voltage Operation," Proc. Int'l Symp. Computer Architecture (ISCA), pp. 203-214, June 2008.
[52] S.C. Woo et al., "The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. Int'l Symp. Computer Architecture (ISCA), pp. 24-36, July 1995.
[53] M. Zhang and K. Asanović, "Victim Replication: Maximizing Capacity While Hiding Wire Delay in Tiled Chip Multiprocessors," Proc. Int'l Symp. Computer Architecture (ISCA), June 2005.

Index Terms:
Chip multiprocessor, cache coherence, chip yield, lifetime reliability.
Hyunjin Lee, Sangyeun Cho, Bruce R. Childers, "PERFECTORY: A Fault-Tolerant Directory Memory Architecture," IEEE Transactions on Computers, vol. 59, no. 5, pp. 638-650, May 2010, doi:10.1109/TC.2009.138
Usage of this product signifies your acceptance of the Terms of Use.