This Article 
 Bibliographic References 
 Add to: 
Fault-Containment in Cache Memories for TMR Redundant Processor Systems
April 1999 (vol. 48 no. 4)
pp. 386-397

Abstract—Cache data errors read by a processor may cause CPU control flow error and force the system to enter a CPU-cache reintegration process in redundant processor systems. The reintegration process degrades the system performance and reliability. To reduce the occurrences of such an event, we propose a real-time error recovery scheme that provides effective fault-containment for data errors in cache memories. The scheme is based on cache data broadcasting of a dirty line after modification. It effectively exploits the redundancy of a fault-tolerant system using hardware voting. The scheme recovers from erroneous cache data written by a processor with full coverage. This error recovery feature remedies the insufficiency of error-correcting codes that are unable to prevent such an error. In addition, more than 60 percent of cache lines are fully covered for recovery due to errors originated from the cache itself, including unrecoverable ECC errors. The protocol can also be used to speedup the CPU-cache reintegration process for a temporarily failed processor. The performance overhead of the protocol is to broadcast only 2-3 percent of the total memory references.

[1] A.L. Hopkins Jr., T.B. Smith, and J.H. Lala, “FTMP-A Highly Reliable Fault-Tolerant Multiprocessor for Aircraft,” Proc. IEEE, vol. 66, no. 10, pp. 1,221-1,239, Oct. 1978.
[2] J.H. Wensley et al. “SIFT: Design and Analysis of a Fault-Tolerant Computer for Aircraft Control,” Proc. IEEE, vol. 66, no. 10, pp. 1,240-1,255, Oct. 1978.
[3] S.J. Adams, “Hardware Assisted Recovery from Transient Errors in Redundant Processing Systems,” Proc. 19th Symp. Fault-Tolerant Computing, pp. 512-519, 1989.
[4] R.E. Harper and B.P. Butler, “Rapid Recovery from Transient Faults in the Fault-Tolerant Processor with Fault-Tolerant Shared Memory,” Proc. IEEE/AIAA/NASA Ninth Digital Avionics Systems Conf., pp. 355-359, 1990.
[5] D. Siewiorek and R. Swarz, Reliable Computer Systems: Design and Evaluation. Digital Press, 1992.
[6] D. Jewett, “Integrity S2: A Fault-Tolerant Unix Platform,” Proc. 21st Int'l Symp. Fault-Tolerant Computing (FTCS-21), pp. 512-519, Montreal, June 1991.
[7] K. Goswami, R.K. Iyer, and L. Young, “DEPEND: A Simulation Based Environment for System Level Dependability Analysis,” IEEE Trans. Computers, vol. 46, no. 1, pp. 60-74, Jan. 1997.
[8] J. Ohlsson, M. Rimen, and U. Genneflo, "A Study of the Effects of Transient Fault Injection into a 32-bit RISC with Built-In Watchdog," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 316-325, 1992.
[9] X. Castillo, S.R. Mcconnel, and D.P. Siewiorek, “Derivation and Calibration of a Transient Error Reliability Model,” IEEE Trans. Computers, vol. 31, no. 7, pp. 658-671, July 1982.
[10] C.L. Chen and M.Y. Hsiao, “Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review,” IBM J. Research and Development, vol. 28, no. 2, Mar. 1984.
[11] D.B. Hunt and P.N. Marinos, “A General Purpose Cache-Aided Rollback Error Recovery (CARER) Technique,” Proc. 17th Symp. Fault-Tolerant Computing, pp. 170-175, 1987.
[12] R.E. Ahmed, R. Frazier, and P.N. Marinos, “Cache-Aided Rollback Error Recovery Algorithms for Shared-Memory Multiprocessor Systems,” Proc. 20th Symp. Fault-Tolerant Computing, pp. 82-88, 1990.
[13] K.L. Wu, W.K. Fuchs, and J.H. Patel, "Error Recovery in Shared Memory Multiprocessors Using Private Caches," IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 2, pp. 231-240, Apr. 1990.
[14] A.K. Somani and S. Kim, “Transient Fault Detection in Cache Memories by Employing a Small Shadow Cache,” Proc. Sixth Ann. Int'l Symp. Dependable Computing for Critical Applications (DCCA-6), Mar. 1997.
[15] M. Banâtre and P. Joubert,“Cache management in tightly coupled fault tolerant multiprocessor,” 20th Symp. on Fault-Tolerant Computing, June 1990, pp. 89-96.
[16] P.A. Bernstein,"Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing," Computer, pp. 37-45, Feb. 1988.
[17] J. Sosnowski, “Transient Fault Tolerance in Digital Systems,” IEEE Micro, vol. 14, pp. 24-35, 1994.
[18] C.-H. Chen and A.K. Somani, “A Cache Protocol for Error Detection and Recovery in Fault-Tolerant Computing Systems,” Proc. 24th Symp. Fault-Tolerant Computing, pp. 278-287, 1994.

Index Terms:
Caches, error detection and recovery, fault-containment, redundant systems, transient faults.
Chung-Ho Chen, Arun K. Somani, "Fault-Containment in Cache Memories for TMR Redundant Processor Systems," IEEE Transactions on Computers, vol. 48, no. 4, pp. 386-397, April 1999, doi:10.1109/12.762529
Usage of this product signifies your acceptance of the Terms of Use.