The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - May/June (2011 vol.8)
pp: 391-403
Refik Samet , Ankara University, Ankara, Turkey
ABSTRACT
This paper proposes the design of specialized hardware, called Recovery Device, for a dual-redundant computer system that operates in real-time. Recovery Device executes all fault-tolerant services including fault detection, fault type determination, fault localization, recovery of system after temporary (transient) fault, and reconfiguration of system after permanent fault. The paper also proposes the algorithms for determination of fault type (whether the fault is temporary or permanent) and localization of faulty computer without using self-testing techniques and diagnosis routines. Determination of fault type allows us to eliminate only the computer with a permanent fault. In other words, the determination of fault type prevents the elimination of nonfaulty computer because of short temporary fault. On the other hand, localization of faulty computer without using self-testing techniques and diagnosis routines shortens the recovery point time period and reduces the probability that a fault will occur during the execution of fault-tolerant procedure. This is very important for real-time fault-tolerant systems. These contributions bring both an increase in system performance and an increase in the degree of system reliability.
INDEX TERMS
Dual-redundant computer system, fault-tolerant procedure, hardware implementation, real-time, recovery device, recovery point, temporary and permanent faults.
CITATION
Refik Samet, "Recovery Device for Real-Time Dual-Redundant Computer Systems", IEEE Transactions on Dependable and Secure Computing, vol.8, no. 3, pp. 391-403, May/June 2011, doi:10.1109/TDSC.2010.12
REFERENCES
[1] A. Avizienis, J.C. Laprie, B. Randell, and C. Landwehr, "Basic Concepts and Taxonomy of Dependable and Secure Computing," IEEE Trans. Dependable and Secure Computing, vol. 1, no. 1, pp. 11-33, Jan. 2004.
[2] J.P. Bentley, Introduction to Reliability and Quality Engineering. Addison-Wesley Longman, 1999.
[3] D.K. Pradhan, Fault-Tolerant Computer System Design. Prentice Hall PTR, 1996.
[4] A. Avizienis, "Fault-Tolerance: A Property That Ensures Constant Availability of Digital System," IEEE Trans. Computers, vol. 66, no. 10, pp. 5-25, Oct. 1978.
[5] J.C. Laprie, "Dependable Computing and Fault Tolerance: Concepts and Terminology," Proc. 15th Int'l Symp. Fault Tolerant Computing (FTCS-15), Reprint, pp. 2-11, 1996.
[6] S. Borkar, "Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation," IEEE Micro, vol. 25, no. 6, pp. 10-16, Nov. 2005.
[7] P.J. Brooke and R.F. Paige, "Fault Trees for Security System Design and Analysis," Computers & Security, vol. 22, no. 3, pp. 256-264, 2003.
[8] G.S. Sohi, M. Franklin, and K.K. Saluja, "A Study of Time-Redundant Fault Tolerance Techniques for High-Performance Pipelined Computers," Proc. 19th Int'l Symp. Fault-Tolerant Computing (FTCS-19), pp. 167-174, 1989.
[9] A. Timor, A. Mendelson, Y. Birk, and N. Suri, "Using Underutilized CPU Resources to Enhance Its Reliability," IEEE Trans. Dependable and Secure Computing, vol. 7, no. 1, pp. 94-109, Jan.-Mar. 2010.
[10] A. Ejlali, B.M. Al-Hashimi, M.T. Schmitz, P. Rosinger, and S.G. Miremadi, "Combined Time and Information Redundancy for SEU-Tolerance in Energy-Efficient Real-Time Systems," IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 14, no. 4, pp. 323-335, Apr. 2006.
[11] T. Tsai, "Fault Tolerance via N-Modular Software Redundancy," Proc. 28th Int'l Symp. Fault-Tolerant Computing (FTCS-28), pp. 201-206, 1998.
[12] H. Kim, H-J. Jeon, K. Lee, and H. Lee, "The Design and Evaluation of All Voting Triple Modular Redundancy System," Proc. 2002 Ann. Reliability and Maintainability Symp., pp. 439-444, 2002.
[13] J.C. Laprie, J. Arlat, C. Beounes, and K. Kanoun, "Definitions and Analysis of Hardware and Software Fault-Tolerant Architectures," Computer, vol. 23, no. 7, pp. 39-51, July 1990.
[14] S. Mitra, N.R. Saxena, and E.J. McCluskey, "A Design Diversity Metric and Analysis of Redundant Systems," IEEE Trans. Computers, vol. 51, no. 5, pp. 498-510, May 2002.
[15] L.R. Freydel and N. Ida, "Dual/Triple Redundant Computer System," US Patent 7,047,440 B1, 2006.
[16] L.R. Freydel, "Hybrid Triple Redundant Computer System," US Patent 6,732,300 B1, 2004.
[17] Y. Zhao and F. Liu, "The Implementation of a Dual-Redundant Control System," Control Eng. Practice, vol. 12, pp. 445-453, 2004.
[18] L. Beckman, "Safety Performance vs. Cost Analysis of Redundant Architectures Used in Safety Systems," Advances in Instrumentation and Control, vol. 50, part 1, pp. 372-375, ISA, 1996.
[19] C. Bolchini, L. Pomante, F. Salice, and D. Sciuto, "A System Level Approach in Designing Dual-Duplex Fault-Tolerant Embedded Systems," Proc. Eighth IEEE Int'l On-Line Testing Workshop (IOLTW '02), 2002.
[20] W. Dabney, L. Etzkorn, and G.W. Cox, "A Fault-Tolerant Approach to Test Control Utilizing Dual-Redundant Processors," Advances in Eng. Software, vol. 39, pp. 371-383, 2008.
[21] S. Hua, P.R. Pari, and G. Qu, "Dual-Processor Design of Energy Efficient Fault-Tolerant System," Proc. IEEE 17th Int'l Conf. Application-Specific Systems, Architectures and Processors (ASAP '06), pp. 239-244, 2006.
[22] A. Avizienis, G.C. Gilley, F.P. Mathur, D.A. Rennels, J.A. Rohr, and D.K. Rubin, "The STAR (Self-Testing and Repairing) Computer: An Investigation of the Theory and Practice of Fault-Tolerant Computer Design," IEEE Trans. Computers, vol. 20, no. 11, pp. 1312-1321, Nov. 1971.
[23] D.D. Burchby, L.W. Kern, and W.A. Sturm, "Specification of the Fault-Tolerant Spaceborne Computer (FTSC)," Proc. 1976 Int'l Symp. Fault-Tolerant Computing, pp. 129-133, June 1976.
[24] D. Siewiorek, M. Canepa, and S. Clark, "C.vmp: The Architecture of a Fault-Tolerant Multiprocessors," Proc. 1977 Int'l Symp. Fault-Tolerant Computing, June 1977.
[25] A.L. Hopkins, T.B. Smith, and J.H. Lala, "FTMP—a Highly Reliable Fault-Tolerant Multiprocessor for Aircraft," IEEE Trans. Computers, vol. 66, no. 10, 1978.
[26] J.H. Wensley, L. Lamport, J. Goldberg, M.W. Green, K.N. Levitt, P.M. Melliar-Smith, R.E. Shostak, and C.B. Weinstock, "SIFT— Design and Analysis of a Fault-Tolerant Computer for Aircraft Control," Proc. IEEE, vol. 66, no. 10, pp. 1240-1255, Oct. 1978.
[27] J.R. Sklaroff, "Redundancy Management Technique for Space Shuttle Computers," IBM J. Research and Development, vol. 20, pp. 20-28, 1976.
[28] A. Avizienis, P. Gunningberg, J.P.J. Kelly, R.T. Lyu, L. Strigini, P.J. Traverse, K.S. Tso, and U. Voges, "The UCLA DEDIX System: A Distributed Testbed for Multiple Version Software," Proc. 15th Ann. Int'l Symp. Fault Tolerant Computing, pp. 126-134, June 1985.
[29] B. Benton, H. Kokado, and H. Yamada, "A Fault-Tolerant Implementation of the CTRON Basic Operating System," Proc. TRON Project Int'l Symp., pp. 65-74, 1994.
[30] S. Chan and R.L. Jardine, "Fault Tolerance in the NonStop Cyclone System," Technical Report 90.7, Tandem Computers Incorporated, 1990.
[31] W. Bartlett and L. Spainhower, "Commercial Fault Tolerance: A Tale of Two Systems," IEEE Trans. Dependable and Secure Computing, vol. 1, no. 1, pp. 87-96, Jan. 2004.
[32] R.K. Iyer, M. Hsueh, and I. Lee, "Fault/Failure Analysis of the Tandem Nonstop-UX Operating System," Proc. AIAA/IEEE Conf. Digital Avionics Systems, pp. 491-496, 1996.
[33] D. Bernick, B. Bruckert, P.D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen, "NonStop Advanced Architecture," Proc. Int'l Conf. Dependable Systems and Networks, 2005.
[34] M. Prvulovic, Z. Zhang, and J. Torrellas, "ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors," Proc. 29th Ann. Int'l Symp. Computer Architecture (ISCA-29), 2002.
[35] D.J. Sorin, M.M.K. Martin, M.D. Hill, and D.A. Wood, "SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery," Proc. 29th Ann. Int'l Symp. Computer Architecture (ISCA-29), 2002.
[36] S.M. Ornstein, W.R. Crowther, M.F. Kraley, R.D. Bressler, A. Michel, and F.E. Heart, "PLURIBUS—a Reliable Multiprocessor," Proc. Am. Federation Information Processing Societies (AFIPS), vol. 44, pp. 551-559, 1975.
[37] H. Ihara, K. Fukuoka, Y. Kubo, and S. Yokota, "Fault-Tolerant Computer System with Three Symmetric Computers," Proc. IEEE, vol. 66, no. 10, pp. 1160-1170, Oct. 1978.
[38] A.E. Cooper and W.T. Chow, "Development of Onboard Space Computer Systems," IBM J. Research and Development, vol. 20, no. 1, pp. 5-19, 1976.
[39] R. Samet, "Fault-Tolerant Procedures for Redundant Computer Systems," Quality and Reliability Eng. Int'l, vol. 25, no. 1, pp. 41-68, 2008.
20 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool