This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Low-Cost Error Containment and Recovery for Onboard Guarded Software Upgrading and Beyond
February 2002 (vol. 51 no. 2)
pp. 121-137

Message-driven confidence-driven (MDCD) error containment and recovery, a low-cost approach to mitigating the effect of software design faults in distributed embedded systems, is developed for onboard guarded software upgrading for deep-space missions. In this paper, we first describe and verify the MDCD algorithms in which we introduce the notion of "confidence-driven" to complement the "communication-induced" approach employed by a number of existing checkpointing protocols to achieve error containment and recovery efficiency. We then conduct a model-based analysis to show that the algorithms ensure low performance overhead. Finally, we discuss the advantages of the MDCD approach and its potential utility as a general-purpose, low-cost software fault tolerance technique for distributed embedded computing.

[1] R. Baalke, “Mars Pathfinder Update,” Mars Pathfinder Weekly Status Report, Office of the Flight Operations Manager, Jet Propulsion Laboratory, California Inst. of Technology, Pasadena, June 1997.
[2] J. Rendleman, “MCI WorldCom Blames Lucent Software for Outage,” PC Week, 16 Aug. 1999.
[3] L. Sha, J.B. Goodenough, and B. Pollak, “Simplex Architecture: Meeting the Challenges of Using COTS in High-Reliability Systems,” CrossTalk: The J. Defense Software Eng., vol. 11, pp. 7-10, Apr. 1998.
[4] D. Powell et al., “GUARDS: A Generic Upgradable Architecture for Real-Time Dependable Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 6, pp. 580-599, June 1999.
[5] M.E. Segal and O. Frieder, “On-the-Fly Program Modification: Systems for Dynamic Updating,” IEEE Software, vol. 10, no. 2, pp. 53-65, Mar. 1993.
[6] A.T. Tai and K.S. Tso, “On-Board Maintenance for Affordable, Evolvable and Dependable Spaceborne Systems,” Phase-I Final Technical Report for Contract NAS8-98179, IA Tech, Inc., Los Angeles, Oct. 1998.
[7] E.N. Elnozahy, D.B. Johnson, and Y.-M. Wang, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon Univ., Pittsburgh, Penn., Oct. 1996.
[8] Y.M. Wang et al., “Checkpointing and Its Applications,” Digest 25th Ann. Int'l Symp. Fault-Tolerant Computing, pp. 22-31, June 1995.
[9] A.T. Tai, K.S. Tso, L. Alkalai, S.N. Chau, and W.H. Sanders, “On the Effectiveness of a Message-Driven Confidence-Driven Protocol for Guarded Software Upgrading,” Performance Evaluation, vol. 44, pp. 211-236, Apr. 2001.
[10] S.N. Chau, L. Alkalai, A.T. Tai, and J.B. Burt, “Design of a Fault-Tolerant COTS-Based Bus Architecture,” IEEE Trans. Reliability, vol. 48, pp. 351-359, Dec. 1999.
[11] C.T. Baker, “Effects of Field Service on Software Reliability,” IEEE Trans. Software Eng., vol. 14, no. 2, pp. 254-258, Feb. 1988.
[12] A.T. Tai, K.S. Tso, L. Alkalai, S.N. Chau, and W.H. Sanders, “On Low-Cost Error Containment and Recovery Methods for Guarded Software Upgrading,” Proc. 20th Int'l Conf. Distributed Computing Systems (ICDCS 2000), pp. 548-555, Apr. 2000.
[13] K.H. Kim, “The Distributed Recovery Block Scheme,” Software Fault Tolerance, M.R. Lyu, ed., pp. 189-209, West Sussex, England: John Wiley&Sons, 1995.
[14] H. Wasserman and M. Blum, “Software Reliability via Run-Time Result-Checking,” J. ACM, vol. 44, no. 6, pp. 826-849, 1997.
[15] S. Edwards, L. Lavagno, E.A. Lee, and A. Sangiovanni-Vincentelli, “Design of Embedded Systems: Formal Models, Validation, and Synthesis,” Proc. IEEE, vol. 85, pp. 366-390, Mar. 1997.
[16] A. Avizienis, “The N-Version Approach to Fault-Tolerant Software,” IEEE Trans. Software Eng., vol. 11, no. 12, pp. 1491-1501, Dec. 1985.
[17] B. Randell, “System Structure for Software Fault Tolerance,” IEEE Trans. Software Eng., vol. 1, pp. 220-232, June 1975.
[18] N. Neves and W.K. Fuchs, “Coordinated Checkpointing without Direct Coordination,” Proc. Third IEEE Int'l Computer Performance and Dependability Symp., pp. 23-31, Sept. 1998.
[19] W.H. Sanders, W.D. Obal II, M.A. Qureshi, and F.K. Widjanarko, “TheUltraSANModeling Environment,” Performance Evaluation, vol. 24, no. 1, pp. 89-115, 1995.
[20] A.T. Tai and K.S. Tso, “MDCD Algorithm Extension for a General Class of Distributed Embedded Systems,” Phase-II Fifth Interim Technical Progress Report for Contract NAS3-99125, IA Tech, Inc., Los Angeles, June 2001.
[21] K.H. Kim, “Programmer Transparent Coordination of Recovering Concurrent Processes: Philosophy and Rules of Efficient Implementation,” IEEE Trans. Software Eng., vol. 14, no. 6, pp. 810-821, June 1988.
[22] J.-C. Laprie, J. Arlat, C. Béounes, and K. Kanoun, “Definition and Analysis of Hardware-and-Software Fault-Tolerant Architectures,” Computer, vol. 23, no. 7, pp. 39-51, July 1990.
[23] A. Avizienis and J.P.J. Kelly, “Fault Tolerance by Design Diversity: Concepts and Experiments,” Computer, vol. 17, pp. 67-80, Aug. 1984.
[24] T. Vardanega, P.D.J.-F. Chane, W.M.R. Messaros, and J. Arlat, “On the Development of Fault-Tolerant On-Board Control Software and Its Evaluation by Fault Injection,” Digest 25th Ann. Int'l Symp. Fault-Tolerant Computing, pp. 510-515, June 1995.
[25] E. Totel, J.-P. Blanquart, Y. Deswarte, and D. Powell, “Supporting Multiple Levels of Criticality,” Digest 28th Ann. Int'l Symp. Fault-Tolerant Computing, pp. 70-79, June 1998.
[26] K.S. Tso, A.T. Tai, L. Alkalai, S.N. Chau, and W.H. Sanders, “GSU Middleware Architecture Design,” Proc. Fifth IEEE Int'l Symp. High Assurance Systems Eng., pp. 212-215, Nov. 2000.

Index Terms:
Guarded software upgrading, message-driven confidence-driven, global state consistency and recoverability, performance overhead, software fault tolerance, distributed embedded systems.
Citation:
A.T. Tai, K.S. Tso, L. Alkalai, S.N. Chau, W.H. Sanders, "Low-Cost Error Containment and Recovery for Onboard Guarded Software Upgrading and Beyond," IEEE Transactions on Computers, vol. 51, no. 2, pp. 121-137, Feb. 2002, doi:10.1109/12.980004
Usage of this product signifies your acceptance of the Terms of Use.