This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Rigorous Development of an Embedded Fault-Tolerant System Based on Coordinated Atomic Actions
February 2002 (vol. 51 no. 2)
pp. 164-179

This paper describes our experience using coordinated atomic (CA) actions as a system structuring tool to design and validate a sophisticated and embedded control system for a complex industrial application that has high reliability and safety requirements. Our study is based on an extended production cell model, the specification and simulator for which were defined and developed by FZI (Forschungszentrum Informatik, Germany). This "Fault-Tolerant Production Cell" represents a manufacturing process involving redundant mechanical devices (provided in order to enable continued production in the presence of machine faults). The challenge posed by the model specification is to design a control system that maintains specified safety and liveness properties even in the presence of a large number and variety of device and sensor failures. Based on an analysis of such failures, we provide in this paper details of: 1) a design for a control program that uses CA actions to deal with both safety-related and fault tolerance concerns and 2) the formal verification of this design based on the use of model-checking. We found that CA action structuring facilitated both the design and verification tasks by enabling the various safety problems (involving possible clashes of moving machinery) to be treated independently. Even complex situations involving the concurrent occurrence of any pairs of the many possible mechanical and sensor failures can be handled simply yet appropriately. The formal verification activity was performed in parallel with the design activity and the interaction between them resulted in a combined exercise in "design for validation"; formal verification was very valuable in identifying some very subtle residual bugs in early versions of our design which would have been difficult to detect otherwise.

[1] R.H. Campbell and B. Randell, “Error Recovery in Asynchronous Systems,” IEEE Trans. Software Eng., vol. 12, no. 8, pp. 811-826, Aug. 1986.
[2] E. Canver, D. Schwier, A. Romanovsky, and J. Xu, “Formal Verification of CAA-Based Designs: The Fault-Tolerant Production Cell,” third year report, ESPRIT Long Term Research Project 20072 on Design for Validation, pp. 229-258, Nov. 1998.
[3] S.J. Caughey, M.C. Little, and S.K. Shrivastava, “Checked Transactions in an Asynchronous Message Passing Environment,” Proc. First IEEE Int'l Symp. Object-Oriented Real-Time Distributed Computing, pp. 222-229, Apr. 1998.
[4] R. DeLemos and A. Romanovsky, “Co-Ordinated Atomic Actions in Modelling Object Co-Operation,” Proc. First IEEE Int'l Symp. Object-Oriented Real-Time Distributed Computing, pp. 152-161, Apr. 1998.
[5] E. Emerson, "Temporal and Modal Logic," Handbook of Theoretical Computer Science, J. Leeuwen, ed., chapter 16, pp. 995-1, 072. Elsevier, 1990.
[6] J. Gray, "A Census of Tandem System Availability Between 1985 and 1990," IEEE Trans. Reliability, vol. 39, no. 4, pp. 409-418, Oct. 1990.
[7] K.H. Kim, “Approaches to Mechanization of the Conversation Scheme Based on Monitors,” IEEE Trans. Software Eng., vol. 8, no. 3, pp. 189-197, 1982.
[8] C. Lewerentz and T. Lindner, Formal Development of Reactive Systems: Case Study“Production Cell.” Springer-Verlag, 1995.
[9] P. Liggesmeyer and M. Rothfelder, “Improving System Reliability with Automatic Fault Tree Generation,” Proc. 28th Int'l Symp. Fault-Tolerant Computing, pp. 90-99, June 1998.
[10] A. Lötzbeyer, “Task Description of a Fault-Tolerant Production Cell,” version 1.6, available fromhttp://www.fzi.de/prost/projects/korsyskorsys.html , 1996.
[11] A. Lötzbeyer and R. Mühlfeld, “Task Description of a Flexible Production Cell with Real Time Properties,” FZI Technical Report, (ftp://ftp.fzi.de/pub/PROST/projects/korsys task_descr_flex_2_2.ps), 1996.
[12] G. Matos and E. White, “Application of Dynamic Reconfiguration in the Design of Fault-Tolerant Production Cell,” Proc. Fourth Int'l Conf. Configurable Distributed Systems, pp. 2-9, 1998.
[13] K.L. McMillan, Symbolic Model Checking. Kluwer Academic, 1993.
[14] G.D. Parrington, S.K. Shrivastava, S.M. Wheater, and M.C. Little, “The Design and Implementation of Arjuna,” USENIX Computing Systems J., vol. 8, no. 3, 1995.
[15] B. Randell, “System Structure for Software Fault Tolerance,” IEEE Trans. Software Eng., vol. 1, no. 2, pp. 220-232, 1975.
[16] A. Romanovsky, J. Xu, and B. Randell, “Coordinated Exception Handling in Real-Time Distributed Object Systems,” Int'l J. Computer Systems Science&Eng., vol. 14, no. 4, pp. 197-207, July 1999.
[17] D. Schwier, F. von Henke, J. Xu, R.J. Stroud, A. Romanovsky, and B. Randell, “Formalization of the CA Action Concept Based on Temporal Logic,” second year report, ESPRIT Long Term Research Project 20072 on Design for Validation, pp. 3-15, Dec. 1997.
[18] J. Vachon, D. Buchs, M. Buffo, G.D.M. Serugendo, B. Randell, A. Romanovsky, R.J. Stroud, and J. Xu, “COALA—A Formal Language for Co-Ordinated Atomic Actions,” third year report, ESPRIT Long Term Research Project 20072 on Design for Validation, pp. 43-86, Nov. 1998.
[19] J. Xu, B. Randell, A. Romanovsky, C. Rubira, R.J. Stroud, and Z. Wu, “Fault Tolerance in Concurrent Object-Oriented Software through Coordinated Error Recovery,” Proc. 25th Int'l Symp. Fault-Tolerant Computing, pp. 499-508, June 1995.
[20] J. Xu, A. Romanovsky, and B. Randell, “Coordinated Exception Handling in Distributed Object Systems: From Model to System Implementation,” Proc. Int'l Conf. Distributed Computing Systems, May 1998.
[21] J. Xu, A. Romanovsky, A. Zorzo, B. Randell, R.J. Stroud, and E. Canver, “Developing Control Software for Production Cell II: Failure Analysis and System Design Using CA Actions,” third year report, ESPRIT Long Term Research Project 20072 on Design for Validation, pp. 167-188, Nov. 1998.
[22] J. Xu, B. Randell, A. Romanovsky, R.J. Stroud, A.F. Zorzo, E. Canver, and F. von Henke, “Rigorous Development of a Safety-Critical System Based on Coordinated Atomic Actions,” Proc. 29th Int'l Symp. Fault-Tolerant Computing, pp. 68-75, June 1999.
[23] J. Xu, A. Romanovsky, and B. Randell, “Concurrent Exception Handling and Resolution in Distributed Object Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 11, no. 10, pp. 1019-1032, Oct. 2000.
[24] A.F. Zorzo, A. Romanovsky, J. Xu, B. Randell, R.J. Stroud, and I.S. Welch, “Using Co-Ordinated Atomic Actions to Design Complex Safety-Critical Systems: The Production Cell Case Study,” Software—Practice&Experience, vol. 29, no. 2, pp. 667-697, 1999.

Index Terms:
Concurrency, coordinated atomic (CA) actions, embedded fault-tolerant systems, exception handling, object orientation, formal verification, model checking, reliability, safety.
Citation:
J. Xu, B. Randell, A. Romanovsky, R.J. Stroud, A.F. Zorzo, E. Canver, F. von Henke, "Rigorous Development of an Embedded Fault-Tolerant System Based on Coordinated Atomic Actions," IEEE Transactions on Computers, vol. 51, no. 2, pp. 164-179, Feb. 2002, doi:10.1109/12.980006
Usage of this product signifies your acceptance of the Terms of Use.