This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Implementing Fail-Silent Nodes for Distributed Systems
November 1996 (vol. 45 no. 11)
pp. 1226-1238

Abstract—A fail-silent node is a self-checking node that either functions correctly or stops functioning after an internal failure is detected. Such a node can be constructed from a number of conventional processors. In a software-implemented fail-silent node, the nonfaulty processors of the node need to execute message order and comparison protocols to "keep in step" and check each other, respectively. In this paper, the design and implementation of efficient protocols for a two processor fail-silent node are described in detail. The performance figures obtained indicate that in a wide class of applications requiring a high degree of fault-tolerance, software-implemented fail-silent nodes constructed simply by utilizing standard "off-the-shelf" components are an attractive alternative to their hardware-implemented counterparts that do require special-purpose hardware components, such as fault-tolerant clocks, comparator, and bus interface circuits.

[1] P.A. Barrett et al., “The Delta-4 Extra Performance Architecture,” Proc. 20th Int'l Symp. Fault-Tolerant Computing (FTCS-20), pp. 481-488, 1990.
[2] P.A. Bernstein,"Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing," Computer, pp. 37-45, Feb. 1988.
[3] D. Black, C. Low, and S.K. Shrivastava, "The Voltan Application Programming Environment for Fail-Silent Processes," technical report, Dept. of Computing Science, Univ. of Newcastle upon Tyne, Jan. 1996.
[4] F. Cristian, "Understanding Fault-Tolerant Distributed Systems," Comm. ACM, vol. 34, no. 2, Feb. 1991.
[5] D. Dolev, J. Halpern, and H.R. Strong, "On the Possibility and Impossibility of Achieving Clock Synchronization," Proc. 16th ACM STOC, pp. 504-511,Washington, D.C., May 1984.
[6] C.-J.L. van Driel et al., "The Error Resistant Interactively Consistent Architecture (ERICA)," Proc. Fault-Tolerant Computing Symp., IEEE CS, 1990, pp. 474-480.
[7] P.D. Ezhilchelvan and S.K. Shrivastava, "A Distributed Systems Architecture Supporting High Availability and Reliability," Dependable Computing and Fault-Tolerant Systems, J.F. Meyer, R.D. Schlichting, eds., vol. 6, pp. 67-91. Springer-Verlag, 1992.
[8] P.D. Ezhilchelvan, C. Low, and S.K. Shrivastava, "Building Available Distributed Systems Using Fail-Stable Nodes," technical report, Dept. of Computing Science, Univ. of Newcastle upon Tyne, Jan. 1996.
[9] J.Y. Halpern, B. Simons, H.R. Strong, and D. Dolev, "Fault Tolerant Clock Synchronization," Proc. Third ACM Symp. PODC, pp. 89-102,Vancouver, Aug. 1984.
[10] R.M. Kieckhafer,C.J. Walter,A.M. Finn, and P.M. Thambidurai,"The MAFT Architecture for Distributed Fault-Tolerance," IEEE Trans. Computers, vol. 37, no. 4, pp. 398-405, Apr. 1988.
[11] H. Kopetz, H. Kantz, G. Grunsteidl, P. Puschner, and J. Reisinger, Tolerating Transient Faults in MARS Digest of Papers, 20th Ann. Int'l Symp. Fault-Tolerant Computing (FTCS-20), pp. 466-473, June 1990.
[12] J.H. Lala, "A Byzantine Resilient Fault Tolerant Computer for Nuclear Power Plant Applications," Digest of Papers, FTCS-16, pp. 338-343,Vienna, July 1986.
[13] J. H. Lala and L. S. Alger,“Hardware and software fault tolerance: A unified architectural approach,” Proc. 18th Int’l Symp. on Fault-Tolerant Computing,Tokyo, Japan, June 1988, pp. 240-245.
[14] L. Lamport, "Time, clocks and the ordering of events in a distributed system," Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[15] D.L. Palumbo and R.W. Butler, "Measurements of SIFT Operating System Overhead," NASA Technical Memo 86322, 1985.
[16] Delta-4—A Generic Architecture for Dependable Distributed Computing, D. Powell, ed. Spring-Verlag, 1992.
[17] J. Reisinger and A. Steininger, "The Design of a Fail-Silent Processing Node for the Predictable Hard Real-Time System MARS," Distributed System Eng. J., vol. 1, no. 2, pp. 104-111, 1993.
[18] R.L. Rivest,A. Shamir, and L.A. Adleman,"A Method for Obtaining Digital Signatures and Public Key Cryptosystems," Comm. ACM, vol. 21, pp. 120-126, 1978.
[19] F. Schneider, "Byzantine Generals in Action: Implementing Fail-stop Processors," ACM Trans. Computing, vol. 2, no. 2, pp. 145-154, 1984.
[20] F.B. Schneider, "Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial," ACM Computing Surveys, vol. 22, no. 4, pp. 299-319, Dec. 1990.
[21] S.K. Shrivastava, P.D. Ezhilchelvan, N.A. Speirs, and D.T. Seaton, "Fail-Controlled Computer Architectures for Distributed Systems," Technical Report TR-333, Univ. of Newcastle upon Tyne, July 1991.
[22] S.K. Shrivastava, P.D. Ezhilchelvan, N.A. Speirs, S. Tao, and A. Tully, “Principal Features of the VOLTAN Family of Reliable Node Architectures for Distributed Systems,” IEEE Trans. Computers, vol. 41, no. 5, pp. 542–549, May 1992.
[23] T.B. Smith, "Fault Tolerant Processor Concepts and Operation," Digest of Papers, FTCS-14, pp. 158-163,Kissimmee, Fla., June 1984.
[24] N.A. Speirs, S. Tao, F.V. Brasileiro, P.D. Ezhilchelvan, and S.K. Shrivastava, "The Design and Implementation of VOLTAN Fault-Tolerant Nodes for Distributed Systems," Transputer Comm., vol. 1, no. 2, pp. 93-109, Nov. 1993.
[25] N. Theuretzbacher, "'VOTRICS': Voting Triple Modular Computing System," Digest of Papers, FTCS-16, pp. 144-150,Vienna, July 1986.
[26] A. Tully and S. Shrivastava, “Preventing State Divergence in Replicated Distributed Programs,” Proc. Ninth Symp. Reliable Distributed Systems (SRDS '90), pp. 104-113, 1990.
[27] S. Weber and J. Beirne, "The Stratus Architecture," Proc. 21st Int'l Symp. Fault-Tolerant Computing Systems, pp. 79-85, 1991.
[28] J.H. Wensley et al., "SIFT: Design and Analysis of a Fault Tolerant Computer for Aircraft Control," Proc. IEEE, vol. 66, no. 10, pp. 1,240-1,255, Oct. 1978.

Index Terms:
Distributed processing, fault-tolerance, fail-silence, reliability, replicated processing.
Citation:
Francisco V. Brasileiro, Paul Devadoss Ezhilchelvan, Santosh K. Shrivastava, Neil A. Speirs, S. Tao, "Implementing Fail-Silent Nodes for Distributed Systems," IEEE Transactions on Computers, vol. 45, no. 11, pp. 1226-1238, Nov. 1996, doi:10.1109/12.544479
Usage of this product signifies your acceptance of the Terms of Use.