This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Effect of Fault Tolerance on Response Time-Analysis of the Primary Site Approach
April 1992 (vol. 41 no. 4)
pp. 420-428

The effect of the primary site approach for fault tolerance on the response time is studied. In the primary site approach, the service to be made fault tolerant is replicated at many nodes, one of which is designated as primary and the others as backups. All the requests for operations on the data object are sent to the primary site. The primary fails, one of the backups takes over as primary. The primary site periodically checkpoints its state on the backups. An analytical model for studying the average response time of the primary site system and analyzing the effects of the checkpointing frequency and the degree of replication on the response time is presented. This model is used to compare the response time of the system to that of a system without any fault tolerance.

[1] E. Gelenbe and D. Derochette, "Performance of rollback recovery systems under intermittent failures,"Commun. ACM, vol. 21, no. 6, pp. 493-499, 1978.
[2] E. Gelenbe, "On the optimum checkpoint interval,"J. ACM, vol. 26, no. 2, pp. 259-270, 1979.
[3] Y. Huang and P. Jalote, "Analytic models for the primary site approach to fault-tolerance,"Acta Informatica, vol. 26, pp. 543-557, 1989.
[4] P. A. Alsberg and J. D. Day, "A principle for resilient sharing of distributed resources," inProc. 2nd Int. Conf. Software Eng., San Franscisco, CA, Oct. 1976, pp. 562-570.
[5] K. P. Birman, T. A. Joseph, T. Raeuchle, and A. E. Abbadi," Implementing fault-tolerant distributed objects,"IEEE Trans. Software Eng., vol. SE-11, pp. 502-508, Jan. 1985.
[6] B. Walker et al., "The Locus Distributed Operating System,"Proc. Ninth ACM Symp. Operating Systems Principles, Oct. 1983, pp. 49-70.
[7] J. Bartlett, "A NonStop Kernel," Eighth Sigops, ACM, New York, 1981, pp. 22-29.
[8] P. Jalote, "Fault tolerant processes," Tech. Rep., Univ. of Maryland, College Park, MD 20742, 1987.
[9] K. J. Lin and J. Gannon, "Atomic remote procedure call,"IEEE Trans. Software Eng., vol. SE-11, pp. 1126-1135, Oct. 1985.
[10] K. Chandy, J. Brown, C. Dissly, and W. Uhrig, "Analytic models for rollback and recovery strategies in data base system,"IEEE Trans. Software Eng., vol. SE-11, pp. 100-110, Mar. 1975.
[11] E. Gelenbe, D. Finkel, and S. Tripathi, "Availability of a distributed computer system with failures,"Acta Informatica, vol. 23, pp. 643-655, 1986.
[12] E. G. Coffman, E. Gelenbe, and B. Plateau, "Optimization of the number of copies in a distributed data base," inPerformance'80, 1980, pp. 257-263.
[13] M. Ahamad and M. H. Ammar, "Performance characterization of quorum-consensus algorithms for replicated data,"IEEE Trans. Software Eng., vol. 15, pp. 492-496, Apr. 1989.
[14] D. Gifford, "Weighted voting for replicated data," inProc. 7th ACM Symp. Oper. Syst. Principles, Dec. 1979, pp. 150-162.
[15] D. Skeen, "Determining the last process to fail,"ACM Trans Comput. Syst., vol. 3, no. 1, pp. 15-30, Feb. 1985.
[16] K. S. Trivedi,Probability and Statistics with Reliability, Queueing and Computer Science Applications. Englewood Cliffs, NJ: Prentice-Hall, 1982.
[17] D. P. Siewiorek, "Fault tolerance in commercial computers,"IEEE Comput. Mag., vol. 23, pp. 26-37, July 1990.

Index Terms:
fault tolerance; response time; primary site approach; backups; analytical model; primary site system; checkpointing frequency; fault tolerant computing.
Citation:
Y . Huang, P. Jalote, "Effect of Fault Tolerance on Response Time-Analysis of the Primary Site Approach," IEEE Transactions on Computers, vol. 41, no. 4, pp. 420-428, April 1992, doi:10.1109/12.135555
Usage of this product signifies your acceptance of the Terms of Use.