This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
The Effect of Program Behavior on Fault Observability
August 1996 (vol. 45 no. 8)
pp. 868-880

Abstract—Fault observability based on the behavior of memory references is studied. Traditional studies view memory as one monolithic entity that must completely work to be considered reliable. The usage patterns of a particular program's memory are emphasized here. This paper develops a new model for the successful execution of a program taking into account the usage of the data by extending a cache memory performance model. Three variations, based on well known allocation schemes, are presented (i.e., whether the program's storage is preallocated, dynamically allocated, or constrained in allocation). This is contrasted to traditional memory reliability calculations to show that the actual mean time to failure may be more optimistic when program behavior is considered. It also develops expressions for the probability of unobserved faults. With several studies reporting correlations between increased workloads and increased failure rates, a new theory is proposed here that provides an explanation for this behavior. The model studies several program traces demonstrating that increased workloads could cause an increase of the observed failure rates in the range of 32% to 53%.

[1] J.E. Anderson and F.J. Macri, "Multiple Redundancy Applications in a Computer," Proc. 1967 Ann. Symp. Reliablity, pp. 553-562,Washington, D.C., Jan. 1967.
[2] J. Arlat et al., "Fault Injection for Dependability Validation: A Methodology and Some Applications," IEEE Trans. Software Eng., Feb. 1990, pp. 166-182.
[3] M. Blaum, R. Goodman, and R. McEliece, "The Reliability of Single-Error Protected Computer Memories," IEEE Trans. Computers, vol. 37, no. 1, pp. 114-119, Jan. 1988.
[4] N.S. Bowen and D.K. Pradhan, "Program Fault Tolerance Based on Memory Access Behavior," Proc. 21st Symp. Fault-Tolerant Computing, pp. 426-433, IEEE, June 1991.
[5] X. Castillo and D.P. Siewiorek, "Workload, Performance, and Reliability of Digital Computer Systems," Proc. 11th Symp. Fault-Tolerant Computing, pp. 84-89, IEEE, June 1981.
[6] R. Chillarege and N.S. Bowen, “Understanding Large System Failures—A Fault Injection Experiment,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 356–363, June 1989.
[7] R. Chillarege and R.K. Iyer, "An Experimental Study of Memory Fault Latency," IEEE Trans. Computers, vol. 38, no. 6, pp. 869-874, June 1989.
[8] R. Chillarege and R.K. Iyer, "Measurement-Based Analysis of Error Latency," IEEE Trans. Computers, vol. 36, no. 5, pp. 529-537, May 1987.
[9] J.A. Clark and D.K. Pradhan, "Fault Injection: A Method for Validating Computer-System Dependability," Computer, June 1995, pp. 47-56.
[10] E. Czeck and D. Siewiorek, "Effects of Transient Gate-Level Faults on Program Behavior," Proc. Int'l Symp. Fault-Tolerant Computing, pp. 236-243, 1990.
[11] S.A. Elkind and D.P. Siewiorek, "Reliability and Performance of Error-Correcting Memory and Register Arrays," IEEE Trans. Computers, vol. 29, no. 10, pp. 920-927, Oct. 1980.
[12] A. Endres, "An Analysis of Errors and Their Causes in System Programs," IEEE Trans. Software Eng., vol. 1, pp. 140-149, June 1975.
[13] K. Goswami and R. Iyer, "A Simulation Based Study of a Triple-Modular Redundant System Using Depend," Proc. Fifth Int'l Conf. Fault Tolerant Computing Systems, pp. 300-311, IEEE, 1991.
[14] R.K. Iyer, S.E. Butner, and E.J. McCluskey, "A Statistical Failure/Load Relationship: Results of a Multicomputer Study," IEEE Trans. Computers, vol. 31, no. 7, pp. 697-706, July 1982.
[15] R.K. Iyer and D.J. Rossetti, "A Measurement-Based Model for Workload Dependence of CPU Errors," IEEE Trans. Computers, vol. 35, no. 6, pp. 511-519, June 1986.
[16] G. Kanawati, N. Kanawati, and J. Abraham, “FERRARI: A Tool for the Validation of System Dependability Properties,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 336–344, 1992.
[17] A. Mendelson, D. Thiébaut, and D. Pradhan, "Modeling of Live Lines and True Sharing in Multi-Cache Memory Systems," Proc. 1990 Int'l conf. Parallel Processing, pp. I-326-I-330, Aug. 1990.
[18] J.F. Meyer and L. Wei, "Analysis of Workload Influence on Dependability," Proc. 18th Symp. Fault-Tolerant Computing, pp. 84-89, IEEE, June 1988.
[19] J.F. Meyer and L. Wei, "Influence of Workload on Error Recovery in Random Access Memories," IEEE Trans. Computers, vol. 37, no. 4, pp. 500-507, Apr. 1988.
[20] W.F. Mikhail, R.W. Bartoldus, and R.A. Rutledge, "The Reliability of Memory with Single-Error Corrections," IEEE Trans. Computers, vol. 31, no. 6, pp. 560-564, June 1982.
[21] D.B. Sarrazin and M. Malek, "Fault-Tolerant Semiconductor Memories," Computer, vol. 17, no. 8, pp. 49-56, Aug. 1984.
[22] Z. Segall et al., “FIAT—Fault Injection Based Automated Testing Environment,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 102–107, 1988.
[23] K. So and R.N. Rechtschaffen, "Cache Operations by MRU Change," IEEE Trans. Computers, vol. 37, no. 6, pp. 700-709, June 1988.
[24] D. Thiébaut, "From the Fractal Dimension of the Intermiss Gaps to the Cache-Miss Ratio," IBM J. Research and Development, vol. 32, pp. 796-803, Nov. 1988.
[25] D. Thiébaut, H.S. Stone, and J.L. Wolf, "Synthetic Traces for Trace-Driven Simulation of Cache Memories," Technical Report RC-14268, IBM Research Division, Dec. 1988.
[26] D. Thiébaut, "On the Fractal Dimension of Computer Programs and Its Application to the Prediction of the Cache Miss Ratio," IEEE Trans. Computers, vol. 38, no. 7, July 1989.
[27] J. Voldman, B. Mandelbrot, L.W. Hoevel, J. Knight, and P. Rosenfeld, "Fractal Nature of Software-Cache Interaction," IBM J. Research and Development, vol. 27, pp. 164-170, Mar. 1983.

Index Terms:
Program behavior, fault tolerance, memory reliability, unobserved faults, system reliability.
Citation:
Nicholas S. Bowen, Dhiraj K. Pradhan, "The Effect of Program Behavior on Fault Observability," IEEE Transactions on Computers, vol. 45, no. 8, pp. 868-880, Aug. 1996, doi:10.1109/12.536230
Usage of this product signifies your acceptance of the Terms of Use.