This Article 
 Bibliographic References 
 Add to: 
DEPEND: A Simulation-Based Environment for System Level Dependability Analysis
January 1997 (vol. 46 no. 1)
pp. 60-74

Abstract—The paper presents the rationale for a functional simulation tool, called DEPEND, which provides an integrated design and fault injection environment for system level dependability analysis. The paper discusses the issues and problems of developing such a tool, and describes how DEPEND tackles them. Techniques developed to simulate realistic fault scenarios, reduce simulation time explosion, and handle the large fault model and component domain associated with system level analysis are presented. Examples are used to motivate and illustrate the benefits of this tool. To further illustrate its capabilities, DEPEND is used to simulate the Unix-based Tandem triple-modular-redundancy (TMR) based prototype fault-tolerant system and evaluate how well it handles near-coincident errors caused by correlated and latent faults. Issues such as memory scrubbing, re-integration policies, and workload dependent repair times, which affect how the system handles near-coincident errors, are also evaluated. Unlike any other simulation-based dependability studies, the accuracy of the simulation model is validated by comparing the results of the simulations with measurements obtained from fault injection experiments conducted on a production Tandem machine.

[1] J. Arlat et al., "Fault Injection for Dependability Validation: A Methodology and Some Applications," IEEE Trans. Software Eng., Feb. 1990, pp. 166-182.
[2] S.J. Bavuso, J.B. Dugan, K.S. Trivedi, E.M. Rothman, and W.E. Smith, "Analysis of Typical Fault-Tolerant Architectures Using HARP," IEEE Trans. Reliability, vol. 36, no. 2, pp. 176-185, June 1987.
[3] J. Carreira, H. Madeira, and J. Gabriel Silva, "Xception: Software Fault Injection and Monitoring in Processor Functional Units," Proc. Fifth Int'l Working Conf. Dependable Computing for Critical Applications, pp. 135-149,Urbana, Ill., Sept. 1995.
[4] X. Castillo and D. Siewiorek, "A Workload Dependent Software Reliability Prediction Model," Proc. 12th Int'l Symp. Fault-Tolerant Computing,Santa Monica, Calif., June 1982.
[5] R. Chillarege and R.K. Iyer, "Measurement-Based Analysis of Error Latency," IEEE Trans. Computers, vol. 36, no. 5, pp. 529-537, May 1987.
[6] R. Chillarege and N.S. Bowen, “Understanding Large System Failures—A Fault Injection Experiment,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 356–363, June 1989.
[7] G. Ciardo, J. Muppala, and K. Trivedi, "SPNP: Stochastic Petri Net Package," Proc. Int'l Conf. Petri Nets and Performance Models,Kyoto, Japan, Dec. 1989.
[8] J.A. Clark and D.K. Pradhan, "A Simulated Fault-Injection Testbed for Alternative TMR Architectures," Technical Report TR-92-CSE-1, Univ. of Massachusetts, Jan. 1992.
[9] P.J. Courtois, "Decomposability, Instabilities, and Saturation in Multiprogramming Systems," Comm. ACM, vol. 18, no. 7, pp. 371-377, July 1975.
[10] J. B. Dugan and K. S. Trivedi,“Coverage modeling for dependability analysis of fault-tolerant systems,” IEEE Trans. on Computers, vol. 38, no. 6, pp. 775-787, June 1989.
[11] A. Dupuy et al., "NEST: A Network Simulation and Prototyping Testbed," Comm. ACM, Oct. 1990, pp. 64-74.
[12] R. Geist and K. Trivedi, "Reliability Estimation of Fault-Tolerant Systems: Tools and Techniques," Computer, vol. 23, no. 7, pp. 52-61, July 1990.
[13] K.K. Goswami and R.K. Iyer, "DEPEND: A Design Environment for Prediction and Evaluation of System Dependability," Proc. Ninth Digital Avionics Systems Conf., Oct.15, 1990.
[14] K.K. Goswami, M. Devarakonda, and R.K. Iyer, "Prediction-Based Dynamic Load-Sharing Heuristics," IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 6, pp. 638-648, June 1993.
[15] K.K. Goswami and R.K. Iyer, "Use of Hybrid and Hierarchical Simulation to Reduce Computation Costs," Proc. Int'l Workshop Modeling Analysis&Simulation of Computer&Telecommunication Systems, pp. 197-202,San Diego, Jan. 1993.
[16] K.K. Goswami and R.K. Iyer, "DEPEND: A Simulating-Based Environment for System Level Dependability Analysis," Technical Report CRHC-92-11, Coordinated Science Laboratory, Univ. of Illi nois, June 1992.
[17] K. Goswami and R. Iyer, "Simulation of Software Behavior under Hardware Faults," Proc. 23rd Int'l Symp. Fault-Tolerant Computing (FTCS-23), pp. 218-277, June 1993.
[18] S. Han, K.G. Shin, and H. Rosenberg, “DOCTOR: An IntegrateD SOftware Fault InjeCTiOn EnviRonment for Distributed Real-Time Systems,” Proc. IEEE Int'l Computer Performance and Dependability Symp., pp. 204–213, 1995.
[19] H. Hecht and E. Fiorentino, "Reliability Assessment of Spacecraft Electronics," Proc. Ann. Reliability and Maintainability Symp., pp. 341-346, 1987.
[20] M.C. Hsueh, R.K. Iyer, and K.S. Trivedi, "Performability Modeling Based on Real Data: A Case Study," IEEE Trans. Computers, vol. 37, no. 4, Apr. 1988.
[21] O.C. Ibe, R.C. Howe, and K.S. Trivedi, "Approximate Availability Analysis of VAXcluster Systems," IEEE Trans. Reliability, vol. 38, no. 1, pp. 146-152, Apr. 1989.
[22] K.W. Brodlie, L.R. Henderson, and A.M. Mumford, “The CGM—A Metafile for GKS?” Computer Graphics Forum, Vol. 6, No. 2, May 1987, pp. 87-90.
[23] R.K. Iyer, S.E. Butner, and E.J. McCluskey, "A Statistical Failure/Load Relationship: Results of a Multicomputer Study," IEEE Trans. Software Eng., vol. 8, pp. 354-370, July 1982.
[24] R.K. Iyer and D. Tang, "Experimental Analysis of Computer System Dependability," Technical Report CRHC-93-15, Coordinated Science Laboratory, Univ. of Illi nois, June 1993.
[25] D. Jewett, “Integrity S2: A Fault-Tolerant Unix Platform,” Proc. 21st Int'l Symp. Fault-Tolerant Computing (FTCS-21), pp. 512-519, Montreal, June 1991.
[26] A.M. Johnson and M.A. Schoenfelder, "Rainbow Net Analysis of VAXcluster System Availability," IEEE Trans. Reliability, July 1991.
[27] G. Kanawati, N. Kanawati, and J. Abraham, “FERRARI: A Tool for the Validation of System Dependability Properties,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 336–344, 1992.
[28] W.-L. Kao and R. Iyer, “DEFINE: A Distributed Fault Injection and Monitoring Environment,” Fault-Tolerant Parallel and Distributed Systems, D.K. Pradhan and D.R. Avresky, eds., pp. 252-259, Los Alamitos, Calif.: IEEE CS Press, 1995.
[29] H. Kobayashi, Modeling and Analysis: An Introduction to System Performance Evaluation Methodology Simulation Modeling and Analysis. Addison-Wesley, 1978.
[30] E.E. Lewis, F. Boehm, C. Kirsch, and B.P. Kelkhoff, "Monte Carlo Simulation of Complex System Mission Reliability," Proc. Winter Simulation Conf., pp. 497-504, 1989.
[31] M.H. MacDougall and J.S. McAlpine, "Computer Simulation with ASPOL," Symp. Simulation of Computer Systems, pp. 93-103, ACM/SIGSIM, 1973.
[32] B. Melamed and R.J.T. Morris, "Visual Simulation: The Performance Analysis Workstation," Computer, vol. 18, no. 8, pp. 87-94, Aug. 1985.
[33] B. Meyer,Object-Oriented Software Construction. Englewood Cliffs, NJ: Prentice-Hall, 1988.
[34] J.F. Meyer and L. Wei, "Influence of Workload on Error Recovery in Random Access Memories," IEEE Trans. Computers, vol. 37, no. 4, pp. 500-507, Apr. 1988.
[35] V.F. Nicola, M.K. Nakayama, P. Heidelberger, and A. Goyal, "Fast Simulation of Dependability Models with General Failure, Repair and Maintenance Processes," Proc. 20th Int'l Symp. Fault-Tolerant Computing,England, June 1990.
[36] R.A. Sahner and K.S. Trivedi, "Reliability Modeling Using SHARPE," IEEE Trans. Reliability, vol. 36, no. 2, pp. 186-193, June 1987.
[37] W.H. Sanders, W.D. Obal II, M.A. Qureshi, and F.K. Widjanarko, “TheUltraSANModeling Environment,” Performance Evaluation, vol. 24, no. 1, pp. 89-115, 1995.
[38] C.H. Sauer, E.A. MacNair, and J.F. Kurose, "RESQ: CMS User's Guide," IBM Research Report RA-139, Yorktown Heights, N.Y., Apr. 1982.
[39] H. Schwetman, "CSIM: A C-based, Process Oriented Simulation Language," Proc. 1991 Winter Simulation Conf., pp. 387-396, 1991.
[40] Z. Segall et al., “FIAT—Fault Injection Based Automated Testing Environment,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 102–107, 1988.
[41] SES, Inc., "SES/Sim Simulation Language Reference Manual,"Austin, Tex., Mar. 1989.
[42] D. Tang, R.K. Iyer, S.S. Subramani, "Failure Analysis and Modeling of a VAXcluster System," Proc. 20th Int'l Symp. Fault-Tolerant Computing,England, June 1990.
[43] D. Tang and R.K. Iyer, "Analysis and Modeling of Correlated Failures in Multicomputer Systems," IEEE Trans. Computers, vol. 42, no. 1, Jan. 1993.
[44] K.S. Trivedi and R.M. Geist, "Decomposition in Reliability Analysis of Fault-Tolerant Systems," IEEE Trans. Reliability, vol. 32, no. 5, pp. 463-468, Dec. 1983.
[45] T.K. Tsai and R.K. Iyer, "An Approach to Benchmarking of Fault-Tolerant Commercial Systems," Proc. 26th Ann. Int'l Symp. Fault-Tolerant Computing, IEEE CS Press, Los Alamitos, Calif., 1996, pp. 314-323.
[46] A.S. Wein and A. Sathaye, "Validating Complex Computer System Availability Models," IEEE Trans. Reliability, vol. 39, no. 4, pp. 468-479, Oct. 1990.
[47] L. Young, R.K. Iyer, K.K. Goswami, and C. Alonso, "A Hybrid Monitor Assisted Fault Injection Environment," Proc. Third IFIP Conf. Dependable Computing for Critical Applications, Sicily, Sept. 1992.

Index Terms:
Simulation, fault injection, dependability analysis, correlated errors, latent errors, intercomponent dependence, object-oriented design, Tandem TMR-based prototype analysis, validation.
Kumar K. Goswami, Ravishankar K. Iyer, Luke Young, "DEPEND: A Simulation-Based Environment for System Level Dependability Analysis," IEEE Transactions on Computers, vol. 46, no. 1, pp. 60-74, Jan. 1997, doi:10.1109/12.559803
Usage of this product signifies your acceptance of the Terms of Use.