| | This Article | |
| |
| |
| | Share | |
| |
| |
| | Bibliographic References | |
| |
| |
| | Add to: | |
| |
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
| |
| | Search | |
| |
| |
| | |
A Global-State-Triggered Fault Injector for Distributed System Evaluation
July 2004 (vol. 15 no. 7)
pp. 593-605
Abstract—Validation of the dependability of distributed systems via fault injection is gaining importance because distributed systems are being increasingly used in environments with high dependability requirements. The fact that distributed systems can fail in subtle ways that depend on the state of multiple parts of the system suggests that a global-state-based fault injection mechanism should be used to validate them. However, global-state-based fault injection is challenging since it is very difficult in practice to maintain the global state of a distributed system at runtime with minimal intrusion into the system execution. This paper presents Loki, a global-state-based fault injector, which has been designed with the goals of low intrusion, high precision, and high flexibility. Loki achieves these goals by utilizing the ideas of partial view of global state, optimistic synchronization, and offline analysis. In Loki, faults are injected based on a partial view of the global state of the system, and a postruntime analysis is performed to place events and injections into a single global timeline and to discard experiments with incorrect fault injections. Finally, the experiments with correct fault injections are used to estimate user-specified performance and dependability measures. A flexible measure language has been designed that facilitates the specification of a wide range of measures.
[1] 593 G. Alvarez and F. Cristian, Centralized Failure Injection for Distributed, Fault-Tolerant Protocol Testing Proc. 17th IEEE Int'l Conf. Distributed Computing Systems, pp. 78-85, May 1997.[2] J. Arlat et al., "Fault Injection for Dependability Validation: A Methodology and Some Applications," IEEE Trans. Software Eng., Feb. 1990, pp. 166-182.[3] J. Arlat, Y. Crouzet, and J.-C. Laprie, “Fault Injection for Dependability Validation of Fault-Tolerant Computing Systems,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 348–355, 1989.[4] D. Bhatt et al., "SPI: An Instrumentation Development Environment for Parallel/Distributed Systems," Proc. Ninth Int'l Parallel Processing Symp., IEEE Computer Society Press, Los Alamitos, Calif., 1995, pp. 494-501.[5] W.G. Bouricius, W.C. Carter, D.C. Jessep, P.R. Schneider, and A.B. Wadia, Reliability Modeling for Fault-Tolerant Computers IEEE Trans. Computers, vol. 20, no. 11, pp. 1306-1311, 1971.[6] R. Chandra, M. Cukier, R.M. Lefever, and W.H. Sanders, Dynamic Node Management and Measure Estimation in a State-Driven Fault Injector Proc. 19th IEEE Symp. Reliable Distributed Systems, pp. 248-257, Oct. 2000.[7] K. Chandy and L. Lamport, Distributed Snapshots: Determining the Global States of Distributed Systems ACM Trans. Computer Systems, vol. 3, no. 1, pp. 63-75, 1985.[8] K. Chandy and J. Misra, An Example of Stepwise Refinement of Distributed Programs: Quiescence Detection ACM Trans. Program Languages and Systems, vol. 8, no. 3, pp. 326-343, July 1986.[9] S. Dawson, F. Jahanian, T. Mitton, and T.-L. Tung, "Testing of Fault-Tolerant and Real-Time Distributed Systems via Protocol Fault Injection," Proc. 26th Int'l Symp. Fault-Tolerant Computing (FTCS-26), pp. 404-414, June 1996.[10] K. Echtle and M. Leu, The EFA Fault Injector for Fault-Tolerant Distributed System Testing Proc. IEEE Workshop Fault-Tolerant Parallel and Distributed Systems, pp. 28-35, 1992.[11] C.E. Ellingston and R.J. Kulpinski, Dissemination of System Time IEEE Trans. Comm., vol. 21, pp. 605-623, May 1973.[12] S. Han, K.G. Shin, and H. Rosenberg, “DOCTOR: An IntegrateD SOftware Fault InjeCTiOn EnviRonment for Distributed Real-Time Systems,” Proc. IEEE Int'l Computer Performance and Dependability Symp., pp. 204–213, 1995.[13] D.A. Henke, Loki An Empirical Evaluation Tool for Distributed Systems: The Experiment Analysis Framework master's thesis, Univ. of Illinois at Urbana-Champaign, 1998.[14] K.R. Joshi, M. Cukier, and W.H. Sanders, Experimental Evaluation of the Unavailability Induced by a Group Membership Protocol Proc. Fourth European Dependable Computing Conf., pp. 23-25, Oct. 2002.[15] G. Kanawati, N. Kanawati, and J. Abraham, “FERRARI: A Tool for the Validation of System Dependability Properties,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 336–344, 1992.[16] F. Lange, R. Kroger, and M. Gergeleit, "Jewel: Design and Implementation of a Distributed Measurement System," IEEE Trans. Parallel and Distributed Systems, Vol. 3, No. 6, Nov. 1992, pp. 657-671.[17] R.M. Lefever, M. Cukier, and W.H. Sanders, An Experimental Evaluation of Correlated Network Partitions in the Coda Distributed File System Proc. 22nd IEEE Symp. Reliable Distributed Systems, pp. 273-282, Oct. 2003.[18] K. Marzullo and G. Neiger, Detection of Global State Predicates Proc. Fifth Int'l Workshop Distributed Algorithms, pp. 254-272, 1991.[19] D. Powell, E. Martins, J. Arlat, and Y. Crouzet, “Estimators for Fault Tolerance Coverage Evaluation,” Proc. 23rd Int'l Symp. Fault-Tolerant Computing (FTCS-23), pp. 228-237, Toulouse, France, 1993 (extended version in IEEE Trans. Computers, vol. 44, no. 2, pp. 347-366, Feb. 1995).[20] Z. Segall et al., “FIAT—Fault Injection Based Automated Testing Environment,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 102–107, 1988.[21] K.G. Shin and Y.H. Lee, Measurement and Application of Fault Latency IEEE Trans. Computers, vol. 35, no. 4, pp. 370-375, Apr. 1986.[22] D.T. Stott, B. Floering, Z. Kalbarczyk, and R.K. Iyer, Dependability Assessment in Distributed Systems with Lightweight Fault Injectors in NFTAPE Proc. Fourth Int'l Computer Performance and Dependability Symp., pp. 91-100, 2000.[23] A. Stuart and J.K. Ord, Distribution Theory, Kendall's Advanced Theory of Statistics, 1. London: Edward Ar nold, 1987.[24] T. Tsai and R. Iyer, Measuring Fault Tolerance with the FTAPE Fault Injection Tool Proc. Eighth Int'l Conf. Modelling Techniques and Tools for Computer Performance Evaluation, pp. 26-40, Sept. 1995.
Index Terms:
Distributed systems, reliable systems, system evaluation, fault injection, partial view of global state, offline clock synchronization, measure estimation.
Citation:
Ramesh Chandra, Ryan M. Lefever, Kaustubh R. Joshi, Michel Cukier, William H. Sanders, "A Global-State-Triggered Fault Injector for Distributed System Evaluation," IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 7, pp. 593-605, July 2004, doi:10.1109/TPDS.2004.14