This Article 
 Bibliographic References 
 Add to: 
Transparent Recovery from Intermittent Faults in Time-Triggered Distributed Systems
February 2003 (vol. 52 no. 2)
pp. 113-125
The time-triggered model, with tasks scheduled in static (offline) fashion, provides a high degree of timing predictability in safety-critical distributed systems. Such systems must also tolerate transient and intermittent failures which occur far more frequently than permanent ones. Software-based recovery methods using temporal redundancy, such as task reexecution and primary/backup, while incurring performance overhead, are cost-effective methods of handling these failures. We present a constructive approach to integrating runtime recovery policies in a time-triggered distributed system. Furthermore, the method provides transparent failure recovery in that a processor recovering from task failures does not disrupt the operation of other processors. Given a general task graph with precedence and timing constraints and a specific fault model, the proposed method constructs the corresponding fault-tolerant (FT) schedule with sufficient slack to accommodate recovery. We introduce the cluster-based failure recovery concept which determines the best placement of slack within the FT schedule so as to minimize the resulting time overhead. Contingency schedules, also generated offline, revise this FT schedule to mask task failures on individual processors while preserving precedence and timing constraints. We present simulation results which show that, for small-scale embedded systems having task graphs of moderate complexity, the proposed approach generates FT schedules which incur about 30-40 percent performance overhead when compared to corresponding non-fault-tolerant ones.

[1] T.F. Abdelzaher and K.G. Shin, “Combined Task and Message Scheduling in Distributed Real-Time Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 11, pp. 1179-1191, Nov. 1999.
[2] T.L. Adam, K.M. Chandy, and J.R. Dickson, “A Comparison of List Schedules for Parallel Processing Systems,” Comm. ACM, vol. 17, pp. 685-690, 1974.
[3] H. Al-Asaad, B.T. Murray, and J.P. Hayes, “On-Line BIST for Embedded Systems,” IEEE Design and Test, vol. 15, no. 4, pp. 17-24, Oct.-Dec. 1998.
[4] V. Asnek et al., “SEU Induced Errors Observed in Microprocessor Systems,” IEEE Trans. Nuclear Science, vol. 45, no. 6, pp. 2876-2883, Dec. 1998.
[5] A.A. Bertossi, L.V. Mancini, and F. Rossini, “Fault-Tolerant Rate-Monotonic First-Fit Scheduling in Hard Real-Time Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 9, pp. 934-945, 1999.
[6] E.A. Bretz, “By-Wire Cars Turn the Corner,” IEEE Spectrum, pp. 68-73, Apr. 2001.
[7] V. Claesson, S. Poledna, and J. Soderberg, “XBW Model for Dependable Real-Time Systems,” Proc. Int'l Conf. Parallel and Distributed Systems, pp. 130-138, 1998.
[8] E.L. Ding, H. Fennel, and S.X. Ding, “Model Based Diagnosis of Sensor Faults for ESP-Systems,” Proc. Symp. Fault Detection, Supervision, and Safety for Technical Processes, 2000.
[9] H. El-Rewini, T.G. Lewis, and H.H. Ali, Task Scheduling in Parallel and Distributed Systems. Englewood Cliffs, N.J.: Prentice Hall, 1994.
[10] P. Eles, A. Doboli, P. Pop, and Z. Peng, “Scheduling with Bus Access Optimization for Distributed Embedded Systems,” IEEE Trans. VLSI Systems, vol. 8, no. 5, pp. 472-491, Oct. 2000.
[11] S. Ghosh, R. Melhem, and D. Mosse, “Enhancing Real-Time Schedules to Tolerate Transient Faults,” Proc. IEEE Real-Time Systems Symp., pp. 120-129, 1995.
[12] S. Ghosh, R. Melhem, D. Mosse, and J.S. Sarma, “Fault-Tolerant Rate-Monotonic Scheduling,” J. Real-Time Systems, vol. 15, no. 2, pp. 120-129, 1998.
[13] O. Gonzalez et al., “Adaptive Fault Tolerance and Graceful Degradation under Dynamic Hard Real-Time Scheduling,” Proc. IEEE Real-Time Systems Symp., pp. 79-89, 1997.
[14] M. Hiller, “Executable Assertions for Detecting Data Errors in Embedded Control Systems,” Proc. Dependable Systems and Networks, pp. 24-33, 2000.
[15] D. Isovic and G. Fohler, “Efficient Scheduling of Sporadic, Aperiodic, and Periodic Tasks with Complex Constraints,” Proc. IEEE Real-Time Systems Symp., pp. 207-216, 2000.
[16] H. Kopetz et al., “Distributed Fault-Tolerant Real-Time Systems: The MARS Approach,” IEEE Micro, vol. 9, no. 1, pp. 25-40, Feb. 1989.
[17] H. Kopetz, “TTP— A Time-Triggered Protocol for Fault-Tolerant Real-Time Systems,” Proc. IEEE Fault-Tolerant Computing Symp., pp. 524-533, 1993.
[18] C.M. Krishna and K.G. Shin, “On Scheduling Tasks with a Quick Recovery from Failure,” IEEE Trans. Computers, vol. 35, no. 5, pp. 448-455, May 1986.
[19] Y.-K. Kwok and I. Ahmad, “Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors,” ACM Computing Surveys, vol. 31, no. 4, pp. 406-471, Dec. 1999.
[20] F. Liberato, R. Melhem, and D. Mosse, “Tolerance to Multiple Transient Faults for Aperiodic Tasks in Hard Real-Time Systems,” IEEE Trans. Computers, vol. 49, no. 9, pp. 906-914, Sept. 2000.
[21] A.L. Liestman and R.H. Campbell, “A Fault-Tolerant Scheduling Problem,” IEEE Trans. Software Eng., vol. 12, no. 11, pp. 1089-1095, Nov. 1988.
[22] G. Manimaran and C.S.R. Murthy, “A Fault-Tolerant Dynamic Scheduling Algorithm for Multiprocessor Real-Time Systems and Its analysis,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 11, Nov. 1998.
[23] M. Pandya and M. Malek, “Minimum Achievable Utilization for Fault-Tolerant Processing of Periodic Tasks,” IEEE Trans. Computers, vol. 47, no. 10, pp. 1102-1113, Oct. 1998.
[24] S. Ramos-Thuel and J.K. Strosnider, “The Transient Server Approach to Scheduling Time-Critical Recovery Operations,” Proc. Real-Time Systems Symp., pp. 286-295, 1991.
[25] B. Rostamzadeh, H. Lonn, R. Snedsbol, and J. Torin, “DACAPO: A Distributed Computer Architecture for Safety-Critical Control Applications,” Proc. Intelligent Vehicles Symp., pp. 376-381, 1995.
[26] D.P. Siewiorek and R.S. Swarz, Reliable Computer Systems: Design and Evaluation, third ed. Natick, Mass.: A.K. Peters, 1998.
[27] P.P. Shirvani, N.R. Saxena, and E.J. McCluskey, “Software-Implemented EDAC Protection against SEUs,” IEEE Trans. Reliability, vol. 49, no. 3, Sept. 2000.
[28] A. Steininger and C. Scherrer, “Identifying Efficient Combinations of Error Detection Mechanisms Based on Results of Fault Injection Experiments,” IEEE Trans. Computers, vol. 51, no. 2, pp. 235-239, Feb. 2002.
[29] T. Yang and A. Gersoulis, “DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors,” IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 9, pp. 951-967, Sept. 1994.
[30] V. Sarkar, Partitioning and Scheduling Parallel Programs for Execution on Multiprocessors. Cambridge, Mass.: MIT Press, 1989.

Nagarajan Kandasamy, John P Hayes, Brian T. Murray, "Transparent Recovery from Intermittent Faults in Time-Triggered Distributed Systems," IEEE Transactions on Computers, vol. 52, no. 2, pp. 113-125, Feb. 2003, doi:10.1109/TC.2003.1
Usage of this product signifies your acceptance of the Terms of Use.