This Article 
 Bibliographic References 
 Add to: 
A Nonpreemptive Real-Time Scheduler with Recovery from Transient Faults and Its Implementation
August 2003 (vol. 29 no. 8)
pp. 752-767
Daniel Moss?, IEEE Computer Society

Abstract—Real-time systems (RTS) are those whose correctness depends on satisfying the required functional as well as the required temporal properties. Due to the criticality of such systems, recovery from faults is an essential part of a RTS. In many systems, such as those supporting space applications, single event upsets (SEUs) are the prevalent type of faults; SEUs are transient faults and affect a single task at a time. This paper presents a scheme to guarantee that the execution of real-time tasks can tolerate SEUs and intermittent faults assuming any queue-based scheduling technique. Three algorithms are presented to solve the problem of adding fault tolerance to a queue of real-time tasks by reserving sufficient slack in a schedule so that recovery can be carried out before the task deadline without compromising guarantees given to other tasks. The first algorithm is a dynamic programming optimal solution, the second is a linear-time heuristic for scheduling dynamic tasks, and the third algorithm comprises extensions to address queues with gaps between tasks (gaps are caused by precedence, resource, or timing constraints). We show through simulations that the heuristics closely approximate the optimal algorithm. Finally, the paper describes the implementation of the modified admission control algorithm, the nonpreemptive scheduler, and a recovery mechanism in the FT-RT-Mach operating system.

[1] Mission Critical Operating Systems, A.K. Agrawala, K. Gordon, and P. Hwang, eds., IOS Press, 1991.
[2] F. Belli and P. Jedrzejowicz, An Approach to the Reliability Optimization of Software with Redundancy IEEE Trans. Software Eng., vol. 17, no. 3, pp. 310-312, Mar. 1991.
[3] S. Balaji, L. Jenkins, L.M. Patnaik, and P.S. Goel, Workload Redistribution for Fault Tolerance in a Hard Real-Time Distributed Computing System Proc. IEEE Fault Tolerance Computing Symp. (FTCS-19), pp. 366-373, 1989.
[4] L. Dong, R. Melhem, and D. Mossé, Time Slot Allocation for Real-Time Messages with Negotiable Distance Constrained Requirements Proc. Real-Time Technology and Applications Symp., 1998.
[5] M. DiNatale and J. Stankovic, Dynamic End-to-End Guarantees in Distributed Real-Time Systems Proc. Real-Time Systems Symp., pp. 216-227, Dec. 1994.
[6] J. Gaisler, Concurrent Error-Detection and Modular Fault-Tolerance in a 32-Bit Processing Core for Embedded Space Flight Applications Proc. IEEE Symp. Fault Tolerant Computing (FTCS-24), pp. 128-130, 1994.
[7] S. Ghosh, D. Mossé, and R. Melhem, Fault-Tolerant Rate-Monotonic Scheduling Proc. Sixth IFIP Conf. Dependable Computing for Critical Applications, Mar. 1997.
[8] S. Ghosh, D. Mossé, and R. Melhem, Implementation and Analysis of a Fault-Tolerant Scheduling Algorithm IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 3, pp. 272-284, Mar. 1997.
[9] S. Ghosh, D. Mossé, and R. Melhem, Tolerant Rate-Monotonic Scheduling J. Real-Time Systems, vol. 15, no. 2, Sept. 1998.
[10] E.D. Jensen, C.D. Locke, and H. Tokuda, A Time-Driven Scheduling Model for Real-Time Operating Systems Proc. IEEE Real-Time Systems Symp., pp. 112-122, Dec. 1985.
[11] B.W. Johnson, Design and Analysis of Fault-Tolerant Digital Systems, pp. 394-402. Reading, Mass.: Addison-Wesley, June 1989.
[12] H. Kopetz, A. Damm, C. Koza, M. Mulazzani, W. Schwabi, C. Senft, and R. Zainlinger, "Distributed Fault-Tolerant Real-Time Systems: The MARS Approach," IEEE Micro, pp. 25-58, Feb. 1989.
[13] H. Kopetz, H. Kantz, G. Grunsteidl, P. Puschner, and J. Reisinger, Tolerating Transient Faults in MARS Digest of Papers, 20th Ann. Int'l Symp. Fault-Tolerant Computing (FTCS-20), pp. 466-473, June 1990.
[14] H. Kopetz, Event-Triggered Versus Time-Triggered Real-Time Systems Lecture Notes in Computer Science, 1991.
[15] C. M. Krishna and K. G. Shin,“On scheduling tasks with a quick recovery from failure,”IEEE Trans. Comput., vol. C-35, no. 5, pp. 448–455, May 1986.
[16] A. L. Liestman and R. H. Campbell,“A fault tolerant scheduling problem,”IEEE Trans. Software Eng., vol. SE-12, no. 11, pp. 1089–1095, Nov. 1986.
[17] C.L. Liu and J.W. Layland, “Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment,” J. ACM, vol. 20, no. 1, pp. 40-61, 1973.
[18] J.W. Liu, K. Lin, C.L. Liu, and C.W. Gear, Research on Imprecise Computations in Project QuartZ Proc. Workshop Operating Systems for Mission Critical Computing, Sept. 1989.
[19] K. Lin, S. Natarajan, and J.W. Liu, Imprecise Results: Utilizing Partial Computations in Real-Time Systems Proc. IEEE Real-Time Systems Symp., Dec. 1987.
[20] J. Lehoczky, L. Sha, and Y. Ding, The Rate Monotonic Scheduling Algorithm: Exact Characterization and Average Case Behavior Proc. IEEE Real-Time Systems Symp., pp. 166-171, 1989.
[21] R. Melhem and D. Mossé, Forts: Fault Tolerance through Scheduling in Real-Time Systems http://www.cs.pitt.eduFORTS, 1998.
[22] D. Mossé, R. Melhem, and S. Ghosh, Analysis of a Fault-Tolerant Multiprocessor Scheduling Algorithm Proc. 24th Int'l Symp. Fault-Tolerant Computing, June 1994.
[23] D. Mossé, Design, Development, and Deployment of Fault-Tolerant Applications for Distributed Real-Time Systems, PhD thesis, Univ of Maryland, College Park, 1993.
[24] D. Mossé, Mechanisms for System-Level Fault Tolerance in Real-Time Systems Proc. Int'l Conf. Robotics, Vision, and Parallel Processing for Industrial Automation, June 1994.
[25] D. Niehaus, "Program Representation and Translation for Predictable Real-Time Systems," Proc. IEEE Real-Time Systems Symp., pp. 53-63, 1991.
[26] T. Ng and S. Shi, Replicated Transactions Proc. Ninth Int't Conf. Distributed Computer Systems, June 1989.
[27] T. Nakajima and H. Tokuda, Implementation of Scheduling Policies in Real-Time Mach Proc. Second Int'l Workshop Object Orientation in Operating Systems, pp. 165-169, Sept. 1992.
[28] V. Nirkhe, S. Tripathi, and A. Agrawala, Language Support for the Maruti Real-Time System Proc. Real-Time Systems Symp., pp. 257-266, Dec. 1990.
[29] S.K. Oh and G. MacEwen, Toward Fault-Tolerant Adaptive Real-Time Distributed Systems External Technical Report 92-325, Dept. of Computing and Information Science, Queen's Univ., Kingston, Ontario, Canada, Jan. 1992.
[30] Y. Oh and S.H. Son, Enhancing Fault-Tolerance in Rate-Monotonic Scheduling J. Real-Time Systems, vol. 7, no. 3, pp. 315-329, Nov. 1994.
[31] B. Randell, System Structure for Software Fault Tolerance IEEE Trans. Software Eng., vol. 1, no. 2, pp. 220-232, June 1975.
[32] S. Ramos-Thuel, Enhancing Fault Tolerance of Real-Time Systems through Time Redundancy PhD thesis, Carnegie Mellon Univ., May 1993.
[33] S. Ramos-Thuel and J.K. Strosnider, Scheduling Fault Recovery Operations for Time-Critical Applications Proc. Fourth IFIP Conf. Dependable Computing for Critical Applications, Jan. 1994.
[34] T.B. Smith, Fault-Tolerant Processor Concepts and Operation Proc. 14th IEEE Fault-Tolerant Computing Symp., June 1984.
[35] T.B. Smith, The Fault-Tolerant Multiprocessor Computer. Park Ridge, N.J.: Noyes Publications, 1986.
[36] J.A. Stankovic, “Misconceptions About Real Time Computing,” IEEE Computer, vol. 21, no. 10, Oct. 1988.
[37] T. Tsuchiya, Y. Kakuda, and T. Kikuno, A New Fault-Tolerant Scheduling Algorithm for Real-Time Multiprocessor Systems Real-Time Computing Systems and Applications, pp. 197-202, 1995.
[38] H. Tokuda, T. Nakajima, and P. Rao, Real-Time Mach: Toward a Predictable Real-Time System Proc. USENIX Mach Workshop, Oct. 1990.
[39] J.H. Wensley et al., SIFT: Design and Analysis of a Fault Tolerant Computer for Aircraft Control Proc. IEEE, pp. 1240-1255, Oct. 1978.
[40] W. Zhao and K. Ramamritham,“Simple and integrated heuristic algorithms for scheduling tasks with time and resource constraints,”J. Syst. and Software, vol. 7, pp. 195–207, 1987.
[41] W. Zhao, K. Ramamritham, and J.A. Stankovic, "Preemptive Scheduling Under Time and Resource Constraints," IEEE Trans. Computers, Vol. 36, No. 8, Aug. 1987, pp. 949-960.

Index Terms:
Fault tolerance, operating system, real-time, scheduling, transient faults.
Daniel Moss?, Rami Melhem, Sunondo Ghosh, "A Nonpreemptive Real-Time Scheduler with Recovery from Transient Faults and Its Implementation," IEEE Transactions on Software Engineering, vol. 29, no. 8, pp. 752-767, Aug. 2003, doi:10.1109/TSE.2003.1223648
Usage of this product signifies your acceptance of the Terms of Use.