The Community for Technology Leaders
RSS Icon
Issue No.03 - March (2009 vol.58)
pp: 380-393
Qin Zheng , Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore
Bharadwaj Veeravalli , National University of Singapore, Singapore
Chen-Khong Tham , National University of Singapore, Singapore
Fault-tolerant scheduling is an imperative step for large-scale computational Grid systems, as often geographically distributed nodes co-operate to execute a task. By and large, primary-backup approach is a common methodology used for fault tolerance wherein each task has a primary copy and a backup copy on two different processors. In this paper, we identify two cases that may happen when scheduling dependent tasks with primary-backup approach. We derive two important constraints that must be satisfied. Further, we show that these two constraints play a crucial role in limiting the schedulability and overloading efficiency of backups of dependent tasks. We then propose two strategies to improve schedulability and overloading efficiency, respectively. We propose two algorithms (MRC-ECT and MCT-LRC), to schedule backups of independent jobs and dependent jobs, respectively. MRC-ECT is shown to guarantee an optimal backup schedule in terms of replication cost for an independent task, while MCT-LRC can schedule a backup of a dependent task with minimum completion time and less replication cost. We conduct extensive simulation experiments to quantify the performance of the proposed algorithms.
Grid computing, directed acyclic graphs, independent tasks, primary-backup, fault-tolerance
Qin Zheng, Bharadwaj Veeravalli, Chen-Khong Tham, "On the Design of Fault-Tolerant Scheduling Strategies Using Primary-Backup Approach for Computational Grids with Low Replication Costs", IEEE Transactions on Computers, vol.58, no. 3, pp. 380-393, March 2009, doi:10.1109/TC.2008.172
[1] I. Foster and C. Kesselman, The Grid: Blueprint for a Future Computing Infrastructure. Morgan Kaufmann, 2004.
[2] S. Hwang and C. Kesselman, “A Flexible Framework for Fault Tolerance in the Grid,” J. Grid Computing, vol. 1, pp. 251-272, 2003.
[3] R. Medeiros, W. Cirne, F. Brasileiro, and J. Sauve, “Faults in Grids: Why Are They So Bad and What Can Be Done About It?” Proc. Fourth Int'l Workshop Grid Computing (GRID), 2003.
[4] X. Zhang, D. Zagorodnov, M. Hiltunen, K. Marzullo, and R.D. Schlichting, “Fault-Tolerant Grid Services Using Primary-Backup: Feasibility and Performance,” Proc. IEEE Int'l Conf. Cluster Computing (CLUSTER '04), pp. 105-114, 2004.
[5] M.J. Gonzalez, “Deterministic Processor Scheduling,” ACM Computing Surveys, vol. 9, no. 3, pp. 173-204, 1997.
[6] A. Iamnitchi and I. Foster, “A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems,” Proc. Int'l Conf. Parallel Processing (ICPP), 2000.
[7] J. Frey, T. Tannenbaum, I. Foster, M. Livny, and S. Tuecke, “Condor-G: A Computation Management Agent for Multi-Institutional Grids,” Cluster Computing, vol. 5, no. 3, 2002.
[8] A. Natrajan, M. Humphrey, and A. Grimshaw, “Grids: Harnessing Geographically-Separated Resources in a Multi-Organisational Context,” Proc. High Performance Computing Systems, 2001.
[9] B. Lee and J.B. Weissman, “Dynamic Replica Management in the Service Grid,” Proc. 10th IEEE Symp. High Performance Distributed Computing (HPDC), 2001.
[10] N. Budhiraja, K. Marzullo, F.B. Schneider, and S. Toueg, “Primary-Backup Protocols: Lower Bounds and Optimal Implementations,” Proc. Third IFIP Conf. Dependable Computing for Critical Applications (DCCA), 1992.
[11] S. Ghosh, R. Melhem, and D. Mosse, “Fault-Tolerance through Scheduling of Aperiodic Tasks in Hard Real-Time Multiprocessor Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 3, pp. 272-284, Mar. 1997.
[12] G. Manimaran and C.S.R. Murthy, “A Fault-Tolerant Dynamic Scheduling Algorithm for Multiprocessor Real-Time Systems and Its Analysis,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 11, pp. 1137-1152, 1998.
[13] R. AI-Omari, A.K. Somani, and G. Maninaran, “A New Fault-Tolerant Technique for Improving Schedulability in Multiprocessor Real-Time Systems,” Proc. Int'l Parallel and Distributed Processing Symp. (IPDPS), 2001.
[14] J.H. Abawajy, “Fault-Tolerant Scheduling Policy for Grid Computing Systems,” Proc. Int'l Parallel and Distributed Processing Symp. (IPDPS), 2004.
[15] Y. Oh and S.H. Son, “Scheduling Real-Time Tasks for Dependability,” J. Operational Research Soc., vol. 48, no. 6, pp. 629-639, 1997.
[16] X. Qin, H. Jiang, and D. Swanson, “An Efficient Fault-Tolerant Scheduling Algorithm for Real-Time Tasks with Precedence Constraints in Heterogeneous Systems,” Proc. Int'l Conf. Parallel Processing (ICPP), 2002.
[17] X. Qin and H. Jiang, “A Novel Fault-Tolerant Scheduling Algorithm for Precedence Constrained Tasks in Real-Time Heterogeneous Systems,” Parallel Computing, vol. 32, nos. 5/6, pp. 331-356, 2006.
[18] P.G. Paulin and J.P. Knight, “Force Directed Scheduling for the Behavioral Synthesis of Asics,” IEEE Trans. Computer-Aided Design, vol. 8, no. 6, pp. 661-679, June 1989.
[19] C. Tseng and D. Siewoirek, “Automated Synthesis of Data Paths in Digital Systems,” IEEE Trans. Computer-Aided Design, vol. 5, no. 3, pp. 379-395, July 1986.
[20] P. Marwedel, “A New Synthesis Algorithm for the Mimola Software System,” Proc. 23rd Design Automation Conf. (DAC '86), pp. 271-277, 1986.
[21] L. He, S.A. Jarvis, D.P. Spooner, H. Jiang, D.N. Dillenberger, and G.R. Nudd, “Reliability Driven Task Scheduling for Heterogeneous Systems,” Proc. IASTED Int'l Conf. Parallel and Distributed Computing and Systems (PDCS '03), pp. 465-470, 2003.
20 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool