The Community for Technology Leaders
RSS Icon
Issue No.04 - April (2009 vol.20)
pp: 460-473
Zhiling Lan , Illinois Institute of Technology, Chicago
Yawei Li , Illinois Institute of Technology, Chicago
Xian-He Sun , Illinois Institute of Technology, Chicago
As the scale of parallel systems continues to grow, fault management of these systems is becoming a critical challenge. While existing research mainly focuses on developing or improving fault tolerance techniques, a number of key issues remain open. In this paper, we propose runtime strategies for spare node allocation and job rescheduling in response to failure prediction. These strategies, together with failure predictor and fault tolerance techniques, construct a runtime system called FARS (Fault-Aware Runtime System). In particular, we propose a 0-1 knapsack model and demonstrate its flexibility and effectiveness for reallocating running jobs to avoid failures. Experiments, by means of synthetic data and real traces from production systems, show that FARS has the potential to significantly improve system productivity (i.e., performance and reliability).
Scheduling, Fault-tolerance, Parallel systems, Performance
Zhiling Lan, Yawei Li, Xian-He Sun, "Fault-Aware Runtime Strategies for High-Performance Computing", IEEE Transactions on Parallel & Distributed Systems, vol.20, no. 4, pp. 460-473, April 2009, doi:10.1109/TPDS.2008.128
[1] S. Albers and G. Schmidt, “Scheduling with Unexpected Machine Breakdowns,” Discrete Applied Math., vol. 110, nos. 2-3, pp. 85-99, 2001.
[2] F. Berman et al., “New Grid Scheduling and Rescheduling Methods in the GrADS Project,” Int'l J. Parallel Programming, vol. 33, nos. 2-3, pp. 209-229, 2005.
[3] A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello, “MPICH-V: A Multiprotocol Automatic Fault Tolerant MPI,” Int'l J. High Performance Computing and Applications, vol. 20, no. 3, pp. 319-333, 2006.
[4] E. Gabriel et al., “Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation,” Proc. 11th European PVM/MPI Users' Group Meeting (Euro PVM/MPI '04), Sept. 2004.
[5] S. Chakravorty, C. Mendes, and L. Kale, “Proactive Fault Tolerance in MPI Applications via Task Migration,” Proc. Int'l Conf. High Performance Computing (HiPC '06), p. 485, 2006.
[6] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms, second ed. The MIT Press and McGraw-Hill Book, 2001.
[7] A. Dogan and F. Ozguner, “Reliable Matching and Scheduling of Precedence-Constrained Tasks in Heterogeneous Distributed Computing,” Proc. Int'l Conf. Parallel Processing (ICPP '00), pp.307-314, 2000.
[8] C. Du and X. Sun, “MPI-Mitten: Enabling Migration Technology in MPI,” Proc. Int'l Symp. Cluster Computing and the Grid (CCGRID '06), pp. 11-18, 2006.
[9] E. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson, “A Survey of Rollback Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
[10] S. Hariri and C. Raghavendra, “Distributed Functions Allocation for Reliability and Delay Optimization,” Proc. ACM Fall Joint Computer Conf. (FJCC '86), pp. 344-352, 1986.
[11] P. Hargrove and J. Duell, “Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters,” Proc. Scientific Discovery through Advanced Computing (SciDAC), 2006.
[12] R. Jain, The Art of Computer Systems, Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley-Interscience, 1991.
[13] Parallel Workloads Archive, workload /, 2008.
[14] S. Kartik and C. Murthy, “Task Allocation Algorithms for Maximizing Reliability of Distributed Computing Systems,” IEEE Trans. Computer Systems, vol. 46, pp. 719-724, 1997.
[15] K. Limaye, C. Leangsuksun, and A. Tikotekar, “Fault Tolerance Enabled HPC Scheduling with HA-OSCAR Framework,” Proc. High Availability and Performance Workshop (HAPCW), 2005.
[16] Y. Li, P. Gujrati, Z. Lan, and X. Sun, “Fault-Driven Re-Scheduling for Improving System-Level Fault Resilience,” Proc. Int'l Conf. Parallel Processing (ICPP), 2007.
[17] Z. Lan and Y. Li, “Adaptive Fault Management of Parallel Applications for High Performance Computing,” IEEE Trans. Computers, vol. 57, no. 12, pp. 1647-1660, Dec. 2008.
[18] M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, “Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System,” Technical Report 1346, Univ. of Wisconsin-Madison Computer Science, 1997.
[19] C. Lu, “Scalable Diskless Checkpointing for Large Parallel Systems,” PhD dissertation, Univ. of Illinois at Urbana-Champaign, 2005.
[20] A. Mu'alem and D. Feitelson, “Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling,” IEEE Trans. Parallel and Distributed System, vol. 12, no. 6, pp. 529-543, June 2001.
[21] A. Nagarajan, F. Mueller, C. Engelmann, and S. Scott, “Proactive Fault Tolerance for HPC with Xen Virtualization,” Proc. Int'l Conf. Supercomputing (ICS '07), pp. 23-32, 2007.
[22] D. Nurmi, A. Mandal, J. Brevik, C. Koelbel, R. Wolski, and K. Kennedy, “Evaluation of a Workflow Scheduler Using Integrated Performance Modeling and Batch Queue Wait Time Prediction,” Proc. ACM/IEEE Conf. Supercomputing (SC), 2006.
[23] A.J. Oliner, L. Rudolph, and R.K. Sahoo, “Cooperative Checkpointing a Robust Approach to Large-Scale Systems Reliability,” Proc. Int'l Conf. Supercomputing (ICS '06), pp. 14-23, 2006.
[24] A. Oliner, R. Sahoo, J. Moreira, and M. Gupta, “Fault-Aware Job Scheduling for BlueGene/L Systems,” Proc. 18th Int'l Parallel and Distributed Processing Symp. (IPDPS '04), p. 64, 2004.
[25] T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, “Min-max Checkpoint Placement under Incomplete Failure Information,” Proc. Int'l Conf. Dependable Systems and Networks (DSN '04), p. 721, 2004.
[26] F. Petrini, “Scaling to Thousands of Processors with Buffered Coscheduling,” Proc. Scaling to New Height Workshop, 2002.
[27] J. Plank, M. Beck, G. Kingsley, and K. Li, “Libckpt: Transparent Checkpointing under Unix,” Proc. Usenix, 1995.
[28] J. Plank, K. Li, and M. Puening, “Diskless Checkpointing,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 972-986, Oct. 1998.
[29] J. Plank and M. Thomason, “Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems,” J. Parallel and Distributed Computing, vol. 61, no. 11, pp. 1570-1590, 2001.
[30] D. Reed, C. Lu, and C. Mendes, “Big Systems and Big Reliability Challenges,” Proc. Parallel Computing (ParCo '03), pp. 729-736, 2003.
[31] R. Sahoo et al., “Critical Event Prediction for Proactive Management in Large-Scale Computer Clusters,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDDM '03), pp. 426-435, 2003.
[32] B. Schroeder and G. Gibson, “A Large Scale Study of Failures in High-Performance-Computing Systems,” Proc. Int'l Symp. Dependable Systems and Networks (DSN), 2006.
[33] M. Schulz, G. Bronevetsky, R. Fernandes, D. Marques, K. Pingali, and P. Stodghill, “Implementation and Evaluation of a Scalable Application Level Checkpoint-Recovery Scheme for MPI Programs,” Proc. ACM/IEEE Conf. Supercomputing (SC '04), p. 38, 2004.
[34] S. Shatz, J. Wang, and M. Goto, “Task Allocation for Maximizing Reliability of Distributed Computer Systems,” IEEE Trans. Computers, vol. 41, no. 9, pp. 1156-1168, Sept. 1992.
[35] R. Smith and D. Dietrich, “The Bathtub Curve: An Alternative Explanation,” Proc. Ann. Reliability and Maintainability Symp. (RAMS '94), pp. 241-247, 1994.
[36] J. Smith, “A Survey of Process Migration Mechanisms,” Operating Systems Rev., vol. 22, no. 3, pp. 102-106, 1988.
[37] J. Squyres and A. Lumsdaine, “A Component Architecture for LAM/MPI,” Proc. 10th European PVM/MPI Users' Group Meeting (Euro PVM/MPI), 2003.
[38] S. Srinivasan and N. Jha, “Safety and Reliability Driven Task Allocation in Distributed Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 3, Mar. 1999.
[39] D. Tsafrir, Y. Etsion, and D. Feitelson, “Backfilling Using System-Generated Predictions Rather than User Runtime Estimates,” IEEE Trans. Parallel and Distributed Systems, vol. 18, no. 6, June 2007.
[40] R. Vilalta and S. Ma, “Predicting Rare Events in Temporal Domains,” Proc. IEEE Int'l Conf. Data Mining (ICDM), 2002.
[41] L. Wang, K. Pattabiraman, L. Votta, A.C. Vick, Z. Wood, and R. Kalbarczyk, “Modeling Coordinated Checkpointing for Large-Scale Supercomputers,” Proc. Int'l Conf. Dependable Systems and Networks (DSN '05), pp. 812-821, 2005.
[42] J. Young, “A First Order Approximation to the Optimal Checkpoint Interval,” ACM Comm., vol. 17, no. 9, pp.530-531, 1974.
[43] Y. Zhang, M. Squillante, A. Sivasubramaniam, and R. Sahoo, “Performance Implications of Failures in Large-Scale Cluster Scheduling,” Proc. Workshop Job Scheduling Strategies for Parallel Processing (JSSPP '04), pp. 233-252, 2004.
[44] Z. Zheng, Y. Li, and Z. Lan, “Anomaly Localization in Large-Scale Clusters,” Proc. IEEE Int'l Conf. Cluster Computing (Cluster), 2007.
[45] P. Gujrati, Y. Li, Z. Lan, R. Thakur, and J. White, “A Meta-Learning Failure Predictor for Blue Gene/L Systems,” Proc. Int'l Conf. Parallel Processing (ICPP), 2007.
[46] M. Morris, “Kiviat Graphs: Conventions and Figures of Merit,” ACM SIGMETRICS Performance Evaluation Rev., vol. 3, no. 3, 1974.
[47] S. Fu and C.Z. Xu, “Exploring Event Correlation for Failure Prediction in Coalitions of Clusters,” Proc. ACM/IEEE Conf. Supercomputing (SC), 2007.
[48] J. Gu, Z. Zheng, Z. Lan, J. White, E. Hocks, and B. Park, “Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study,” Proc. Int'l Conf. Parallel Processing (ICPP), 2008.
15 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool