The Community for Technology Leaders
RSS Icon
Issue No.12 - December (2008 vol.57)
pp: 1647-1660
Zhiling Lan , Illinois Institute of Technology, Chicago
Yawei Li , Illinois Institute of Technology, Chicago
As the scale of high performance computing (HPC) grows, application fault resilience becomes increasingly important. In this paper, we propose FT-Pro, an adaptive fault management approach that combines the merits of reactive checkpointing and proactive migration. It enables parallel applications to avoid anticipated failures via preventive migration, and in the case of unforeseeable failures, to minimize their impact through selective checkpointing. An adaptation manager is designed for making runtime decision in response to failure prediction. We evaluate FT-Pro through stochastic modeling and case studies with real applications under a wide range of settings. Preliminary results indicate that FT-Pro outperforms periodic checkpointing, in terms of both reducing application completion times and improving resource utilization, by up to 43%.
Fault tolerance, Performance evaluation of algorithms and systems
Zhiling Lan, Yawei Li, "Adaptive Fault Management of Parallel Applications for High-Performance Computing", IEEE Transactions on Computers, vol.57, no. 12, pp. 1647-1660, December 2008, doi:10.1109/TC.2008.90
[1] The Top500 Supercomputer Site, http:/, 2007.
[2] D. Reed, C. Lu, and C. Mendes, “Big Systems and Big Reliability Challenges,” Proc. Int'l Conf. Parallel Computing (ParCo), 2003.
[3] B. Schroeder and G. Gibson, “A Large Scale Study of Failures in High-Performance-Computing Systems,” Proc. Int'l Conf. Dependable Systems and Networks (DSN), 2006.
[4] E. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, 2002.
[5] E. Elnozahy and J. Plank, “Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery,” IEEE Trans. Dependable and Secure Computing, vol. 1, no. 2, Apr.-June 2004.
[6] V. Castelli, R. Harper, P. Heldelberger, S. Hunter, K. Trivedi, K. Vaidyanathan, and W. Zeggert, “Proactive Management of Software Aging,” IBM J. Research and Development, vol. 45, no. 2, 2001.
[7] S. Chakravorty, C. Mendes, and L. Kale, “Proactive Fault Tolerance in Large Systems,” Proc. First Workshop High Performance Computing Reliability Issues (HPCRI), 2005.
[8] R. Vilalta and S. Ma, “Predicting Rare Events in Temporal Domains,” Proc. IEEE Int'l Conf. Data Mining (ICDM), 2002.
[9] R. Sahoo, A. Oliner, I. Rish, M. Gupta, J. Moreira, and S. Ma, “Critical Event Prediction for Proactive Management in Large-Scale Computer Clusters,” Proc. ACM SIGKDD, 2003.
[10] Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, and R. Sahoo, “Blue Gene/L Failure Analysis and Prediction Models,” Proc. Int'l Conf. Dependable Systems and Networks (DSN), 2006.
[11] P. Gujrati, Y. Li, Z. Lan, R. Thakur, and J. White, “A Meta-Learning Failure Predictor for Blue Gene/L Systems,” Proc. Int'l Conf. Parallel Processing (ICPP), 2007.
[12] A. Oliner and J. Stearley, “What Supercomputers Say: A Study of Five System Logs,” Proc. Int'l Conf. Dependable Systems and Networks (DSN), 2007.
[13] A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello, “Mpich-V: A Multiprotocol Automatic Fault Tolerant MPI,” Int'l J. High Performance Computing and Applications, 2005.
[14] J. Squyres and A. Lumsdaine, “A Component Architecture for LAM/MPI,” Proc. 10th European PVM/MPI Users' Group Meeting, 2003.
[15] M. Schulz, G. Bronevetsky, R. Fernandes, D. Marques, K. Pingali, and P. Stodghill, “Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs,” Proc. ACM/IEEE Conf. Supercomputing (SC), 2004.
[16] J. Plank, M. Beck, G. Kingsley, and K. Li, “Libckpt: Transparent Checkpointing under Unix,” Proc. Usenix Winter Technical Conf., 1995.
[17] J. Duell, P. Hargrove, and E. Roman, “Requirements for Linux Checkpoint/Restart,” Technical Report LBNL-49659, Berkeley Lab, May 2002.
[18] E. Gabriel et al., “Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation,” Proc. 11th European PVM/MPI Users' Group Meeting, 2004.
[19] C. Du and X. Sun, “MPI-Mitten: Enabling Migration Technology in MPI,” Proc. Sixth IEEE Int'l Symp. Cluster Computing and the Grid (CCGrid), 2006.
[20] C. Wang, F. Mueller, C. Engelmann, and S. Scott, “A Job Pause Service under LAM/MPI$+$ BLCR for Transparent Fault Tolerance,” Proc. 21st Int'l Parallel and Distributed Processing Symp. (IPDPS), 2007.
[21] J. Young, “A First Order Approximation to the Optimal Checkpoint Interval,” Comm. ACM, vol. 17, no. 9, 1974.
[22] T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, “Min-Max Checkpoint Placement under Incomplete Failure Information,” Proc. Int'l Conf. Dependable Systems and Networks (DSN), 2004.
[23] S. Toueg and O. Babaoglu, “On the Optimum Checkpoint Selection Problem,” SIAM J. Computing, vol. 13, no. 3, 1984.
[24] O. Babaoglu and W. Joy, “Converting a Swap-Based System to Do Paging in an Architecture Lacking Page Reference Bits,” Proc. Eighth Symp. Operating Systems Principles (SOSP), 1981.
[25] J. Sancho, F. Petrini, G. Johnson, J. Fernandez, and E. Frachtenberg, “On the Feasibility of Incremental Checkpointing for Scientific Computing,” Proc. 18th Int'l Parallel and Distributed Processing Symp. (IPDPS), 2004.
[26] J. Plank, K. Li, and M. Puening, “Diskless Checkpointing,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, Oct. 1998.
[27] C.-D. Lu, “Scalable Diskless Checkpointing for Large Parallel Systems,” PhD dissertation, Univ. of Illinois at Urbana-Champaign, 2005.
[28] G. Zheng, L. Shi, and L. Kale, “FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI,” Proc. IEEE Int'l Conf. Cluster Computing (Cluster), 2004.
[29] B. Allen, “Monitoring Hard Disks with Smart,” Linux J., Jan. 2004.
[30] Hardware Monitoring by LM Sensors, http://secure.netroedge. com/-lm78info.html , 2007.
[31] Health Application Programming Interface, http:/www.renci. org, 2007.
[32] Intelligent Platform Management Interface, com/design/serversipmi , 2007.
[33] K. Trivedi and K. Vaidyanathan, “A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems,” Proc. 10th Int'l Symp. Software Reliability Eng. (ISSRE), 1999.
[34] G. Weiss and H. Hirsh, “Learning to Predict Rare Events in Event Sequences,” Proc. ACM SIGKDD, 1998.
[35] G. Hoffmann, F. Salfner, and M. Malek, “Advanced Failure Prediction in Complex Software Systems,” Proc. 23rd Int'l Symp. Reliable Distributed Systems (SRDS), 2004.
[36] G. Hamerly and C. Elkan, “Bayesian Approaches to Failure Prediction for Disk Drives,” Proc. 18th Int'l Conf. Machine Learning (ICML), 2001.
[37] J. Hellerstein, F. Zhang, and P. Shahabuddin, “A Statistical Approach to Predictive Detection,” Computer Networks: The Int'l J. Computer and Telecommunications Networking, 2001.
[38] A. Gara et al., “Overview of the Blue Gene/L System Architecture,” IBM J. Research and Development, vol. 49, nos.2/3, 2005.
[39] Cray, Cray XT Series System Management, , 2005.
[40] C. Leangsuksun, T. Liu, T. Raol, S. Scott, and R. Libby, “A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster,” Proc. Fifth LCI Int'l Conf. Linux Clusters, 2004.
[41] A. Oliner, R. Sahoo, J. Moreira, M. Gupta, and A. Sivasubramaniam, “Fault-Aware Job Scheduling for Blue Gene/L Systems,” Proc. 18th Int'l Parallel and Distributed Processing Symp. (IPDPS), 2004.
[42] Y. Zhang, M. Squillante, A. Sivasubramaniam, and R. Sahoo, “Performance Implications of Failures in Large-Scale Cluster Scheduling,” Proc. 10th Workshop Job Scheduling Strategies for Parallel Processing (JSSPP), 2004.
[43] T. Tannenbaum and M. Litzkow, “Checkpointing and Migration of Unix Processes in the Condor Distributed Processing System,” Dr. Dobbs J., Feb. 1995.
[44] C. Clark et al., “Live Migration of Virtual Machines,” Proc. Second Symp. Networked Systems Design and Implementation (NSDI), 2005.
[45] A. Oliner, L. Rudolph, and R. Sahoo, “Cooperative Checkpointing: A Robust Approach to Large-Scale Systems Reliability,” Proc. 20th Ann. Int'l Conf. Supercomputing (ICS), 2006.
[46] G. Brown, D. Bernard, and R. Rasmussen, “Attitude and Articulation Control for the Cassini Spacecraft: A Fault Tolerance Overview,” Jet Propulsion Laboratory technical report, 1997.
[47] R. Bhagwan, K. Tati, Y. Cheng, S. Savage, and G. Voelker, “Total Recall: System Support for Automated Availability Management,” Proc. First Symp. Networked Systems Design and Implementation (NSDI), 2004.
[48] Z. Lan, P. Gujrati, Y. Li, Z. Zheng, R. Thakur, and J. White, “A Fault Diagnosis and Prognosis Service for Teragrid Clusters,” Proc. Second TeraGrid Conf., 2007.
[49] Z. Zheng, Y. Li, and Z. Lan, “Anomaly Localization in Large-Scale Clusters,” Proc. IEEE Int'l Conf. Cluster Computing (Cluster), 2007.
[50] Y. Li and Z. Lan, “Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing,” Proc. Sixth IEEE Int'l Symp. Cluster Computing and the Grid (CCGrid), 2006.
[51] J. Plank and M. Thomason, “Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems,” J. Parallel and Distributed Computing, vol. 61, no. 11, 2001.
[52] A. Oliner, L. Rudolph, and R. Sahoo, “Cooperative Checkpointing Theory,” Proc. 20th Int'l Parallel and Distributed Processing Symp. (IPDPS), 2006.
[53] G. Ciardo, J. Muppala, and K. Trivedi, “SPNP: Stochastic Petri Net Package,” Proc. Third Int'l Workshop Petri Nets and Performance Models (PNPM), 1989.
[54] L. Wang, K. Pattabiraman, Z. Kalbarczyk, and R. Iyer, “Modeling Coordinated Checkpointing for Large-Scale Supercomputers,” Proc. Int'l Conf. Dependable Systems and Networks (DSN), 2005.
[55] NASA NAS Parallel Benchmarks, npb.html, 2007.
[56] G. Bryan, T. Abel, and M. Norman, “Achieving Extreme Resolution in Numerical Cosmology Using Adaptive Mesh Refinement: Resolving Primordial Star Formulation,” Proc. ACM/IEEE Conf. Supercomputing (SC), 2001.
[57] H. Berendsen, D.V. der Spoel, and R. van Drunen, “Gromacs: A Message-Passing Parallel Molecular Dynamics Implementation,” Computer Physics Comm., vol. 91, pp. 43-56, 1995.
[58] Z. Lan, V. Taylor, and G. Bryan, “Dynamic Load Balancing for Structured Adaptive Mesh Refinement Applications,” Proc. ACM/IEEE Conf. Supercomputing (SC), 2001.
[59] Y. Li and Z. Lan, “Using Adaptive Fault Tolerance to Improve Application Robustness on the Teragrid,” Proc. Second TeraGrid Conf., 2007.
16 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool