The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.06 - June (2012 vol.61)
pp: 767-779
Xuejun Yang , National University of Defense Technology, ChangSha
Zhiyuan Wang , National University of Defense Technology, ChangSha
Jingling Xue , University of New South Wales, Sydney
Yun Zhou , National University of Defense Technology, ChangSha
ABSTRACT
Reliability is a key challenge to be understood to turn the vision of exascale supercomputing into reality. Inevitably, large-scale supercomputing systems, especially those at the peta/exascale levels, must tolerate failures, by incorporating fault-tolerance mechanisms to improve their reliability and availability. As the benefits of fault-tolerance mechanisms rarely come without associated time and/or capital costs, reliability will limit the scalability of parallel applications. This paper introduces for the first time the concept of "Reliability Wall” to highlight the significance of achieving scalable performance in peta/exascale supercomputing with fault tolerance. We quantify the effects of reliability on scalability, by proposing a reliability speedup, defining quantitatively the reliability wall, giving an existence theorem for the reliability wall, and categorizing a given system according to the time overhead incurred by fault tolerance. We also generalize these results into a general reliability speedup/wall framework by considering not only speedup but also costup. We analyze and extrapolate the existence of the reliability wall using two representative supercomputers, Intrepid and ASCI White, both employing checkpointing for fault tolerance, and have also studied the general reliability wall using Intrepid. These case studies provide insights on how to mitigate reliability-wall effects in system design and through hardware/software optimizations in peta/exascale supercomputing.
INDEX TERMS
Fault tolerance, exascale, performance metric, reliability speedup, reliability wall, checkpointing.
CITATION
Xuejun Yang, Zhiyuan Wang, Jingling Xue, Yun Zhou, "The Reliability Wall for Exascale Supercomputing", IEEE Transactions on Computers, vol.61, no. 6, pp. 767-779, June 2012, doi:10.1109/TC.2011.106
REFERENCES
[1] D.B. Kothe, “Science Prospects and Benefits with Exascale Computing,” Technical Report ORNL/TM-2007/232, Oak Ridge Nat'l Laboratory, 2007.
[2] H. Simon, T. Zacharia, and R. Stevens, “Modeling and Simulation at the Exascale for Energy and the Environment,” http://www.sc.doe.gov/ascr/ProgramDocuments ProgDocs.html, 2011.
[3] J. Stearley, “Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS),” Proc. Linux Clusters Inst. Conf., 2005.
[4] F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer, and M. Snir, “Toward Exascale Resilience,” Int'l J. High Performance Computing Applications, vol. 23, pp. 374-388, Nov. 2009.
[5] N. DeBardeleben, J. Laros, J. Daly, S. Scott, C. Engelmann, and B. Harrod, “High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path-Forward for Research and Development,” White Paper, http://www.csm.ornl.gov/~engelman/publications debardeleben09high-end.pdf, 2009.
[6] E.N. Elnozahy, R. Bianchini, T. El-Ghazawi, A. Fox, F. Godfrey, A. Hoisie, K. McKinley, R. Melhem, J. Plank, P. Ranganathan, and J. Simons, “System Resilience at Extreme Scale,” white paper, Defense Advanced Research Project Agency (DARPA), 2008.
[7] E.N.M. Elnozahy, L. Alvisi, Y.-M. Wang, and D.B. Johnson, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
[8] S. Chakravorty, “A Fault Tolerance Protocol for Fast Recovery,” PhD dissertation, Univ. of Illinois at Urbana-Champaign, 2008.
[9] D. Scott, “HW & SW Challenges and Trends to Reach Exascale,” HPCChina '09: Proc. High Performance Computing of China, 2009.
[10] D.A. Wood and M.D. Hill, “Cost-Effective Parallel Computing,” Computer, vol. 28, no. 2, pp. 69-72, Feb. 1995.
[11] G.M. Amdahl, “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities,” AFIPS '67 (Spring): Proc. Spring Joint Computer Conf., pp. 483-485, 1967.
[12] J.L. Gustafson, “Reevaluating Amdahl's Law,” Multiprocessor Performance Measurement and Evaluation, pp. 92-93. IEEE Computer Society Press, 1995.
[13] X.H. Sun and L.M. Ni, “Scalable Problems and Memory-Bounded Speedup,” J. Parallel and Distributed Computing, vol. 19, no. 1, pp. 27-37, 1993.
[14] X.H. Sun and D.T. Rover, “Scalability of Parallel Algorithm-Machine Combinations,” IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 6, pp. 599-613, June 1994.
[15] D.B. Johnson, “Distributed System Fault Tolerance Using Message Logging and Checkpointing,” PhD dissertation, Rice Univ., 1990.
[16] A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello, “MPICH-V: A Multiprotocol Fault Tolerant MPI,” Int'l J. High Performance Computing and Applications, vol. 20, no. 3, pp. 319-333, 2006.
[17] G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill, “Automated Application-Level Checkpointing of MPI Programs,” PPoPP '03: Proc. Ninth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 84-94, 2003.
[18] A. Beguelin, E. Seligman, and P. Stephan, “Application Level Fault Tolerance in Heterogeneous Networks of Workstations,” J. Parallel and Distributed Computing, vol. 43, no. 2, pp. 147-155, 1997.
[19] Z. Chen, G.E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, “Fault Tolerant High Performance Computing by a Coding Approach,” PPoPP '05: Proc. 10th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 213-223, 2005.
[20] M. Beck, J.S. Plank, and G. Kingsley, “Compiler-Assisted Checkpointing,” technical report, Univ. of Tennessee, K noxville, 1994.
[21] J.S. Plank, M. Beck, and G. Kingsley, “Compiler-Assisted Memory Exclusion for Fast Checkpointing,” IEEE Technical Committee on Operating Systems and Application Environments, vol. 7, no. 4, pp. 10-14, Winter 1995.
[22] J. Li and W.K. Fuchs, “CATCH - Compiler Assisted Techniques for Checkpointing,” FTCS-20: Proc. 20th Int'l Symp. Fault-Tolerant Computing, pp. 74-81, 1990.
[23] J.S. Plank, M. Beck, G. Kingsley, and K. Li, “Libckpt: Transparent Checkpointing under Unix,” technical report, Univ. of Tennessee, K noxville, 1994.
[24] C.D. Lu, “Scalable Diskless Checkpointing for Large Parallel Systems,” PhD dissertation, Univ. of Illinois at Urbana-Champaign, 2005.
[25] J.S. Plank, K. Li, and M.A. Puening, “Diskless Checkpointing,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 972-986, Oct. 1998.
[26] D.A. Patterson, G. Gibson, and R.H. Katz, “A Case for Redundant Arrays of Inexpensive Disks (RAID),” SIGMOD '88: Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 109-116, 1988.
[27] S. Lin and D.J. Costello, Error Control Coding, second ed. Prentice-Hall, Inc., 2004.
[28] T.V. Ramabadran and S.S. Gaitonde, “A Tutorial on CRC Computations,” IEEE Micro, vol. 8, no. 4, pp. 62-75, Aug. 1988.
[29] O. Wolfson, S. Jajodia, and Y. Huang, “An Adaptive Data Replication Algorithm,” ACM Trans. Database Systems, vol. 22, no. 2, pp. 255-314, 1997.
[30] L. Mancini and M. Koutny, “Formal Specification of N-Modular Redundancy,” CSC '86: Proc. ACM 14th Ann. Conf. Computer Science, pp. 199-204, 1986.
[31] W. Rudin, Principles of Mathematical Analysis, third ed. R.R. Donnelley & Sons, 1976.
[32] M. Wu, X.-H. Sun, and H. Jin, “Performance under Failures of High-End Computing,” Proc. ACM/IEEE Conf. Supercomputing, pp. 48:1-48:11, 2007.
[33] Los Alamos Nat'l Laboratory, “Operational Data to Support and Enable Computer Science Research,” http://institute.lanl.gov/datalanldata.shtml , 2011.
[34] R. Gupta, H. Naik, and P. Beckman, “Understanding Checkpointing Overheads on Massive-Scale Systems: Analysis on the IBM Blue Gene/P System,” Int'l J. High Performance Computing Applications, vol. 25, no. 2, May 2011.
[35] UK High-End Computing, “Overview of the Advanced Simulation and Computing Program,” http://www.ukhec.ac.uk/ publications/ reportsasci.pdf, 2011.
[36] B. Schroeder and G.A. Gibson, “A Large-Scale Study of Failures in High-Performance Computing Systems,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 249-258, 2006.
[37] A. Moody and G. Bronevetsky, “Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/Sec File I/O,” Technical Report LLNL-TR-415791, Lawrence Livermore Nat'l Laboratory (LLNL), 2008.
[38] T. Budnik, A. Peters, and G. Thain, “Blue Heron Project,” http://www.cs.wisc.edu/condor/PCW2008/condor_presentations peters_blue_heron.ppt , 2011.
[39] J. Dongarra, G. Bosilca, Z. Chen, V. Eijkhout, G.E. Fagg, E. Fuentes, J. Langou, P. Luszczek, J. Pjesivac-Grbovic, K. Seymour, H. You, and S.S. Vadhiyar, “Self-Adapting Numerical Software (SANS) Effort,” IBM J. Research and Development, vol. 50, nos. 2/3, pp. 223-238, 2006.
[40] J.T. Daly, “A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps,” Future Generation Computer Systems, vol. 22, pp. 303-312, Feb. 2006.
53 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool