|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
| ASCII Text | x | ||
| Xuejun Yang, Zhiyuan Wang, Jingling Xue, Yun Zhou, "The Reliability Wall for Exascale Supercomputing," IEEE Transactions on Computers, vol. 61, no. 6, pp. 767-779, June, 2012. | |||
| BibTex | x | ||
| @article{ 10.1109/TC.2011.106, author = {Xuejun Yang and Zhiyuan Wang and Jingling Xue and Yun Zhou}, title = {The Reliability Wall for Exascale Supercomputing}, journal ={IEEE Transactions on Computers}, volume = {61}, number = {6}, issn = {0018-9340}, year = {2012}, pages = {767-779}, doi = {http://doi.ieeecomputersociety.org/10.1109/TC.2011.106}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - JOUR JO - IEEE Transactions on Computers TI - The Reliability Wall for Exascale Supercomputing IS - 6 SN - 0018-9340 SP767 EP779 EPD - 767-779 A1 - Xuejun Yang, A1 - Zhiyuan Wang, A1 - Jingling Xue, A1 - Yun Zhou, PY - 2012 KW - Fault tolerance KW - exascale KW - performance metric KW - reliability speedup KW - reliability wall KW - checkpointing. VL - 61 JA - IEEE Transactions on Computers ER - | |||
[1] D.B. Kothe, “Science Prospects and Benefits with Exascale Computing,” Technical Report ORNL/TM-2007/232, Oak Ridge Nat'l Laboratory, 2007.
[2] H. Simon, T. Zacharia, and R. Stevens, “Modeling and Simulation at the Exascale for Energy and the Environment,” http://www.sc.doe.gov/ascr/ProgramDocuments ProgDocs.html, 2011.
[3] J. Stearley, “Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS),” Proc. Linux Clusters Inst. Conf., 2005.
[4] F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer, and M. Snir, “Toward Exascale Resilience,” Int'l J. High Performance Computing Applications, vol. 23, pp. 374-388, Nov. 2009.
[5] N. DeBardeleben, J. Laros, J. Daly, S. Scott, C. Engelmann, and B. Harrod, “High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path-Forward for Research and Development,” White Paper, http://www.csm.ornl.gov/~engelman/publications debardeleben09high-end.pdf, 2009.
[6] E.N. Elnozahy, R. Bianchini, T. El-Ghazawi, A. Fox, F. Godfrey, A. Hoisie, K. McKinley, R. Melhem, J. Plank, P. Ranganathan, and J. Simons, “System Resilience at Extreme Scale,” white paper, Defense Advanced Research Project Agency (DARPA), 2008.
[7] E.N.M. Elnozahy, L. Alvisi, Y.-M. Wang, and D.B. Johnson, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
[8] S. Chakravorty, “A Fault Tolerance Protocol for Fast Recovery,” PhD dissertation, Univ. of Illinois at Urbana-Champaign, 2008.
[9] D. Scott, “HW & SW Challenges and Trends to Reach Exascale,” HPCChina '09: Proc. High Performance Computing of China, 2009.
[10] D.A. Wood and M.D. Hill, “Cost-Effective Parallel Computing,” Computer, vol. 28, no. 2, pp. 69-72, Feb. 1995.
[11] G.M. Amdahl, “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities,” AFIPS '67 (Spring): Proc. Spring Joint Computer Conf., pp. 483-485, 1967.
[12] J.L. Gustafson, “Reevaluating Amdahl's Law,” Multiprocessor Performance Measurement and Evaluation, pp. 92-93. IEEE Computer Society Press, 1995.
[13] X.H. Sun and L.M. Ni, “Scalable Problems and Memory-Bounded Speedup,” J. Parallel and Distributed Computing, vol. 19, no. 1, pp. 27-37, 1993.
[14] X.H. Sun and D.T. Rover, “Scalability of Parallel Algorithm-Machine Combinations,” IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 6, pp. 599-613, June 1994.
[15] D.B. Johnson, “Distributed System Fault Tolerance Using Message Logging and Checkpointing,” PhD dissertation, Rice Univ., 1990.
[16] A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello, “MPICH-V: A Multiprotocol Fault Tolerant MPI,” Int'l J. High Performance Computing and Applications, vol. 20, no. 3, pp. 319-333, 2006.
[17] G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill, “Automated Application-Level Checkpointing of MPI Programs,” PPoPP '03: Proc. Ninth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 84-94, 2003.
[18] A. Beguelin, E. Seligman, and P. Stephan, “Application Level Fault Tolerance in Heterogeneous Networks of Workstations,” J. Parallel and Distributed Computing, vol. 43, no. 2, pp. 147-155, 1997.
[19] Z. Chen, G.E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, “Fault Tolerant High Performance Computing by a Coding Approach,” PPoPP '05: Proc. 10th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 213-223, 2005.
[20] M. Beck, J.S. Plank, and G. Kingsley, “Compiler-Assisted Checkpointing,” technical report, Univ. of Tennessee, K noxville, 1994.
[21] J.S. Plank, M. Beck, and G. Kingsley, “Compiler-Assisted Memory Exclusion for Fast Checkpointing,” IEEE Technical Committee on Operating Systems and Application Environments, vol. 7, no. 4, pp. 10-14, Winter 1995.
[22] J. Li and W.K. Fuchs, “CATCH - Compiler Assisted Techniques for Checkpointing,” FTCS-20: Proc. 20th Int'l Symp. Fault-Tolerant Computing, pp. 74-81, 1990.
[23] J.S. Plank, M. Beck, G. Kingsley, and K. Li, “Libckpt: Transparent Checkpointing under Unix,” technical report, Univ. of Tennessee, K noxville, 1994.
[24] C.D. Lu, “Scalable Diskless Checkpointing for Large Parallel Systems,” PhD dissertation, Univ. of Illinois at Urbana-Champaign, 2005.
[25] J.S. Plank, K. Li, and M.A. Puening, “Diskless Checkpointing,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 972-986, Oct. 1998.
[26] D.A. Patterson, G. Gibson, and R.H. Katz, “A Case for Redundant Arrays of Inexpensive Disks (RAID),” SIGMOD '88: Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 109-116, 1988.
[27] S. Lin and D.J. Costello, Error Control Coding, second ed. Prentice-Hall, Inc., 2004.
[28] T.V. Ramabadran and S.S. Gaitonde, “A Tutorial on CRC Computations,” IEEE Micro, vol. 8, no. 4, pp. 62-75, Aug. 1988.
[29] O. Wolfson, S. Jajodia, and Y. Huang, “An Adaptive Data Replication Algorithm,” ACM Trans. Database Systems, vol. 22, no. 2, pp. 255-314, 1997.
[30] L. Mancini and M. Koutny, “Formal Specification of N-Modular Redundancy,” CSC '86: Proc. ACM 14th Ann. Conf. Computer Science, pp. 199-204, 1986.
[31] W. Rudin, Principles of Mathematical Analysis, third ed. R.R. Donnelley & Sons, 1976.
[32] M. Wu, X.-H. Sun, and H. Jin, “Performance under Failures of High-End Computing,” Proc. ACM/IEEE Conf. Supercomputing, pp. 48:1-48:11, 2007.
[33] Los Alamos Nat'l Laboratory, “Operational Data to Support and Enable Computer Science Research,” http://institute.lanl.gov/datalanldata.shtml , 2011.
[34] R. Gupta, H. Naik, and P. Beckman, “Understanding Checkpointing Overheads on Massive-Scale Systems: Analysis on the IBM Blue Gene/P System,” Int'l J. High Performance Computing Applications, vol. 25, no. 2, May 2011.
[35] UK High-End Computing, “Overview of the Advanced Simulation and Computing Program,” http://www.ukhec.ac.uk/ publications/ reportsasci.pdf, 2011.
[36] B. Schroeder and G.A. Gibson, “A Large-Scale Study of Failures in High-Performance Computing Systems,” Proc. Int'l Conf. Dependable Systems and Networks, pp. 249-258, 2006.
[37] A. Moody and G. Bronevetsky, “Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/Sec File I/O,” Technical Report LLNL-TR-415791, Lawrence Livermore Nat'l Laboratory (LLNL), 2008.
[38] T. Budnik, A. Peters, and G. Thain, “Blue Heron Project,” http://www.cs.wisc.edu/condor/PCW2008/condor_presentations peters_blue_heron.ppt , 2011.
[39] J. Dongarra, G. Bosilca, Z. Chen, V. Eijkhout, G.E. Fagg, E. Fuentes, J. Langou, P. Luszczek, J. Pjesivac-Grbovic, K. Seymour, H. You, and S.S. Vadhiyar, “Self-Adapting Numerical Software (SANS) Effort,” IBM J. Research and Development, vol. 50, nos. 2/3, pp. 223-238, 2006.
[40] J.T. Daly, “A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps,” Future Generation Computer Systems, vol. 22, pp. 303-312, Feb. 2006.

