|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
| ASCII Text | x | ||
| Zizhong Chen, Jack Dongarra, "Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing," IEEE Transactions on Computers, vol. 58, no. 11, pp. 1512-1524, November, 2009. | |||
| BibTex | x | ||
| @article{ 10.1109/TC.2009.42, author = {Zizhong Chen and Jack Dongarra}, title = {Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing}, journal ={IEEE Transactions on Computers}, volume = {58}, number = {11}, issn = {0018-9340}, year = {2009}, pages = {1512-1524}, doi = {http://doi.ieeecomputersociety.org/10.1109/TC.2009.42}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - JOUR JO - IEEE Transactions on Computers TI - Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing IS - 11 SN - 0018-9340 SP1512 EP1524 EPD - 1512-1524 A1 - Zizhong Chen, A1 - Jack Dongarra, PY - 2009 KW - Self-healing KW - diskless checkpointing KW - fault tolerance KW - pipeline KW - parallel and distributed systems KW - high-performance computing KW - Message Passing Interface. VL - 58 JA - IEEE Transactions on Computers ER - | |||
[1] N.R. Adiga et al. “An Overview of the BlueGene/L Supercomputer,” Proc. Supercomputing Conf. (SC '02), pp. 1-22, 2002.
[2] R. Barrett, M. Berry, T.F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H.V. der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, second ed. SIAM, 1994.
[3] F. Berman, G. Fox, and A. Hey, Grid Computing: Making the Global Infrastructure a Reality. Wiley, 2003.
[4] Z. Chen and J. Dongarra, “Numerically Stable Real Number Codes Based on Random Matrices,” Proc. Fifth Int'l Conf. Computational Science (ICCS '05), May 2005.
[5] Z. Chen and J. Dongarra, “Condition Numbers of Gaussian Random Matrices,” SIAM J. Matrix Analysis and Applications, vol. 27, no. 3, pp. 603-620, 2005.
[6] Z. Chen, J. Dongarra, P. Luszczek, and K. Roche, “Self-Adapting Software for Numerical Linear Algebra and LAPACK for Clusters,” Parallel Computing, vol. 29, nos. 11/12, pp. 1723-1743, Nov./Dec. 2003.
[7] Z. Chen, G.E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, “Fault Tolerant High Performance Computing by a Coding Approach,” Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '05), June 2005.
[8] T.C. Chiueh and P. Deng, “Evaluation of Checkpoint Mechanisms for Massively Parallel Machines,” Proc. 26th Ann. Int'l Symp. Fault-Tolerant Computing (FTCS '96), pp. 370-379, 1996.
[9] J. Dongarra, H. Meuer, and E. Strohmaier, “TOP500 Supercomputer Sites, 24th Edition,” Proc. Supercomputing Conf. (SC'2004), 2004.
[10] A. Edelman, “Eigenvalues and Condition Numbers of Random Matrices,” SIAM J. Matrix Analysis and Applications, vol. 9, no. 4, pp. 543-560, 1988.
[11] G.E. Fagg and J. Dongarra, “FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World,” Proc. Parallel Virtual Machine/Message Passing Interface Conf. (PVM/MPI '00), pp. 346-353, 2000.
[12] G.E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. Pjesivac-Grbovic, K. London, and J.J. Dongarra, “Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems,” Proc. Int'l Supercomputer Conf., 2004.
[13] G.E. Fagg, E. Gabriel, Z. Chen, T. Angskun, G. Bosilca, J. Pjesivac-Grbovic, and J.J. Dongarra, “Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing,” Int'l J. High Performance Computing Applications, vol. 19, no. 4, pp. 465-477, 2005.
[14] I. Foster and C. Kesselman, The Grid: Blueprint for a New Computing Infrastructure. Morgan Kauffman, 1999.
[15] E. Gelenbe, “On the Optimum Checkpoint Interval,” J. ACM, vol. 26, no. 2, pp. 259-270, 1979.
[16] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard,” Parallel Computing, vol. 22, no. 6, pp. 789-828, Sept. 1996.
[17] G.H. Golub and C.F. Van Loan, Matrix Computations. The Johns Hopkins Univ. Press, 1989.
[18] Y. Kim, “Fault Tolerant Matrix Operations for Parallel and Distributed Systems,” PhD dissertation, Univ. of Tennessee, June 1996.
[19] Message Passing Interface Forum “MPI: A Message Passing Interface Standard,” Technical Report ut-cs-94-230, Univ. of Tennessee, 1994.
[20] J.S. Plank, “A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-Like Systems,” Software—Practice & Experience, vol. 27, no. 9, pp. 995-1012, Sept. 1997.
[21] J.S. Plank, Y. Kim, and J. Dongarra, “Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing,” J. Parallel and Distributed Computing, vol. 43, no. 2, pp.125-138, 1997.
[22] J.S. Plank and K. Li, “Faster Checkpointing with $n+1$ Parity,” Proc. Int'l Symp. Fault-Tolerant Computing (FTCS), pp. 288-297, 1994.
[23] J.S. Plank, K. Li, and M.A. Puening, “Diskless Checkpointing,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 972-986, Oct. 1998.
[24] J.S. Plank and M.G. Thomason, “Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems,” J.Parallel and Distributed Computing, vol. 61, no. 11, pp. 1570-1590, Nov. 2001.
[25] L.M. Silva and J.G. Silva, “An Experimental Study about Diskless Checkpointing,” Proc. EUROMICRO '98 Conf., pp. 395-402, 1998.
[26] N.H. Vaidya, “A Case for Two-Level Recovery Schemes,” IEEE Trans. Computers, vol. 47, no. 6, pp. 656-666, June 1998.
[27] J.W. Young, “A First Order Approximation to the Optimal Checkpoint Interval,” Comm. ACM, vol. 17, no. 9, pp. 530-531, 1974.

