
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Zizhong Chen, Jack Dongarra, "Highly Scalable SelfHealing Algorithms for High Performance Scientific Computing," IEEE Transactions on Computers, vol. 58, no. 11, pp. 15121524, November, 2009.  
BibTex  x  
@article{ 10.1109/TC.2009.42, author = {Zizhong Chen and Jack Dongarra}, title = {Highly Scalable SelfHealing Algorithms for High Performance Scientific Computing}, journal ={IEEE Transactions on Computers}, volume = {58}, number = {11}, issn = {00189340}, year = {2009}, pages = {15121524}, doi = {http://doi.ieeecomputersociety.org/10.1109/TC.2009.42}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Computers TI  Highly Scalable SelfHealing Algorithms for High Performance Scientific Computing IS  11 SN  00189340 SP1512 EP1524 EPD  15121524 A1  Zizhong Chen, A1  Jack Dongarra, PY  2009 KW  Selfhealing KW  diskless checkpointing KW  fault tolerance KW  pipeline KW  parallel and distributed systems KW  highperformance computing KW  Message Passing Interface. VL  58 JA  IEEE Transactions on Computers ER   
[1] N.R. Adiga et al. “An Overview of the BlueGene/L Supercomputer,” Proc. Supercomputing Conf. (SC '02), pp. 122, 2002.
[2] R. Barrett, M. Berry, T.F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H.V. der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, second ed. SIAM, 1994.
[3] F. Berman, G. Fox, and A. Hey, Grid Computing: Making the Global Infrastructure a Reality. Wiley, 2003.
[4] Z. Chen and J. Dongarra, “Numerically Stable Real Number Codes Based on Random Matrices,” Proc. Fifth Int'l Conf. Computational Science (ICCS '05), May 2005.
[5] Z. Chen and J. Dongarra, “Condition Numbers of Gaussian Random Matrices,” SIAM J. Matrix Analysis and Applications, vol. 27, no. 3, pp. 603620, 2005.
[6] Z. Chen, J. Dongarra, P. Luszczek, and K. Roche, “SelfAdapting Software for Numerical Linear Algebra and LAPACK for Clusters,” Parallel Computing, vol. 29, nos. 11/12, pp. 17231743, Nov./Dec. 2003.
[7] Z. Chen, G.E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, “Fault Tolerant High Performance Computing by a Coding Approach,” Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '05), June 2005.
[8] T.C. Chiueh and P. Deng, “Evaluation of Checkpoint Mechanisms for Massively Parallel Machines,” Proc. 26th Ann. Int'l Symp. FaultTolerant Computing (FTCS '96), pp. 370379, 1996.
[9] J. Dongarra, H. Meuer, and E. Strohmaier, “TOP500 Supercomputer Sites, 24th Edition,” Proc. Supercomputing Conf. (SC'2004), 2004.
[10] A. Edelman, “Eigenvalues and Condition Numbers of Random Matrices,” SIAM J. Matrix Analysis and Applications, vol. 9, no. 4, pp. 543560, 1988.
[11] G.E. Fagg and J. Dongarra, “FTMPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World,” Proc. Parallel Virtual Machine/Message Passing Interface Conf. (PVM/MPI '00), pp. 346353, 2000.
[12] G.E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. PjesivacGrbovic, K. London, and J.J. Dongarra, “Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems,” Proc. Int'l Supercomputer Conf., 2004.
[13] G.E. Fagg, E. Gabriel, Z. Chen, T. Angskun, G. Bosilca, J. PjesivacGrbovic, and J.J. Dongarra, “Process FaultTolerance: Semantics, Design and Applications for High Performance Computing,” Int'l J. High Performance Computing Applications, vol. 19, no. 4, pp. 465477, 2005.
[14] I. Foster and C. Kesselman, The Grid: Blueprint for a New Computing Infrastructure. Morgan Kauffman, 1999.
[15] E. Gelenbe, “On the Optimum Checkpoint Interval,” J. ACM, vol. 26, no. 2, pp. 259270, 1979.
[16] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A HighPerformance, Portable Implementation of the MPI Message Passing Interface Standard,” Parallel Computing, vol. 22, no. 6, pp. 789828, Sept. 1996.
[17] G.H. Golub and C.F. Van Loan, Matrix Computations. The Johns Hopkins Univ. Press, 1989.
[18] Y. Kim, “Fault Tolerant Matrix Operations for Parallel and Distributed Systems,” PhD dissertation, Univ. of Tennessee, June 1996.
[19] Message Passing Interface Forum “MPI: A Message Passing Interface Standard,” Technical Report utcs94230, Univ. of Tennessee, 1994.
[20] J.S. Plank, “A Tutorial on ReedSolomon Coding for FaultTolerance in RAIDLike Systems,” Software—Practice & Experience, vol. 27, no. 9, pp. 9951012, Sept. 1997.
[21] J.S. Plank, Y. Kim, and J. Dongarra, “FaultTolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing,” J. Parallel and Distributed Computing, vol. 43, no. 2, pp.125138, 1997.
[22] J.S. Plank and K. Li, “Faster Checkpointing with $n+1$ Parity,” Proc. Int'l Symp. FaultTolerant Computing (FTCS), pp. 288297, 1994.
[23] J.S. Plank, K. Li, and M.A. Puening, “Diskless Checkpointing,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 972986, Oct. 1998.
[24] J.S. Plank and M.G. Thomason, “Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems,” J.Parallel and Distributed Computing, vol. 61, no. 11, pp. 15701590, Nov. 2001.
[25] L.M. Silva and J.G. Silva, “An Experimental Study about Diskless Checkpointing,” Proc. EUROMICRO '98 Conf., pp. 395402, 1998.
[26] N.H. Vaidya, “A Case for TwoLevel Recovery Schemes,” IEEE Trans. Computers, vol. 47, no. 6, pp. 656666, June 1998.
[27] J.W. Young, “A First Order Approximation to the Optimal Checkpoint Interval,” Comm. ACM, vol. 17, no. 9, pp. 530531, 1974.