
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Zizhong Chen, Jack Dongarra, "AlgorithmBased Fault Tolerance for FailStop Failures," IEEE Transactions on Parallel and Distributed Systems, vol. 19, no. 12, pp. 16281641, December, 2008.  
BibTex  x  
@article{ 10.1109/TPDS.2008.58, author = {Zizhong Chen and Jack Dongarra}, title = {AlgorithmBased Fault Tolerance for FailStop Failures}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {19}, number = {12}, issn = {10459219}, year = {2008}, pages = {16281641}, doi = {http://doi.ieeecomputersociety.org/10.1109/TPDS.2008.58}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Parallel and Distributed Systems TI  AlgorithmBased Fault Tolerance for FailStop Failures IS  12 SN  10459219 SP1628 EP1641 EPD  16281641 A1  Zizhong Chen, A1  Jack Dongarra, PY  2008 KW  Reliability and robustness KW  Mathematical Software KW  Parallel algorithms VL  19 JA  IEEE Transactions on Parallel and Distributed Systems ER   
[1] J. Anfinson and F.T. Luk, “A Linear Algebraic Model of AlgorithmBased Fault Tolerance,” IEEE Trans. Computers, vol. 37, no. 12, pp. 15991604, Dec. 1988.
[2] P. Banerjee, J.T. Rahmeh, C.B. Stunkel, V.S.S. Nair, K. Roy, V. Balasubramanian, and J.A. Abraham, “AlgorithmBased Fault Tolerance on a Hypercube Multiprocessor,” IEEE Trans. Computers, vol. C39, pp. 11321145, 1990.
[3] V. Balasubramanian and P. Banerjee, “CompilerAssisted Synthesis of AlgorithmBased Checking in Multiprocessors,” IEEE Trans. Computers, vol. C39, pp. 436446, 1990.
[4] L.S. Blackford, J. Choi, A. Cleary, A. Petitet, R.C. Whaley, J. Demmel, I. Dhillon, K. Stanley, J. Dongarra, S. Hammarling, G. Henry, and D. Walker, “ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers—Design Issues and Performance,” Proc. ACM/IEEE Conf. Supercomputing (Supercomputing '96), CDROM, p. 5, 1996.
[5] D.L. Boley, R.P. Brent, G.H. Golub, and F.T. Luk, “Algorithmic Fault Tolerance Using the Lanczos Method,” SIAM J. Matrix Analysis and Applications, vol. 13, pp. 312332, 1992.
[6] L.E. Cannon, “A Cellular Computer to Implement the Kalman Filter Algorithm,” PhD dissertation, Montana State Univ., 1969.
[7] Z. Chen and J. Dongarra, “Numerically Stable Real Number Codes Based on Random Matrices,” Proc. Fifth Int'l Conf. Computational Science (ICCS '05), May 2005.
[8] Z. Chen and J. Dongarra, “Condition Numbers of Gaussian Random Matrices,” SIAM J. Matrix Analysis and Applications, vol. 27, no. 3, pp. 603620, 2005.
[9] Z. Chen, G.E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, “Fault Tolerant High Performance Computing by a Coding Approach,” Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPOPP '05), June 2005.
[10] Z. Chen, G.E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, “Building Fault Survivable MPI Programs with FTMPI Using Diskless Checkpointing,” Technical Report UTCS04540, Dept. Computer Science, Univ. of Tennessee, 2004.
[11] Z. Chen, “Scalable Techniques for Fault Tolerant High Performance Computing,” PhD dissertation, Univ. of Tennessee, 2006.
[12] G.E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. PjesivacGrbovic, K. London, and J.J. Dongarra, “Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems,” Proc. Int'l Supercomputer Conf., 2004.
[13] G.E. Fagg, E. Gabriel, Z. Chen, T. Angskun, G. Bosilca, J. PjesivacGrbovic, and J.J. Dongarra, “Process FaultTolerance: Semantics, Design and Applications for High Performance Computing,” Int'l J. High Performance Computing Applications, 2004.
[14] I. Foster and C. Kesselman, The Grid: Blueprint for a New Computing Infrastructure. Morgan Kauffman, 1999.
[15] I. Foster and C. Kesselman, “The GLOBUS Toolkit,” The Grid: Blueprint for a New Computing Infrastructure, pp. 259278, 1999.
[16] G.C. Fox, M. Johnson, G. Lyzenga, S.W. Otto, J. Salmon, and D. Walker, Solving Problems on Concurrent Processors: Volume 1. PrenticeHall, 1988.
[17] E. Gabriel, G.E. Fagg, G. Bosilca, T. Angskun, J. Dongarra, J.M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R.H. Castain, D.J. Daniel, R.L. Graham, and T.S. Woodall, “Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation,” Proc. 11th European PVM/MPI Users' Group Meeting (Euro PVM/MPI '04), pp. 97104, 2004.
[18] G.H. Golub and C.F. Van Loan, Matrix Computations. The John Hopkins Univ. Press, 1989.
[19] K.H. Huang and J.A. Abraham, “AlgorithmBased Fault Tolerance for Matrix Operations,” IEEE Trans. Computers, vol. 33, pp.518528, 1984.
[20] Y. Kim, “Fault Tolerant Matrix Operations for Parallel and Distributed Systems,” PhD dissertation, Univ. of Tennessee, June 1996.
[21] F.T. Luk and H. Park, “An Analysis of AlgorithmBased Fault Tolerance Techniques,” Proc. SPIE Advanced Algebra and Architecture for Signal Processing, vol. 696, pp. 222228, 1986.
[22] J.S. Plank, Y. Kim, and J. Dongarra, “Fault Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing,” IEEE J. Parallel and Distributed Computing, vol. 43, pp. 125138, 1997.
[23] J.S. Plank, K. Li, and M.A. Puening, “Diskless Checkpointing,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp.972986, Oct. 1998.
[24] J.S. Plank, “A Tutorial on ReedSolomon Coding for FaultTolerance in RAIDLike Systems,” Software—Practice and Experience, vol. 27, no. 9, pp. 9951012, Sept. 1997.
[25] P. Sanders and J.F. Sibeyn, “A Bandwidth Latency Tradeoff for Broadcast and Reduction,” Information Processing Letters, vol. 86, no. 1, pp. 3338, 2003.
[26] M. Snir, S. Otto, S. HussLederman, D.W. Walker, and J. Dongarra, MPI: The Complete Reference, vol. 1, second ed. The MIT Press, 1998.
[27] V.S. Sunderam, “PVM: A Framework for Parallel Distributed Computing,” Concurrency: Practice and Experience, vol. 2, no. 4, pp.315339, 1990.
[28] C. Wang, F. Mueller, C. Engelmann, and S. Scot, “Job Pause Service Under LAM/MPI $+$ BLCR for Transparent Fault Tolerance,” Proc. 21st IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '07), Mar. 2007.