The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.12 - December (2008 vol.19)
pp: 1628-1641
Zizhong Chen , University of Tennessee, Knoxville
Jack Dongarra , University of Tennessee, Knoxville
ABSTRACT
Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix-matrix multiplication kennel can be tolerated without checkpointing or message logging. It has been proved in previous algorithm-based fault tolerance that, for matrix-matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no mater which algorithm is chosen. From this checksum relationship in the final computation results, processor miscalculations can be detected, located, and corrected at the end of the computation. However, whether this checksum relationship can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that, for many matrix matrix multiplication algorithms, the checksum relationship in the input checksum matrices is not maintained in the middle of the computation. We then prove that, however, for the outer product version algorithm, the checksum relationship in the input checksum matrices can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures (which are often tolerated by checkpointing or message logging) in ScaLAPACK matrix-matrix multiplication can be tolerated without checkpointing or message logging.
INDEX TERMS
Reliability and robustness, Mathematical Software, Parallel algorithms
CITATION
Zizhong Chen, Jack Dongarra, "Algorithm-Based Fault Tolerance for Fail-Stop Failures", IEEE Transactions on Parallel & Distributed Systems, vol.19, no. 12, pp. 1628-1641, December 2008, doi:10.1109/TPDS.2008.58
REFERENCES
[1] J. Anfinson and F.T. Luk, “A Linear Algebraic Model of Algorithm-Based Fault Tolerance,” IEEE Trans. Computers, vol. 37, no. 12, pp. 1599-1604, Dec. 1988.
[2] P. Banerjee, J.T. Rahmeh, C.B. Stunkel, V.S.S. Nair, K. Roy, V. Balasubramanian, and J.A. Abraham, “Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor,” IEEE Trans. Computers, vol. C-39, pp. 1132-1145, 1990.
[3] V. Balasubramanian and P. Banerjee, “Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors,” IEEE Trans. Computers, vol. C-39, pp. 436-446, 1990.
[4] L.S. Blackford, J. Choi, A. Cleary, A. Petitet, R.C. Whaley, J. Demmel, I. Dhillon, K. Stanley, J. Dongarra, S. Hammarling, G. Henry, and D. Walker, “ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers—Design Issues and Performance,” Proc. ACM/IEEE Conf. Supercomputing (Supercomputing '96), CDROM, p. 5, 1996.
[5] D.L. Boley, R.P. Brent, G.H. Golub, and F.T. Luk, “Algorithmic Fault Tolerance Using the Lanczos Method,” SIAM J. Matrix Analysis and Applications, vol. 13, pp. 312-332, 1992.
[6] L.E. Cannon, “A Cellular Computer to Implement the Kalman Filter Algorithm,” PhD dissertation, Montana State Univ., 1969.
[7] Z. Chen and J. Dongarra, “Numerically Stable Real Number Codes Based on Random Matrices,” Proc. Fifth Int'l Conf. Computational Science (ICCS '05), May 2005.
[8] Z. Chen and J. Dongarra, “Condition Numbers of Gaussian Random Matrices,” SIAM J. Matrix Analysis and Applications, vol. 27, no. 3, pp. 603-620, 2005.
[9] Z. Chen, G.E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, “Fault Tolerant High Performance Computing by a Coding Approach,” Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPOPP '05), June 2005.
[10] Z. Chen, G.E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra, “Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing,” Technical Report UT-CS-04-540, Dept. Computer Science, Univ. of Tennessee, 2004.
[11] Z. Chen, “Scalable Techniques for Fault Tolerant High Performance Computing,” PhD dissertation, Univ. of Tennessee, 2006.
[12] G.E. Fagg, E. Gabriel, G. Bosilca, T. Angskun, Z. Chen, J. Pjesivac-Grbovic, K. London, and J.J. Dongarra, “Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems,” Proc. Int'l Supercomputer Conf., 2004.
[13] G.E. Fagg, E. Gabriel, Z. Chen, T. Angskun, G. Bosilca, J. Pjesivac-Grbovic, and J.J. Dongarra, “Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing,” Int'l J. High Performance Computing Applications, 2004.
[14] I. Foster and C. Kesselman, The Grid: Blueprint for a New Computing Infrastructure. Morgan Kauffman, 1999.
[15] I. Foster and C. Kesselman, “The GLOBUS Toolkit,” The Grid: Blueprint for a New Computing Infrastructure, pp. 259-278, 1999.
[16] G.C. Fox, M. Johnson, G. Lyzenga, S.W. Otto, J. Salmon, and D. Walker, Solving Problems on Concurrent Processors: Volume 1. Prentice-Hall, 1988.
[17] E. Gabriel, G.E. Fagg, G. Bosilca, T. Angskun, J. Dongarra, J.M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R.H. Castain, D.J. Daniel, R.L. Graham, and T.S. Woodall, “Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation,” Proc. 11th European PVM/MPI Users' Group Meeting (Euro PVM/MPI '04), pp. 97-104, 2004.
[18] G.H. Golub and C.F. Van Loan, Matrix Computations. The John Hopkins Univ. Press, 1989.
[19] K.-H. Huang and J.A. Abraham, “Algorithm-Based Fault Tolerance for Matrix Operations,” IEEE Trans. Computers, vol. 33, pp.518-528, 1984.
[20] Y. Kim, “Fault Tolerant Matrix Operations for Parallel and Distributed Systems,” PhD dissertation, Univ. of Tennessee, June 1996.
[21] F.T. Luk and H. Park, “An Analysis of Algorithm-Based Fault Tolerance Techniques,” Proc. SPIE Advanced Algebra and Architecture for Signal Processing, vol. 696, pp. 222-228, 1986.
[22] J.S. Plank, Y. Kim, and J. Dongarra, “Fault Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing,” IEEE J. Parallel and Distributed Computing, vol. 43, pp. 125-138, 1997.
[23] J.S. Plank, K. Li, and M.A. Puening, “Diskless Checkpointing,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp.972-986, Oct. 1998.
[24] J.S. Plank, “A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-Like Systems,” Software—Practice and Experience, vol. 27, no. 9, pp. 995-1012, Sept. 1997.
[25] P. Sanders and J.F. Sibeyn, “A Bandwidth Latency Tradeoff for Broadcast and Reduction,” Information Processing Letters, vol. 86, no. 1, pp. 33-38, 2003.
[26] M. Snir, S. Otto, S. Huss-Lederman, D.W. Walker, and J. Dongarra, MPI: The Complete Reference, vol. 1, second ed. The MIT Press, 1998.
[27] V.S. Sunderam, “PVM: A Framework for Parallel Distributed Computing,” Concurrency: Practice and Experience, vol. 2, no. 4, pp.315-339, 1990.
[28] C. Wang, F. Mueller, C. Engelmann, and S. Scot, “Job Pause Service Under LAM/MPI $+$ BLCR for Transparent Fault Tolerance,” Proc. 21st IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '07), Mar. 2007.
6 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool