The Community for Technology Leaders
2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT) (2012)
Minneapolis, MN, USA
Sept. 19, 2012 to Sept. 23, 2012
ISBN: 978-1-5090-6609-4
pp: 213-222
Ignacio Laguna , Purdue University, School of Electrical and Computer Engineering, West Lafayette, IN 47907, USA
Dong H. Ahn , Lawrence Livermore National Laboratory, Computation Directorate, CA 94550, USA
Bronis R. de Supinski , Lawrence Livermore National Laboratory, Computation Directorate, CA 94550, USA
Saurabh Bagchi , Purdue University, School of Electrical and Computer Engineering, West Lafayette, IN 47907, USA
Todd Gamblin , Lawrence Livermore National Laboratory, Computation Directorate, CA 94550, USA
ABSTRACT
Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control but little information about the causes of failures. Most debuggers also scale poorly despite continued growth in supercomputer core counts. Our novel, highly scalable tool helps developers to understand and to fix performance failures and correctness problems at scale. Our tool probabilistically infers the least progressed task in MPI programs using Markov models of execution history and dependence analysis. This analysis guides program slicing to find code that may have caused a failure. In a blind study, we demonstrate that our tool can isolate the root cause of a particularly perplexing bug encountered at scale in a molecular dynamics simulation. Further, we perform fault injections into two benchmark codes and measure the scalability of the tool. Our results show that it accurately detects the least progressed task in most cases and can perform the diagnosis in a fraction of a second with thousands of tasks.
INDEX TERMS
Debugging, Probabilistic logic, Computational modeling, Markov processes, History, Computer bugs,Performance, Reliability
CITATION
Ignacio Laguna, Dong H. Ahn, Bronis R. de Supinski, Saurabh Bagchi, Todd Gamblin, "Probabilistic diagnosis of performance faults in large-scale parallel applications", 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), vol. 00, no. , pp. 213-222, 2012, doi:
97 ms
(Ver 3.3 (11022016))