The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2013 vol.24)
pp: 535-549
Zhezhe Chen , The Ohio State University, Columbus
Qi Gao , Facebook Inc., Menlo Park
Wenbin Zhang , The Ohio State University, Columbus
Feng Qin , The Ohio State University, Columbus
ABSTRACT
Despite the success of the Message Passing Interface (MPI), many MPI libraries have suffered from software bugs. These bugs severely impact the productivity of a large number of users, causing program failures or other errors. As a result, MPI application developers often have to spend days or weeks in vain debugging their own code. To address this daunting problem, this paper presents a new method called FlowChecker, which detects communication related bugs in MPI libraries. First, FlowChecker extracts program intentions of message passing (MP-intentions), which specify messages to be delivered from the sources to the destinations. Then FlowChecker tracks the message flows that actually occur in the underlying MPI libraries. Finally, FlowChecker checks whether the messages are correctly delivered from the sources to the destinations by comparing the message flows against the MP-intentions. If a mismatch is found, FlowChecker reports a bug and provides diagnostic information to help MPI library developers to understand and fix it. We have built a FlowChecker prototype on Linux and evaluated it with five real-world and two injected bug cases in three widely used MPI libraries, including Open MPI, MPICH2, and MVAPICH2. Our experimental results show that FlowChecker effectively detects all seven evaluated bug cases. Additionally, it provides useful diagnostic information for narrowing down or even pinpointing root causes of the bugs. Moreover, our experiments with High Performance Linpack and NAS Parallel Benchmarks show that FlowChecker induces low runtime overhead (0.9-5.6 percent on Open MPI, 0.9-8.1 percent on MPICH2, and 1.6-9.7 percent on MVAPICH2).
INDEX TERMS
message passing interfaces, Software reliability, bug detection
CITATION
Zhezhe Chen, Qi Gao, Wenbin Zhang, Feng Qin, "Improving the Reliability of MPI Libraries via Message Flow Checking", IEEE Transactions on Parallel & Distributed Systems, vol.24, no. 3, pp. 535-549, March 2013, doi:10.1109/TPDS.2012.127
REFERENCES
[1] "Message Passing Interface Forum," http:/www.mpi-forum.org, 2012.
[2] "Papers About MPI," http://www.mcs.anl.gov/research/ projects/ mpipapers, 2012.
[3] "Architecture Share in top 500 Supercomputers for 06/2009," http://www.top500.org/stats/list/33archtype , 2012.
[4] "MPICH2: A High-Performance and Widely Portable Implementation of the Message Passing Interface (MPI) standard," http://www.mcs.anl.gov/research/projects mpich2, 2012.
[5] "MVAPICH2: MPI-2 over OpenFabrics-IB, OpenFabrics-iWARP, PSM, uDAPL and TCP/IP," http://mvapich.cse.ohio-state.edu/overview mvapich2, 2012.
[6] E. Gabriel, G.E. Fagg, G. Bosilca, T. Angskun, J.J. Dongarra, J.M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R.H. Castain, D.J. Daniel, R.L. Graham, and T.S. Woodall, "Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation," EuroPVM/MPI, 2004.
[7] J.M. Squyres and A. Lumsdaine, "A Component Architecture for LAM/MPI," Proc. EuroPVM/MPI, 2003.
[8] "Open MPI Bug Tickets," https://svn.open-mpi.org/trac/ompi/ticket 689, 2012.
[9] D.C. Arnold, D.H. Ahn, B.R. de Supinski, G. Lee, B.P. Miller, and M. Schulz, "Stack Trace Analysis for Large Scale Debugging," Proc. IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS), 2007.
[10] J. DeSouza, B. Kuhn, B.R. de Supinski, V. Samofalov, S. Zheltov, and S. Bratanov, "Automated, Scalable Debugging of MPI Programs with Intel Message Checker," Proc. Second Int'l Workshop Software Eng. for High Performance Computing System Applications (SE-HPCS), 2005.
[11] T. Hilbrich, B.R. de Supinski, M. Schulz, and M.S. Müller, "A Graph Based Approach for MPI deadlock detection," Proc. 23rd Int'l Conf. Supercomputing (ICS), 2009.
[12] B. Krammera, K. Bidmona, M.S. Muller, and M.M. Rescha, "MARMOT: An MPI Analysis and Checking Tool," Proc. Parallel Computing (PARCO), 2003.
[13] G. Luecke, H. Chen, J. Coyle, J. Hoekstra, M. Kraeva, and Y. Zou, "MPI-CHECK: A Tool for Checking Fortran 90 MPI Programs," Concurrency and Computation: Practice and Experience, vol. 15, no. 2, pp. 93-100, 2003.
[14] J.S. Vetter and B.R. de Supinski, "Dynamic Software Testing of Mpi Applications with Umpire," Proc. ACM/IEEE Conf. Supercomputing (CDROM), 2000.
[15] Q. Gao, F. Qin, and D.K. Panda, "DMTracker: Finding Bugs in Large-Scale Parallel Programs by Detecting Anomaly in Data Movements," Proc. ACM/IEEE Conf. Supercomputing, 2007.
[16] A.V. Mirgorodskiy, N. Maruyama, and B.P. Miller, "Problem Diagnosis in Large-Scale Computing Environments," Proc. ACM/IEEE Conf. Supercomputing, 2006.
[17] M.D. Ernst, J. Cockrell, W.G. Griswold, and D. Notkin, "Dynamically Discovering Likely Program Invariants to Support Program Evolution," Proc. 21st Int'l Conf. Software Eng. (ICSE), 1999.
[18] S. Hangal and M.S. Lam, "Tracking Down Software Bugs Using Automatic Anomaly Detection," Proc. 24th Int'l Conf. Software Eng. (ICSE), 2002.
[19] A. Petitet and R.C. Whaley, and J. Dongarra, and A. Cleary, "High Performance Linpack," http://www.netlib.org/benchmarkhpl, 2012.
[20] D.H. Bailey, L. Dagum, E. Barszcz, and H.D. Simon, "NAS Parallel Benchmark Results," Proc. ACM/IEEE Conf. Supercomputing, 1992.
[21] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V.J. Reddi, and K. Hazelwood, "Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI), 2005.
[22] R. Oldfield, A. Maccabe, S. Arunagiri, T. Kordenbrock, R. Riesen, L. Ward, and P. Widener, "Lightweight I/O for Scientific Applications," Sandia Nat'l Laboratories, Technical Report SAND2006-3057, May 2006.
[23] G. Bronevetsky and A. Moody, "Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec file I/O," Lawrence Livermore Nat'l Laboratory, Technical Report LLNL-TR-415791, 2009.
[24] M. Noeth, P. Ratn, F. Mueller, M. Schulz, and B.R. de Supinski, "Scalatrace: Scalable Compression and Replay of Communication Traces for High-Performance Computing," J. Parallel and Distributed Computing, vol. 69, no. 8, pp. 696-710, 2009.
[25] S.F. Siegel, A. Mironova, G.S. Avrunin, and L.A. Clarke, "Combining Symbolic Execution with Model Checking to Verify Parallel Numerical Programs," ACM Trans. Software Eng. and Methodology, vol. 17, no. 2,article 10, 2008.
[26] S. Muchnick, Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.
[27] Myricom, http:/www.myri.com, 2012.
[28] Quadrics, http:/www.Quadrics.com, 2011.
[29] P. Zhou, W. Liu, F. Long, S. Lu, F. Qin, Y. Zhou, S. Midkiff, and J. Torrellas, "AccMon: Automatially Detecting Memory-Related Bugs Via Program Counter-Based Invariants," Proc. 37th Int'l Symp. Microarchitecture (MICRO), 2004.
[30] "Basic Linear Algebra Communication Subprograms (BLACS)," http://www.netlib.orgblacs/, 2012.
[31] Z. Chen, Q. Gao, W. Zhang, and F. Qin, "FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking," Proc. ACM/IEEE Int'l Conf. High Performance Computing, Networking, Storage and Analysis, 2010.
[32] C. Falzone, A. Chan, E. Lusk, and W. Gropp, "A Portable Method for Finding User Errors in the Usage of MPI Collective Operations," Int'l J. High Performance Computing Applications, vol. 21, no. 2, pp. 155-165, 2007.
[33] A. Vo, S. Vakkalanka, M. DeLisi, G. Gopalakrishnan, R.M. Kirby, and R. Thakur, "Formal Verification of Practical MPI Programs," Proc. 14th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP), 2009.
[34] D.H. Ahn, B.R. de Supinski, I. Laguna, G.L. Lee, B. Liblit, B.P. Miller, and M. Schulz, "Scalable Temporal Order Analysis for Large Scale Debugging," Proc. Conf. High Performance Computing Networking, Storage and Analysis (Supercomputing), 2009.
[35] S.M. Balle, B.R. Brett, C.-P. Chen, and D. LaFrance-Linden, "Extending a Traditional Debugger to Debug Massively Parallel Applications," J. Parallel and Distributed Computing, vol. 64, no. 5, pp. 617-628, 2004.
[36] Etnus, LLC., "TotalView," http://www.etnus.comTotalView, 2012.
[37] S.S. Lumetta and D.E. Culler, "The Mantis Parallel Debugger," Proc. SIGMETRICS Symp. Parallel and Distributed Tools (SPDT), 1996.
[38] R.T. Aulwes, D.J. Daniel, N.N. Desai, R.L. Graham, L.D. Risinger, M.W. Sukalski, and M.A. Taylor, "Network Fault Tolerance in LA-MPI," Proc. EuroPVM/MPI 2003.
[39] A. Perez, "Byte-wise CRC Calculations," IEEE Micro, vol. 3, no. 3, pp. 40-50, June 1983.
[40] Z. Alkhalifa, V.S.S. Nair, N. Krishnamurthy, and J.A. Abraham, "Design and Evaluation of System-Level Checks for On-Line Control Flow Error Detection," IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 6, pp. 627-641, June 1999.
[41] T. Knauth, C. Fetzer, and P. Felber, "Assertion-Driven Development: Assessing the Quality of Contracts Using Meta-Mutations," Proc. IEEE Int'l Conf. Software Testing, Verification, and Validation Workshops (ICST), 2009.
[42] T. Ball and S. Rajamani, "The SLAM Project: Debugging System Software via Static Analysis," Proc. 29th ACM SIGPLAN-SIGACT Symp. Principles of Programming Languages (POPL), 2002.
[43] J. Condit, M. Harren, S. McPeak, G.C. Necula, and W. Weimer, "CCured in the Real World," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI), 2003.
[44] D. Engler, D.Y. Chen, S. Hallem, A. Chou, and B. Chelf, "Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code," Proc. ACM 18th Symp. Operating Systems Principles (SOSP), 2001.
[45] D. Evans, J. Guttag, J. Horning, and Y.M. Tan, "LCLint: A Tool for Using Specifications to Check Code," Proc. Second ACM SIGSOFT Symp. Foundations of Software Eng. (SIGSOFT), 1994.
[46] R. Hastings and B. Joyce, "Purify: Fast Detection of Memory Leaks and Access Errors," Proc. Winter USENIX Conf., 1992.
[47] N. Nethercote and J. Seward, "Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI), 2007.
[48] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson, "Eraser: A Dynamic Data Race Detector for Multithreaded Programs," ACM Trans. Computer Systems, vol. 15, no. 4, pp. 391-411, 1997.
[49] J. Yang, P. Twohey, D. Engler, and M. Musuvathi, "Using Model Checking to Find Serious File System Errors," Proc. Operating System Design and Implementation (OSDI), 2004.
[50] V.S.S. Nair, K. Indiradevi, and A.J. Abraham, "Formal Checking of Reliable User Interfaces," Proc. IEEE Int'l Conf. Fault-Tolerant Systems, 1995.
[51] N. Suri and P. Sinha, "On the Use of Formal Techniques for Validation," Proc. 28th Ann. Int'l Symp. Fault-Tolerant Computing, 1998.
[52] M. Susskraut, S. Weigert, U. Schiffel, T. Knauth, M. Nowack, D.B. de Brum, and C. Fetzer, "Speculation for PArallelizing Runtime Checks," Proc. 11th Int'l Symp. Stabilization, Safety, and Security of Distributed Systems (SSS), 2009.
[53] M. Castro, M. Costa, and J.-P. Martin, "Better Bug Reporting with Better Privacy," Proc. 13th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2008.
[54] M. Song and E. Tilevich, "Enhancing Source-Level Programming Tools with an Awareness of Transparent Program Transformations," Proc. 24th ACM SIGPLAN Conf. Object Oriented Programming Systems Languages and Applications (OOPSLA), 2009.
[55] M.K. Aguilera, J.C. Mogul, J.L. Wiener, P. Reynolds, and A. Muthitacharoen, "Performance Debugging for Distributed Systems of Black Boxes," Proc. 19th ACM Symp. Operating Systems Principles (SOSP), 2003.
[56] M.Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer, "Pinpoint: Problem Determination in Large, Dynamic Internet Services," Proc. Int'l Conf. Dependable Systems and Networks (DSN), 2002.
[57] Z. Lan, Z. Zheng, and Y. Li, "Toward Automated Anomaly Identification in Large-Scale Systems," IEEE Trans. Parallel and Distributed Systems, vol. 21, no. 2, pp. 174-187, Feb. 2010.
[58] N. Maruyama and S. Matsuoka, "Model-Based Fault Localization in Large-Scale Computing Systems," Proc. IEEE Int'l Symp. Parallel and Distributed Processing (IPDPS), 2008.
[59] C. Yuan, N. Lao, J.-R. Wen, J. Li, Z. Zhang, Y.-M. Wang, and W.-Y. Ma, "Automated Known Problem Diagnosis with Event Traces," Proc. First ACM SIGOPS/EuroSys European Conf. Computer Systems, 2006.
[60] G. Carrozza, D. Cotroneo, and S. Russo, "Software Faults Diagnosis in Complex OTS Based Safety Critical Systems," Proc. Seventh European Dependable Computing Conf. (EDCC), 2008.
28 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool