The Community for Technology Leaders
Green Image
Issue No. 03 - March (2013 vol. 24)
ISSN: 1045-9219
pp: 535-549
Zhezhe Chen , The Ohio State University, Columbus
Qi Gao , Facebook Inc., Menlo Park
Wenbin Zhang , The Ohio State University, Columbus
Feng Qin , The Ohio State University, Columbus
ABSTRACT
Despite the success of the Message Passing Interface (MPI), many MPI libraries have suffered from software bugs. These bugs severely impact the productivity of a large number of users, causing program failures or other errors. As a result, MPI application developers often have to spend days or weeks in vain debugging their own code. To address this daunting problem, this paper presents a new method called FlowChecker, which detects communication related bugs in MPI libraries. First, FlowChecker extracts program intentions of message passing (MP-intentions), which specify messages to be delivered from the sources to the destinations. Then FlowChecker tracks the message flows that actually occur in the underlying MPI libraries. Finally, FlowChecker checks whether the messages are correctly delivered from the sources to the destinations by comparing the message flows against the MP-intentions. If a mismatch is found, FlowChecker reports a bug and provides diagnostic information to help MPI library developers to understand and fix it. We have built a FlowChecker prototype on Linux and evaluated it with five real-world and two injected bug cases in three widely used MPI libraries, including Open MPI, MPICH2, and MVAPICH2. Our experimental results show that FlowChecker effectively detects all seven evaluated bug cases. Additionally, it provides useful diagnostic information for narrowing down or even pinpointing root causes of the bugs. Moreover, our experiments with High Performance Linpack and NAS Parallel Benchmarks show that FlowChecker induces low runtime overhead (0.9-5.6 percent on Open MPI, 0.9-8.1 percent on MPICH2, and 1.6-9.7 percent on MVAPICH2).
INDEX TERMS
message passing interfaces, Software reliability, bug detection
CITATION
Zhezhe Chen, Qi Gao, Wenbin Zhang, Feng Qin, "Improving the Reliability of MPI Libraries via Message Flow Checking", IEEE Transactions on Parallel & Distributed Systems, vol. 24, no. , pp. 535-549, March 2013, doi:10.1109/TPDS.2012.127
98 ms
(Ver 3.1 (10032016))