The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2014 vol.25)
pp: 740-749
Minh Ngoc Dinh , Monash University, Victoria
David Abramson , Monash University, Victoria
Chao Jin , Monash University, Victoria
ABSTRACT
Detecting and isolating bugs that arise only at high processor counts is a challenging task. Over a number of years, we have implemented a special debugging method, called "relative debugging," that supports debugging applications as they evolve or are ported to larger machines. It allows a user to compare the state of a suspect program against another reference version even as the number of processors is increased. The innovative idea is the comparison of runtime data to reason about the state of the suspect program. While powerful, a naïve implementation of the comparison phase does not scale to large problems running on large machines. In this paper, we propose two different solutions including a hash-based scheme and a direct point-to-point scheme. We demonstrate the implementation, a case study, as well as the performance, of our techniques on 20K cores of a Cray XE6 system.
INDEX TERMS
assertion checkers, Parallellism and concurrency, distributed debugging,
CITATION
Minh Ngoc Dinh, David Abramson, Chao Jin, "Scalable Relative Debugging", IEEE Transactions on Parallel & Distributed Systems, vol.25, no. 3, pp. 740-749, March 2014, doi:10.1109/TPDS.2013.86
REFERENCES
[1] B. Schroeder and G.A. Gibson, "A Large-Scale Study of Failures in High-Performance Computing Systems," Proc. Int'l Conf. Dependable Systems and Networks (DSN '06), 2006.
[2] D. Abramson, I. Foster, J. Michalakes, and R. Sosic, "Relative Debugging and Its Application to the Development of Large Numerical Models," Proc. Conf. High Performance Networking and Computing, 1995.
[3] D. Abramson, I. Foster, J. Michalakes, and R. Sosic, "Relative Debugging - A New Methodology for Debugging Scientific Applications," Comm. the ACM, vol. 39, pp. 69-77, 1996.
[4] D. Abramson, M.N. Dinh, D. Kurniawan, B. Moench, and L. DeRose, "Data Centric Highly Parallel Debugging," Proc. 19th ACM Int'l Symp. High Performance Distributed Computing (HPDC), pp. 119-129, 2010.
[5] S.F. Siegel and T.K. Zirkel, "Collective Assertions," Proc. 12th Int'l Conf. Verification, Model Checking, and Abstract Interpretation (VMCAI), 2011.
[6] X. Liu, Z. Guo, X. Wang, F. Chen, X. Lian, J. Tang, M. Wu, M.F. Kaashoek, and Z. Zhang, "D3S: Debugging Deployed Distributed Systems," Proc. Fifth USENIX Symp. Networked Systems Design and Implementation (NSDI), 2008.
[7] X. Liu, W. Lin, A. Pan, and Z. Zhang, "WiDS Checker: Combating Bugs in Distributed Systems," Proc. Fourth USENIX Conf. Networked Systems Design & Implementation (NSDI), 2007.
[8] Q. Gao, F. Qin, and D.K. Panda, "DMTracker: Finding Bugs in Large-Scale Parallel Programs by Detecting Anomaly in Data Movements," Proc. Int'l Conf. for High Performance Computing, Networking, Storage and Analysis (SC), 2007.
[9] G. Bronevetsky, I. Laguma, S. bagchi, B.R.d. Supinski, D.H. Ahn, and M. Schulz, "AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks," Proc. IEEE/IFIP Int'l Conf. Dependable Systems and Networks (DSN), 2010.
[10] G.L. Lee, D.H. Ahn, D.C. Arnold, B.R.d. Supinski, M. Legendre, B.P. Miller, and B. Liblit, "Lessons Learned at 208K: Towards Debugging Millions of Cores," Proc. ACM/IEEE Int'l Conf. for High Performance Computing, Networking, Storage, and Analysis, 2008.
[11] A.X. Zheng, M.I. Jordan, B. Liblit, and A. Aiken, "Statistical Debugging of Sampled Programs," Proc. Neural Information Processing Systems (NIPS '03), 2003.
[12] G. Jin, A. Thakur, B. Liblit, and S. Lu, "Instrumentation and Sampling Strategies for Cooperative Concurrency Bug Isolation," Proc. 25th ACM SIGSPLAN Conf. Object-Oriented Programming, Systems, Languages, and Applications, 2010.
[13] B. Liblit, M. Naik, A.X. Zheng, A. Aiken, and M.I. Jordan, "Scalable Statistical Bug Isolation," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 15-26, 2005.
[14] S. Park, R.W. Vuduc, and M.J. Harrold, "Falcon: Fault Localization in Concurrent Programs," Proc. ACM/IEEE 32nd Int'l Conf. Software Eng., pp. 245-254, 2010.
[15] C. Gottbrath, "Automation Assisted Debugging on the Cray with TotalView," Proc. Cray User Group, 2011.
[16] R. Hood and G. Jost, "Support for Debugging Automatically Parallelized Programs," Proc. Workshop Automated and Algorithmic Debugging, 2000.
[17] G. Matthews, R. Hood, H. Jin, S. Johnson, and C. Ierotheou, "Automatic Relative Debugging of OpenMP Programs," NAS, 2003.
[18] G.R. Watson, "The Design and Implementation of a Parallel Relative Debugger," Doctor of Philosophy Doctoral, Faculty of Information Technology, Monash Univ., Melbourne, 2000.
[19] A. Petitet, Block Cyclic Data Distribution, http://www.netlib.org/utk/papers/scalapack node8.html, 1995.
[20] H. Richardson, "High Performance Fortran: History, Overview and Current Developments," Thinking Machines Corporation, 1996.
[21] The Berkeley UPC Project. Berkeley UPC - Unified Parallel C, http:/upc.lbl.gov/, 2013.
[22] NASA Advanced Supercomputing Division, NAS Parallel Benchmarks, http://www.nas.nasa.gov/Resources/Software npb.html, 2009.
[23] ScaLAPACK Project, The ScaLAPACK Project, http://www. netlib.org/scalapackindex.html , 2000.
[24] A. Nakano, R.K. Kalia, K.-i Nomura, A. Sharma, P. Vashishta, F. Shimojo, A.C.T. v. Duin, W.A. Goddard, R. Biswas, and D. Srivastava, "A Divide-and-Conquer/Cellular-Decomposition Framework for Million-to-Billion Atom Simulations of Chemical Reactions," Computational Materials Science, vol. 38, pp. 642-652, 2007.
[25] J. Michalakes, J. Hacker, R. Loft, M.O. McCracken, A. Snavely, and N.-l.J. Wright, "WRF Nature Run," Proc. High Performance Networking and Computing, 2007.
[26] B. Mulvey, Hash Functions, http://bretm.home.comcast.net/~bretmhash /. 2013.
[27] S. Bakhtiari, R. Safavi-Naini, and J. Pieprzyk, "Cryptographic Hash Functions: A Survey," 1995.
[28] M. Molina, S. Niccolini, and N.G. Duffield, "A Comparative Experimental Study of Hash Functions Applied to Packet Sampling," Proc. Int'l Teletraffic Congress (ITC-19), 2005.
28 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool