The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.02 - February (2012 vol.61)
pp: 199-212
Chi-Neng Wen , National Chung-Cheng University
Shu-hsuan Chou , National Chung-Cheng University, Chia-Yi
Chien-Chih Chen , National Chiao-Tung University, Hsin-Chu
Tien-Fu Chen , National Chiao-Tung University, Hsin-Chu
ABSTRACT
Traditional debugging methodologies are limited in their ability to provide debugging support for many-core parallel programming. Synchronization problems or bugs due to race conditions are particularly difficult to detect with existing debugging tools. Most traditional debugging approaches rely on globally synchronized signals, but these pose their own problems in terms of scalability. The first contribution of this paper is to propose a novel non-uniform debugging architecture (NUDA) based on a ring interconnection schema. Our approach makes hardware-assisted debugging both feasible and scalable for many-core processing scenarios. The key idea is to distribute the debugging support structures across a set of hierarchical clusters while avoiding address overlap. The design strategy allows the address space to be monitored using non-uniform protocols. Our second contribution is to propose a nonintrusive approach to lockset-based race detection supported by the NUDA. A non-uniform page-based monitoring cache in each NUDA node is used to keep track of the access footprints. The union of all the caches can serve as a race detection probe without disturbing execution ordering. Using the proposed approach, we show that parallel race bugs can be precisely captured, and that most false-positive alerts can be efficiently eliminated at an average slowdown cost of only 1.4-3.6 percent. The net hardware cost is relatively low, so that the NUDA can easily be scaled to increasingly complex many-core systems.
INDEX TERMS
NUDA, lockset, data race, nonintrusive, manycore, debugging.
CITATION
Chi-Neng Wen, Shu-hsuan Chou, Chien-Chih Chen, Tien-Fu Chen, "NUDA: A Non-Uniform Debugging Architecture and Nonintrusive Race Detection for Many-Core Systems", IEEE Transactions on Computers, vol.61, no. 2, pp. 199-212, February 2012, doi:10.1109/TC.2010.254
REFERENCES
[1] C.E. McDowell and D.P. Helmbold, "Debugging Concurrent Programs," J. ACM Computing Surveys, vol. 21, no. 4, pp. 593-622, 1989.
[2] L. Seiler et al., "Larrabee: A Many-Core x86 Architecture for Visual Computing," ACM Trans. Graphics, vol. 27, no. 3, pp. 1-15, 2008.
[3] nVidia, "Next Generation CUDA Architecture," http://www. nvidia.com/objectfermi_architecture.html , 2011.
[4] B. Vermeulen, M.Z. Urfianto, and S.K. Goel, "Automatic Generation of Breakpoint Hardware for Silicon Debug," Proc. 41st Ann. Design Automation Conf., 2004.
[5] K. Goossens et al., "Transaction-Based Communication-Centric Debug," Proc. First Int'l Symp. Networks-on-Chip, 2007.
[6] "Standard Debug Interface Socket Requirements for OCP-Compliant SoC."
[7] S. Tang and Q. Xu, "In-Band Cross-Trigger Event Transmission for Transaction-Based Debug," Proc. Conf. Design, Automation and Test in Europe, 2008.
[8] A.R.M. Ltd., "CoreSight Architecture Specification," 2004.
[9] R. Leatherman, "On-Chip Instrumentation Approach to System-on-Chip Development," OCI White Paper, available at http:/www.fs2.com, 2011.
[10] S. Min and J. Choi, "An Efficient Cache-Based Access Anomaly Detection Scheme," ACM SIGARCH Computer Architecture News, vol. 19, no. 2, pp. 235-244, 1991.
[11] P. Zhou, R. Teodorescu, and Y. Zhou, "HARD: Hardware-Assisted Lockset-Based Race Detection," Proc. IEEE 13th Int'l Symp. High Performance Computer Architecture, 2007.
[12] J. Huh et al., "A NUCA Substrate for Flexible CMP Cache Sharing," Proc. 19th Ann. Int'l Conf. Supercomputing, 2005.
[13] L. Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System," Comm. ACM, vol. 21, no. 7, pp. 558-565, 1978.
[14] P. Keleher, A. Cox, and W. Zwaenepoel, "Lazy Release Consistency for Software Distributed Shared Memory," Distributed Shared Memory: Concepts and Systems, p. 96, 1998.
[15] S. Savage et al., "Eraser: A Dynamic Data Race Detector for Multithreaded Programs," ACM Trans. Computer Systems, vol. 15, no. 4, pp. 391-411, 1997.
[16] J.W. Voung, R. Jhala, and S. Lerner, "RELAY: Static Race Detection on Millions of Lines of Code," Proc. Sixth Joint Meeting of the European Software Eng. Conf. and the ACM SIGSOFT Symp. Foundations of Software Eng. (ESEC-FSE '07), 2007.
[17] Y. Yu, T. Rodeheffer, and W. Chen, "Racetrack: Efficient Detection of Data Race Conditions via Adaptive Tracking," ACM SIGOPS Operating Systems Rev., vol. 39, no. 5, pp. 221-234, 2005.
[18] M. Singhal and A. Kshemkalyani, "An Efficient Implementation of Vector Clocks," Information Processing Letters, vol. 43, no. 1, pp. 47-52, 1992.
[19] M. Xu, R. Bodik, and M.D. Hill, "A "Flight Data Recorder" for Enabling Full-System Multiprocessor Deterministic Replay," Proc. 30th Ann. Int'l Symp. Computer Architecture, 2003.
[20] D.R. Hower and M.D. Hill, "Rerun: Exploiting Episodes for Lightweight Memory Race Recording," Proc. 35th Ann. Int'l Symp. Computer Architecture, 2008.
[21] A. Alameldeen et al., "Evaluating Non-Deterministic Multi-Threaded Commercial Workloads," Proc. Fifth Workshop Computer Architecture Evaluation Using Commercial Workloads, pp. 30-38, 2002.
[22] R. Chandra et al., Parallel Programming in OpenMP. Morgan Kaufmann Publishers, Inc., 2001.
[23] M. Ronsse and K.D. Bosschere, "RecPlay: A Fully Integrated Practical Record/Replay System," ACM Trans. Computer Systems, vol. 17, no. 2, pp. 133-152, 1999.
[24] A. Muzahid et al., "SigRace: Signature-Based Data Race Detection," Proc. 36th Ann. Int'l Symp. Computer Architecture, 2009.
[25] B. Boehm, "Software and Its Impact: A Quantitative Assessment," Software Eng.: Barry W. Boehm's Lifetime Contributions to Software Development, Management, and Research, vol. 19, no. 5, p. 91, 2007.
[26] B. Boehm, "Improving Software Productivity," Software Eng.: Barry W. Boehm's Lifetime Contributions to Software Development, Management, and Research, vol. 20, no. 9, p. 151, 2007.
[27] A.-T. Nguyen et al., "The Augmint Multiprocessor Simulation Toolkit for Intel x86 Architectures," Proc. Int'l Conf. Computer Design, VLSI in Computers and Processors, 1996.
[28] S.C. Woo et al., "The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. 22nd Ann. Int'l Symp. Computer Architecture, 1995.
[29] S. Wilton and N. Jouppi, "CACTI: An Enhanced Cache Access and Cycle Time Model," IEEE J. Solid-State Circuits, vol. 31, no. 5, pp. 677-688, May 1996.
[30] N. Nethercote and J. Seward, "Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, 2007.
[31] M.-C. Hsieh and C.-T. Huang, "An Embedded Infrastructure of Debug and Trace Interface for the DSP Platform," Proc. 45th Ann. Design Automation Conf., 2008.
7 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool