The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - September/October (2011 vol.8)
pp: 714-727
Omer Khan , Massachusetts Institute of Technology, Cambridge
Sandip Kundu , University of Massachusetts Amherst, Amherst
ABSTRACT
As the semiconductor industry continues its relentless push for nano-CMOS technologies, long-term device reliability and occurrence of hard errors have emerged as a major concern. Long-term device reliability includes parametric degradation that results in loss of performance as well as hard failures that result in loss of functionality. It has been reported in the ITRS roadmap that effectiveness of traditional burn-in test in product life acceleration is eroding. Thus, to assure sufficient product reliability, fault detection and system reconfiguration must be performed in the field at runtime. Although regular memory structures are protected against hard errors using error-correcting codes, many structures within cores are left unprotected. Several proposed online testing techniques either rely on concurrent testing or periodically check for correctness. These techniques are attractive, but limited due to significant design effort and hardware cost. Furthermore, lack of observability and controllability of microarchitectural states result in long latency, long test sequences, and large storage of golden patterns. In this paper, we propose a low-cost scheme for detecting and debugging hard errors with a fine granularity within cores and keeping the faulty cores functional, with potentially reduced capability and performance. The solution includes both hardware and runtime software based on codesigned virtual machine concept. It has the ability to detect, debug, and isolate hard errors in small noncache array structures, execution units, and combinational logic within cores. Hardware signature registers are used to capture the footprint of execution at the output of functional modules within the cores. A runtime layer of software (microvisor) initiates functional tests concurrently on multiple cores to capture the signature footprints across cores to detect, debug, and isolate hard errors. Results show that using targeted set of functional test sequences, faults can be debugged to a fine-granular level within cores. The hardware cost of the scheme is less than three percent, while the software tasks are performed at a high-level, resulting in a relatively low design effort and cost.
INDEX TERMS
Chip Multiprocessor (CMP), hard error detection, isolation and tolerance, hardware/software codesign.
CITATION
Omer Khan, Sandip Kundu, "Hardware/Software Codesign Architecture for Online Testing in Chip Multiprocessors", IEEE Transactions on Dependable and Secure Computing, vol.8, no. 5, pp. 714-727, September/October 2011, doi:10.1109/TDSC.2011.19
REFERENCES
[1] Int'l Technology Roadmap for Semiconductors (ITRS), www.itrs. net/links/2003itrs/LinkedFiles/ PIDS4377atr.pdf, 2011.
[2] S.Y. Borkar, "Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation," IEEE Micro, vol. 25, no. 6, pp. 10-16, Nov.-Dec. 2005.
[3] B. Murphy, "Automating Software Failure Reporting," ACM Queue, vol. 2, no. 8, pp. 42-48, 2004.
[4] E. Schuchman and T.N. Vijaykumar, "Rescue: A Microarchitecture for Testability and Defect Tolerance," Proc. Int'l Symp. Computer Architecture, 2005.
[5] P. Shivakumar, S. Keckler, C. Moore, and D. Burger, "Exploiting Microarchitectural Redundancy for Defect Tolerance," Proc. Int'l Conf. Computer Design, 2003.
[6] B. Romanescu and D. Sorin, "Core Cannibalization Architecture: Improving Lifetime Chip Performance for Multicore Processors in the Presence of Hard Faults," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, 2008.
[7] M.D. Powell et al., "Architectural Core Salvaging in a Multi-Core Processor for Hard Error Tolerance," Proc. Int'l Symp. Computer Architecture, 2009.
[8] O. Khan and S. Kundu, "Thread Relocation: A Runtime Architecture for Tolerating Hard Errors in Chip Multiprocessors," IEEE Trans. Computers, vol. 59, no. 5, pp. 651-665, May 2010.
[9] P. Bardell and W. McAnney, "Self-Testing of Multichip Modules," Proc. Int'l Test Conf., 1982.
[10] K. Constantinides, O. Mutlu, T. Austin, and V. Bertacco, "Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support and Evaluation," Proc. IEEE/ACM Int'l Symp. Microarchitecture, 2007.
[11] J. Smolens et al., "Detecting Emerging Wearout Faults," Proc. Workshop Silicon Errors in Logic System Effects, 2007.
[12] M. Bushnell and V. Agarwal, Essentials of Electronic Testing for Digital, Memory, and Mixed-Signal VLSI Circuits. Kluwer Academic Publishers, 2000.
[13] J. Gatej, L. Song, C. Pyron, R. Raina, and T. Munns, "Evaluating ATE Features in Terms of Test Escape Rates and Other Cost of Test Culprits," Proc. Int'l Test Conf., 2002.
[14] S. Mukherjee, J. Emer, and S. Reinhardt, "The Soft Error Problem: An Architecture Perspective," Proc. Int'l Symp. High Performance Computer Architecture, 2005.
[15] P. Parvathala et al., "FRITS—A Microprocessor Functional BIST Method," Proc. Int'l Test Conf., 2002.
[16] I. Bayraktaroglu et al., "Cache Resident Functional Microprocessor Testing: Avoiding High Speed IO Issues," Proc. Int'l Test Conf., 2006.
[17] A. Klaiber, "The Technology behind Crusoe Processors," Transmeta Corporation, Jan. 2000.
[18] M. Li et al., "Trace-Based Microarchitecture-Level Diagnosis of Permanent Hardware Faults," Proc. Int'l Conf. Dependable Systems and Networks, 2008.
[19] M. Agostinelli et al., "Random Charge Effects for PMOS NBTI in Ultra-Small Gate Area Devices," Proc. Int'l Reliability Physics Symp., 2005.
[20] A. Dhodapkar and J.E. Smith, "Saving and Restoring Implementation Contexts with Co-Designed Virtual Machines," Proc. Workshop Complexity-Effective Design, 2001.
[21] J. Smith and R. Nair, Virtual Machines: Versatile Platforms for Systems and Processes. Morgan Kaufmann Publication, 2005.
[22] S. Mitra, "Circuit Failure Prediction for Robust System Design in Scaled CMOS," Proc. Int'l Reliability Physics Symp., 2008.
[23] R.E. Lyons and W. Vanderkulk, "The Use of Triple-Modular Redundancy to Improve Computer Reliability," IBM J. Research and Development, vol. 6, no. 2, pp. 200-209, Apr. 1962.
[24] Y. Zorian, Principles of Testing Electronic Systems. Wiley-Interscience, 2000.
[25] T. Sherwood et al., "Phase Tracking and Prediction," Proc. Int'l Symp. Computer Architecture, 2003.
[26] SPECCPU 2000. SPEC Newsletter, 2000.
[27] E. Perelman, G. Hamerly, and B. Calder, "Picking Statistically Valid and Early Simulation Points," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, 2003.
[28] J. Smolens, "Fingerprinting: Hash-Based Error Detection in Microprocessors," PhD Thesis, Jan. 2008.
[29] J. Chang et al., "The 65nm 16MB On-Die L3 Cache for Dual Core Multi-Threaded Xeon Processor," Proc. IEEE Symp. VLSI Circuits, 2006.
[30] S. Mitra and K. Kim, "X-Compact: An Efficient Response Compaction Technique," IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 23, no. 3, pp. 421-432, Mar. 2004.
[31] D. Ponomarev, G. Kucuk, and K. Ghose, "Dynamic Resizing of Superscalar Data-Path Components for Energy Efficiency," IEEE Trans. Computers, vol. 55, no. 2, pp. 199-213, Feb. 2006.
[32] D. Harris, Skew-Tolerance Circuit Design. Academic Press, 2001.
[33] S. Mitra et al., "X-Tolerant Signature Analysis," Proc. Int'l Test Conf., 2004.
[34] J. Renau et al, "SESC Simulator," http:/sesc.sourceforge.net/, 2005.
[35] "55 and 65 Nanometer Process Technology," TSMC, 2008.
[36] Synopsys Inc., TetraMAX ATPG, http:/www.synopsys.com/, 2011.
[37] E. McClusky and C. Tseng, "Stuck-Fault Test vs. Actual Defects," Proc. Int'l Test Conf., 2000.
[38] O. Khan and S. Kundu, "A Self-Adaptive Architecture to Address Transistor Aging," Proc. Int'l Conf. Design, Automation and Test in Europe, Apr. 2009.
[39] K. Holdbrook et al., "MicroSPARCTM: A Case Study of Scan-Based Debug," Proc. Int'l Test Conf., 1994.
[40] A. Apostolakis, D. Gizopoulos, M. Psarakis, and A. Paschalis, "Software-Based Self-Testing of Symmetric Shared-Memory Multiprocessors," IEEE Trans. Computers, vol. 58, no. 12, pp. 1682-1694, Dec. 2009.
[41] M. Ramachandran et al., "Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating System, 2008.
[42] G. Reis et al., "Design and Evaluation of Hybrid Fault-Detection Systems," Proc. Int'l Symp. Computer Architecture, 2005.
[43] L. Spainhower and T. Gregg, "IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective," IBM J. Research and Development, vol. 43, no. 5, pp. 863-873, 1999.
[44] T. Austin, "DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design," Proc. IEEE/ACM Int'l Symp. Microarchitecture, 1999.
[45] F. Bower, D. Sorin, and S. Ozev, "Online Diagnosis of Hard Faults in Microprocessors," ACM Trans. Architecture and Code Optimizations, vol. 4, no. 2, June 2007.
[46] "IEEE Standard for Binary Floating-Point Arithmetic," IEEE Standard 754-1985, 1985.
17 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool