Issue No. 05 - September/October (2011 vol. 8)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TDSC.2011.19
Omer Khan , Massachusetts Institute of Technology, Cambridge
Sandip Kundu , University of Massachusetts Amherst, Amherst
As the semiconductor industry continues its relentless push for nano-CMOS technologies, long-term device reliability and occurrence of hard errors have emerged as a major concern. Long-term device reliability includes parametric degradation that results in loss of performance as well as hard failures that result in loss of functionality. It has been reported in the ITRS roadmap that effectiveness of traditional burn-in test in product life acceleration is eroding. Thus, to assure sufficient product reliability, fault detection and system reconfiguration must be performed in the field at runtime. Although regular memory structures are protected against hard errors using error-correcting codes, many structures within cores are left unprotected. Several proposed online testing techniques either rely on concurrent testing or periodically check for correctness. These techniques are attractive, but limited due to significant design effort and hardware cost. Furthermore, lack of observability and controllability of microarchitectural states result in long latency, long test sequences, and large storage of golden patterns. In this paper, we propose a low-cost scheme for detecting and debugging hard errors with a fine granularity within cores and keeping the faulty cores functional, with potentially reduced capability and performance. The solution includes both hardware and runtime software based on codesigned virtual machine concept. It has the ability to detect, debug, and isolate hard errors in small noncache array structures, execution units, and combinational logic within cores. Hardware signature registers are used to capture the footprint of execution at the output of functional modules within the cores. A runtime layer of software (microvisor) initiates functional tests concurrently on multiple cores to capture the signature footprints across cores to detect, debug, and isolate hard errors. Results show that using targeted set of functional test sequences, faults can be debugged to a fine-granular level within cores. The hardware cost of the scheme is less than three percent, while the software tasks are performed at a high-level, resulting in a relatively low design effort and cost.
Chip Multiprocessor (CMP), hard error detection, isolation and tolerance, hardware/software codesign.
Omer Khan, Sandip Kundu, "Hardware/Software Codesign Architecture for Online Testing in Chip Multiprocessors", IEEE Transactions on Dependable and Secure Computing, vol. 8, no. , pp. 714-727, September/October 2011, doi:10.1109/TDSC.2011.19