The Community for Technology Leaders

Guest Editors' Introduction: Special Section on Concurrent On-Line Testing and Error/Fault Resilience of Digital Systems

Cecilia Metra, IEEE
Rajesh Galivanche, IEEE

Pages: pp. 1217-1218

The continuous scaling of microelectronic technology, while allowing to integrate increasingly complex and high performance systems on a die, poses new challenges to their reliable operation in the field, due to the increased likelihood of faults and aging phenomena possibly occurring in the field and compromising the system's correct operation.

Several on-line testing and error/fault resilience techniques have been employed in the past to implement highly reliable, fault tolerant systems for mission critical applications, in areas like space, military, automotive, medical, banking, etc. However, new faults and aging phenomena occurring in the field are posing unique on-line testing and error/fault resilience challenges even for mainstream applications, where cost is a crucial factor. This mandates the development and adoption of innovative solutions optimized for cost, power and area.

This Special Section consists of eleven articles that have been selected to provide the readers with a single comprehensive reference of theoretical and practical aspects of innovative techniques for on-line testing and error/fault resilience of electronic systems, possibly adopted to face the challenges in reliability of today's complex electronic systems, including high performance microprocessors, multi-core systems, real time systems and systems for cryptographic applications.

In "ReviveNet: A Self-adaptive Architecture for Improving Lifetime Reliability via Localized Timing Adaptation", G. Yan, Y. Han, and X. Li propose a new on-line approach to detect and compensate for aging. Aging sensors and a mechanism to tolerate aging-induced delay are presented, which are proven to improve the Mean-Time-To-Failure by up to 48.7 percent, at the cost of 9.5 percent area overhead and a small increase in power consumption.

In “CEDA: Control-flow Error Detection using Assertions”, R. Vemu and J.A. Abraham propose a software technique for online detection of control flow errors (i.e., errors consisting in the execution of a wrong sequence of instructions by a processor, due to the presence of faults). Compared to previously proposed methods, their approach provides higher error coverage, while implying lower performance overhead.

In “Modeling Yield, Cost, and Quality of a Spare-enhanced Multicore Chip”, S. Shamshiri and K.-T. Cheng propose a model for the yield and cost of a NoC-based multicore chip. They show that, by adding extra cores and wires to replace faulty cores and wires before shipment or in the field, the effective yield of the chip and its cost can be significantly improved, and manufacturing testing requirements can be relaxed.

In “Instruction-Level Impact Analysis of Low-Level Faults in a Modern Microprocessor Controller”, M. Maniatakos, N. Karimi, C. Tirumurti, A. Jas, and Y. Makris address the problem of assessing the relevance of low-level faults (i.e., faults in the RT- or Gate Level description) in the control logic of modern microprocessors, based on their impact on the execution of typical programs. They propose a fault simulation infrastructure, allowing the injection of stuck-at faults and transient errors, as well as the classification of their repercussions into instruction-level errors. Considering an Alpha-like superscalar microprocessor as a case study, they perform extensive fault injection experiments into the control modules to assess the distribution of low-level faults into the instruction-level error types.

In “Workload-Cognizant Concurrent Error Detection in the Scheduler of a Modern Microprocessor”, N. Karimi, M. Maniatakos, A. Jas, C. Tirumurti, and Y. Makris present a concurrent error detection scheme for the scheduler of modern microprocessors, based on monitoring a set of invariances imposed through added hardware. Considering an Alpha-like superscalar microprocessor as a case study, they show that, at a hardware overhead of 32 percent of the scheduler, their proposed approach allows the detection of over 85 percent of faults affecting the architectural state of the microprocessor. Over 99.5 percent of these faults are detected before they corrupt the architectural state, while the remaining faults have an average detection latency in the order of a few clock cycles.

In “A Comparative Study of System Level Energy Management Methods for Fault-Tolerant Hard Real Time Systems”, S. Aminzadeh and A. Ejlali consider the case of embedded real-time systems using replication for fault tolerance and analyze the impact of diverse system level energy reduction methods on their reliability and energy consumption. Based on the performed comparative study, guidelines are provided to allow designers to choose the optimal energy management method for applications with diverse energy-reliability constraints.

In “Time-Multiplexed Online Checking”, M. Gao, H.-M. Chang, P. Lisherness, and K.-T. Cheng introduce a new on-line testing approach based on time multiplexing, that uses embedded field-programmable blocks for checker implementation The proposed technique allows on-line checking of various parts of a system at lower area and power costs compared to traditional approaches, at the expenses of some increase in fault detection latency.

In “Guided Probabilistic Checksums for Error Control in Low-Power Digital Filters”, M.M. Nisar and A. Chatterjee consider the case of low power linear digital filters employing checksum codes for detection and compensation of intermittent errors due to voltage overscaling. A guided probabilistic error compensation technique is proposed, that allows significant power savings with minimal degradation in system performance.

In “A Low-Power High-Performance Concurrent Fault Detection Approach for the Composite Field S-Box and Inverse S-Box”, M. Mozaffari-Kermani and A. Reyhani-Masoleh address the problem of the security and reliability of the Advanced Encryption Standard (AES). They propose a concurrent fault detection scheme for the nonlinear operations within the AES. They prove that the proposed technique requires lower costs in terms of area overhead, critical path delay, and power consumption compared to alternate approaches, for the same target fault detection capability.

In “Concurrent Error Detection in Montgomery Multiplication over Binary Extension Fields”, A. Hariri and A. Reyhani-Masoleh consider the case of Montgomery multipliers, that are frequently adopted in cryptographic and coding applications. They propose a parity-based concurrent error detection approach for natural faults, as well as fault attacks in cryptography, that is proven to provide significant error detection capability, at low time and area costs.

In “Efficient On-line Self-Checking Modulo $2^n + 1$ Multiplier Design”, W. Hong, R. Modugu, and M. Choi consider the case of modulo $2^n + 1$ multipliers, that are frequently used in cryptographic applications employing the International Data Encryption Algorithm (IDEA). They propose a hardware residue code based self-checking implementation of such multipliers, that allows the on-line testing of faults affecting a single gate at a time, at a 20 to 45 percent area overhead, and two to seven percent performance penalty over their non self-checking implementation, for $n = 64$ to eight, respectively.

We hope that this Special Section will constitute a reference publication for future research and developments in the field of on-line testing and error/fault resilience of electronic systems. We thank all authors and reviewers. We also thank the IEEE Transactions on Computers past editor-in-chief, Fabrizio Lombardi, and current editor-in-chief, Albert Zomaya, for allowing us to create this Special Section.

Cecilia Metra

Rajesh Galivanche

Guest Editors

About the Authors

Bio Graphic
Cecilia Metra is a professor in electronics in the Department of Electronic, Computer Science and Systems (DEIS) of the University of Bologna. She is also affiliated with the Advanced Research Center on Electronic Systems for Information and Communication Technologies E. De Castro (ARCES) of the University of Bologna. She has been Visiting Scholar at the University of Washington, Seattle (USA) from 1998 to 2001, and Visiting Faculty Consultant for Intel Corporation, Santa Clara (CA) in 2002. She is the General Chair of the IEEE Int'l VLSI Test Symposium 2011, and she has been General Cochair of The IEEE Int'l Symposium on Defect and Fault Tolerance in VLSI Systems 2005 and 1999, of the IEEE Int'l On-Line Testing Symposium 2006 and of the IEEE Int'l On-Line Testing Workshop 2001, and Program Chair/Cochair of the IEEE Int'l VLSI Test Symposium 2009 and 2008, of the IEEE International Workshop on Design and Test of Nano Devices, Circuits and Systems (NDCS) 2008, of The IEEE Int'l Symposium on Defect and Fault Tolerance in VLSI Systems 1998, of the IEEE Int'l On-Line Testing Symp. 2005, 2004, and 2003, and of the IEEE Int'l On-Line Testing Workshop 2002. She serves/served as Topic Chair and as Member of the Organizing Committee and/or Technical Program Committee of several international conferences. Her research interests are in the field of Design and Test of Integrated Digital Systems, Reliable and Error Resilient Systems, Fault Tolerance, On-Line Testing, Fault Modeling, Diagnosis and Debug, Emergent Technologies, Energy Harvesting and Security. She is Associate Editor in Chief of the IEEE Transactions on Computers, and a Member of the Editorial Board of the Journal of Electronic Testing: Theory and Applications and the International Journal of Highly Reliable Electronic System Design. She is a Senior Member and a Golden Core member of the IEEE Computer Society.
Bio Graphic
Rajesh Galivanche is a senior principal engineer in the Technology and Manufacturing Group at Intel. As the architect for DFT and Test Technology, Rajesh sets the strategy for research and development of Design-for-Test and HVM test technologies for Intel microprocessor and consumer SoC products. Rajesh also chaired the Intel wide task force on Logic Fault tolerance in Intel products. In these roles, he works closely with both the academia and the EDA industry in advancing the state-of -the-art in test and fault tolerant systems. Rajesh has published several papers in IEEE conference proceedings, two patents issued and two patent applications pending. He served as keynote speaker in many workshops in manufacturing test and online testing related workshops. Rajesh served on the Program Committees of IEEE VLSI Test Symposium, IEEE International Test Conference European Test Symposium in the past. Rajesh has an MS in electrical and computer engineering from the University of Iowa and is a senior member of IEEE.
62 ms
(Ver 3.x)