# Guest Editors' Introduction: Special Section on Concurrent On-Line Testing and Error/Fault Resilience of Digital Systems

Cecilia Metra, IEEE
Rajesh Galivanche, IEEE

Pages: pp. 1217-1218

The continuous scaling of microelectronic technology, while allowing to integrate increasingly complex and high performance systems on a die, poses new challenges to their reliable operation in the field, due to the increased likelihood of faults and aging phenomena possibly occurring in the field and compromising the system's correct operation.

Several on-line testing and error/fault resilience techniques have been employed in the past to implement highly reliable, fault tolerant systems for mission critical applications, in areas like space, military, automotive, medical, banking, etc. However, new faults and aging phenomena occurring in the field are posing unique on-line testing and error/fault resilience challenges even for mainstream applications, where cost is a crucial factor. This mandates the development and adoption of innovative solutions optimized for cost, power and area.

This Special Section consists of eleven articles that have been selected to provide the readers with a single comprehensive reference of theoretical and practical aspects of innovative techniques for on-line testing and error/fault resilience of electronic systems, possibly adopted to face the challenges in reliability of today's complex electronic systems, including high performance microprocessors, multi-core systems, real time systems and systems for cryptographic applications.

In "ReviveNet: A Self-adaptive Architecture for Improving Lifetime Reliability via Localized Timing Adaptation", G. Yan, Y. Han, and X. Li propose a new on-line approach to detect and compensate for aging. Aging sensors and a mechanism to tolerate aging-induced delay are presented, which are proven to improve the Mean-Time-To-Failure by up to 48.7 percent, at the cost of 9.5 percent area overhead and a small increase in power consumption.

In “CEDA: Control-flow Error Detection using Assertions”, R. Vemu and J.A. Abraham propose a software technique for online detection of control flow errors (i.e., errors consisting in the execution of a wrong sequence of instructions by a processor, due to the presence of faults). Compared to previously proposed methods, their approach provides higher error coverage, while implying lower performance overhead.

In “Modeling Yield, Cost, and Quality of a Spare-enhanced Multicore Chip”, S. Shamshiri and K.-T. Cheng propose a model for the yield and cost of a NoC-based multicore chip. They show that, by adding extra cores and wires to replace faulty cores and wires before shipment or in the field, the effective yield of the chip and its cost can be significantly improved, and manufacturing testing requirements can be relaxed.

In “Instruction-Level Impact Analysis of Low-Level Faults in a Modern Microprocessor Controller”, M. Maniatakos, N. Karimi, C. Tirumurti, A. Jas, and Y. Makris address the problem of assessing the relevance of low-level faults (i.e., faults in the RT- or Gate Level description) in the control logic of modern microprocessors, based on their impact on the execution of typical programs. They propose a fault simulation infrastructure, allowing the injection of stuck-at faults and transient errors, as well as the classification of their repercussions into instruction-level errors. Considering an Alpha-like superscalar microprocessor as a case study, they perform extensive fault injection experiments into the control modules to assess the distribution of low-level faults into the instruction-level error types.

In “Workload-Cognizant Concurrent Error Detection in the Scheduler of a Modern Microprocessor”, N. Karimi, M. Maniatakos, A. Jas, C. Tirumurti, and Y. Makris present a concurrent error detection scheme for the scheduler of modern microprocessors, based on monitoring a set of invariances imposed through added hardware. Considering an Alpha-like superscalar microprocessor as a case study, they show that, at a hardware overhead of 32 percent of the scheduler, their proposed approach allows the detection of over 85 percent of faults affecting the architectural state of the microprocessor. Over 99.5 percent of these faults are detected before they corrupt the architectural state, while the remaining faults have an average detection latency in the order of a few clock cycles.

In “A Comparative Study of System Level Energy Management Methods for Fault-Tolerant Hard Real Time Systems”, S. Aminzadeh and A. Ejlali consider the case of embedded real-time systems using replication for fault tolerance and analyze the impact of diverse system level energy reduction methods on their reliability and energy consumption. Based on the performed comparative study, guidelines are provided to allow designers to choose the optimal energy management method for applications with diverse energy-reliability constraints.

In “Time-Multiplexed Online Checking”, M. Gao, H.-M. Chang, P. Lisherness, and K.-T. Cheng introduce a new on-line testing approach based on time multiplexing, that uses embedded field-programmable blocks for checker implementation The proposed technique allows on-line checking of various parts of a system at lower area and power costs compared to traditional approaches, at the expenses of some increase in fault detection latency.

In “Guided Probabilistic Checksums for Error Control in Low-Power Digital Filters”, M.M. Nisar and A. Chatterjee consider the case of low power linear digital filters employing checksum codes for detection and compensation of intermittent errors due to voltage overscaling. A guided probabilistic error compensation technique is proposed, that allows significant power savings with minimal degradation in system performance.

In “A Low-Power High-Performance Concurrent Fault Detection Approach for the Composite Field S-Box and Inverse S-Box”, M. Mozaffari-Kermani and A. Reyhani-Masoleh address the problem of the security and reliability of the Advanced Encryption Standard (AES). They propose a concurrent fault detection scheme for the nonlinear operations within the AES. They prove that the proposed technique requires lower costs in terms of area overhead, critical path delay, and power consumption compared to alternate approaches, for the same target fault detection capability.

In “Concurrent Error Detection in Montgomery Multiplication over Binary Extension Fields”, A. Hariri and A. Reyhani-Masoleh consider the case of Montgomery multipliers, that are frequently adopted in cryptographic and coding applications. They propose a parity-based concurrent error detection approach for natural faults, as well as fault attacks in cryptography, that is proven to provide significant error detection capability, at low time and area costs.

In “Efficient On-line Self-Checking Modulo $2^n + 1$ Multiplier Design”, W. Hong, R. Modugu, and M. Choi consider the case of modulo $2^n + 1$ multipliers, that are frequently used in cryptographic applications employing the International Data Encryption Algorithm (IDEA). They propose a hardware residue code based self-checking implementation of such multipliers, that allows the on-line testing of faults affecting a single gate at a time, at a 20 to 45 percent area overhead, and two to seven percent performance penalty over their non self-checking implementation, for $n = 64$ to eight, respectively.

We hope that this Special Section will constitute a reference publication for future research and developments in the field of on-line testing and error/fault resilience of electronic systems. We thank all authors and reviewers. We also thank the IEEE Transactions on Computers past editor-in-chief, Fabrizio Lombardi, and current editor-in-chief, Albert Zomaya, for allowing us to create this Special Section.

Cecilia Metra

Rajesh Galivanche

Guest Editors