Guest Editors' Introduction: Special Section on System-Level Design of Reliable Architectures
MAY 2010 (Vol. 59, No. 5) pp. 577-578
0018-9340/10/$31.00 © 2010 IEEE

Published by the IEEE Computer Society
Guest Editors' Introduction: Special Section on System-Level Design of Reliable Architectures
Cristiana Bolchini , IEEE Senior Member

Donatella Sciuto , IEEE Senior Member
  Download Citation  
Download Content
PDFs Require Adobe Acrobat
It is with great pleasure that we introduce this special section on System-Level Design of Reliable Architectures to the audience of the IEEE Transactions on Computers.
Six papers have been selected covering a wide spectrum of topics ranging from architectural fault-tolerant techniques to formal methodologies for reliability analysis. These papers are authored by relevant researchers in the field and cover theoretical and experimental topics.
The widespread use of electronics in our life is directing more and more attention to the reliability properties of such systems in order to preserve both user's and environmental safety; therefore, the design of reliable architectures is today a necessity rather than an option, even in not-critical application domains. At the same time, these systems are reaching high complexity levels, thus leading the designer to both develop specific components and to use and compose existing ones to achieve the desired overall functionality. In the former case, ad hoc techniques may be devised, acting on either the hardware or the software to cope with the occurrence of faults. In this latter situation, when combining independently designed modules, the enhancement and assessment of reliability becomes particularly important; for instance, specific approaches are required to be able both to apply fault detection/tolerance techniques from the initial steps of the design flow and to evaluate the effects of faults in a component while interacting with the other ones composing the overall system. As a result, the entire design flow needs to be enhanced to support reliability: from the initial modelling of the system together with the desired properties/requirements, to the fault model, from the hardware/software partitioning step to the subsequent design exploration phase, where the more traditional metrics covering performance, costs, and power consumption need to be modified to also weight fault detection/tolerance capabilities. Functional verification and reliability analysis constitute two other aspects of this scenario to assess the quality of the designed system in terms of correctness and its ability to deal with failures.
In this scenario, new advances have been achieved in all the relevant issues pertaining the system-level design of reliable systems, to support the designers in the development of innovative architectures able to cope with the occurrence of failures. Such advances lead to the definition of both new methodologies, as well as, of new architectures. Furthermore, based on the application environment in which the system will be adopted, different classes of reliability might be necessary; in some situations it is possible to achieve an autonomous fault detection capability, whereas, in critical environments, fault effects need to be completely masked, thus providing fault tolerance properties.
The six papers presented in this special section were selected to address the different aspects of the important challenges related to the system level design of reliable systems. They cover all various facets of the issue, offering interesting solutions to tackle the specific problems.
The first two papers deal with reliability analysis, which has become a fundamental tool to computer engineers for the validation of the design of hardened system architectures, in particular in safety and mission critical domains, such as medicine, military and transportation. The first paper is entitled "Formal Reliability Analysis Using Theorem Proving" by Osman Hasan, Sofiène Tahar, and Naeem Abbasi. This paper addresses an important aspect of reliability analysis, attempting to introduce formal verification instead of simulation-based and probabilistic approaches to assess the fault tolerance characteristics of the designed systems. The authors propose to conduct a formal reliability analysis of systems within the framework of a higher-order-logic theorem prover. In this paper, they present the higher-order-logic formalization of some fundamental reliability theory concepts, which can be built upon to precisely analyze the reliability of various engineering systems. The proposed formalization is then applied to analyze the repairability conditions for a reconfigurable memory array in the presence of stuck-at and coupling faults. Still within the context of reliability analysis, the second paper, entitled "Efficient Microarchitectural Vulnerabilities Prediction Using Boosted Regression Trees and Patient Rule Inductions," by Bin Li, Lide Duan, and Lu Peng, deals with Architectural Vulnerability Factor (AVF) analysis, which reflects the possibility that a transient fault eventually causes a visible error in the program output, and it indicates a system's susceptibility to transient faults. This metric is increasingly being adopted to evaluate microprocessor's architectures, due to their high vulnerability to transient faults, derived from shrinking feature sizes, threshold voltage, and increasing frequency. The authors propose an innovative way to predict the architectural vulnerability factor using Boosted Regression Trees, a nonparametric tree-based predictive modeling scheme, to identify the correlation across workloads, execution phases, and processor configurations, between the estimated AVF of a key processor structure and various performance metrics.
The next two papers deal with fault detection techniques for different architectural components. The first paper is entitled "Concurrent Structure-Independent Fault Detection Schemes for the Advanced Encryption Standard," authored by Mehran Mozaffari-Kermani and Arash Reyhani-Masoleh. The authors focus their work on the Advanced Encryption Standard (AES), widely adopted and accepted as the symmetric cryptography standard for confidential data transmission. However, natural or maliciously injected faults may lead to confidential information leakage; therefore, countermeasures need to be taken; thus reliability plays a relevant role for this kind of architectural components. In the paper, the authors have studied a number of fault detection schemes for both the encryption and the decryption of the AES, also defining new ones that are independent of the specific internal structures. The reported results are interesting, achieving an error coverage of about 100 percent.
The second paper dealing with online fault detection is entitled "Microarchitectural Online Testing for Failure Detection in Memory Order Buffers," by Javier Carretero, Xavier Vera, Pedro Chaparro, and Jaume Abella. The authors propose to exploit microarchitecture knowledge of application runtime behavior to implement on-line testing techniques able to detect hard errors in the memory order buffer logic, at a limited cost. The paper presents three different implementations of the idea, providing different trade-offs in terms of error coverage, performance overhead and design complexity.
The last two papers target fault tolerance and propose architectural solutions against the occurrence of soft and hard errors in multiprocessor systems. "PERFECTORY: A Fault-Tolerant Directory Memory Architecture," by Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers is the first paper. It introduces ways to protect the on-chip coherence directory from fault occurrences in multiprocessor systems. In particular, they propose a novel online fault detection and error recovery scheme that protects the directory memory from soft errors. The second paper is entitled "Thread Relocation: A Runtime Architecture for Tolerating Hard-Errors in Chip Multiprocessors," by Omer Kahn and Sandip Kundu. The paper provides runtime methods to tolerate hard errors in on-chip multiprocessors. Their scheme exploits intercore redundancies and combines hardware reconfiguration with a runtime layer of software to manage the mapping of threads to cores taking into account their reduced functionality. The reported results show that, in the presence of degraded cores due to hard errors, the software layer succeeds in limiting performance loss to an average value of 2 percent.
We hope that the papers included in this special section can offer an interesting perspective on the most recent work in the design of reliable architectures at system level. The issues raised by the authors are important and the proposed solutions present interesting results, which could be taken as a reference for future developments and further research.
We would like to thank the editor-in-chief of the IEEE Transactions on Computers, Dr. Fabrizio Lombardi, for hosting this section, and all of the editorial staff for the support in the making of this issue. Furthermore, we would also like to thank the authors of the submitted papers and the numerous reviewers who contributed to the high quality of this special section.
Cristiana Bolchini
Donatella Sciuto
Guest Editors

    The guest editors are with the Dipartimento di Elettronica e Informazione, Politecnico di Milano, P.zza L. da Vinci, 32, 20133 Milano, Italy. E-mail: {bolchini, sciuto}

For information on obtaining reprints of this article, please send e-mail to:

Cristiana Bolchini received the degree in electronic engineering and the PhD in automation and computing engineering from the Politecnico di Milano, where she is an associate professor at the Dipartimento di Elettronica e Informazione. Since 2008, she has been an associate editor of the IEEE Transactions on Computers. Dr. Bolchini is or has been a member of different technical program committees of conferences and symposia in the area of test and fault tolerance for digital systems, and served as program cochair and general chair of DFT; she is also a reviewer for journals and transactions in this same area. Her research interests are digital system design with a specific focus on reliability properties, hardware/software co-design of dependable systems, and reconfigurable systems. She has authored several papers in this area. In recent years she has also contributed to founding a research group on context-aware data design, tailoring, and management to cope with mobility and huge, noisy amounts of information. She is a senior member of the IEEE.

Donatella Sciuto received the Laurea in electronic engineering from the Politecnico di Milano and the PhD degree in electrical and computer engineering from the University of Colorado, Boulder. She is currently a full professor in the Dipartimento di Elettronica e Informazione of the Politecnico di Milano, Italy. She has served as an associate editor of the IEEE Transactions on Computers, and now serves as an associate editor to the IEEE Embedded Systems Letters for the design methodologies topic area and as associate editor for the Journal of Design Automation of Embedded Systems. Professor Sciuto has offered several technical services to the IEEE. In particular she has been on the executive committee of DATE for more than 5 years and she was technical program chair in 2006 and general chair in 2008 and has served in a number of other positions of the executive committee. She is general cochair for 2009 and 2010 of ESWEEK. She also served as an executive committee member of ICCAD for three years and is currently serving a two year term as VP of Finance for the Council of EDA, for which she will serve as President-elect for the next two years. She has received different IEEE service awards and the Outstanding Contribution Award from the IEEE Computer Society in 2009. She is or has been member of different program committees of the ACM, and the IEEE, and EDA conferences and workshops. Her main research interests cover the methodologies for the design of embedded systems and multicore systems, from the specification level down to the implementation of both the hardware and software components, including reconfigurable and adaptive systems. She is a senior member of the IEEE, IFIP 10.5, and the EDAA.