Pages: pp. 3-4
Computer architecture is a wide active research field that spans all aspects of computer systems design. Dependability - the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers - has always been a key aspect of computer architecture and has been extensively investigated since the first days of computing. Traditionally, dependable computer architectures are utilized in high-end computing systems or computing systems for critical applications, where continuous, uninterrupted operation is among the most important requirements and systems are characterized by high reliability, availability, and maintainability (well-known measures to quantify dependability).
The time has now come for the wealth of concepts and methodologies in the field of dependable (also known as fault-tolerant) computer architecture that have been proposed during the last several decades, to be adopted and extended in mainstream, general-purpose computing systems. Dependable operation of computing systems is a mandatory requirement in virtually all application fields (at lower or higher costs), due to the increasing reliance of everyday human activities on computers or microprocessor-based systems in general. Unfortunately, this ubiquitous computing revolution comes in hand with hard-to-solve technological issues that are closely related to dependable operation of a computing system.
Integrated circuits are implemented today in miniaturized and inherently unreliable technologies that render circuits more vulnerable to both temporary disturbances leading to transient (or soft) errors and to permanent (or hard) errors. Soft errors in silicon-based circuits are caused by alpha particles from integrated circuit package decay or by cosmic rays that create high energy neutrons and protons. Hard errors, on the other side, appear either because of manufacturing defects that escape high-volume production manufacturing testing or because of material aging and wearout mechanisms during the system's life cycle that are exacerbated by the high clock frequencies of modern circuits.
Computing systems are getting constantly more complex (in particular with the recent turn toward multicore processors and high performance memory systems) while on the same time strict time-to-market constraints demand for extremely short design, verification, and validation intervals.
The net outcome of all the previous factors is that dependable operation of computing systems in the field should be a first level design consideration in all application domains, obviously at different cost points. This special section of IEEE Transactions on Computers focuses on architectural techniques that enhance dependability of different components of a computing system. A total number of 27 high-quality manuscripts were submitted to the special section from academic and industrial research groups worldwide, and a large set of more than 140 reviews from a strong group of expert reviewers was required to make final decisions. This special section published in the current issue of IEEE Transactions on Computers includes seven of the papers; roughly one fourth of the submitted papers on this topic.
The seven papers of the special section on Dependable Computer Architecture cover a wide spectrum of subsystems of mainstream architectures: processors (in particular CMP), cache memories, flash, and disk storage, but also a paper on probabilistic (stochastic) architectures is included. Methodologies are proposed for effective handling of both hard and soft errors and the papers of the special section comprehensively discuss all major tradeoffs between dependability and other design aspects such as performance, cost, yield, energy/power.
The first paper entitled “StageNet: A Reconfigurable Fabric for Constructing Dependable CMPs”, by Shantanu Gupta, Shuguang Feng, Amin Ansari, and Scott Mahlke from the University of Michigan, Ann Arbor, opens the special section dealing with tolerance of progressively higher defect densities due to aging in massively parallel chip multiprocessors (CMPs). The authors present StageNet, a reconfigurable CMP architecture that is primarily designed to provide fault tolerance and extend the lifetime of the system which gracefully degrades as the number of permanent faults increases.
The subsequent three papers focus on different aspects of cache memories with respect to dependability.
The second paper of the special section entitled “Reliability-Driven ECC Allocation for Multiple Bit Error Resilience in Processor Cache”, by Somnath Paul, Fang Cai, Xinmiao Zhang, and Swarup Bhunia, from Case Western Reserve University, discusses a non-uniform, variable ECC allocation scheme to effectively tolerate multiple bit errors in cache memories. The method utilizes post-fabrication characterization information to provide different ECC allocation depending on the relative vulnerability of cache memory blocks.
The third paper entitled “Maximizing Spare Utilization by Virtually Reorganizing Faulty Cache Lines”, by Amin Ansari, Shantanu Gupta, Shuguang Feng, and Scott Mahlke from the University of Michigan, Ann Arbor, proposes a flexible, reconfigurable architecture for redundant spares allocation in cache memories, aiming to effectively deal with both manufacturing and lifetime failures and improve both manufacturing yield and in-field reliability.
The fourth paper entitled “Adaptive Cache Design to Enable Reliable, Low-Voltage Operation”, by Alaa R. Alameldeen, Zeshan Chishti, Chris Wilkerson, Wei Wu, and Shih-Lien Lu from Intel Labs, Oregon, focuses on the critical reliability vs. energy tradeoff in cache memories design. The authors propose an adaptive cache design that utilizes a synergy between a hardware mechanism and the operating system to obtain the best energy/reliability combination for the desired performance of the application.
The fifth paper entitled “Improving Availability of RAID-Structured Storage Systems by Workload Outsourcing”, by Suzhen Wu, Hong Jiang, Dan Feng, Lei Tian, and Bo Mao from the Huazhong University of Science and Technology and the University of Nebraska-Lincoln, proposes a novel scheme that improves the availability of RAID-based storage systems by extending the RAID structure to improve the performance of low-priority background RAID tasks. Comprehensive experimental analysis demonstrates the advantages of the portable approach.
The sixth paper entitled “Flash-Aware RAID Techniques for Dependable and High-Performance Flash Memory SSD”, by Soojun Im and Dongkun Shin from Sungkyunkwan University, discusses reliability and fault tolerance in solid-state disks (SSD) an important storage technology competing hard disks. The authors propose a novel RAID-based technique for flash memory SSDs with significantly improved performance in parity calculations and updates.
The seventh and last paper of the special section entitled “An Architecture for Fault-Tolerant Computation with Stochastic Logic”, by Weikang Qian, Xin Li, Marc D. Riedel, Kia Bazargan, and David J. Lilja from University of Minnesota, reveals the advantages that probabilistic circuit design like stochastic logic can provide over conventional hardware implementations. The authors show that better fault coverage against many sources of errors can be achieved at lower costs.
The guest editors of the special section on Dependable Computer Architecture would like to thank the authors of the submitted papers for the excellent quality of the submitted papers and the reviewers for timely and competent review under a very tight schedule including one or two revision rounds for each paper.