The Community for Technology Leaders

Guest Editors' Introduction: Embedded Fault-Tolerant Systems

Dimiter R. , Boston University
Karl E. , GMD
Barry W. , University of Virginia
Fabrizio , Northeastern University

Pages: pp. 8-11

Embedded computer systems represent a rapidly growing field in the computer industry. The tremendous growth is partly the result of technology advances that have made it possible to place significant computing power and other functionality on a single integrated circuit.

Embedded systems are typically hardware-software systems placed within a larger overall application, often performing real-time command and control operations. Typical applications include medical electronics (pacemakers), personal communications devices (wireless phones), nuclear reactors (reactor protection systems), automobiles (antilock braking systems), aviation (fly-by-wire flight control systems), railroad (high-speed train control), and others.

Often, embedded systems perform their functions without human oversight. In other words, the typically autonomous embedded system makes decisions that affect an application's integrity, as well as the equipment and people affected by that application. Consequently, the dependability of embedded systems must usually be very high. Fault tolerance technology is one approach for improving dependability. Reliability, availability, and safety are example measures of dependability. While the fault tolerance field has been active for decades, the specific focus on embedded systems has taken on an increased importance due to the rapid expansion of the market for embedded products.

The IEEE Computer Society has sponsored two recent workshops on embedded fault-tolerant systems. The first was held in Dallas, Texas, in 1996, and the second, in Boston in 1998. The workshops brought together researchers and practitioners who are working to solve the technical issues associated with embedded systems and to apply those systems to practical problems. This special issue of IEEE Micro is an outgrowth of those workshops and provides a snapshot of the many technological advances and the variety of interesting applications.

Many technical challenges face the designers of embedded systems. Embedded systems are truly mixed-technology systems that include integrated hardware and software, analog circuitry, digital hardware, mechanical elements, sensors, actuators, real-time operating systems, and application software. It is insufficient to consider the various elements separately because the overall system performance, reliability, and safety will depend on the interactions of the system's components. For example, the ability to meet a hard, real-time deadline will depend on the performance of the processor hardware and on the software's efficient use of that hardware, as well as the analog circuitry's timely response to its input signals.

The ability to design a correctly functioning embedded system depends on our ability to design and analyze mixed-technology systems and to properly account for the interactions that occur between the various technologies. Unfortunately, design environments and computer-aided design tools have not yet successfully integrated the mixed technologies needed to design and analyze a complete embedded system.

Existing fault tolerance techniques often depend on significant knowledge of the specific hardware-software implementation. For example, many practical applications use hardware-based or software-based diagnostics to detect faults and to perform system reconfiguration in response to those faults. The development of such diagnostics requires an intimate knowledge of the hardware-software implementation. Unfortunately, the increasing complexity of ICs such as microcontrollers and the increasing use of commercial off-the-shelf (COTS) hardware and software mean that designers seldom have the detailed implementation knowledge needed to design diagnostic-based systems. Also, design approaches that depend on implementation specifics are not easily portable from one application to the next, so the fault tolerance techniques become dependent on both application and implementation.

Many embedded systems also require some form of certification before they can be licensed for use in applications that affect human safety: nuclear reactors, rail transit, commercial aviation, and medical devices such as pacemakers. Certification requires that the embedded system be assessed and its reliability, safety, and fault tolerance attributes be evaluated in some manner. The assessment techniques needed for certification are also important during design so designers can make informed trade-offs and decisions. In fact, if we knew how to model and analyze mixed-technology systems to adequately support design, we could adequately support system certification. The lack of sufficient analysis techniques and tools that support modeling the mixed-technology systems makes both design and certification a formidable task.

There is a myth that embedded systems are relatively simple hardware-software systems. Technological advances have allowed us to place on one IC the computing power equivalent to early supercomputers and to surround that computing with analog circuits and other devices. Communications links (including both hardware and software), real-time operating systems, and application software provide an integrated hardware-software system with hundreds of thousands of lines (perhaps millions of lines) of software. The term embedded supercomputing is becoming a reality, and complexity management is a daunting task in embedded systems.

Embedded systems are also often very cost sensitive. The increase in cost of a few pennies is often crucial when the application involves the sale of millions of units. However, cost sensitivity and fault tolerance are often at odds with one another. The industry has rejected many elegant solutions for fault-tolerant systems because of the significant cost associated with them. Highly redundant voting architectures are a superb solution to a very difficult technical problem, but few embedded applications can afford the cost, power, size, and weight associated with these solutions. We must develop techniques that can provide efficient solutions to the fault tolerance problem.

The list of technical issues is a very long one, and space does not permit all of them to be discussed here. So, we stop with one last issue. Design faults remain one of the most significant challenges for the embedded fault-tolerant systems community. In the early days, we assumed that software and hardware were free of design faults, and we worried about randomly occurring hardware faults. Then software began to overwhelm us, and we worried about design faults in software. Now, system complexity is such that design faults can (and probably will) exist in hardware (both analog and digital) and software. In fact, the existing design paradigm is gradually making hardware and software indistinguishable in the embedded system. We must be able to handle design faults that occur in the integrated hardware-software system.

We have selected eight articles that cover a spectrum of technologies and applications. Due to space limitations, six will appear in this current issue, and the remaining two, in later issues. Individuals from industry and universities cover issues associated with hardware, software, design environments, and complete systems.

We begin with a report of the development of a fault-tolerant intravenous infusion control system, an important medical application for delivering medication or nutrition to a patient. The system must control the infusion flow rate accurately to prevent over- or under-medication. In developing their system, the authors used a hardware-software codesign process. They describe the hardware-software partitioning process and consider the software impact in system fault tolerance as part of the partitioning. The authors also present a detailed implementation of their system.

The author of the second article examines portable and fault-tolerant software systems using checkpoints and automatic code generation. Checkpoints are an important feature of embedded, real-time, fault-tolerant systems. The state of a system is periodically stored (this is called a checkpoint), so that any future recovery from faults can restore the state of the machine using the stored checkpoint. This article describes the development of compiler technology that supports the simple and efficient inclusion of fault tolerance techniques into a program during the compilation process. The author has developed a prototype compiler called Porch that supports the developed techniques.

IC technology advancements have made it possible to place large computational capabilities in very small physical spaces. The result is the potential for an embedded system with processing power comparable to a supercomputer. The authors of this article focus on the inclusion of fault tolerance mechanisms to support both synchronous and asynchronous communications between application segments executing on different nodes of the parallel system.

Next we examine the creation of special-purpose logic within processors to support enhanced reliability in embedded systems. The techniques take advantage of an application's knowledge to provide redundancy for a subset of a processor's instruction set when those instructions are needed. The approach supports so-called adaptive computing. Software executing on the processor can use built-in, redundant, programmable logic to provide a backup function on demand. For example, during the execution of one portion of the program, the redundant logic might be configured to provide a redundant arithmetic and logic unit. During another section, the redundant logic might be reconfigured to provide a redundant input/output structure. This enables the software to configure the hardware as needed for particular portions of the application. The authors demonstrate their approach with their design of a simple 8-bit processor.

Replica management is necessary in systems where applications are replicated on different processing elements. Each instance of a replicated application is called a replica. Replica management is a form of redundancy management to coordinate the activities of the replicas. This article develops an approach called active parallel replica management, which provides fault tolerance as well as improved efficiency that results in better system performance. The authors describe an implementation of their approach.

The last article in this current issue addresses the problem of hardware-software codesign in an automobile's safety-critical airbag control system. The authors develop techniques for creating and analyzing executable system specifications and supporting automatic synthesis along with hardware-software prototyping. A major focus of the work is on the interfaces that exist within a system. Of particular importance are the interfaces between software and the remainder of the system since these are often the source of problems.

Though the other articles will appear in later issues of IEEE Micro, we consider them to be part of our overall special issue. They address system-level verification using multilevel concurrent simulation, and fault injection in VHDL models of fault-tolerant systems. Multilevel simulation and fault injection are extremely important in assessing fault-tolerant systems. Multilevel simulation supports the management of complexity during the analysis. Fault injection in VHDL allows faults to be simulated in a language very commonly used in the design process.


We sincerely hope that you enjoy this special issue. The topics are timely and important, and the authors and editors have done an excellent job of presenting the material. We extend our sincere thanks to all the authors and reviewers. We also thank Steve Diamond, the Editor-in-Chief of IEEE Micro, for allowing us to create this special issue. Finally, a special thanks is due Marie English for editing and assembling this issue.

About the Authors

Dimiter R. Avresky is an associate professor of computer engineering in the Department of Electrical and Computer Engineering at Boston University, where he is a head of the Network Computing Research Lab. His research interests focus on hardware and software fault-tolerant systems, network computing, performance evaluation, parallel computers, testing, and fault injection. He has chaired the 1996-1999 Annual IEEE Workshops on Fault-Tolerant Parallel and Distributed Systems, which are held in conjunction with the International Parallel Processing Symposium. He served as 1996 and 1998 Program and General Co-chair of the IEEE Workshop on Embedded Fault-Tolerant Systems, as he will also in 2000. He is a member of the IEEE, the IEEE Computer Society, and The New York Academy of Sciences.
Karl E. Grosspietsch is a researcher at the German National Research Center for Information Technology (GMD) in St. Augustin, Germany. His main activities comprise research in the fields of computer architecture, fault tolerance, VLSI design, and autonomous systems. The results of his work have been documented in more than 100 publications. Since 1998, he has been the speaker of the Technical Committee on Dependability and Fault Tolerance of the German Computing Society, Gesellschaft fuer Informatik.
Barry W. Johnson is a professor in the Department of Electrical Engineering at the University of Virginia at Charlottesville. His research focuses on all issues associated with designing, analyzing, and implementing fault-tolerant and safety-critical systems for real-time applications. He has published more than 100 papers in these areas. Johnson received his BS, ME, and PhD degrees in electrical engineering from the University of Virginia. He is a fellow of the IEEE and was the 1997 president of the IEEE Computer Society.
Fabrizio Lombardi is chair and the International Test Conference endowed professor in the Department of Electrical and Computer Engineering at Northeastern University. His research interests are fault-tolerant computing, testing and design of digital systems, configurable computing, defect tolerance, and VLSI CAD. Lombardi graduated from the University of Essex, UK, with a BSc in electronic engineering. He received a master's degree in microwaves and modern optics and a diploma in microwave engineering from the Microwave Research Unit at University College, London. He received a PhD from the University of London. He is a member of the IEEE and the Computer Society.
60 ms
(Ver 3.x)