# Guest Editors' Introduction: Online VLSI Testing

Ramesh Karri, Polytechnic University, Brooklyn
Michael Nicolaidis, French National Research Center and TIMA Laboratory

Pages: pp. 12-16

Very large scale integration has become an important implementation technology for many application domains including automotive, communication, medical, and satellite electronics.

Because of the limited on-chip real estate in previous VLSI generations, online testing technology had been restricted to very few application domains, and even there, to a very limited extent. The reduction in VLSI feature sizes and increase in modern VLSI integrated circuit sizes has resulted in an abundance of on-chip interconnect and data path resources, making online VLSI testing affordable.

Furthermore, with transient and intermittent faults becoming a dominant failure mode in modern VLSI, widespread deployment of online VLSI test technology has become crucial. Key benefits of online VLSI testing include fault tolerance at source, low-latency fault detection and correction, and fault effect localization.

We can implement online testing using hardware redundancy (hardware duplication, triplication, and spares), time redundancy (recomputing with shifted operands), information redundancy (error-detecting and error-correcting codes), or parametric testing methods (built-in current sensors). Nicolaidis and Zorian 1 provide a comprehensive survey of online VLSI testing techniques.

The articles selected for this special issue focus on important areas of online testing. Hussain Al-Assad, Brian Murray, and John P. Hayes discuss the development of a systematic framework based on error coverage, error latency, hardware redundancy, and time redundancy. This framework helps designers evaluate the suitability of existing online testing techniques for embedded system applications.

Two articles address the incorporation of online testing constraints during the high-level synthesis phase of a top-down VLSI design methodology. Samuel Hamilton and Alex Orailo $\breve{\bf g}$lu describe techniques for fault detection and isolation via algorithm duplication and on-chip reconfiguration via graceful degradation in data-path-dominated VLSI designs.

Sybille Hellebrand and Hans Wunderlich show us how to make the control units of such data path VLSI designs online testable by partitioning the controller states into groups. They also show how we can constrain the transitions such that they can emanate from states within a group and terminate on states in a different group.

Long clock signal lines in synchronous digital VLSI systems are highly susceptible to transient and intermittent faults. Moreover, the correctness of clock signals distributed across a VLSI circuit is essential for the correct operation of a system. Based on these observations, Cecilia Metra, Michele Favalli, and Bruno Riccò present their design of a self-checking self-checker. It checks whether the clock signals emanating from a single clock source are all simultaneously high (all-1s code) or all simultaneously low (all-0s code).

In contrast to the previously discussed techniques that use hardware, time, or information redundancy for online testing, Jien-Chung Lo makes a strong case for the use of built-in current sensors for both online and production-time monitoring of quiescent (I DDQ) and/or transient (I DD) current in deep-submicron designs.

Automotive electronics is an important application domain in which online testing techniques have been used. Critical automotive functions such as antilock braking, throttle control, and air-bag control have been implemented using microcontrollers. While straightforward duplication with checking yields low error detection latency and negligible performance penalty, it entails more than 100% area/cost overhead. To reduce the costs associated with duplication in checking, researchers at Robert Bosch GmbH have used low-cost online testing techniques such as parity coding and checking, periodic application of BIST, and I DDQ testing of peripherals in their AE11 microcontroller. Eberhard Böhl, Thomas Lindenkreuz, and Matthias Meerwein describe some of the online testing features of AE11.

Finally, Cristiana Bolchini, Fabio Salice, and Donatella Sciuto propose an approach for fault analysis and simulation of complete or partially concurrent error-detecting designs so as to characterize the faults to which a design is susceptible.

## CONCLUSION

We trust you will find this special online VLSI testing issue helpful and interesting as we make the move to a new testing paradigm and techniques that will support our industry for the year 2000 and beyond.

## Some Practical Applications of Online Testing

RAMESHKARRI

Several cost-conscious and safety-critical application domains, including railway, automotive, communication, medical, and satellite electronics, have deployed low-cost online VLSI testing techniques for high reliability, high availability, and/or diagnosability.

Electronic control and signaling of railways is an important domain in which designers have extensively used these testing techniques. Electronic railway control and signaling imposes very stringent safety requirements. First, the probability of a safety-critical failure should be less than 10 −9 per hour. Further, the mean time to failure should be about 10 years, and the fault-detection latencies should be a few milliseconds. A number of safety-critical functions such as continuous speed control and on-board signaling are being implemented as VLSI-based systems satisfying these stringent requirements.

French railways SNCF developed and deployed the SACEM intelligent train control system. 1 This system consists of fixed rail-side and mobile train-borne equipment. The rail-side equipment collects information such as the presence or absence of trains, gradient of the terrain, and local speed limits, then transmits it to the train-borne equipment. On receiving this information, the train-borne equipment integrates it with dynamic information related to its speed, position, and direction, then takes the appropriate action. This includes either triggering emergency brakes or displaying any deceleration information and target speed.

To meet the targeted safety requirements, SACEM designers devised an information-redundancy-based, online testable VLSI processor called the vital coded processor. Since the electromagnetic railway environment is a potential source for common mode errors in VLSI, the designers used information redundancy instead of hardware redundancy. While arithmetic codes detect operation, information storage, and data transfer errors, a signature-based technique detects operator, operand, and control flow errors. The electronic interface elements between the coded processor and the environment were designed to be fail-safe.

Shinkansen, the Japanese railway, uses different fault-tolerance techniques at the various levels within its computer-aided train control system. These include online VLSI testing at the lowest physical level and triple modular redundancy at the highest levels. 2 Critical elements of this system include the centralized train controller (CTC) that operates an interlocking device to control train routes and an automatic train controller (ATC) to control train speeds. By clearly identifying the safe and risky states and providing mechanisms to switch to a safe state in case of a failure, CTC and ATC can meet the stringent fail-safe requirements.

The CTC continuously monitors the different sections of the railway tracks for the presence of another train and operates the interlocking device that controls the train route. To guarantee train operation safety, the CTC uses an online testing mechanism built into the interlocking device. This mechanism is implemented in VLSI using a bus-level comparator. Input data, control data, and data from and to memory go through two identical buses, which are continuously compared. Online VLSI testing maintains low fault detection latencies.

Fail-safe capability in other VLSI circuits used in the Japanese railway control/signaling system is achieved by duplicating the functional blocks. 3 Spatial and temporal diversity techniques minimize common mode errors among these redundant functional blocks. Designers implement spatial diversity in the IC floor-planning stage by placing a functional block and its duplicate as far apart as possible. Making a functional block and its duplicate operate at a clock phase difference accomplishes temporal diversity. Finally, self-checking comparators monitor the operation of a functional block and its duplicate.

Saab Ericsson Space has designed and manufactured concurrent error detection and correction circuits for use in European space program 16- and 32-bit buses. 4 The program deployed these radiation-tolerant, 1-micron CMOS Matra MHS gate array circuits in space-borne embedded computer systems designed for the ARTEMIS, ENVISAT, and METEOSAT satellites. The targeted errors included both single-event upsets and permanent chip errors. While a (24,16) code was implemented for the 16-bit buses, the 32-bit buses used a (40,32) code. These codes, in addition to being single error correcting and double error detecting (SEC-DED), can detect all single-byte errors with a 4- or 8-bit-per-byte chip organization. Being a purely combinational design, the 16-bit error detection and correction circuits preclude the occurrence of erroneous states caused by single-byte errors.

Examples of online VLSI testing practice in communication systems include the Phillips implementation of (4,2)-redundancy in its SOPHO S-2500 communication switches and H1 broadband switches. 5

In a (4,2)-redundant system, the processing logic is quadrupled, while the memory size is doubled. Each module in the (4,2)-redundant system consists of a processor, an encoder, a decoder, and memory, all operating synchronously and deterministically. When writing 8-bit data into memory, each encoder encodes it into a 4-bit code symbol that results in a 16-bit code word (one symbol contributed by each encoder).

On reading a 16-bit code word from memory, the decoder in a module generates the 8-bit data even in the presence of some errors. This (4,2)-code ensures that any two code symbols are sufficient to derive the original information. The code can tolerate all single symbol errors and correct all double-bit errors. The (4,2)-code coupled with (4,2)-redundancy localizes an error to the module in which it occurred. An error in a faulty module can manifest in a different way in each of the fault-free modules. A solution to this Byzantine Generals problem that maintains interactive consistency during error detection and correction has been implemented in VLSI.

The HaL general-purpose processor also incorporated concurrent error detection and correction. 6 Key design considerations in HaL were that the hardware implementation be simple and have negligible impact on the cycle time. Based on these criteria, the HaL memory management unit (MMU) implemented information-redundancy-based concurrent error detection and correction. In the HaL MMU, error-correcting codes protect the address and data buses in the cache and memory systems as well as the data storage in the cache and the memory systems. Further, the available architectural functionality during the address translation phase was combined with simple linear polynomial codes to implement a low-cost online testing mechanism as follows:

Consider the address translation mechanism in HaL. A virtual address indexes into a fully associative translation lookaside buffer implemented by content-addressable memory (TLB-CAM). On a TLB-CAM hit, the matching virtual address indexes into a random access memory (TLB-SRAM) to obtain the corresponding physical address and protection information. On a TLB-CAM miss, the virtual address entry and the associated physical address/protection information are retrieved from the main memory into the TLB-CAM and TLB-SRAM.

On one hand, a transient failure may corrupt one virtual address (the corresponding entry is present) into a different virtual address (the corresponding entry is not present) and result in a false miss. This can be treated as an ordinary miss. On the other hand, a transient failure may result in a false hit. Since a false hit can update a potentially incorrect physical address and provide incorrect protection information, it is first detected (by using check bits on the virtual address entries in the TLB-CAM), then the entry at the location is invalidated, and finally the TLB-miss handler is invoked.

A linear polynomial error-detecting code can be used to derive the check bits from the corresponding virtual address entry. This code can detect up to 2-bit errors, entails simple hardware implementation, and does not incur critical delays in encoding and decoding circuits.

ReferencesC.HennebertandG.Guiho"SACEM: A Fault Tolerant System for Train Speed Control,"Proc. Fault-Tolerant Computing Symp.,IEEE Computer Society Press,Los Alamitos, Calif.,1993,pp. 624-628.A.HachigaK.AkiatandY.Hasegawa"The Design Concepts and Operational Results of Fault Tolerant Computer Systems for the Shinkansen Train Control,"Proc. Fault-Tolerant Computing Symp.,IEEE CS,1993,pp. 78-87.N.Kanekawaet al.,"Self-Checking and Fail-Safe LSIs by Intra-Chip Redundancy,"Proc. Fault-Tolerant Computing Symp.,IEEE CS,1996,pp. 426-430.R.Johansson"Two Error Detecting and Correcting Circuits for Space Applications,"Proc. Fault-Tolerant Computing Symp.,IEEE CS,1996,pp. 436-439.C.-J.L.van Drielet al.,"The Error Resistant Interactively Consistent Architecture (ERICA),"Proc. Fault-Tolerant Computing Symp.,IEEE CS,1990,pp. 474-480.N.R.Saxenaet al.,"Fault-Tolerant Features in the HaL Memory Management Unit,"IEEE Trans. Computers,Vol. 44,No. 2,Feb.1995,pp. 170-180.

## Future Trends in Online Testing: A New VLSI Design Paradigm?

MICHAELNICOLAIDIS

Online testing and hardware fault tolerance, among the oldest fields of computer science, were developed to improve reliability and/or availability of electronic systems. Three main situations require mandatory reliability improvements when

• the reliability of components produced by a fabrication process is very low
• an application requires very high levels of reliability
• a hostile environment reduces the reliability of components, which, under normal conditions, provide acceptable reliability levels

In the very early age of computers, unreliable electronic components made hardware fault tolerance mandatory. In the VLSI era, component reliability improved dramatically, restricting the use of certain online testing techniques in specific application domains. They included domains requiring very high levels of dependability (fault-tolerant computers, safety-critical applications, and so on) and eventually evolving in hostile environments (for example, space). Such applications often correspond to low-volume production.

The low number of these applications did not make the development of tools specific to the design of online testable ICs attractive to CAD vendors. The lack of such tools dramatically increases the effort expended to design online testable ICs. Further, since it will impact significantly the per-product-unit cost, this low-volume production often does not justify the high development cost. As a matter of fact, techniques such as duplication or triplication using off-the-shelf components are more often adopted, since they represent a much lower development cost, although the production cost is relatively high.

We can expect this situation to be changing. Due to increasing reliability or availability requirements, various industrial sectors increasingly need online testing features. Such sectors include railway control, satellites, avionics, telecommunications, critical automotive function control, medical electronics, and industrial control. Some of these applications concern mass production and should support the standardization of such techniques and the development of commercial CAD tools supporting them. Since silicon is relatively inexpensive, such tools should make the design of online testable circuits popular.

In addition to these trends, the high complexity of today's systems requires more efficient solutions. In fact, yesterday's complex multichip systems are today's single-chip components. Older fault-tolerant and fail-safe system designs must be integrated at the chip level, which is appealing for online testing techniques for VLSI.

While applications with increased reliability needs should support an increased use of online testing design, another emerging factor will influence these trends more drastically. In the past, progress in VLSI processes dramatically improved the reliability of electronic components, restricting the use of online testing in specific application domains. We are now seeing these trends inverted. Drastic device shrinking, low power supply levels, and increasing operating speeds that accompany the technological evolution to deeper submicron levels significantly reduce the noise margins and increase the soft-error rates. As a consequence, technological progress will be blocked quickly if we take no particular actions to cope with increasingly high soft-error rates. 1-3 In this context, design for online testability seems to be the most adequate solution for designing soft-error robust circuits and aggressively pushing the limits of technological scaling.

These emerging needs require the development of a new design paradigm that supports efficient online VLSI testing solutions, including a rich variety of online testing design methodologies and design automation tools. The understanding of this situation led a few years ago to the creation of the Online Testing Technical Activities Committee of the Test Technology Technical Committee ( http://www.computer.org/tab/tttc/tac/online.html). Under this committee, several actions were undertaken to anticipate future needs by stimulating activities in the online testing domain. Such actions include the creation of the International Online Testing Workshop ( http://tima-cmp.imag.fr/tima/ris/ioltw.html) and the establishment of online testing as one of the basic topics in various conferences. Such conferences are the International Test Conference, the VLSI Test Symposium, and DATE (formerly the European Design and Test Conference). Other actions are the organization of several panel sessions and tutorials at conferences like ITC and VTS, and the organization of special issues, including publications like this magazine issue and the February-April 1998 JETTA double issue. We should intensify these activities in the future as we start to understand that the current VLSI design paradigm cannot maintain acceptable levels of reliability for future IC generations.

ReferencesM.Nicolaidis"Scaling Deeper to Submicron: Online Testing to the Rescue,"Proc. Fault-Tolerant Computing Symp.,IEEE Computer Society Press,Los Alamitos, Calif.,1998,pp. 299-301.M.Nicolaidis"Design for Soft-Error Robustness to Rescue Deep Submicron Scaling,"Proc. Int'l Test Conf. 98,IEEE-CS Press,1998,p. 1140.M.Nicolaidis"Online Testing for VLSI: State of the Art and Trends,"Integration: The VLSI J,and special issue on "VLSI Testing Toward the 21st Century,"to be publishedautumn1998.

## Online VLSI Testing Resources

When designing highly reliable systems, VLSI circuit designers must be aware of available online testing techniques, be familiar with integration of off- and online testing approaches, and understand the attendant cost-reliability trade-off. Existing sources on online VLSI testing are the

• IEEE Transactions on Computers
• IEEE Design & Test of Computers
• IEEE Transactions on CAD of ICs and Systems
• IEEE Transactions on VLSI Systems
• IEEE Transactions on Reliability
• IEEE Transactions on Nuclear Science
• Journal of Electronic Testing: Theory and Applications (Kluwer)

The topical conferences in this area are the

• IEEE International Workshop on Online Testing
• IEEE International Symposium on Fault-Tolerant Computing
• IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems

Finally, conferences that have organized sessions and/or published papers related to online testing include

• IEEE International Test Conference
• IEEE VLSI Test Symposium
• DATE (formerly the European Design and Test Conference)
• European Test Workshop
• IEEE International Conference on CAD
• IEEE/ACM Design Automation Conference

## Acknowledgments

We gratefully acknowledge the help of Editor-in-Chief Yervant Zorian, the authors, the reviewers, and the IEEE Design & Test staff, especially Jason True Seaborn, in putting together this special issue. NSF grant MIP9702676 partially supported Ramesh Karri's work.

## Reference

• 1. M. Nicolaidis, and Y. Zorian, "Online Testing for VLSI—A Compendium of Approaches," J. Electronic Testing, Theory and Applications (JETTA), Vol. 2, Nos. 1/2, Feb.-Apr. 1998, pp. 7-20.