This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
ED4I: Error Detection by Diverse Data and Duplicated Instructions
February 2002 (vol. 51 no. 2)
pp. 180-199

Errors in computing systems can cause abnormal behavior and degrade data integrity and system availability. Errors should be avoided especially in embedded systems for critical applications. However, as the trend in VLSI technologies has been toward smaller feature sizes, lower supply voltages, and higher frequencies, there is a growing concern about temporary errors as well as permanent errors in embedded systems; thus, it is very essential to detect those errors. Software Implemented Hardware Fault Tolerance (SIHFT) is a low-cost alternative to hardware fault tolerance techniques for embedded processors: It does not require any hardware modification of Commercial Off-The-Shelf (COTS) processors. ED4I is a SIHFT technique that detects both permanent and temporary errors by executing two "different" programs (with the same functionality) and comparing their outputs. ED4I maps each number, x, in the original program into a new number x', and then transforms the program so that it operates on the new numbers so that the results can be mapped backwards for comparison with the results of the original program. The mapping in the transformation of ED4I is x'=k·x for integer numbers, where k determines the fault detection probability and data integrity of the system. For floating point numbers, we find a value of kf for the fraction and ke for the exponent separately and use k=kf×2ke for the value of k. We have demonstrated how to choose an optimal value of k for the transformation. This paper shows that, for integer programs, the transformation with k=-2 was the most desirable choice in six out of seven benchmark programs we simulated. It maximizes fault detection probability under the condition that data integrity is highest.

[1] M. Hiller, “Executable Assertions for Detecting Data Errors in Embedded Control Systems,” Proc. Int'l Conf. Dependable Systems and Networks (DSN 2000), pp. 24-33, June 2000.
[2] G. Bauer and H. Kopetz, “Transparent Redundancy in the Time-Triggered Architecture,” Proc. Int'l Conf. Dependable Systems and Networks (DSN 2000), pp. 5-8, June 2000.
[3] P. Shirvani, “Fault Tolerant Computing for Radiation Environment,” PhD thesis, Stanford Univ., pp. v., June 2001, .
[4] D.L. Hamilton, “Fault Tolerant Algorithms and Architectures for Robotics,” Proc. Seventh Electrotechnical Conf., vol. 3, pp. 1034-1036, 1994.
[5] N. Oh, P. Shirvani, and E. McCluskey, “Control Flow Checking by Software Signatures,” to appear in IEEE Trans. Reliability, Dec. 2001, (http://www-crc.stanford.edu)(http:/www-crc.stanford.edu) .
[6] A. Mahmood and E.J. McCluskey, “Watchdog Processor: Error Coverage and Overhead,” Proc. 15th Ann. Int'l Symp. Fault-Tolerant Computing (FTCS-15), pp. 214-219, June 1985.
[7] A. Ersoz, D.M. Andrews, and E.J. McCluskey, “The Watchdog Task: Concurrent Error Detection Using Assertions,” Technical Report TR 85-8, Stanford Univ., Center for Reliable Computing, 1985.
[8] G. Miremadi, J. Karlsson, J.U. Gunneflo, and J. Torin, “Two Software Techniques for On-Line Error Detection,” Proc. 22nd Ann. Int'l Symo. Fault-Tolerant Computing, pp. 328-335, July 1992.
[9] G. Miremadi, J.TJ. Ohlsson, M. Rimen, and J. Karlsson, “Use of Time, Location and Instruction Signatures for Control Flow checking,” Proc. Int'l Working Conf. Dependable Computing for Critical Applications, Springer-Verlag Series for Dependable Computing Systems, Sept. 1995.
[10] J.G. Holm and P. Banerjee, “Low Cost Concurrent Error Detection in a VLIW Architecture Using Replicated Instructions,” Proc. Int'l Conf. Parallel Processing, pp. 102-195, 1992.
[11] N. Oh, P. Shirvani, and E. McCluskey, “EDDI: Error Detection by Duplicated Instructions in Superscalar Processors,” to appear in IEEE Trans. Reliability, Dec. 2001, (http:/www-crc.stanford.edu).
[12] K. Parker and E.J. McCluskey, “Probabilistic Treatment of General Combinational Networks,” IEEE Trans. Computers, vol. 24, no. 6, pp. 668-670, June 1975.
[13] S. Mitra, N. Saxena, and E.J. McCluskey, "A Design Diversity Metric and Reliability Analysis for Redundant Systems," Proc. IEEE Int'l Test Conference, 1999.
[14] A. Avizienis and L. Chen, “On the Implementation of N-Version Programming for Software Fault-Tolerance During Program Execution,” Proc. Int'l Computer Software and Application Conf., pp. 145-155, 1977.
[15] R.K. Scott, J.W. Gault, and D.F. McAllister, “The Consensus Recovery Block,” Proc. Total Systems Reliability Symp., pp. 74-85, Dec. 1983.
[16] J.-C. Laprie, J. Arlat, C. Béounes, and K. Kanoun, “Definition and Analysis of Hardware-and-Software Fault-Tolerant Architectures,” Computer, vol. 23, no. 7, pp. 39-51, July 1990.
[17] A. Avizienis and J.P.J. Kelly, “Fault Tolerance by Design Diversity: Concepts and Experiments,” IEEE Computer, pp. 67-80, Aug. 1984.
[18] J.H. Lala and R.E. Harper, "Architectural Principles for Safety-Critical Real-Time Applications," Proc. IEEE, vol. 82, no. 1, pp. 25-40, Jan. 1994.
[19] M.R. Lyu and A. Avizienis, “Assuring Design Diversity in N-Version Software: A Design Paradigm for N-Version Programming,” Proc. Int'l Working Conf. Dependable Computing for Critical Applications, pp. 197-218, 1991.
[20] L. Chen and A. Avizienis, “N Version Programming: A Fault-Tolerance Approach to Reliability of Software Operation,” Proc. Eighth Int'l Symp. Fault-Tolerant Computing, pp. 3-9, 1978.
[21] R.K. Scott, J.W. Gault, and D.F. McAllister, “Fault-Tolerant Reliability Modeling,” IEEE Trans. Software Eng., vol. 13, no. 5, pp. 582-592, May 1987.
[22] J.B. Dugan and R.V. Buren, “Reliability Evaluation of Fly-by-Wire Computer Systems,” J. Systems and Software, vol. 25, no. 1, pp. 109-120, Apr. 1994.
[23] P.E. Ammann and J.C. Knight, “Data Diversity: An Approach to Software Fault Tolerance,” IEEE Trans. Computers, vol. 37, no. 4, pp. 418-425, Apr. 1988.
[24] J. Christmansson, A. Kalbarczyk, and J. Torin, “Dependable Flight Control System Using Data Diversity with Error Recovery,” Computer Systems Science and Eng., vol. 9, no. 2, pp. 142-150, Apr. 1994.
[25] D.T. Brown, “Error Detecting and Correcting Binary Codes for Arithmetic Operations,” IRE Trans. Electronic Computers, vol. 9, pp. 333-337, Sept. 1960.
[26] J.H. Patel and L.Y. Fung, “Concurrent Error Detection in ALU's by Recomputing with Shifted Operands,” IEEE Trans. Computers, vol. 31, no. 7, pp. 589-595, July 1982.
[27] P. Forin, “Vital Coded Microprocessor Principles and Application for Various Transit Systems,” Proc. IFAC Conf. Control, Computers, Comms. in Transportation, (CCCT '89), pp. 137-142, 1989.
[28] P. Chapront, “Vital Coded Processor and Safety Related Software Design,” Proc. Int'l Federation of Automatic Control, SAFECOMP '92, pp. 141-145, 1992.
[29] H. Engel, “Data Flow Transformations to Detect Results which are Corrupted by Hardware Faults,” Proc. IEEE High-Assurance Systems Eng. Workshop, pp. 279-285, 1997.
[30] D.E. Eckhardt and L.D. Lee, “A Theoretical Basis for the Analysis of Multi-Version Software Subject to Coincident Errors,” IEEE Trans. Software Eng., vol. 11, no. 12, pp. 1511-1517, 1985.
[31] Bev Littlewood and Douglas R. Miller, "Conceptual Modeling of Coincident Failures in Multiversion Software," IEEE Transactions on Software Engineering, vol. 15, p. 1,596, Dec. 1989.
[32] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, Addison-Wesley, 1994.
[33] J.B. Kuo, K.W. Su, and J.H. Lou, “A BiCMOS Dynamic Multiplier Using Wallace Tree Reduction Architecture and 1.5-V Full-Swing BiCMOS Dynamic Logic Circuits,” IEEE J. Solid-State Circuits, vol. 30, no. 8, Aug. 1995.
[34] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1995.
[35] http://www.netlib.orgbenchweb, 2001
[36] W. Harrison, “Compiler Analysis of the Value Ranges for Variables,” IEEE Trans. Software Eng., vol. 3, no. 3, pp. 243-250, May 1977.
[37] J. Patterson, “Accurate Static Branch Prediction by Value Range Propagation,” Proc. SIG-PLAN Conf. Programming Language Design and Implementation, vol. 37, pp. 67-78, June 1995.
[38] M. Stephenson, J. Babb, and S. Amarashinghe, “Bitwidth Analysis with Application to Silicon Compilation,” SIGPLAN Notices, vol. 35, no. 5, pp. 108-120, May 2000.
[39] W. Stallings, Computer Organization and Architecture, Fourth ed. Prentice-Hall, 1996.
[40] N. Oh, “Software Implemented Hardware Fault Tolerance,” PhD thesis, Stanford Univ., Stanford, Calif., 2000.
[41] J.J. Shedletsky, “Error Correction by Alternate-Data Retry,” IEEE Trans. Computers, vol. 27, no. 2, pp. 106-112, Feb. 1978.

Index Terms:
Software implemented hardware fault tolerance (SIHFT), low cost fault tolerance, concurrent error detection, data diversity, duplicated instructions.
Citation:
N. Oh, S. Mitra, E.J. McCluskey, "ED4I: Error Detection by Diverse Data and Duplicated Instructions," IEEE Transactions on Computers, vol. 51, no. 2, pp. 180-199, Feb. 2002, doi:10.1109/12.980007
Usage of this product signifies your acceptance of the Terms of Use.