This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Concurrent Error Detection Using Watchdog Processors-A Survey
February 1988 (vol. 37 no. 2)
pp. 160-174
Concurrent system-level error detection techniques using a watchdog processor are surveyed. A watchdog processor is a small and simple coprocessor that detects errors by monitoring the behavior of a system. Like replication, it does not depend on any fault model for error detection. However, it requires less hardware than replication. It is shown that a large number of errors can be detected by

[1] R. J. Andrews, J. J. Driscoll, J. A. Herndon, P. C. Richards, and L. R. Roberts, "Service features and call processing,"Bell Syst. Tech. J., vol. 48, pp. 2713-2764, Oct. 1969.
[2] D. M. Andrews, "Software fault tolerance through executable assertions," inConf. Rec. 12th Asilomar Conf. Circuits, Syst., Comput., Pacific Grove, CA, Nov. 6-8, 1978, pp. 641-645.
[3] D. M. Andrews, "Using excutable assertions for testing and fault tolerance," inDig. 9th Annu. Int. Symp. Fault Tolerant Comput., FTCS-9, Madison, WI, June 20-22, 1979, pp. 102-105.
[4] D. M. Andrews and J. P. Benson, "An automated program testing methodology and its implementation," inProc. 5th Int. Conf. Software Eng., San Diego, CA, Mar. 9-12, 1981, pp. 254-261.
[5] A. Avizienis, "Fault tolerance by means of external monitoring of computer systems," inProc. AFIPS Conf., vol. 50, Chicago, IL, May 4-7, 1981, pp. 27-40.
[6] A. Avizienis and J. P. J. Kelly, "Fault tolerance by design diversity: Concepts and experiments,"Computer, vol. 17, pp. 67-80, Aug. 1984.
[7] G. E. Bendixen, "The digital flight control and active control systems on the L-1011," inProc., IEEE/AIAA, 5th Digital Avion. Syst. Conf., Seattle, WA, Oct. 31-Nov. 3, 1983, pp. 11.2.1-11.2.11.
[8] N. Benowitz, "An advanced fault isolation system for digital logic,"IEEE Trans. Comput., vol. C-24, pp. 489-497, May 1975.
[9] L. Chen and A. Avizienis, "N-Version programming: A faulttolerance approach to reliability of software operation," inDig. Papers Eighth Annu. Int. Conf. Fault-Tolerant Comput., FTCS-8, Toulouse, France, June 21-23, 1978, pp. 3-9.
[10] J. R. Connet, E. J. Pasternak, and B. D. Wagner, "Software defenses in real time control systems," inDig. Int. Symp. Fault Tolerant Comput., FTCS-2, Newton, MA, June 19-21, 1972, pp. 94-99.
[11] R. W. Cook, W. H. Sisson, T. F. Storey, and W. N. Toy, "Design of self-checking microprogram control,"IEEE Trans. Comput., vol. C- 22, pp. 255-262, Mar. 1973.
[12] F. Cristian, "Exception handling and software fault tolerance,"IEEE Trans. Comput., vol. C-31, pp. 531-540, June 1982.
[13] Y. Crouzet and J. Chavade, "A 6800 coprocessor for error detection in microcomputers: The PAD,"Proc. IEEE, vol. 74, pp. 723-731, May 1986.
[14] S. F. Daniels, "A concurrent test technique for standard microprocessors," inDig. Papers Compcon Spring 83, San Francisco, CA, Feb. 28-Mar. 3, 1983, pp. 389-394.
[15] J. B. Eifert and J. P. Shen, "Processor monitoring using asynchronous signatured instruction streams," inDig., 14th Int. Conf. Fault-Tolerant Comput., FTCS-14, Kissimmee, FL, June 20-22, 1984, pp. 394-399.
[16] G. Estrin, "Snuper Computer-A computer in instrumentation automation," inProc. AFIPS Spring Joint Comput. Conf., vol. 30, Atlantic City, NJ, April 18-20, 1967, pp. 645-656.
[17] R. S. Fabry, "Capability-based addressing,"Commun. ACM, vol. 17, no. 7, pp. 403-412, July 1974.
[18] R. W. Floyd, "Assigning meaning to programs," inProc. Symp. Appl. Math., vol. 19, Amer. Math. Soc., Providence, RI, 1967, pp. 19-32.
[19] G. E. Forsythe, M. A. Malcolm, and C. B. Moler,Computer Methods for Mathematical Computations. Englewood Cliffs, NJ: Prentice-Hall, 1977, ch. 3.
[20] S. Z. Hassan, D. J. Lu, and E. J. McCluskey, "Parallel signature analyzers-Detection capability and extensions," inDig. Papers Compcon Spring 83, San Francisco, CA, Feb. 28-Mar. 3, 1983, pp. 440-445.
[21] S. Z. Hassan and E. J. McCluskey, "Enhancing the effectiveness of parallel signature analyzers," inDig., IEEE ICCAD-84, Santa Clara, CA, Nov. 12-15, 1984, pp. 102-104.
[22] C. A. R. Hoare, "An axiomatic basis for computer programming,"Commun. ACM, vol. 12, no. 10, pp. 576-583, 1969.
[23] A. L. Hopkins, T. B. Smith, and J. H. Lala, "FTMP-A highly reliable fault-tolerant multiprocessor for aircraft,"Proc. IEEE, vol. 66, pp. 1221-1239, Oct. 1978.
[24] J. C. Huck, "Comparative analysis of computer architectures," Tech. Rep. 83-243, Comput. Syst. Lab., Stanford University, Stanford, CA 94305, May 1983.
[25] E. A. Irland and U. K. Stagg, "New developments in suburban and rural ESS (No. 2 and No. 3 ESS),"Rec., Int. Switching Symp., Munich, West Germany, Sept. 9-13, 1974, pp. 512/1-512/7.
[26] V. S. Iyengar and L. L. Kinney, "Concurrent fault detection in microprogrammed control units,"IEEE Trans. Comput., vol. C-34, pp. 810-821, Sept. 1985.
[27] J. R. Kane and S. S. Yau, "Concurrent software fault detection,"IEEE Trans. Software Eng., vol. SE-1. pp. 87-99, Mar. 1975.
[28] K. H. Kim and C. V. Ramamoorthy, "Failure-tolerant parallel programming and its supporting system architecture," inAFIPS Conf. Proc. (National Comput. Conf.), vol. 45, New York, NY, June 7-10, 1976, pp. 413-423.
[29] B. Krieg-Bruckner and D. C. Luckham, "ANNA: Towards a language for annotating Ada programs,"ACM SIGPLAN Notices, vol. 15, pp. 128-138, Nov. 1980.
[30] P. A. Lee, N. Ghani, and K. Heron, "A recovery cache for PDP-11," inDig. Papers 9th Annu. Int. Symp. Fault Tolerant Comput., FTCS-9, Madison, WI, June 20-22, 1979, pp. 3-7.
[31] H. Lee and K. G. Shin, "Design and evaluation of a fault-tolerant multiprocessor using hardware recovery blocks,"IEEE Trans. Comput., vol. C-33, pp. 113-124, Feb. 1984.
[32] N. G. Leveson and P. R. Harvey, "Analyzing software safety,"IEEE Trans. Software Eng., vol. SE-9, pp. 569-579, Sept. 1983.
[33] K. W. Li, "Detection of transient faults in microprocessors by means of external hardware," M.Sc. Thesis, Dep. Elec. Eng., Virginia Polytechnic Institute and State Univ., Blacksburg, VA, Mar. 1984.
[34] T. S. Liu, "The role of a maintenance processor for a general-purpose computer system,"IEEE Trans. Comput., vol. C-33, pp. 507-517, June 1984.
[35] D. J. Lu, "Watchdog processors and VLSI," inProc. Nat. Electron. Conf., vol. 34, Chicago, IL, Oct. 27-28, 1980, pp. 240-245.
[36] D. J. Lu, "Watchdog processor and structural integrity checking,"IEEE Trans. Comput., vol. C-31, pp. 681-685, July 1982.
[37] A. Mahmood, D. J. Lu, and E. J. McCluskey, "Concurrent fault detection using a watchdog processor and assertions," inProc. 1983 Int. Test Conf., Philadelphia, PA, Oct. 18-20, 1983, pp. 622-628.
[38] A. Mahmood, D. M. Andrews, and E. J. McCluskey, "Writing executable assertions to test flight software," inConf. Rec. 18th Annu. Asilomar Conf. Circuits, Syst., Comput., Pacific Grove, CA, Nov. 5-7, 1984, pp. 262-266.
[39] A. Mahmood, "Executable assertions and flight software," inProc. AIAA/ IEEE 6th Digital Avion. Syst. Conf., Baltimore, MD, Dec. 3-6, 1984, pp. 346-351.
[40] A. Mahmood and E. J. McCluskey, "Watchdog processors: Error coverage and overhead," inDig. 15th Annu. Int. Symp. Fault-Tolerant Comput., FTCS-15, Ann Arbor, MI, June 19-21, 1985, pp. 214-219.
[41] A. Mahmood, A. Ersoz, and E. J. McCluskey, "Concurrent system level error detection using a watchdog processor," inProc. 1985 Int. Test Conf., Philadelphia, PA, Nov. 19-21, 1985, pp. 145-152.
[42] A. Mahmood and E. J. McCluskey, "Concurrent error detection using watchdog processors-A survey," CRC Tech. Rep. 85-7, CSL TR. 85- 266, Center Reliable Comput., Computer Systems Lab., Stanford Univ., Stanford, CA 94305, June 1985.
[43] M. Namjoo and E. J. McCluskey, "Watchdog processors and capability checking," inDig. Papers 12th Annu. Int. Symp. Fault Tolerant Comput., FTCS-12, Santa Monica, CA, June 22-24, 1982, pp. 245-248.
[44] M. Namjoo, "Techniques for concurrent testing of VLSI processor operation," inDig. 1982 Int. Test Conf., Philadelphia, PA, Nov. 15- 18, 1982, pp. 461-468.
[45] M. Namjoo, "Design of concurrently testable microprogrammed control units," inProc. 15th Annu. Workshop Microprogramming, MICRO-15, Palo Alto, CA, Oct. 1982, pp. 173-180.
[46] M. Namjoo, "CERBERUS-16: An architecture for a general purpose watchdog processor," inDig. Papers 13th Annu. Int. Symp. Fault Tolerant Comput. FTCS-13, Milano, Italy, June 28-30, 1983, pp. 216-219.
[47] J. S. Novak and L. S. Tuomenoksa, "Memory mutilation in stored program controlled telephone systems," inConf. Rec. 1970 Int. Conf. Commun., vol. 2, 1970, pp. 43-32 to 43-45.
[48] S. M. Ornstein, W. R. Crowther, M. F. Kraley, R. D. Bressler, A. Michel, and F. E. Heart, "Pluribus-A reliable multiprocessor," inProc. AFIPS Conf., vol. 44. Anaheim, CA, May 19-22, 1975, pp. 551-559.
[49] B. Randell, "System structure for software fault tolerance,"IEEE Trans. Software Eng., vol. SE-1, pp. 220-232, June 1975.
[50] S. H. Saib, "Executable assertions-An aid to reliable software," inConf. Rec. 11th Asilomar Conf. Circuits, Syst., Comput., Pacific Grove, CA, Nov. 7-9, 1977, pp. 277-281.
[51] S. H. Saib, "Distributed architectures for reliability," inProc. AIAA Comput. Aerosp. Conf., Los Angeles, CA, Oct. 22-24, 1979, pp. 458-462.
[52] M. E. Schmid, R. L. Trapp, A. E. Davidoff, and G. M. Masson, "Upset exposure by means of abstract verification," inDig. Papers 12th Annu. Int. Symp. Fault Tolerant Comput. FTCS-12, Santa Monica, CA, June 22-24, 1982, pp. 237-244.
[53] M. A. Schuette, J. P. Shen, D. P. Siewiorek, and Y. X. Zhu, "Experimental evaluation of two concurrent error detection schemes," inDig. 16th Annu. Int. Symp. Fault-Tolerant Comput., FTCS-16, Vienna, Austria, July 1-4, 1986, pp. 138-143.
[54] R. M. Sedmak and H. L. Liebergot, "Fault-tolerance of a general purpose computer implemented by very large scale integrating,"IEEE Trans. Comput., vol. C-29, pp. 492-500, June 1980.
[55] J. P. Shen and M. A. Schuette, "On-line self-monitoring using signatured instruction streams," inProc. 1983 Int. Test Conf., Philadelphia, PA, Oct. 18-20, 1983, pp. 275-282.
[56] L. J. Shustek, "Analysis and performance of computer instruction sets," SLAC Rep. 205, STAN-CS-78-658, Stanford Univ., Stanford, CA 94305, May 1978.
[57] D. P. Siewiorek, V. Kini, H. Mashbum, S. R. McConnel, and M. M. Tsao, "A case study of C.mmp, Cm*, and C.vmp: Part 1-Experiences with fault tolerance in multiprocessor systems,"Proc. IEEE, vol. 66, pp. 1178-1199, Oct. 1978.
[58] D. P. Siewiorek and R. S. Swarz,The Theory and Practice of Reliable System Design. Bedford, MA: Digital, 1982, ch. 3.
[59] T. Sridhar and S. M. Thatte, "Concurrent checking of program flow in VLSI processors," inDig. 1982 Int. Test Conf., Philadelphia, PA, Nov. 15-18, 1982, pp. 191-199.
[60] R. E. Staehler, "Organization and objectives,"Bell Syst. Tech. J., vol. 56, pp. 119-134, Feb. 1977.
[61] T. F. Storey, "Design of a microprogram control for a processor in an electronic switching systems,"Bell Syst. Tech. J., vol. 55, pp. 183- 232, Feb. 1976.
[62] L. G. Stucki and G. L. Foshee, "New assertion concepts for self metric software validation," inProc. Int. Conf. Reliable Software, Los Angeles, CA. Apr. 21-23, 1975, PP. 59-71.
[63] D. J. Taylor and J. P. Black, "Principles of data structure error correction,"IEEE Trans. Comput., vol. C-31, pp. 602-608, July 1982.
[64] S. M. Thatte and J. A. Abraham, "Test generation for microprocessors,"IEEE Trans. Comput., vol. C-29, pp. 429-441, June 1980.
[65] S. P. Tomas and J. P. Shen, "A roving monitoring processor for detection of control flow errors in multiple processor systems," inProc. IEEE Int. Conf. Comput. Design: VLSI Comput., Port Chester, NY, Oct. 7-10, 1985, pp. 531-539.
[66] J. H. Wensley, L. Lamport, J. Goldberg, M. W. Green, K. N. Levitt, P. M. Melliar-Smith, R. E. Shostak, and C. B. Weinstock, "SIFT; Design and analysis of a fault-tolerant computer for aircraft control,"Proc. IEEE, vol. 66, pp. 1240-1255, Oct. 1978.
[67] S. S. Yau and Fu-Chung Chen, "An approach to concurrent control flow checking,"IEEE Trans. Software Eng., vol. SE-6, pp. 126- 137, Mar. 1980.
[68] L. J. Yount, "Architectural solutions to safety problems of digital flight-critical systems for commercial transports," inProc. AIAA/ IEEE 6th Digital Avion. Syst. Conf., Baltimore, MD, Dec. 3-6, 1984, pp. 28-35.

Index Terms:
concurrent error detection; system behaviour monitoring; assertion execution; fault tolerant computing; watchdog processors; system-level error detection; coprocessor; control-flow checking; memory-access checking; capability-based addressing; reasonable checks; automatic testing; computer architecture; computer testing; computerised monitoring; error detection; fault tolerant computing; reviews; satellite computers; special purpose computers.
Citation:
A. Mahmood, E.J. McCluskey, "Concurrent Error Detection Using Watchdog Processors-A Survey," IEEE Transactions on Computers, vol. 37, no. 2, pp. 160-174, Feb. 1988, doi:10.1109/12.2145
Usage of this product signifies your acceptance of the Terms of Use.