Issue No.01 - Jan. (1986 vol.12)
Kewal K. Saluja , Department of Electrical and Computer Engineering, University of Newcastle, N.S.W. 2308, Australia
A common assumption in the existing rollback techniques is that transients, the cause of most failures, subside very quickly, implying that a single retry of the program from the previous rollback point is sufficient. We discuss a general rollback strategy with n(n ≥ 2) retries which takes into consideration multiple transient failures as well as transients of long duration. Ways of deriving practical values of n for a given program are also discussed. Furthermore, we propose the use of a watchdog processor as an error detection tool to initiate recovery action through rollback, since the watchdog processor offers low error latency. We also discuss the merging of the watchdog processor with rollback recovery technique for enhancing the overall system reliability.
Transient analysis, Load modeling, Computational modeling, Hardware, Australia, Computers, Image edge detection, transient errors, Error detection, error latency, recovery time, rollback recovery, program retry
Kewal K. Saluja, "A watchdog processor based general rollback technique with multiple retries", IEEE Transactions on Software Engineering, vol.12, no. 1, pp. 87-95, Jan. 1986, doi:10.1109/TSE.1986.6312923