This Article 
 Bibliographic References 
 Add to: 
Software-Based Failure Detection and Recovery in Programmable Network Interfaces
November 2007 (vol. 18 no. 11)
pp. 1539-1550
Emerging network technologies have complex network interfaces that have renewed concerns about network reliability. In this paper, we present an effective low- verhead fault tolerance technique to recover from network interface failures. Failure detection is based on a software watchdog timer that detects network processor hangs and a self- esting scheme that detects interface failures other than processor hangs. The proposed self-testing scheme achieves failure detection by periodically directing the control flow to go through only active software modules in order to detect errors that affect instructions in the local memory of the network interface. Our failure recovery is achieved by restoring the state of the network interface using a small backup copy containing just the right amount of information required for complete recovery. The paper shows how this technique can be made to minimize the performance impact to the host system and be completely transparent to the user.

[1] J.F. Ziegler et al., “IBM Experiments in Soft Fails in Computer Electronics (1978-1994),” IBM J. Research and Development, vol. 40, no. 1, pp. 3-18, Jan. 1996.
[2] S.S. Mukherjee, J. Emer, and S.K. Reinhardt, “The Soft Error Problem: An Architectural Perspective,” Proc. 11th Int'l Symp. High-Performance Computer Architecture, pp. 243-247, Feb. 2005.
[3] Remote Exploration and Experimentation (REE) Project, , 2007.
[4] A.V. Karapetian, R.R. Some, and J.J. Beahan, “Radiation Fault Modeling and Fault Rate Estimation for a COTS Based Space-Borne Supercomputer,” Proc. IEEE Aerospace Conf., vol. 5, pp. 5-2121-5-2131, Mar. 2002.
[5] The Human Impacts of Solar Storms and Space Weather, http://www.solarstorms.orgScomputers.html , 2007.
[6] A. Thakur and B.K. Iyer, “Analyze-NOW—An Environment for Collection of Analysis of Failures in a Network of Workstation,” Proc. Seventh Int'l Symp. Software Reliability Eng., pp. 14-23, Oct. 1996.
[7] V. Lakamraju, I. Koren, and C.M. Krishna, “Low Overhead Fault Tolerant Networking in Myrinet,” Proc. Dependable Systems and Networks, pp. 193-202, June 2003.
[8] L.L. Pullum, Software Fault Tolerance Techniques and Implementation. Artech House, 2001.
[9] K.-H. Huang and J.A. Abraham, “Algorithm-Based Fault Tolerance for Matrix Operations,” IEEE Trans. Computers, vol. 33, no. 6, pp. 518-528, Dec. 1984.
[10] D.M. Andrews, “Using Executable Assertions for Testing and Fault Tolerance,” Proc. Ninth Int'l Symp. Fault-Tolerant Computing, pp. 102-105, June 1979.
[11] S.S. Yau and F.-C. Chen, “An Approach to Concurrent Control Flow Checking,” IEEE Trans. Software Eng., vol. 6, no. 2, pp. 126-137, Mar. 1980.
[12] D.K. Pradhan, Fault-Tolerant Computer System Design. Prentice Hall, 1996.
[13] P.P. Shirvani, N.R. Saxena, and E.J. McCluskey, “Software-Implemented EDAC Protection against SEUs,” IEEE Trans. Reliability, vol. 49, no. 3, pp. 273-284, Sept. 2000.
[14] N. Oh, P.P. Shirvani, and E.J. McCluskey, “Error Detection by Duplicated Instructions in Super-Scalar Processors,” IEEE Trans. Reliability, vol. 51, no. 1, pp. 63-75, Mar. 2002.
[15] B. Nicolescu and R. Velazco, “Detecting Soft Errors by a Purely Software Approach: Method, Tools and Experimental Results,” Proc. Design, Automation and Test in Europe Conf. and Exhibition, pp.57-62, Mar. 2003.
[16] N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L. Seitz, J.N. Seizovic, and W.-K. Su, “Myrinet: A Gigabit-per-Second Local-Area Network,” IEEE Micro, vol. 15, no. 1, pp. 29-36, Feb. 1995.
[17] Myricom, http:/, 2007.
[18] J.R. Allen et al., “IBM PowerNP Network Processor: Hardware, Software, and Applications,” IBM J. Research and Development, vol. 47, no. 2/3, pp. 177-193, Mar./May 2003.
[19] Infiniband Trade Assoc., http:/, 2007.
[20] P. Shivam, P. Wyckoff, and D. Panda, “EMP: Zero-Copy OS-Bypass NIC-Driven Gigabit Ethernet Message Passing,” Proc. ACM/IEEE Supercomputing 2001 Conf., p. 49, Nov. 2001.
[21] The Gigabit Ethernet Alliance, http:/www.gigabit-ethernet. com/, 2007.
[22] The QsNet High Performance Interconnect, 2007, http:/
[23] A.T.M. Forum, ATM User-Network Interface Specification. Prentice Hall, 1995.
[24] T. Halfhill, “Intel Network Processor Targets Routers,” Microprocessor Report, vol. 13, no. 12, Sept. 1999.
[25] D.T. Stott, M.-C. Hsueh, G.L. Ries, and R.K. Iyer, “Dependability Analysis of a Commercial High-Speed Network,” Proc. 27th Int'l Symp. Fault-Tolerant Computing, pp. 248-257, June 1997.
[26] R. Chillarege, “Self-Testing Software Probe System for Failure Detection and Diagnosis,” Proc. 1994 Conf. Centre for Advanced Studies on Collaborative Research, vol. 10, 1994.
[27] R.A.F. Bhoedjang, T. Rühl, and H.E. Bal, “User Level Network Interface Protocols,” Computer, vol. 31, no. 11, pp. 53-60, Nov. 1998.

Index Terms:
Programmable Network Interface Card (NIC), Single Event Upset (SEU), radiation induced faults, failure detection, self-testing
Yizheng Zhou, Vijay Lakamraju, Israel Koren, C. M. Krishna, "Software-Based Failure Detection and Recovery in Programmable Network Interfaces," IEEE Transactions on Parallel and Distributed Systems, vol. 18, no. 11, pp. 1539-1550, Nov. 2007, doi:10.1109/TPDS.2007.1093
Usage of this product signifies your acceptance of the Terms of Use.