This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Commercial Fault Tolerance: A Tale of Two Systems
January-March 2004 (vol. 1 no. 1)
pp. 87-96
Wendy Bartlett, IEEE Computer Society
This paper compares and contrasts the design philosophies and implementations of two computer system families: the IBM S/360 and its evolution to the current zSeries line, and the Tandem (now HP) NonStop® Server. Both systems have a long history; the initial IBM S/360 machines were shipped in 1964, and the Tandem NonStop System was first shipped in 1976. They were aimed at similar markets, what would today be called enterprise-class applications. The requirement for the original S/360 line was for very high availability; the requirement for the NonStop platform was for single fault tolerance against unplanned outages. Since their initial shipments, availability expectations for both platforms have continued to rise and the system designers and developers have been challenged to keep up. There were and still are many similarities in the design philosophies of the two lines, including the use of redundant components and extensive error checking. The primary difference is that the S/360-zSeries focus has been on localized retry and restore to keep processors functioning as long as possible, while the NonStop developers have based systems on a loosely coupled multiprocessor design that supports a "fail-fast” philosophy implemented through a combination of hardware and software, with workload being actively taken over by another resource when one fails.

[1] G.M. Amdahl, G.A. Blaauw, and F.B. Brooks, Architecture of the IBM System/360 IBM J. Research and Development, vol. 8, pp. 87-101, 1964.
[2] W.B. Bartlett and B. Ball, Tandem's Approach to Fault Tolerance Tandem Systems Rev., vol. 8, pp. 84-95, Feb. 1988.
[3] L. Spainhower, System/360 to zSeries: Dependability in IBM Mainframes Dependable Computing Systems: Paradigms, Performance Issues and Applications, 2004.
[4] L. Spainhower and T.A. Gregg, IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective IBM J. Research and Development, vol. 43, nos. 5/6, pp. 863-874, 1999.
[5] P.R. Turgeon, P. Mak, M.A. Blake, M.F. Fee, C.B. Ford III, P.J. Meaney, R. Seigler, and W.W. Shen, The G5/G6 Binodal Cache IBM J. Research and Development, vol. 43, nos. 5/6, pp. 661-670, 1999.
[6] T.A. Gregg, S/390 CMOS Server I/O: The Continuing Evolution IBM J. Research and Development, vol. 41, nos. 4/5, pp. 449-462, 1997.
[7] J.A. Katzman, System Architecture for NonStop Computing Proc. IEEE CS Int'l Conf. Technologies for the Information Superhighway, pp. 77-80, 1977.
[8] J.F. Bartlett, A‘NonStop’Operating System Proc. Hawaii Int'l Conf. System Sciences, pp. 103-119, 1978.
[9] J.F. Bartlett, A NonStop Kernel Proc. Eighth SIGOPS European Workshop, pp. 22-29, 1981.
[10] E.W. Dijkstra, The Structure of the‘THE’Multiprogramming System Comm. ACM, vol. 11, pp 341-346, 1968.
[11] P.B. Hansen, The Nucleus of a Multi-Programming System Comm. ACM, vol. 13, pp. 238-241, Apr. 1970.
[12] J. Gray, Why Do Computers Stop and What Can Be Done About It? Technical Report TR85.7, Tandem Computers, Cupertino, Calif., June 1985.
[13] I. Lee and R.K. Iyer, “Software Dependability in the Tandem GUARDIAN System,” IEEE Trans. Software Eng., vol. 21, no. 5, pp. 455–467, May 1995.
[14] C. Constantinescu, Impact of Deep Submicron Technology on Dependability of VLSI Circuits Proc. Int'l Conf. Dependable Systems and Networks (DSN-2002), pp. 205-209, 2002.
[15] C.F. Webb and J.S. Liptay, A High-Frequency Custom CMOS S/390 Microprocessor IBM J. Research and Development, vol. 41, nos. 4/5, pp. 463-473, 1997.
[16] D. Siewiorek and R. Swarz, Reliable Computer Systems: Design and Evaluation. Burlington, Mass.: Digital Press, pp. 586-648, 1992.
[17] J.M. Nick, B.B. Moore, J.Y. Chung, and N.S. Bowen, S/390 Cluster Technology: Parallel Sysplex IBM Systems J., vol. 36, no. 2, pp. 172-201, 1997.
[18] R.W. Horst, "TNet: A Reliable System Area Network," IEEE Micro, Feb. 1994, pp. 37-45.
[19] D. Garcia and W. Watson, ServerNet II Proc. Parallel Computing, Routing, and Comm. Workshop, 1997.
[20] Tandem Computers, EXPAND Reference Manual. Cupertino, Calif., 1986.
[21] Tandem Computers, Introduction to Pathway. Cupertino, Calif., 1985.
[22] R. Horst et al., "The Risk of Data Corruption in Microprocessor-Based Systems," Dig. Pap. 23th Int'l Fault-Tolerant Computing Symp., IEEE Computer Society Press, Los Alamitos, Calif., 1993, pp. 576-585.
[23] S. Chandra and P. Chen, “How Fail-Stop Are Faulty Programs?,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 240–249, 1998.
[24] S. Turner, C. Henry, G. Horvath, and J. Kibble, Selecting a Server Value of S/390 IBM Redbook SG24-4812-01, 1999.
[25] D.A. Patterson and D. Oppenheimer, Architecture and Dependability of Large-Scale Internet Services IEEE Internet Computing, pp. 41-49, Sept.-Oct. 2002.
[26] J.O. Kephart and D.M. Chess, "The Vision of Autonomic Computing," Computer, vol. 36, no. 1, 2003, pp. 41–50.
[27] P. Homan, B. Malizia, and E. Reisner, Overview of DSM Tandem Systems Rev., vol. 4, no. 3, Oct. 1988.

Index Terms:
Computer systems implementation, fault tolerance, high availability.
Citation:
Wendy Bartlett, Lisa Spainhower, "Commercial Fault Tolerance: A Tale of Two Systems," IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 1, pp. 87-96, Jan.-March 2004, doi:10.1109/TDSC.2004.4
Usage of this product signifies your acceptance of the Terms of Use.