This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Teraflops Supercomputer: Architecture and Validation of the Fault Tolerance Mechanisms
September 2000 (vol. 49 no. 9)
pp. 886-894

Abstract—Intel Corporation developed the Teraflops supercomputer for the US Department of Energy (DOE) as part of the Accelerated Strategic Computing Initiative (ASCI). This is the most powerful computing machine available today, performing over two trillion floating point operations per second with the aid of more than 9,000 Intel processors. The Teraflops machine employs complex hardware and software fault/error handling mechanisms for complying with DOE's reliability requirements. This paper gives a brief description of the system architecture and presents the validation of the fault tolerance mechanisms. Physical fault injection at the IC pin level was used for validation purposes. An original approach was developed for assessing signal sensitivity to transient faults and the effectiveness of the fault/error handling mechanisms. Dependency between fault/error detection coverage and fault duration was also determined. Fault injection experiments unveiled several malfunctions at the hardware, firmware, and software levels. The supercomputer performed according to the DOE requirements after corrective actions were implemented. The fault injection approach presented in this paper can be used for validation of any fault-tolerant or highly available computing system.

[1] J. Arlat, Y. Crouzet, and J.-C. Laprie, “Fault Injection for Dependability Validation of Fault-Tolerant Computing Systems,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 348–355, 1989.
[2] J. Arlat et al., "Fault Injection for Dependability Validation: A Methodology and Some Applications," IEEE Trans. Software Eng., Feb. 1990, pp. 166-182.
[3] J.H. Barton, E.W. Czeck, Z.Z. Segall, and D.P. Siewiorek, Fault Injection Experiments Using FIAT IEEE Trans. Computers, vol. 39, no. 4, pp. 575-582, Apr. 1990.
[4] J. Carreira, H. Madeira, and J.G. Silva, Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers IEEE Trans. Software Eng., vol. 24, no. 2, pp. 125-136, Feb. 1998.
[5] R. Chillarege and N.S. Bowen, “Understanding Large System Failures—A Fault Injection Experiment,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 356–363, June 1989.
[6] C. Constantinescu, “Using Physical and Simulated Fault Injection to Evaluate Error Detection Mechanisms,” Proc. Pacific Rim Int'l Symp. Dependable Computing, pp. 186-192, Dec. 1999.
[7] C. Constantinescu, Assessing Error Detection Coverage by Simulated Fault Injection Proc. Third European Dependable Computing Conf. (EDCC-3), pp. 161-170, 1999.
[8] C. Constantinescu, “Validation of the Fault/Error Handling Mechanisms of the Teraflops Supercomputer,” Proc. 28th Fault-Tolerant Computer Systems Symp., pp. 382-389, 1998.
[9] C. Constantinescu, “Using Multi-Stage&Stratified Sampling for Inferring Fault Coverage Probabilities,” IEEE Trans. Reliability, vol. 44, no. 4, pp. 632-639, 1995.
[10] C. Constantinescu, “Estimation of Coverage Probabilities for Dependability Validation of Fault-Tolerant Computing Systems,” Proc. Ninth Ann. Conf. Computer Assurance, pp. 101-106, 1994.
[11] K. Echtle and M. Leu, “The EFA Fault Injector for Fault Tolerant Distributed System Testing,” Proc. Fault Tolerant Parallel and Distributed Systems Workshop, pp. 28-35, 1992.
[12] P. Folkesson, S. Svensson, and J. Karlsson, A Comparison of Simulation Based and Scan Chain Implemented Fault Injection Proc. 28th Int'l Symp. Fault-Tolerant Computing (FTCS-28), pp. 284-293, 1998.
[13] A. Ghosh and B. Johnson, “System-Level Modeling in the ADEPT Environment of a Distributed Computer System for Real-Time Applications,” Proc. IEEE Int'l Computer Performance and Dependability Symp., pp. 194-203, 1995.
[14] K. Goswami, R.K. Iyer, and L. Young, “DEPEND: A Simulation Based Environment for System Level Dependability Analysis,” IEEE Trans. Computers, vol. 46, no. 1, pp. 60-74, Jan. 1997.
[15] S. Han, K.G. Shin, and H. Rosenberg, “DOCTOR: An IntegrateD SOftware Fault InjeCTiOn EnviRonment for Distributed Real-Time Systems,” Proc. IEEE Int'l Computer Performance and Dependability Symp., pp. 204–213, 1995.
[16] M. Hsueh, T. Tsai, and R. Iyer, “Fault Injection Techniques and Tools,” Computer, pp. 75–82, Apr. 1997.
[17] R.K. Iyer, “Experimental Evaluation,” Proc. 25th Fault-Tolerant Computer Sytems Symp., pp. 115-132, 1995.
[18] E. Jenn et al., “Fault Injection into VHDL Models: The MEFISTO tool,” Proc. 24th Fault-Tolerant Computer Systems Symp., pp. 66-75, 1994.
[19] G. Kanawati, N. Kanawati, and J. Abraham, “FERRARI: A Tool for the Validation of System Dependability Properties,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 336–344, 1992.
[20] J. Karlsson et al., “Application of Three Physical Fault Injection Techniques to the Experimental Assessment of the MARS Architecture,” Proc. Fifth Dependable Computing for Critical Applications Conf., pp. 150-161, 1995.
[21] J. Karlsson, P. Lidén, P. Dahlgren, R. Johansson, and U. Gunneflo, Using Heavy-Ion Radiation to Validate Fault-Handling Mechanisms IEEE Micro, vol. 14, no. 1, pp. 8-23, Feb. 1994.
[22] P.K. Lala, Fault Tolerant and Fault Testable Hardware Design. New York: Prentice Hall Int'l, 1985.
[23] J.C. Laprie, “Dependability—Its Attributes, Impairments and Means,” Predictably Dependable Computing Systems, B. Randell, J.C. Laprie, H. Kopetz, and B. Littlewood, eds., pp. 1-28, Springer-Verlag, 1995.
[24] T.Y. Lin and D.P. Siewiorek, “Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis,” IEEE Trans. Reliability, vol. 39, no. 4, pp. 419-432, 1990.
[25] H. Madeira, M. Rela, F. Moreira, and J.G. Silva, RIFLE: A General Purpose Pin-Level Fault Injector Proc. First European Dependable Computing Conf. (EDCC-1), pp. 199-216, 1994.
[26] G. Mattson, D. Scott, and S. Wheat, “A TeraFLOP Supercomputer in 1996: The ASCI TFLOP System,” Proc. 10th Int'l Parallel Processing Symp., pp. 84-93, 1996.
[27] D. Powel, M. Cukier, and J. Arlat, “On Stratified Sampling for High Coverage Estimators,” Proc. Second European Dependable Computing Conf., pp. 37-54, 1996.
[28] D. Powel, E. Martins, J. Arlat, and Y. Crouzet, “Estimators for Fault Tolerance Coverage Evaluation,” IEEE Trans. Computers, vol. 44, no. 2, pp. 261-274, Feb. 1995.
[29] J.R. Samson Jr., W. Moreno, and F. Falquez, “A Technique for Automated Validation of Fault Tolerant Designs Using Laser Fault Injection (LFI),” Proc. 28th Fault-Tolerant Computer Systems Symp., pp. 162-167, 1998.
[30] D. Stott, G. Ries, M. Hsueh, and R. Iyer, “Dependability Analysis of a High-Speed Network Using Software-Implemented Fault Injection and Simulated Fault Injection,” IEEE Trans. Computers, vol. 47, no. 1, pp. 108–119, Jan. 1998.
[31] Z. Segal and T. Lin, “FIAT: Fault Injection Based Automated Testing Environment,” Proc. 18th Fault-Tolerant Computer Systems Symp., pp. 102-107, 1988.
[32] J. Silva, J. Carreira, H. Madeira, D. Costa, and F. Moreira, “Experimental Assessment of Parallel Systems,” Proc. IEEE Int'l Symp. Fault-Tolerant Computing, pp. 415–424, 1996.
[33] D.P. Siewiorek and R.S. Swarz, Reliable Computer Systems—Design and Evaluation. Natick, Mass.: A.K. Peters Ltd., 1998.
[34] K.S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications. Prentice Hall, 1982.
[35] C.J. Walter, Evaluation and Design of an Ultra-Reliable Distributed Architecture for Fault Tolerance IEEE Trans. Reliability, vol. 39, no. 4, pp. 492-499, Oct. 1990.
[36] S.R. Wheat, R. Riesen, A.B. Maccabe, D.W. van Dresser, and T.M. Stallcup, “Puma: An Operating System for Massively Parallel Systems,” Proc. 27th Hawaii Int'l Conf. Systems Sciences, vol. 2, pp. 56-65, 1994.

Index Terms:
Supercomputing, fault-tolerant computing, validation, fault injection, fault/error detection coverage.
Citation:
Cristian Constantinescu, "Teraflops Supercomputer: Architecture and Validation of the Fault Tolerance Mechanisms," IEEE Transactions on Computers, vol. 49, no. 9, pp. 886-894, Sept. 2000, doi:10.1109/12.869320
Usage of this product signifies your acceptance of the Terms of Use.