This Article 
 Bibliographic References 
 Add to: 
Loop Transformations for Fault Detection in Regular Loops on Massively Parallel Systems
December 1996 (vol. 7 no. 12)
pp. 1238-1249

Abstract—Distributed-memory systems can incorporate thousands of processors at a reasonable cost. However, with an increasing number of processors in a system, fault detection and fault tolerance become critical issues. By replicating the computation on more than one processor and comparing the results produced by these processors, errors can be detected. During the execution of a program, due to data dependencies, typically not all of the processors in a multiprocessor system are busy at all times. Therefore processor schedules contain idle time slots and it is the goal of this work to exploit these idle time slots to schedule duplicated computation for the purpose of fault detection. We propose a compiler-assisted approach to fault detection in regular loops on distributed-memory systems. This approach achieves fault detection by duplicating the execution of statement instances. After carefully analyzing the data dependencies of a regular loop, selected instances of loop statements are duplicated in a way that ensures the desired fault coverage. We first present duplication strategies for fault detection and show that these strategies use idle processor times for executing replicated statements, whenever possible. Next, we present loop transformations to implement these fault-detection strategies. Also, a general framework for selecting appropriate loop transformations is developed. Experimental results performed on the CRAY-T3D show that the overhead of adding the fault detection capability is usually less than 25%, and is less than 10% when communication overhead is reduced by grouping messages.

[1] N. Alewine, S. Chen, C. Li, W. Fuchs, and W. Hwu, "Branch Recovery With Compiler-Assisted Multiple Instruction Retry," Proc. 22nd Ann. Int'l Symp. Fault-Tolerant Computing, pp. 66-73, 1992.
[2] J. Anderson and M. Lam, "Global Optimizations for Parallelism and Locality on Scalable Parallel Machines," Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 112-125,Albuquerque, N.M., June 1993.
[3] T. Anderson, P. Barrett, D. Halliwell, and M. Moulding, "Software Fault Tolerance: An Evaluation," IEEE Trans. Software Eng., vol. 11, no. 12, pp. 1,502-1,510, Dec. 1985.
[4] A. Avizienis and J. Kelly, "Fault Tolerance by Design Diversity: Concepts and Experiments," Computer, vol. 17, no. 8, pp. 67-80, Aug. 1984.
[5] V. Balasubramanian and P. Banerjee, "Compiler-Assisted Synthesis of Algorithm-Based Checking in Multiprocessors," IEEE Trans. Computers, vol. 39, no. 4, pp. 436-446, Apr. 1990.
[6] D. Blough and G. Masson, "Performance Analysis of a Generalized Concurrent Error Detection Procedure," IEEE Trans. Computers, vol. 39, no. 1, pp. 47-62, Jan. 1990.
[7] D. Blough and A. Nicolau, "Fault Tolerance in Super-Scalar and VLIW Processors," Proc. IEEE Workshop Fault-Tolerant Parallel and Distributed Systems, pp. 193-200, 1992.
[8] M. A. Breuer and A. A. Ismaeel,“Roving emulation as a fault detection mechanism,”IEEE Trans. Comput., vol. C-35, pp. 933–939, Nov. 1986.
[9] D. Callahan and K. Kennedy, "Compiling Programs for Distributed-Memory Multiprocessors," J. Supercomputing, pp. 151-169, Feb. 1988.
[10] B. M. Chapman, P. Mehrotra, and H. P. Zima, "Programming in Vienna Fortran," Scientific Programming, pp. 31-50, Jan. 1992.
[11] B. M. Chapman, P. Mehrotra, and H. P. Zima, "High Performance Fortran Without Templates: An Alternative Model for Distribution and Alignment," Proc. Fourth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 92-101, 1993.
[12] A. Roy-Chowdhury and P. Banerjee, “Tolerance Determination for Algorithm-Based Checks Using Simplified Error Analysis Techniques,” Proc. 23rd IEEE Fault-Tolerant Computing Symp. (FTCS-23), pp. 290-298, June 1993.
[13] "CM Fortran User's Guide for the CM-5," Thinking Machines, 1992.
[14] E. Cooper, "Replicated Distributed Programs," Proc. 10th ACM Symp. Operating System Principles, pp. 63-78, Dec.1-4, 1985.
[15] R. Cytron and J. Ferrante, "What's in a Name? -or- The Value of Renaming for Parallelism Detection and Storage Allocation," Proc. Int'l Conf. Parallel Processing, pp. 19-27, Aug. 1987.
[16] A.T. Dahbura, K.K. Sabnami, and W.J. Hery, "Spare Capacity as a Means of Fault Detection and Diagnosis in Multiprocessor Systems," IEEE Trans. Computers, Vol. 38, No. 6, June 1989, pp. 881-891.
[17] J. Feo, "An Analysis of the Computational and Parallel Complexity of the Livermore Loops," Parallel Computing, pp. 163-185, July 1988.
[18] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam, "PVM 3 User's Guide and Reference Manual," Oak Ridge National Laborary, Oak Ridge, TN 37831.
[19] C. Gong, R. Melhem, and R. Gupta, "Compiler Assisted Fault Detection for Distributed-Memory Systems," Proc. 1994 Scalable High Performance Computing Conference,Knoxville, Tenn, pp. 373-380, May23-25, 1994.
[20] C. Gong, "Compiler-Assisted Approaches to Fault Detection on Distributed-Memory Systems, PhD thesis.
[21] "High Performance Fortran Forum," DRAFT High Performance Fortran Language Specification, Ver. 1.0. Technical Report, Rice Univ., Jan. 1993.
[22] S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, and C. Tseng, "An Overview of the Fortran D Programming System," Proc. Fourth Workshop Languages and Compilers for Parallel Computing, pp. e1-e17, 1991.
[23] C. Koelbel,P. Mehrotra,, and J. V. Rosendale,“Supporting shared data structures on distributed memory architectures,” Second ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 177–186, ACM, Mar. 1990.
[24] J. Long, W.K. Fuchs, and J.A. Abraham, "Compiler-Assisted Static Checkpoint Insertion," Proc. FTC'92, pp. 58-65, July 1992.
[25] M. Quinn, P. Hatcher, and K. Jourdenais, "Compiling C* Programs for a Hypercube Multicomputer," Proc. ACM/SIGPLAN PPEALS, pp. 57-65, July, 1988.
[26] M. Quinn and P. Hatcher, "Compiling SIMD Programs for MIMD Architecthres," Proc. Int'l Conf. Computer Languages, pp. 291-296, 1990.
[27] L. Shombert and D. Siewiorek, "Using Redundancy for Concurrent Testing and Repairing of Systolic Arrays," Proc. 17th Int'l Symp. Fault-Tolerant Computing, pp. 244-249, 1987.
[28] T. Ng and S. Shi, Replicated Transactions Proc. Ninth Int't Conf. Distributed Computer Systems, June 1989.
[29] S. Tridandapani, A. Somani, and U. Sandadi, "Low Overhead Multiprocessor Allocation Strategies Exploiting System Spare Capacity for Fault Detection and Location," IEEE Trans. Computers, vol. 44, no. 7, pp. 865-877, July, 1995.
[30] M. Wolfe,“Optimizing Supercompilers For Supercomputers.”Cambridge, MA: MIT, 1989.

Index Terms:
Compiler-assisted approach, data dependence analysis, distributed-memory systems, duplicating execution, execution pattern, fault detection, loop transformation.
Chun Gong, Rami Melhem, Rajiv Gupta, "Loop Transformations for Fault Detection in Regular Loops on Massively Parallel Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 7, no. 12, pp. 1238-1249, Dec. 1996, doi:10.1109/71.553273
Usage of this product signifies your acceptance of the Terms of Use.