This Article 
 Bibliographic References 
 Add to: 
A Gracefully Degrading Massively Parallel System Using the BSP Model, and Its Evaluation
January 1999 (vol. 48 no. 1)
pp. 38-52

Abstract—The Bulk-Synchronous Parallel (BSP) Model was proposed as a unifying model for parallel computation. By using Randomized Shared Memory (RSM), the model offers an asymptotically optimal emulation of the Parallel Random Access Machine (PRAM). By using the BSP model with RSM, we construct a gracefully degrading massively parallel system using a fault tolerance (FT) scheme that relies on memory duplication to ensure global memory integrity and to speed up the reconfiguration. After a fault occurs, global reconfiguration restores the logical properties of the system. Work done during reconfiguration is shared equally among the live processors, with minimal coordination. We analyze, at the level of the BSP model, how the performance of a system may change as processors fail and the performance of the interconnection network degrades. We relate the change in overall system performance to the change in computation and communication load on the live processors. Further, we show how to estimate the overhead imposed by the FT scheme. We evaluate the reconfiguration time, the overhead, and graceful degradation of the system experimentally by an implementation on a Massively Parallel Processor (MPP). We show that the predictions about the degradation of the system and the overhead cost of the scheme are accurate.

[1] The U.S. President's Office of Science and Technology Policy, "Grand Challenges 1993: High Performance Computing and Communications," 1993.
[2] M. Peercy and P. Banerjee, "Design and Analysis of Software Reconfiguration Strategies for Hypercube Multicomputers Under Multiple Faults," Proc. 22nd Int'l Symp. Fault Tolerant Computing, pp. 448-455, June 1992.
[3] R. Jagannathan and E.A. Ashcroft, "Fault Tolerance in Parallel Implementations of Functional Languages," Proc. 21st Int'l Symp. Fault Tolerant Computing, pp. 256-263, June 1991.
[4] K.H. Kim, “Programmer Transparent Coordination of Recovering Concurrent Processes: Philosophy and Rules of Efficient Implementation,” IEEE Trans. Software Eng., vol. 14, no. 6, pp. 810-821, June 1988.
[5] B. Vinnakota and N. Jha, "Synthesis of Algorithm-Based Fault-tolerant Systems from Dependence Graphs," IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 8, pp. 864-874, Aug. 1993.
[6] M. Schneider, “Self-Stabilization,” ACM Computing Surveys, vol. 25, no. 1, pp. 45-67, Mar. 1993.
[7] D. Culler,R. Karp,D. Patterson,A. Sahay,K.E. Schauser,E. Santos,R. Subramonian,, and T. von Eicken,“LogP: Towards a realistic model of parallel computation,” Fourth Symp. Principles and Practices Parallel Programming, SIGPLAN’93, ACM, May 1993.
[8] L.G. Valiant, “A Bridging Model for Parallel Computation,” Comm. ACM, vol. 33, no. 8, pp. 103-111, Aug. 1990.
[9] A.V. Gerbessiotis and L.G. Valiant, "Direct Bulk-Synchronous Parallel Algorithms," Proc. Third Scandinavian Workshop Algorithm Theory 1992, O. Nurmi and E. Ukkonen, eds., pp. 1-18, July8-10 1992.
[10] R.H. Bisseling and W.F. McColl, "Scientific Computing on Bulk Synchronous Parallel Architectures (Short Version)," Proc. 13th IFIP World Computer Congress, B. Pehrson and I. Simon, eds., vol. I, pp. 509-514, 1994.
[11] K. Mehlhorn and U. Vishkin, "Randomized and Deterministic Simulations of PRAMs by Parallel Machines with Restricted Granularity of Parallel Memories," Acta Informatica, vol. 21, pp. 339-374, 1984.
[12] C. Engelmann and J. Keller, "Simulation-Based Comparison of Hash Functions for Emulated Shared Memory," Proc. Parallel Architectures and Languages Europe, Springer LNCS 694, pp. 1-11, June 1993.
[13] R. D. Schlichting and F. B. Schneider,“Fail-stop processors: An approach to designing fault-tolerant computing systems,”ACM Trans. Comput. Syst., vol. 1, no. 3, pp. 222–238, Aug. 1983.
[14] T. Nanya, "Design Approach to Self-Checking VLSI Processors," Design Methodologies, S. Goto, ed., chapter 8, pp. 235-267. NorthHolland, 1985.
[15] H. Ishihata, T. Horie, S. Inano, T. Shimizu, S. Kato, and M. Ikesaka, "Third Generation Message Passing Computer AP1000," Proc. Int'l Symp. Supercomputing, 1991.
[16] C.J. Glass and L.M. Ni, "Fault-Tolerant Wormhole Routing in Meshes," Proc. 23rd Int'l Symp. Fault-Tolerant Computing, pp. 240-249, 1993.
[17] K. Bolding and W. Yost, "Design of a Router for Fault-Tolerant Networks," Proc. 1994 Parallel Computer Routing and Comm. Workshop, pp. 226-240, May 1994.
[18] J.S. Plank and K. Li, “Faster Checkpointing with N+1 Parity,” Proc. IEEE 24th Int'l Symp. Fault-Tolerant Computing, pp. 288–297, 1994.
[19] A. Savva and T. Nanya, "Using the Bulk-Synchronous Parallel Model with Randomised Shared Memory for Graceful Degradation," Technical Report FTS93-23, IEICE, Aug. 1993. Also in Proc. Second Parallel Computing Workshop (PCW '93) of Fujitsu Parallel Computing Research Facilities (FPCRF).
[20] Z.M. Kedem and K.V. Palem, "Transformations for the Automatic Derivation of Resilient Parallel Programs," Proc. 1992 IEEE Workshop Fault-Tolerant Parallel and Distributed Systems, pp. 16-25, July 1992.
[21] A. Savva and T. Nanya, "Gracefully Degrading Systems Using the Bulk-Synchronous Parallel Model with Randomised Shared Memory," Proc. 25th Int'l Symp. Fault Tolerant Computing, pp. 299-308, June 1995.
[22] R.P. Brent, "The Parallel Evaluation of General Arithmetic Expressions," J. ACM, vol. 21, pp. 201-206, 1974.
[23] J. Lin and A. Storer, "A New Parallel Algorithm for the Knapsack Problem and Its Implementation on a Hypercube," Proc. Third Symp. Frontiers of Massively Parallel Computation, J. JáJá, ed., pp. 2-7, Oct. 1990.
[24] H. Hellwagner, "Randomized Shared Memory—Concept and Efficiency of a Scalable Shared Memory Scheme," Parallel Architectures, pp. 102-117, Springer Verlag, 1993

Index Terms:
BSP model, graceful degradation, fault tolerance, memory duplication, MPP, PRAM, RSM.
Andreas Savva, Takashi Nanya, "A Gracefully Degrading Massively Parallel System Using the BSP Model, and Its Evaluation," IEEE Transactions on Computers, vol. 48, no. 1, pp. 38-52, Jan. 1999, doi:10.1109/12.743410
Usage of this product signifies your acceptance of the Terms of Use.