
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Andreas Savva, Takashi Nanya, "A Gracefully Degrading Massively Parallel System Using the BSP Model, and Its Evaluation," IEEE Transactions on Computers, vol. 48, no. 1, pp. 3852, January, 1999.  
BibTex  x  
@article{ 10.1109/12.743410, author = {Andreas Savva and Takashi Nanya}, title = {A Gracefully Degrading Massively Parallel System Using the BSP Model, and Its Evaluation}, journal ={IEEE Transactions on Computers}, volume = {48}, number = {1}, issn = {00189340}, year = {1999}, pages = {3852}, doi = {http://doi.ieeecomputersociety.org/10.1109/12.743410}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Computers TI  A Gracefully Degrading Massively Parallel System Using the BSP Model, and Its Evaluation IS  1 SN  00189340 SP38 EP52 EPD  3852 A1  Andreas Savva, A1  Takashi Nanya, PY  1999 KW  BSP model KW  graceful degradation KW  fault tolerance KW  memory duplication KW  MPP KW  PRAM KW  RSM. VL  48 JA  IEEE Transactions on Computers ER   
Abstract—The BulkSynchronous Parallel (BSP) Model was proposed as a unifying model for parallel computation. By using Randomized Shared Memory (RSM), the model offers an asymptotically optimal emulation of the Parallel Random Access Machine (PRAM). By using the BSP model with RSM, we construct a gracefully degrading massively parallel system using a fault tolerance (FT) scheme that relies on memory duplication to ensure global memory integrity and to speed up the reconfiguration. After a fault occurs, global reconfiguration restores the logical properties of the system. Work done during reconfiguration is shared equally among the live processors, with minimal coordination. We analyze, at the level of the BSP model, how the performance of a system may change as processors fail and the performance of the interconnection network degrades. We relate the change in overall system performance to the change in computation and communication load on the live processors. Further, we show how to estimate the overhead imposed by the FT scheme. We evaluate the reconfiguration time, the overhead, and graceful degradation of the system experimentally by an implementation on a Massively Parallel Processor (MPP). We show that the predictions about the degradation of the system and the overhead cost of the scheme are accurate.
[1] The U.S. President's Office of Science and Technology Policy, "Grand Challenges 1993: High Performance Computing and Communications," 1993.
[2] M. Peercy and P. Banerjee, "Design and Analysis of Software Reconfiguration Strategies for Hypercube Multicomputers Under Multiple Faults," Proc. 22nd Int'l Symp. Fault Tolerant Computing, pp. 448455, June 1992.
[3] R. Jagannathan and E.A. Ashcroft, "Fault Tolerance in Parallel Implementations of Functional Languages," Proc. 21st Int'l Symp. Fault Tolerant Computing, pp. 256263, June 1991.
[4] K.H. Kim, “Programmer Transparent Coordination of Recovering Concurrent Processes: Philosophy and Rules of Efficient Implementation,” IEEE Trans. Software Eng., vol. 14, no. 6, pp. 810821, June 1988.
[5] B. Vinnakota and N. Jha, "Synthesis of AlgorithmBased Faulttolerant Systems from Dependence Graphs," IEEE Trans. Parallel and Distributed Systems, vol. 4, no. 8, pp. 864874, Aug. 1993.
[6] M. Schneider, “SelfStabilization,” ACM Computing Surveys, vol. 25, no. 1, pp. 4567, Mar. 1993.
[7] D. Culler,R. Karp,D. Patterson,A. Sahay,K.E. Schauser,E. Santos,R. Subramonian,, and T. von Eicken,“LogP: Towards a realistic model of parallel computation,” Fourth Symp. Principles and Practices Parallel Programming, SIGPLAN’93, ACM, May 1993.
[8] L.G. Valiant, “A Bridging Model for Parallel Computation,” Comm. ACM, vol. 33, no. 8, pp. 103111, Aug. 1990.
[9] A.V. Gerbessiotis and L.G. Valiant, "Direct BulkSynchronous Parallel Algorithms," Proc. Third Scandinavian Workshop Algorithm Theory 1992, O. Nurmi and E. Ukkonen, eds., pp. 118, July810 1992.
[10] R.H. Bisseling and W.F. McColl, "Scientific Computing on Bulk Synchronous Parallel Architectures (Short Version)," Proc. 13th IFIP World Computer Congress, B. Pehrson and I. Simon, eds., vol. I, pp. 509514, 1994.
[11] K. Mehlhorn and U. Vishkin, "Randomized and Deterministic Simulations of PRAMs by Parallel Machines with Restricted Granularity of Parallel Memories," Acta Informatica, vol. 21, pp. 339374, 1984.
[12] C. Engelmann and J. Keller, "SimulationBased Comparison of Hash Functions for Emulated Shared Memory," Proc. Parallel Architectures and Languages Europe, Springer LNCS 694, pp. 111, June 1993.
[13] R. D. Schlichting and F. B. Schneider,“Failstop processors: An approach to designing faulttolerant computing systems,”ACM Trans. Comput. Syst., vol. 1, no. 3, pp. 222–238, Aug. 1983.
[14] T. Nanya, "Design Approach to SelfChecking VLSI Processors," Design Methodologies, S. Goto, ed., chapter 8, pp. 235267. NorthHolland, 1985.
[15] H. Ishihata, T. Horie, S. Inano, T. Shimizu, S. Kato, and M. Ikesaka, "Third Generation Message Passing Computer AP1000," Proc. Int'l Symp. Supercomputing, 1991.
[16] C.J. Glass and L.M. Ni, "FaultTolerant Wormhole Routing in Meshes," Proc. 23rd Int'l Symp. FaultTolerant Computing, pp. 240249, 1993.
[17] K. Bolding and W. Yost, "Design of a Router for FaultTolerant Networks," Proc. 1994 Parallel Computer Routing and Comm. Workshop, pp. 226240, May 1994.
[18] J.S. Plank and K. Li, “Faster Checkpointing with N+1 Parity,” Proc. IEEE 24th Int'l Symp. FaultTolerant Computing, pp. 288–297, 1994.
[19] A. Savva and T. Nanya, "Using the BulkSynchronous Parallel Model with Randomised Shared Memory for Graceful Degradation," Technical Report FTS9323, IEICE, Aug. 1993. Also in Proc. Second Parallel Computing Workshop (PCW '93) of Fujitsu Parallel Computing Research Facilities (FPCRF).
[20] Z.M. Kedem and K.V. Palem, "Transformations for the Automatic Derivation of Resilient Parallel Programs," Proc. 1992 IEEE Workshop FaultTolerant Parallel and Distributed Systems, pp. 1625, July 1992.
[21] A. Savva and T. Nanya, "Gracefully Degrading Systems Using the BulkSynchronous Parallel Model with Randomised Shared Memory," Proc. 25th Int'l Symp. Fault Tolerant Computing, pp. 299308, June 1995.
[22] R.P. Brent, "The Parallel Evaluation of General Arithmetic Expressions," J. ACM, vol. 21, pp. 201206, 1974.
[23] J. Lin and A. Storer, "A New Parallel Algorithm for the Knapsack Problem and Its Implementation on a Hypercube," Proc. Third Symp. Frontiers of Massively Parallel Computation, J. JáJá, ed., pp. 27, Oct. 1990.
[24] H. Hellwagner, "Randomized Shared Memory—Concept and Efficiency of a Scalable Shared Memory Scheme," Parallel Architectures, pp. 102117, Springer Verlag, 1993