loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
18th International Parallel and Distributed Processing Symposium (IPDPS'04) - Workshop 11
System-Level Fault-Tolerance in Large-Scale Parallel Machines with Buffered Coscheduling
Santa Fe, New Mexico
April 26-April 30
ISBN: 0-7695-2132-0
Fabrizio Petrini, Los Alamos National Laboratory
Kei Davis, Los Alamos National Laboratory
José Carlos Sancho, Los Alamos National Laboratory
As the number of processors for multi-teraflop systems grows to tens of thousands, with proposed petaflops systems likely to contain hundreds of thousands of processors, the assumption of fully reliable hardware has been abandoned. Although the mean time between failures for the individual components can be very high, the large total component count will inevitably lead to frequent failures. It is therefore of paramount importance to develop new software solutions to deal with the unavoidable reality of hardware faults. In this paper we will first describe the nature of the failures of current large-scale machines, and extrapolate these results to future machines. Based on this preliminary analysis we will present a new technology that we are currently developing, buffered coscheduling, which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency — requiring no changes to user applications. Preliminary results show that this is attainable with current hardware.
Index Terms:
Failure characterization, fault-tolerance, checkpointing, large-scale parallel computers, operating systems, communication protocols
Citation:
Fabrizio Petrini, Kei Davis, José Carlos Sancho, "System-Level Fault-Tolerance in Large-Scale Parallel Machines with Buffered Coscheduling," ipdps, vol. 12, pp.209b, 18th International Parallel and Distributed Processing Symposium (IPDPS'04) - Workshop 11, 2004
Usage of this product signifies your acceptance of the Terms of Use.