19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07)
Fault-tolerance in filter-labeled-stream applications
Gramado, RS, Brazil
October 24-October 27
ISBN: 0-7695-3014-1
Fault tolerance is a desirable feature in distributed highperformance systems, since applications tend to run for long periods of time and faults become more likely as the number of nodes in the system increase. However, most distributed environments lack any fault tolerant features, since they tend to be hard to implement and use, and often hurt performance dramatically. In this paper we discuss how we successfully added fault-tolerance to the Anthill distributed programming environment by using an applicationlevel checkpoint/rollback solution. The programming model offers an abstraction where the programmer can easily identify points during the execution where the communication pattern is well defined, forming a consistent cut where checkpointsmay be saved consistently without requiring extra communication, avoiding any domino effect during recovery from faults.We present the new abstractions for fault tolerance, describe how the solution was implemented and present performance results that show the efficiency of the solution with both regular and irregular applications.
Citation:
Bruno Coutinho, Dorgival Guedes, Wagner Meira Jr., Renato A. Ferreira, "Fault-tolerance in filter-labeled-stream applications," sbac-pad, pp.229-236, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07), 2007