The Community for Technology Leaders
16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008) (2009)
Weimar, Germany
Feb. 18, 2009 to Feb. 20, 2009
ISSN: 1066-6192
ISBN: 978-0-7695-3544-9
pp: 252-257
ABSTRACT
Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining its architecture and classifying implementation options. This paper further relates prior work to the presented architecture and classification, and discusses the challenges ahead for needed supporting technologies.
INDEX TERMS
fault tolerance, high-performance computing, preemptive migration
CITATION
Stephen L. Scott, Thomas Naughton, Geoffroy R. Vallee, Christian Engelmann, "Proactive Fault Tolerance Using Preemptive Migration", 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008), vol. 00, no. , pp. 252-257, 2009, doi:10.1109/PDP.2009.31
85 ms
(Ver )