loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)
Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems
May 19-May 22
ISBN: 978-0-7695-3156-4
The current trend in high performance computing is to aggregate ever larger numbers of processing and interconnection elements in order to achieve desired levels of computational power, This, however, also comes with a decrease in the Mean Time To Interrupt because the elements comprising these systems are not becoming significantly more robust. There is substantial evidence that the Mean Time To Interrupt vs. number of processor elements involved is quite similar over a large number of platforms. In this paper we present a system that uses hardware level monitoring coupled with statistical analysis and modeling to select processing system elements based on where they lie in the statistical distribution of similar elements. These characterizations can be used by the scheduler/resource manager to deliver a close to optimal set of processing elements given the available pool and the reliability requirements of the application.
Index Terms:
resilience, RAS, fault tolerance. probabilistic characterization, statistical analysis, abnormality detection, cluster monitoring
Citation:
Jim Brandt, Bert Debusschere, Ann Gentile, Jackson Mayo, Philippe P?bay, David Thompson, Matthew Wong, "Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems," ccgrid, pp.759-764, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), 2008
Usage of this product signifies your acceptance of the Terms of Use.