The Community for Technology Leaders
When running on HPC systems characterized by component failure rates high enough to impact productivity, it becomes important to consider the impact of those failures on individual applications. Typically, this is done by assuming that the mean time between failures (MTBF) for hardware and software components on the system is equivalent to the mean time to fatal error (MTTFE) for an application running on that system. In addition, one commonly applies the rule of thumb estimate that application MTTFE scales as the inverse of the number of nodes used to run the application, so that running on half as many nodes increases MTTFE by a factor of two. However, this estimate does not take into account the fact that a non-trivial fraction of failures affect multiple compute nodes, so a single component failure has the potential to cause multiple application fatal errors. In the work that follows, a new model for application MTTFE is derived based on the impact of multi-component failures and their potential to terminate multiple applications.
reliability, resilience, MTBF, failures, failure rate

S. E. Michalak, J. T. Daly and L. A. Pritchett-Sheats, "Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale," 2008 8th International Symposium on Cluster Computing and the Grid (CCGRID '08)(CCGRID), Lyon, 2008, pp. 795-800.
83 ms
(Ver 3.3 (11022016))