Parallel and Distributed Processing Symposium, International (2011)
Anchorage, Alaska USA
May 16, 2011 to May 20, 2011
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/IPDPS.2011.83
With the growth of system size and complexity, reliability has become of paramount importance for petascale systems. Reliability, Availability, and Serviceability (RAS) logs have been commonly used for failure analysis. However, analysis based on just the RAS logs has proved to be insufficient in understanding failures and system behaviors. To overcome the limitation of this existing methodologies, we analyze the Blue Gene/P RAS logs and the Blue Gene/P job logs in a cooperative manner. From our co-analysis effort, we have identified a dozen important observations about failure characteristics and job interruption characteristics on the Blue Gene/P systems. These observations can significantly facilitate the research in fault resilience of large-scale systems.
W. Tang et al., "Co-analysis of RAS Log and Job Log on Blue Gene/P," 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011)(IPDPS), Anchorage, AK, 2011, pp. 840-851.