loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Second IEEE International Conference on e-Science and Grid Computing (e-Science'06)
Job Failure Analysis and Its Implications in a Large-Scale Production Grid
Amsterdam, Netherlands
December 04-December 06
ISBN: 0-7695-2734-5
Hui Li, Leiden University, The Netherlands
David Groep, National Institute for Nuclear and High Energy Physics (NIKHEF), The Netherlands
Lex Wolters, Leiden University, The Netherlands
Jeff Templon, National Institute for Nuclear and High Energy Physics (NIKHEF), The Netherlands
In this paper we present an initial analysis of job failures in a large-scale data-intensive Grid. Based on three representative periods in production, we characterize the interarrival times and life spans of failed jobs. Different failure types are distinguished and the analysis is carried out further at the Virtual Organization (VO) level. The spatial behavior, namely where job failures occur in the Grid, is also examined. Cross-correlation structures, including how arrivals correlate with life spans of job failures, are analyzed and illustrated. We further investigate statistical models to fit the failure data and propose several failureaware scheduling strategies at the Grid level.

Our results show that the overall failure rates in the Grid are quite significant, ranging from 25% to 33% of all submitted jobs. However, only 5% to 8% of the jobs failed after running on a certain Computing Element (CE). The rest of failed jobs are aborted or cancelled without running. A majority of failed jobs come from several large production VOs and a large amount of these failures are centered around several main CEs. The interarrival time processes of failed jobs are shown to be bursty, and the life spans exhibit strong autocorrelations. Based on the failure patterns we argue that it is important for the Grid resource brokers to track historical failure and take it into account in decision making. Some proactive measures and accountability issues are also discussed.

Citation:
Hui Li, David Groep, Lex Wolters, Jeff Templon, "Job Failure Analysis and Its Implications in a Large-Scale Production Grid," e-science, pp.27, Second IEEE International Conference on e-Science and Grid Computing (e-Science'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.