loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)
Bad Words: Finding Faults in Spirit's Syslogs
May 19-May 22
ISBN: 978-0-7695-3156-4
Accurate fault detection is a key element of resilient computing. Syslogs provide key information regarding faults, and are found on nearly all computing systems. Discovering new fault types requires expert human effort, however, as no previous algorithm has been shown to localize faults in time and space with an operationally acceptable false positive rate. We present experiments on three weeks of syslogs from Sandia's 512-node "Spirit"' Linux cluster, showing one algorithm that localizes 50% of faults with 75% precision, corresponding to an excellent false positive rate of 0.05%. The salient characteristics of this algorithm are (1) calculation of nodewise information entropy, and (2) encoding of word position. The key observation is that similar computers correctly executing similar work should produce similar logs.
Citation:
Jon Stearley, Adam J. Oliner, "Bad Words: Finding Faults in Spirit's Syslogs," ccgrid, pp.765-770, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), 2008
Usage of this product signifies your acceptance of the Terms of Use.