loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Problem Diagnosis in Large-Scale Computing Environments
Tampa, Florida
November 11-November 17
ISBN: 0-7695-2700-0
Naoya Maruyama, Tokyo Institute of Technology
Barton P. Miller, University of Wisconsin
We describe a new approach for locating the causes of anomalies in distributed systems. Our target en- vironment is a distributed application that contains multiple identical processes performing similar ac- tivities. We use a new, lightweight form of dynamic instrumentation to collect function-level traces from each process. If the application fails, the traces are automatically compared to each other. We find anomalies by identifying processes that stopped ear- lier than the rest (sign of a fail-stop problem) or processes that behaved different from the rest (sign of a non-fail-stop problem). Our algorithm does not require reference data to distinguish anomalies from normal behaviors. However, it can make use of such data when available to reduce the number of false positives. Ultimately, we identify a function that is likely to explain the anomalous behavior. We demonstrated the efficacy of our approach by finding two problems in a large distributed cluster environment called SCore.
Citation:
Alexander V. Mirgorodskiy, Naoya Maruyama, Barton P. Miller, "Problem Diagnosis in Large-Scale Computing Environments," sc, pp.11, Proceedings of the 2006 ACM/IEEE conference on Supercomputing, 2006
Usage of this product signifies your acceptance of the Terms of Use.