2008 Seventh IEEE International Symposium on Network Computing and Applications Identifying Failures in Grids through Monitoring and Ranking July 10-July 12 ISBN: 978-0-7695-3192-2
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/NCA.2008.10
In this paper we present FailRank, a novel framework for integrating and ranking information sources that characterize failures in a grid system. After the failing sites have been ranked, these can be eliminated from the job scheduling resource pool yielding in that way a more predictable, dependable and adaptive infrastructure. We also present the tools we developed towards evaluating the FailRank framework. In particular, we present the FailBase Repository which is a 38GB corpus of state information that characterizes the EGEE Grid for one month in 2007. Such a corpus paves the way for the community to systematically uncover new, previously unknown patterns and rules between the multitudes of parameters that can contribute to failures in a Grid environment. Additionally, we present an experimental evaluation study of the FailRank system over 30 days which shows that our framework identifies failures in 93% of the cases. We believe that our work constitutes another important step towards realizing adaptive Grid computing systems.
Index Terms:
Grid Computing, Dependability, Top-k Ranking
Citation:
Demetrios Zeinalipour-Yazti, Kyriacos Neocleous, Chryssis Georgiou, Marios D. Dikaiakos, "Identifying Failures in Grids through Monitoring and Ranking," nca, pp.291-298, 2008 Seventh IEEE International Symposium on Network Computing and Applications, 2008 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||