loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Fourth International Workshop on Grid Computing
Faults in Grids: Why are they so bad and What can be done about it?
Phoenix, Arizona
November 17-November 17
ISBN: 0-7695-2026-X
Raissa Medeiros, Universidade Federal de Campina Grande, Para?ba, Brazil
Walfredo Cirne, Universidade Federal de Campina Grande, Para?ba, Brazil
Francisco Brasileiro, Universidade Federal de Campina Grande, Para?ba, Brazil
Jacques Sauv?, Universidade Federal de Campina Grande, Para?ba, Brazil
Computational Grids have the potential to become the main execution platform for high performance and distributed applications. However, such systems are extremely complex and prone to failures. In this paper, we present a survey with the grid community on which several people shared their actual experience regarding fault treatment. The survey reveals that, nowadays, users have to be highly involved in diagnosing failures, that most failures are due to configuration problems (a hint of the area's immaturity), and that solutions for dealing with failures are mainly application-dependent. Going further, we identify two main reasons for this state of affairs. First, grid components that provide high-level abstractions when working, do expose all gory details when broken. Since there are no appropriate mechanisms to deal with the complexity exposed (configuration, middleware, hardware and software issues), users need to be deeply involved in the diagnosis and correction of failures. To address this problem, one needs a way to coordinate different support teams working at the grids different levels of abstraction. Second, fault tolerance schemes today implemented on grids tolerate only crash failures. Since grids are prone to more complex failures, such those caused by heisenbugs, one needs to tolerate tougher failures. Our hope is that the very heterogeneity, that makes a grid a complex environment, can help in the creation of diverse software replicas, a strategy that can tolerate more complex failures.
Citation:
Raissa Medeiros, Walfredo Cirne, Francisco Brasileiro, Jacques Sauv?, "Faults in Grids: Why are they so bad and What can be done about it?," grid, pp.18, Fourth International Workshop on Grid Computing, 2003
Usage of this product signifies your acceptance of the Terms of Use.