Cluster Computing and the Grid, IEEE International Symposium on (2008)
May 19, 2008 to May 22, 2008
The Grid is inherently unreliable due to its geographical dispersion, heterogeneity and the involvement of multiple administrative domains. The most general case of failures are so-called Byzantine failures where no assumptions about the behavior of faulty components can be made. In this paper a novel system is described that allows to diagnose and tolerate byzantine faults based on service replication. We suggest, briefly describe and compare two fail-stop and two byzantine fault tolerance algorithms. Given that many scientific larger-scale Grid applications have complex outputs the comparison of replica results as needed to implement byzantine fault tolerance becomes a non-trivial task. Therefore we include an automation mechanism based on a generic description language and code generation for this particualar problem. Our approach has been implemented as extension to the Otho Toolkit, a system that synthesizes tailor-made wrapper services for a given application, Grid environment and resource. An analysis of performance and overheads for three real-world applications completes our work.
Grid, HPC, Fault Tolerance, Byzantine Fault Tolerance
J. Hofer and T. Fahringer, "Synthesizing Byzantine Fault-Tolerant Grid Application Wrapper Services," 2008 8th International Symposium on Cluster Computing and the Grid (CCGRID '08)(CCGRID), Lyon, 2008, pp. 467-474.