2003 International Conference on Parallel Processing (ICPP'03)
The Robust Middleware Approach for Transparent and Systematic Fault Tolerance in Parallel and Distributed Systems
Kaohsiung, Taiwan
October 06-October 09
ISBN: 0-7695-2017-0
In this paper, we propose the robust middleware approach to transparent fault tolerance in parallel and distributed systems. The proposed approach inserts a robust middleware between algorithms/programs and system architecture/ hardware. With the robust middleware, hardware faults are transparent to algorithms/programs so that ordinary algorithms/programs developed for fault-free networks can run on faulty parallel/distributed systems without modifications. Moreover, the robust middleware automatically adds fault tolerance capability to ordinary algorithms/ programs so that no hardware redundancy or reconfiguration capability is required and no assumption is made about the availability of a complete subnetwork (at a lower dimension or smaller size). We also propose nomadic agent multithreaded programming as a novel fault-aware programming paradigm that is independent of network topologies and fault patterns. Nomadic agent multi-threaded programming is adaptive to fault/traffic/workload patterns, and can take advantages of various components of the robust middleware, including the fault tolerance features and multiple embeddings, without relying on specialized robust algorithms.