The Community for Technology Leaders
RSS Icon
Subscribe
Vienna, Austria
April 20, 2006 to April 22, 2006
ISBN: 0-7695-2567-9
pp: 639-645
C. Engelmann , University of Reading, Reading, RG6 6AH, UK
S. L. Scott , Oak Ridge National Laboratory, Oak Ridge, TN
C. Leangsuksun , Louisiana Tech University, Ruston, LA
X. He , Tennessee Technological University, Cookeville, TN
ABSTRACT
Today?s high performance computing systems have several reliability deficiencies resulting in availability and serviceability issues. Head and service nodes represent a single point of failure and control for an entire system as they render it inaccessible and unmanageable in case of a failure until repair, causing a significant downtime. This paper introduces two distinct replication methods (internal and external) for providing symmetric active/active high availability for multiple head and service nodes running in virtual synchrony. It presents a comparison of both methods in terms of expected correctness, ease-of-use and performance based on early results from ongoing work in providing symmetric active/active high availability for two HPC system services (TORQUE and PVFS metadata server). It continues with a short description of a distributed mutual exclusion algorithm and a brief statement regarding the handling of Byzantine failures. This paper concludes with an overview of past and ongoing work, and a short summary of the presented research.
INDEX TERMS
null
CITATION
C. Engelmann, S. L. Scott, C. Leangsuksun, X. He, "Active/Active Replication for Highly Available HPC System Services", ARES, 2006, Proceedings. The First International Conference on Availability, Reliability and Security, Proceedings. The First International Conference on Availability, Reliability and Security 2006, pp. 639-645, doi:10.1109/ARES.2006.23
16 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool