The Community for Technology Leaders
2012 Seventh International Conference on Availability, Reliability and Security (2006)
Vienna, Austria
Apr. 20, 2006 to Apr. 22, 2006
ISBN: 0-7695-2567-9
pp: 639-645
S. L. Scott , Oak Ridge National Laboratory, Oak Ridge, TN
C. Leangsuksun , Louisiana Tech University, Ruston, LA
C. Engelmann , University of Reading, Reading, RG6 6AH, UK
X. He , Tennessee Technological University, Cookeville, TN
ABSTRACT
Today?s high performance computing systems have several reliability deficiencies resulting in availability and serviceability issues. Head and service nodes represent a single point of failure and control for an entire system as they render it inaccessible and unmanageable in case of a failure until repair, causing a significant downtime. This paper introduces two distinct replication methods (internal and external) for providing symmetric active/active high availability for multiple head and service nodes running in virtual synchrony. It presents a comparison of both methods in terms of expected correctness, ease-of-use and performance based on early results from ongoing work in providing symmetric active/active high availability for two HPC system services (TORQUE and PVFS metadata server). It continues with a short description of a distributed mutual exclusion algorithm and a brief statement regarding the handling of Byzantine failures. This paper concludes with an overview of past and ongoing work, and a short summary of the presented research.
INDEX TERMS
null
CITATION
S. L. Scott, C. Leangsuksun, C. Engelmann, X. He, "Active/Active Replication for Highly Available HPC System Services", 2012 Seventh International Conference on Availability, Reliability and Security, vol. 00, no. , pp. 639-645, 2006, doi:10.1109/ARES.2006.23
92 ms
(Ver )