Transparent Fault-Tolerance Using Intra-Machine Full-Software-Stack Replication on Commodity Multicore Hardware
2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) (2017)
Atlanta, Georgia, USA
June 5, 2017 to June 8, 2017
As the number of processors and the size of the memory of computing systems keep increasing, the likelihood of CPU core failures, memory errors, and bus failures increases and can threaten system availability. Software components can be hardened against such failures by running several replicas of a component on hardware replicas that fail independently and that are coordinated by a State-Machine Replication protocol. One common solution is to replicate the physical machine to provide redundancy, and to rewrite the software to address coordination. However, a CPU core failure, a memory error, or a bus error is unlikely to always crash an entire machine. Thus, full machine replication may sometimes be an overkill, increasing resource costs. In this paper, we introduce full software stack replication within a single commodity machine. Our approach runs replicas on fault-independent hardware partitions (e.g., NUMA nodes), wherein each partition is software-isolated from the others and has its own CPU cores, memory, and full software stack. A hardware failure in one partition can be recovered by another partition taking over its functionality. We have realized this vision by implementing FT-Linux, a Linux-based operating system that transparently replicates race-free, multithreaded POSIX applications on different hardware partitions of a single machine. Our evaluations of FT-Linux on several popular Linux applications show a worst case slowdown (due to replication) by ≈20%.
Hardware, Fault tolerance, Fault tolerant systems, Linux, Kernel, Servers
G. Losa, A. Barbalace, Y. Wen, H. Chuang and B. Ravindran, "Transparent Fault-Tolerance Using Intra-Machine Full-Software-Stack Replication on Commodity Multicore Hardware," 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), Atlanta, Georgia, USA, 2017, pp. 1521-1531.