Issue No. 04 - April (2013 vol. 24)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPDS.2012.96
B. Balasubramanian , Dept. of Electr. Eng., Princeton Univ., Princeton, NJ, USA
V. K. Garg , Electr. & Comput. Eng. Dept., Univ. of Texas at Austin, Austin, TX, USA
Replication is the prevalent solution to tolerate faults in large data structures hosted on distributed servers. To tolerate f crash faults (dead/unresponsive data structures) among n distinct data structures, replication requires f + 1 replicas of each data structure, resulting in nf additional backups. We present a solution, referred to as fusion that uses a combination of erasure codes and selective replication to tolerate f crash faults using just f additional fused backups. We show that our solution achieves O(n) savings in space over replication. Further, we present a solution to tolerate f Byzantine faults (malicious data structures), that requires only nf + f backups as compared to the 2nf backups required by replication. We explore the theory of fused backups and provide a library of such backups for all the data structures in the Java Collection Framework. The theoretical and experimental evaluation confirms that the fused backups are space-efficient as compared to replication, while they cause very little overhead for normal operation. To illustrate the practical usefulness of fusion, we use fused backups for reliability in Amazon's highly available key-value store, Dynamo. While the current replication-based solution uses 300 backup structures, we present a solution that only requires 120 backup structures. This results in savings in space as well as other resources such as power.
replicated databases, client-server systems, computational complexity, data structures, fault tolerant computing, Java, computational complexity, distributed systems, fused data structure backup library, distributed servers, crash fault tolerance, dead data structure replication, unresponsive data structure replication, erasure codes, Byzantine faults, malicious data structures, Java collection framework, Amazon, Dynamo, Computer crashes, Servers, Indexes, Fault tolerance, Fault tolerant systems, Arrays, data structures, Distributed systems, fault tolerance
B. Balasubramanian, V. K. Garg, "Fault Tolerance in Distributed Systems Using Fused Data Structures", IEEE Transactions on Parallel & Distributed Systems, vol. 24, no. , pp. 701-715, April 2013, doi:10.1109/TPDS.2012.96