The Community for Technology Leaders
Green Image
Issue No. 07 - July (2015 vol. 64)
ISSN: 0018-9340
pp: 1898-1911
Runhui Li , Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, NT, Hong Kong
Jian Lin , Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, NT, Hong Kong
Patrick P.C. Lee , Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, NT, Hong Kong
ABSTRACT
Data availability is critical in distributed storage systems, especially when node failures are prevalent in real life. A key requirement is to minimize the amount of data transferred among nodes when recovering the lost or unavailable data of failed nodes. This paper explores recovery solutions based on regenerating codes, which have been designed to provide fault-tolerant storage and minimum bandwidth. Existing optimal regenerating codes are designed for single node failures. We build a system called CORE, which augments existing optimal regenerating codes for the recovery of a general number of failures including single and concurrent failures. We show theoretically that CORE achieves the minimum possible bandwidth for most cases. We implement a CORE prototype and evaluate it atop an HDFS cluster testbed with up to 20 storage nodes. We demonstrate that our CORE prototype conforms to ourtheoretical findings and achieves bandwidth savings when compared to the conventional recovery approach based on erasure codes.
INDEX TERMS
Bandwidth, Peer-to-peer computing, Nickel, Strips, Equations, Encoding, Availability
CITATION

R. Li, J. Lin and P. P. Lee, "Enabling Concurrent Failure Recovery for Regenerating-Coding-Based Storage Systems: From Theory to Practice," in IEEE Transactions on Computers, vol. 64, no. 7, pp. 1898-1911, 2015.
doi:10.1109/TC.2014.2349518
341 ms
(Ver 3.3 (11022016))