The Community for Technology Leaders
Green Image
Issue No. 04 - April (2015 vol. 64)
ISSN: 0018-9340
pp: 1162-1176
Wen Xia , Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
Hong Jiang , Department of Computer Science and Engineering, University of Nebraska-Lincoln, 217 Schorr Center, 1101 T Street, Lincoln,
Dan Feng , Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
Yu Hua , Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
ABSTRACT
Data deduplication has gained increasing attention and popularity as a space-efficient approach in backup storage systems. One of the main challenges for centralized data deduplication is the scalability of fingerprint-index search. In this paper, we propose SiLo, a near-exact and scalable deduplication system that effectively and complementarily exploits similarity and locality of data streams to achieve high duplicate elimination, throughput, and well balanced load at extremely low RAM overhead. The main idea behind SiLo is to expose and exploit more similarity by grouping strongly correlated small files into a segment and segmenting large files, and to leverage the locality in the data stream by grouping contiguous segments into blocks to capture similar and duplicate data missed by the probabilistic similarity detection. SiLo also employs a locality based stateless routing algorithm to parallelize and distribute data blocks to multiple backup nodes. By judiciously enhancing similarity through the exploitation of locality and vice versa, SiLo is able to significantly reduce RAM usage for index-lookup, achieve the near-exact efficiency of duplicate elimination, maintain a high deduplication throughput, and obtain load balance among backup nodes.
INDEX TERMS
Random access memory, Servers, Indexing, Throughput, Scalability, Probabilistic logic
CITATION
Wen Xia, Hong Jiang, Dan Feng, Yu Hua, "Similarity and Locality Based Indexing for High Performance Data Deduplication", IEEE Transactions on Computers, vol. 64, no. , pp. 1162-1176, April 2015, doi:10.1109/TC.2014.2308181
96 ms
(Ver 3.3 (11022016))