Shengmei Luo , Shengmei Luo is with the Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China. (email: firstname.lastname@example.org)
As data progressively grows within data centers, the cloud storage systems continuously face challenges in saving storage capacity and providing capabilities necessary to move big data within an acceptable time frame. In this paper, we present the Boafft, a cloud storage system with distributed deduplication. The Boafft achieves scalable throughput and capacity using multiple data servers to deduplicate data in parallel, with a minimal loss of deduplication ratio. Firstly, the Boafft uses an efficient data routing algorithm based on data similarity that reduces the network overhead by quickly identifying the storage location. Secondly, the Boafft maintains an in-memory similarity indexing in each data server that helps avoid a large number of random disk reads and writes, which in turn accelerates local data deduplication. Thirdly, the Boafft constructs hot fingerprint cache in each data server based on access frequency, so as to improve the data deduplication ratio. Our comparative analysis with EMC’s stateful routing algorithm reveals that the Boafft can provide a comparatively high deduplication ratio with a low network bandwidth overhead. Moreover, the Boafft makes better usage of the storage space, with higher read/write bandwidth and good load balance.
file system, Big data, cloud storage, data deduplication, data routing
S. Luo, G. Zhang, C. Wu, S. Khan and K. Li, "Boafft: Distributed Deduplication for Big Data Storage in the Cloud," in IEEE Transactions on Cloud Computing.