Data de-duplication is a simple compression method that became verypopular in storage archival and backup. It has the advantage ofdirect, random access to any piece ("chunk") of a file in one tablelookup; that's not the case with differential file compression, theother common storage archival method. The compression efficiency(chunk matching) of de-duplication improves for smaller chunk sizes,however the sequence of hashes replacing the de-duplicated object(file) increases significantly. We propose a simple scheme to shrinkthe list of hashes generated during de-duplication of an object.This shrinkage is orders of magnitude smaller than what a customarycompression algorithm (gzip) achieves and has a significant impacton overall de-duplication efficiency.
Index Terms:
Data De-duplication, cryptographic hashes compression
Citation:
Subashini Balachandran, Cornel Constantinescu, "Sequence of Hashes Compression in Data De-duplication," dcc, pp.505, Data Compression Conference (dcc 2008), 2008