Issue No.06 - June (2011 vol.60)
pp: 824-840
Jaehong Min , Hanyang University, Seoul
Youjip Won , Hanyang University, Seoul
In this work, we focus on optimizing the deduplication system by adjusting the pertinent factors in fingerprint lookup and chunking, the factors which we identify as the key ingredients of efficient deduplication. For efficient fingerprint lookup, we propose fingerprint management scheme called LRU-based Index Partitioning. For efficient chunking, we propose Incremental Modulo-K(INC-K) algorithm which is optimized Rabin's algorithm where we significantly reduce the number of arithmetic operations exploiting the algebraic nature of modulo arithmetic. LRU-based Index Partitioning uses the notion of tablet and enforces access locality of the fingerprint lookup in storing fingerprints. We maintain tablets with LRU manner to exploit temporal locality of the fingerprint lookup. To preserve access correlation across the tablets, we apply prefetching in maintaining tablet list. We propose Context-aware chunking to maximize chunking speed and deduplication ratio. We develop prototype backup system and performed comprehensive analysis on various factors and their relationship: average chunk size, chunking speed, deduplication ratio, tablet management algorithms, and overall backup speed. By increasing the average chunk size from 4 KB to 10 KB, chunking time increases by 34.3 percent, deduplication ratio decreases by 0.66 percent and the overall backup speed increases by 50 percent (from 51.4 MB/sec to 77.8 MB/sec).
Deduplication, chunking, backup, index partitioning, fingerprint lookup.
Jaehong Min, Youjip Won, "Efficient Deduplication Techniques for Modern Backup Operation", IEEE Transactions on Computers, vol.60, no. 6, pp. 824-840, June 2011, doi:10.1109/TC.2010.263
[1] J. Gantz, C. Chute, A. Manfrediz, S. Minton, D. Reinsel, W. Schlichting, and A. Toncheva, The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth through 2011, IDC, An IDC White Paper-Sponsored by EMC, Mar. 2008.
[2] W. Tichy, "Rcs: A System for Version Control," Software Practice and Experience, vol. 15, no. 7, pp. 637-654, July 1985.
[3] M. Ajtai, R. Burns, R. Fagin, D. Long, and L. Stockmeyer, "Compactly Encdoing Unstructured Input with Differential Compression," J. ACM, vol. 49, no. 3, pp. 318-367, May 2002.
[4] P. Kulkarni, F. Douglis, J. LaVoie, and J. Tracey, "Redundancy Elimination within Large Collections of Files," Proc. USENIX Ann. Technical Conf., General Track, pp. 59-72, 2004.
[5] F. Douglis and A. Iyengar, "Application-Specific Delta-Encoding via Resemblance Detection," Proc. Conf. USENIX '03, June 2003.
[6] Y. Won, J. Ban, J. Min, J. Hur, S. Oh, and J. Lee, "Efficient Index Lookup for De-Duplication Backup System," Proc. IEEE Int'l Symp. Modeling, Analysis and Simulation of Computers and Telecomm. Systems (MASCOTS '08), pp. 1-3, Sept. 2008.
[7] B. Zhu, K. Li, and H. Patterson, "Avoiding the Disk Bottleneck in the Data Domain Deduplication File System," Proc. FAST '08: Sixth USENIX Conf. File and Storage Technologies, pp. 1-14, 2008.
[8] A. Muthitacharoen, B. Chen, and D. Mazières, "A Low-bandwidth Network File System," SIGOPS Operating Systems Rev., vol. 35, no. 5, pp. 174-187, 2001.
[9] B. Hong and D.D.E. Long, "Duplicate Data Elimination in a San File System," Proc. 21st IEEE / 12th NASA Goddard Conf. Mass Storage Systems and Technologies (MSST), pp. 301-314, Apr. 2004.
[10] H.P. nd David Andersen and M. Kaminsky, "Exploiting Similarity for Multi-Source Downloads Using File Handprints," Proc. Symp. Networked Systems Design Implementation (NSDI '07), Apr. 2007.
[11] M. Mitzenmacher, "Compressed Bloom Filters," IEEE/ACM Trans. Networking, vol. 10, no. 5, pp. 604-612, Oct. 2002.
[12] N.T. Spring and D. Wetherall, "A Protocol-Independent Technique for Eliminating Redundant Network Traffic," Proc. SIGCOMM, pp. 87-95, 2000.
[13] Y. Won, R. Kim, J. Ban, J. Hur, S. Oh, and J. Lee, "Prun: Eliminating Information Redundancy for Large Scale Data Backup System," Proc. IEEE Int'l Conf. Computational Sciences and Its Applications(ICCSA '08), 2008.
[14] S. Quinlan and S. Dorward, "Venti: A New Approach to Archival Storage," Proc. Conf. File and Storage Technologies (FAST '02), pp. 89-101, Jan. 2002.
[15] J.C. Mogul, Y.M. Chan, and T. Kelly, "Design, Implementation, and Evaluation of Duplicate Transfer Detection in http," Proc. Symp. Networked Systems Design Implementation (NSDI '04), p. 4, 2004.
[16] L.P. Cox, C.D. Murray, and B.D. Noble, "Pastiche: Making Backup Cheap and Easy," SIGOPS Operating Systems Rev., vol. 36, no. SI, pp. 285-298, 2002.
[17] C. Policroniades and I. Pratt, "Alternatives for Detecting Redundancy in Storage Systems Data," Proc. Conf. USEXNIX '04, June 2004.
[18] C. Liu, Y. Lu, C. Shi, G. Lu, D. Du, and D. Wang, "ADMAD: Application-Driven Metadata Aware De-Duplication Archival Storage System," Proc. Fifth IEEE Int'l Workshop Storage Network Architecture and Parallel I/Os ( SNAPI '08), pp. 29-35, 2008.
[19] D. Meister and A. Brinkmann, "Multi-Level Comparison of Data Deduplication in a Backup Scenario," Proc. SYSTOR '09: The Israeli Experimental Systems Conf., pp. 1-12, May 2009.
[20] N. Mandagere, P. Zhou, M. Smith, and S. Uttamchandani, "Demystifying Data Deduplication," Proc. ACM/IFIP/USENIX Middleware '08 Conf. Companion, pp. 12-17, Dec. 2008.
[21] W.J. Bolosky, S. Corbin, D. Goebel, and J.R. Douceur, "Single Instance Storage in Windows 2000," Proc. Fourth USENIX Windows Systems Symp., pp. 13-24, 2000.
[22] L.L. You, K.T. Pollack, and D.D.E. Long, "Deep Store: An Archival Storage System Architecture," Proc. Int'l Conf. Data Engineering (ICDE '05), pp. 804-8015, 2005.
[23] B.H. Bloom, "Space/Time Trade-Offs in Hash Coding with Allowable Errors," Comm. ACM, vol. 13, no. 7, pp. 422-426, 1970.
[24] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R.E. Gruber, "Bigtable: A Distributed Storage System for Structured Data," Proc. Symp. Operating Systems Design and Implementation (OSDI '06), pp. 205-218, 2006.
[25] M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble, "Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality," Proc. Seventh USENIX Conf. File and Storage Technologies (FAST '09), 2009.
[26] D.R. Bobbarjung, S. Jagannathan, and C. Dubnicki, "Improving Duplicate Elimination in Storage Systems," ACM Trans. Storage, vol. 2, no. 4, pp. 424-448, 2006.
[27] L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch, and S. Klein, "The Design of a Similarity Based Deduplication System," Proc. SYSTOR '09: The Israeli Experimental Systems Conf., pp. 1-14, May 2009.
[28] J. Hamilton and E. Olsen, "Design and Implementation of a Storage Repository Using Commonality Factoring," Proc. 20th IEEE/11th NASA Goddard Conf. Mass Storage Systems and Technologies(MSS '03), Aug. 2003.
[29] D. Bhagwat, K. Eshghi, D. Long, and M. Lillibridge, "Extreme Binning: Scalable, Parallel Deduplication for Chunk-Based File Backup," Proc. 17th IEEE Int'l Symp. Modeling, Analysis, and Simulation of Computer and Telecomm. Systems (MASCOTS '09), Sept. 2009.
[30] A. Leung, M. Shao, T. Bisson, S. Pasupathy, and E. Miller, "Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems," Proc. Six USENIX Conf. File and Storage Technologies (FAST '09), 2009.
[31] C. Liu, Y. Gu, L. Sun, B. Yan, and D. Wang, "R-ADMAD: High Reliability Provision for Large-Scale De-Duplication Archival Storage Systems," Proc. 23rd Int'l Conf. Supercomputing, (ICS '09), pp. 370-379, 2009.
[32] D. Bhagwat, K. Pollack, D. Long, T. Schwarz, E. Miller, and J. Pâris, "Providing High Reliability in a Minimum Redundancy Archival Storage System," Proc. 14th IEEE Int'l Symp. Modeling, Analysis, and Simulation of Computer and Telecomm. Systems (MASCOTS '06), 2006.
[33] P. Efstathopoulos and F. Guo, "Rethinking Deduplication Scalability," HotStorage '10, Second Workshop Hot Topics in Storage and File Systems, June 2010.
[34] J. Burrows and D.O.C.W. DC, "Secure Hash Standard," Federal Information Processing Standards Publication, Apr. 1995.
[35] R. Rivest, "The MD5 Message Digest Algorithm, RFC 1321," Internet Activities Board, 1992.
[36] V. Henson, "An Analysis of Compare-by-Hash," Proc. Conf. Hot Topics in Operating Systems (HOTOS '03), 2003.
[37] "Berkeley db," berkeley db/dbindex.html, 2011.
[38] A. Broder and M. Mitzenmacher, "Network Applications of Bloom Filters: A Survey," Internet Math., vol. 1, no. 4, pp. 485-509, 2004.
[39] N. Jain, M. Dahlin, and R. Tewari, "Taper: Tiered Approach for Eliminating Redundancy in Replica Synchronization," Proc. FAST '05: Fourth Conf. USENIX File and Storage Technologies, pp. 21-21, 2005.
[40] E. Horowitz, S. Sahni, and D. Mehta, Fundamentals of Data Structures in C++. Computer Science Press, 1995.
[41] A.Z. Broder and M. Mitzenmacher, "Network Applications of Bloom Filters: A Survey," Internet Math., vol. 1, no. 4, pp. 485-509, 2003.
[42] L. Fan, P. Cao, J. Almeida, and A. Broder, "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol," IEEE/ACM Trans. Networking (TON), vol. 8, no. 3, pp. 281-293, June 2000.
[43] P. Reynolds and A. Vahdat, "Efficient Peer-to-Peer Keyword Searching," Lecture Notes in Computer Science, pp. 21-40, Springer, 2003.