This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Preventing Silent Data Corruptions from Propagating During Data Reconstruction
December 2010 (vol. 59 no. 12)
pp. 1611-1624
Mingqiang Li, Tsinghua University, Beijing
Jiwu Shu, Tsinghua University, Beijing
One recent technical challenge facing the designers of erasure-coded storage systems is how to prevent silent data corruptions from propagating during data reconstruction. This paper proposes a new technique of exploiting erasure-coded storage systems to cope with silent data corruptions during data reconstruction. To develop a data reconstruction method that can prevent silent data corruptions from propagating, we first define the consistency of a strip group and then study the impact of silent data corruptions on the consistency of strip groups. Based on the conclusions obtained from the study, an efficient adaptive data reconstruction method is developed for data reconstruction in the presence of silent data corruptions. A performance analysis of our new data reconstruction method is then made using a probabilistic method. Our results show that the overall performance impact of our data reconstruction method is negligible in practical systems. A comparison of techniques for coping with silent data corruptions in erasure-coded storage systems is also made. The comparison shows that the technique based on our data reconstruction method is a better choice to cope with silent data corruptions when periodic validation is used in an erasure-coded storage system.

[1] E. Pinheiro, W.D. Weber, and L.A. Barroso, "Failure Trends in a Large Disk Drive Population," Proc. File and Storage Technologies (FAST '07), Feb. 2007.
[2] B. Schroeder and G.A. Gibson, "Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?" Proc. File and Storage Technologies (FAST '07), Feb. 2007.
[3] L.N. Bairavasundaram, G.R. Goodson, S. Pasupathy, and J. Schindler, "An Analysis of Latent Sector Errors in Disk Drives," Proc. SIGMETRICS '07, Jun. 2007.
[4] A. Dholakia, E. Eleftheriou, X.-Y. Hu, I. Iliadis, J. Menon, and K. Rao, "A New Intra-Disk Redundancy Scheme for High-Reliability RAID Storage Systems in the Presence of Unrecoverable Errors," ACM Trans. Storage, vol. 4, no. 1, pp. 1-42, May 2008.
[5] I. Iliadis, R. Haas, X.-Y. Hu, and E. Eleftheriou, "Disk Scrubbing Versus Intra-Disk Redundancy for High-Reliability RAID Storage Systems," Proc. SIGMETRICS '08, Jun. 2008.
[6] J.S. Plank, "Erasure Codes for Storage Applications," Tutorial Slides, Fourth USENIX Conf. File and Storage Technologies (FAST'05), Dec. 2005.
[7] P.M. Chen, E.K. Lee, G.A. Gibson, R.H. Katz, and D.A. Patterson, "RAID: High-Performance, Reliable Secondary Storage," ACM Computing Surveys, vol. 26, no. 2, pp. 145-185, Jun. 1994.
[8] C. Carlane and A. Osuna, "IBM System Storage $N$ Series Implementation of RAID Double Parity for Data Protection," IBM Redpaper REDP-4169-00, http://www.redbooks.ibm.com/redpapers/pdfs redp4169.pdf, Apr. 2006.
[9] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao, "OceanStore: An Architecture for Global-Scale Persistent Storage," Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '00), Nov. 2000.
[10] A. Haeberlen, A. Mislove, and P. Druschel, "Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures," Proc. Second Symp. Networked Systems Design and Implementation (NSDI '05), May 2005.
[11] S. Frolund, A. Merchant, Y. Saito, S. Spence, and A. Veitch, "A Decentralized Algorithm for Erasure-Coded Virtual Disks," Proc. Int'l Conf. Dependable Systems and Networks (DSN '04), Jun. 2004.
[12] G.R. Goodson, J.J. Wylie, G.R. Ganger, and M.K. Reiter, "Efficient Byzantine-Tolerant Erasure-Coded Storage," Proc. Int'l Conf. Dependable Systems and Networks (DSN '04), Jun. 2004.
[13] J. Hendricks, G.R. Ganger, and M.K. Reiter, "Low-Overhead Byzantine Fault-Tolerant Storage," Proc. 21st ACM Symp. Operating Systems Principles (SOSP '07), Oct. 2007.
[14] H. Xia and A.A. Chien, "RobuSTore: A Distributed Storage Architecture with Robust and High Performance," Proc. 2007 ACM/IEEE Conf. Supercomputing (SC '07), Nov. 2007.
[15] M.W. Storer, K.M. Greenan, E.L. Miller, and K. Voruganti, "Pergamum: Replacing Tape with Energy Efficient, Reliable, Disk-Based Archival Storage," Proc. File and Storage Technologies (FAST '08), Feb. 2008.
[16] Cleversafe, Inc. Cleversafe Dispersed Storage. Open source code distribution at http://www.cleversafe.orgdownloads, 2010.
[17] Allmydata, Inc. Unlimited Online Backup, Storage, and Sharing, http:/www.allmydata.com/, 2010.
[18] Permabit Technology Corporation. Disk Based Enterprise Archive, Data Archiving Solutions, http:/www.permabit.com/, 2010.
[19] V. Prabhakaran, L.N. Bairavasundaram, N. Agrawal, H.S. Gunawi, A.C. Arpaci-Dusseau, and R.H. Arpaci-Dusseau, "IRON File Systems," Proc. 21st ACM Symp. Operating Systems Principles (SOSP '05), Oct. 2005.
[20] R. Sundaram, "The Private Lives of Disk Drives," Tech OnTap, NetApp, Inc., http://www.netapp.com/go/techontap/matl/ sample0206tot_resiliency.html, Feb. 2006.
[21] K. Péter, "Silent Corruptions," CERN, http://fuji.web.cern.ch/fuji/talk/2007kelemen-2007-C5-Silent_Corruptions.pdf , Jun. 2007.
[22] L.N. Bairavasundaram, G.R. Goodson, B. Schroeder, A.C. Arpaci-Dusseau, and R.H. Arpaci-Dusseau, "An Analysis of Data Corruption in the Storage Stack," Proc. File and Storage Technologies (FAST '08), Feb. 2008.
[23] A. Krioukov, L.N. Bairavasundaram, G.R. Goodson, K. Srinivasan, R. Thelen, A.C. Arpaci-Dusseau, and R.H. Arpaci-Dusseau, "Parity Lost and Parity Regained," Proc. File and Storage Technologies (FAST '08), Feb. 2008.
[24] J.L. Hafner, V. Deenadhayalan, W. Belluomini, and K. Rao, "Undetected Disk Errors in RAID Arrays," IBM J. Research and Development, vol. 52, nos. 4/5, pp. 413-425, Jul./Sep. 2008.
[25] E.W.D. Rozier, W. Belluomini, V. Deenadhayalan, J. Hafner, K. Rao, and P. Zhou, "Evaluating the Impact of Undetected Disk Errors in RAID Systems," Proc. Int'l Conf. Dependable Systems and Networks (DSN '09), Jun. 2009.
[26] J.L. Hafner, V. Deenadhayalan, K. Rao, and J.A. Tomlin, "Matrix Methods for Lost Data Reconstruction in Erasure Codes," Proc. File and Storage Technologies (FAST '05), Dec. 2005.
[27] F.J. MacWilliams and N.J.A. Sloane, The Theory of Error-Correcting Codes. Elsevier, 1977.
[28] I.S. Reed and G. Solomon, "Polynomial Codes Over Certain Finite Fields," J. Soc. for Industrial and Applied Math., vol. 8, no. 2, pp. 300-304, Jun. 1960.
[29] R.R. Roth and A. Lempel, "On MDS Codes via Cauchy Matrices," IEEE Trans. Information Theory, vol. 35, no. 6, pp. 1314-1319, Nov. 1989.
[30] J. Blomer, M. Kalfane, M. Karpinski, R. Karp, M. Luby, and D. Zuckerman, "An XOR-Based Erasure-Resilient Coding Scheme," Technical Report TR-95-048, Int'l Computer Science Inst., Aug. 1995.
[31] J.S. Plank and L. Xu, "Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Network Storage Applications," Proc. Fifth IEEE Int'l Symp. Network Computing and Applications (NCA '06), Jul. 2006.
[32] M. Blaum, J. Brady, J. Bruck, and J. Menon, "EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures," IEEE Trans. Computers, vol. 44, no. 2, pp. 192-202, Feb. 1995.
[33] M. Blaum, J. Bruck, and A. Vardy, "MDS Array Codes with Independent Parity Symbols," IEEE Trans. Information Theory, vol. 42, no. 2, pp. 529-542, Mar. 1996.
[34] M. Blaum, J. Brady, J. Bruck, J. Menon, and A. Vardy, "The EVENODD Code and its Generalization," High Performance Mass Storage and Parallel I/O: Technologies and Applications, J. Jin, T. Cortest, and R. Buyya, eds., chapter 14, pp. 187-208, IEEE and Wiley Press, 2001.
[35] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, and S. Sankar, "Row-Diagonal Parity for Double Disk Failure," Proc. File and Storage Technologies (FAST '04), Apr. 2004.
[36] C. Huang and L. Xu, "STAR: An Efficient Coding Scheme for Correcting Triple Storage Node Failures," Proc. File and Storage Technologies (FAST '05), Dec. 2005.
[37] G. Feng, R. Deng, F. Bao, and J. Shen, "New Efficient MDS Array Codes for RAID Part I: Reed-Solomon-Like Codes for Tolerating Three Disk Failures," IEEE Trans. Computers, vol. 54, no. 9, pp. 1071-1080, Sept. 2005.
[38] G. Feng, R. Deng, F. Bao, and J. Shen, "New Efficient MDS Array Codes for RAID Part II: Rabin-Like Codes for Tolerating Multiple ($\ge 4$ ) Disk Failures," IEEE Trans. Computers, vol. 54, no. 12, pp. 1473-1483, Dec. 2005.
[39] J.S. Plank, "The RAID-6 Liberation Codes," Proc. File and Storage Technologies (FAST '08), Feb. 2008.
[40] L. Xu and J. Bruck, "X-Code: MDS Array Codes with Optimal Encoding," IEEE Trans. Information Theory, vol. 45, no. 1, pp. 272-276, Jan. 1999.
[41] J.L. Hafner, "WEAVER Codes: High Fault Tolerant Erasure Codes for Storage Systems," Proc. File and Storage Technologies (FAST '05), Dec. 2005.
[42] J.L. Hafner, "HoVer Erasure Codes for Disk Arrays," Proc. Int'l Conf. Dependable Systems and Networks (DSN '06), Jun. 2006.
[43] R.G. Gallager, Low-Density Parity-Check Codes. MIT Press, 1963.
[44] M.G. Luby, M. Mitzenmacher, A. Shokrollahi, and D.A. Spielman, "Efficient Erasure Correcting Codes," IEEE Trans. Information Theory, vol. 47, no. 2, pp. 569-584, Feb. 2001.
[45] R.M. Tanner, "A Recursive Approach to Low-Complexity Codes," IEEE Trans. Information Theory, vol. 27, no. 5, pp. 533-547, Sept. 1981.
[46] J.S. Plank and M.G. Thomason, "A Practical Analysis of Low-Density Parity-Check Erasure Codes for Wide Area Storage Applications," Proc. Int'l Conf. Dependable Systems and Networks (DSN '04), Jun. 2004.
[47] J.S. Plank, R.L. Collins, A.L. Buchsbaum, and M.G. Thomason, "Small Parity-Check Erasure Codes—Exploration and Observations," Proc. Int'l Conf. Dependable Systems and Networks (DSN '05), Jun. 2005.
[48] M. Li, J. Shu, and W. Zheng, "GRID Codes: Strip-Based Erasure Codes with High Fault Tolerance for Storage Systems," ACM Trans. Storage, vol. 4, no. 4,Article 15, Jan. 2009.
[49] A. Vardy, "Algorithmic Complexity in Coding Theory and the Minimum Distance Problem," Proc. 29th ACM Symp. Theory of Computing (STOC '97), May 1997.
[50] T.J.E. Schwarz, Q. Xin, E.L. Miller, and D.D.E. Long, "Disk Scrubbing in Large Archival Storage Systems," Proc. 12th IEEE/ACM Int'l Symp. Modeling, Analysis, and Simulation of Computer and Telecomm. Systems (MASCOTS '04), Oct. 2004.
[51] A. Oprea and A. Juels, "A Clean-Slate Look at Disk Scrubbing," Proc. File and Storage Technologies (FAST '10), Feb. 2010.
[52] L. Gong, "Securely Replicating Authentication Services," Proc. Ninth Int'l Conf. Distributed Computing Systems (ICDCS '89), Jun. 1989.
[53] H. Krawczyk, "Distributed Fingerprints and Secure Information Dispersal," Proc. 12th Ann. ACM Symp. Principles of Distributed Computing (PODC '93), Aug. 1993.
[54] J. Hendricks, G.R. Ganger, and M.K. Reiter, "Verifying Distributed Erasure-Coded Data," Proc. 12th Ann. ACM Symp. Principles of Distributed Computing (PODC '07), Aug. 2007.

Index Terms:
Data reconstruction, erasure code, silent data corruption, storage system.
Citation:
Mingqiang Li, Jiwu Shu, "Preventing Silent Data Corruptions from Propagating During Data Reconstruction," IEEE Transactions on Computers, vol. 59, no. 12, pp. 1611-1624, Dec. 2010, doi:10.1109/TC.2010.36
Usage of this product signifies your acceptance of the Terms of Use.