This Article 
 Bibliographic References 
 Add to: 
Reliability for Networked Storage Nodes
May/June 2011 (vol. 8 no. 3)
pp. 404-418
KK Rao, IBM Almaden Research Center, San Jose, CA
James Lee Hafner, IBM Almaden Research Center, San Jose, CA
Richard A. Golding, IBM Almaden Research Center, San Jose, CA
High-end enterprise storage has traditionally consisted of monolithic systems with customized hardware, multiple redundant components and paths, and no single point of failure. Distributed storage systems realized through networked storage nodes offer several advantages over monolithic systems such as lower cost and increased scalability. In order to achieve reliability goals associated with enterprise-class storage systems, redundancy will have to be distributed across the collection of nodes to tolerate both node and drive failures. In this paper, we present alternatives for distributing this redundancy, and models to determine the reliability of such systems. We specify a reliability target and determine the configurations that meet this target. Further, we perform sensitivity analyses, where selected parameters are varied to observe their effect on reliability.

[1] M. Abd-El-Malek, W.V. Courtright,II, C. Cranor, G.R. Ganger, J. Hendricks, A.J. Klosterman, M. Mesnier, M. Prasad, B. Salmon, R.R. Sambasivan, S. Sinnamohideen, J.D. Strunk, E. Thereska, M. Wachs, and J.J. Wylie, "Ursa Minor Versatile Cluster-Based Storage," Proc. Fourth Conf. File and Storage Technology, Dec. 2005.
[2] L.N. Bairavasundaram, G.R. Goodson, S. Pasupathy, and J. Schindler, "An Analysis of Latent Sector Errors in Disk Drives," Proc. SIGMETRICS, 2007.
[3] M. Blaum, J. Brady, J. Bruck, and J. Menon, "EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Arechitectures," IEEE Trans. Computing, vol. 44, no. 2, pp. 192-202, Feb. 1995.
[4] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, and S. Sankar, "Row Diagonal Parity for Double Disk Failure Correction," Proc. Third Conf. File and Storage Technology, Mar./Apr. 2004.
[5] J.G. Elerath and M. Pecht, "Enhanced Reliability Modeling of RAID Storage Systems," Proc. Int'l Conf. Dependable Systems and Networks (DSN '07,), pp. 175-184, June 2007.
[6] C. Fleiner, R.B. Garner, J.L. Hafner, K. Rao, D.R. Kenchammana-Hosekote, W.W. Wilcke, and J.S. Glider, "Reliability of Modular Mesh-Connected Intelligent Storage Brick Systems," IBM J. Research and Development, vol. 50, nos. 2/3, pp. 199-208, Mar.-May 2006.
[7] G.R. Goodson, J.J. Wylie, G.R. Ganger, and M.K. Reiter, "Efficient Byzantine-Tolerant Erasure-Coded Storage," Proc. Int'l Conf. Dependable Systems and Networks, June 2004.
[8] J.L. Hafner, "WEAVER Codes: Highly Fault Tolerant Erasure Codes for Storage Systems," Proc. Fourth Conf. File and Storage Technology, Dec. 2005.
[9] C. Huang and L. Xu, "Star: An Efficient Coding Scheme for Correcting Triple Storage Node Failures," Proc. Fourth Conf. File and Storage Technology, Dec. 2005.
[10] E.K. Lee, C.A. Thekkath, C. Whitaker, and J. Hogg, "A Comparison of Two Distributed Disk Systems," Research Report 155, Digital Systems Research Center, Apr. 1998.
[11] M. Malhotra and K. Trivedi, "Reliability Analysis of RAID," J. Parallel and Distributed Computing, vol. 17, pp. 146-151, 1993.
[12] D. Nagle, D. Serenyi, and A. Matthews, "The Panasas ActiveScale Storage Cluster—Delivering Scalable High Bandwidth Storage," Proc. ACM/IEEE Conf. Supercomputing, Nov. 2004.
[13] E. Pinheiro, W.-D. Weber, and L.A. Barroso, "Failure Trends in a Large Disk Drive Population," Proc. Fifth Conf. File and Storage Technology, Feb. 2007.
[14] J.S. Plank and M.G. Thomason, "A Practical Analysis of Low-Density Parity-Check Erasure Codes for Wide-Area Storage Applications," Proc. Int'l Conf. Dependable Systems and Networks, June 2004.
[15] Y. Saito, S. Frølund, A. Veitch, A. Merchant, and S. Spence, "FAB: Building Distributed Enterprise Disk Arrays from Commodity Components," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS XI), pp. 48-58, Oct. 2004.
[16] B. Schroeder and G.A. Gibson, "Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?" Proc. Fifth Conf. File and Storage Technology, Feb. 2007.
[17] K.S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications, second ed. John Wiley, 2001.
[18] W.W. Wilcke, R.B. Garner, C. Fleiner, R.F. Freitas, R.A. Golding, J.S. Glider, D.R. Kenchammana-Hosekote, J.L. Hafner, K.M. Mohiuddin, K. Rao, R.A. Becker-Szendy, T.M. Wong, O.A. Zaki, M. Hernandez, K.R. Fernandez, H. Huels, H. Lenk, K. Smolin, M. Ries, C. Goettert, T. Picunko, B.J. Rubin, H. Kahn, and T. Loo, "IBM Intelligent Bricks Project—Petabytes and Beyond," IBM J. Research and Development, vol. 50, nos. 2/3, pp. 181-197, Mar.-May 2006.
[19] Q. Xin, E.L. Miller, T. Schwarz, D.D.E. Long, S.A. Brandt, and W. Litwin, "Reliability Mechanisms for Very Large Storage Systems," Proc. 20th IEEE/11th NASA Goddard Conf. Mass Storage Systems and Technologies (MSST), Apr. 2003.

Index Terms:
Mass storage, fault tolerance, modeling techniques, redundant design, distributed systems.
KK Rao, James Lee Hafner, Richard A. Golding, "Reliability for Networked Storage Nodes," IEEE Transactions on Dependable and Secure Computing, vol. 8, no. 3, pp. 404-418, May-June 2011, doi:10.1109/TDSC.2010.21
Usage of this product signifies your acceptance of the Terms of Use.