The Community for Technology Leaders
RSS Icon
Issue No.03 - March (2009 vol.58)
pp: 289-299
Jon Elerath , Network Appliance, Sunnyvale
Michael Pecht , University of Maryland, College Park
Abstract - The statistical bases for current models of RAID reliability are reviewed and a highly accurate alternative is provided and justified. This new model corrects statistical errors associated with the pervasive assumption that system (RAID group) times to failure follow a homogeneous Poisson process, and corrects errors associated with assuming the time-to-failure and time-to-restore distributions are exponentially distributed. Statistical justification for the new model uses theory for reliability of repairable systems. Four critical component distributions are developed from field data. These distributions are for times to catastrophic failure, reconstruction and restoration, read errors, and disk data scrubs. Model results have been verified and predict between 2 to 1,500 times as many double disk failures as estimates made using the mean time to data loss method. Model results are compared to system level field data for RAID group of 14 drives and show excellent correlation and greater accuracy than either MTTDL.
Hardware reliability, Redundant design, Reliability, Testing, and Fault-Tolerance
Jon Elerath, Michael Pecht, "A Highly Accurate Method for Assessing Reliability of Redundant Arrays of Inexpensive Disks (RAID)", IEEE Transactions on Computers, vol.58, no. 3, pp. 289-299, March 2009, doi:10.1109/TC.2008.163
[1] D.A. Patterson, G.A. Gibson, and R.H. Katz, “A Case for Redundant Arrays of Inexpensive Disks (RAID),” Proc. ACM SIGMOD '88, pp. 109-116, June 1988.
[2] G.A. Gibson, “Redundant Disk Arrays: Reliable, Parallel Secondary Storage,” PhD dissertation, Dept. Computer Science, UCBerkeley, T7.6 1991 G52 ENGI, Apr. 1991.
[3] J.G. Elerath, “Reliability Model and Assessment of Redundant Arrays of Inexpensive Disks (RAID) Incorporating Latent Defects and Non-Homogeneous Poisson Process Events,” PhD dissertation, A. James Clark College of Eng., Mechanical Eng. Dept., Univ. of Maryland, , 2007.
[4] H.H. Kari, “Latent Sector Faults and Reliability of Disk Arrays,” PhD dissertation, TKO-A33, Helsinki Univ. of Technology,, 1997.
[5] T.J.E. Schwarz, “Reliability and Performance of Disk Arrays,” PhD dissertation, Dept. Computer Science, UC San Diego, 1994.
[6] R. Geist and K. Trivedi, “An Analytic Treatment of the Reliability and Performance of Mirrored Disk Subsystems,” Proc. 23rd Int'l Symp. Fault-Tolerant Computing (FTCS '93), pp. 442-450, June 1993.
[7] T.J.E. Schwarz, Q. Xin, E.L. Miller, D.D.E. Long, A. Hospodor, and S. Ng, “Disk Scrubbing in Large Archival Storage Systems,” Proc. 12th IEEE/ACM Int'l Symp. Modeling, Analysis, and Simulations of Computer and Telecommunications Systems (MASCOTS), 2004.
[8] D.A. Patterson, P. Chen, G. Gibson, and R.H. Katz, “Introduction to Redundant Arrays of Inexpensive Disks (RAID),” Proc. 34th IEEE Computer Soc. Int'l Conf.: Intellectual Leverage (COMPCON '89), pp. 112-117, Feb. 1989.
[9] P.M. Chen, E.K. Lee, G.A. Gibson, and R.H. Katz, “RAID: High-Performance, Reliable Secondary Storage,” ACM Computing Surveys, 1994.
[10] W.V. Courtright II, “A Transactional Approach to Redundant Disk Array Implementation,” PhD thesis, CMU-CS-97-141, School of Computer Science, Carnegie Mellon Univ., May 1997.
[11] T.J.E. Schwarz and W.A. Burkhard, “Reliability and Performance of RAIDs,” IEEE Trans. Magnetics, vol. 31, no. 2, pp. 1161-1166, Mar. 1995.
[12] S. Shah and J.G. Elerath, “Reliability Analysis of Disk Drive Failure Mechanisms,” Proc. Ann. Reliability and Maintainability Symp. (RAMS '05), pp. 226-231, Jan. 2005.
[13] E. Pinheiro, W.D. Weber, and L.A. Barroso, “Failure Trends in Large Disk Drive Population,” Proc. Fifth USENIX Conf. File Storage Technologies (FAST '07), Feb. 2007.
[14] B. Schroeder and G. Gibson, “Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You,” Proc. Fifth USENIX Conf. File and Storage Technologies (FAST '07), Feb. 2007.
[15] J.G. Elerath and S. Magie, “Field Reliability from Post-GA Manufacturing Process and Design Changes,” Proc. DISKCON Asia-Pacific, local/data_fileshow_file.php?cmd=download&data_file_id=1441 , May 2006.
[16] F. Proschan, “Theoretical Explanation of Observer Decreasing Failure Rate,” Technometrics, vol. 5, pp. 375-383, 1963.
[17] H. Ascher, “Statistical Methods in Reliability: Discussion,” Technometrics, vol. 25, no. 4, pp. 320-326, Nov. 1983.
[18] H.E. Ascher, “A Set-of-Numbers is NOT a Data-Set,” IEEE Trans. Reliability, vol. 48, no. 2, pp. 135-140, June 1999.
[19] W.A. Thompson, “On the Foundations of Reliability,” Technometrics, vol. 23, no. 1, pp. 1-13, Feb. 1981.
[20] W.A. Thompson, “The Rate of Failure Is the Density, Not the Failure Rate,” The Am. Statistician, Editorial, vol. 42, no. 4, pp.288-291, Nov. 1988.
[21] L.H. Crow, “Evaluating the Reliability of Repairable Systems,” Proc. Ann. Reliability and Maintainability Symp. (RAMS '90), pp.275-279, Jan. 1990.
[22] W. Nelson, “Graphical Analyses of System Repair Data,” J. Quality Technology, vol. 20, no. 1, pp. 24-35, Jan. 1988.
[23] V. Prabhakaran, “IRON File Systems,” Proc. 20th ACM Symp. Operating Systems Principles (SOSP '05), pp. 1-15, Oct. 2005.
[24] C.L.T. Borges, D.M. Falcao, J.C.O. Mello, and A.C.G. Melo, “Composite Reliability Evaluation by Sequential Monte Carlo Simulation on Parallel and Distributed Operating Environments,” IEEE Trans. Power Systems, vol. 16, no. 2, pp. 203-209, May 2001.
[25] D. Trindade and S. Nathan, “Simple Plots for Monitoring Field Reliability of Repairable Systems,” Proc. Ann. Reliability and Maintainability Symp. (RAMS '05), pp. 539-544, Jan. 2005.
[26] J.G. Elerath and S. Shah, “Disk Drive Reliability Case Study: Dependence upon Head Fly-Height and Quantity of Heads,” Proc. Ann. Reliability and Maintainability Symp. (RAMS '03), Jan. 2003.
[27] S. Shah and J.G. Elerath, “Disk Drive Vintage and Its Affect on Reliability,” Proc. Ann. Reliability and Maintainability Symp. (RAMS'04), Jan. 2004.
[28] J. Gray and C. van Ingen, “Empirical Measurements of Disk Failure Rates and Error Rates,” Microsoft Research Technical Report MSR-TR-2005-166, Dec. 2005.
[29] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, and S. Sankar, “Row Diagonal Parity for Double Disk Failure Correction,” Proc. Third USENIX Conf. File and Storage Technologies (FAST), 2004.
69 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool