The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.10 - October (2010 vol.59)
pp: 1350-1362
Mingqiang Li , Tsinghua University, Beijing
Jiwu Shu , Tsinghua University, Beijing
ABSTRACT
Large-scale erasure-coded storage systems have a serious performance problem due to I/O congestion and disk media access congestion caused by read-modify-write operations involved in small-write operations. All the existing technologies based on the conventional disk can provide very limited performance improvement. This paper presents a new Disk Architecture with Composite Operation (DACO), whose disk media access interface consists of three kinds of operations: READ, WRITE, and Composite Operation (CO). The CO adopts a sector-based pipeline technology to implement block-level data modify operations, and thus, can replace the read-modify-write operations involved in small-write operations. When the DACO is adopted in a large-scale erasure-coded storage system with t fault tolerance, t I/Os and t disk media access operations can be reduced in each small-write operation, respectively. This alleviates both I/O congestion and disk media access congestion in nature, and thus, can remarkably improve the performance of large-scale erasure-coded storage systems. A simulation study shows that the DACO can provide significant performance improvement: reducing the average I/O response time by up to 31.16 percent even in the worst case where t=1. This paper also discusses the important implementation issues of the DACO and investigates the additional cost involved in the DACO.
INDEX TERMS
Disk architecture, erasure code, small-write problem, storage system.
CITATION
Mingqiang Li, Jiwu Shu, "DACO: A High-Performance Disk Architecture Designed Specially for Large-Scale Erasure-Coded Storage Systems", IEEE Transactions on Computers, vol.59, no. 10, pp. 1350-1362, October 2010, doi:10.1109/TC.2010.22
REFERENCES
[1] E. Pinheiro, W.D. Weber, and L.A. Barroso, "Failure Trends in a Large Disk Drive Population," Proc. USENIX Conf. File and Storage Technologies (FAST '07), Feb. 2007.
[2] B. Schroeder and G.A. Gibson, "Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?," Proc. USENIX Conf. File and Storage Technologies (FAST '07), Feb. 2007.
[3] L.N. Bairavasundaram, G.R. Goodson, S. Pasupathy, and J. Schindler, "An Analysis of Latent Sector Errors in Disk Drives," Proc. SIGMETRICS '07, June 2007.
[4] A. Dholakia, E. Eleftheriou, X.-Y. Hu, I. Iliadis, J. Menon, and K.K. Rao, "A New Intra-Disk Redundancy Scheme for High-Reliability RAID Storage Systems in the Presence of Unrecoverable Errors," ACM Trans. Storage, vol. 4, no. 1, pp. 1-42, May 2008.
[5] I. Iliadis, R. Haas, X.-Y. Hu, and E. Eleftheriou, "Disk Scrubbing Versus Intra-Disk Redundancy for High-Reliability RAID Storage Systems," Proc. SIGMETRICS '08, June 2008.
[6] J.S. Plank, "Erasure Codes for Storage Applications," Tutorial Slides, FAST 2005, Dec. 2005.
[7] P.M. Chen, E.K. Lee, G.A. Gibson, R.H. Katz, and D.A. Patterson, "RAID: High-Performance, Reliable Secondary Storage," ACM Computing Surveys, vol. 26, no. 2, pp. 145-185, June 1994.
[8] C. Carlane and A. Osuna, "IBM System Storage $N$ Series Implementation of RAID Double Parity for Data Protection," IBM Redpaper REDP-4169-00, http://www.redbooks.ibm.com/redpapers/pdfs redp4169.pdf, Apr. 2006.
[9] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao, "OceanStore: An Architecture for Global-Scale Persistent Storage," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '00), Nov. 2000.
[10] A. Haeberlen, A. Mislove, and P. Druschel, "Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures," Proc. Second Conf. Symp. Networked Systems Design and Implementation (NSDI '05), May 2005.
[11] S. Frolund, A. Merchant, Y. Saito, S. Spence, and A. Veitch, "A Decentralized Algorithm for Erasure-Coded Virtual Disks," Proc. Int'l Conf. Dependable Systems and Networks (DSN '04), June 2004.
[12] G.R. Goodson, J.J. Wylie, G.R. Ganger, and M.K. Reiter, "Efficient Byzantine-Tolerant Erasure-Coded Storage," Proc. Int'l Conf. Dependable Systems and Networks (DSN '04), June 2004.
[13] J. Hendricks, G.R. Ganger, and M.K. Reiter, "Low-Overhead Byzantine Fault-Tolerant Storage," Proc. 21st ACM Symp. Operating Systems Principles (SOSP '07), Oct. 2007.
[14] H. Xia and A.A. Chien, "RobuSTore: A Distributed Storage Architecture with Robust and High Performance," Proc. Conf. Supercomputing (SC '07), Nov. 2007.
[15] M.W. Storer, K.M. Greenan, E.L. Miller, and K. Voruganti, "Pergamum: Replacing Tape with Energy Efficient, Reliable, Disk-Based Archival Storage," Proc. USENIX Conf. File and Storage Technologies (FAST '08), Feb. 2008.
[16] Cleversafe, Inc., "Cleversafe Dispersed Storage," Open Source Code Distribution, http://www.cleversafe.orgdownloads, 2009.
[17] Allmydata, Inc., "Unlimited Online Backup, Storage, and Sharing," http:/www.allmydata.com/, 2009.
[18] Permabit Technology Corporation, "Disk Based Enterprise Archive, Data Archiving Solutions," http:/www.permabit.com/, 2009.
[19] R.E. Bryant, "Data Intensive Supercomputing: The Case for DISC," Technical Report CMU-CS-07-128, School of Computer Science, Carnegie Mellon Univ., May 2007.
[20] J. Menon, J. Roche, and J. Kasson, "Floating Parity and Data Disk Arrays," J. Parallel and Distributed Computing, vol. 17, nos. 1/2, pp. 129-139, Jan./Feb. 1993.
[21] D. Stodolsky, M. Holland, W.V. Courtright, and G.A. Gibson, "Parity-Logging Disk Arrays," ACM Trans. Computer Systems, vol. 12, no. 3, pp. 206-235, Aug. 1994.
[22] R.A. Demoss and K.B. Dulac, "Delayed Initiation of Read-Modify-Write Parity Operations in a Raid Level 5 Disk Array," US Patent No. 5388108, Feb. 1995.
[23] G. Houlder, J. Elrod, and M. Miller, "XOR Commands on SCSI Disk Drives," ANSI Specification X3T10/94-111r9, http://www.t10.org/ftp/t10/document.9494-111r9.pdf , 1994.
[24] R.A. DeKoning, "Method for Performing a RAID Stripe Write Operation Using a Drive XOR Command Set," U.S. Patent No. 5742752, Apr. 1998.
[25] G.R. Ganger, "Blurring the Line Between OSes and Storage Devices," Technical Report CMU-CS-01-166, School of Computer Science, Carnegie Mellon Univ., Dec. 2001.
[26] R.A. Uhlig and T.N. Mudge, "Trace-Driven Memory Simulation: A Survey," ACM Computing Surveys, vol. 29, no. 2, pp. 128-170, June 1997.
[27] F.J. MacWilliams and N.J.A. Sloane, The Theory of Error-Correcting Codes. Elsevier, 1977.
[28] I.S. Reed and G. Solomon, "Polynomial Codes over Certain Finite Fields," J. Soc. for Industrial and Applied Math., vol. 8, no. 2, pp. 300-304, June 1960.
[29] R.R. Roth and A. Lempel, "On MDS Codes via Cauchy Matrices," IEEE Trans. Information Theory, vol. 35, no. 6, pp. 1314-1319, Nov. 1989.
[30] J. Blomer, M. Kalfane, M. Karpinski, R. Karp, M. Luby, and D. Zuckerman, "An XOR-Based Erasure-Resilient Coding Scheme," Technical Report TR-95-048, Int'l Computer Science Inst., Aug. 1995.
[31] J.S. Plank and L. Xu, "Optimizing Cauchy Reed-Solomon codes for Fault-tolerant Network Storage Applications," Proc. IEEE Int'l Symp. Network Computing and Applications (NCA '06), July 2006.
[32] M. Blaum, J. Brady, J. Bruck, and J. Menon, "EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures," IEEE Trans. Computers, vol. 44, no. 2, pp. 192-202, Feb. 1995.
[33] L. Xu and J. Bruck, "X-Code: MDS Array Codes with Optimal Encoding," IEEE Trans. Information Theory, vol. 45, no. 1, pp. 272-276, Jan. 1999.
[34] C. Huang and L. Xu, "STAR: An Efficient Coding Scheme for Correcting Triple Storage Node Failures," Proc. USENIX Conf. File and Storage Technologies (FAST '05), Dec. 2005.
[35] J.S. Plank, "The RAID-6 Liberation Codes," Proc. USENIX Conf. File and Storage Technologies (FAST '08), Feb. 2008.
[36] J.L. Hafner, "WEAVER Codes: High Fault Tolerant Erasure Codes for Storage Systems," Proc. USENIX Conf. File and Storage Technologies (FAST '05), Dec. 2005.
[37] M. Li, J. Shu, and W. Zheng, "GRID Codes: Strip-Based Erasure Codes with High Fault Tolerance for Storage Systems," ACM Trans. Storage, vol. 4, no. 4, pp. 1-22, Jan. 2009.
[38] K. Haughton, "Design Considerations in the IBM 3340 Disk File," Proc. IEEE Computer Soc. Conf., Feb. 1974.
[39] R.B. Mulvany, "Engineering Design of a Disk Storage Facility with Data Modules," IBM J. Research and Development, vol. 18, no. 6, pp. 489-505, 1974.
[40] A.J. Smith, "On the Effectiveness of Buffered and Multiple Arm Disks," Proc. Int'l Symp. Computer Architecture (ISCA '78), Apr. 1978.
[41] J.M. Harker, D.W. Brede, R.E. Pattison, G.R. Santana, and L.G. Taft, "A Quarter Century of Disk File Innovation," IBM J. Research and Development, vol. 25, no. 5, pp. 677-689, Sept. 1981.
[42] J.P. Squires, G.N. Bagnell, C.M. Sander, and K.M. Anderson, "Multiple Actuator Disk Drive," U.S. Patent No. 5293282, Mar. 1994.
[43] E. Grochowski and R.D. Halem, "Technological Impact of Magnetic Hard Disk Drives on Storage Systems," IBM Systems J., vol. 42, no. 2, pp. 338-346, Apr. 2003.
[44] P. Gilovich, "Multiple Actuator Assemblies for Data Storage Devices," U.S. Patent No. 6057990, May 2000.
[45] A.R. Howard, "Disk Data Storage Apparatus and Method Using Multiple Head Actuators," U.S. Patent No. 7102842, Sept. 2006.
[46] N.-K. Lee, T.-D. Han, S.-D. Kim, and S.-B. Yang, "High Performance RAID System by Using Dual Head Disk Structure," Proc. Eighth Int'l Conf. High-Performance Computing in Asia-Pacific Region (HPC-Asia '97), Apr. 1997.
[47] N.-K. Lee, S.-B. Yang, T.-D. Han, and S.-D. Kim, "Modeling and Performance Analysis of Dual Head Disk Structure," J. Systems Architecture, vol. 44, nos. 9/10, pp. 787-802, June 1998.
[48] R. Wood, J. Miles, and T. Olson, "Recording Technologies for Terabit Per Square Inch Systems," IEEE Trans. Magnetics, vol. 38, no. 4, pp. 1711-1718, July 2002.
[49] J. Zheng, G. Guo, and Y. Wang, "Feedforward Decoupling Control Design for Dual-Actuator System in Hard Disk Drives," IEEE Trans. Magnetics, vol. 40, no. 4, pp. 2080-2082, July 2004.
[50] J.A. Chandy, "Dual Actuator Logging Disk Architecture and Modeling," J. Systems Architecture, vol. 53, no. 12, pp. 913-926, Dec. 2007.
[51] S. Sankar, S. Gurumurthi, and M.R. Stan, "Intra-Disk Parallelism: An Idea Whose Time Has Come," Proc. Int'l Symp. Computer Architecture (ISCA '08), June 2008.
[52] R. Galbraith and T. Oenning, "Iterative Detection Read Channel Technology in Hard Disk Drives," Hitachi White Paper WPIDRC08EN-01, http://www.hitachigst.com/tech/techlib.nsf/ techdocs/FB376A33027F5A5F86257509001463AE/ $fileIDRC_WP_final.pdf, Oct. 2008.
[53] R. Wood, "Future Hard Disk Drive Systems," J. Magnetism and Magnetic Materials, vol. 321, no. 6, pp. 555-561, Mar. 2009.
[54] Seagate Technology LLC, "The Seagate Cheetah Hard Drive Family," http://www.seagate.com/www/en-us/products/ serverscheetah/, 2009.
[55] S.W. Ng, "Advances in Disk Technology: Performance Issues," Computer, vol. 31, no. 5, pp. 75-81, May 1998.
[56] M.K. Aguilera, M. Ji, M. Lillibridge, J. MacCormick, E. Oertli, D. Andersen, M. Burrows, T. Mann, and C.A. Thekkath, "Block-Level Security for Network-Attached Disks," Proc. USENIX Conf. File and Storage Technologies (FAST '03), Apr. 2003.
[57] J.S. Bucy, G.R. Ganger, and Contributors, "The Disksim Simulation Environment Version 3.0 Reference Manual," Technical Report CMU-CS-03-102, School of Computer Science, Carnegie Mellon Univ., Jan. 2003.
[58] Storage Systems Department at HP Labs, "Trace Data," Open Source Software, http://tesla.hpl.hp.comopensource/, 2009.
[59] A.A. Mamun, G. Guo, and C. Bi, Hard Disk Drive: Mechatronics and Control. CRC Press, 2007.
[60] J. Ding, S.-C. Wu, and M. Tomizuka, "Settling Control with Reference Redesign for Dual Actuator Hard Disk Drive Systems," Ann. Rev. in Control, vol. 28, no. 2, pp. 219-227, Sept. 2004.
[61] D. Abramovitch and G. Franklin, "A Brief History of Disk Drive Control," IEEE Trans. Automatic Control, vol. 22, no. 3, pp. 28-42, June 2002.
[62] R.G. Gallager, Low-Density Parity-Check Codes, Monograph. MIT Press, 1963.
[63] M.G. Luby, M. Mitzenmacher, A. Shokrollahi, and D.A. Spielman, "Efficient Erasure Correcting Codes," IEEE Trans. Information Theory, vol. 47, no. 2, pp. 569-584, Feb. 2001.
[64] R.M. Tanner, "A Recursive Approach to Low-Complexity Codes," IEEE Trans. Information Theory, vol. 27, no. 5, pp. 533-547, Sept. 1981.
[65] J.S. Plank and M.G. Thomason, "A Practical Analysis of Low-Density Parity-Check Erasure Codes for Wide Area Storage Applications," Proc. Int'l Conf. Dependable Systems and Networks (DSN '04), June 2004.
[66] J.S. Plank, R.L. Collins, A.L. Buchsbaum, and M.G. Thomason, "Small Parity-Check Erasure Codes-Exploration and Observations," Proc. Int'l Conf. Dependable Systems and Networks (DSN '05), June 2005.
[67] S. Hampton, "Process Cost Analysis for Hard Disk Manufacturing," Technical Report 96-02, Information Storage Industry Center, Univ. of California, Sept. 1996.
[68] R. Bohn and C. Terwiesch, "The Economics of Yield-Driven Processes," J. Operations Management, vol. 18, no. 1, pp. 41-59, Dec. 1999.
[69] M. Wu and W. Zwaenepoel, "ENVy: A Non-Volatile, Main Memory Storage System," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS VI), Oct. 1994.
[70] S.W. Schlosser, J.L. Griffin, D. Nagle, and G.R. Ganger, "Designing Computer Systems with MEMS-Based Storage," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS IX), Nov. 2000.
[71] T. Kgil, D. Roberts, and T. Mudge, "Improving NAND Flash Based Disk Caches," Proc. Int'l Symp. Computer Architecture (ISCA '08), June 2008.
22 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool