The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.06 - June (2008 vol.19)
pp: 750-763
ABSTRACT
An efficient and distributed scheme for file mapping or file lookup is critical in decentralizing metadata management within a group of metadata servers. This paper presents a novel technique called HBA (Hierarchical Bloom filter Arrays) to map filenames to the metadata servers holding their metadata. Two levels of probabilistic arrays, namely, Bloom filter arrays, with different level of accuracies, are used on each metadata server. One array, with lower accuracy and representing the distribution of the entire metadata, trades accuracy for significantly reduced memory overhead, while the other array, with higher accuracy, caches partial distribution information and exploits the temporal locality of file access patterns. Both arrays are replicated to all metadata servers to support fast local lookups. We evaluate HBA through extensive trace-driven simulations and an implementation in Linux. Simulation results show our HBA design to be highly effective and efficient in improving performance and scalability of file systems in clusters with 1,000 to 10,000 nodes (or super-clusters) and with the amount of data in the Peta-byte scale or higher. Our implementation indicates that HBA can reduce metadata operation time of a single-metadata-server architecture by a factor of up to 43.9 when the system is configured with 16 metadata servers.
INDEX TERMS
Distributed file systems, Distributed file systems, Distributed systems, Parallel systems, Storage Management, File Systems Management
CITATION
Yifeng Zhu, Hong Jiang, Jun Wang, Feng Xian, "HBA: Distributed Metadata Management for Large Cluster-Based Storage Systems", IEEE Transactions on Parallel & Distributed Systems, vol.19, no. 6, pp. 750-763, June 2008, doi:10.1109/TPDS.2007.70788
REFERENCES
[1] P.J. Braam, “Lustre White Paper,” http://www.lustre.org/docswhitepaper.pdf , 2005.
[2] S.A. Brandt, L. Xue, E.L. Miller, and D.D.E. Long, “Efficient Metadata Management in Large Distributed File Systems,” Proc. 20th IEEE Mass Storage Symp./11th NASA Goddard Conf. Mass Storage Systems and Technologies (MSS/MSST '03), pp. 290-298, Apr. 2003.
[3] P.H. Carns, W.B. Ligon III, R.B. Ross, and R. Thakur, “PVFS: AParallel File System for Linux Clusters,” Proc. Fourth Ann. Linux Showcase and Conf., pp. 317-327, 2000.
[4] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google File System,” Proc. 19th ACM Symp. Operating Systems Principles (SOSP'03), pp. 29-43, 2003.
[5] H. Tang and T. Yang, “An Efficient Data Location Protocol forSelf-Organizing Storage Clusters,” Proc. ACM/IEEE Conf. SuperComputing (SC '03), p. 53, Nov. 2003.
[6] Y. Zhu and H. Jiang, “CEFT: A Cost-Effective, Fault-Tolerant Parallel Virtual File System,” J. Parallel and Distributed Computing, vol. 66, no. 2, pp. 291-306, Feb. 2006.
[7] N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L. Seitz, J.N. Seizovic, and W.-K. Su, “Myrinet: A Gigabit-per-Second Local Area Network,” IEEE Micro, vol. 15, no. 1, pp. 29-36, 1995.
[8] D.H. Carrere, “Linux Local and Wide Area Network Adapter Support,” Int'l J. Network Management, vol. 10, no. 2, pp. 103-112, 2000.
[9] C. Eddington, “Infinibridge: An Infiniband Channel Adapter withIntegrated Switch,” IEEE Micro, vol. 22, no. 2, pp. 48-56, 2002.
[10] Y. Zhu, H. Jiang, X. Qin, D. Feng, and D. Swanson, “Exploiting Redundancy to Boost Performance in a RAID-10 Style Cluster-Based File System,” Cluster Computing: The J. Networks, Software Tools and Applications, vol. 9, no. 4, pp. 433-447, Oct. 2006.
[11] M. Vilayannur, A. Sivasubramaniam, M. Kandemir, R. Thakur, and R. Ross, “Discretionary Caching for I/O on Clusters,” Proc. Third IEEE/ACM Int'l Symp. Cluster Computing and the Grid (CCGRID '03), pp. 96-103, May 2003.
[12] W.B. Ligon III and R.B. Ross, “Server-Side Scheduling in Cluster Parallel I/O Systems,” Calculateurs Paralleles, special issue on parallel I/O for cluster computing, Oct. 2001.
[13] Y. Zhu, H. Jiang, X. Qin, D. Feng, and D. Swanson, “Scheduling for Improved Write Performance in a Cost-Effective, Fault-Tolerant Parallel Virtual File System (CEFT-PVFS),” Proc. FourthLCI Int'l Conf. Linux Clusters, June 2003.
[14] J. Wu, P. Wyckoff, and D. Pandac, “PVFS over InfiniBand: Design and Performance Evaluation,” Proc. Int'l Conf. Parallel Processing (ICPP '03), pp. 125-132, Oct. 2003.
[15] D. Roselli, J.R. Lorch, and T.E. Anderson, “A Comparison ofFile System Workloads,” Proc. Ann. Usenix Technical Conf., June 2000.
[16] Y. Zhu, H. Jiang, and J. Wang, “Hierarchical Bloom Filter Arrays (HBA): A Novel, Scalable Metadata Management System for Large Cluster-Based Storage,” Proc. IEEE Int'l Conf. Cluster Computing (CLUSTER '04), pp. 165-174, Sept. 2004.
[17] Y. Zhu, H. Jiang, X. Qin, and D. Swanson, “A Case Study of Parallel I/O for Biological Sequence Search on Linux Clusters,” Proc. IEEE Int'l Cluster Computing (CLUSTER '03), pp. 308-315, Dec. 2003.
[18] F. Schmuck and R. Haskin, “GPFS: A Shared-Disk File System for Large Computing Clusters,” Proc. First Usenix Conf. File and Storage Technologies (FAST '02), pp. 231-244, Jan. 2002.
[19] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, C. Wells, and B. Zhao, “Oceanstore: An Architecture for Global-Scale Persistent Storage,” Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '00), pp. 190-201, 2000.
[20] D. Nagle, D. Serenyi, and A. Matthews, “The Panasas ActiveScale Storage Cluster: Delivering Scalable High Bandwidth Storage,” Proc. ACM/IEEE Conf. Supercomputing (SC '04), p. 53, 2004.
[21] T. Anderson, M. Dahlin, J. Neefe, D. Pat-terson, D. Roselli, and R. Wang, “Serverless Network File Systems,” Proc. 15th ACM Symp. Operating System Principles (SOSP '95), pp. 109-126, Dec. 1995.
[22] B. Pawlowski, C. Juszczak, P. Staubach, C. Smith, D. Lebel, and D. Hitz, “NFS Version 3: Design and Implementation,” Proc. Usenix Summer Technical Conf., pp. 137-151, 1994.
[23] J.H. Morris, M. Satyanarayanan, M.H. Conner, J.H. Howard, D.S. Rosenthal, and F.D. Smith, “Andrew: A Distributed Personal Computing Environment,” Comm. ACM, vol. 29, no. 3, pp. 184-201, 1986.
[24] M. Satyanarayanan, J.J. Kistler, P. Kumar, M.E. Okasaki, E.H. Siegel, and D.C. Steere, “Coda: A Highly Available File System for Distributed Workstation Environments,” IEEE Trans. Computers, vol. 39, no. 4, Apr. 1990.
[25] V. Cate and T. Gross, “Combining the Concepts of Compression and Caching for a Two-Level File System,” Proc. Fourth Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '91), pp. 200-211, Apr. 1991.
[26] R. Floyd, “Short-Term File Reference Patterns in a Unix Environment,” Technical Report TR-177, Computer Science Dept., Univ. of Rochester, Mar. 1986.
[27] E. Riedel, M. Kallahalla, and R. Swaminathan, “A Framework for Evaluating Storage System Security,” Proc. First Usenix Conf. File and Storage Technology (FAST '02), pp. 15-30, Mar. 2002.
[28] C.H. Staelin, “High Performance File System Design,” PhD dissertation, Dept. Computer Science, Princeton Univ., Oct. 1991.
[29] S.A. Weil, K.T. Pollack, S.A. Brandt, and E.L. Miller, “Dynamic Metadata Management for Petabyte-Scale File Systems,” Proc. ACM/IEEE Conf. Supercomputing (SC '04), p. 4, 2004.
[30] P.F. Corbett and D.G. Feitelson, “The Vesta Parallel File System,” ACM Trans. Computer Systems, vol. 14, no. 3, pp. 225-264, 1996.
[31] P.J. Braam and P.A. Nelson, “Removing Bottlenecks in Distributed Filesystems: Coda and InterMezzo as Examples,” Proc. Linux Expo, May 1999.
[32] T.E. Anderson, M.D. Dahlin, J.M. Neefe, D.A. Patterson, D.S. Roselli, and R.Y. Wang, “Serverless Network File Systems,” ACM Trans. Computer Systems, vol. 14, no. 1, pp. 41-79, 1996.
[33] M.N. Nelson, B.B. Welch, and J.K. Ousterhout, “Caching in the Sprite Network File System,” ACM Trans. Computer Systems, vol. 6, no. 1, pp. 134-154, 1988.
[34] M.K. McKusick, W.N. Joy, S.J. Leffler, and R.S. Fabry, “A Fast File System for Unix,” ACM Trans. Computer Systems, vol. 2, no. 3, pp.181-197, 1984.
[35] M. Rosenblum and J.K. Ousterhout, “The Design and Implementation of a Log-Structured File System,” Proc. 13th ACM Symp. Operating Systems Principles (SOSP '91), pp. 1-15, 1991.
[36] B.H. Bloom, “Space/Time Trade-Offs in Hash Coding with Allowable Errors,” Comm. ACM, vol. 13, no. 7, pp. 422-426, 1970.
[37] L. Fan, P. Cao, J. Almeida, and A.Z. Broder, “Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol,” IEEE/ACM Trans. Networking, vol. 8, no. 3, pp. 281-293, 2000.
[38] A. Broder and M. Mitzenmacher, “Network Applications of Bloom Filters: A Survey,” Proc. 40th Ann. Allerton Conf. Comm., Control and Computing, Oct. 2002.
[39] S. Dharmapurikar, P. Krishnamurthy, and D.E. Taylor, “Longest Prefix Matching Using Bloom Filters,” Proc. ACM SIGCOMM '03, pp. 201-212, Aug. 2003.
[40] J.K. Mullin, “A Second Look at Bloom Filters,” Comm. ACM, vol. 26, no. 8, pp. 570-571, 1983.
[41] M.V. Ramakrishna, “Practical Performance of Bloom Filters andParallel Free-Text Searching,” Comm. ACM, vol. 32, no. 10, pp. 1237-1239, 1989.
[42] F. Wang, Q. Xin, B. Hong, S.A. Brandt, E.L. Miller, D.D.E. Long, and T.T. McLarty, “File System Workload Analysis for Large Scale Scientific Computing Applications,” Proc. 21st IEEE Mass Storage Symp./12th NASA Goddard Conf. Mass Storage Systems and Technologies (MSS/MSST '04), Apr. 2004.
[43] M.P. Mesnier, M. Wachs, R.R. Sambasivan, J. Lopez, J. Hendricks, G.R. Ganger, and D. O'Hallaron, “//TRACE: Parallel Trace Replay with Approximate Causal Events,” Proc. Fifth Usenix Conf. File and Storage Technologies (FAST '07), pp. 153-167, Feb. 2007.
[44] C. Perkins, IP Encapsulation within IP, IBM, 1996.
[45] L. Aversa and A. Bestavros, “Load Balancing a Cluster of Web Servers Using Distributed Packet Rewriting,” technical report, 1999.
[46] W.W. Hsu and A.J. Smith, “The Performance Impact of I/O Optimizations and Disk Improvements,” IBM J. Research and Development, vol. 48, no. 2, pp. 255-289, 2004.
[47] C. Ruemmler and J. Wilkes, “Unix Disk Access Patterns,” Proc. Usenix Winter Technical Conf., pp. 405-502, 1993.
[48] XFS: A High-Performance Journaling Filesystem, http://oss.sgi. com/projectsxfs/, Feb. 2007.
[49] Red Hat Global File System, http://www.redhat.com/software/rhagfs/, Feb. 2007.
[50] P. Gu, Y. Zhu, H. Jiang, and J. Wang, “Nexus: A Novel Weighted-Graph-Based Group Prefetching Algorithm for Metadata Servers in Petabyte Scale Storage Systems,” Proc. Sixth IEEE Int'l Symp. Cluster Computing and the Grid (CCGRID '06), pp. 409-416, May 2006.
18 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool