This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Design and Evaluation of MPI File Domain Partitioning Methods under Extent-Based File Locking Protocol
February 2011 (vol. 22 no. 2)
pp. 260-272
Wei-keng Liao, Northwestern University, Evanston
MPI collective I/O has been an effective method for parallel shared-file access and maintaining the canonical orders of structured data in files. Its implementation commonly uses a two-phase I/O strategy that partitions a file into disjoint file domains, assigns each domain to a unique process, redistributes the I/O data based on their locations in the domains, and has each process perform I/O for the assigned domain. The partitioning quality determines the maximal performance achievable by the underlying file system, as the shared-file I/O has long been impeded by the cost of file system's data consistency control, particularly due to the conflicted locks. This paper proposes a few file domain partitioning methods designed to reduce lock conflicts under the extent-based file locking protocol. Experiments from four I/O benchmarks on the IBM GPFS and Lustre parallel file systems show that the partitioning method producing minimum lock conflicts wins the highest performance. The benefit of removing conflicted locks can be so significant that more than thirty times of write bandwidth differences are observed between the best and worst methods.

[1] H. Shan and J. Shalf, "Using IOR to Analyze the I/O Performance of XT3," Proc. Cray User Group Conf., May 2007.
[2] Message Passing Interface Forum, MPI-2: Extensions to the Message Passing Interface, July 1997.
[3] J. del Rosario, R. Brodawekar, and A. Choudhary, "Improved Parallel I/O via a Two-Phase Run-Time Access Strategy," Proc. Workshop I/O in Parallel Computer Systems at IPPS '93, Apr. 1993.
[4] D. Kotz, "Disk-Directed I/O for MIMD Multiprocessors," ACM Trans. Computer Systems, vol. 15, no. 1, pp. 41-74, Feb. 1997.
[5] R. Thakur, W. Gropp, and E. Lusk, Users Guide for ROMIO: A High-Performance, Portable MPI-IO Implementation, Technical Report ANL/MCS-TM-234, Math. and Computer Science Division, Argonne Nat'l Laboratory, Oct. 1997.
[6] W. Liao and A. Choudhary, "Dynamically Adapting File Domain Partitioning Methods for Collective I/O Based on Underlying Parallel File System Locking Protocols," Proc. Int'l Conf. High Performance Computing, Networking, Storage and Analysis, Nov. 2008.
[7] K. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett, "Server-Directed Collective I/O in Panda," Proc. Conf. Supercomputing, Nov. 1995.
[8] K. Coloma, A. Choudhary, W. Liao, W. Lee, E. Russell, and N. Pundit, "Scalable High-Level Caching for Parallel I/O," Proc. Int'l Parallel and Distributed Processing Symp., Apr. 2004.
[9] K. Coloma, A. Ching, A. Choudhary, W. Liao, R. Ross, R. Thakur, and L. Ward, "A New Flexible MPI Collective I/O Implementation," Proc. IEEE Conf. Cluster Computing, Sept. 2006.
[10] X. Ma, M. Winslett, J. Lee, and S. Yu, "Improving MPI-IO Output Performance with Active Buffering Plus Threads," Proc. Int'l Parallel and Distributed Processing Symp., Apr. 2003.
[11] K. Coloma, A. Choudhary, W. Liao, W. Lee, and S. Tideman, "DAChe: Direct Access Cache System for Parallel I/O," Proc. 20th Int'l Supercomputer Conf., June 2005.
[12] W. Liao, A. Ching, K. Coloma, A. Choudhary, and L. Ward, "An Implementation and Evaluation of Client-Side File Caching for MPI-IO," Proc. Int'l Parallel and Distributed Processing Symp., Mar. 2007.
[13] B. Nitzberg and V. Lo, "Collective Buffering: Improving Parallel I/O Performance," Proc. Sixth IEEE Int'l Symp. High Performance Distributed Computing, pp. 148-157, Aug. 1997.
[14] X. Zhang, S. Jiang, and K. Davis, "Making Resonance a Common Case: A High-Performance Implementation of Collective I/O on Parallel File Systems," Proc. Int'l Parallel and Distributed Processing Symp., Mar. 2009.
[15] IEEE/ANSI Std. 1003.1, Portable Operating System Interface (POSIX)—Part 1: System Application Program Interface (API) [C Language], 1996.
[16] "Lustre: A Scalable, High-Performance File System," whitepaper, Cluster File Systems, Inc., 2003.
[17] J. Prost, R. Treumann, R. Hedges, B. Jia, and A. Koniges, "MPI-IO/GPFS, an Optimized Implementation of MPI-IO on Top of GPFS," Proc. Conf. Supercomputing, Nov. 2001.
[18] D. Nagle, D. Serenyi, and A. Matthews, "The Panasas ActiveScale Storage Cluster—Delivering Scalable High Bandwidth Storage," Proc. ACM/IEEE Supercomputing Conf., Nov. 2004.
[19] B. Welch, M. Unangst, Z. Abbasi, G. Gibson, B. Mueller, J. Small, J. Zelenka, and B. Zhou, "Scalable Performance of the Panasas Parallel File System," Proc. Sixth USENIX Conf. File and Storage Technologies, Jan. 2008.
[20] R. Ross, R. Latham, W. Gropp, R. Thakur, and B. Toonen, "Implementing MPI-IO Atomic Mode without File System Support," Proc. Fifth IEEE/ACM Int'l Symp. Cluster Computing and the Grid, May 2005.
[21] J. Prost, R. Treumann, R. Hedges, A. Koniges, and A. White, "Towards a High-Performance Implementation of MPI-IO on Top of GPFS," Proc. Sixth Int'l Euro-Par Conf. Parallel Processing, Aug. 2000.
[22] H. Yu, R. Sahoo, C. Howson, G. Almasi, J. Castanos, M. Gupta, J. Moreira, J. Parker, T. Engelsiepen, R. Ross, R. Thakur, R. Latham, and W.D. Gropp, "High Performance File I/O for the BlueGene/L Supercomputer," Proc. 12th Int'l Symp. High-Performance Computer Architecture (HPCA-12), Feb. 2006.
[23] P. Bernstein, V. Hadzilacos, and N. Goodman, Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987.
[24] J. Lombardi, "Building Lustre, Protocol Basics, and Debugging," Proc. Conf. Lustre User Group, Apr. 2009.
[25] P. Wong and R. der Wijngaart, "NAS Parallel Benchmarks I/O Version 2.4," Technical Report NAS-03-002, NASA Ames Research Center, Jan. 2003.
[26] M. Zingale, "FLASH I/O Benchmark Routine—Parallel HDF 5," http://flash.uchicago.edu/~zingaleflash_benchmark_io , 2001.
[27] B. Fryxell, K. Olson, P. Ricker, F.X. Timmes, M. Zingale, D.Q. Lamb, P. MacNeice, R. Rosner, and H. Tufo, "FLASH: An Adaptive Mesh Hydrodynamics Code for Modelling Astrophysical Thermonuclear Flashes," Astrophysical J. Supplement, vol. 131, pp. 273-334, 2000.
[28] HDF Group, Hierarchical Data Format, The Nat'l Center for Supercomputing Applications, http://hdf.ncsa.uiuc.eduHDF5, 2009.
[29] R. Sankaran, E. Hawkes, J. Chen, T. Lu, and C. Law, "Direct Numerical Simulations of Turbulent Lean Premixed Combustion," J. Physics: Conf. Series, vol. 46, pp. 38-42, 2006.

Index Terms:
Parallel I/O, MPI-IO, parallel file system, file locking, Lustre, GPFS.
Citation:
Wei-keng Liao, "Design and Evaluation of MPI File Domain Partitioning Methods under Extent-Based File Locking Protocol," IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 2, pp. 260-272, Feb. 2011, doi:10.1109/TPDS.2010.74
Usage of this product signifies your acceptance of the Terms of Use.