The Community for Technology Leaders
RSS Icon
Issue No.06 - June (2011 vol.22)
pp: 946-959
Florin Isaila , University Carlos III of Madrid, Leganés
Javier Garcia Blas , University Carlos III of Madrid, Leganés
Jesus Carretero , University Carlos III of Madrid, Leganés
Robert Latham , Argonne National Laboratory, Argonne
Robert Ross , Argonne National Laboratory, Argonne
Parallel applications currently suffer from a significant imbalance between computational power and available I/O bandwidth. Additionally, the hierarchical organization of current Petascale systems contributes to an increase of the I/O subsystem latency. In these hierarchies, file access involves pipelining data through several networks with incremental latencies and higher probability of congestion. Future Exascale systems are likely to share this trait. This paper presents a scalable parallel I/O software system designed to transparently hide the latency of file system accesses to applications on these platforms. Our solution takes advantage of the hierarchy of networks involved in file accesses, to maximize the degree of overlap between computation, file I/O-related communication, and file system access. We describe and evaluate a two-level hierarchy for Blue Gene systems consisting of client-side and I/O node-side caching. Our file cache management modules coordinate the data staging between application and storage through the Blue Gene networks. The experimental results demonstrate that our architecture achieves significant performance improvements through a high degree of overlap between computation, communication, and file I/O.
MPI-IO, parallel I/O, parallel file systems, supercomputers.
Florin Isaila, Javier Garcia Blas, Jesus Carretero, Robert Latham, Robert Ross, "Design and Evaluation of Multiple-Level Data Staging for Blue Gene Systems", IEEE Transactions on Parallel & Distributed Systems, vol.22, no. 6, pp. 946-959, June 2011, doi:10.1109/TPDS.2010.127
[1] International Exascale Software Project, http:/, 2010.
[2] H. Abbasi, M. Wolf, G. Eisenhauer, S. Klasky, K. Schwan, and F. Zheng, “Datastager: Scalable Data Staging Services for Petascale Applications,” Proc. 18th ACM Int'l Symp. High Performance Distributed Computing (HPDC '09), pp. 39-48, 2009.
[3] N. Ali, P. Carns, K. Iskra, D. Kimpe, S. Lang, R. Latham, R. Ross, L. Ward, and P. Sadayappan, “Scalable I/O Forwarding Framework for High-Performance Computing Systems,” Proc. IEEE Conf. Cluster Computing, Sept. 2009.
[4] J.G. Blas, F. Isaila, J. Carretero, R. Latham, and R.B. Ross, “Multiple-Level MPI File Write-Back and Prefetching for Blue Gene Systems,” Proc. European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 164-173, 2009.
[5] J.G. Blas, F. Isaila, D.E. Singh, and J. Carretero, “View-Based Collective I/O for MPI-IO,” Proc. IEEE Int'l Symp. Cluster Computing and the Grid (CCGRID), pp. 409-416, 2008.
[6] S. Byna, Y. Chen, X.-H. Sun, R. Thakur, and W. Gropp, “Parallel I/O Prefetching Using MPI File Caching and I/O Signatures,” Proc. ACM/IEEE Conf. Supercomputing (SC '08), pp. 1-12, 2008.
[7] F. Chang and G. Gibson, “Automatic I/O Hint Generation through Speculative Execution,” Proc. Symp. Operating Systems Design and Implementation (OSDI), 1999.
[8] Y. Chen, S. Byna, X.-H. Sun, R. Thakur, and W. Gropp, “Hiding I/O Latency with Pre-Execution Prefetching for Parallel Applications,” Proc. ACM/IEEE Conf. Supercomputing (SC '08), pp. 1-10, 2008.
[9] J. del Rosario, R. Bordawekar, and A. Choudhary, “Improved Parallel I/O via a Two-Phase Run-Time Access Strategy,” Proc. IPPS Workshop Input/Output in Parallel Computer Systems, 1993.
[10] F. Isaila, J. Garcia Blas, J. Carretero, R. Latham, S. Lang, and R. Ross, “Latency Hiding File I/O for Blue Gene Systems,” Proc. IEEE Int'l Symp. Cluster Computing and the Grid (CCGRID) '09, 2009.
[11] FUSE Homepage, http:/, 2009.
[12] ZeptoOs Project, http://www unix.mcs.anl.govzeptoos/, 2008.
[13] Top 500 List, http:/, 2010.
[14] The Portable Operating System Interface, http:/www.unix systems. org/, 1995.
[15] F. Isaila, G. Malpohl, V. Olaru, G. Szeder, and W. Tichy, “Integrating Collective I/O and Cooperative Caching into the ‘Clusterfile’ Parallel File System,” Proc. ACM Int'l Conf. Supercomputing (ICS), pp. 315-324, 2004.
[16] K. Iskra, J.W. Romein, K. Yoshii, and P. Beckman, “ZOID: I/O-Forwarding Infrastructure for Petascale Architectures,” Proc. Symp. Principles and Practice of Parallel Programming (PPoPP '08), pp. 153-162, 2008.
[17] M. Kallahalla and P. Varman, “PC-OPT: Optimal Offline Prefetching and Caching for Parallel I/O Systems,” IEEE Trans. Computers, vol. 51, no. 11, pp. 1333-1344, Nov. 2002.
[18] D. Kotz, “Disk-Directed I/O for MIMD Multiprocessors,” Proc. First USENIX Symp. Operating Systems Design and Implementation, 1994.
[19] W.K. Liao, K. Coloma, A. Choudhary, L. Ward, E. Russel, and S. Tideman, “Collective Caching: Application-Aware Client-Side File Caching,” Proc. 14th Int'l Symp. High Performance Distributed Computing (HPDC), July 2005.
[20] W.K. Liao, K. Coloma, A.N. Choudhary, and L. Ward, “Cooperative Write-Behind Data Buffering for MPI I/O,” Proc. European Parallel Virtual Machine and Message Passing Interface Conf. (PVM/MPI), pp. 102-109, 2005.
[21] W. Ligon and R. Ross, “An Overview of the Parallel Virtual File System,” Proc. Extreme Linux Workshop, June 1999.
[22] X. Ma, M. Winslett, J. Lee, and S. Yu, “Improving MPI-IO Output Performance with Active Buffering Plus Threads,” Proc. Int'l Symp. Parallel and Distributed Processing (IPDPS), pp. 22-26, 2003.
[23] J. Moreira et al. “Designing a Highly-Scalable Operating System: The Blue Gene/L Story,” Proc. ACM/IEEE Conf. Supercomputing (SC '06), 2006.
[24] N. Nieuwejaar, D. Kotz, A. Purakayastha, C. Ellis, and M. Best, “File Access Characteristics of Parallel Scientific Workloads,” IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 10, pp. 1075-1089, Oct. 1996.
[25] C.M. Patrick, S. Son, and M. Kandemir, “Comparative Evaluation of Overlap Strategies with Study of I/O Overlap in MPI-IO,” ACM SIGOPS Operating Systems Rev., vol. 42, pp. 43-49, 2008.
[26] R.H. Patterson, G.A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka, “Informed Prefetching and Caching,” SIGOPS Operating Systems Rev., vol. 29, no. 5, pp. 79-95, 1995.
[27] J.-P. Prost, R. Treumann, R. Hedges, B. Jia, and A. Koniges, “MPI-IO/GPFS, an Optimized Implementation of MPI-IO on Top of GPFS,” Proc. 2001 ACM/IEEE Conf. Supercomputing (SC '01), p. 17, 2001.
[28] I. Raicu, Z. Zhang, M. Wilde, I. Foster, P. Beckman, K. Iskra, and B. Clifford, “Toward Loosely Coupled Programming on Petascale Systems,” Proc. 2008 ACM/IEEE Conf. Supercomputing (SC '08), pp. 1-12, 2008.
[29] P.C. Roth, “Characterizing the I/O Behavior of Scientific Applications on the Cray XT,” Proc. Second Int'l Workshop Petascale Data Storage (PDSW '07), pp. 50-55, 2007.
[30] Y.H. Sahoo et al. “High Performance File I/O for the Blue Gene/L Supercomputer,” Proc. Int'l Symp. High-Performance Computer Architecture (HPCA), pp. 187-196, 2006.
[31] H. Shan, K. Antypas, and J. Shalf, “Characterizing and Predicting the I/O Performance of HPC Applications Using a Parameterized Synthetic Benchmark,” Proc. 2008 ACM/IEEE Conf. Supercomputing (SC '08), pp. 1-12, 2008.
[32] R. Thakur, W. Gropp, and E. Lusk, “Data Sieving and Collective I/O in ROMIO,” Proc. Seventh Symp. Frontiers of Massively Parallel Computation, pp. 182-189, Feb. 1999.
[33] M. Wilde, I. Foster, K. Iskra, P. Beckman, Z. Zhang, A. Espinosa, M. Hategan, B. Clifford, and I. Raicu, “Parallel Scripting for Applications at the Petascale and Beyond,” Computer, vol. 42, no. 11, pp. 50-60, Nov. 2009.
[34] P. Wong and R. der Wijngaart, “NAS Parallel Benchmarks I/O Version 2.4,” technical report, NASA Ames Research Center, 2003.
[35] W. Yu and J. Vetter, “ParColl: Partitioned Collective I/O on the Cray XT,” Proc. Int'l Conf. Parallel Processing (ICPP), pp. 562-569, 2008.
[36] W. Yu, J.S. Vetter, and R.S. Canon, “OPAL: An Open-Source MPI-IO Library over Cray XT,” Proc. Int'l Workshop Storage Network Architecture and Parallel I/Os (SNAPI '07), pp. 41-46, 2007.
15 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool