This Article 
 Bibliographic References 
 Add to: 
Compiler-Directed Collective-I/O
December 2001 (vol. 12 no. 12)
pp. 1318-1331

Current approaches to parallel I/O demand extensive user effort to obtain acceptable performance. This is in part due to difficulties in understanding the characteristics of a wide variety of I/O devices and in part due to inherent complexity of I/O software. While parallel I/O systems provide users with environments where persistent data sets can be shared between parallel processors, the ultimate performance of I/O-intensive codes depends largely on the relation between data access patterns exhibited by parallel processors and storage patterns of data in files and on disks. In cases where access patterns and storage patterns match, we can exploit parallel I/O hardware by allowing each processor to perform independent parallel I/O. In order to keep performance decent under circumstances in which data access patterns and storage patterns do not match, several I/O optimization techniques have been developed in recent years. Collective I/O is such an optimization technique that enables each processor to do I/O on behalf of other processors if doing so improves the overall performance. While it is generally accepted that collective I/O and its variants can bring impressive improvements as far as the I/O performance is concerned, it is difficult for the programmer to use collective I/O in an optimal manner. In this paper, we propose and evaluate a compiler-directed collective I/O approach which detects the opportunities for collective I/O and inserts the necessary I/O calls in the code automatically. An important characteristic of the approach is that instead of applying collective I/O indiscriminately, it uses collective I/O selectively only in cases where independent parallel I/O would not be possible or would lead to an excessive number of I/O calls. The approach involves compiler-directed access pattern and storage pattern detection schemes that work on a multiple application environment. We have implemented the necessary algorithms in a source-to-source translator and within a stand-alone tool. Our experimental results on an SGI/Cray Origin 2000 multiprocessor machine demonstrate that our compiler-directed collective I/O scheme performs very well on different setups built using nine applications from several scientific benchmarks. We have also observed that the I/O performance of our approach is only 5.23 percent worse than an optimal scheme.

[1] W. Abu-Sufah, D. Kuck, and D. Lawrie, “On the Performance Enhancement of Paging Systems through Program Analysis and Transformations,” IEEE Trans. Computers, vol. 30, no. 5, pp. 341-356, May 1981.
[2] C. Ancourt, F. Coelho, F. Irigoin, and R. Keryell, "A Linear Algebra Framework for Static HPF Code Distribution," Scientific Programming, to appear. Available as CRI-Ecole des Mines Technical Report A-278-CRI, .
[3] J. Anderson, “Automatic Computation and Data Decomposition for Multiprocessors,” doctoral thesis, Stanford Univ., San Francisco, Calif., Mar. 1997.
[4] V. Balasundaram, G. Fox, K. Kennedy, and U. Kremer, “A Static Performance Estimator to Guide Data Partitioning Decisions,” Proc. Third ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, Apr. 1991.
[5] T. Ball and J.R. Larus, “Branch Prediction for Free,” Proc. ACM SIGPLAN 1993 Conf. Programming Language Design and Implementation, pp. 300-313, June 1993.
[6] U. Banerjee, Loop Parallelization: Loop Transformations for Restructuring Compilers. Boston: Kluwer Academic, 1994.
[7] R. Bordawekar, A. Choudhary, K. Kennedy, C. Koelbel, and M. Paleczny, “A Model and Compilation Strategy for Out-of-Core Data Parallel Programs,” Proc. Fifth ACM SIGPLAN Symp. Principles&Practice of Parallel Programming (PPOPP), pp. 1-10, July 1995. ACM SIGPLAN Notices, vol. 30, no. 8.
[8] R. Bordawekar, A. Choudhary, and J. Ramanujam, “Automatic Optimization of Communication in Out-of-Core Stencil Codes,” Proc. 10th ACM Int'l Conf. Supercomputing, pp. 366-373, May 1996.
[9] R. Bordawekar, J.M. del Rosario, and A. Choudhary, “Design and Evaluation of Primitives for Parallel I/O,” Proc. Supercomputing '93, pp. 452-461, Nov. 1993.
[10] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, and S. Ranka, “A Compilation Approach for Fortran 90D/HPF Compilers,” Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, vol. 768, pp. 200-215, 1994.
[11] P. Brezany, T.A. Muck, and E. Schikuta, “Language, Compiler and Parallel Database Support for I/O Intensive Applications,” Proc. High Performance Computing and Networking, May 1995.
[12] P. Cao, E. Felten, and K. Li, “Application-Controlled File Caching Policies,” Proc. 1994 Summer USENIX Technical Conf., pp. 171-182, June 1994.
[13] R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedeljkovic, and J. Anderson, "Data-Distribution Support on Distributed-Shared Memory Multiprocessors," Proc. SIGPLAN Conf. Programming Language Design and Implementation, pp. 334-345,Las Vegas, Nev., 1997.
[14] L.T. Chen, R. Drach, M. Keating, S. Louis, D. Rotem, and A. Shoshani, “Efficient Organization and Access of Multi-Dimensional Datasets on Tertiary Storage Systems,” Information Systems J., vol. 20, no. 2, pp. 155-183, 1995.
[15] P.F. Corbett and D.G. Feitelson, “The Vesta Parallel File System,” ACM Trans. Computer Systems, vol. 14, no. 3, pp. 225-264, Aug. 1996.
[16] P. Corbett, D. Feitelson, S. Fineberg, Y. Hsu, B. Nitzberg, J. Prost, M. Snir, B. Traversat, and P. Wong, “Overview of the MPI-IO Parallel I/O Interface,” Proc. Third Workshop I/O in Parallel and Distributed Systems, Apr. 1995.
[17] A. Colvin and T.H. Cormen, “ViC*: A Compiler for Virtual-Memory C*,” Technical Report PCS-TR97-303, Dartmouth College Computer Science Dept., Hanover, N.H., Nov. 1997.
[18] J.M. del Rosario, R. Bordawekar, and A. Choudhary, “Improved Parallel I/O via a Two-Phase Run-Time Access Strategy,” Proc. IPPS '93 Workshop Input/Output in Parallel Computer Systems, pp. 56-70, 1993.
[19] J.M. del Rosario, A.N. Choudhary, “High Performance I/O for Massively Parallel Computers: Problems and Prospects,” Computer, vol. 27, no. 3,pp. 59–68, 1994.
[20] D.G. Feitelson, P.F. Corbett, S.J. Baylor, and Y. Hsu, “Parallel I/O Subsystems in Massively Parallel Supercomputers,” IEEE Parallel and Distributed Technology, pp. 33–47, Fall 1995.
[21] N. Gloy, T. Blackwell, M.D. Smith, and B. Calder, “Procedure Placement Using Temporal Ordering Information,” Proc. 30th Ann. IEEE/ACM Int'l Symp. Microarchitecture (MICRO), pp. 303-313, Dec. 1997.
[22] M. Gupta and P. Banerjee, “Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers,” IEEE Trans. Parallel and Distributed Systems, vol. 3, no. 2, pp. 179-193, Mar. 1992.
[23] M.W. Hall et al., "Maximizing Multiprocessor Performance with the SUIF Compiler," Computer, Dec. 1996, pp. 84-89.
[24] A.H. Hashemi, D.R. Kaeli, and B. Calder, “Procedure Mapping Using Static Call Graph Estimation,” Proc. Workshop Interaction between Compilers and Computer Architectures, Feb. 1997.
[25] B.K. Hillyer and A. Silberschatz, “Storage Technology: Status, Issues, and Opportunities,” AT&T Bell Laboratories, June 1996.
[26] M. Kandemir, “A Collective I/O Scheme Based on Compiler Analysis,” Proc. Fifth Workshop Languages, Compilers, and Run-Time Systems for Scalable Computers, May 2000.
[27] M. Kandemir, R. Bordawekar, and A. Choudhary, “Data Access Reorganizations in Compiling Out-of-Core Data Parallel Programs,” Proc. Int'l Parallel Processing Symp., Apr. 1997.
[28] M. Kandemir, A. Choudhary, J. Ramanujam, and M. Kandaswamy, “A Unified Compiler Algorithm for Optimizing Locality, Parallelism and Communication in Out-of-Core Computations,” Proc. IOPADS '97: ACM Workshop I/O in Parallel and Distributed Systems, pp. 79-92, Nov. 1997.
[29] W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott, “The Omega Library Interface Guide,” Technical Report CS-TR-3445, Computer Science Dept., Univ. of Maryland, College Park, Mar. 1995.
[30] D. Kotz and C.S. Ellis, “Practical Prefetching Techniques for Multiprocessor File Systems,” J. Distributed and Parallel Databases, vol. 1, no. 1, pp. 33–51, Jan. 1993.
[31] D. Kotz, “Disk-Directed I/O for MIMD Multiprocessors,” ACM Trans. Computer Systems, vol. 15, no. 1, pp. 41-74, Feb. 1997.
[32] D. Kotz, "Multiprocessor File System Interfaces," Proc. Second Int'l Conf. Parallel and Distributed Information Systems, pp. 194-201, 1993.
[33] T.M. Kroeger and D.E. Long, “Predicting File System Actions from Prior Events,” Proc. USENIX 1996 Ann. Technical Conf., pp. 319-328, Jan. 1996.
[34] T.M. Madhyastha, “Automatic Classification of Input Output Access Patterns,” PhD thesis, Dept. of Computer Science, Univ. of IIlinois, Urbana-Champagne, 1997.
[35] Message Passing Interface Forum “MPI-2: Extensions to the Message-Passing Interface,” http://www.cri.ensmp.fr docs/docshtml. 1997.
[36] T.C. Mowry, A.K. Demke, and O. Krieger, “Automatic Compiler-Inserted I/O Prefetching for Out-of-Core Applications,” Proc. Second Symp. Operating Systems Design and plementation (OSDI '96), Nov. 1996.
[37] N. Nieuwejaar and D. Kotz, “The Galley Parallel File System,” Parallel Computing, vol. 23, no. 4, pp. 447-476, 1997.
[38] B.J. Nitzberg, “Collective Parallel I/O,” PhD thesis, Dept. of Computer and Information Science, Univ. of Oregon, Dec. 1995.
[39] “Nwchem: A Computational Chemistry Package for Parallel Computers, Version 1.1,” High Performance Computational Chemistry Group, Pacific Northwest Laboratory, 1995.
[40] “OpenMP: A Proposed Standard API for Shared Memory Programming,” White Paper, OpenMP, paperpaper.html. Oct. 1997.
[41] “Origin ccNUMA Servers: True Scalability with a Difference,” Silicon Graphics Computer Systems, . 1999.
[42] M. Paleczny, K. Kennedy, and C. Koelbel, “Compiler Support for Out-of-Core Arrays on Parallel Machines,” Proc. Fifth Symp. Frontiers of Massively Parallel Computation, pp. 110-118, Feb. 1995.
[43] R.H. Patterson, G.A. Gibson, and M. Satyanarayanan, “A Status Report on Research in Transparent Informed Prefetching,” ACM Operating Systems Review, vol. 27, no. 2, pp 21-34, Apr. 1993.
[44] C. Polychronopoulos, M. Girkar, M.R. Haghighat, C. Lee, B. Leung, and D. Schouten, “The Structure of Parafrase-2: An Advanced Parallelizing Compiler for C and Fortran,” Proc. Workshop Language and Compilers for Parallel Computing, Aug. 1990.
[45] B. Rullman, “Paragon Parallel File System,” External Product Specification, Intel Supercomputer Systems Division, 1993.
[46] K.E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett, “Server-Directed Collective I/O in Panda,” Proc. Supercomputing '95, Dec. 1995.
[47] I. Song and T. Cho, "Page Prefetching Based on Fault History," Proc. USENIX Mach III Symp., Apr. 1993.
[48] A. Szabo and N. Ostlund, Modern Quantum Chemistry: Introduction to Advanced Electronic Structure Theory, Revised first ed. New York: McGraw-Hill, 1989.
[49] R. Thakur et al., "Passion: Optimized I/O for Parallel Application," Computer, June 1996, pp. 70-78.
[50] R. Thakur and A. Choudhary, “An Extended Two-Phase Method for Accessing Sections of Out-of-Core Arrays,” Scientific Programming, vol. 5, no. 4, pp. 301-317, Winter 1996.
[51] R. Thakur, A. Choudhary, and J. Ramanujam, “Efficient Algorithms for Array Redistribution“ IEEE Trans. Parallel and Distributed Systems, vol. 7, no. 6 pp. 587-594, June 1996.
[52] R. Thakur, W. Gropp, and E. Lusk, “A Case for Using MPI's Derived Data Types to Improve I/O Performance,” Proc. Supercomputing '98: High Performance Networking and Computing, Nov. 1998.
[53] R. Thakur, E. Lusk, and W. Gropp, “Users Guide for ROMIO: A High-Performance, Portable MPI-IO Implementation,” Technical Memorandum ANL/MCS-TM-234, Math. and Computer Science Div., Argonne Nat'l Laboratory, Oct. 1997.
[54] S. Toledo and F.G. Gustavson, “The Design and Implementation of SOLAR, a Portable Library for Scalable Out-of-Core Linear Algebra Computations,” Proc. Fourth Workshop Input/Output in Parallel and Distributed Systems, pp. 28–40, May 1996.
[55] K.S. Trivedi, “On the Paging Performance of Array Algorithms,” IEEE Trans. Computers, vol. 26, no. 10, pp. 938-947, Oct. 1977.
[56] M. Zagha, B. Larson, S. Turner, and M. Itzkowitz, "Performance Analysis Using the MIPS R10000 Performance Counters," Proc. Supercomputing '96,Pittsburgh, Pa., Nov. 1996.

Index Terms:
Optimizing compilers, parallel I/O, collective I/O, data-intensive applications, file layouts
M. Kandemir, "Compiler-Directed Collective-I/O," IEEE Transactions on Parallel and Distributed Systems, vol. 12, no. 12, pp. 1318-1331, Dec. 2001, doi:10.1109/71.970566
Usage of this product signifies your acceptance of the Terms of Use.