This Article 
 Bibliographic References 
 Add to: 
A High-Performance Application Data Environment for Large-Scale Scientific Computations
December 2003 (vol. 14 no. 12)
pp. 1262-1274

Abstract—Effective high-level data management is becoming an important issue with more and more scientific applications manipulating huge amounts of secondary-storage and tertiary-storage data using parallel processors. A major problem facing the current solutions to this data management problem is that these solutions either require a deep understanding of specific data storage architectures and file layouts to obtain the best performance (as in high-performance storage management systems and parallel file systems), or they sacrifice significant performance in exchange for ease-of-use and portability (as in traditional database management systems). In this paper, we discuss the design, implementation, and evaluation of a novel application development environment for scientific computations. This environment includes a number of components that make it easy for the programmers to code and run their applications without much programming effort and, at the same time, to harness the available computational and storage power on parallel architectures.

[1] C. Baru, R. Frost, J. Lopez, R. Marciano, R. Moore, A. Rajasekar, and M. Wan, Meta-Data Design for a Massive Data Analysis System Proc. CASCON Conf., 1996.
[2] C. Baru, R. Moore, A. Rajasekar, and M. Wan, The SDSC Storage Resource Broker Proc. CASCON Conf., Dec. 1998
[3] R. Bennett, K. Bryant, A. Sussman, R. Das, and J.S. Jovian, A Framework for Optimizing Parallel I/O Proc. Scalable Parallel Libraries Conf., 1994.
[4] R. Bordawekar, A. Choudhary, K. Kennedy, C. Koelbel, and M. Paleczny, A Model and Compilation Strategy for Out-of-Core Data Parallel Programs Proc. ACM Symp. Principles and Practice of Parallel Programming, pp. 1-10, 1995.
[5] P. Brown, R. Troy, D. Fisher, S. Louis, J.R. McGraw, and R. Musick, Meta-Data Sharing for Balanced Performance Proc. First IEEE Meta-Data Conf., 1996.
[6] P. Cao, E. Felten, and K. Li, Application-Controlled File Caching Policies Proc. Summer USENIX Technical Conf., pp. 171-182, 1994.
[7] C.-M. Chen and R. Sinha, “Analysis and Comparison of Declustering Schemes for Interactive Navigation Queries,” IEEE Trans. Knowledge and Data Eng., vol. 12, no. 5, pp. 763-778, 2000.
[8] A. Choudhary, R. Bordawekar, M. Harry, R. Krishnaiyer, R. Ponnusamy, T. Singh, and R. Thakur, Passion: Parallel and Scalable Software for Input-Output NPAC Technical Report SCCS-636, 1994.
[9] P. Corbett, D. Feitelson, J.-P. Prost, G. Almasi, S.J. Baylor, A. Bolmarcich, Y. Hsu, J. Satran, M. Snir, R. Colao, B. Herr, J. Kavaky, T. Morgan, and A. Zlotek, Parallel File Systems for the IBM SP Computers IBM Systems J., vol. 34, no. 2, pp. 222-248, Jan. 1995.
[10] P. Corbett, D. Fietelson, S. Fineberg, Y. Hsu, B. Nitzberg, J. Prost, M. Snir, B. Traversat, and P. Wong, Overview of the MPI-IO Parallel I/O Interface Proc. Third Workshop I/O in Parallel and Distributed Systems, 1995.
[11] P.F. Corbett, D.G. Feitelson, J.-P. Prost, and S.J. Baylor, Parallel Access to Files in the Vesta File System Proc. Supercomputing Conf., pp. 472-481, 1993.
[12] T.H. Cormen and D.M. Nicol, Out-of-Core FFTS with Parallel Disks ACM SIGMETRICS Performance Evaluation Rev., pp. 3-12, 1997.
[13] R.A. Coyne and H. Hulen, An Introduction to the Mass Storage System Reference Model Proc. 12th IEEE Symp. Mass STorage Systems, 1993.
[14] R.A. Coyne, H. Hulen, and R. Watson, The High Performance Storage System Proc. Supercomputing Conf., 1993.
[15] J.R. Davis, Datalinks: Managing External Data with db2 Universal Database IBM Corporation White Paper, 1997.
[16] J. del Rosario, R. Bordawekar, and A. Choudhary, Improved Parallel I/O via a Two-Phase Run-Time Access Strategy Proc. IPPS Workshop Input/Output in Parallel Computer Systems, 1993.
[17] J.M. del Rosario, A.N. Choudhary, “High Performance I/O for Massively Parallel Computers: Problems and Prospects,” Computer, vol. 27, no. 3,pp. 59–68, 1994.
[18] M. Drewry, H. Conover, S. McCoy, and S. Graves, Meta-Data: Quality vs. Quantity Proc. Second IEEE Meta-Data Conf., 1997.
[19] C.S. Ellis and D. Kotz, Prefetching in File Systems for MIMD Multiprocessors Proc. Int'l Conf. Parallel Processing, pp. 306-314, 1989.
[20] M. Kandaswamy, M. Kandemir, A. Choudhary, and D. Bernholdt, Performance Implications of Architectural and Software Techniques on I/O-Intensive Applications Proc. Int'l Conf. Parallel Processing, 1998.
[21] J.F. Karpovich, A.S. Grimshaw, and J.C. French, Extensible File Systems (ELFS): An Object-Oriented Approach to High Performance File I/O Proc. Ninth Ann. Conf. Object-Oriented Programming Systems, Languages, and Applications, pp. 191-204, 1994.
[22] D. Kotz, "Multiprocessor File System Interfaces," Proc. Second Int'l Conf. Parallel and Distributed Information Systems, pp. 194-201, 1993.
[23] D. Kotz, Disk-Directed I/O for MIMD Multiprocessors Proc. Symp. Operating Systems Design and Implementation, pp. 61-74, 1994.
[24] T. Madhyastha and D.A. Reed, “Intelligent, Adaptive File System Policy Selection,” Proc. Frontiers '96, 1996.
[25] MCAT,, 2003.
[26] G. Memik, M. Kandemir, A. Choudhary, and V.E. Taylor, APRIL: A Run-Time Library for Tape Resident Data Proc. 17th IEEE Symp. Mass Storage Systems, 2000.
[27] T.C. Mowry, A.K. Demke, and O. Krieger, Automatic Compiler-Inserted I/O Prefetching for Out-of-Core Applications Proc. Second Symp. Operating Systems Design and Implementation, pp. 3-17, 1996.
[28] J. Newton, Application of Meta-Data Standards Proc. First IEEE Meta-Data Conf., 1996.
[29] R.H. Patterson, G.A. Gibson, and M. Satyanarayanan, A Status Report on Research in Transparent Informed Prefetching ACM Operating Systems Rev., pp. 21-34, 1993.
[30] B. Rullman, Paragon Parallel File System External Product Specification, Intel Supercomputer Systems Division, 1996.
[31] S. Sarawagi, Query Processing in Tertiary Memory Databases Proc. 21st Very Large Databases Conf., 1995.
[32] M. Stonebraker, Object-Relational DBMSs: Tracking the Next Great Wave. Morgan Kaufman, 1998.
[33] M. Stonebraker and L.A. Rowe, The Design of Postgres Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 340-355, 1986.
[34] R. Thakur, W. Gropp, and E. Lusk, A Case for Using MPI's Derived Datatypes to Improve I/O Performance Proc. Supercomputing Conf.: High Performance Networking and Computing, 1998.
[35] R. Thakur, W. Gropp, and E. Lusk, On Implementing MPI-IO Portably and with High Performance Preprint ANL/MCS-P732-1098, Math. and Computer Science Division, Argonne Nat'l Laboratory, 1998.
[36] R. Thakur, W. Gropp, and E. Lusk, Data Sieving and Collective I/O in Romio Proc. Seventh Symp. Frontiers of Massively Parallel Computation, 1999.
[37] S. Toledo and F.G. Gustavson, The Design and Implementation of Solar, a Portable Library for Scalable Out-of-Core Linear Algebra Computations Proc. Fourth Ann. Workshop I/O in Parallel and Distributed Systems, 1996.
[38] P. Triantafillou and T. Papadakis, Continuous Data Block Placement and Elevation from Tertiary Storage in Hierarchical Storage Servers Cluster Computing: The J. Networks, Software Tools, and Applications, 2001.
[39] UniTree User Guide, Release 2.0, UniTree Software, Inc., 1998.

Index Terms:
Data intensive computing, access pattern, storage pattern, MDMS.
Xiaohui Shen, Wei-keng Liao, Alok Choudhary, Gokhan Memik, Mahmut Kandemir, "A High-Performance Application Data Environment for Large-Scale Scientific Computations," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 12, pp. 1262-1274, Dec. 2003, doi:10.1109/TPDS.2003.1255638
Usage of this product signifies your acceptance of the Terms of Use.