The Community for Technology Leaders
RSS Icon
Issue No.01 - Jan. (2013 vol.24)
pp: 158-169
Saba Sehrish , University of Central Florida, Orlando
Grant Mackey , University of Central Florida, Orlando
Pengju Shang , University of Central Florida, Orlando
Jun Wang , University of Central Florida, Orlando
John Bent , Los Alamos National Laboratory, Los Alamos
Current High Performance Computing (HPC) applications have seen an explosive growth in the size of data in recent years. Many application scientists have initiated efforts to integrate data-intensive computing into computational-intensive HPC facilities, particularly for data analytics. We have observed several scientific applications which must migrate their data from an HPC storage system to a data-intensive one for analytics. There is a gap between the data semantics of HPC storage and data-intensive system, hence, once migrated, the data must be further refined and reorganized. This reorganization must be performed before existing data-intensive tools such as MapReduce can be used to analyze data. This reorganization requires at least two complete scans through the data set and then at least one MapReduce program to prepare the data before analyzing it. Running multiple MapReduce phases causes significant overhead for the application, in the form of excessive I/O operations. That is for every MapReduce phase, a distributed read and write operation on the file system must be performed. Our contribution is to develop a MapReduce-based framework for HPC analytics to eliminate the multiple scans and also reduce the number of data preprocessing MapReduce programs. We also implement a data-centric scheduler to further improve the performance of HPC analytics MapReduce programs by maintaining the data locality. We have added additional expressiveness to the MapReduce language to allow application scientists to specify the logical semantics of their data such that 1) the data can be analyzed without running multiple data preprocessing MapReduce programs, and 2) the data can be simultaneously reorganized as it is migrated to the data-intensive file system. Using our augmented Map-Reduce system, MapReduce with Access Patterns (MRAP), we have demonstrated up to 33 percent throughput improvement in one real application, and up to 70 percent in an I/O kernel of another application. Our results for scheduling show up to 49 percent improvement for an I/O kernel of a prevalent HPC analysis application.
Semantics, Distributed databases, Layout, Scheduling, Pattern matching, Kernel, Processor scheduling, scheduling, HPC analytics framework, data-intensive systems, MapReduce
Saba Sehrish, Grant Mackey, Pengju Shang, Jun Wang, John Bent, "Supporting HPC Analytics Applications with Access Patterns Using Data Restructuring and Data-Centric Scheduling Techniques in MapReduce", IEEE Transactions on Parallel & Distributed Systems, vol.24, no. 1, pp. 158-169, Jan. 2013, doi:10.1109/TPDS.2012.88
[1] http://hadoop.apache.orgcore/, 2012.
[2] , 2012.
[3], 2012.
[4], 2012.
[5] roskies20090130.ppt, 2011.
[6] http://www.hdfgroup.orghdf5/, 2011.
[7], 2012.
[8] 2011ucsc-soe-11-04.pdf, 2012.
[9] /, 2012.
[10] MPI-2: Extensions to the message-passing interface,, July 1997.
[11] J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate, "PLFS: A Checkpoint Filesystem for Parallel Applications," Proc. ACM/IEEE Conf. Supercomputing, Nov. 2009.
[12] D. Borthaku, "The Hadoop Distributed File System: Architecture and Design."
[13] T.H. Cormen, C. Stein, R.L. Rivest, and C.E. Leiserson, Introduction to Algorithms. McGraw-Hill Higher Education, 2001.
[14] J. Dean, "Experiences with MapReduce, An Abstraction for Large-Scale Computation," Proc. 15th Int'l Conf. Parallel Architectures and Compilation Techniques (PACT '06), p. 1, 2006.
[15] J. Dean and S. Ghemawat, "Mapreduce: Simplified Data Processing on Large Clusters," Proc. Sixth Conf. Symp. Operating Systems Design and Implementation (OSDI '04), pp. 10-10, 2004.
[16] J. Ekanayake, S. Pallickara, and G. Fox, "Mapreduce for Data Intensive Scientific Analyses," Proc. IEEE Fourth Int'l Conf. eScience (eScience '08), pp. 277-284, 2008.
[17] J. Ekanayake, T. Gunarathne, G. Fox, A.S. Balkir, C. Poulain, N. Araujo, and R. Barga, "Dryadlinq for Scientific Analyses," Proc. IEEE Fifth Int'l Conf. e-Science (E-SCIENCE '09), pp. 329-336, 2009.
[18] S. Ghemawat, H. Gobioff, and S.-T. Leung, "The Google File System," Proc. 19th ACM Symp. Operating Systems Principles (SOSP '03), pp. 29-43, 2003.
[19] W. Gropp, R. Thakur, and E. Lusk, Using MPI-2: Advanced Features of the Message Passing Interface. MIT Press, 1999.
[20] C. He, Y. Lu, and D. Swanson, "Matchmaking: A New MapReduce Scheduling Technique," Proc. IEEE Third Int'l Conf. Cloud Computing Technology and Science (CloudCom), pp. 40-47, 2011.
[21] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, "Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks," Proc. ACM SIGOPS/EuroSys European Conf. Computer Systems (EuroSys '07), pp. 59-72, 2007.
[22] M. Isard and Y. Yu, "Distributed Data-Parallel Computing Using a High-Level Programming Language," Proc. 35th SIGMOD Int'l Conf. Management of Data (SIGMOD '09), pp. 987-994, 2009.
[23] Y.C. Kwon, D. Nunley, J.P. Gardner, M. Balazinska, B. Howe, and S. Loebman, "Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster," technical report, Univ. of Washington, 2009.
[24] J. Li, W.K. Liao, A. Choudhary, R. Ross, R. Thakur, W. Gropp, R. Latham, A. Siegel, B. Gallagher, and M. Zingale, "Parallel NetCDF: A High-Performance Scientific I/O Interface," Proc. ACM/IEEE Conf. Supercomputing, p. 39, Nov. 2003.
[25] X. Liu, J. Han, Y. Zhong, C. Han, and X. He, "Implementing WebGIS on Hadoop: A Case Study of Improving Small File I/O Performance on HDFS," Proc. IEEE Int'l Conf. Cluster Computing and Workshops (CLUSTER '09), 2009.
[26] A. Matsunaga, M. Tsugawa, and J. Fortes, "Cloudblast: Combining Mapreduce and Virtualization on Distributed Resources for Bioinformatics Applications," Proc. IEEE Fourth Int'l Conf. eScience (ESCIENCE '08), pp. 222-229, 2008.
[27] L. Meyer, J. Annis, M. Wilde, M. Mattoso, and I. Foster, "Planning Spatial Workflows to Optimize Grid Performance," Proc. ACM Symp. Applied Computing (SAC '06), pp. 786-790, 2006.
[28] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, "Pig Latin: A Not-So-Foreign Language for Data Processing," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '08), pp. 1099-1110, 2008.
[29] I. Raicu, I.T. Foster, Y. Zhao, P. Little, C.M. Moretti, A. Chaudhary, and D. Thain, "The Quest for Scalable Support of Data-Intensive Workloads in Distributed Systems," Proc. 18th ACM Int'l Symp. High Performance Distributed Computing (HPDC '09), pp. 207-216, 2009.
[30] I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and M. Wilde, "Falkon: A Fast and Light-Weight Task Execution Framework," Proc. ACM/IEEE Conf. Supercomputing (SC '07), pp. 1-12, 2007.
[31] I. Raicu, Y. Zhao, I.T. Foster, and A. Szalay, "Accelerating Large-Scale Data Exploration through Data Diffusion," Proc. Int'l Workshop Data-Aware Distributed Computing (DADC '08), pp. 9-18, 2008.
[32] K. Ranganathan and I. Foster, "Decoupling Computation and Data Scheduling in Distributed Data-Intensive Applications," Proc. IEEE 11th Int'l Symp. High Performance Distributed Computing (HPDC '02), p. 352, 2002.
[33] E. Santos-Neto, W. Cirne, F. Brasileiro, A. Lima, R. Lima, and C. Grande, "Exploiting Replication and Data Reuse to Efficiently Schedule Data-Intensive Applications on Grids," Proc. 10th Workshop Job Scheduling Strategies for Parallel Processing, pp. 210-232, 2004.
[34] M.C. Schatz, "Cloudburst: Highly Sensitive Read Mapping with MapReduce," Bioinformatics, vol. 25, no. 11, pp. 1363-1369, 2009.
[35] X. Shen and A. Choudhary, "DPFS: A Distributed Parallel File System," Proc. Int'l Conf. Parallel Processing, vol. 0:0533, 2001.
[36] V. Springel, S.D.M. White, A. Jenkins, C.S. Frenk, N. Yoshida, L. Gao, J. Navarro, R. Thacker, D. Croton, J. Helly, J.A. Peacock, S. Cole, P. Thomas, H. Couchman, A. Evrard, J. Colberg, and F. Pearce, "Simulations of the Formation, Evolution and Clustering of Galaxies and Quasars," Nature, vol. 435, no. 70422, pp. 629-636, June 2005.
[37] R. Thakur, W. Gropp, and E. Lusk, "Data Sieving and Collective I/O in ROMIO," Proc. Seventh Symp. Frontiers of Massively Parallel Computation (FRONTIERS '99), pp. 182-189, 1999.
[38] S. Viswanathan, B. Veeravalli, D. Yu, and T.G. Robertazzi, "Design and Analysis of a Dynamic Scheduling Strategy with Resource Estimation for Large-Scale Grid Systems," Proc. IEEE/ACM Fifth Int'l Workshop Grid Computing (GRID '04), pp. 163-170, 2004.
[39] S. Yamagiwa and K. Wada, "Performance Impact on Resource Sharing among Multiple CPU- and GPU-Based Applications," Int'l J. Parallel, Emergent and Distributed Systems, vol. 26, no. 4, pp. 313-329, 2011.
[40] Y. Yu, M. Isard, D. Fetterly, M. Budiu, I. Erlingsson, P.K. Gunda, and J. Currey, "Dryadlinq: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language," Proc. Eighth USENIX Conf. Operating Systems Design and Implementation (OSDI), pp. 1-14, 2008.
[41] M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz, and I. Stoica, "Improving Mapreduce Performance in Heterogeneous Environments," Proc. Eighth USENIX Conf. Operating Systems Design and Implementation (OSDI '08), pp. 29-42, 2008.
[42] Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, V. Nefedova, I. Raicu, T. Stef-Praun, and M. Wilde, "Swift: Fast, Reliable, Loosely Coupled Parallel Computation," Proc. IEEE Congress Services, pp. 199-206, July 2007.
26 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool