The Community for Technology Leaders
RSS Icon
Issue No.08 - August (2011 vol.22)
pp: 1307-1322
Henry M. Monti , Virginia Polytechnic Institute and State University, Blacksburg
Ali R. Butt , Virginia Polytechnic Institute and State University, Blacksburg
Sudharshan S. Vazhkudai , Oak Ridge National Laboratory, Oak Ridge
Modern High-Performance Computing (HPC) centers are facing a data deluge from emerging scientific applications. Supporting large data entails a significant commitment of the high-throughput center storage system, scratch space. However, the scratch space is typically managed using simple “purge policies,” without sophisticated end-user data services to balance resource consumption and user serviceability. End-user data services such as offloading are performed using point-to-point transfers that are unable to reconcile center's purge and users' delivery deadlines, unable to adapt to changing dynamics in the end-to-end data path and are not fault-tolerant. Such inefficiencies can be prohibitive to sustaining high performance. In this paper, we address the above issues by designing a framework for the timely, decentralized offload of application result data. Our framework uses an overlay of user-specified intermediate and landmark sites to orchestrate a decentralized fault-tolerant delivery. We have implemented our techniques within a production job scheduler (PBS) and data transfer tool (BitTorrent). Our evaluation using both a real implementation and supercomputer job log-driven simulations show that: the offloading times can be significantly reduced (90.4 percent for a 5 GB data transfer); the exposure window can be minimized while also meeting center-user service level agreements.
High-performance data management, HPC center serviceability, offloading, end-user data delivery, peer-to-peer.
Henry M. Monti, Ali R. Butt, Sudharshan S. Vazhkudai, "Timely Result-Data Offloading for Improved HPC Center Scratch Provisioning and Serviceability", IEEE Transactions on Parallel & Distributed Systems, vol.22, no. 8, pp. 1307-1322, August 2011, doi:10.1109/TPDS.2010.190
[1] Gyrokinetic Toroidal Code (GTC),, 2010.
[2] W.X. Wang, Z. Lin, W.M. Tang, W.W. Lee, S. Ethier, J.L.V. Lewandowski, G. Rewoldt, T.S. Hahm, and J. Manickam, "Global Gyrokinetic Particle Simulation of Turbulence and Transport in Realistic Tokamak Geometry," J. Physics: Conf. Series, vol. 16, no. 1, p. 59, 2005.
[3] NSF TeraGrid, http:/, 2009.
[4] Cluster File Systems, Inc., Lustre: A Scalable, High-Performance File System, , 2002.
[5] F. Schmuck and R. Haskin, "GPFS: A Shared-Disk File System for Large Computing Clusters," Proc. USENIX Conf. File and Storage Technologies (FAST '02), 2002.
[6] NCCS.GOV File Systems, resources/ jaguarfile-systems, 2007.
[7] UC/ANL Teragrid Guide, , 2004.
[8] J. Bester, I. Foster, C. Kesselman, J. Tedesco, and S. Tuecke, "GASS: A Data Movement and Access Service for Wide Area Computing Systems," Proc. Workshop I/O in Parallel and Distributed Systems (IOPADS '99), 1999.
[9] M. Gleicher, "HSI: Hierarchical Storage Interface for HPSS," /, 2010.
[10] J.W. Cobb, A. Geist, J.A. Kohl, S.D. Miller, P.F. Peterson, G.G. Pike, M.A. Reuter, T. Swain, S.S. Vazhkudai, and N.N. Vijayakumar, "The Neutron Science Teragrid Gateway: A Teragrid Science Gateway to Support the Spallation Neutron Source: Research Articles," Concurrency and Computation: Practice and Experience, vol. 19, no. 6, pp. 809-826, 2007.
[11] M. Christie and S. Marru, "The Lead Portal: A Teragrid Gateway and Application Service Architecture: Research Articles," Concurrency and Computation : Practice and Experience, vol. 19, no. 6, pp. 767-781, 2007.
[12] R. Wolski, N. Spring, and J. Hayes, "The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing," Future Generation Computer Systems , vol. 15, no. 5, pp. 757-768, 1999.
[13] L. Peterson, T. Anderson, D. Culler, and T. Roscoe, "A Blueprint for Introducing Disruptive Technology into the Internet," Proc. ACM First Workshop Hot Topics in Networks (HotNets-I), 2002.
[14] Nat'l Center for Computational Sciences, http:/, 2009.
[15] A. Bayucan, R.L. Henderson, C. Lesiak, B. Mann, T. Proett, and D. Tweten, "Portable Batch System: External Reference Specification," 2672 Bayshore Parkway, Suite 810, Mountain View, CA 94043, v2_2_ers. pdf, Nov. 1999.
[16] B. Cohen BitTorrent Protocol Specification, http://www. bittorrent.orgprotocol.html , 2007.
[17] Z. Zhang, C. Wang, S.S. Vazhkudai, X. Ma, G. Pike, J. Cobb, and F. Mueller, "Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery," Proc. Conf. Supercomputing, 2007.
[18] Dept. of Energy, Office of Science, Innovative and Novel Computational Impact on Theory and Experiment (INCITE),, 2008.
[19] A. Rowstron and P. Druschel, "Pastry: Scalable, Distributed Object Location and Routing Force Large-Scale Peer-to-Peer Systems," Proc. IFIP/ACM Int'l Conf. Middleware, 2001.
[20] I. Stoica, R. Morris, D. Karger, M. Frans Kaashoek, and H. Balakrishnan, "Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications," Proc. SIGCOMM, 2001.
[21] Internet2, http:/, 2008.
[22] Nat'l Lambda Rail: Light the Future, http:/, 2008.
[23] The Research and Education Data Depot Network (Reddnet), Infrastructure , 2007.
[24] Grid Physics Network, http:/, 2004.
[25] P. Maymounkov Online Codes, Technical Report TR2003-883, New York Univ., Nov. 2002.
[26] J.S. Plank, "Erasure Codes for Storage Applications," Tutorial Slides, Presented at USENIX FAST, FAST-2005.html, 2005.
[27] B. Schroeder and G.A. Gibson, "Disk Failures in the Real World: What Does an Mttf of 1,000,000 Hours Mean to You?," Proc. USENIX Conf. File and Storage Technologies (FAST '07), 2007.
[28] E. Pinheiro, W.-D. Weber, and L. André Barroso, "Failure Trends in a Large Disk Drive Population," Proc. USENIX Conf. File and Storage Technologies (FAST '07), 2007.
[29] S. Shah and J.G. Elerath, "Reliability Analysis of Disk Drive Failure Mechanisms," Proc. IEEE Ann. Reliability and Maintainability Symp. (RAMS '05), 2005.
[30] L.N. Bairavasundaram, G.R. Goodson, S. Pasupathy, and J. Schindler, "An Analysis of Latent Sector Errors in Disk Drives," Proc. ACM SIGMETRICS, 2007.
[31] I. Iliadis, R. Haas, X.-Y. Hu, and E. Eleftheriou, "Disk Scrubbing versus Intra-Disk Redundancy for High-Reliability Raid Storage Systems," Proc. ACM SIGMETRICS, 2008.
[32] A. Riska and E. Riedel, "Idle Read after Write: Iraw," Proc. USENIX Ann. Technical Conf. (ATC '08), 2008.
[33] A.R. Butt, T.A. Johnson, Y. Zheng, and Y. Charlie Hu, "Kosha: A Peer-to-Peer Enhancement for the Network File System," J. Grid Computing: Special Issue on Global and Peer-to-Peer Computing, vol. 4, no. 3, pp. 323-341, 2006.
[34] Druschel et. al. Freepastry, http:/, 2004.
[35] J.S. Plank, "A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-Like Systems," Software—Practice and Experience, vol. 27, no. 9, pp. 995-1012, 1997.
[36] H. Monti, A.R. Butt, and S.S. Vazhkudai, "/Scratch as a Cache: Rethinking HPC Center Scratch Storage," Proc. ACM Ann. Int'l Conf. Supercomputing (ICS '09), 2009.
[37] H. Monti, A.R. Butt, and S.S. Vazhkudai, "Timely Offloading of Result-Data in Hpc Centers," Proc. ACM Ann. Int'l Conf. Supercomputing (ICS '08), 2008.
[38] J. Plank, M. Beck, W. Elwasif, T. Moore, M. Swany, and R. Wolski, "The Internet Backplane Protocol: Storage in the Network," Proc. Network Storage Symp. (NSS '99), 1999.
[39] Bbcp Homepage, http:/, 2010.
[40] Nccs User Support—Data Transfer, support/general-support data-transfer/, 2010.
[41] D. Thain, S. Son, J. Basney, and M. Livny, "The Kangaroo Approach to Data Movement on the Grid," Proc. Int'l Symp. High Performance Distributed Computing (HPDC '01), 2001.
[42] V. Bhat, S. Klasky, S. Atchley, M. Beck, D. Mccune, and M. Parashar, "High Performance Threaded Data Streaming for Large Scale Simulations," Proc. IEEE/ACM Int'l Workshop Grid Computing, 2004.
[43] T. Kosar and M. Livny, "Stork: Making Data Placement a First Class Citizen in the Grid," Proc. IEEE Int'l Conf. Distributed Computing Systems (ICDCS '04), 2004.
[44] M. Litzkow, M. Livny, and M. Mutka, "Condor—A Hunter of Idle Workstations," Proc. IEEE Int'l Conf. Distributed Computing Systems (ICDCS '88), 1988.
[45] Directed Acyclic Graph Manager, condordagman/, 2010.
[46] DMOVER: Scheduled Data Transfer for Distributed Computational Workflows, dmover/, 2008.
[47] D. Kostic, A. Rodriguez, J. Albrecht, and A. Vahdat, "Bullet: High Bandwidth Data Dissemination Using an Overlay Mesh," Proc. ACM Symp. Operating Systems Principles (SOSP '03), 2003.
[48] D. Kostic, A. Rodriguez, J. Albrecht, A. Bhirud, and A.M. Vahdat, "Using Random Subsets to Build Scalable Network Services," Proc. Conf. USENIX Symp. Internet Technologies and Systems (USITS '03), 2003.
[49] S. Annapureddy, M.J. Freedman, and D. Mazires, "Shark: Scaling File Servers via Cooperative Caching," Proc. Conf. USENIX Networked Systems Design and Implementation (NSDI '05), 2005.
[50] L. Wang, K. Park, R. Pang, V. Pai, and L. Peterson, "Reliability and Security in the CoDeeN Content Distribution Network," Proc. USENIX Ann. Technical Conf. (ATC '04), 2004.
[51] K. Park and V.S. Pai, "Scale and Performance in the CoBlitz Large-File Distribution Service," Proc. Conf. USENIX Networked Systems Design and Implementation (NSDI '06), 2006.
[52] L. Cherkasova and J. Lee, "Fastreplica: Efficient Large File Distribution within Content Delivery Networks," Proc. Conf. USENIX Symp. Internet Technologies and Systems (USITS '03), 2003.
[53] R. Sherwood, R. Braud, and B. Bhattacharjee, "Slurpie: A Cooperative Bulk Data Transfer Protocol," Proc. IEEE INFOCOM, 2004.
[54] P. Rodriguez, A. Kirpal, and E.W. Biersack, "Parallel-access for Mirror Sites in the Internet," Proc. IEEE INFOCOM, 2000.
[55] J.S. Plank, S. Atchley, Y. Ding, and M. Beck, "Algorithms for High Performance, Wide-Area Distributed File Downloads," Parallel Processing Letters, vol. 13, no. 2, pp. 207-224, 2003.
[56] R.L. Collins and J.S. Plank, "Downloading Replicated, Wide-Area Files—A Framework and Empirical Evaluation," Proc. IEEE Int'l Symp. Network Computing, 2004.
[57] S. Vazhkudai and J. Schopf, "Predicting Sporadic Grid Data Transfers," Proc. Int'l Symp. High Performance Distributed Computing (HPDC '02), 2002.
[58] P. Rizk, C. Kiddle, and R. Simmonds, "A Gridftp Overlay Network Service," Proc. Int'l Conf. Grid Computing, 2007.
[59] G. Khanna, U. Catalyurek, T. Kurc, R. Kettimuthu, P. Sadayappan, I. Foster, and J. Saltz, "Using Overlays for Efficient Data Transfer over Shared Wide-Area Networks," Proc. Int'l Conf. Supercomputing, 2008.
[60] H. Abbasi, M. Wolf, F. Zheng, G. Eisenhauer, S. Klasky, and K. Schwan, "Scalable Data Staging Services for Petascale Applications," Proc. ACM Int'l Symp. High Performance Distributed Computing (HPDC '09), 2009.
[61] H. Monti, A.R. Butt, and S.S. Vazhkudai, "Just-in-Time Staging of Large Input Data for Supercomputing Jobs," Proc. ACM Petascale Data Storage Workshop (PDSW '08), 2008.
33 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool