This Article 
 Bibliographic References 
 Add to: 
Timely Result-Data Offloading for Improved HPC Center Scratch Provisioning and Serviceability
August 2011 (vol. 22 no. 8)
pp. 1307-1322
Henry M. Monti, Virginia Polytechnic Institute and State University, Blacksburg
Ali R. Butt, Virginia Polytechnic Institute and State University, Blacksburg
Sudharshan S. Vazhkudai, Oak Ridge National Laboratory, Oak Ridge
Modern High-Performance Computing (HPC) centers are facing a data deluge from emerging scientific applications. Supporting large data entails a significant commitment of the high-throughput center storage system, scratch space. However, the scratch space is typically managed using simple “purge policies,” without sophisticated end-user data services to balance resource consumption and user serviceability. End-user data services such as offloading are performed using point-to-point transfers that are unable to reconcile center's purge and users' delivery deadlines, unable to adapt to changing dynamics in the end-to-end data path and are not fault-tolerant. Such inefficiencies can be prohibitive to sustaining high performance. In this paper, we address the above issues by designing a framework for the timely, decentralized offload of application result data. Our framework uses an overlay of user-specified intermediate and landmark sites to orchestrate a decentralized fault-tolerant delivery. We have implemented our techniques within a production job scheduler (PBS) and data transfer tool (BitTorrent). Our evaluation using both a real implementation and supercomputer job log-driven simulations show that: the offloading times can be significantly reduced (90.4 percent for a 5 GB data transfer); the exposure window can be minimized while also meeting center-user service level agreements.

[1] Gyrokinetic Toroidal Code (GTC),, 2010.
[2] W.X. Wang, Z. Lin, W.M. Tang, W.W. Lee, S. Ethier, J.L.V. Lewandowski, G. Rewoldt, T.S. Hahm, and J. Manickam, "Global Gyrokinetic Particle Simulation of Turbulence and Transport in Realistic Tokamak Geometry," J. Physics: Conf. Series, vol. 16, no. 1, p. 59, 2005.
[3] NSF TeraGrid, http:/, 2009.
[4] Cluster File Systems, Inc., Lustre: A Scalable, High-Performance File System, , 2002.
[5] F. Schmuck and R. Haskin, "GPFS: A Shared-Disk File System for Large Computing Clusters," Proc. USENIX Conf. File and Storage Technologies (FAST '02), 2002.
[6] NCCS.GOV File Systems, resources/ jaguarfile-systems, 2007.
[7] UC/ANL Teragrid Guide, , 2004.
[8] J. Bester, I. Foster, C. Kesselman, J. Tedesco, and S. Tuecke, "GASS: A Data Movement and Access Service for Wide Area Computing Systems," Proc. Workshop I/O in Parallel and Distributed Systems (IOPADS '99), 1999.
[9] M. Gleicher, "HSI: Hierarchical Storage Interface for HPSS," /, 2010.
[10] J.W. Cobb, A. Geist, J.A. Kohl, S.D. Miller, P.F. Peterson, G.G. Pike, M.A. Reuter, T. Swain, S.S. Vazhkudai, and N.N. Vijayakumar, "The Neutron Science Teragrid Gateway: A Teragrid Science Gateway to Support the Spallation Neutron Source: Research Articles," Concurrency and Computation: Practice and Experience, vol. 19, no. 6, pp. 809-826, 2007.
[11] M. Christie and S. Marru, "The Lead Portal: A Teragrid Gateway and Application Service Architecture: Research Articles," Concurrency and Computation : Practice and Experience, vol. 19, no. 6, pp. 767-781, 2007.
[12] R. Wolski, N. Spring, and J. Hayes, "The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing," Future Generation Computer Systems , vol. 15, no. 5, pp. 757-768, 1999.
[13] L. Peterson, T. Anderson, D. Culler, and T. Roscoe, "A Blueprint for Introducing Disruptive Technology into the Internet," Proc. ACM First Workshop Hot Topics in Networks (HotNets-I), 2002.
[14] Nat'l Center for Computational Sciences, http:/, 2009.
[15] A. Bayucan, R.L. Henderson, C. Lesiak, B. Mann, T. Proett, and D. Tweten, "Portable Batch System: External Reference Specification," 2672 Bayshore Parkway, Suite 810, Mountain View, CA 94043, v2_2_ers. pdf, Nov. 1999.
[16] B. Cohen BitTorrent Protocol Specification, http://www. bittorrent.orgprotocol.html , 2007.
[17] Z. Zhang, C. Wang, S.S. Vazhkudai, X. Ma, G. Pike, J. Cobb, and F. Mueller, "Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery," Proc. Conf. Supercomputing, 2007.
[18] Dept. of Energy, Office of Science, Innovative and Novel Computational Impact on Theory and Experiment (INCITE),, 2008.
[19] A. Rowstron and P. Druschel, "Pastry: Scalable, Distributed Object Location and Routing Force Large-Scale Peer-to-Peer Systems," Proc. IFIP/ACM Int'l Conf. Middleware, 2001.
[20] I. Stoica, R. Morris, D. Karger, M. Frans Kaashoek, and H. Balakrishnan, "Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications," Proc. SIGCOMM, 2001.
[21] Internet2, http:/, 2008.
[22] Nat'l Lambda Rail: Light the Future, http:/, 2008.
[23] The Research and Education Data Depot Network (Reddnet), Infrastructure , 2007.
[24] Grid Physics Network, http:/, 2004.
[25] P. Maymounkov Online Codes, Technical Report TR2003-883, New York Univ., Nov. 2002.
[26] J.S. Plank, "Erasure Codes for Storage Applications," Tutorial Slides, Presented at USENIX FAST, FAST-2005.html, 2005.
[27] B. Schroeder and G.A. Gibson, "Disk Failures in the Real World: What Does an Mttf of 1,000,000 Hours Mean to You?," Proc. USENIX Conf. File and Storage Technologies (FAST '07), 2007.
[28] E. Pinheiro, W.-D. Weber, and L. André Barroso, "Failure Trends in a Large Disk Drive Population," Proc. USENIX Conf. File and Storage Technologies (FAST '07), 2007.
[29] S. Shah and J.G. Elerath, "Reliability Analysis of Disk Drive Failure Mechanisms," Proc. IEEE Ann. Reliability and Maintainability Symp. (RAMS '05), 2005.
[30] L.N. Bairavasundaram, G.R. Goodson, S. Pasupathy, and J. Schindler, "An Analysis of Latent Sector Errors in Disk Drives," Proc. ACM SIGMETRICS, 2007.
[31] I. Iliadis, R. Haas, X.-Y. Hu, and E. Eleftheriou, "Disk Scrubbing versus Intra-Disk Redundancy for High-Reliability Raid Storage Systems," Proc. ACM SIGMETRICS, 2008.
[32] A. Riska and E. Riedel, "Idle Read after Write: Iraw," Proc. USENIX Ann. Technical Conf. (ATC '08), 2008.
[33] A.R. Butt, T.A. Johnson, Y. Zheng, and Y. Charlie Hu, "Kosha: A Peer-to-Peer Enhancement for the Network File System," J. Grid Computing: Special Issue on Global and Peer-to-Peer Computing, vol. 4, no. 3, pp. 323-341, 2006.
[34] Druschel et. al. Freepastry, http:/, 2004.
[35] J.S. Plank, "A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-Like Systems," Software—Practice and Experience, vol. 27, no. 9, pp. 995-1012, 1997.
[36] H. Monti, A.R. Butt, and S.S. Vazhkudai, "/Scratch as a Cache: Rethinking HPC Center Scratch Storage," Proc. ACM Ann. Int'l Conf. Supercomputing (ICS '09), 2009.
[37] H. Monti, A.R. Butt, and S.S. Vazhkudai, "Timely Offloading of Result-Data in Hpc Centers," Proc. ACM Ann. Int'l Conf. Supercomputing (ICS '08), 2008.
[38] J. Plank, M. Beck, W. Elwasif, T. Moore, M. Swany, and R. Wolski, "The Internet Backplane Protocol: Storage in the Network," Proc. Network Storage Symp. (NSS '99), 1999.
[39] Bbcp Homepage, http:/, 2010.
[40] Nccs User Support—Data Transfer, support/general-support data-transfer/, 2010.
[41] D. Thain, S. Son, J. Basney, and M. Livny, "The Kangaroo Approach to Data Movement on the Grid," Proc. Int'l Symp. High Performance Distributed Computing (HPDC '01), 2001.
[42] V. Bhat, S. Klasky, S. Atchley, M. Beck, D. Mccune, and M. Parashar, "High Performance Threaded Data Streaming for Large Scale Simulations," Proc. IEEE/ACM Int'l Workshop Grid Computing, 2004.
[43] T. Kosar and M. Livny, "Stork: Making Data Placement a First Class Citizen in the Grid," Proc. IEEE Int'l Conf. Distributed Computing Systems (ICDCS '04), 2004.
[44] M. Litzkow, M. Livny, and M. Mutka, "Condor—A Hunter of Idle Workstations," Proc. IEEE Int'l Conf. Distributed Computing Systems (ICDCS '88), 1988.
[45] Directed Acyclic Graph Manager, condordagman/, 2010.
[46] DMOVER: Scheduled Data Transfer for Distributed Computational Workflows, dmover/, 2008.
[47] D. Kostic, A. Rodriguez, J. Albrecht, and A. Vahdat, "Bullet: High Bandwidth Data Dissemination Using an Overlay Mesh," Proc. ACM Symp. Operating Systems Principles (SOSP '03), 2003.
[48] D. Kostic, A. Rodriguez, J. Albrecht, A. Bhirud, and A.M. Vahdat, "Using Random Subsets to Build Scalable Network Services," Proc. Conf. USENIX Symp. Internet Technologies and Systems (USITS '03), 2003.
[49] S. Annapureddy, M.J. Freedman, and D. Mazires, "Shark: Scaling File Servers via Cooperative Caching," Proc. Conf. USENIX Networked Systems Design and Implementation (NSDI '05), 2005.
[50] L. Wang, K. Park, R. Pang, V. Pai, and L. Peterson, "Reliability and Security in the CoDeeN Content Distribution Network," Proc. USENIX Ann. Technical Conf. (ATC '04), 2004.
[51] K. Park and V.S. Pai, "Scale and Performance in the CoBlitz Large-File Distribution Service," Proc. Conf. USENIX Networked Systems Design and Implementation (NSDI '06), 2006.
[52] L. Cherkasova and J. Lee, "Fastreplica: Efficient Large File Distribution within Content Delivery Networks," Proc. Conf. USENIX Symp. Internet Technologies and Systems (USITS '03), 2003.
[53] R. Sherwood, R. Braud, and B. Bhattacharjee, "Slurpie: A Cooperative Bulk Data Transfer Protocol," Proc. IEEE INFOCOM, 2004.
[54] P. Rodriguez, A. Kirpal, and E.W. Biersack, "Parallel-access for Mirror Sites in the Internet," Proc. IEEE INFOCOM, 2000.
[55] J.S. Plank, S. Atchley, Y. Ding, and M. Beck, "Algorithms for High Performance, Wide-Area Distributed File Downloads," Parallel Processing Letters, vol. 13, no. 2, pp. 207-224, 2003.
[56] R.L. Collins and J.S. Plank, "Downloading Replicated, Wide-Area Files—A Framework and Empirical Evaluation," Proc. IEEE Int'l Symp. Network Computing, 2004.
[57] S. Vazhkudai and J. Schopf, "Predicting Sporadic Grid Data Transfers," Proc. Int'l Symp. High Performance Distributed Computing (HPDC '02), 2002.
[58] P. Rizk, C. Kiddle, and R. Simmonds, "A Gridftp Overlay Network Service," Proc. Int'l Conf. Grid Computing, 2007.
[59] G. Khanna, U. Catalyurek, T. Kurc, R. Kettimuthu, P. Sadayappan, I. Foster, and J. Saltz, "Using Overlays for Efficient Data Transfer over Shared Wide-Area Networks," Proc. Int'l Conf. Supercomputing, 2008.
[60] H. Abbasi, M. Wolf, F. Zheng, G. Eisenhauer, S. Klasky, and K. Schwan, "Scalable Data Staging Services for Petascale Applications," Proc. ACM Int'l Symp. High Performance Distributed Computing (HPDC '09), 2009.
[61] H. Monti, A.R. Butt, and S.S. Vazhkudai, "Just-in-Time Staging of Large Input Data for Supercomputing Jobs," Proc. ACM Petascale Data Storage Workshop (PDSW '08), 2008.

Index Terms:
High-performance data management, HPC center serviceability, offloading, end-user data delivery, peer-to-peer.
Henry M. Monti, Ali R. Butt, Sudharshan S. Vazhkudai, "Timely Result-Data Offloading for Improved HPC Center Scratch Provisioning and Serviceability," IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 8, pp. 1307-1322, Aug. 2011, doi:10.1109/TPDS.2010.190
Usage of this product signifies your acceptance of the Terms of Use.