Issue No.08 - August (2011 vol.22)
Henry M. Monti , Virginia Polytechnic Institute and State University, Blacksburg
Ali R. Butt , Virginia Polytechnic Institute and State University, Blacksburg
Sudharshan S. Vazhkudai , Oak Ridge National Laboratory, Oak Ridge
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.190
Modern High-Performance Computing (HPC) centers are facing a data deluge from emerging scientific applications. Supporting large data entails a significant commitment of the high-throughput center storage system, scratch space. However, the scratch space is typically managed using simple “purge policies,” without sophisticated end-user data services to balance resource consumption and user serviceability. End-user data services such as offloading are performed using point-to-point transfers that are unable to reconcile center's purge and users' delivery deadlines, unable to adapt to changing dynamics in the end-to-end data path and are not fault-tolerant. Such inefficiencies can be prohibitive to sustaining high performance. In this paper, we address the above issues by designing a framework for the timely, decentralized offload of application result data. Our framework uses an overlay of user-specified intermediate and landmark sites to orchestrate a decentralized fault-tolerant delivery. We have implemented our techniques within a production job scheduler (PBS) and data transfer tool (BitTorrent). Our evaluation using both a real implementation and supercomputer job log-driven simulations show that: the offloading times can be significantly reduced (90.4 percent for a 5 GB data transfer); the exposure window can be minimized while also meeting center-user service level agreements.
High-performance data management, HPC center serviceability, offloading, end-user data delivery, peer-to-peer.
Henry M. Monti, Ali R. Butt, Sudharshan S. Vazhkudai, "Timely Result-Data Offloading for Improved HPC Center Scratch Provisioning and Serviceability", IEEE Transactions on Parallel & Distributed Systems, vol.22, no. 8, pp. 1307-1322, August 2011, doi:10.1109/TPDS.2010.190