The Community for Technology Leaders
Green Image
Issue No. 08 - August (2011 vol. 22)
ISSN: 1045-9219
pp: 1307-1322
Henry M. Monti , Virginia Polytechnic Institute and State University, Blacksburg
Sudharshan S. Vazhkudai , Oak Ridge National Laboratory, Oak Ridge
Ali R. Butt , Virginia Polytechnic Institute and State University, Blacksburg
ABSTRACT
Modern High-Performance Computing (HPC) centers are facing a data deluge from emerging scientific applications. Supporting large data entails a significant commitment of the high-throughput center storage system, scratch space. However, the scratch space is typically managed using simple “purge policies,” without sophisticated end-user data services to balance resource consumption and user serviceability. End-user data services such as offloading are performed using point-to-point transfers that are unable to reconcile center's purge and users' delivery deadlines, unable to adapt to changing dynamics in the end-to-end data path and are not fault-tolerant. Such inefficiencies can be prohibitive to sustaining high performance. In this paper, we address the above issues by designing a framework for the timely, decentralized offload of application result data. Our framework uses an overlay of user-specified intermediate and landmark sites to orchestrate a decentralized fault-tolerant delivery. We have implemented our techniques within a production job scheduler (PBS) and data transfer tool (BitTorrent). Our evaluation using both a real implementation and supercomputer job log-driven simulations show that: the offloading times can be significantly reduced (90.4 percent for a 5 GB data transfer); the exposure window can be minimized while also meeting center-user service level agreements.
INDEX TERMS
High-performance data management, HPC center serviceability, offloading, end-user data delivery, peer-to-peer.
CITATION
Henry M. Monti, Sudharshan S. Vazhkudai, Ali R. Butt, "Timely Result-Data Offloading for Improved HPC Center Scratch Provisioning and Serviceability", IEEE Transactions on Parallel & Distributed Systems, vol. 22, no. , pp. 1307-1322, August 2011, doi:10.1109/TPDS.2010.190
93 ms
(Ver )