The Community for Technology Leaders
Green Image
Issue No. 08 - August (2011 vol. 22)
ISSN: 1045-9219
pp: 1307-1322
Henry M. Monti , Virginia Polytechnic Institute and State University, Blacksburg
Ali R. Butt , Virginia Polytechnic Institute and State University, Blacksburg
Sudharshan S. Vazhkudai , Oak Ridge National Laboratory, Oak Ridge
Modern High-Performance Computing (HPC) centers are facing a data deluge from emerging scientific applications. Supporting large data entails a significant commitment of the high-throughput center storage system, scratch space. However, the scratch space is typically managed using simple “purge policies,” without sophisticated end-user data services to balance resource consumption and user serviceability. End-user data services such as offloading are performed using point-to-point transfers that are unable to reconcile center's purge and users' delivery deadlines, unable to adapt to changing dynamics in the end-to-end data path and are not fault-tolerant. Such inefficiencies can be prohibitive to sustaining high performance. In this paper, we address the above issues by designing a framework for the timely, decentralized offload of application result data. Our framework uses an overlay of user-specified intermediate and landmark sites to orchestrate a decentralized fault-tolerant delivery. We have implemented our techniques within a production job scheduler (PBS) and data transfer tool (BitTorrent). Our evaluation using both a real implementation and supercomputer job log-driven simulations show that: the offloading times can be significantly reduced (90.4 percent for a 5 GB data transfer); the exposure window can be minimized while also meeting center-user service level agreements.
High-performance data management, HPC center serviceability, offloading, end-user data delivery, peer-to-peer.

H. M. Monti, S. S. Vazhkudai and A. R. Butt, "Timely Result-Data Offloading for Improved HPC Center Scratch Provisioning and Serviceability," in IEEE Transactions on Parallel & Distributed Systems, vol. 22, no. , pp. 1307-1322, 2010.
83 ms
(Ver 3.3 (11022016))