Issue No.12 - December (2006 vol.55)
Eitan Frachtenberg , IEEE
Fabrizio Petrini , IEEE
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TC.2006.206
Although clusters are a popular form of high-performance computing, they remain more difficult to manage than sequential systems—or even symmetric multiprocessors. In this paper, we identify a small set of primitive mechanisms that are sufficiently general to be used as building blocks to solve a variety of resource-management problems. We then present STORM, a resource-management environment that embodies these mechanisms in a scalable, low-overhead, and efficient implementation. The key innovation behind STORM is a modular software architecture that reduces all resource management functionality to a small number of highly scalable mechanisms. These mechanisms simplify the integration of resource management with low-level network features. As a result of this design, STORM can launch large, parallel applications an order of magnitude faster than the best time reported in the literature and can gang-schedule a parallel application as fast as the node OS can schedule a sequential application. This paper describes the mechanisms and algorithms behind STORM and presents a detailed performance model that shows that STORM's performance can scale to thousands of nodes.
Hardware/software interface, system architectures, integration, and modeling, network operating systems, supercomputers.
Eitan Frachtenberg, Fabrizio Petrini, Juan Fern?ndez, Scott Pakin, "STORM: Scalable Resource Management for Large-Scale Parallel Computers", IEEE Transactions on Computers, vol.55, no. 12, pp. 1572-1587, December 2006, doi:10.1109/TC.2006.206