loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Fourth IEEE International Conference on Cluster Computing (CLUSTER'02)
Scalable Resource Management in High Performance Computers
Chicago, Illinois
September 23-September 26
ISBN: 0-7695-1745-5
Eitan Frachtenberg, Los Alamos National Laboratory
Fabrizio Petrini, Los Alamos National Laboratory
Juan Fernandez, Los Alamos National Laboratory
Salvador Coll, Los Alamos National Laboratory

Clusters of workstations have emerged as an important platform for building cost-effective, scalable, and highly-available computers. Although many hardware solutions are available today, the largest challenge in making large-scale clusters usable lies in the system software. In this paper we present STORM, a resource management tool designed to provide scalability, low overhead, and the flexibility necessary to efficiently support and analyze a wide range of job-scheduling algorithms. STORM achieves these feats by using a small set of primitive mechanisms that are common in modern high-performance interconnects. The architecture of STORM is based on three main technical innovations. First, a part of the scheduler runs in the thread processor located on the network interface. Second, we use hardware collectives that are highly scalable both for implementing control heartbeats and to distribute the binary of a parallel job in near-constant time. Third, we use an I/O bypass protocol that allows fast data movements from the file system to the communication buffers in the network interface and vice versa.

The experimental results show that STORM can launch a jobwith a binary of 12 MBon a 64 -processor, 32 -node cluster in less than 250 ms. This paper provides experimental and analytical evidence that these results scale to a much larger number of nodes. To the best of our knowledge, STORM significantly outperforms existing production schedulers in launching jobs, performing resource management tasks, and gang-scheduling tasks.

Index Terms:
Cluster Computing, Resource Management, Job Scheduling, Gang Scheduling, Parallel Architectures, Quadrics Interconnect, I/O bypass
Citation:
Eitan Frachtenberg, Fabrizio Petrini, Juan Fernandez, Salvador Coll, "Scalable Resource Management in High Performance Computers," cluster, pp.305, Fourth IEEE International Conference on Cluster Computing (CLUSTER'02), 2002
Usage of this product signifies your acceptance of the Terms of Use.