Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06) Monitoring High Performance Networks in Large-scale Clusters Singapore May 16-May 19 ISBN: 0-7695-2585-7
The number of large-scale clusters is rising. They are included into Grids or become key components o f large structures. As more users and projects rely 071 HPC clusters, high availability and security are requirements for a fast growing adoption and use. I n this paper, ute focus o n high performance networks. All HPC clusters are built o n top of them. We demonstrate that classical instrumentation are ineficient in HPC environment, they do not scale or cause a significant loss of performance. Based 071. this fact, we highlight clusters properties: nodes have assigned roles and are coupled at various levels. Moreover, we study the main characteristics of resource usage for each type of node and propose a n instrumentation that can be effectively deployed. It results in fine-grained mechanisms adapted to system architecture and performance constraints.. Relevant information is collected over time. Two properties are verified online and dynamically: coherency and containment. Each induces a type of verification and both aim at reducin,g recovery time from failure and security risk of a whole cluster. We illustrate our. rnethodology o n QsNet network and provide a way t o increase safety of high performance networks and clusters.
Citation:
Fabrice Gadaud, "Monitoring High Performance Networks in Large-scale Clusters," ccgrid, vol. 2, pp.32, Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'06), 2006 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||