Issue No. 09 - September (2009 vol. 20)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPDS.2008.228
Salvador Coll , Universidad Politecnica de Valencia, Valencia
Francisco J. Mora , Universidad Politecnica de Valencia, Valencia
Jose Duato , Universidad Politecnica de Valencia, Valencia
Fabrizio Petrini , IBM TJ Watson Research Center, Yorktown Heights
This article presents an efficient and scalable mechanism to overcome the limitations of collective communication in switched interconnection networks in the presence of faults. Considering that current trends in supercomputing are moving toward massively parallel computers, with many thousands of components, reliability becomes a challenge. In such scenario, fat-tree networks that provide hardware support for collective communication suffer from serious performance degradation due to the presence of, even, a single faulty node. This paper describes a new mechanism to provide high-performance collective communication in such situations. The feasibility of the proposed technique is formally demonstrated. We present the design of a new hardware-based routing algorithm for multicast, that is at the base of our proposal. The proposed mechanism is implemented and experimentally evaluated. Our experimental results show that hardware-based multicast trees provide an efficient and scalable solution for collective communication in fat-tree networks, significantly outperforming traditional solutions.
Multicast, data communications, interprocessor communications, network communication, network problems, trees.
F. Petrini, S. Coll, J. Duato and F. J. Mora, "Efficient and Scalable Hardware-Based Multicast in Fat-Tree Networks," in IEEE Transactions on Parallel & Distributed Systems, vol. 20, no. , pp. 1285-1298, 2008.