|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
| ASCII Text | x | ||
| J. M. Montañana, J. Flich, A. Robles, P. López, J. Duato, "A Transition-Based Fault-Tolerant Routing Methodology for InfiniBand Networks," Parallel and Distributed Processing Symposium, International, vol. 9, pp. 186a, 18th International Parallel and Distributed Processing Symposium (IPDPS'04) - Workshop 8, 2004. | |||
| BibTex | x | ||
| @article{ 10.1109/IPDPS.2004.1303198, author = {J. M. Montañana and J. Flich and A. Robles and P. López and J. Duato}, title = {A Transition-Based Fault-Tolerant Routing Methodology for InfiniBand Networks}, journal ={Parallel and Distributed Processing Symposium, International}, volume = {9}, year = {2004}, isbn = {0-7695-2132-0}, pages = {186a}, doi = {http://doi.ieeecomputersociety.org/10.1109/IPDPS.2004.1303198}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - CONF JO - Parallel and Distributed Processing Symposium, International TI - A Transition-Based Fault-Tolerant Routing Methodology for InfiniBand Networks SN - 0-7695-2132-0 SP EP A1 - J. M. Montañana, A1 - J. Flich, A1 - A. Robles, A1 - P. López, A1 - J. Duato, PY - 2004 KW - null VL - 9 JA - Parallel and Distributed Processing Symposium, International ER - | |||
Currently, clusters of PCs are considered a cost-effective alternative to large parallel computers. As the number of elements increases in these systems, the probability of faults increases dramatically. Therefore, it is critical to keep the system running even in the presence of faults. The interconnection network plays a key role in its performance. InfiniBand (IBA) is a new standard interconnect suitable for clusters. Most of the fault-tolerant routing strategies proposed for massively parallel computers cannot be applied to IBA because routing and virtual channel transitions are deterministic, which prevents packets from avoiding the faults.
A possible approach to provide fault-tolerance in IBA consists of using several disjoint paths between every source-destination pair of nodes and selecting the appropriate path at the source host. However, to this end, a routing algorithm able to provide enough disjoint paths, while still guaranteeing deadlock freedom, is required. In this paper we address this issue, proposing a simple and effective fault-tolerant methodology for IBA networks that can be applied to any network topology and meets the trade-off between fault-tolerance degree and the number of network resources devoted to it. Preliminary results show that the proposed methodology scales well and supports up to three faults in 2D and five in 3D tori using only two virtual channels.
