The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.06 - June (2009 vol.20)
pp: 802-817
Crispín Gómez Requena , Universidad Politécnica de Valencia, Valencia
María Engracia Gómez Requena , Universidad Politécnica de Valencia, Valencia
Pedro Juan López Rodríguez , Universidad Politécnica de Valencia, Valencia
José Francisco Duato Marín , Universidad Politécnica de Valencia, Valencia
ABSTRACT
Fault tolerance in the interconnection network of large clusters of PCs is an issue of growing importance, since their increasing size also increases the failure probability. The fat-tree topology is usually used in these machines since it has become very popular among high-speed interconnect manufacturers. This paper proposes a new distributed fault-tolerant routing methodology for fat trees. Unlike other previous proposals, it does not require additional network hardware, and its memory requirements, switch hardware, and routing delay scales up with the network size. Indeed, it nullifies only the strictly necessary paths, allowing adaptive routing through the healthy paths. The methodology is based on enhancing the Interval Routing scheme with exclusion intervals. Exclusion intervals are associated to each switch output port and represent the nodes that are unreachable from this port after a fault. We propose a methodology to identify the links where the exclusion intervals must be updated after a fault, the values to write on them, and a very efficient mechanism to distribute the required information through the network without stopping the system activity. Our methodology can tolerate a high number of network failures with a low degradation in performance. Moreover, it can achieve zero packet losing during the updating period.
INDEX TERMS
Fault tolerance, fat trees, adaptive routing, dynamic fault model, memory-effective routing.
CITATION
Crispín Gómez Requena, María Engracia Gómez Requena, Pedro Juan López Rodríguez, José Francisco Duato Marín, "FT²EI: A Dynamic Fault-Tolerant Routing Methodology for Fat Trees with Exclusion Intervals", IEEE Transactions on Parallel & Distributed Systems, vol.20, no. 6, pp. 802-817, June 2009, doi:10.1109/TPDS.2008.130
REFERENCES
[1] ASCI Red Web Site, http://www.sandia.gov/ASCIRed/, 2008.
[2] E. Bakker, J. van Leeuwer, and R.B. Tan, “Linear Interval Routing,” Algorithms Rev., vol. 2, pp. 45-61, 1991.
[3] IBM BG/L Team, “An Overview of BlueGene/L Supercomputer,” Proc. ACM/IEEE Conf. Supercomputing (SC), 2002.
[4] A. Broder, M. Fischer, R. Dolev, and B. Simons, “Efficient Fault-Tolerant Routings in Networks,” Proc. 16th Ann. ACM Symp. Theory of Computing (STOC), 1984.
[5] S. Chalsani, C. Raghavendra, and A. Varma, “Fault-Tolerant Routing in MIN Based Supercomputers,” Proc. Fourth Int'l Conf. Supercomputing (ICS), 1990.
[6] F.T. Chong, J. Thomas, and F. Knight Jr., “Design and Performance of Multipath MIN Architectures,” Proc. Fourth Ann. ACM Symp. Parallel Algorithms and Architectures (SPAA), 1992.
[7] R. Cole, B. Maggs, and R. Sitaraman, “Routing on Butterfly Networks with Random Faults,” Proc. 36th Ann. Symp. Foundations of Computer Science (FOCS), 1995.
[8] J. Duato et al., Interconnection Networks. An Engineering Approach. Morgan Kaufmann, 2004.
[9] Earth Simulator Center, http://www.es.jamstec.go.jp/esc/eng index.html , 2008.
[10] C. Gómez, F. Gilabert, M.E. Gómez, P. López, and J. Duato, “Deterministic versus Adaptive Routing in Fat-Trees,” Proc. IPDPS Workshop Comm. Architecture on Clusters (CAC), 2007.
[11] C. Gómez, M.E. Gómez, P. López, and J. Duato, “An Efficient Fault-Tolerant Routing Methodology for Fat-Tree Interconnection Networks,” Proc. Fifth Int'l Symp. Parallel and Distributed Processing and Applications (ISPA), 2007.
[12] C. Gómez, M.E. Gómez, P. López, and J. Duato, “A Dynamic and Compact Fault-Tolerant Strategy for Fat-Tree,” Proc. IFIP Int'l Conf. Network and Parallel Computing (NPC), 2006.
[13] M.E. Gómez, P. López, and J. Duato, “A Memory-Effective Routing Strategy for Regular Interconnection Networks,” Proc. 19th Int'l Parallel and Distributed Processing Symp. (IPDPS), 2005.
[14] IBM BlueGene/L Team, “An Overview of the BlueGene/L Supercomputer,” Proc. ACM/IEEE Conf. Supercomputing (SC), 2002.
[15] N. Kamiura et al., “Design of a Fault-Tolerant Multistage Interconnection Network with Parallel Duplicated Switches,” Proc. 15th IEEE Int'l Symp. Defect and Fault Tolerance in VLSI Systems (DFTVS), 2000.
[16] S. Konstantinidou, “The Selective Extra Stage Butterfly,” IEEE Trans. Very Large Scale Integration (VLSI) Systems, 1993.
[17] T.H. Lee and J.J. Chou, “Some Directed Graph Theorems for Testing the Dynamic Full Access Property of Multistage Interconnection Networks,” Proc. IEEE Region 10 Conf. Computer, Comm., Control and Power Eng. (TENCON), 1993.
[18] C.E. Leiserson, “Fat-Trees: Universal Networks Hardware-Efficient Supercomputing,” IEEE Trans. Computers, vol. 34, no. 10, Oct. 1985.
[19] J. Liu, “Microbenchmark Performance Comparison of High-Speed Cluster Interconnects,” IEEE Micro, 2004.
[20] J.C. Martinez et al., “Supporting Adaptive Routing in IBA Switches,” J. Systems Architecture, 2004.
[21] Y. Mun and H.Y. Youn, “On Performance Evaluation of Fault-Tolerant Multistage Interconnection Networks,” Proc. ACM/SIGAPP Symp. Applied Computing (SAC), 1992.
[22] F. Petrini and M. Vanneschi, “$k\hbox{-}{\rm ary}\;n\hbox{-}{\rm trees}$ : High Performance Networks for Massively Parallel Architecture,” IEEE Micro, vol. 15, Feb. 1995.
[23] Quadrics Home Page, http:/www.quadrics.com, 2008.
[24] N. Santoro and R. Khatib, “Routing without Routing Tables,” Technical Report SCS-TR-6, School of Computer Science, Carleton Univ., 1982.
[25] S.L. Scott and G.M. Thorson, “The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus,” Proc. IEEE Symp. High Performance Interconnects (Hot Interconnects), 1996.
[26] F.O. Sem-Jacobsen, T. Skeie, O. Lysne, O. Torudbakken, E. Rongved, and B. Johnsen, “Siamese-Twin: A Dynamically Fault-Tolerant Fat-Tree,” Proc. 19th Int'l Parallel and Distributed Processing Symp. (IPDPS), 2005.
[27] F.O. Sem-Jacobsen, T. Skeie, O. Lysne, and J. Duato, “Dynamic Fault Tolerance with Misrouting in Fat Trees,” Proc. Int'l Conf. Parallel Processing (ICPP), 2006.
[28] J. Sengupta and P. Bansal, “Fault-Tolerant Routing in Irregular MINs,” Proc. IEEE Region 10 Int'l Conf. Global Connectivity in Energy, Computer, Comm. and Control (TENCON '98), vol. 2, 1998.
[29] N. Sharma, “Fault-Tolerance of a MIN Using Hybrid Redundancy,” Proc. 27th Ann. Simulation Symp., 1994.
[30] C.B. Stunkel, D.G. Shea, D.G. Grice, P.H. Hochschild, and M. Tsao, “The SP1 High-Performance Switch,” Proc. Scalable High-Performance Computing Conf., 1994.
[31] R. Suzuki, S. Fukumoto, and K. Iwasakio, “Adaptive Checkpointing for Time Warp Technique with a Limited Number of Checkpoints,” Proc. 22nd Int'l Conf. Distributed Computing Systems Workshops, 2002.
[32] Tera-10 at Commissariat a l'Energie Atomique, http:/www.cea.fr, 2008.
[33] M. Valerio, L. Moser, and P. Melliar-Smith, “Fault-Tolerant Orthogonal Fat-Trees as Interconnection Networks,” Proc. First Int'l Conf. Algorithms and Architectures for Parallel Processing (ICAPP), 1995.
[34] A. Varma and C. Raghavendra, “Fault-Tolerant Routing in Multistage Interconnection Networks,” IEEE Trans. Computers, vol. 38, no. 3, pp. 385-393, Mar. 1989.
32 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool