loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
18th International Parallel and Distributed Processing Symposium (IPDPS'04) - Papers
Fast and Scalable MPI-Level Broadcast Using InfiniBand?s Hardware Multicast Support
Santa Fe, New Mexico
April 26-April 30
ISBN: 0-7695-2132-0
Jiuxing Liu, Ohio State University
Amith R Mamidala, Ohio State University
Dhabaleswar K Panda, Ohio State University

Modern high performance applications require efficient and scalable collective communication operations. Currently, most collective operations are implemented based on point-to-point operations. In this paper, we propose to use hardware multicast in InfiniBand to design fast and scalable broadcast operations in MPI. InfiniBand supports multicast with Unreliable Datagram (UD) transport service. This makes it hard to be directly used by an upper layer such as MPI. To bridge the semantic gap between MPI Bcast and InfiniBand hardware multicast, we have designed and implemented a substrate on top of InfiniBand which provides functionalities such as reliability, in-order delivery and large message handling. By using a sliding-window based design, we improve MPI Bcast latency by removing most of the overhead in the substrate out of the communication critical path. By using optimizations such as a new co-root based scheme and delayed ACK, we can further balance and reduce the overhead. We have also addressed many detailed design issues such as buffer management, efficient handling of out-of-order and duplicate messages, timeout and retransmission, flow control and RDMA based ACK communication.

Our performance evaluation shows that in an 8 node cluster testbed, hardware multicast based designs can improve MPI broadcast latency up to 58% and broadcast throughput up to 112%. The proposed solutions are also much more tolerant to process skew compared with the current point-to-point based implementation. We have also developed analytical model for our multicast based schemes and validated them with experimental numbers. Our analytical model shows that with the new designs, one can achieve MPI broadcast latency of small messages with 20.0µs and of one MTU size message (around 1836 bytes of data payload) with 40.0µs in a 1024 node cluster.

Citation:
Jiuxing Liu, Amith R Mamidala, Dhabaleswar K Panda, "Fast and Scalable MPI-Level Broadcast Using InfiniBand?s Hardware Multicast Support," ipdps, vol. 1, pp.10b, 18th International Parallel and Distributed Processing Symposium (IPDPS'04) - Papers, 2004
Usage of this product signifies your acceptance of the Terms of Use.