Issue No. 01 - January (2008 vol. 19)
The processing elements of many modern tightly coupled multicomputers are connected via mesh or toroidal networks. Such interconnects are simple and highly scalable, but suffer from high fragmentation, low utilization, and insufficient fault tolerance when the resources allocated to each job are dedicated. High dimension interconnects may be more efficient in certain cases, but are based on complex and expensive components, and scale poorly. We present a novel hardware/software architectural approach that detaches the processing elements of the system from the interconnect and augments the traditional toroidal topology to provide additional connectivity options and additional link redundancy. We explore the properties of the new "multi-toroidal" topology and the improvements it offers in resource utilization and failure tolerance. We present the results of extensive simulation studies to show that for practically important types of workloads the resource utilization may be increased by 50%, and in certain cases as much as 100% compared to toroidal machines, and is, in fact, close to the theoretically optimal case of a full crossbar interconnect. The combined hardware/software architectural innovation is a major significant improvement in resource utilization on top of the state of the art in scheduling algorithm research. Also, multi-toroidal multicomputers are able to work under link failure rates of 0.002 failures per week that would shut down toroidal machines. A variant of multi-toroidal architecture is implemented in the Blue Gene/L supercomputer.
parallel architectures, scheduling and task partitioning, network topology
Oleg Goldshmidt, Yariv Aridor, Yevgeny Kliteynik, Edi Shmueli, Tamar Domany, Jose E. Moreira, "Multitoroidal Interconnects For Tightly Coupled Supercomputers", IEEE Transactions on Parallel & Distributed Systems, vol. 19, no. , pp. 52-65, January 2008, doi:10.1109/TPDS.2007.1118