2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) (2008)
Toronto, ON, Canada
Oct. 25, 2008 to Oct. 29, 2008
DOI Bookmark: http://doi.ieeecomputersociety.org/
Costin Iancu , Lawrence Berkeley National Laboratory, CA, USA
Steven Hofmeyr , Lawrence Berkeley National Laboratory, CA, USA
“Vector” style communication operations transfer multiple disjoint memory regions within one logical step. These operations are widely used in applications, they do improve application performance, and their behavior has been studied and optimized using different implementation techniques across a large variety of systems. In this paper we present a methodology for the selection of the best performing implementation of a vector operation from multiple alternative implementations. Our approach is designed to work for systems with wide SMP nodes where we believe that most published studies fail to correctly predict performance. Due to the emergence of multi-core processors we believe that techniques similar to ours will be incorporated for performance reasons in communication libraries or language runtimes. The methodology relies on the exploration of the application space and a classification of the regions within this space where a particular implementation method performs best. We use micro-benchmarks to measure the performance of an implementation for a given point in the application space and then compose profiles that compare the performance of two given implementations. These profiles capture an empirical upper bound for the performance degradation of a given protocol under heavy node load. At runtime, the application selects the implementation according to these performance profiles. Our approach provides performance portability and using our dynamic multi-protocol selection we have been able to improve the performance of a NAS Parallel Benchmarks workload by 22% on an IBM large scale cluster. Very positive results have also been obtained on large scale InfiniBand and Cray XT systems. This work indicates that perhaps the most important factor for application performance on wide SMP systems is the successful management of load on the Network Interface Cards.
Latency Hiding, Parallel Programming, Program Transformations, Performance Portability, Communication Code Generation
C. Iancu and S. Hofmeyr, "Runtime optimization of vector operations on large scale SMP clusters," 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT), Toronto, ON, Canada, 2008, pp. 122-132.