2006 International Conference on Parallel Processing Workshops (ICPPW'06) Using Overdecomposition to Overlap Communication Latencies with Computation and Take Advantage of SMT Processors Columbus, Ohio August 14-August 18 ISBN: 0-7695-2637-3
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICPPW.2006.77
Parallel programs running on clusters are typically decomposed and mapped to run with one thread per processor each working on its disjoint subset of the data. We evaluate performance improvements and limitations for a microbenchmark and the NAS benchmarks, by using overdecomposition to map multiple threads to each processor to overlap computation with communication. The experiment platform is a cluster with Pentium 4 symmetric multithreading (SMT) processor nodes interconnected through Gigabit Ethernet. Micro-benchmark results demonstrate execution time improvements up to 1.8. However, for the NAS benchmarks overdecomposition and SMT provides only slight performance gains, and sometimes significant performance loss. We evaluated improvement and limitation sensitivity to problem size, communication structure and whether SMT is enabled or not. We found that performance improvements are limited by: applications having communication dependencies that limit thread-level parallelism, increase in cache misses, or increased systems activity. Our study contributes a better understanding of these limitations.
Citation:
Lars Ailo Bongo, Brian Vinter, Otto J. Anshus, Tore Larsen, John Markus Bj?rndalen, "Using Overdecomposition to Overlap Communication Latencies with Computation and Take Advantage of SMT Processors," icppw, pp.239-247, 2006 International Conference on Parallel Processing Workshops (ICPPW'06), 2006 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||