The Community for Technology Leaders
2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) (2010)
Vienna, Austria
Sept. 11, 2010 to Sept. 15, 2010
ISBN: 978-1-5090-5032-1
pp: 537-538
Rajesh Bordawekar , IBM Watson Research Center, Hawthorne, NY 10532, USA
Uday Bondhugula , IBM Watson Research Center, Yorktown Heights, NY 10598, USA
Ravi Rao , IBM Watson Research Center, Yorktown Heights, NY 10598, USA
ABSTRACT
In this paper, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. We implement this algorithm on a nVidia GTX 285 GPU using CUDA, and also parallelize it for the Intel Xeon (Nehalem) and IBM Power7 processors, using both manual and automatic techniques. Pthreads and OpenMP with SSE and VSX vector intrinsics are used for the manually parallelized version, while a state-of-the-art optimization framework based on the polyhedral model is used for automatic compiler parallelization and optimization. The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.22s, respectively. The performance of this algorithm on the nVidia GPU suffers from: (1) a smaller shared memory, (2) unaligned device memory access patterns, (3) expensive atomic operations, and (4) weaker single-thread performance. These results conclusively demonstrate that, under certain conditions, it is possible for a FLOP-intensive structured application running on a multi-core processor to match or even beat the performance of an equivalent GPU version.
INDEX TERMS
Performance Evaluation, GPU, Multi-core, Parallel Programming
CITATION
Rajesh Bordawekar, Uday Bondhugula, Ravi Rao, "Believe it or not! multi-core CPUs can match GPU performance for a FLOP-intensive application!", 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), vol. 00, no. , pp. 537-538, 2010, doi:
177 ms
(Ver 3.3 (11022016))