2015 International Conference on Parallel Architecture and Compilation (PACT) (2015)
San Francisco, CA, USA
Oct. 18, 2015 to Oct. 21, 2015
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/PACT.2015.56
Nested thread-level parallelism (TLP) is pervasive in real applications. For example, 75% (14 out of 19) of the applications in the Rodinia benchmark for heterogeneous accelerators contain kernels with nested thread-level parallelism. Efficiently mapping the enclosed nested parallelism to the GPU threads in the C-to-CUDA compilation (OpenACC in this paper) is becoming more and more important. This mapping problem is two folds: suitable execution models and efficient mapping strategies of the nested parallelism.
Graphics processing units, Message systems, Parallel processing, Parallel architectures, Software engineering, Benchmark testing, Kernel
Shixiong Xu, David Gregg, "An Efficient Vectorization Approach to Nested Thread-level Parallelism for CUDA GPUs", 2015 International Conference on Parallel Architecture and Compilation (PACT), vol. 00, no. , pp. 488-489, 2015, doi:10.1109/PACT.2015.56