2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (2012)
Shanghai, China China
May 21, 2012 to May 25, 2012
ISBN: 978-1-4673-0974-5
pp: 2404-2413
Dynamic scheduling and varying decomposition granularity are well-known techniques for achieving high performance in parallel computing. Heterogeneous clusters with highly data-parallel processors, such as GPUs, present unique problems for the application of these techniques. These systems reveal a dichotomy between grain sizes: decompositions ideal for the CPUs may yield insufficient data-parallelism for accelerators, and decompositions targeted at the GPU may decrease performance on the CPU. This problem is typically ameliorated by statically scheduling a fixed amount of work for agglomeration. However, determining the ideal amount of work to compose requires experimentation because it varies between architectures and problem configurations. This paper describes a novel methodology for dynamically agglomerating work units at runtime and scheduling them on accelerators. This approach is demonstrated in the context of two applications: an n-body particle simulation, which offloads particle interaction work, and a parallel dense LU solver, which relocates DGEMM kernels to the GPU. In both cases dynamic agglomeration yields comparable or better results over statically scheduling the work across a variety of system configurations.
Graphics processing unit, Kernel, Arrays, Dynamic scheduling, Grain size, Runtime, adaptive runtime, dynamic scheduling, accelerator, GPGPU, CUDA, agglomeration
