2014 23rd International Conference on Parallel Architecture and Compilation (PACT) (2014)
Aug. 23, 2014 to Aug. 27, 2014
D. Anoushe Jamshidi , Advanced Computer Architecture Laboratory, University of Michigan - Ann Arbor, MI
Mehrzad Samadi , Advanced Computer Architecture Laboratory, University of Michigan - Ann Arbor, MI
Scott Mahlke , Advanced Computer Architecture Laboratory, University of Michigan - Ann Arbor, MI
To achieve high performance on many-core architectures like GPUs, it is crucial to efficiently utilize the available memory bandwidth. Currently, it is common to use fast, on-chip scratchpad memories, like the shared memory available on GPUs' shader cores, to buffer data for computation. This buffering, however, has some sources of inefficiency that hinder it from most efficiently utilizing the available memory resources. These issues stem from shader resources being used for repeated, regular address calculations, a need to shuffle data multiple times between a physically unified on-chip memory, and forcing all threads to synchronize to ensure RAW consistency based on the speed of the slowest threads. To address these inefficiencies, we propose DataParallel DMA, or D2MA. D2MA is a reimagination of traditional DMA that addresses the challenges of extending DMA to thousands of concurrently executing threads. D2MA decouples address generation from the shader's computational resources, provides a more direct and efficient path for data in global memory to travel into the shared memory, and introduces a novel dynamic synchronization scheme that is transparent to the programmer. These advancements allow D2MA to achieve speedups as high as 2.29×, and reduces the average time to buffer data by 81% on average.
Instruction sets, Graphics processing units, Hardware, Bandwidth, Performance evaluation, Memory management,Throughput Processing, GPUs, DMA, Software-managed Caches, Shared Memory, Dynamic Management
D. Anoushe Jamshidi, Mehrzad Samadi, Scott Mahlke, "D2MA: Accelerating coarse-grained data transfer for GPUs", 2014 23rd International Conference on Parallel Architecture and Compilation (PACT), vol. 00, no. , pp. 431-442, 2014, doi:10.1145/2628071.2628072