Heterogeneous systems typically consist of a CPU, GPU and other accelerator devices, all of which are integrated in a single platform with a shared high-bandwidth memory system. The special accelerators are used to obtain both power and performance beneﬁts. While the shared memory system eliminates data copies between the CPU and accelerator-owned memory, the different memory access models for most of the accelerators (e.g. for cache coherency) may still result in “expensive” software synchronization overhead when passing data between the different system components.
In this post, we’ll see how the Heterogeneous System Architecture (HSA) specification addresses the key issues in using accelerator execution units for processing. In a later post, we’ll discuss how programmers can implement the standard.
Intermediaries can diminish gains
Typically, accelerators rely on a software driver applications programming interface (API) as an intermediary. This greatly increases the overhead to leverage accelerator functions and queue up enough work to the accelerator to amortize the necessary software overhead. While GPUs have demonstrated extraordinary gains in compute performance – in range of several Teraflops/second – the additional overhead caused by the software APIs diminishes some of these gains.
The need for a standard on heterogeneous computing has been set by the popularity of general-purpose GPUs but also applies to other programmable accelerators. One of the first computing platforms and programming models for GPGPUs is Compute Unified Device Architecture (CUDA) developed by NVIDIA. However, this is a proprietary interface. Several industry players have introduced open APIs for heterogeneous computing, such as DirectCompute extensions for GPGPU computing on Windows, and Renderscript for heterogeneous computing on Android. Similarly, the Khronos Group announced an open standard framework for heterogeneous computing, the OpenCL standard, which supports both task-level and data-level parallelism and allows platform control and program execution on compute devices.
Yet a pervasive challenge in heterogeneous computing is related to memory organization and the need to copy data structures between various local memories. OpenCL, for instance, doesn’t have a well-defined memory model to the level of HSA and it still depends on memory handles in 1.2 and on runtime allocated memory for a large part of the 2.0 implementations. There are few OpenCL 2.x implementations (AMD is one of them) that even support SVM via OpenCL, especially fine-grain access, which is equivalent to the HSA paradigm. And AMD’s implementation leverages the HSA features in the implementation. The HSA specification addresses this problem while it targets towards a royalty-free industry standard for heterogeneous computing.
The HSA speciﬁcations define virtual memory, memory coherency, architected dispatch mechanisms, and power-efﬁcient signal platform requirements. The architecture uses accelerators called kernel agents to reduce or eliminate software overhead paths in performance-critical dispatch paths. All these definitions help to dramatically reduce the overhead and latency needed to dispatch work to the accelerator.
The design allows targeting the accelerator hardware directly via high-level compilers and data parallel- and managed runtimes without the typical translation steps necessary to interface with a high-level API in the dispatch. The architecture also allows compute kernels running on the accelerator to efficiently call back to the host for OS services like file I/O, networking and similar functions that typically would not be available and, therefore, allowing the accelerator to operate as a true peer of the host CPU.
We’ve outlined some features of the HSA system architecture including the use of agents, kernels and runtime, but we’ve yet to address the programmer’s model for using the architecture. In our next post, we’ll dive deeper into the programmer’s model and the HSA Runtime Specification. And we’ll discuss mapping concepts of HSA agents to modern DSP accelerators using HSAIL implementations of Finite Impulse Response (FIR) filters to remove memory loads.