HSA Helping to Deliver More Immersive and Better Use Cases
By Espen Oybo, GPU Product Manager, ARM

At ARM, when we design products, we consider real world use cases first and foremost. If we can do everything possible to ensure our latest CPUs, GPUs and other IP are optimized for tomorrow’s use cases, end users will greatly benefit. For example, if you consider the recent trend towards virtual reality for mobile phones, before it ever reached the public it was something our engineers had been working on for some time already. If we didn’t design components that could interact efficiently with each other (video decoder, GPU and display processor in this example), the battery would be drained far quicker, and it would be much longer before virtual cinema became a reality for the user.

Why do we talk about the use cases of tomorrow and not those of today? It is simple really, as an IP provider rather than, for example, a device manufacturer, we design hardware and software that will go into mobile devices one or two years down the line. We therefore don’t have the luxury of reacting to the latest trends, we have to be able foresee them. Luckily this is an industry wide problem and with organizations like the HSA Foundation we are able to help define best practices for hardware vendors to follow.

What makes the HSA Foundation unique in this context is that it covers aspects on a system level while leaving (most of) the more intricate details of the individual IP components, such as the GPU, up to other industry bodies to define. Other standards bodies, like Khronos, have typically built APIs covering very detailed operations of specific components. As an example think about OpenGLES 3.2 or OpenCL 1.2. These APIs are focused on a specific component (the GPU), enabling them to be utilized to their best advantage on their own, but not always considering the context of the wider system. That said, Khronos has also recognized this need for a system wide view and so next generation APIs like OpenCL 2.x and Vulkan arguably bring a wider system view into the equation.

Let’s take a look at another demanding use case, a next generation augmented reality (AR) app running on a smartphone. Think about an app that takes input from the camera, recognizes objects in the scene and then draws some 3D elements on top of it before displaying it on the screen. A modern System on Chip (SoC) might consist of several CPUs, a GPU, display processor, video processor, assertive display (to handle light correction in the display according to ambient light), ISP (to handle image stabilization and correction from the raw camera input) and a computer vision processor to help recognize objects and faces in the scene. In addition, there are advanced components like interconnect and memory controllers which are transparent to the software developer, but are nonetheless critical to the system performance and efficiency.

Execution Flow

Figure 1 – AR application execution flow

If our hypothetical AR app wants to utilize the modern SoC described above as efficiently as possible, it implies using several of the compute components in pipelined operation. The app itself will execute on one of the CPUs, then an image will be captured from the image sensor before being fed into the ISP for processing. After this, it might be picked up by the computer vision processor and the GPU for object recognition. The output of this is fed into the CPU which then initiates 3D drawing on the GPU, which hands the output over to the display processor to apply assertive colouring and copy it to the display. What is obvious from this example is that the time and power spent for data sharing between the components (considering that image data from a modern sensor might be substantial) might make up a significant proportion of the total execution time. See Figure 1 for a graphical representation of the execution flow. Also, since most of the processing happens outside of the CPU, having CPU operations stalling other components is sub-optimal. Before HSA was around, optimizing this data movement around the SoC was difficult and there were no standards which we could adhere to.

HSA specifies a memory model that greatly increases the efficiency of data sharing across the system. Shared virtual memory, discussed in greater depth in this blog post, is something of a revelation to developers in this context. The fact that another component can see the same virtual address space as the CPU makes the job of developing applications utilizing multiple compute IP much easier, as data synchronization is now handled by hardware. Cache coherency adds further benefit by greatly reducing the amount of control data going out to external memory.

Is data sharing all there is to HSA though?

Not at all, but it is a large part of it. Controlling tasks and scheduling work are other major components of HSA and this is where user mode queues and device side enqueue comes into play. The goal of user mode queues is to let user space application launch work directly to a compute IP without the interaction of the OS and this translates into reduced latency. The device side enqueue allows compute IP to initiate more work for itself (or other compute IPs) without the need for CPU interaction at all. The benefit is obvious, as any unnecessary interactions between components are eliminated.

As we’ve seen, it is clear what the HSA Foundation brings to the table. It lays down ground rules for IP and system designers, like ARM, so that end users can experience more immersive and better use cases for longer. This kind of regulation and standardization can only mean good things for the industry and the end user.