“The goal of a programmer in a modern computing environment is not just to take advantage of processors with multiple or even distributed cores. It is actually to be able to write scalable applications that can take advantage of any amount of parallel or heterogeneous hardware1”. The quest for scaling requires attention to many factors, such as reducing data movement, serial bottlenecks (including locking), and other forms of overhead. “Ultimately it is up to the diligence and intelligence of the software developer to produce a good algorithm design2”.
The programming model can help facilitate that design and this blog will focus on data reorganization which is the larger context. “This has two benefits: Data can be laid out in memory for better vectorization and data locality, and data and computation can be offloaded to attached co-processors with no changes to the code. It has the disadvantage that extra code is required to move data in and out of the data space, and extra data movement may be required2”. “The performance bottleneck in many applications is just as likely, if not more likely, to be due to data movement as it is to be computation. For data-intensive applications, it is often a good idea to design the data movement first and the computation around the chosen data movements. Some common applications are in fact mostly data reorganization such as searching and sorting3”.
Assuming we want to keep our goal of portability and performance, and that it is done in Standard C++ then we can look at what all these frameworks have demonstrated. Not surprisingly, they all solve a piece of the problem, and in some cases the same problem in similar ways.
It turns out to maintain performance portability, you usually must deal with what I call “the four horsemen”. After working on integrating several GPU programming models into a mainstream language, starting with IBM’s Cell, then OpenMP, then Khronos SYCL/OpenCL, and ISO C++, these are the prevailing problems of our age.
These are the usual challenges for performance portability and the “four horsemen of heterogeneous computing”. They have repeatedly appeared in heterogeneous computing language, in every language I have worked on, as the key challenge in getting performance and are:
- “Data movement
- Data layout
- Data affinity
- Data locality4”
There are other problems, like how do you expose the parallelism, how do you choose where things are run or execution space, and where they are stored or memory space. These horsemen however affect how portable, efficient and performant your model is, and it is interesting that they are solved by some aspect of every framework.
Data movement can be the most expensive. So much that most languages offer you a way not to offload the kernel if you don’t believe the balance of work vs the cost of data movement makes no sense. Here we demonstrate the cost of data movement for a 64-bit Double Precision operation is 20 pico Joules (20×10-12). But if you have to move four 64-bit data across a DRAM die, it costs 16 nano Joule (16×10-9) ie moving 4x 64-bit data can cost 800 times more than a single 64-bit move. Off die movement is even more expensive as it would have to go through an interconnect, with possible cache broadcast and full snoop depending on the type of architecture and data.
Data movement has 2 parts, either explicit or implicit. Most are explicit, only two are implicit. It turns out there is need for both. You need explicit data movement if you are serving as a runtime, or needing fine grain control over how, when, and where data movement is needed.
Implicit data movement uses a call graph defined using a read write relationship to describe what data needs to be there JIT. It gives the runtime greater freedom in scheduling, and potentially can scale better.
If you control the entire compilation toolchain, and you can see all the code used on the program, you might be able to infer some of the data movement necessary in simple cases. However, you will end up with situations that are only known at runtime. For example, copy only if the result of an operation is an odd number. Then you’ll get extra copies just to ensure all cases.
OpenMP and most programming models use explicit data movement. In my 2015 LLVM talk on OpenMP 4.0 accelerator support I show the following slide on OpenMP directives that support data movement – most of the directives (in the middle) are to serve explicit data movement.
Implicit data movement was pioneered by C++ AMP in the example above, but the language was relatively proprietary never caught on. SYCL follows this implicit data movement model. In fact, SYCL goes further by separating the storage and access of data through the use of buffers and accessors. SYCL provides data dependency tracking based on accessors optimized for the scheduling of tasks.
A SYCL accessor allows you to specify where you want your data to be stored or allocated on the device.
Accessors also allow you to specify the access patterns such as read, write or read/write.
SYCL accessors and buffers use a data dependency graph or Directed Acyclic Graph (DAG) that is familiar to many in high performance computing for ordering input and output in a concurrent environment. The benefit of data dependency graphs is that it allows you to describe your problem in terms of relationship and removes the need to enqueue explicit copies, or the need for complex event handling. It also allows the runtime to optimize data movement and pre-emptively copy data to a device just in time or before kernel execution, while avoiding unnecessarily copying data back to the host after execution on the device. Excessive data movement can lead to unnecessary data movement which all costs execution performance.
Not all programs have complex dependencies (or how I learned to stop worrying and enjoy pointers)
The SYCL DAG execution model is very powerful and allows programmers to express complex dependence patterns. However, not all programs have complex dependencies and in such a case, an in-order queue would be sufficient for many simple programs. Indeed some folks have tried to build SYCL on top of other heterogeneous frameworks such as HPX or Kokkos and found the implicit data movement, which is fantastic for users, actually gets in the way when the other runtime wishes to manage the data movement explicitly.
Even when you want to take a simple C++ program with pointers and want to SYCL-ify it, you will end up adding lots of buffers and accessors while turning all the user’s pointers into buffers. This is a pain point that we will address in the upcoming SYCL 2020 release where we will enable programmers to use pointers and simplify the programmer control.
Supporting pointers in SYCL requires the following:
- Single address space across host and devices thus allowing pointer-based data structures
- Pointer-based memory management
- Explicit data movement along with implicit data movement
Together, this supports a feature in the upcoming SYCL 2020 known as Unified Shared memory (USM), something I talked about in my SYCLcon 2020 keynote. Its name is derived from an open standard in OpenMP 5.0. It allows for DAG scheduling APIs for pointer-based kernels as well as a performance hint for USM called prefetch. This allows fine grain control over performance.
USM needs to support several use cases. This include explicit data movement for device allocation, which is done with sycl_memcpy calls to copy data between host and device.
Implicit data movement is reserved for shared and host memory allocations. Host memory allocations do not migrate between host and devices, but are accessible by many devices. Shared allocations are accessible by the host and at least one device, and may migrate between host and device. In these cases, the programmer just reads and writes shared pointers and the driver instead of the SYCL runtime handling the data movement.
USM allows the system software to migrate data between CPU and accelerator with no need for explicit copies. But this allows access to pointer-based data structures on CPU and accelerator because of the single address space. This will result in a much simpler programming model when users or other frameworks need it.
In the meantime, SYCL continues to support implicit data movement to manage complex data dependencies offering both forms of data movement to adapt to your program structure. The foreseeable future will see more solid support for both implicit and explicit data movement in SYCL.
You can find out more on the SYCL community website with links to learning resources and videos. If you want to see the full SYCL specification including details on Unified Shared Memory you can find this on the Khronos website.
- James Reinders, Jeff Lait, Erwin Coumans, George ElKoura, Martin Watt. “Multithreading for visual effects”, ACM SIGGRAPH 2015 Courses on – SIGGRAPH ’15, 2015
- Gordon Brown, Ruyman Reyes, Michael Wong. “Towards Heterogeneous and Distributed Computing in C++”, Proceedings of the International Workshop on OpenCL – IWOCL’19, 2019