, Advanced Micro Devices
Pages: pp. 24-25
Abstract—The massively multithreaded systems space will continue to expand as users clamor for more powerful systems and more exciting applications while system designers contend with power and energy constraints.
The sequential von Neumann execution model has dominated the computing landscape for more than 50 years. Applications and programming models have focused on single-thread execution, while hardware designers have strived to increase single-thread performance.
Parallel computers were developed for limited domains, and over the past 20 years they have been constructed largely by aggregating many high-performance serial processors and programmed at a coarse grain with models such as MPI or OpenMP. Unfortunately, the rate of improvement in peak single-thread performance has slowed dramatically, primarily due to power constraints.
Because future computer systems will continue to be power- and energy-constrained, they must derive performance increases from exploiting increasingly higher degrees of parallelism. Developers are adapting conventional CPUs by adding more processor cores per chip and often more threads per processor core as well. For example, recent Intel Xeon processors have as many as 10 cores per chip and two threads per core for a total of 20 simultaneously executing threads per chip. Other commercial products aim for even higher core counts, ranging from 50 to 100. 1,2 Even mobile cell phone chips are appearing with two and four CPU cores.
While these modest thread-count increases provide limited system performance and efficiency improvements, systems employing vastly larger numbers of threads—on the order of tens of thousands—can produce substantially higher benefits on suitably parallel applications. Such throughput-oriented computing systems trade lower single-thread performance for parallelism across a massive number of threads to achieve high overall performance.
While multithreading for throughput processing has a long history, 3 it has traditionally been confined to specific domains such as graphics and a subset of high-performance computing (HPC). Recently, though, as programmers and system designers seek continued performance improvements, they are turning to massively multithreaded systems as a key enabling technology to address a wide range of problems, motivated in part by the ready availability of such engines in the graphics processing units (GPUs) of most modern systems.
These efforts make massively multithreaded architectures relevant to a much broader set of designers and developers than in the past. However, this push toward generality also exacerbates the already significant algorithmic and programming challenges that any parallel system faces.
This issue focuses on forward-looking research into more effective, more general, and more scalable massively multithreaded programs and systems. The basics of contemporary GPU architectures and programming systems have been well covered elsewhere, 4-6 and we refer readers to these articles for additional background.
"Algorithm and Data Optimization Techniques for Scaling to Massively Threaded Systems," by researchers from the University of Illinois at Urbana-Champaign and KLA-Tencor, presents a collection of patterns commonly used to improve scalability on modern GPU architectures. These transformations highlight how the architectural features of GPUs require a different perspective on performance, with a focus on coordinating memory access patterns across threads to improve locality and optimize memory hierarchy utilization. The commonality of these transformations across a range of applications indicates that, once programmers have shifted their focus, they can apply these lessons across multiple algorithms and applications.
In "A GPU Task-Parallel Model with Dependency Resolution," researchers from the University of California, Davis, and Microsoft present a novel approach for using GPUs to address irregular workloads with interthread dependences. The authors describe a lightweight task-parallel runtime system that employs work queues and a custom scheduler, prototyped on top of contemporary commercial GPU systems. The execution model is a task graph that the system can unroll statically or dynamically and map to a task queue. A lookup table that records the relationships between tasks and the depth of the tasks in the program graph maintains dependences. This article demonstrates how massively threaded hardware can decompose and execute both coding and search algorithms.
In "Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?", researchers at Advanced Micro Devices describe a programming system designed to enable efficient expression and execution of a broader range of parallelism on GPUs. The proposed heterogeneous parallel primitives (HPP) model provides first-class support for both data parallelism and task parallelism. HPP resembles conventional shared-memory models but includes features such as lightweight tasks, distributed data structures, dataflow-oriented producer/consumer channels, and barrier primitives to enforce different types of bulk-synchronous behavior. While HPP is still in the prototype stage, the authors envision how programming systems might evolve to address a wide range of parallel behaviors.
"Designing Next-Generation Massively Multithreaded Architectures for Irregular Applications," from researchers at Pacific Northwest National Laboratories, shifts from software to hardware to consider future directions for these architectures. The authors examine the Cray XMT, a massively multithreaded supercomputer designed for irregular data-intensive applications in the HPC space. Through simulation, they investigate potential architectural enhancements that developers could apply to new generations of XMT-like systems to improve their performance and scalability.
We are confident that the massively multithreaded systems space will continue to expand as users clamor for more powerful systems and more exciting applications while system designers contend with power and energy constraints. The work presented in this issue provides just a glimpse at the exciting developments under way in this area.