# Guest Editor's Introduction: Special Issue on High-Performance Computing with Accelerators

David Kaeli, IEEE
Volodymyr Kindratenko, IEEE

Pages: pp. 3-6

## Introduction

Itis our honor to serve as guest editors of this special issue of the IEEE Transactions on Parallel and Distributed Systems ( TPDS) on the use of accelerators in high-performance computing. There is a renewed interest in designing and building computer systems based on special-purpose chip architectures. Many research groups already have deployed experimental systems in which Field-Programmable Gate Arrays (FPGAs), Graphics Processing Units (GPUs), the Cell Broadband Engine (Cell/B.E.), and ClearSpeed, to name a few, are used as coprocessors or application-specific accelerators to speed up the execution of computationally intensive codes, and the community is starting to use AMD's Fusion, and NVIDIA's Fermi chips. A few of these efforts have resulted in the deployment of large-scale systems, such as Los Alamos National Laboratory's RoadRunner, which is based on AMD Opteron nodes accelerated with IBM's PowerXCell processor, and the Chinese National University of Defense Technology Tianhe-1 system, which combines Intel Xeon nodes with AMD GPU accelerators. Both of these systems are currently at the top of the most powerful supercomputers list (TOP-500): # 3 and #7, respectively. High performance computer designers are turning toward accelerators to increase performance, reduce power requirements, and enable the most challenging applications. To realize the potential of these new systems, however, much remains to be done on the software side as the scientific computing community is only beginning to understand the intricacies and interactions between the new hardware, execution models, software architectures, development processes, and the application transformations necessary to utilize the available resources effectively.

## Overview of the Special Issue on Accelerators

We are pleased to present this special issue containing 12 high quality contributions that discuss a range of different accelerator architectures and applications that can take advantage of these devices. This issue covers a range of accelerator issues that highlight how these devices can be used effectively to break down computational barriers in challenging applications. Independent of the architecture employed, we must first identify both task-level parallelism and data-level parallelism in order to map a common set of operations to parallel hardware.

FPGAs have been used successfully to offload computational kernels—the function of the FPGA tends to be statically defined, and the accelerator hardware is typically memory mapped. This model of computation has been able to achieve application speedups of 5X-10X in many common applications (e.g., FFTs, convolution), though it is limited by off-chip pin bandwidth and can only change function through entire device reprogramming.

GPUs were originally designed for render graphics; the computer gaming community has driven the rapid advancement in the performance and core densities of high end GPUs. Given that these devices were designed for a very different class of computing task (i.e., graphics rendering), and given the lack of general purpose programming languages for the GPU, high performance computing manufacturers and integrators were reluctant to install these devices in their high performance systems. It was not until much more mature programming environments, such as NVIDIA's CUDA and Khronos' OpenCL became available that we began to see the rampant adoption of GPUs as a standard element in most high performance systems.

In terms of hardware architectures, the IBM Cell/B.E. (jointly designed with Sony, Toshiba, and IBM) provides a rather unique design point, with a main Power Processing Engine (PPE) core connected by an Element Interconnection Bus to eight Synergistic Processing Elements (SPEs) that are essential for acceleration. The design provides for communications through a cache coherent direct memory access (DMA) protocol. While the Cell/B.E. has been used in a number of acceleration configurations, the Los Alamos RoadRunner system provides impressive capability through its demonstration as the first supercomputer to achieve petaflop performance.

Graphics processing units (GPUs) are quickly becoming the most commonly deployed accelerator platform by the high performance community. Both AMD and NVIDIA provide very impressive solutions. Only recently have these manufacturers recognized that there is a growing market for these cards in the high performance computing domain, and they are now producing cards that lack a video output (i.e., they are compute engines, not rendering engines). The communication between the CPU (the host) and the GPU (the device) is through a custom device driver. When a CUDA or OpenCL code gets compiled, the code generated is sent as a string to the AMD or NVIDIA device driver, where the code is just-in-time compiled for the specific GPU installed. GPUs can provide impressive speedups for a variety of applications —typical speedups range from 50X-500X.

Independent of which accelerator platform is chosen, the underlying execution characteristics of the target applications play a critical role in our ability to effectively utilize an accelerator effectively. As we will see in the articles included in this special issue, some applications are limited by the amount of divergent and unpredictable control flow that may be fundamental to the algorithm being used. Others applications may possess irregular memory access patterns which can limit memory efficiency.

## In This Issue

This special issue contains a collection of papers that nicely cover the hardware accelerator design space, especially if we consider the recent focus on many-core graphics processors and gaming chipsets, as well as applications that exploit these devices.

Engineers and scientists enjoy the ease of use of programming using an interpreted language framework. These environments allow them to quickly carry out computations effectively without significant programming effort. Some commercial examples of these environments include Matlab and Octave, which are especially well suited for working with matrix data. In “Accelerating the Execution of Matrix Languages on the Cell Broadband Engine Architecture,” Raymes Khoury, Bernd Burgstaller, and Bernhard Scholz look at offloading Octave computations onto an IBM PowerXCell processor. They report on speedups of up to 12X over a dual core X86 CPU. This paper provides a typical example of the kind of benefits that can be enjoyed by moving computation a many-core accelerator.

In “Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid,” Dominik Göddeke and Robert Strzodka extend previous work on mixed precision iterative solvers for sparse linear equation systems by implementing the entire algorithm on the GPU. The authors present a new GPU implementation of a cyclic reduction scheme for solving tridiagonal systems in parallel and use it as a line relaxation smoother for the GPU-based multigrid solver. They show that the resulting mixed precision schemes are always faster than double precision alone and outperform tuned CPU solvers by nearly an order of magnitude.

In “A Framework for Evaluating High-Level Design Methodologies for High-Performance Reconfigurable Computers,” Esam El-Araby, Saumil G. Merchant, and Tarek El-Ghazawi consider the impact of the programming model on our ability to effectively exploit accelerators provided on the reconfigurable Cray XD1 system. In this paper, the authors use examples of imperative programming, functional programming, and dataflow programming, and develop a set of metrics that captures both the programming effort involved and the resulting performance improvements achieved when utilizing.

In the paper “Hybrid Core Acceleration of UWB SIRE Radar Signal Processing,” Song Jun Park, James A. Ross, Dale R. Shires, David A. Richie, Brian J. Henz, and Lam H. Nguyen leverage both NVIDIA and AMD GPUs to explore the range of GPU accelerator options to speed up a key signal processing applications used in military radar systems. The target application in this study was to provide acceleration for real-time obstacle detection, which was only achieved through the adoption of a GPU in their final design.

A Quantum Monte Carlo application has been redesigned and implemented to work on several application accelerators in “Comparing Hardware Accelerators in Scientific Applications: A Case Study.” Rick Weber, Akila Gothandaraman, Robert J. Hinde, and Gregory D. Peterson present design methodologies and demonstrate performance improvements of the code on NVIDIA GPUs, ATI graphics accelerators, and Xilinx FPGAs as compared with a baseline CPU implementation. The authors also consider OpenCL multicore and GPU implementations and demonstrate the OpenCL application portability between these platforms, albeit at a performance cost.

In “Accelerating Pairwise Computations on Cell Processors,” an open-source software library for accelerating pair-wise computations on the Cell processor is presented. Such computations are needed in many application domains; in a general case, they have a computational complexity of ${\rm{O}}(n^2)$ . For sufficiently large $n$ , Cell implementation of pair-wise computations becomes a challenge due to the limited amount of memory available in the synergistic processing elements. Abhinav Sarje, Jaroslaw Zola, and Srinivas Aluru present a scheduling algorithm based on the particular tile decomposition schema that allows maximization of the data reuse on Cell and demonstrate the performance of their implementation on several applications.

A high-level directive-based language for CUDA programming is presented in “ hiCUDA: High-Level GPGPU Programming.” Tianyi David Han and Tarek S. Abdelrahman describe a set of new directives (pragmas) and present a prototype compiler that translates a C program with the hiCUDA directives to a CUDA program. A set of CUDA benchmarks is used to demonstrate the effectiveness of the developed compiler.

In “Design and Performance Evaluation of Image Processing Algorithms on GPUs,” In Kyu Park, Nitin Singhal, Man Hee Lee, Sungdae Cho, and Chris W. Kim consider a range of Image Processing algorithms running on NVIDIA GPUs. They compare algorithms taken from four different domains: 3D imaging, feature extraction, image compression, and computational photography. The result is a general set of metrics for evaluating the suitability of image processing algorithms for a GPU.

In the article titled “Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures,” Byunghyun Jang, Dana Schaa, Perhaad Mistry, and David Kaeli demonstrate that by applying a set of profile-guided memory transformations on a GPU, very significant performance benefits can be reaped. The paper presents a model for characterizing the memory access patterns present in the data structures accessed in a kernel, and then applies transformations that improve both our ability to use vectorization (for AMD GPUs) and memory coalescing (for NVIDIA GPUs).

In “Automatic Generation of Multicore Chemical Kernels,” John C. Linford, John Michalakes, Manish Vachharajani, and Adrian Sandu extend a kinetic preprocessor tool used in atmospheric modeling codes to generate chemical kinetics code for scalar architectures to support the generation of the numerical solution of chemical reaction network problems for NVIDIA GPUs, Cell processor, and multicore processors. The article presents a comparative performance analysis of chemical kernels from two atmospheric community codes with the kernels generated by the proposed tool across different accelerator platforms.

In “Accelerating Wavelet Lifting on Graphics Hardware Using CUDA,” Wladimir J. van der Laan, Andrei C. Jalba, and Jos B.T.M. Roerdink presented a fast and extendable to any number of dimensions wavelet lifting schema implemented on NVIDIA GPUs. The authors also provide a theoretical performance model which is in good agreement with the experimental observations.

In “Assessing Accelerator-Based HPC Reverse Time Migration,” Mauricio Araya-Polo, Javier Cabezas, Mauricio Hanzich, Miquel Pericas, Félix Rubio, Isaac Gelado, Muhammad Shafiq, Enric Morancho, Nacho Navarro, Eduard Ayguade, José María Cela, and Mateo Valero present an implementation of Reverse Time Migration seismic imaging technique on Cell, NVIDIA GPU, and SGI RC100 FPGA platforms and compare development methodologies and performance improvements across these platforms. They also provide a performance prediction analysis for the Convey HC-1 FPGA-based system. Three-dimensional stencil computations are at the core of the Reverse Time Migration algorithm; their efficient implementation on the three accelerator-based platforms is the main subject of this paper.

We received a large number of high-quality submissions and it has been a challenge to select a subset of the best papers for inclusion in this special issue. We would like to thank the reviewers whose thorough and thoughtful reviews made this task much easier.

David Kaeli