Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (2013)
Edinburgh, United Kingdom United Kingdom
Sept. 7, 2013 to Sept. 11, 2013
Venkatraman Govindaraju , Dept. of Comput. Sci., Univ. of Wisconsin - Madison, Madison, WI, USA
Tony Nowatzki , Dept. of Comput. Sci., Univ. of Wisconsin - Madison, Madison, WI, USA
Karthikeyan Sankaralingam , Dept. of Comput. Sci., Univ. of Wisconsin - Madison, Madison, WI, USA
Modern microprocessors exploit data level parallelism through in-core data-parallel accelerators in the form of short vector ISA extentions such as SSE/AVX and NEON. Although these ISA extentions have existed for decades, compilers do not generate good quality, high-performance vectorized code without significant programmer intervention and manual optimization. The fundamental problem is that the architecture is too rigid, which overly complicates the compiler's role and simultaneously restricts the types of codes that the compiler can profitably map to these data-parallel accelerators. We take a fundamentally new approach that first makes the architecture more flexible and exposes this flexibility to the compiler. Counter-intuitively, increasing the complexity of the accelerator's interface to the compiler enables a more robust and efficient system that supports many types of codes. This system also enables the performance of auto-acceleration to be comparable to that of manually-optimized implementations. To address the challenges of compiling for flexible accelerators, we propose a variant of Program Dependence Graph called the Access Execute Program Dependence Graph to capture spatio-temporal aspects of memory accesses and computations. We implement a compiler that uses this representation and evaluate it by considering both a suite of kernels developed and tuned for SSE, and “challenge” data-parallel applications, the Parboil benchmarks. We show that our compiler, which targets the DySER accelerator, provides high-quality code for the kernels and full applications, commonly reaching within 30% of manually-optimized and out-performs compiler-produced SSE code by 1.8×.
Vectors, Computer architecture, Program processors, Acceleration, Hardware, Ports (Computers), Optimization
V. Govindaraju, T. Nowatzki and K. Sankaralingam, "Can lock-free and combining techniques co-exist?: a novel approach on concurrent queue," Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques(PACT), Edinburgh, United Kingdom United Kingdom, 2013, pp. 341-351.