2013 25th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) (2013)
Porto de Galinhas, Pernambuco, Brazil
Oct. 23, 2013 to Oct. 26, 2013
This paper presents preliminary performance comparisons of parallel applications developed natively for the Intel Xeon Phi accelerator using three different parallel programming environments and their associated runtime systems. We compare Intel OpenMP, Intel CilkPlus and XKaapi together on the same benchmark suite and we provide comparisons between an Intel Xeon Phi coprocessor and a Sandy Bridge Xeon-based machine. Our benchmark suite is composed of three computing kernels: a Fibonacci computation that allows to study the overhead and the scalability of the runtime system, a NQueens application generating irregular and dynamic tasks and a Cholesky factorization algorithm. We also compare the Cholesky factorization with the parallel algorithm provided by the Intel MKL library for Intel Xeon Phi. Performance evaluation shows our XKaapi data-flow parallel programming environment exposes the lowest overhead of all and is highly competitive with native OpenMP and CilkPlus environments on Xeon Phi. Moreover, the efficient handling of data-flow dependencies between tasks makes our XKaapi environment exhibit more parallelism for some applications such as the Cholesky factorization. In that case, we observe substantial gains with up to 180 hardware threads over the state of the art MKL, with a 47% performance increase for 60 hardware threads.
accelerators, data-flow programming, work stealing, runtime systems, Intel Xeon Phi
J. V. Lima, F. Broquedis, T. Gautier and B. Raffin, "Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor," 2013 25th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Porto de Galinhas, Pernambuco, Brazil, 2014, pp. 105-112.