The Community for Technology Leaders
2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (2015)
Shenzhen, China
May 4, 2015 to May 7, 2015
ISBN: 978-1-4799-8006-2
pp: 713-716
ABSTRACT
Intel Initial Many-Core Instructions (IMCI) for Xeon Phi introduces hardware-implemented Gather and Scatter (G/S) load/store contents of SIMD registers from/to non-contiguous memory locations. However, they can be one of key performance bottlenecks for Xeon Phi. Modelling G/S can provide insights to the performance on Xeon Phi, however, the existing solution needs a hand-written assembly implementation. Therefore, we modeled G/S with hardware performance counters which can be profiled by the tools like PAPI. We profiled Address Generation Interlock (AGI) events as the number of G/S, estimated the average latency of G/S with VPU_DATA_READ, and combined them to model the total latencies of G/S. We applied our model to the 3D 7-point stencil and the result showed G/S spent nearly 40% of total kernel time. We also validated the model by implementing a G/S- free version with intrinsics. The contribution of the work is a performance model for G/S built with hardware counters. We believe the model can be generally applicable to CPU as well.
INDEX TERMS
Hardware, Radiation detectors, Mathematical model, Kernel, Three-dimensional displays, Analytical models, Solid modeling
CITATION

J. Lin, A. Nukada and S. Matsuoka, "Modeling Gather and Scatter with Hardware Performance Counters for Xeon Phi," 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)(CCGRID), Shenzhen, China, 2015, pp. 713-716.
doi:10.1109/CCGrid.2015.59
88 ms
(Ver 3.3 (11022016))