2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (2015)
May 4, 2015 to May 7, 2015
Intel Initial Many-Core Instructions (IMCI) for Xeon Phi introduces hardware-implemented Gather and Scatter (G/S) load/store contents of SIMD registers from/to non-contiguous memory locations. However, they can be one of key performance bottlenecks for Xeon Phi. Modelling G/S can provide insights to the performance on Xeon Phi, however, the existing solution needs a hand-written assembly implementation. Therefore, we modeled G/S with hardware performance counters which can be profiled by the tools like PAPI. We profiled Address Generation Interlock (AGI) events as the number of G/S, estimated the average latency of G/S with VPU_DATA_READ, and combined them to model the total latencies of G/S. We applied our model to the 3D 7-point stencil and the result showed G/S spent nearly 40% of total kernel time. We also validated the model by implementing a G/S- free version with intrinsics. The contribution of the work is a performance model for G/S built with hardware counters. We believe the model can be generally applicable to CPU as well.
Hardware, Radiation detectors, Mathematical model, Kernel, Three-dimensional displays, Analytical models, Solid modeling
J. Lin, A. Nukada and S. Matsuoka, "Modeling Gather and Scatter with Hardware Performance Counters for Xeon Phi," 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)(CCGRID), Shenzhen, China, 2015, pp. 713-716.