2015 International Conference on Parallel Architecture and Compilation (PACT) (2015)
San Francisco, CA, USA
Oct. 18, 2015 to Oct. 21, 2015
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/PACT.2015.24
Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-Add (FAA) are ubiquitous in parallel programming. Yet, performance tradeoffs between these opera-tions and various characteristics of such systems, such as the structure of caches, are unclear and have not been thoroughly analyzed. In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailed benchmarks for latency and bandwidth of different atomics. Weconsider various state-of-the-art x86 architectures: Intel Haswell, Xeon Phi, Ivy Bridge, and AMD Bulldozer. The results unveil surprising performance relationships between the considered atomics and architectural properties such as the coherence state of the accessed cache lines. One key finding is that all the tested atomics have comparable latency and bandwidth even if they are characterized by different consensus numbers. Another insight is that the design of atomics prevents any instruction-level parallelism even if there are no dependencies between the issued operations. Finally, we discuss solutions to the discovered performance issues in the analyzed architectures. Our analysis can be used for making better design and algorithmic decisions in parallel programming on various architectures deployed in both off-the-shelf machines and large compute systems.
Benchmark testing, Land vehicles, Protocols, Bridges, Multicore processing, Bandwidth,Parallel programming, Read-modify-write, Atomic Operations, CPU architecture
Hermann Schweizer, Maciej Besta, Torsten Hoefler, "Evaluating the Cost of Atomic Operations on Modern Architectures", 2015 International Conference on Parallel Architecture and Compilation (PACT), vol. 00, no. , pp. 445-456, 2015, doi:10.1109/PACT.2015.24