The Community for Technology Leaders
Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (2013)
Edinburgh, United Kingdom United Kingdom
Sept. 7, 2013 to Sept. 11, 2013
ISSN: 1089-795X
ISBN: 978-1-4799-1018-2
pp: 299-308
Lei Fang , Dept. of ISEE, Zhejiang Univ., Hangzhou, China
Peng Liu , Dept. of ISEE, Zhejiang Univ., Hangzhou, China
Qi Hu , Dept. of ISEE, Zhejiang Univ., Hangzhou, China
Michael C. Huang , Dept. of ECE, Univ. of Rochester, Rochester, NY, USA
Guofan Jiang , IBM China Syst. & Technol. Lab., Shanghai, China
ABSTRACT
Mainstream chip multiprocessors already include a significant number of cores that make straightforward snooping-based cache coherence less appropriate. Further increase in core count will almost certainly require more sophisticated tracking of data sharing to minimize unnecessary messages and cache snooping. Directory-based coherence has been the standard solution for large-scale shared-memory multiprocessors and is a clear candidate for on-chip coherence maintenance. A vanilla directory design, however, suffers from inefficient use of storage to keep coherence metadata. The result is a high storage overhead for larger scales. Reducing this overhead leads to saving of resources that can be redeployed for other purposes. In this paper, we exploit familiar characteristics of coherence metadata, but with novel angles and propose two practical techniques to increase the expressiveness of directory entries, particularly for chip-multiprocessors. First, it is well known that the vast majority of cache lines have a small number of sharers. We exploit a related fact with a subtle but important difference: that a significant portion of directory entries only need to track one node. We can thus use a hybrid representation of sharers list for the whole set. Second, contiguous memory regions often share the same coherence characteristics and can be tracked by a single entry. We propose a multi-granular mechanism that does not rely on any profiling, compiler, or OS support to identify such regions. Moreover, it allows co-existence of line and region entries in the same locations, thus making regions more applicable. We show that both techniques improve the expressiveness of directory entries, and, when combined, can reduce directory storage by more than an order of magnitude with negligible loss of precision.
INDEX TERMS
Vectors, Coherence, System-on-chip, Tiles, Indexes, Optimization, Target tracking,polyhedral model, communication optimization, data movement, distributed memory, heterogeneous architectures
CITATION
Lei Fang, Peng Liu, Qi Hu, Michael C. Huang, Guofan Jiang, "Generating efficient data movement code for heterogeneous architectures with distributed-memory", Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, vol. 00, no. , pp. 299-308, 2013, doi:10.1109/PACT.2013.6618826
379 ms
(Ver 3.3 (11022016))