2009 18th International Conference on Parallel Architectures and Compilation Techniques (2009)
Raleigh, North Carolina, USA
Sept. 12, 2009 to Sept. 16, 2009
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/PACT.2009.31
Bulk memory copying and initialization is one of the most ubiquitous operations performed in current computer systems by both user applications and Operating Systems. While many current systems rely on a loop of loads and stores, there are proposals to introduce a single instruction to perform bulk memory copying. While such an instruction can improve performance due to generating fewer TLB and cache accesses, and requiring fewer pipeline resources, in this paper we show that the key to significantly improving the performance is removing pipeline and cache bottlenecks of the code that follows the instructions. We show that the bottlenecks arise due to (1) the pipeline clogged by the copying instruction, (2) lengthened critical path due to dependent instructions stalling while waiting for the copying to complete, and (3) the inability to specify (separately) the cacheability of the source and destination regions. We propose FastBCI, an architecture support that achieves the granularity efficiency of a bulk copying/ initialization instruction, but without its pipeline and cache bottlenecks. When applied to OS kernel buffer management, we show that on average FastBCI achieves anywhere between 23% to 32% speedup ratios, which is roughly 3x-4x of an alternative scheme, and 1.5x-2x of a highly optimistic DMA with zero setup and interrupt overheads.
memory copying, memory initialization, cache affinity, cache neutral, early retirement
R. Iyer, L. Zhao, Y. Solihin and X. Jiang, "Architecture Support for Improving Bulk Memory Copying and Initialization Performance," 2009 18th International Conference on Parallel Architectures and Compilation Techniques(PACT), Raleigh, North Carolina, USA, 2009, pp. 169-180.