2015 International Conference on Parallel Architecture and Compilation (PACT) (2015)
San Francisco, CA, USA
Oct. 18, 2015 to Oct. 21, 2015
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/PACT.2015.27
Coherence misses are an important factor in limitingthe scalability of multi-threaded shared memory applicationson chip multiprocessors (CMPs) that are envisaged to containdozens of cores in the imminent future. This paper proposesa novel approach to tackling this problem by leveraging thegrowingly important paradigm of approximate computing. Manyapplications are either tolerant to slight errors in the output or ifstringent, have in-built resiliency to tolerate some errors in the ex-ecution. The approximate computing paradigm suggests breakingconventional barriers of mandating stringent correctness on thehardware, allowing more flexibility in the performance-power-reliability design space. Taking the multi-threaded applicationsin the SPLASH-2 benchmark suite, we note that nearly all theseapplications have such inherent resiliency and/or tolerance toslight errors in the output. Based on this observation, we proposeto approximate coherence-related load misses by returning stalevalues, i.e., the version at the time of the invalidation. We showthat returning such values from the invalidated lines alreadypresent in d-L1 offers only limited scope for improvement sincethose lines get evicted fairly soon due to the high pressure ond-L1. Instead, we propose a very small (8 lines) Stale VictimCache (SVC), to hold such lines upon d-L1 eviction. While thisdoes offer significant improvement, there is the possibility ofdata getting very stale in such a structure, making it highlysensitive to the choice of what data to keep, and for how long. Toaddress these concerns, we propose to time-out these lines fromthe SVC to limit their staleness in a mechanism called SVC+TB. We show that SVC+TB provides as much as 28.6% speedup insome SPLASH-2 applications, with an average speedup between10-15% across the entire suite, becoming comparable to an idealexecution that does not incur coherence misses. Further, theconsequent approximations have little impact on the correctness, allowing all of them to complete. There were no errors, becauseof inherent application resilience, in eleven applications, and themaximum error was at most 0.08% across the entire suite.
Coherence, Hardware, Static VAr compensators, Prefetching, Benchmark testing, Contracts
P. V. Rengasamy, A. Sivasubramaniam, M. T. Kandemir and C. R. Das, "Exploiting Staleness for Approximating Loads on CMPs," 2015 International Conference on Parallel Architecture and Compilation (PACT), San Francisco, CA, USA, 2015, pp. 343-354.