The Community for Technology Leaders
Parallel and Distributed Processing Symposium, International (2007)
Long Beach, CA, USA
Mar. 26, 2007 to Mar. 30, 2007
ISBN: 1-4244-0909-8
pp: 346
Konrad Malkowski , Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802. Tel: 814-865-9505, E-mail: malkowsk@cse.psu.edu
Greg Link , Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802. Tel: 814-865-9505, E-mail: link@cse.psu.edu
Padma Raghavan , Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802. Tel: 814-865-9505, E-mail: raghavan@cse.psu.edu
Mary Jane Irwin , Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802. Tel: 814-865-9505, E-mail: mji@cse.psu.edu
ABSTRACT
Modern CPUs operate at GHz frequencies, but the latencies of memory accesses are still relatively large, in the order of hundreds of cycles. Deeper cache hierarchies with larger cache sizes can mask these latencies for codes with good data locality and reuse, such as structured dense matrix computations. However, cache hierarchies do not necessarily benefit sparse scientific computing codes, which tend to have limited data locality and reuse. We therefore propose a new memory architecture with a Load Miss Predictor (LMP), which includes a data bypass cache and a predictor table, to reduce access latencies by determining whether a load should bypass the main cache hierarchy and issue an early load to main memory. Our architecture uses the L2 (and lower caches) as a victim cache for data removed from our bypass cache. We use cycle-accurate simulations, with SimpleScalar and Wattch to show that our LMP improves the performance of sparse codes, our application domain of interest, on average by 14%, with a 13.6% increase in power. When the LMP is used with dynamic voltage and frequency scaling (DVFS), performance can be improved by 8.7% with system power savings of 7.3% and energy reduction of 17.3% at 1800MHz relative to the base system at 2000MHz. Alternatively our LMP can be used to improve the performance of SPEC benchmarks by an average of 2.9% at the cost of 7.1% increase in average power.
INDEX TERMS
null
CITATION

P. Raghavan, K. Malkowski, M. J. Irwin and G. Link, "Load Miss Prediction - Exploiting Power Performance Trade-offs," 2007 IEEE International Parallel and Distributed Processing Symposium(IPDPS), Rome, 2007, pp. 346.
doi:10.1109/IPDPS.2007.370536
84 ms
(Ver 3.3 (11022016))