The Community for Technology Leaders
RSS Icon
Issue No.03 - May-June (2013 vol.33)
pp: 78-85
Timothy G. Rogers , University of British Columbia
Mike O'Connor , AMD Research
Tor M. Aamodt , University of British Columbia
Highly multithreaded architectures introduce another dimension to fine-grained hardware cache management. The order in which the system's threads issue instructions can significantly impact the access stream seen by the caching system. This article studies a set of economically important server applications and presents the cache-conscious wavefront scheduling (CCWS) hardware mechanism, which uses feedback from the memory system to guide the issue-level thread scheduler and shape the access pattern seen by the first-level cache.
Computer architecture, Multithreading, Hardware, Memory management, Scheduling, Parallel processing, Cache storage, Program processors, parallel processors, cache-conscious wavefront scheduling, GPU, memory systems, thread scheduling, SIMD processors, CCWS, cache, locality
Timothy G. Rogers, Mike O'Connor, Tor M. Aamodt, "Cache-Conscious Thread Scheduling for Massively Multithreaded Processors", IEEE Micro, vol.33, no. 3, pp. 78-85, May-June 2013, doi:10.1109/MM.2013.24
1. A. Jaleel et al., "High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP)," Proc. 37th Ann. Int'l Symp. Computer Architecture (ISCA 10), ACM, 2010, pp. 60-71.
2. M.E. Wolf et al., "A Data Locality Optimizing Algorithm," Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI 91), ACM, 1991, pp. 30-44.
3. "The Intel Xeon Phi Coprocessor," brief, Intel, 2012.
4. M. Shah et al., "Sparc T4: A Dynamically Threaded Server-on-a-Chip," IEEE Micro, vol. 32, no. 2, 2012, pp. 8-19.
5. R. Haring et al., "The IBM Blue Gene/Q Compute Chip," IEEE Micro, vol. 32, no. 2, 2012, pp. 48-60.
6. T.G. Rogers, M. O'Connor, and T.M. Aamodt, "Cache-Conscious Wavefront Scheduling," Proc. 45th IEEE/ACM Int'l Symp. Microarchitecture, IEEE CS, 2012, pp. 72-83.
7. A. Bakhoda et al., "Analyzing CUDA Workloads Using a Detailed GPU Simulator," Proc. IEEE Int'l Symp. Performance Analysis of Systems and Software (ISPASS 09), IEEE CS, 2009, pp. 163-174.
8. S. Che et al., "Rodinia: A Benchmark Suite for Heterogeneous Computing," Proc. IEEE Int'l Symp. Workload Characterization (IISWC 09), IEEE CS, 2009, pp. 44-54.
9. K. Barabash and E. Petrank, "Tracing Garbage Collection on Highly Parallel Platforms," Proc. Int'l Symp. Memory Management (ISMM 10), ACM, 2010, doi:10.11451806651.1806653.
10. T.H. Hetherington et al., "Characterizing and Evaluating a Key-Value Store Application on Heterogeneous CPU-GPU Systems," Proc. Int'l Symp. Performance Analysis of Systems and Software (ISPASS 12), IEEE CS, 2012, pp. 88-98.
11. N.P. Jouppi, "Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers," Proc. Int'l Symp. Computer Architecture (ISCA 90), 1990, pp. 364-373.
12. L.A. Belady, "A Study of Replacement Algorithms for a Virtual-Storage Computer," IBM Systems J., vol. 5, no. 2, 1966, pp. 78-101.
13. V. Narasiman et al., "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," Proc. Int'l Symp. Microarchitecture (MICRO 44), ACM, 2011, pp. 308-317.
14. Z. Guz et al., "Many-Core vs. Many-Thread Machines: Stay Away From the Valley," IEEE Computer Architecture Letters, Jan. 2009, pp. 25-28.
3 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool