Subscribe
Issue No.03 - May/June (2012 vol.32)
pp: 7-16
Inderpreet Singh , University of British Columbia
Andrew Brownsword , Electronic Arts
Tor M. Aamodt , University of British Columbia
ABSTRACT
Programming GPUs is challenging for applications with irregular fine-grained communication between threads. To improve the programmability of GPUs and thus extend their usage to a wider range of applications, the authors propose to enable transactional memory (TM) on GPUs via Kilo TM, a novel hardware TM system that scales to thousands of concurrent transactions.
INDEX TERMS
SIMD processors, hardware-software interface, parallel processors, transactional memory, GPU, KILO TM, fine-grained communication
CITATION
Inderpreet Singh, Andrew Brownsword, Tor M. Aamodt, "Kilo TM: Hardware Transactional Memory for GPU Architectures", IEEE Micro, vol.32, no. 3, pp. 7-16, May/June 2012, doi:10.1109/MM.2012.16
REFERENCES
1. D. Arnold et al., "Stack Trace Analysis for Large Scale Debugging," Proc. IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS 07), IEEE CS, 2007; doi:10.1109/IPDPS.2007.370254.
2. M. Herlihy and J.E.B. Moss, "Transactional Memory: Architectural Support for Lock-Free Data Structures," Proc. 20th Ann. Int'l Symp. Computer Architecture (ISCA 93), ACM, 1993, pp. 289-300.
3. T. Harris, J. Larus, and R. Rajwar, Transactional Memory, 2nd ed., Morgan and Claypool, 2010.
4. M. Burtscher and K. Pingali, "An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm," GPU Computing Gems, Emerald ed., Morgan Kaufmann, 2011, pp. 75-92.
5. "NVIDIA's Next-Gen CUDA Compute Architecture: Fermi," white paper, Nvidia, Oct. 2009; http://www.nvidia.com/content/PDF/fermi_white_papers NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf .
6. W.W.L. Fung et al., "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," Proc. 40th Ann. IEEE/ACM Int'l Symp. Microarchitecture, IEEE CS, 2007, pp. 407-420.
7. W.W.L. Fung et al., "Hardware Transactional Memory for GPU Architectures," Proc. 44th Ann. IEEE/ACM Int'l Symp. Microarchitecture, ACM, 2011, pp. 296-307.
8. H. Chafi et al., "A Scalable, Non-blocking Approach to Transactional Memory," Proc. IEEE 13th Int'l Symp. High Performance Computer Architecture (HPCA 07), IEEE CS, 2007, pp. 97-108.
9. L. Yen et al., "LogTM-SE: Decoupling Hardware Transactional Memory from Caches," Proc. IEEE 13th Int'l Symp. High Performance Computer Architecture (HPCA 07), IEEE CS, 2007, pp. 261-272.
10. R. Guerraoui and M. Kapalka, "On the Correctness of Transactional Memory," Proc. 13th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP 08), ACM, 2008, pp. 175-184.
11. M.F. Spear, M.M. Michael, and C. von Praun, "RingSTM: Scalable Transactions with a Single Atomic Instruction," Proc. 20th Ann. Symp. Parallelism in Algorithms and Architectures (SPAA 08), ACM, 2008, pp. 275-284.
12. L. Dalessandro, M.F. Spear, and M.L. Scott, "NOrec: Streamlining STM by Abolishing Ownership Records," Proc. 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP 10), ACM, 2010, pp. 67-78.
13. M.M. Michael, "Practical Lock-Free and Wait-Free LL/SC/VL Implementations Using 64-Bit CAS," Proc. 18th Int'l Symp. Distributed Computing (DISC 04), LNCS 3274, Springer, 2004, pp. 144-158.
14. A. Bakhoda et al., "Analyzing CUDA Workloads Using a Detailed GPU Simulator," Proc. IEEE Int'l Symp. Performance Analysis of Systems and Software (ISPASS 09), IEEE, 2009, pp. 163-174.
15. W.J. Dally and B.P. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann, 2004.