Search For:

Displaying 1-26 out of 26 total
Hardware Support for Prescient Instruction Prefetch
Found in: High-Performance Computer Architecture, International Symposium on
By Tor M. Aamodt, Paul Chow, Per Hammarlund, Hong Wang, John P. Shen
Issue Date:February 2004
pp. 84
This paper proposes and evaluates hardware mechanisms for supporting prescient instruction prefetch — an approach to improving single-threaded application performance by using helper threads to perform instruction prefetch. We demonstrate the need for enab...
 
Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems
Found in: Performance Analysis of Systems and Software, IEEE International Symmposium on
By Tayler H. Hetherington,Timothy G. Rogers,Lisa Hsu,Mike O'Connor,Tor M. Aamodt
Issue Date:April 2012
pp. 88-98
The recent use of graphics processing units (GPUs) in several top supercomputers demonstrate their ability to consistently deliver positive results in high-performance computing (HPC). GPU support for significant amounts of parallelism would seem to make t...
 
Cache coherence for GPU architectures
Found in: 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)
By Inderpreet Singh,Arrvindh Shriraman,Wilson W. L. Fung,Mike O'Connor,Tor M. Aamodt
Issue Date:February 2013
pp. 578-590
While scalable coherence has been extensively studied in the context of general purpose chip multiprocessors (CMPs), GPU architectures present a new set of challenges. Introducing conventional directory protocols adds unnecessary coherence traffic overhead...
 
Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Xi E. Chen, Tor M. Aamodt
Issue Date:November 2008
pp. 59-70
As the number of transistors integrated on a chip continues to increase, a growing challenge is accurately modeling performance in the early stages of processor design. Analytical models have been employed to rapidly search for higher performance designs, ...
 
Cache-Conscious Thread Scheduling for Massively Multithreaded Processors
Found in: IEEE Micro
By Timothy G. Rogers,Mike O'Connor,Tor M. Aamodt
Issue Date:May 2013
pp. 78-85
Highly multithreaded architectures introduce another dimension to fine-grained hardware cache management. The order in which the system's threads issue instructions can significantly impact the access stream seen by the caching system. This article studies...
 
Cache-Conscious Wavefront Scheduling
Found in: 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
By Timothy G. Rogers,Mike OConnor,Tor M. Aamodt
Issue Date:December 2012
pp. 72-83
This paper studies the effects of hardware thread scheduling on cache management in GPUs. We propose Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locali...
 
Kilo TM: Hardware Transactional Memory for GPU Architectures
Found in: IEEE Micro
By Wilson W.L. Fung,Inderpreet Singh,Andrew Brownsword,Tor M. Aamodt
Issue Date:May 2012
pp. 7-16
Programming GPUs is challenging for applications with irregular fine-grained communication between threads. To improve the programmability of GPUs and thus extend their usage to a wider range of applications, the authors propose to enable transactional mem...
 
Modeling Cache Contention and Throughput of Multiprogrammed Manycore Processors
Found in: IEEE Transactions on Computers
By Xi E. Chen,Tor M. Aamodt
Issue Date:July 2012
pp. 913-927
This paper proposes an analytical model for accurately predicting the impact of contention on cache miss rates. The focus is multiprogrammed workloads running on multithreaded manycore architectures. This work addresses a key challenge facing earlier cache...
 
Throughput-Effective On-Chip Networks for Manycore Accelerators
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Ali Bakhoda, John Kim, Tor M. Aamodt
Issue Date:December 2010
pp. 421-432
As the number of cores and threads in many core compute accelerators such as Graphics Processing Units (GPU) increases, so does the importance of on-chip interconnection network design. This paper explores throughput-effective network-on-chips (NoC) for fu...
 
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Wilson W.L. Fung, Ivan Sham, George Yuan, Tor M. Aamodt
Issue Date:December 2007
pp. 407-420
Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in commodity desktop computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines t...
 
Thread block compaction for efficient SIMT control flow
Found in: High-Performance Computer Architecture, International Symposium on
By Wilson W. L. Fung, Tor M. Aamodt
Issue Date:February 2011
pp. 25-36
Manycore accelerators such as graphics processor units (GPUs) organize processing units into single-instruction, multiple data
 
Cache Coherence for GPU Architectures
Found in: IEEE Micro
By Inderpreet Singh,Arrvindh Shriraman,Wilson W.L. Fung,Mike O'Connor,Tor M. Aamodt
Issue Date:May 2014
pp. 69-79
GPUs have become an attractive target for accelerating parallel applications and delivering significant speedups and energy-efficiency gains over multicore CPUs. Programming GPUs, however, remains challenging because existing GPUs lack the well-defined mem...
 
A scalable multi-path microarchitecture for efficient GPU control flow
Found in: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)
By Ahmed ElTantawy,Jessica Wenjie Ma,Mike O'Connor,Tor M. Aamodt
Issue Date:February 2014
pp. 248-259
Graphics processing units (GPUs) are increasingly used for non-graphics computing. However, applications with divergent control flow incur performance degradation on current GPUs. These GPUs implement the SIMT execution model by serializing the execution o...
   
Progressive-BackSpace: Efficient Predecessor Computation for Post-Silicon Debug
Found in: 2012 13th International Workshop on Microprocessor Test and Verification (MTV)
By Johnny J.W. Kuan,Tor M. Aamodt
Issue Date:December 2012
pp. 70-75
As microprocessors become more complex, finding errors in their design becomes more difficult. Most design errors are caught before the chip is fabricated, however, some make it into the fabricated design. One challenge in determining what is wrong with a ...
 
Energy efficient GPU transactional memory via space-time optimizations
Found in: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46)
By Tor M. Aamodt, Wilson W. L. Fung
Issue Date:December 2013
pp. 408-420
Many applications with regular parallelism have been shown to benefit from using Graphics Processing Units (GPUs). However, employing GPUs for applications with irregular parallelism tends to be a risky process, involving significant effort from the progra...
     
Divergence-aware warp scheduling
Found in: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46)
By Mike O'Connor, Timothy G. Rogers, Tor M. Aamodt
Issue Date:December 2013
pp. 99-110
This paper uses hardware thread scheduling to improve the performance and energy efficiency of divergent applications on GPUs. We propose Divergence-Aware Warp Scheduling (DAWS), which introduces a divergence-based cache footprint predictor to estimate how...
     
Designing on-chip networks for throughput accelerators
Found in: ACM Transactions on Architecture and Code Optimization (TACO)
By Ali Bakhoda, John Kim, Tor M. Aamodt
Issue Date:September 2013
pp. 1-35
As the number of cores and threads in throughput accelerators such as Graphics Processing Units (GPU) increases, so does the importance of on-chip interconnection network design. This article explores throughput-effective Network-on-Chips (NoC) for future ...
     
GPUWattch: enabling energy optimizations in GPGPUs
Found in: Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13)
By Ahmed ElTantawy, Jingwen Leng, Nam Sung Kim, Syed Gilani, Tayler Hetherington, Tor M. Aamodt, Vijay Janapa Reddi
Issue Date:June 2013
pp. 487-498
General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly expl...
     
GPUDet: a deterministic GPU architecture
Found in: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS '13)
By Hadi Jooybar, Joseph Devietti, Mike O'Connor, Tor M. Aamodt, Wilson W.L. Fung
Issue Date:March 2013
pp. 1-12
Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one's ability to test for correct...
     
Hardware transactional memory for GPU architectures
Found in: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44 '11)
By Andrew Brownsword, Inderpreet Singh, Tor M. Aamodt, Wilson W. L. Fung
Issue Date:December 2011
pp. 296-307
Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long l...
     
On-chip network design considerations for compute accelerators
Found in: Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT '10)
By Ali Bakhoda, John Kim, Tor M. Aamodt
Issue Date:September 2010
pp. 535-536
There has been little work investigating the overall performance impact of on-chip communication in manycore compute accelerators. In this paper we evaluate performance of a GPU-like compute accelerator running CUDA workloads and consisting of compute node...
     
Complexity effective memory access scheduling for many-core accelerator architectures
Found in: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (Micro-42)
By Ali Bakhoda, George L. Yuan, Tor M. Aamodt
Issue Date:December 2009
pp. 34-44
Modern DRAM systems rely on memory controllers that employ out-of-order scheduling to maximize row access locality and bank-level parallelism, which in turn maximizes DRAM bandwidth. This is especially important in graphics processing unit (GPU) architectu...
     
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
Found in: ACM Transactions on Architecture and Code Optimization (TACO)
By George Yuan, Ivan Sham, Tor M. Aamodt, Wilson W. L. Fung
Issue Date:June 2009
pp. 1-37
Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in today's desktop and notebook computer systems. GPUs typically use single-instruction, multiple-data (SIMD) ...
     
Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor
Found in: Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT '08)
By Ankur Khandelwal Groen, Anne Bracy, Ethan Schuchman, Gautham Chinya, Henry Wong, Hong Jiang, Hong Wang, Jamison D. Collins, Perry H. Wang, Tor M. Aamodt
Issue Date:October 2008
pp. 133-133
Moore's Law and the drive towards performance efficiency have led to the on-chip integration of general-purpose cores with special-purpose accelerators. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with n...
     
Compile-time and instruction-set methods for improving floating- to fixed-point conversion accuracy
Found in: ACM Transactions on Embedded Computing Systems (TECS)
By Paul Chow, Tor M. Aamodt
Issue Date:April 2008
pp. 1-27
This paper proposes and evaluates compile time and instruction-set techniques for improving the accuracy of signal-processing algorithms run on fixed-point embedded processors. These techniques are proposed in the context of a profile guided floating- to f...
     
A framework for modeling and optimization of prescient instruction prefetch
Found in: Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems (SIGMETRICS '03)
By Antonio González, Hong Wang, John P. Shen, Paul Chow, Pedro Marcuello, Per Hammarlund, Tor M. Aamodt
Issue Date:June 2003
pp. 13-24
This paper describes a framework for modeling macroscopic program behavior and applies it to optimizing prescient instruction prefetch -- novel technique that uses helper threads to improve single-threaded application performance by performing judicious an...
     
 1