Search For:

Displaying 1-38 out of 38 total
Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers
Found in: High-Performance Computer Architecture, International Symposium on
By Santhosh Srinath, Onur Mutlu, Hyesoon Kim, Yale N. Patt
Issue Date:February 2007
pp. 63-74
High performance processors employ hardware data prefetching to reduce the negative performance impact of large main memory latencies. While prefetching improves performance substantially on many programs, it can significantly reduce performance on others....
 
Diverge-Merge Processor: Generalized and Energy-Efficient Dynamic Predication
Found in: IEEE Micro
By Hyesoon Kim, José A. Joao, Onur Mutlu, Yale N. Patt
Issue Date:January 2007
pp. 94-104
The branch misprediction penalty is a major performance limiter and a major cause of wasted energy in high-performance processors. The diverge-merge processor reduces this penalty by dynamically predicating a wide range of hard-to-predict branches at runti...
 
Wish Branches: Enabling Adaptive and Aggressive Predicated Execution
Found in: IEEE Micro
By Hyesoon Kim, Onur Mutlu, Yale N. Patt, Jared Stark
Issue Date:January 2006
pp. 48-58
The goal of wish branches is to use predicated execution for hard-to-predict dynamic branches, and branch prediction for easy-to-predict dynamic branches, thereby obtaining the best of both worlds. Wish loops, one class of wish branches, use predication to...
 
Predicting Potential Speedup of Serial Code via Lightweight Profiling and Emulations with Memory Performance Model
Found in: 2012 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
By Minjang Kim,Pranith Kumar,Hyesoon Kim,Bevin Brett
Issue Date:May 2012
pp. 1318-1329
We achieve very small runtime overhead: approximately a 1.2-10 times slowdown and moderate memory consumption. We demonstrate the effectiveness of Parallel Prophet in eight benchmarks in the Omp SCR and NAS Parallel benchmarks by comparing our predictions ...
 
TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture
Found in: High-Performance Computer Architecture, International Symposium on
By Jaekyu Lee,Hyesoon Kim
Issue Date:February 2012
pp. 1-12
Combining CPUs and GPUs on the same chip has become a popular architectural trend. However, these heterogeneous architectures put more pressure on shared resource management. In particular, managing the last-level cache (LLC) is very critical to performanc...
 
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, Richard Vuduc
Issue Date:December 2010
pp. 213-224
We consider the problem of how to improve memory latency tolerance in massively multithreaded GPGPUs when the thread-level parallelism of an application is not sufficient to hide memory latency. One solution used in conventional CPU systems is prefetching,...
 
Age based scheduling for asymmetric multiprocessors
Found in: SC Conference
By Nagesh B. Lakshminarayana, Jaekyu Lee, Hyesoon Kim
Issue Date:November 2009
pp. 1-12
Asymmetric (or Heterogeneous) Multiprocessors are becoming popular in the current era of multi-cores due to their power efficiency and potential performance and energy efficiency. However, scheduling of multithreaded applications in Asymmetric Multiprocess...
 
Virtual Program Counter (VPC) Prediction: Very Low Cost Indirect Branch Prediction Using Conditional Branch Prediction Hardware
Found in: IEEE Transactions on Computers
By Hyesoon Kim, José A. Joao, Onur Mutlu, Chang Joo Lee, Yale N. Patt, Robert Cohn
Issue Date:September 2009
pp. 1153-1170
Indirect branches have become increasingly common in modular programs written in modern object-oriented languages and virtual-machine-based runtime systems. Unfortunately, the prediction accuracy of indirect branches has not improved as much as that of con...
 
Profile-assisted Compiler Support for Dynamic Predication in Diverge-Merge Processors
Found in: Code Generation and Optimization, IEEE/ACM International Symposium on
By Hyesoon Kim, Jos´e A. Joao, Onur Mutlu, Yale N. Patt
Issue Date:March 2007
pp. 367-378
<p>Dynamic predication has been proposed to reduce the branch misprediction penalty due to hard-to-predict branch instructions. A recently proposed dynamic predication architecture, the diverge-merge processor (DMP), provides large performance improv...
 
Dynamic Predication of Indirect Jumps
Found in: IEEE Computer Architecture Letters
By J.A. Joao,O. Mutlu, Hyesoon Kim,Y.N. Patt
Issue Date:February 2007
pp. 1-1
Indirect jumps are used to implement increasingly-common programming language constructs such as virtual function calls, switch-case statements, jump tables, and interface calls. Unfortunately, the prediction accuracy of indirect jumps has remained low bec...
 
2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set
Found in: Code Generation and Optimization, IEEE/ACM International Symposium on
By Hyesoon Kim, M. Aater Suleman, Onur Mutlu, Yale N. Patt
Issue Date:March 2006
pp. 159-172
Static compilers use profiling to predict run-time program behavior. Generally, this requires multiple input sets to capture wide variations in run-time behavior. This is expensive in terms of resources and compilation time. We introduce a new mechanism, 2...
 
Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Onur Mutlu, Hyesoon Kim, Yale N. Patt
Issue Date:November 2005
pp. 233-244
<p>While runahead execution is effective at parallelizing independent long-latency cache misses, it is unable to parallelize dependent long-latency cache misses. To overcome this limitation, this paper proposes a novel technique, address-value delta ...
 
Techniques for Efficient Processing in Runahead Execution Engines
Found in: Computer Architecture, International Symposium on
By Onur Mutlu, Hyesoon Kim, Yale N. Patt
Issue Date:June 2005
pp. 370-381
<p>Runahead execution is a technique that improves processor performance by pre-executing the running application instead of stalling the processor when a long-latency cache miss occurs. Previous research has shown that this technique significantly i...
 
On Reusing the Results of Pre-Executed Instructions in a Runahead Execution Processor
Found in: IEEE Computer Architecture Letters
By Onur Mutlu, Hyesoon Kim, Jared Stark, Yale N. Patt
Issue Date:January 2005
pp. N/A
Previous research on runahead execution took it for granted as a prefetch-only technique. Even though the results of instructions independent of an L2 miss are correctly computed during runahead mode, previous approaches discarded those results instead of ...
 
Wrong Path Events: Exploiting Unusual and Illegal Program Behavior for Early Misprediction Detection and Recovery
Found in: Microarchitecture, IEEE/ACM International Symposium on
By David N. Armstrong, Hyesoon Kim, Onur Mutlu, Yale N. Patt
Issue Date:December 2004
pp. 119-128
Control and data speculation are widely used to improve processor performance. Correct speculation can reduce execution time, but incorrect speculation can lead to increased execution time and greater energy consumption.<div></div> This paper p...
 
Cache Filtering Techniques to Reduce the Negative Impact of Useless Speculative Memory References on Processor Performance
Found in: Computer Architecture and High Performance Computing, Symposium on
By Onur Mutlu, Hyesoon Kim, David N. Armstrong, Yale N. Patt
Issue Date:October 2004
pp. 2-9
High-performance processors employ aggressive speculation and prefetching techniques to increase performance. Speculative memory references caused by these techniques sometimes bring data into the caches that are not needed by correct execution. This paper...
 
SD3: An Efficient Dynamic Data-Dependence Profiling Mechanism
Found in: IEEE Transactions on Computers
By Minjang Kim,Nagesh B. Lakshminarayana,Hyesoon Kim,Chi-Keung Luk
Issue Date:December 2013
pp. 2516-2530
As multicore processors are deployed in mainstream computing, the need for software tools to help parallelize programs is increasing dramatically. Data-dependence profiling is an important program analysis technique to exploit parallelism in serial program...
 
CHiP: A Profiler to Measure the Effect of Cache Contention on Scalability
Found in: 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)
By Bevin Brett,Pranith Kumar,Minjang Kim,Hyesoon Kim
Issue Date:May 2013
pp. 1565-1574
Programmers are looking for ways to exploit the multi-core processors which have become commonplace today. One of the options available is to parallelize the existing serial programs using frameworks like OpenMP etc. However, such parallelization does not ...
 
OpenCL Performance Evaluation on Modern Multi Core CPUs
Found in: 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)
By Joo Hwan Lee,Kaushik Patel,Nimit Nigania,Hyojong Kim,Hyesoon Kim
Issue Date:May 2013
pp. 1177-1185
Utilizing heterogeneous platforms for computation has become a general trend making the portability issue important. OpenCL (Open Computing Language) serves the purpose by enabling portable execution on heterogeneous architectures. However, unpredictable p...
 
SD3: A Scalable Approach to Dynamic Data-Dependence Profiling
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Minjang Kim, Hyesoon Kim, Chi-Keung Luk
Issue Date:December 2010
pp. 535-546
As multicore processors are deployed in mainstream computing, the need for software tools to help parallelize programs is increasing dramatically. Data-dependence profiling is an important technique to exploit parallelism in programs. More specifically, ma...
 
A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch
Found in: 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
By Jaewoong Sim,Gabriel H. Loh,Hyesoon Kim,Mike OConnor,Mithuna Thottethodi
Issue Date:December 2012
pp. 247-257
Die-stacking technology allows conventional DRAM to be integrated with processors. While numerous opportunities to make use of such stacked DRAM exist, one promising way is to use it as a large cache. Although previous studies show that DRAM caches can del...
 
DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function
Found in: IEEE Computer Architecture Letters
By Nagesh B. Lakshminarayana,Jaekyu Lee,Hyesoon Kim,Jinwoo Shin
Issue Date:July 2012
pp. 33-36
GPGPU architectures (applications) have several different characteristics compared to traditional CPU architectures (applications): highly multithreaded architectures and SIMD-execution behavior are the two important characteristics of GPGPU computing. In ...
 
Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses
Found in: IEEE Transactions on Computers
By Onur Mutlu, Hyesoon Kim, Yale N. Patt
Issue Date:December 2006
pp. 1491-1508
While runahead execution is effective at parallelizing independent long-latency cache misses, it is unable to parallelize dependent long-latency cache misses. To overcome this limitation, this paper proposes a novel hardware technique, address-value delta ...
 
Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance
Found in: IEEE Micro
By Onur Mutlu, Hyesoon Kim, Yale N. Patt
Issue Date:January 2006
pp. 10-20
Several simple techniques can make runahead execution more efficient by reducing the number of instructions executed and thereby reducing the additional energy consumption typically associated with runahead execution.
 
An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors
Found in: IEEE Transactions on Computers
By Onur Mutlu, Hyesoon Kim, David N. Armstrong, Yale N. Patt
Issue Date:December 2005
pp. 1556-1571
High-performance, out-of-order execution processors spend a significant portion of their execution time on the incorrect program path even though they employ aggressive branch prediction algorithms. Although memory references generated on the wrong path do...
 
Hardware Support for Safe Execution of Native Client Applications
Found in: IEEE Computer Architecture Letters
By Dilan Manatunga,Joo Lee,Hyesoon Kim
Issue Date:March 2014
pp. 1
Over the past few years, there has been vast growth in the area of the web browser as an applications platform. One example of this trend is Google’s Native Client (NaCl) platform, which is a software-fault isolation mechanism that allows the running of na...
 
Spare register aware prefetching for graph algorithms on GPUs
Found in: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)
By Nagesh B. Lakshminarayana,Hyesoon Kim
Issue Date:February 2014
pp. 614-625
More and more graph algorithms are being GPU enabled. Graph algorithm implementations on GPUs have irregular control flow and are memory-intensive with many irregular/data-dependent memory accesses. Due to these factors graph algorithms on GPUs have low ex...
   
Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures
Found in: ACM Transactions on Design Automation of Electronic Systems (TODAES)
By Hyesoon Kim, Si Li, Jaekyu Lee, Sudhakar Yalamanchili
Issue Date:October 2013
pp. 1-28
Current heterogeneous chip-multiprocessors (CMPs) integrate a GPU architecture on a die. However, the heterogeneity of this architecture inevitably exerts different pressures on shared resource management due to differing characteristics of CPU and GPU cor...
     
FLEXclusion: balancing cache capacity and on-chip bandwidth via flexible exclusion
Found in: Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12)
By Hyesoon Kim, Jaekyu Lee, Jaewoong Sim, Moinuddin K. Qureshi
Issue Date:June 2012
pp. 321-332
Exclusive last-level caches (LLCs) reduce memory accesses by effectively utilizing cache capacity. However, they require excessive on-chip bandwidth to support frequent insertions of cache lines on eviction from upper-level caches. Non-inclusive caches, on...
     
When Prefetching Works, When It Doesn’t, and Why
Found in: ACM Transactions on Architecture and Code Optimization (TACO)
By Jaekyu Lee, Richard Vuduc, Hyesoon Kim
Issue Date:March 2012
pp. 1-29
In emerging and future high-end processor systems, tolerating increasing cache miss latency and properly managing memory bandwidth will be critical to achieving high performance. Prefetching, in both hardware and software, is among our most important avail...
     
A performance analysis framework for identifying potential benefits in GPGPU applications
Found in: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (PPoPP '12)
By Aniruddha Dasgupta, Jaewoong Sim, Richard Vuduc, Hyesoon Kim
Issue Date:February 2012
pp. 11-22
Tuning code for GPGPU and other emerging many-core platforms is a challenge because few models or tools can precisely pinpoint the root cause of performance bottlenecks. In this paper, we present a performance analysis framework that can help shed light on...
     
Design space exploration of the turbo decoding algorithm on GPUs
Found in: Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems (CASES '10)
By Dongwon Lee, Hyesoon Kim, Marilyn Wolf
Issue Date:October 2010
pp. 217-226
In this paper, we explore the design space of the Turbo decoding algorithm on GPUs and find a performance bottleneck. We consider three axes for the design space exploration: a radix degree, a parallelization method, and the number of sub-frames per thread...
     
An integrated GPU power and performance model
Found in: Proceedings of the 37th annual international symposium on Computer architecture (ISCA '10)
By Hyesoon Kim, Sunpyo Hong
Issue Date:June 2010
pp. 72-ff
GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Performance optimization for multi-core processors has been a challenge for programmers. Furthermore, optimizing for power consumption is ev...
     
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping
Found in: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (Micro-42)
By Chi-Keung Luk, Hyesoon Kim, Sunpyo Hong
Issue Date:December 2009
pp. 45-55
Heterogeneous multiprocessors are increasingly important in the multi-core era due to their potential for high performance and energy efficiency. In order for software to fully realize this potential, the step that maps computations to processing elements ...
     
Age based scheduling for asymmetric multiprocessors
Found in: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC '09)
By Hyesoon Kim, Jaekyu Lee, Nagesh B. Lakshminarayana
Issue Date:November 2009
pp. 1-12
Asymmetric (or Heterogeneous) Multiprocessors are becoming popular in the current era of multi-cores due to their power efficiency and potential performance and energy efficiency. However, scheduling of multithreaded applications in Asymmetric Multiprocess...
     
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
Found in: Proceedings of the 36th annual international symposium on Computer architecture (ISCA '09)
By Hyesoon Kim, Sunpyo Hong
Issue Date:June 2009
pp. 70-73
GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks...
     
Improving the performance of object-oriented languages with dynamic predication of indirect jumps
Found in: Proceedings of the 13th international conference on Architectural support for programming languages and operating systems (ASPLOS XIII)
By Hyesoon Kim, Jose A. Joao, Onur Mutlu, Rishi Agarwal, Yale N. Patt
Issue Date:March 2008
pp. 1-1
Indirect jump instructions are used to implement increasingly-common programming constructs such as virtual function calls, switch-case statements, jump tables, and interface calls. The performance impact of indirect jumps is likely to increase because ind...
     
Understanding the effects of wrong-path memory references on processor performance
Found in: Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture (WMPI '04)
By David N. Armstrong, Hyesoon Kim, Onur Mutlu, Yale N. Patt
Issue Date:June 2004
pp. 56-64
High-performance out-of-order processors spend a significant portion of their execution time on the incorrect program path even though they employ aggressive branch prediction algorithms. Although memory references generated on the wrong path do not change...
     
 1