Search For:

Displaying 1-45 out of 45 total
Thread-aware dynamic shared cache compression in multi-core processors
Found in: Computer Design, International Conference on
By Yuejian Xie,Gabriel H. Loh
Issue Date:October 2011
pp. 135-141
When a program's working set exceeds the size of its last-level cache, performance may suffer due to the resulting off-chip memory accesses. Cache compression can increase the effective cache size and therefore reduce misses, but compression also introduce...
 
A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch
Found in: 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
By Jaewoong Sim,Gabriel H. Loh,Hyesoon Kim,Mike OConnor,Mithuna Thottethodi
Issue Date:December 2012
pp. 247-257
Die-stacking technology allows conventional DRAM to be integrated with processors. While numerous opportunities to make use of such stacked DRAM exist, one promising way is to use it as a large cache. Although previous studies show that DRAM caches can del...
 
A Simple Divide-and-Conquer Approach for Neural-Class Branch Prediction
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Gabriel H. Loh
Issue Date:September 2005
pp. 243-254
<p>The continual demand for greater performance and growing concerns about the power consumption in highperformance microprocessors make the branch predictor a critical component of modern microarchitectures. Recent research in applying machine learn...
 
A Segmented Bloom Filter Algorithm for Efficient Predictors
Found in: Computer Architecture and High Performance Computing, Symposium on
By M. Breternitz, Gabriel H. Loh, Bryan Black, Jeffrey Rupley, Peter G. Sassone, Wesley Attrot, Youfeng Wu
Issue Date:November 2008
pp. 123-130
Bloom Filters are a technique to reduce the effects of conflicts/interference in hash table-like structures. Conventional hash tables store information in a single location which is susceptible to destructive interference through hash conflicts. A Bloom Fi...
 
Fire-and-Forget: Load/Store Scheduling with No Store Queue at All
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Samantika Subramaniam, Gabriel H. Loh
Issue Date:December 2006
pp. 273-284
<p>Modern processors use CAM-based load and store queues (LQ/SQ) to support out-of-order memory scheduling and store-to-load forwarding. However, the LQ and SQ scale poorly for the sizes required for large-window, high- ILP processors. Past research ...
 
Supporting Very Large DRAM Caches with Compound-Access Scheduling and MissMap
Found in: IEEE Micro
By Gabriel H. Loh,Mark D. Hill
Issue Date:May 2012
pp. 70-78
This work efficiently enables conventional block sizes for very large die-stacked DRAM caches with two innovations: it makes hits faster with compound-access scheduling and misses faster with a MissMap. The combination of these mechanisms enables the new o...
 
3D Stacked Microprocessor: Are We There Yet?
Found in: IEEE Micro
By Gabriel H. Loh, Yuan Xie
Issue Date:May 2010
pp. 60-64
<p>Editors' Note</p><p>We live in a 3D world. It is hard to imagine a large city, such as New York City, with only single-level structures. There would be no skyscrapers, no mixed-use, no live-work. It would be a long walk (or drive) betw...
 
3D-Stacked Memory Architectures for Multi-core Processors
Found in: Computer Architecture, International Symposium on
By Gabriel H. Loh
Issue Date:June 2008
pp. 453-464
Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two.??Previous studies have examined the performance benefits of such an approach, but all of these works only ...
 
Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors
Found in: High-Performance Computer Architecture, International Symposium on
By Kiran Puttaswamy, Gabriel H. Loh
Issue Date:February 2007
pp. 193-204
3D integration technology greatly increases transistor density while providing faster on-chip communication. 3D implementations of processors can simultaneously provide both latency and power benefits due to reductions in critical wires. However, 3D stacki...
 
Controlling the Power and Area of Neural Branch Predictors for Practical Implementation in High-Performance Processors
Found in: Computer Architecture and High Performance Computing, Symposium on
By Daniel A. Jimenez, Gabriel H. Loh
Issue Date:October 2006
pp. 55-62
Neural-inspired branch predictors achieve very low branch misprediction rates. However, previously proposed implementations have a variety of characteristics that make them challenging to implement in future high-performance processors. In particular, the ...
 
Circuits for Wide-Window Superscalar Processors
Found in: Computer Architecture, International Symposium on
By Rahul Sami, Bradley C. Kuszmaul, Gabriel H. Loh, Dana S. Henry
Issue Date:June 2000
pp. 236
Our program benchmarks and simulations of novel circuits indicate that large-window processors are feasible. Using our redesigned superscalar components, a large-window processor implemented in today's technology can achieve an increase of 10-60% (geometri...
 
Top Picks from the 2012 Computer Architecture Conferences
Found in: IEEE Micro
By Babak Falsafi,Gabriel H. Loh
Issue Date:May 2013
pp. 4-7
This special issue is the tenth in an important tradition in the computer architecture community: IEEE Micro's Top Picks from the Computer Architecture Conferences. This tradition provides a means for sharing a sample of the best papers published in comput...
 
3D-Integrated SRAM Components for High-Performance Microprocessors
Found in: IEEE Transactions on Computers
By Kiran Puttaswamy, Gabriel H. Loh
Publication Date: July 2009
pp. 1369-1381
3D integration is an emergent technology that has the potential to greatly increase device density while simultaneously providing faster on-chip communication. 3D fabrication involves stacking two or more die connected with a very high density and low-late...
 
Processor Design in 3D Die-Stacking Technologies
Found in: IEEE Micro
By Gabriel H. Loh, Yuan Xie, Bryan Black
Issue Date:May 2007
pp. 31-48
Three-dimensional die-stacking integration stacks multiple layers of processed silicon with a very high-density, low-latency layer-to-layer interconnect. After presenting a brief background on 3D die-stacking technology, this article gives multiple case st...
 
Adaptive Caches: Effective Shaping of Cache Behavior to Workloads
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Ranjith Subramanian, Yannis Smaragdakis, Gabriel H. Loh
Issue Date:December 2006
pp. 385-396
We present and evaluate the idea of adaptive processor cache management. Specifically, we describe a novel and general scheme by which we can combine any two cache management algorithms (e.g., LRU, LFU, FIFO, Random) and adaptively switch between them, clo...
 
Die Stacking (3D) Microarchitecture
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H. Loh, Don McCaule, Pat Morrow, Donald W. Nelson, Daniel Pantuso, Paul Reed, Jeff Rupley, Sadasivan Shankar, John Shen, Clair Webb
Issue Date:December 2006
pp. 469-479
<p>3D die stacking is an exciting new technology that increases transistor density by vertically integrating two or more die with a dense, high-speed interface. The result of 3D die stacking is a significant reduction of interconnect both within a di...
 
Implementing Register Files for High-Performance Microprocessors in a Die-Stacked (3D) Technology
Found in: VLSI, IEEE Computer Society Annual Symposium on
By Kiran Puttaswamy, Gabriel H. Loh
Issue Date:March 2006
pp. 384-392
3D integration is a new technology that will greatly increase transistor density while providing faster on-chip communication. 3D integration stacks multiple die connected with a very high-density and low-latency interface which provides increased device d...
 
Implementing Caches in a 3D Technology for High Performance Processors
Found in: Computer Design, International Conference on
By Kiran Puttaswamy, Gabriel H. Loh
Issue Date:October 2005
pp. 525-532
<p>3D integration is an emergent technology that has the potential to greatly increase device density while simultaneously providing faster on-chip communication. 3D fabrication involves stacking two or more die connected with a very high-density and...
 
Exploiting Data-Width Locality to Increase Superscalar Execution Bandwidth
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Gabriel H. Loh
Issue Date:November 2002
pp. 395
In a 64-bit processor, many of the data values actually used in computations require much narrower data-widths. In this study, we demonstrate that instruction data-widths exhibit very strong temporal locality and describe mechanisms to accurately predict d...
 
Predicting Conditional Branches With Fusion-Based Hybrid Predictors
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Gabriel H. Loh, Dana S. Henry
Issue Date:September 2002
pp. 165
Researchers have studied hybrid branch predictors that leverage the strengths of multiple stand-alone predictors. The common theme among the proposed techniques is a selection mechanism that chooses a prediction from among several component predictors. We ...
 
A Configurable and Strong RAS Solution for Die-Stacked DRAM Caches
Found in: IEEE Micro
By Jaewoong Sim,Gabriel H. Loh,Vilas Sridharan,Mike O'Connor
Issue Date:May 2014
pp. 80-90
The resiliency problem of die-stacked memory will become important because of its lack of serviceability. This article details how to provide practical and cost-effective reliability, availability, and serviceability support for die-stacked DRAM cache arch...
 
Increasing TLB reach by exploiting clustering in page translations
Found in: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)
By Binh Pham,Abhishek Bhattacharjee,Yasuko Eckert,Gabriel H. Loh
Issue Date:February 2014
pp. 558-567
The steadily increasing sizes of main memory capacities require corresponding increases in the processor's translation lookaside buffer (TLB) resources to avoid performance bottlenecks. Large operating system page sizes can mitigate the bottleneck with a s...
   
Design and Analysis of 3D-MAPS (3D Massively Parallel Processor with Stacked Memory)
Found in: IEEE Transactions on Computers
By Dae Hyun Kim,Krit Athikulwongse,Michael B. Healy,Mohammad M. Hossain,Moongon Jung,Ilya Khorosh,Gokul Kumar,Young-Joon Lee,Dean L. Lewis,Tzu-Wei Lin,Chang Liu,Shreepad Panth,Mohit Pathak,Minzhen Ren,Guanhao Shen,Taigon Song,Dong Hyuk Woo,Xin Zhao,Joungho Kim,Ho Choi,Gabriel H. Loh,Hsien-Hsin Lee,Sung Kyu Lim
Issue Date:October 2013
pp. 1
This paper describes the architecture, design, analysis, and simulation and measurement results of the 3D-MAPS (3D massively parallel processor with stacked memory) chip built with a 1.5V, 130nm process technology and a two-tier 3D stacking technology usin...
 
Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface
Found in: ACM Transactions on Architecture and Code Optimization (TACO)
By Gabriel H. Loh, Guangyu Sun, Jishen Zhao, Yuan Xie
Issue Date:December 2013
pp. 1-25
The performance of graphics processing unit (GPU) systems is improving rapidly to accommodate the increasing demands of graphics and high-performance computing applications. With such a performance improvement, however, power consumption of GPU systems is ...
     
Resilient die-stacked DRAM caches
Found in: Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13)
By Gabriel H. Loh, Jaewoong Sim, Mike O'Connor, Vilas Sridharan
Issue Date:June 2013
pp. 416-427
Die-stacked DRAM can provide large amounts of in-package, high-bandwidth cache storage. For server and high-performance computing markets, however, such DRAM caches must also provide sufficient support for reliability and fault tolerance. While conventiona...
     
Energy-efficient GPU design with reconfigurable in-package graphics memory
Found in: Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design (ISLPED '12)
By Gabriel H. Loh, Guangyu Sun, Jishen Zhao, Yuan Xie
Issue Date:July 2012
pp. 403-408
We propose an energy-efficient reconfigurable in-package graphics memory design that integrates wide-interface graphics DRAMs with GPU on a silicon interposer. We reduce the memory power consumption by scaling down the supply voltage and frequency while ma...
     
Staged memory scheduling: achieving high performance and scalability in heterogeneous systems
Found in: Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12)
By Gabriel H. Loh, Kevin Kai-Wei Chang, Lavanya Subramanian, Onur Mutlu, Rachata Ausavarungnirun
Issue Date:June 2012
pp. 416-427
When multiple processor (CPU) cores and a GPU integrated together on the same chip share the off-chip main memory, requests from the GPU can heavily interfere with requests from the CPU cores, leading to low system performance and starvation of CPU cores. ...
     
Efficiently enabling conventional block sizes for very large die-stacked DRAM caches
Found in: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44 '11)
By Gabriel H. Loh, Mark D. Hill
Issue Date:December 2011
pp. 454-464
Die-stacking technology enables multiple layers of DRAM to be integrated with multicore processors. A promising use of stacked DRAM is as a cache, since its capacity is insufficient to be all of main memory (for all but some embedded systems). However, a 1...
     
A register-file approach for row buffer caches in die-stacked DRAMs
Found in: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44 '11)
By Gabriel H. Loh
Issue Date:December 2011
pp. 351-361
Die-stacked DRAMs have been proposed that combine multiple layers of dense memory cells with a base logic layer to implement peripheral circuitry (decoders, sense amps), interface logic, and test structures. Even after implementing these various features, ...
     
Preventing PCM banks from seizing too much power
Found in: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44 '11)
By Andrew Hay, Doug Burger, Gabriel H. Loh, Karin Strauss, Timothy Sherwood
Issue Date:December 2011
pp. 186-195
Widespread adoption of Phase Change Memory (PCM) requires solutions to several problems recently addressed in the literature, including limited endurance, increased write latencies, and system-level changes required to exploit non-volatility. One important...
     
Use ECP, not ECC, for hard failures in resistive memories
Found in: Proceedings of the 37th annual international symposium on Computer architecture (ISCA '10)
By Doug Burger, Gabriel H. Loh, Karin Straus, Stuart Schechter
Issue Date:June 2010
pp. 72-ff
As leakage and other charge storage limitations begin to impair the scalability of DRAM, non-volatile resistive memories are being developed as a potential replacement. Unfortunately, current error correction techniques are poorly suited to this emerging c...
     
Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy
Found in: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (Micro-42)
By Gabriel H. Loh
Issue Date:December 2009
pp. 201-212
3D-integration is a promising technology to help combat the "Memory Wall" in future multi-core processors. Past work has considered using 3D-stacked DRAM as a large last-level cache (LLC). While significant performance benefits can be gained with such an a...
     
Design and optimization of the store vectors memory dependence predictor
Found in: ACM Transactions on Architecture and Code Optimization (TACO)
By Gabriel H. Loh, Samantika Subramaniam
Issue Date:October 2009
pp. 1-33
Allowing loads that do not violate memory ordering to issue out of order with respect to earlier unresolved store addresses is very important for extracting parallelism in large-window superscalar processors. Previous research has proposed memory dependenc...
     
PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches
Found in: Proceedings of the 36th annual international symposium on Computer architecture (ISCA '09)
By Gabriel H. Loh, Yuejian Xie
Issue Date:June 2009
pp. 70-73
Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple cores ...
     
A modular 3d processor for flexible product design and technology migration
Found in: Proceedings of the 2008 conference on Computing frontiers (CF '08)
By Gabriel H. Loh
Issue Date:May 2008
pp. 353-358
The current methodology used in mass-market processor design is to create a single base microarchitecture (e.g., Intel's ``Core'' or AMD's ``K8'') that is used throughout all of the PC market segments from laptops to servers. To differentiate the products,...
     
Static strands: Safely exposing dependence chains for increasing embedded power efficiency
Found in: ACM Transactions on Embedded Computing Systems (TECS)
By D. Scott Wills, Gabriel H. Loh, Peter G. Sassone
Issue Date:September 2007
pp. 24-es
Modern embedded processors are designed to maximize execution efficiency---the amount of performance achieved per unit of energy dissipated while meeting minimum performance levels. To increase this efficiency, we propose utilizing static strands, dependen...
     
Scalability of 3D-integrated arithmetic units in high-performance microprocessors
Found in: Proceedings of the 44th annual conference on Design automation (DAC '07)
By Gabriel H. Loh, Kiran Puttaswamyt
Issue Date:June 2007
pp. 622-625
Three-Dimensional integration provides a simultaneous improvement in wire-related delay and power consumption of microprocessor circuits. Prior work has looked at the performance, power, and area benefits of the 3D integration technology. In this paper, we...
     
Matrix scheduler reloaded
Found in: Proceedings of the 34th annual international symposium on Computer architecture (ISCA '07)
By Bryan Black, Edward Brekelbaum, Gabriel H. Loh, Jeff Rupley, Peter G. Sassone
Issue Date:June 2007
pp. 335-346
From multiprocessor scale-up to cache sizes to the number of reorder-buffer entries, microarchitects wish to reap the benefits of more computing resources while staying within power and latency bounds. This tension is quite evident in schedulers, which nee...
     
Entropy-based low power data TLB design
Found in: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems (CASES '06)
By Chinnakrishnan Ballapuram, Gabriel H. Loh, Hsien-Hsin S. Lee, Kiran Puttaswamy
Issue Date:October 2006
pp. 304-311
The Translation Look-aside Buffer (TLB), a content addressable memory, consumes significant power due to the associative search mechanism it uses in the virtual to physical address translation. Based on our analysis of the TLB accesses, we make two observa...
     
Design space exploration for 3D architectures
Found in: ACM Journal on Emerging Technologies in Computing Systems (JETC)
By Bryan Black, Gabriel H. Loh, Kerry Bernstein, Yuan Xie
Issue Date:April 2006
pp. 65-103
As technology scales, interconnects have become a major performance bottleneck and a major source of power consumption for microprocessors. Increasing interconnect costs make it necessary to consider alternate ways of building modern microprocessors. One p...
     
Dynamic instruction schedulers in a 3-dimensional integration technology
Found in: Proceedings of the 16th ACM Great Lakes symposium on VLSI (GLSVLSI '06)
By Gabriel H. Loh, Kiran Puttaswamy
Issue Date:April 2006
pp. 153-158
We present the design of high-performance and energy-efficient dynamic instruction schedulers in a 3-Dimensional integration technology. Based on a previous observation that the critical path latency of a conventional dynamic scheduler is greatly affected ...
     
Thermal analysis of a 3D die-stacked high-performance microprocessor
Found in: Proceedings of the 16th ACM Great Lakes symposium on VLSI (GLSVLSI '06)
By Gabriel H. Loh, Kiran Puttaswamy
Issue Date:April 2006
pp. 19-24
3-dimensional integrated circuit (3D IC) technology places circuit blocks in the vertical dimension in addition to the conventional horizontal plane. Compared to conventional planar ICs, 3D ICs have shorter latencies as well as lower power consumption, due...
     
Static strands: safely collapsing dependence chains for increasing embedded power efficiency
Found in: Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems (LCTES'05)
By D. Scott Wills, Gabriel H. Loh, Peter G. Sassone
Issue Date:June 2005
pp. 127-136
Modern embedded processors are designed to maximize execution efficiency--the amount of performance achieved per unit of energy dissipated while meeting minimum performance levels. To increase this efficiency we propose utilizing static strands, dependence...
     
Circuits for wide-window superscalar processors
Found in: Proceedings of the 27th annual international symposium on Computer architecture (ISCA '00)
By Bradley C. Kuszmaul, Dana S. Henry, Gabriel H. Loh, Rahul Sami
Issue Date:June 2000
pp. 125-131
Our program benchmarks and simulations of novel circuits indicate that large-window processors are feasible. Using our redesigned superscalar components, a large-window processor implemented in today's technology can achieve an increase of 10-60% (geometri...
     
A comparison of scalable superscalar processors
Found in: Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures (SPAA '99)
By Bradley C. Kuszmaul, Dana S. Henry, Gabriel H. Loh
Issue Date:June 1999
pp. 126-137
We consider the problem of sorting a file of N records on theD-disk model of parallel I/0 [VS94] in which there are two sourcesof parallehsm. Records are transferred to and from diskconcurrently in blocks of B con-tiguous records. In each I/Ooperation, up ...
     
 1