Search For:

Displaying 1-28 out of 28 total
iCFP: Tolerating All-Level Cache Misses in In-Order Processors
Found in: IEEE Micro
By Andrew Hilton, Santosh Nagarakatte, Amir Roth
Issue Date:January 2010
pp. 12-19
<p>In-order continual flow pipeline (iCFP) is an in-order pipeline that allows execution to flow around data cache misses. When a cache miss occurs, iCFP executes and speculatively retires miss-independent instructions. It saves miss-dependent instru...
 
Low-Overhead Interactive Debugging via Dynamic Instrumentation with DISE
Found in: High-Performance Computer Architecture, International Symposium on
By Marc L. Corliss, E Christopher Lewis, Amir Roth
Issue Date:February 2005
pp. 303-314
Breakpoints, watchpoints, and conditional variants of both are essential debugging primitives, but their natural implementations often degrade performance significantly. Slowdown arises because the debugger-the tool implementing the breakpoint/watchpoint i...
 
Flexible register management using reference counting
Found in: High-Performance Computer Architecture, International Symposium on
By Steven Battle,Andrew D. Hilton,Mark Hempstead,Amir Roth
Issue Date:February 2012
pp. 1-12
Conventional out-of-order processors that use a unified physical register file allocate and reclaim registers explicitly using a free list that operates as a circular queue. We describe and evaluate a more flexible register management scheme -- reference c...
 
CPROB: Checkpoint Processing with Opportunistic Minimal Recovery
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Andrew Hilton, Neeraj Eswaran, Amir Roth
Issue Date:September 2009
pp. 159-168
CPR (Checkpoint Processing and Recovery) is a physical register management scheme that supports a larger instruction window and higher average IPC than conventional ROB-style register management. It does so by restricting mis-speculation recovery to checkp...
 
NoSQ: Store-Load Communication without a Store Queue
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Tingting Sha, Milo M. K. Martin, Amir Roth
Issue Date:December 2006
pp. 285-296
<p>This paper presents NoSQ (short for No Store Queue), a microarchitecture that performs store-load communication without a store queue and without executing stores in the outof- order engine. NoSQ implements store-load communication using speculati...
 
Serialization-Aware Mini-Graphs: Performance with Fewer Resources
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Anne Bracy, Amir Roth
Issue Date:December 2006
pp. 171-184
<p>Instruction aggregation-the grouping of multiple operations into a single processing unit -is a technique that has recently been used to amplify the bandwidth and capacity of critical processor structures. This amplification can be used to improve...
 
Scalable Store-Load Forwarding via Store Queue Index Prediction
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Tingting Sha, Milo M.K. Martin, Amir Roth
Issue Date:November 2005
pp. 159-170
<p>Conventional processors use a fully-associative store queue (SQ) to implement store-load forwarding. Associative search latency does not scale well to capacities and bandwidths required by wide-issue, large window processors. In this work, we impr...
 
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization
Found in: Computer Architecture, International Symposium on
By Amir Roth
Issue Date:June 2005
pp. 458-468
<p>The load-store unit is a performance critical component of a dynamically-scheduled processor. It is also a complex and non-scalable component. Several recently proposed techniques use some form of speculation to simplify the load-store unit and ch...
 
Energy-Effectiveness of Pre-Execution and Energy-Aware P-Thread Selection
Found in: Computer Architecture, International Symposium on
By Vlad Petric, Amir Roth
Issue Date:June 2005
pp. 322-333
<p>Pre-execution removes the microarchitectural latency of
 
Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Anne Bracy, Prashant Prahlad, Amir Roth
Issue Date:December 2004
pp. 18-29
A mini-graph is a dataflow graph that has an arbitrary internal size and shape but the interface of a singleton instruction: two register inputs, one register output, a maximum of one memory operation, and a maximum of one (terminal) control transfer.<d...
 
DISE: A Programmable Macro Engine for Customizing Applications
Found in: Computer Architecture, International Symposium on
By Marc L. Corliss, E Christopher Lewis, Amir Roth
Issue Date:June 2003
pp. 362
Dynamic Instruction Stream Editing (DISE) is a cooperative software-hardware scheme for efficiently adding customization functionality-e.g, safety/security checking, profiling, dynamic code decompression, and dynamic optimization-to an application. In DISE...
 
Three Extensions To Register Integration
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Vlad Petric, Anne Bracy, Amir Roth
Issue Date:November 2002
pp. 37
Register integration (or just integration) is a register renaming discipline that implements instruction reuse via physical register sharing. Initially developed to perform squash reuse, the integration mechanism can exploit more reuse scenarios. Here, we ...
 
Speculative Multithreaded Processors
Found in: Computer
By Gurindar S. Sohi, Amir Roth
Issue Date:April 2001
pp. 66-73
<p>Although novel functionality in the 1990s played a dominant role in processor design, the authors predict that implementation will dominate over functionality. Designing, debugging, and verifying monolithic designs that use hundreds of millions of...
 
Speculative Data-Driven Multithreading
Found in: High-Performance Computer Architecture, International Symposium on
By Amir Roth, Gurindar S. Sohi
Issue Date:January 2001
pp. 0037
Abstract: Mispredicted branches and loads that miss in the cache cause the majority of retirement stalls experienced by sequential processors; we call these critical instructions.Despite their importance, a sequential processor has difficulty prioritizing ...
 
Effective Jump-Pointer Prefetching for Linked Data Structures
Found in: Computer Architecture, International Symposium on
By Amir Roth, Gurindar S. Sohi
Issue Date:May 1999
pp. 0111
Current techniques for prefetching linked data structures (LDS) exploit the work available in one loop iteration or recursive call to overlap pointer chasing latency. Jump-pointers, which provide direct access to non-adjacent nodes, can be used for prefetc...
 
New Methods for Exploiting Program Structure and Behavior in Computer Architecture
Found in: Innovative Architecture for Future Generation High-Performance Processors and Systems, International Workshop on
By Amir Roth, Gurindar S. Sohi
Issue Date:October 1998
pp. 71
Micro-architectural techniques of the next decade will have to be more efficient and scalable in order to handle growing workloads and longer communication and memory latencies. We believe that information about program structure, the data and control rela...
 
Exploiting Dead Value Information
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Milo M. Martin, Amir Roth, Charles N. Fischer
Issue Date:December 1997
pp. 125
We describe Dead Value Information (DVI) and introduce three new optimizations which exploit it. DVI provides assertions that certain register values are dead, meaning they will not be read before being overwritten. The processor can use DVI to track dead ...
 
Near term computing opportunities in building energy efficiency
Found in: 2012 International Green Computing Conference (IGCC)
By Amir Roth
Issue Date:June 2012
pp. 1-5
Computing is poised to make major contributions to the global sustainability effort by engaging the fields of energy efficiency, renewable energy generation, and energy delivery. This paper discusses three near term opportunities for computing in the field...
 
SMT-Directory: Efficient Load-Load Ordering for SMT
Found in: IEEE Computer Architecture Letters
By Andrew Hilton, Amir Roth
Issue Date:January 2010
pp. 25-28
Memory models like SC, TSO, and PC enforce load-load ordering, requiring that loads from any thread appear to occur in program order to all other threads. Out-of-order execution can violate load-load ordering. Multi-processors with out-of-order cores detec...
 
NoSQ: Store-Load Communication without a Store Queue
Found in: IEEE Micro
By Tingting Sha, Milo M.K. Martin, Amir Roth
Issue Date:January 2007
pp. 106-113
The NoSQ microarchitecture performs store-load communication without a store queue and without executing stores in the out-of-order engine. It uses speculative memory bypassing for all in-flight store-load communication, enabled by a 99.8 percent accurate ...
 
A Quantitative Framework for Automated Pre-Execution Thread Selection
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Amir Roth, Gurindar S. Sohi
Issue Date:November 2002
pp. 430
Pre-execution attacks cache misses for which address prediction driven prefetching fails. In pre-execution, copies of cache miss computations are isolated from the main program and launched as separate threads called p-threads whenever the processor antici...
 
Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors
Found in: Proceedings of the 36th annual international symposium on Computer architecture (ISCA '09)
By Amir Roth, Andrew Hilton
Issue Date:June 2009
pp. 70-73
CPR/CFP (Checkpoint Processing and Recovery/Continual Flow Pipeline) support an adaptive instruction window that scales to tolerate last-level cache misses. CPR/CFP scale the register file by aggressively reclaiming the destination registers of many in-fli...
     
Ginger: control independence using tag rewriting
Found in: Proceedings of the 34th annual international symposium on Computer architecture (ISCA '07)
By Amir Roth, Andrew D. Hilton
Issue Date:June 2007
pp. 436-447
The negative performance impact of branch mis-predictions can be reduced by exploiting control independence (CI). When a branch mis-predicts, the wrong-path instructions up to the point where control converges with the correct path are selectively squashed...
     
The implementation and evaluation of dynamic code decompression using DISE
Found in: ACM Transactions on Embedded Computing Systems (TECS)
By Amir Roth, E. Christopher Lewis, Marc L. Corliss
Issue Date:February 2005
pp. 38-72
Code compression coupled with dynamic decompression is an important technique for both embedded and general-purpose microprocessors. Postfetch decompression, in which decompression is performed after the compressed instructions have been fetched, allows th...
     
A DISE implementation of dynamic code decompression
Found in: Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems (LCTES '03)
By Amir Roth, E. Christopher Lewis, Marc L. Corliss
Issue Date:June 2003
pp. 694-699
Code compression coupled with dynamic decompression is an important technique for both embedded and general-purpose microprocessors. Post-fetch decompression, in which decompression is performed after the compressed instructions have been fetched, allows t...
     
Register integration: a simple and efficient implementation of squash reuse
Found in: Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture (MICRO 33)
By Amir Roth, Gurindar S. Sohi
Issue Date:December 2000
pp. 223-234
Recent research has suggested that the branch history register need not contain the outcomes of the most recent branches in order for the Two-Level Adaptive Branch Predictor to work well. From this result, it is tempting to conclude that the branch history...
     
Improving virtual function call target prediction via dependence-based pre-computation
Found in: Proceedings of the 13th international conference on Supercomputing (ICS '99)
By Amir Roth, Andreas Moshovos, Gurindar S. Sohi
Issue Date:June 1999
pp. 356-364
To minimize the amount of computation and storage for parallel sparse factorization, sparse matrices have to be reordered prior to factorization. We show that none of the popular ordering heuristics proposed before, namely, mulitple minimum degree and nest...
     
Dependence based prefetching for linked data structures
Found in: Proceedings of the eighth international conference on Architectural support for programming languages and operating systems (ASPLOS-VIII)
By Amir Roth, Andreas Moshovos, Gurindar S. Sohi
Issue Date:October 1998
pp. 205-209
We introduce a dynamic scheme that captures the accesspat-terns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses and...
     
 1