Search For:

Displaying 1-24 out of 24 total
Using Virtual Load/Store Queues (VLSQs) to Reduce the Negative Effects of Reordered Memory Instructions
Found in: High-Performance Computer Architecture, International Symposium on
By Aamer Jaleel, Bruce Jacob
Issue Date:February 2005
pp. 191-200
The use of large instruction windows coupled with aggressive out-of-order and prefetching capabilities has provided significant improvements in processor performance. In this paper, we quantify the effects of increased out-of-order aggressiveness on a proc...
 
CoLT: Coalesced Large-Reach TLBs
Found in: 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
By Binh Pham,Viswanathan Vaidyanathan,Aamer Jaleel,Abhishek Bhattacharjee
Issue Date:December 2012
pp. 258-269
Translation Look aside Buffers (TLBs) are critical to system performance, particularly as applications demand larger working sets and with the adoption of virtualization. Architectural support for super pages has previously been proposed to improve TLB per...
 
Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon C. Steely Jr., Joel Emer
Issue Date:December 2010
pp. 151-162
Inclusive caches are commonly used by processors to simplify cache coherence. However, the trade-off has been lower performance compared to non-inclusive and exclusive caches. Contrary to conventional wisdom, we show that the limited performance of inclusi...
 
Analyzing Parallel Programs with Pin
Found in: Computer
By Moshe (Maury) Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi-Keung Luk, Gail Lyons, Harish Patil, Ady Tal
Issue Date:March 2010
pp. 34-41
No summary available.
 
Understanding the Memory Behavior of Emerging Multi-core Workloads
Found in: Parallel and Distributed Computing, International Symposium on
By Junmin Lin, Yu Chen, Wenlong Li, Aamer Jaleel, Zhizhong Tang
Issue Date:July 2009
pp. 153-160
This paper characterizes the memory behavior on emerging RMS (recognition, mining, and synthesis) workloads for future multi-core processors. As multi-core processors proliferate across different application domains, and the number of on-die cores continue...
 
Set-Dueling-Controlled Adaptive Insertion for High-Performance Caching
Found in: IEEE Micro
By Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely Jr., Joel Emer
Issue Date:January 2008
pp. 91-98
The commonly used LRU replacement policy causes thrashing for memory-intensive workloads. A simple mechanism that dynamically changes the insertion policy used by LRU replacement reduces cache misses by 21 percent and requires a total storage overhead of l...
 
Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling
Found in: High-Performance Computer Architecture, International Symposium on
By Brinda Ganesh, Aamer Jaleel, David Wang, Bruce Jacob
Issue Date:February 2007
pp. 109-120
Performance gains in memory have traditionally been obtained by increasing memory bus widths and speeds. The diminishing returns of such techniques have led to the proposal of an alternate architecture, the Fully-Buffered DIMM. This new standard replaces t...
 
In-Line Interrupt Handling and Lock-Up Free Translation Lookaside Buffers (TLBs)
Found in: IEEE Transactions on Computers
By Aamer Jaleel, Bruce Jacob
Issue Date:May 2006
pp. 559-574
The effects of the general-purpose precise interrupt mechanisms in use for the past few decades have received very little attention. When modern out-of-order processors handle interrupts precisely, they typically begin by flushing the pipeline to make the ...
 
In-Line Interrupt Handling for Software-Managed TLBs
Found in: Computer Design, International Conference on
By Aamer Jaleel, Bruce Jacob
Issue Date:September 2001
pp. 0062
Abstract: The general-purpose precise interrupt mechanism, which has long been used to handle exceptional conditions that occur infrequently, is now being used increasingly often to handle conditions that are neither exceptional nor infrequent. One example...
 
Efficient Spatial Processing Element Control via Triggered Instructions
Found in: IEEE Micro
By Angshuman Parashar,Michael Pellauer,Michael Adler,Bushra Ahsan,Neal Crago,Daniel Lustig,Vladimir Pavlov,Antonia Zhai,Mohit Gambhir,Aamer Jaleel,Randy Allmon,Rachid Rayess,Stephen Maresh,Joel Emer
Issue Date:May 2014
pp. 120-137
In this article, the authors present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to tra...
 
Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers
Found in: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)
By Seth H Pugsley,Zeshan Chishti,Chris Wilkerson,Peng-fei Chuang,Robert L Scott,Aamer Jaleel,Shih-Lien Lu,Kingsum Chow,Rajeev Balasubramonian
Issue Date:February 2014
pp. 626-637
Memory latency is a major factor in limiting CPU performance, and prefetching is a well-known method for hiding memory latency. Overly aggressive prefetching can waste scarce resources such as memory bandwidth and cache capacity, limiting or even hurting p...
   
Undersubscribed threading on clustered cache architectures
Found in: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)
By Wim Heirman,Trevor E. Carlson,Kenzo Van Craeynest,Ibrahim Hur,Aamer Jaleel,Lieven Eeckhout
Issue Date:February 2014
pp. 678-689
Recent many-core processors such as Intel's Xeon Phi and GPGPUs specialize in running highly scalable parallel applications at high performance while simultaneously embracing energy efficiency as a first-order design constraint. The traditional belief is t...
   
Fairness-aware scheduling on single-ISA heterogeneous multi-cores
Found in: 2013 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT)
By Kenzo Van Craeynest,Shoaib Akram,Wim Heirman,Aamer Jaleel,Lieven Eeckhout
Issue Date:September 2013
pp. 177-187
Single-ISA heterogeneous multi-cores consisting of small (e.g., in-order) and big (e.g., out-of-order) cores dramatically improve energy- and power-efficiency by scheduling workloads on the most appropriate core type. A significant body of recent work has ...
   
Using in-flight chains to build a scalable cache coherence protocol
Found in: ACM Transactions on Architecture and Code Optimization (TACO)
By Aamer Jaleel, Carl Beckmann, Joel Emer, Samantika Subramaniam, Simon C. Steely, Tryggve Fossum, Will Hasenplaugh
Issue Date:December 2013
pp. 1-24
As microprocessor designs integrate more cores, scalability of cache coherence protocols becomes a challenging problem. Most directory-based protocols avoid races by using blocking tag directories that can impact the performance of parallel applications. I...
     
Triggered instructions: a control paradigm for spatially-programmed architectures
Found in: Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13)
By Aamer Jaleel, Angshuman Parashar, Antonia Zhai, Bushra Ahsan, Daniel Lustig, Joel Emer, Michael Adler, Michael Pellauer, Mohit Gambhir, Neal Crago, Rachid Rayess, Randy Allmon, Stephen Maresh, Vladimir Pavlov
Issue Date:June 2013
pp. 142-153
In this paper, we present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition con...
     
Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE)
Found in: Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12)
By Aamer Jaleel, Joel Emer, Kenzo Van Craeynest, Lieven Eeckhout, Paolo Narvaez
Issue Date:June 2012
pp. 213-224
Single-ISA heterogeneous multi-core processors are typically composed of small (e.g., in-order) power-efficient cores and big (e.g., out-of-order) high-performance cores. The effectiveness of heterogeneous multi-cores depends on how well a scheduler can ma...
     
CRUISE: cache replacement and utility-aware scheduling
Found in: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '12)
By Samantika Subramaniam, Simon C. Steely, Aamer Jaleel, Hashem H. Najaf-abadi, Joel Emer
Issue Date:March 2012
pp. 249-260
When several applications are co-scheduled to run on a system with multiple shared LLCs, there is opportunity to improve system performance. This opportunity can be exploited by the hardware, software, or a combination of both hardware and software. The so...
     
The gradient-based cache partitioning algorithm
Found in: ACM Transactions on Architecture and Code Optimization (TACO)
By Aamer Jaleel, Pritpal S. Ahuja, Joel Emer, Simon Steely Jr., William Hasenplaugh
Issue Date:January 2012
pp. 1-21
This paper addresses the problem of partitioning a cache between multiple concurrent threads and in the presence of hardware prefetching. Cache replacement designed to preserve temporal locality (e.g., LRU) will allocate cache resources proportional to the...
     
Non-monopolizable caches: Low-complexity mitigation of cache side channel attacks
Found in: ACM Transactions on Architecture and Code Optimization (TACO)
By Dmitry Ponomarev, Nael Abu-Ghazaleh, Aamer Jaleel, Jason Loew, Leonid Domnitser
Issue Date:January 2012
pp. 1-21
We propose a flexibly-partitioned cache design that either drastically weakens or completely eliminates cache-based side channel attacks. The proposed Non-Monopolizable (NoMo) cache dynamically reserves cache lines for active threads and prevents other co-...
     
PACMan: prefetch-aware cache management for high performance caching
Found in: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44 '11)
By Aamer Jaleel, Joel Emer, Carole-Jean Wu, Margaret Martonosi, Simon C. Steely
Issue Date:December 2011
pp. 442-453
Hardware prefetching and last-level cache (LLC) management are two independent mechanisms to mitigate the growing latency to memory. However, the interaction between LLC management and hardware prefetching has received very little attention. This paper cha...
     
SHiP: signature-based hit predictor for high performance caching
Found in: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44 '11)
By Aamer Jaleel, Carole-Jean Wu, Margaret Martonosi, Simon C. Steely, Joel Emer, Will Hasenplaugh
Issue Date:December 2011
pp. 430-441
The shared last-level caches in CMPs play an important role in improving application performance and reducing off-chip memory bandwidth requirements. In order to use LLCs more efficiently, recent research has shown that changing the re-reference prediction...
     
High performance cache replacement using re-reference interval prediction (RRIP)
Found in: Proceedings of the 37th annual international symposium on Computer architecture (ISCA '10)
By Aamer Jaleel, Joel Emer, Kevin B. Theobald, Simon C. Steely
Issue Date:June 2010
pp. 72-ff
Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and misses. ...
     
Adaptive insertion policies for managing shared caches
Found in: Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT '08)
By Aamer Jaleel, Joel Emer, Julien Sebot, Moinuddin Qureshi, Simon Steely, William Hasenplaugh
Issue Date:October 2008
pp. 133-133
Chip Multiprocessors (CMPs) allow different applications to concurrently execute on a single chip. When applications with differing demands for memory compete for a shared cache, the conventional LRU replacement policy can significantly degrade cache perfo...
     
Adaptive insertion policies for high performance caching
Found in: Proceedings of the 34th annual international symposium on Computer architecture (ISCA '07)
By Aamer Jaleel, Joel Emer, Moinuddin K. Qureshi, Simon C. Steely, Yale N. Patt
Issue Date:June 2007
pp. 381-391
The commonly used LRU replacement policy is susceptible to thrashing for memory-intensive workloads that have a working set greater than the available cache size. For such applications, the majority of lines traverse from the MRU position to the LRU positi...
     
 1