Search For:

Displaying 1-50 out of 160 total
Three-Dimensional Memory Vectorization for High Bandwidth Media Memory Systems
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Jesus Corbal, Roger Espasa, Mateo Valero
Issue Date:November 2002
pp. 149
Vector processors have good performance, cost and adaptability when targeting multimedia applications. However, for a significant number of media programs, conventional memory configurations fail to deliver enough memory references per cycle to feed the SI...
 
Microarchitectural Support for Speculative Register Renaming
Found in: Parallel and Distributed Processing Symposium, International
By Jesus Alastruey, Teresa Monreal, Victor Vinals, Mateo Valero
Issue Date:March 2007
pp. 47
This paper proposes and evaluates a new microarchitecture for out-of-order processors that supports speculative renaming. We call speculative renaming to the speculative omission of physical register allocation along with the speculative early release of p...
 
Kilo-Instruction Processors: Overcoming the Memory Wall
Found in: IEEE Micro
By Adrián Cristal, Oliverio J. Santana, Francisco Cazorla, Marco Galluzzi, Tanausú Ramírez, Miquel Pericas, Mateo Valero
Issue Date:May 2005
pp. 48-57
Kilo-instruction processors are a new type of out-of-order superscalar processor that overlaps long memory access delays by maintaining thousands of in-flight instructions, in a scalable, efficient manner.
 
Exploiting a New Level of DLP in Multimedia Applications
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Jesus Corbal, Mateo Valero, Roger Espasa
Issue Date:November 1999
pp. 72
This paper proposes and evaluates MOM: a novel ISA paradigm targeted at multimedia applications. By fusing conventional vector ISA approaches together with more recent SIMD-like (Single Instruction Multiple Data) ISAs (such as MMX), we have developed a new...
 
Command Vector Memory Systems: High Performance at Low Cost
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Jesus Corbal, Roger Espasa, Mateo Valero
Issue Date:October 1998
pp. 68
The focus of this paper is on designing both a low cost and high performance, high bandwidth vector memory system that takes advantage of modern commodity SDRAM memory chips. To successfully extract the full bandwidth from SDRAM parts, we propose a new mem...
 
HPC System Software for Regular and Irregular Parallel Applications
Found in: 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)
By Alessandro Morari,Mateo Valero
Issue Date:May 2013
pp. 2242-2245
The upcoming generation of system software for High Performance Computing is expected to provide a richer set of functionalities without compromising application performance. This Ph.D. thesis addresses the problem of designing scalable system software for...
 
Efficient Sorting on the Tilera Manycore Architecture
Found in: 2012 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
By Alessandro Morari,Antonino Tumeo,Oreste Villa,Simone Secchi,Mateo Valero
Issue Date:October 2012
pp. 171-178
We present an efficient implementation of the radix sort algorithm for the Tilera TILEPro64 processor. The TILEPro64 is one of the first successful commercial manycore processors. It is composed of 64 tiles interconnected through multiple fast Networks-on-...
 
Evaluating the Impact of TLB Misses on Future HPC Systems
Found in: 2012 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
By Alessandro Morari,Roberto Gioiosa,Robert W. Wisniewski,Bryan S. Rosenburg,Todd A. Inglett,Mateo Valero
Issue Date:May 2012
pp. 1010-1021
TLB misses have been considered an important source of system overhead and one of the causes that limit scalability on large supercomputers. This assumption lead to HPC lightweight kernel designs that usually statically map page table entries to TLB entrie...
 
Keynotes
Found in: 2011 Symposium on Computational Systems (WSCAD-SSC 2011)
By Jack Dongarra,Jeannette Wing,Mateo Valero
Issue Date:October 2011
pp. ix-xii
These keynote speeches discuss the following: Architecture-aware algorithms and software for peta and exascale computing; What's hot in computing; and Towards exaflop supercomputers.
   
STM2: A Parallel STM for High Performance Simultaneous Multithreading Systems
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Gokcen Kestor,Roberto Gioiosa,Tim Harris,Osman S. Unsal,Adrian Cristal,Ibrahim Hur,Mateo Valero
Issue Date:October 2011
pp. 221-231
Extracting high performance from modern chip multithreading (CMT) processors is a complex task, especially for large CMT systems. Programmers must efficiently parallelize performance-critical software while avoiding deadlocks and race conditions. Transacti...
 
A Quantitative Analysis of OS Noise
Found in: Parallel and Distributed Processing Symposium, International
By Alessandro Morari,Roberto Gioiosa,Robert W. Wisniewski,Francisco J. Cazorla,Mateo Valero
Issue Date:May 2011
pp. 852-863
Operating system noise is a well-known problem that may limit application scalability on large-scale machines, significantly reducing their performance. Though the problem is well studied, much of the previous work has been qualitative. We have developed a...
 
IA^3: An Interference Aware Allocation Algorithm for Multicore Hard Real-Time Systems
Found in: Real-Time and Embedded Technology and Applications Symposium, IEEE
By Marco Paolieri, Eduardo Quiñones, Francisco J. Cazorla, Robert I. Davis, Mateo Valero
Issue Date:April 2011
pp. 280-290
In multicore processors, the execution environment is defined as the environment in which tasks run and it is determined by the hardware resources they get and the workload with which they are executed. Thus, different execution environments lead to differ...
 
Multicore: The View from Europe
Found in: IEEE Micro
By Mateo Valero, Nacho Navarro
Issue Date:September 2010
pp. 2-4
<p>In 2004, the European Commission funded the HiPEAC Network of Excellence to improve research in the fields of computer architecture and compilation in Europe. In response to the paradigm shift to multicore-based computers, the European Commission ...
 
Designing OS for HPC Applications: Scheduling
Found in: Cluster Computing, IEEE International Conference on
By Roberto Gioiosa, Sally A. McKee, Mateo Valero
Issue Date:September 2010
pp. 78-87
Operating systems have historically been implemented as independent layers between hardware and applications. User programs communicate with the OS through a set of well defined system calls, and do not have direct access to the hardware. The OS, in turn, ...
 
Optimizing job performance under a given power constraint in HPC centers
Found in: International Conference on Green Computing
By Maja Etinski, Julita Corbalan, Jesus Labarta, Mateo Valero
Issue Date:August 2010
pp. 257-267
Never-ending striving for performance has resulted in a tremendous increase in power consumption of HPC centers. Power budgeting has become very important from several reasons such as reliability, operating costs and limited power draw due to the existing ...
 
ITCA: Inter-task Conflict-Aware CPU Accounting for CMPs
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Carlos Luque, Miquel Moreto, Francisco J. Cazorla, Roberto Gioiosa, Alper Buyuktosunoglu, Mateo Valero
Issue Date:September 2009
pp. 203-213
Chip-MultiProcessor (CMP) architectures are becoming more and more popular as an alternative to the traditional processors that only extract instruction-level parallelism from an application. CMPs introduce complexities when accounting CPU utilization. Thi...
 
A distributed processor state management architecture for large-window processors
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Isidro Gonzalez, Marco Galluzzi, Alex Veidenbaum, Marco A. Ramirez, Adrian Cristal, Mateo Valero
Issue Date:November 2008
pp. 11-22
Processor architectures with large instruction windows have been proposed to expose more instruction-level parallelism (ILP) and increase performance. Some of the proposed architectures replace a re-order buffer (ROB) with a check-pointing mechanism and an...
 
A dynamic scheduler for balancing HPC applications
Found in: SC Conference
By Carlos Boneti, Roberto Gioiosa, Francisco J. Cazorla, Mateo Valero
Issue Date:November 2008
pp. 1-12
Load imbalance cause significant performance degradation in High Performance Computing applications. In our previous work we showed that load imbalance can be alleviated by modern MT processors that provide mechanisms for controlling the allocation of proc...
 
Selection of the Register File Size and the Resource Allocation Policy on SMT Processors
Found in: Computer Architecture and High Performance Computing, Symposium on
By Jesús Alastruey, Teresa Monreal, Francisco Cazorla, Víctor Viñals, Mateo Valero
Issue Date:November 2008
pp. 63-70
The performance impact of the Physical Register File(PRF) size on Simultaneous Multithreading processors has not been extensively studied in spite of being a critical shared resource. In this paper we analyze the effect on performance of the PRF size for a...
 
A Two-Level Load/Store Queue Based on Execution Locality
Found in: Computer Architecture, International Symposium on
By Miquel Pericàs, Adrian Cristal, Francisco J. Cazorla, Ruben González, Alex Veidenbaum, Daniel A. Jiménez, Mateo Valero
Issue Date:June 2008
pp. 25-36
Multicore processors have emerged as a powerful platform on which to efficiently exploit thread-level parallelism (TLP). However, due to Amdahl’s Law, such designs will be increasingly limited by the remaining sequential components of applications. To over...
 
Software-Controlled Priority Characterization of POWER5 Processor
Found in: Computer Architecture, International Symposium on
By Carlos Boneti, Francisco J. Cazorla, Roberto Gioiosa, Alper Buyuktosunoglu, Chen-Yong Cher, Mateo Valero
Issue Date:June 2008
pp. 415-426
Due to the limitations of instruction-level parallelism, thread-level parallelism has become a popular way to improve processor performance. One example is the IBM POWER5TM processor, a two-context simultaneous-multithreaded dual-core chip. In each SMT cor...
 
Balancing HPC applications through smart allocation of resources in MT processors
Found in: Parallel and Distributed Processing Symposium, International
By Carlos Boneti, Roberto Gioiosa, Francisco J. Cazorla, Julita Corbalan, Jesus Labarta, Mateo Valero
Issue Date:April 2008
pp. 1-12
Many studies have shown that load imbalancing causes significant performance degradation in High Performance Computing (HPC) applications. Nowadays, Multi-Threaded (MT1) processors are widely used in HPC for their good performance/energy consumption and pe...
 
A Flexible Heterogeneous Multi-Core Architecture
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Miquel Pericas, Adrian Cristal, Francisco J. Cazorla, Ruben Gonzalez, Daniel A. Jimenez, Mateo Valero
Issue Date:September 2007
pp. 13-24
Multi-core processors naturally exploit thread-level par- allelism (TLP). However, extracting instruction-level paral- lelism (ILP) from individual applications or threads is still a challenge as application mixes in this environment are nonuniform. Thus, ...
 
FAME: FAirly MEasuring Multithreaded Architectures
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Javier Vera, Francisco J. Cazorla, Alex Pajuelo, Oliverio J. Santana, Enrique Fernandez, Mateo Valero
Issue Date:September 2007
pp. 305-316
Nowadays, multithreaded architectures are becoming more and more populal: In order to evaluate their behavior, several methodologies and metrics have been proposed. A methodology defines when the measurements of a given workload execution are taken. A metr...
 
Explaining Dynamic Cache Partitioning Speed Ups
Found in: IEEE Computer Architecture Letters
By Miquel Moreto Planas, Francisco Cazorla, Alex Ramirez, Mateo Valero
Issue Date:January 2007
pp. 1-4
Cache Partitioning has been proposed as an interesting alternative to traditional eviction policies of shared cache levels in modern CMP architectures: throughput is improved at the expense of a reasonable cost. However, these new policies present differen...
 
Architectural impact of stateful networking applications
Found in: Symposium On Architecture For Networking And Communications Systems
By Javier Verdú, Jorge Garcí, Mario Nemirovsky, Mateo Valero
Issue Date:October 2005
pp. 11-18
The explosive and robust growth of the Internet owes a lot to the
 
A Vector-?SIMD-VLIW Architecture for Multimedia Applications
Found in: Parallel Processing, International Conference on
By Esther Salamí, Mateo Valero
Issue Date:June 2005
pp. 69-77
Media processing has motivated strong changes in the focus and design of processors. These applications are composed of heterogeneous regions of code, some of them with high levels of DLP and other ones with only modest amounts of ILP. A common approach to...
 
A Complexity-Effective Simultaneous Multithreading Architecture
Found in: Parallel Processing, International Conference on
By Carmelo Acosta, Ayose Falcón, Alex Ramirez, Mateo Valero
Issue Date:June 2005
pp. 157-164
<p>Different applications may exhibit radically different behaviors and thus have very different requirements in terms of hardware support. In Simultaneous Multithreading (SMT) architectures, the hardware is shared among multiple running applications...
 
Control-Flow Independence Reuse via Dynamic Vectorization
Found in: Parallel and Distributed Processing Symposium, International
By Alex Pajuelo, Antonio González, Mateo Valero
Issue Date:April 2005
pp. 21a
Current processors exploit out-of-order execution and branch prediction to improve instruction level parallelism. When a branch prediction is wrong, processors flush the pipeline and squash all the speculative work. However, control-flow independent instru...
 
Effective Instruction Prefetching via Fetch Prestaging
Found in: Parallel and Distributed Processing Symposium, International
By Ayose Falcón, Alex Ramirez, Mateo Valero
Issue Date:April 2005
pp. 20b
As technological process shrinks and clock rate increases, instruction caches can no longer be accessed in one cycle. Alternatives are implementing smaller caches (with higher miss rate) or large caches with a pipelined access (with higher branch mispredic...
 
Dynamically Controlled Resource Allocation in SMT Processors
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Francisco J. Cazorla, Alex Ramirez, Mateo Valero, Enrique Fernández
Issue Date:December 2004
pp. 171-182
SMT processors increase performance by executing instructions from several threads simultaneously. These threads use the resources of the processor better by sharing them but, at the same time, threads are competing for these resources. The way critical re...
 
Implicit vs. Explicit Resource Allocation in SMT Processors
Found in: Digital Systems Design, Euromicro Symposium on
By Francisco J. Cazorla, Peter M. W. Knijnenburg, Rizos Sakellariou, Enrique Fernandez, Alex Ramirez, Mateo Valero
Issue Date:September 2004
pp. 44-51
In a Simultaneous Multithreaded (SMT) architecture, the front end of a superscalar is adapted in order to be able to fetch from several threads while the back end is shared among the threads.<div></div> In this paper, we describe different reso...
 
Prophet/Critic Hybrid Branch Prediction
Found in: Computer Architecture, International Symposium on
By Ayose Falcón, Jared Stark, Alex Ramirez, Konrad Lai, Mateo Valero
Issue Date:June 2004
pp. 250
This paper introduces the prophet/critic hybrid conditional branch predictor, which has two component predictors that play the role of either prophet or critic. The prophet is a conventional predictor that uses branch history to predict the direction of th...
 
DCache Warn: An I-Fetch Policy to Increase SMT Efficiency
Found in: Parallel and Distributed Processing Symposium, International
By Francisco J. Cazorla, Alex Ramirez, Mateo Valero, Enrique Fernández
Issue Date:April 2004
pp. 74a
<p>Simultaneous Multithreading (SMT) processors increase performance by executing instructions from multiple threads simultaneously. These threads share the processor's resources, but also compete for them. In this environment, a thread missing in th...
 
A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors
Found in: High-Performance Computer Architecture, International Symposium on
By Ayose Falcón, Alex Ramirez, Mateo Valero
Issue Date:February 2004
pp. 244
<p>Simultaneous Multithreading (SMT) is an architectural technique that allows for the parallel execution of several threads simultaneously. Fetch performance has been identified as the most important bottleneck for SMT processors. The commonly adopt...
 
Out-of-Order Commit Processors
Found in: High-Performance Computer Architecture, International Symposium on
By Adrian Cristal, Daniel Ortega, Josep Llosa, Mateo Valero
Issue Date:February 2004
pp. 48
<p>Modern out-of-order processors tolerate long latency memory operations by supporting a large number of in-flight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the c...
 
Reducing Fetch Architecture Complexity Using Procedure Inlining
Found in: Interaction between Compilers and Computer Architecture, Annual Workshop on
By Oliverio J. Santana, Alex Ramirez, Mateo Valero
Issue Date:February 2004
pp. 97-106
<p>Fetch engine performance is seriously limited by the branch prediction table access latency. This fact has lead to the development of hardware mechanisms, like prediction overriding, aimed to tolerate this latency. However, prediction overriding r...
 
Direct Instruction Wakeup for Out-of-Order Processors
Found in: Innovative Architecture for Future Generation High-Performance Processors and Systems, International Workshop on
By Marco A. Ramírez, Adrian Cristal, Alexander V. Veidenbaum, Luis Villa, Mateo Valero
Issue Date:January 2004
pp. 2-9
Instruction queues consume a significant amount of power in high-performance processors, primarily due to instruction wakeup logic access to the queue structures. The wakeup logic delay is also a critical timing parameter. This paper proposes a new queue o...
 
Design and Implementation of High-Performance Memory Systems for Future Packet Buffers
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Jorge García, Jesús Corbal, Llorenç Cerdà, Mateo Valero
Issue Date:December 2003
pp. 373
In this paper we address the design of a future high-speed router that supports line rates as high as OC-3072 (160 Gb/s), around one hundred ports and several service classes. Building such a high-speed router would raise many technological problems, one o...
 
Latency Tolerant Branch Predictors
Found in: Innovative Architecture for Future Generation High-Performance Processors and Systems, International Workshop on
By Oliverio J. Santana, Alex Ramirez, Mateo Valero
Issue Date:July 2003
pp. 30
The access latency of branch predictors is a well known problem of fetch engine design. Prediction overriding techniques are commonly accepted to overcome this problem. However, prediction overriding requires a complex recovery mechanism to discard the wro...
 
Hierarchical Clustered Register File Organization for VLIW Processors
Found in: Parallel and Distributed Processing Symposium, International
By Javier Zalamea, Josep Llosa, Eduard Ayguadé, Mateo Valero
Issue Date:April 2003
pp. 77a
Technology projections indicate that wire delays will become one of the biggest constraints in future microprocessor designs. To avoid long wire delays and therefore long cycle times, processor cores must be partitioned into components so that most of the ...
 
A Case for Resource-conscious Out-of-order Processors
Found in: IEEE Computer Architecture Letters
By Adrián Cristal, José F. Martínez, Josep Llosa, Mateo Valero
Issue Date:January 2003
pp. N/A
Modern out-of-order processors tolerate long-latency memory operations by supporting a large number of in-flight instructions. This is achieved in part through proper sizing of critical resources, such as register files or instruction queues. In light of t...
 
Fetching instruction streams
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Alex Ramirez, Oliverio J. Santana, Josep L. Larriba-Pey, Mateo Valero
Issue Date:November 2002
pp. 371
Fetch performance is a very important factor because it effectively limits the overall processor performance. However, there is little performance advantage in increasing front-end performance beyond what the back-end can consume. For each processor design...
 
Cost-Effective Compiler Directed Memory Prefetching and Bypassing
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Daniel Ortega, Eduard Ayguadé, Jean-Loup Baer, Mateo Valero
Issue Date:September 2002
pp. 189
Ever increasing memory latencies and deeper pipelines push memory farther from the processor. Prefetching techniques aim is to bridge these two gaps by fetching data in advance to both the L1 cache and the register file. Our main contribution in this paper...
 
Hardware Schemes for Early Register Release
Found in: Parallel Processing, International Conference on
By Teresa Monreal, Víctor Viñals, Antonio González, Mateo Valero
Issue Date:August 2002
pp. 5
Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is quite related to the size and number of ports of the re...
 
Speculative Dynamic Vectorization
Found in: Computer Architecture, International Symposium on
By Alex Pajuelo, Antonio Gonzalez, Mateo Valero
Issue Date:May 2002
pp. 0271
Traditional vector architectures have shown to be very effective for regular codes where the compiler can detect data-level parallelism. However, this SIMD parallelism is also present in irregular or pointer-rich codes, for which the compiler is quite limi...
 
On the Efficiency of Reductions in ?-SIMD Media Extensions
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Jesus Corbal, Roger Espasa, Mateo Valero
Issue Date:September 2001
pp. 0083
Abstract: Many important multimedia applications contain a significant fraction of reduction operations. Although, in general, multimedia applications are characterized for having high amounts of Data Level Parallelism, reductions and accumulations are dif...
 
Guest Editors' Introduction: Early 21st Century Processors
Found in: Computer
By Sriram Vajapeyam, Mateo Valero
Issue Date:April 2001
pp. 47-50
<p>The computer architecture arena faces exciting challenges as it attempts to meet the design goals and constraints that new markets, changing applications, and fast-moving semiconductor technology impose.</p>
 
The Effect of Code Reordering on Branch Prediction
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Alex Ramirez, Josep L. Larriba-Pey, Mateo Valero
Issue Date:October 2000
pp. 189
Branch prediction accuracy is a very important factor for superscalar processor performance. The ability to predict the outcome of a branch allows the processor to effectively use a large instruction window, and extract a larger amount of Instruction Level...
 
Multiple-Banked Register File Architectures
Found in: Computer Architecture, International Symposium on
By Mateo Valero, Antonio González, Nigel P. Topham, José-Lorenzo Cruz
Issue Date:June 2000
pp. 316
The register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies more r...
 
 1  2 Next >>