Search For:

Displaying 1-26 out of 26 total
Parallel Programming Models for Heterogeneous Multicore Architectures
Found in: IEEE Micro
By Roger Ferrer, Pieter Bellens, Vicenc Beltran, Marc Gonzalez, Xavier Martorell, Rosa M. Badia, Eduard Ayguade, Jae-Seung Yeom, Scott Schneider, Konstantinos Koukos, Michail Alvanos, Dimitros S. Nikolopoulos, Angelos Bilas
Issue Date:September 2010
pp. 42-53
<p>This article evaluates the scalability and productivity of six parallel programming models for heterogeneous architectures, and finds that task-based models using code and data annotations require the minimum programming effort while sustaining ne...
 
Hardware-software coherence protocol for the coexistence of caches and local memories
Found in: 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
By Lluc Alvarez,Lluis Vilanova,Marc Gonzalez,Xavier Martorell,Nacho Navarro,Eduard Ayguade
Issue Date:November 2012
pp. 1-11
Cache coherence protocols limit the scalability of chip multiprocessors. One solution is to introduce a local memory alongside the cache hierarchy, forming a hybrid memory system. Local memories are more power-efficient than caches and they do not generate...
 
Accelerating Boosting-Based Face Detection on GPUs
Found in: 2012 41st International Conference on Parallel Processing (ICPP)
By David Oro,Carles Fern'ndez,Carlos Segura,Xavier Martorell,Javier Hernando
Issue Date:September 2012
pp. 309-318
The goal of face detection is to determine the presence of faces in arbitrary images, along with their locations and dimensions. As it happens with any graphics workloads, these algorithms benefit from data-level parallelism. Existing parallelization effor...
 
Productive Programming of GPU Clusters with OmpSs
Found in: 2012 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
By Javier Bueno,Judit Planas,Alejandro Duran,Rosa M. Badia,Xavier Martorell,Eduard Ayguade,Jesus Labarta
Issue Date:May 2012
pp. 557-568
Clusters of GPUs are emerging as a new computational scenario. Programming them requires the use of hybrid models that increase the complexity of the applications, reducing the productivity of programmers. We present the implementation of OmpSs for cluster...
 
DMA++: On the Fly Data Realignment for On-Chip Memories
Found in: IEEE Transactions on Computers
By Nikola Vujic,Felipe Cabarcas,Marc Gonzalez,Alex Ramirez,Xavier Martorell,Eduard Ayguade
Issue Date:February 2012
pp. 237-250
Multimedia extensions based on Single-Instruction Multiple-Data (SIMD) units are widespread. They have been used, for some time, in processors and accelerators (e.g., the Cell SPEs). SIMD units usually have significant memory alignment constraints in order...
 
Automatic Prefetch and Modulo Scheduling Transformations for the Cell BE Architecture
Found in: IEEE Transactions on Parallel and Distributed Systems
By Nikola Vujic, Marc Gonzàlez, Xavier Martorell, Eduard Ayguadé
Issue Date:April 2010
pp. 494-505
Ease of programming is one of the main requirements for the broad acceptance of multicore systems without hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a trans...
 
Performance-Driven Processor Allocation
Found in: IEEE Transactions on Parallel and Distributed Systems
By Julita Corbalan, Xavier Martorell, Jesus Labarta
Issue Date:July 2005
pp. 599-611
<p><b>Abstract</b>—In current multiprogrammed multiprocessor systems, to take into account the performance of parallel applications is critical to decide an efficient processor allocation. In this paper, we present the Performance-Driven ...
 
Optimizing NANOS OpenMP for the IBM Cyclops Multithreaded Architecture
Found in: Parallel and Distributed Processing Symposium, International
By David Ródenas, Xavier Martorell, Eduard Ayguadé, Jesús Labarta, George Almási, Calin Cascaval, José Castaños, José Moreira
Issue Date:April 2005
pp. 110
In this paper, we present two approaches to improve the execution of OpenMP applications on the IBM Cyclops multithreaded architecture. Both solutions are independent and they are focused to obtain better performance through a better management of the cach...
 
Employing Nested OpenMP for the Parallelization of Multi-Zone Computational Fluid Dynamics Applications
Found in: Parallel and Distributed Processing Symposium, International
By Eduard Ayguade, Marc Gonzalez, Xavier Martorell, Gabriele Jost
Issue Date:April 2004
pp. 6a
In this paper we describe the parallelization of the multi-zone code versions of the NAS Parallel Benchmarks employing multi-level OpenMP parallelism. For our study we use the NanosCompiler, which supports nesting of OpenMP directives and provides clauses ...
 
Enabling Dual-Core Mode in BlueGene/L: Challenges and Solutions
Found in: Computer Architecture and High Performance Computing, Symposium on
By George Almási, Leonardo R. Bachega, Siddhartha Chatterjee, Manish Gupta, Derek Lieber, Xavier Martorell, José E. Moreira
Issue Date:November 2003
pp. 19
BlueGene/L is a massively parallel computer system with 65,536 dual-processor compute nodes. The peak performance of BlueGene/L is in excess of 360 TFLOP/s if both processor cores in a node are used for computation. The main challenge of deploying this dua...
 
Application/Kernel Cooperation Towards the Efficient Execution of Shared-Memory Parallel Java Codes
Found in: Parallel and Distributed Processing Symposium, International
By Jordi Guitart, Xavier Martorell, Jordi Torres, Eduard Ayguadé
Issue Date:April 2003
pp. 38a
In this paper we propose mechanisms to improve the performance of parallel Java applications executing on multiprogrammed shared-memory multiprocessors. The proposal is based on a dialog between each Java application and the underlying execution environmen...
 
Applying Interposition Techniques for Performance Analysis of OpenMP Parallel Applications
Found in: Parallel and Distributed Processing Symposium, International
By Marc Gonzalez, Albert Serra, Xavier Martorell, Jose Oliver, Eduard Ayguade, Jesus Labarta, Nacho Navarro
Issue Date:May 2000
pp. 235
Tuning parallel applications requires the use of effective tools for detecting performance bottlenecks. Along a parallel program execution, many individual situations of performance degradation may arise. We believe that an exhaustive and time-aware tracin...
 
CUDAlign 3.0: Parallel Biological Sequence Comparison in Large GPU Clusters
Found in: 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
By Edans F. de O. Sandes,Guillermo Miranda,Alba C.M.A. de Melo,Xavier Martorell,Eduard Ayguade
Issue Date:May 2014
pp. 160-169
This paper proposes and evaluates a parallel strategy to execute the exact Smith-Waterman (SW) biological sequence comparison algorithm for huge DNA sequences in multi-GPU platforms. In our strategy, the computation of a single huge SW matrix is spread ove...
 
Hardware-Software Coherence Protocol for the Coexistence of Caches and Local Memories
Found in: IEEE Transactions on Computers
By Lluc Alvarez,Lluis Vilanova,Marc Gonzalez,Xavier Martorell,Nacho Navarro,Eduard Ayguade
Issue Date:October 2013
pp. 1
Cache coherence protocols limit the scalability of multicore and manycore architectures and are responsible for an important amount of the power consumed in the chip. A good way to alleviate these problems is to introduce a local memory alongside the cache...
 
Exploiting Multiple Levels of Parallelism in OpenMP: A Case Study
Found in: Parallel Processing, International Conference on
By Eduard Ayguade, Xavier Martorell, Jesus Labarta, Marc Gonzalez, Nacho Navarro
Issue Date:September 1999
pp. 172
Most current shared-memory parallel programming environments are based on thread packages that allow the exploitation of a single level of parallelism. These thread packages do not enable the spawning of new parallelism from a previously activated parallel...
 
Fine-grain parallel megabase sequence comparison with multiple heterogeneous GPUs
Found in: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP '14)
By Alba C.M.A. Melo, Edans F. de O. Sandes, Eduard Ayguade, Guillermo Miranda, Xavier Martorell
Issue Date:February 2014
pp. 383-384
This paper proposes and evaluates a parallel strategy to execute the exact Smith-Waterman (SW) algorithm for megabase DNA sequences in heterogeneous multi-GPU platforms. In our strategy, the computation of a single huge SW matrix is spread over multiple GP...
     
Improving performance of all-to-all communication through loop scheduling in PGAS environments
Found in: Proceedings of the 27th international ACM conference on International conference on supercomputing (ICS '13)
By Ettore Tiotto, Gabriel Tanase, Xavier Martorell
Issue Date:June 2013
pp. 457-458
We present a decomposition method for the parallelization of multi-dimensional FFTs with two distinguishing features: adaptive decomposition and transpose order awareness for achieving minimal communication volume. Based on a row-wise decomposition that tr...
     
Implementing OmpSs support for regions of data in architectures with multiple address spaces
Found in: Proceedings of the 27th international ACM conference on International conference on supercomputing (ICS '13)
By Eduard Ayguadé, Javier Bueno, Jesús Labarta, Rosa M. Badia, Xavier Martorell
Issue Date:June 2013
pp. 359-368
The need for features for managing complex data accesses in modern programming models has increased due to the emerging hardware architectures. HPC hardware has moved towards clusters of accelerators and/or multicores, architectures with a complex memory h...
     
Improving communication in PGAS environments: static and dynamic coalescing in UPC
Found in: Proceedings of the 27th international ACM conference on International conference on supercomputing (ICS '13)
By Ettore Tiotto, Michail Alvanos, Xavier Martorell
Issue Date:June 2013
pp. 129-138
The goal of Partitioned Global Address Space (PGAS) languages is to improve programmer productivity in large scale parallel machines. However, PGAS programs may have many fine-grained shared accesses that lead to performance degradation. Manual code transf...
     
Implementation of a hierarchical N-body simulator using the Ompss programming model
Found in: Proceedings of the first workshop on Irregular applications: architectures and algorithm (IAAA '11)
By Miquel Pericas, Xavier Martorell, Yoav Etsion
Issue Date:November 2011
pp. 23-30
Many HPC algorithms are highly irregular. They have input-dependent control flow and operate on pointer-based data structures such as trees, graphs, or linked lists. This irregularity makes it challenging to parallelize such algorithms in order to efficien...
     
Design space exploration for aggressive core replication schemes in CMPs
Found in: Proceedings of the 20th international symposium on High performance distributed computing (HPDC '11)
By Eduard Ayguade, Lluc Alvarez, Marc Gonzalez, Nacho Navarro, Ramon Bertran, Xavier Martorell
Issue Date:June 2011
pp. 269-270
Chip multiprocessors (CMPs) are the dominating architectures nowadays. There is a big variety of designs in current CMPs, with different number of cores and memory subsystems. This is because they are used in a wide spectrum of domains, each of them with t...
     
Poster: programming clusters of GPUs with OMPSs
Found in: Proceedings of the international conference on Supercomputing (ICS '11)
By Alejandro Duran, Eduard Ayguade, Javier Bueno, Jesus Labarta, Rosa M. Badia, Xavier Martorell
Issue Date:May 2011
pp. 378-378
OmpSs is a programming model that provides an environment to develop parallel applications for cluster environments with heterogeneous architectures. Based on OpenMP and StarSs, it offers a set of compiler directives that can be used to annotate a sequenti...
     
Hybrid access-specific software cache techniques for the cell BE architecture
Found in: Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT '08)
By Alexandre E. Eichenberger, Eduard Ayguade, Kathryn O'Brien, Kevin O'Brien, Marc Gonzalez, Nikola Vujic, Tao Zhang, Tong Chen, Xavier Martorell, Zehra Sura
Issue Date:October 2008
pp. 133-133
Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a trans...
     
Evaluation of the memory page migration influence in the system performance: the case of the SGI O2000
Found in: Proceedings of the 17th annual international conference on Supercomputing (ICS '03)
By Jesus Labarta, Julita Corbalan, Xavier Martorell
Issue Date:June 2003
pp. 121-129
Current shared-memory multiprocessor CC-NUMA architectures provide a global address space to applications by hardware. However, even though the memory is virtually shared, it is actually physically distributed. Since memory nodes are distributed across the...
     
Improving Gang Scheduling through job performance analysis and malleability
Found in: Proceedings of the 15th international conference on Supercomputing (ICS '01)
By Jesus Labarta, Julita Corbalan, Xavier Martorell
Issue Date:June 2001
pp. 303-311
The OpenMP programming model provides parallel applications a very important feature: job malleability. Job malleability is the capacity of an application to dynamically adapt its parallelism to the number of processors allocated to it. We believe that job...
     
Kernel-level scheduling for the nano-threads programming model
Found in: Proceedings of the 12th international conference on Supercomputing (ICS '98)
By Dimitrios S. Nikolopoulos, Eleftherios D. Polychronopoulos, Jesus Labarta, Nacho Navarro, Theodore S. Papatheodorou, Xavier Martorell
Issue Date:July 1998
pp. 337-344
To minimize the amount of computation and storage for parallel sparse factorization, sparse matrices have to be reordered prior to factorization. We show that none of the popular ordering heuristics proposed before, namely, mulitple minimum degree and nest...
     
 1