Search For:

Displaying 1-50 out of 80 total
A Scalable, Non-blocking Approach to Transactional Memory
Found in: High-Performance Computer Architecture, International Symposium on
By Hassan Chafi, Jared Casper, Brian D. Carlstrom, Austen McDonald, Chi Cao Minh, Woongki Baek, Christos Kozyrakis, Kunle Olukotun
Issue Date:February 2007
pp. 97-108
Transactional Memory (TM) provides mechanisms that promise to simplify parallel programming by eliminating the need for locks and their associated problems (dead-lock, livelock, priority inversion, convoying). For TM to be adopted in the long term, not onl...
 
Transactional Coherence and Consistency: Simplifying Parallel Hardware and Software
Found in: IEEE Micro
By Lance Hammond, Brian D. Carlstrom, Vicky Wong, Michael Chen, Christos Kozyrakis, Kunle Olukotun
Issue Date:November 2004
pp. 92-103
TCC simplifies parallel hardware and software design by eliminating the need for conventional cache coherence and consistency models and letting programmers parallelize a wide range of applications with a simple, lock-free transactional model.
 
A Single-Chip Multiprocessor
Found in: Computer
By Lance Hammond, Basem A. Nayfeh, Kunle Olukotun
Issue Date:September 1997
pp. 79-85
<p>These Stanford University researchers present the case for billion-transistor processor architectures that will consist of chip multiprocessors (CMPs): multiple (four to 16) simple, fast processors on one chip. In their proposal, each processor is...
 
A Heterogeneous Parallel Framework for Domain-Specific Languages
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Kevin J. Brown,Arvind K. Sujeeth,Hyouk Joong Lee,Tiark Rompf,Hassan Chafi,Martin Odersky,Kunle Olukotun
Issue Date:October 2011
pp. 89-100
Computing systems are becoming increasingly parallel and heterogeneous, and therefore new applications must be capable of exploiting parallelism in order to continue achieving high performance. However, targeting these emerging devices often requires using...
 
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Sungpack Hong,Tayo Oguntebi,Kunle Olukotun
Issue Date:October 2011
pp. 78-88
Graphs are a fundamental data representation that has been used extensively in various domains. In graph-based applications, a systematic exploration of the graph such as a breadth-first search (BFS) often serves as a key component in the processing of the...
 
Runtime automatic speculative parallelization
Found in: Code Generation and Optimization, IEEE/ACM International Symposium on
By Ben Hertzberg, Kunle Olukotun
Issue Date:April 2011
pp. 64-73
We present Runtime Automatic Speculative Parallelization (RASP), a technique for the dynamic extraction of speculative threads from a running application in a user-transparent fashion. By leveraging the idle cores in a CMP to analyze, optimize, and partici...
 
FARM: A Prototyping Environment for Tightly-Coupled, Heterogeneous Architectures
Found in: Field-Programmable Custom Computing Machines, Annual IEEE Symposium on
By Tayo Oguntebi, Sungpack Hong, Jared Casper, Nathan Bronson, Christos Kozyrakis, Kunle Olukotun
Issue Date:May 2010
pp. 221-228
Computer architectures are increasingly turning to parallelism and heterogeneity as solutions for boosting performance in the face of power constraints. As this trend continues, the challenges of simulating and evaluating these architectures have grown. Ha...
 
A Large-Scale Architecture for Restricted Boltzmann Machines
Found in: Field-Programmable Custom Computing Machines, Annual IEEE Symposium on
By Sang Kyun Kim, Peter Leonard McMahon, Kunle Olukotun
Issue Date:May 2010
pp. 201-208
Deep Belief Nets (DBNs) are an emerging application in the machine learning domain, which use Restricted Boltzmann Machines (RBMs) as their basic building block. Although small scale DBNs have shown great potential, the computational cost of RBM training h...
 
Implementing and Evaluating a Model Checker for Transactional Memory Systems
Found in: Engineering of Complex Computer Systems, IEEE International Conference on
By Woongki Baek, Nathan Bronson, Christos Kozyrakis, Kunle Olukotun
Issue Date:March 2010
pp. 117-126
Transactional Memory (TM) is a promising technique that addresses the difficulty of parallel programming. Since TM takes responsibility for all concurrency control, TM systems are highly vulnerable to subtle correctness errors. Due to the difficulty of ful...
 
The OpenTM Transactional Application Programming Interface
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Woongki Baek, Chi Cao Minh, Martin Trautmann, Christos Kozyrakis, Kunle Olukotun
Issue Date:September 2007
pp. 376-387
Transactional Memory (TM) simplifies parallel programming by supporting atomic and isolated execution of user-identified tasks. To date, TM programming has required the use of libraries that make it difficult to achieve scalable performance with code that ...
 
Architectural Semantics for Practical Transactional Memory
Found in: Computer Architecture, International Symposium on
By Austen McDonald, JaeWoong Chung, Brian D. Carlstrom, Chi Cao Minh, Hassan Chafi, Christos Kozyrakis, Kunle Olukotun
Issue Date:June 2006
pp. 53-65
Transactional Memory (TM) simplifies parallel programming by allowing for parallel execution of atomic tasks. Thus far, TM systems have focused on implementing transactional state buffering and conflict resolution. Missing is a robust hardware/software int...
 
Characterization of TCC on Chip-Multiprocessors
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Austen McDonald, JaeWoong Chung, Hassan Chafi, Chi Cao Minh, Brian D. Carlstrom, Lance Hammond, Christos Kozyrakis, Kunle Olukotun
Issue Date:September 2005
pp. 63-74
<p>Transactional Coherence and Consistency (TCC) is a novel coherence scheme for shared memory multiprocessors that uses programmer-defined transactions as the fundamental unit of parallel work, synchronization, coherence, and consistency. TCC has th...
 
An Application Analysis Framework For Polymorphic Chip Multiprocessors
Found in: Parallel and Distributed Processing Symposium, International
By Ayodele Thomas, Kunle Olukotun
Issue Date:April 2005
pp. 109b
The SAPIENT parallel analysis framework facilitates the efficient transformation of sequential applications into multilevel parallel applications that can be executed on polymorphic chip multiprocessor architectures. We demonstrate how application characte...
 
The Jrpm System for Dynamically Parallelizing Java Programs
Found in: Computer Architecture, International Symposium on
By Michael K. Chen, Kunle Olukotun
Issue Date:June 2003
pp. 434
We describe the Java runtime parallelizing machine (Jrpm), a complete system for parallelizing sequential programs automatically. Jrpm is based on a chip multiprocessor (CMP) with thread-level speculation (TLS) support. CMPs have low sharing and communicat...
 
TEST: A Tracer for Extracting Speculative Threads
Found in: Code Generation and Optimization, IEEE/ACM International Symposium on
By Michael Chen, Kunle Olukotun
Issue Date:March 2003
pp. 301
Thread-level speculation (TLS) allows sequential programs to be arbitrarily decomposed into threads that can be safely executed in parallel. A key challenge for TLS processors is choosing thread decompositions that speedup the program. Current techniques f...
 
The Stanford Hydra CMP
Found in: IEEE Micro
By Lance Hammond, Benedict A. Hubbert, Michael Siu, Manohar K. Prabhu, Michael Chen, Kunle Olukotun
Issue Date:March 2000
pp. 71-84
Chip multiprocessors offer an economical, scalable architecture for future microprocessors. Thread-level speculation support allows them to speed up past software.
 
JMTP: An Architecture for Exploiting Concurrency in Embedded Java Applications with Real-time Considerations
Found in: Computer-Aided Design, International Conference on
By Rachid Helaihel, Kunle Olukotun
Issue Date:November 1999
pp. 551
Using Java in embedded systems is plagued by problems of limited runtime performance and unpredictable runtime behavior. The Java Multi-Threaded Processor (JMTP) provides solutions to these problems. The JMTP architecture is a single chip containing an off...
 
Java as a Specification Language for Hardware-Software Systems
Found in: Computer-Aided Design, International Conference on
By Rachid Helaihel, Kunle Olukotun
Issue Date:November 1997
pp. 690
The specification language is a critical component of the hardware-software co-design process since it is used for functional validation and as a starting point for hardware- software partitioning and co-synthesis. This paper pro poses the Java programming...
 
The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications-Driven Investigation
Found in: SC Conference
By Andrew Erlichson, Basem A. Nayfeh, Jaswinder P. Singh, Kunle Olukotun
Issue Date:December 1995
pp. 60
Clustering processors together at a level of the memory hierarchy in shared address space multiprocessors appears to be an attractive technique from several standpoints: Resources are shared, packaging technologies are exploited, and processors within a cl...
 
A Software-Hardware Cosynthesis Approach to Digital System Simulation
Found in: IEEE Micro
By Kunle A. Olukotun, Rachid Helaihel, Jeremy Levitt, Ricardo Ramirez
Issue Date:August 1994
pp. 48-58
<p>Our approach to digital system simulation compiles a high-level system model into a high-performance simulator that consists of software and hardware components. The target architecture for the simulation compiler is a tightly coupled processor an...
 
Implementing Domain-Specific Languages for Heterogeneous Parallel Computing
Found in: IEEE Micro
By HyoukJoong Lee,Kevin J. Brown,Arvind K. Sujeeth,Hassan Chafi,Kunle Olukotun,Tirark Rompf,Martin Odersky
Issue Date:September 2011
pp. 42-53
Domain-specific languages offer a solution to the performance and the productivity issues in heterogeneous computing systems. The Delite compiler framework simplifies the process of building embedded parallel DSLs. DSL developers can implement domain-speci...
 
Panel statement
Found in: 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011)
By Per Stenström,Doug Burger,Wen-mei Hwu,Vipin Kumar,Kunle Olukotun,David Padua,Burton Smith
Issue Date:May 2011
pp. 877
No summary available.
 
Ubiquitous Parallel Computing from Berkeley, Illinois, and Stanford
Found in: IEEE Micro
By Bryan Catanzaro, Armando Fox, Kurt Keutzer, David Patterson, Bor-Yiing Su, Marc Snir, Kunle Olukotun, Pat Hanrahan, Hassan Chafi
Issue Date:March 2010
pp. 41-55
<p>The ParLab at Berkeley, UPCRC-Illinois, and the Pervasive Parallel Laboratory at Stanford are studying how to make parallel programming succeed given industry's recent shift to multicore computing. All three centers assume that future microprocess...
 
Transactional Memory: The Hardware-Software Interface
Found in: IEEE Micro
By Austen McDonald, Brian D. Carlstrom, JaeWoong Chung, Chi Cao Minh, Hassan Chafi, Christos Kozyrakis, Kunle Olukotun
Issue Date:January 2007
pp. 67-76
This comprehensive architecture supports nested transactions, transaction handling, and two-phase commit. The result is a seamless integration of transactional memory with modern programming languages and runtime environments.
 
Niagara: A 32-Way Multithreaded Sparc Processor
Found in: IEEE Micro
By Poonacha Kongetira, Kathirgamar Aingaran, Kunle Olukotun
Issue Date:March 2005
pp. 21-29
The Niagara processor implements a thread-rich architecture designed to provide a high-performance solution for commercial server applications. The hardware supports 32 threads with a memory subsystem consisting of an on-board crossbar, level-2 cache, and ...
 
The Jrpm System for Dynamically Parallelizing Sequential Java Programs
Found in: IEEE Micro
By Michael K. Chen, Kunle Olukotun
Issue Date:November 2003
pp. 26-35
<p>As instruction-level parallelism with a single thread of control approaches its performance limits, designers must find other architectural improvements to speed up program execution. The Jrpm system takes advantage of recent developments to enabl...
 
High Bandwidth On-Chip Cache Design
Found in: IEEE Transactions on Computers
By Kenneth M. Wilson, Kunle Olukotun
Issue Date:April 2001
pp. 292-307
<p><b>Abstract</b>—In this paper, we evaluate the performance of high bandwidth cache organizations employing multiple cache ports, multiple cycle hit times, and cache port efficiency enhancements, such as load all and line buffer, to fin...
 
DCP: an Algorithm for Datapath/Control Partitioning of Synthesizable RTL Models
Found in: Computer Design, International Conference on
By Victor J. Lam, Kunle A. Olukotun
Issue Date:October 1998
pp. 442
Currently, the majority of ASIC and custom chip implementations go through a process by which a cycle-accurate synthesizable RTL model is refined into an RT/gate-level model that has been partitioned into datapath and control. This partitioning is usually ...
 
Digital System Simulation: Methodologies and Examples
Found in: Design Automation Conference
By Mark Heinrich, David Ofelt, Kunle Olukotun
Issue Date:June 1998
pp. 658-663
Two major trends in the digital design industry are the increase in system complexity and the increasing importance of short design times. The rise in design complexity is motivated by consumer demand for higher performance products as well as increases in...
 
A Quantitative Analysis of Reconfigurable Coprocessors for Multimedia Applications
Found in: Field-Programmable Custom Computing Machines, Annual IEEE Symposium on
By Takashi Miyamori, Kunle Olukotun
Issue Date:April 1998
pp. 2
Recently, computer architectures that combine a reconfigurable (or retargetable) coprocessor with a general-purpose microprocessor have been proposed. These architectures are designed to exploit large amounts of fine grain parallelism in applications. In t...
 
Verifying Correct Pipeline Implementation for Microprocessors
Found in: Computer-Aided Design, International Conference on
By Jeremy Levitt, Kunle Olukotun
Issue Date:November 1997
pp. 162
We introduce a general, automatic verification technique for pipelined designs. The technique is based on a scalable, formal methodology for analyzing pipelines. The key advantages to our technique are: it specifically targets pipeline control, making it m...
 
Multilevel Optimization of Pipelined Caches
Found in: IEEE Transactions on Computers
By Kunle Olukotun, Trevor N. Mudge, Richard B. Brown
Issue Date:October 1997
pp. 1093-1102
<p><b>Abstract</b>—This paper formulates and shows how to solve the problem of selecting the cache size and depth of cache pipelining that maximizes the performance of a given instruction-set architecture. The solution combines trace-driv...
 
The Hierarchical Multi-Bank DRAM: A High-Performance Architecture for Memory Integrated with Processors
Found in: Advanced Research in VLSI, Conference on
By Tadaaki Yamauchi, Lance Hammond, Kunle Olukotun
Issue Date:September 1997
pp. 303
A microprocessor integrated with DRAM on the same die has the potential to improve system performance by reducing the memory latency and improving the memory bandwidth. However, a high performance microprocessor will typically send more accesses than the D...
 
A Scalable Formal Verification Methodology for Pipelined Microprocessors
Found in: Design Automation Conference
By Jeremy Levitt, Kunle Olukotun
Issue Date:June 1996
pp. 558-563
We describe a novel, formal verification technique for proving the correctness of a pipelined microprocessor that focuses specifically on pipeline control logic. We iteratively deconstruct a pipeline by merging adjacent pipeline stages, allowing for the ve...
 
A General Method for Compiling Event-Driven Simulations
Found in: Design Automation Conference
By Jeremy R. Levitt, Kunle Olukotun, Monica S. Lam, Robert S. French
Issue Date:June 1995
pp. 151-156
We present a new approach to event-driven simulation that does not use a centralized run-time event queue, yet is capable of handling arbitrary models, including those with unclocked feedback and nonunit delay. The elimination of the event queue significan...
 
Transactional Memory Coherence and Consistency
Found in: Computer Architecture, International Symposium on
By Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu, Honggo Wijaya, Christos Kozyrakis, Kunle Olukotun
Issue Date:June 2004
pp. 102
In this paper, we propos a new shared memory model: Transactional memory Coherence and Consistency (TCC). TCC provides a model in which atomic transactions are always the basic unit of parallel work, communication, memory coherence, and memory reference co...
 
Efficient State Representation for Symbolic Simulation
Found in: Design Automation Conference
By Valeria Bertacco, Kunle Olukotun
Issue Date:June 2002
pp. 99
Symbolic simulation is attracting increasing interest for the validation of digital circuits. It allows the verification engineer to explore all, or a major portion of the circuit's state space without having to design specific and time-consuming test stim...
 
Designing High Bandwidth On-Chip Caches
Found in: Computer Architecture, International Symposium on
By Kunle Olukotun, Kenneth M. Wilson
Issue Date:June 1997
pp. 121
In this paper we evaluate the performance of high bandwidth caches that employ multiple ports, multiple cycle hit times, on-chip DRAM, and a line buffer to find the organization that provides the best processor performance. Processor performance is measure...
 
Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors
Found in: Computer Architecture, International Symposium on
By Kunle Olukotun, Mendel Rosenblum, Kenneth M. Wilson
Issue Date:May 1996
pp. 147
The memory bandwidth demands of modern microprocessors require the use of a multi-ported cache to achieve peak performance. However, multi-ported caches are costly to implement. In this paper we propose techniques for improving the bandwidth of a single ca...
 
Evaluation of Design Alternatives for a Multiprocessor Microprocessor
Found in: Computer Architecture, International Symposium on
By Kunle Olukotun, Lance Hammond, Basem A. Nayfeh
Issue Date:May 1996
pp. 67
In the future, advanced integrated circuit processing and packaging technology will allow for several design options for multiprocessor microprocessors. In this paper we consider three architectures: shared-primary cache, shared-secondary cache, and shared...
 
Towards soft optimization techniques for parallel cognitive applications
Found in: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures (SPAA '07)
By Chi Cao Minh, Christos Kozyrakis, JaeWoong Chung, Kunle Olukotun, Woongki Baek
Issue Date:June 2207
pp. 59-60
The Cell Broadband Engine™ is a new heterogeneous multi-core processor from IBM, Sony, and Toshiba. It contains eight co-processors, called Synergistic Processing Elements (SPEs), which operate directly on distinct 256 KB local stores, and also have ...
     
Simplifying Scalable Graph Processing with a Domain-Specific Language
Found in: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '14)
By Jennifer Widom, Kunle Olukotun, Semih Salihoglu, Sungpack Hong
Issue Date:February 2014
pp. 208-218
Large-scale graph processing, with its massive data sets, requires distributed processing. However, conventional frameworks for distributed graph processing, such as Pregel, use non-traditional programming models that are well-suited for parallelism and sc...
     
Beyond parallel programming with domain specific languages
Found in: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP '14)
By Kunle Olukotun
Issue Date:February 2014
pp. 179-180
Today, almost all computer architectures are parallel and heterogeneous; a combination of multiple CPUs, GPUs and specialized processors. This creates a challenging problem for application developers who want to develop high performance programs without th...
     
On fast parallel detection of strongly connected components (SCC) in small-world graphs
Found in: Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis (SC '13)
By Sungpack Hong, Kunle Olukotun, Nicole C. Rodia
Issue Date:November 2013
pp. 1-11
Detecting strongly connected components (SCCs) in a directed graph is a fundamental graph analysis algorithm that is used in many science and engineering domains. Traditional approaches in parallel SCC detection, however, show limited performance and poor ...
     
Forge: generating a high performance DSL implementation from a declarative specification
Found in: Proceedings of the 12th international conference on Generative programming: concepts & experiences (GPCE '13)
By HyoukJoong Lee, Kevin J. Brown, Kunle Olukotun, Arvind K. Sujeeth, Austin Gibbons, Martin Odersky, Tiark Rompf
Issue Date:October 2013
pp. 145-154
Domain-specific languages provide a promising path to automatically compile high-level code to parallel, heterogeneous, and distributed hardware. However, in practice high performance DSLs still require considerable software expertise to develop and force ...
     
Optimizing data structures in high-level programs: new directions for extensible compilers based on staging
Found in: Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages (POPL '13)
By Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Kunle Olukotun, Manohar Jonnalagedda, Martin Odersky, Nada Amin, Tiark Rompf, Vojin Jovanovic
Issue Date:January 2013
pp. 497-510
High level data structures are a cornerstone of modern programming and at the same time stand in the way of compiler optimizations. In order to reason about user- or library-defined data structures compilers need to be extensible. Common mechanisms to exte...
     
High performance embedded domain specific languages
Found in: Proceedings of the 17th ACM SIGPLAN international conference on Functional programming (ICFP '12)
By Kunle Olukotun
Issue Date:September 2012
pp. 139-140
Today, all high-performance computer architectures are parallel and heterogeneous; a combination of multiple CPUs, GPUs and specialized processors. This creates a complex programming problem for application developers. Domain-specific languages (DSLs) are ...
     
Green-Marl: a DSL for easy and efficient graph analysis
Found in: Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '12)
By Kunle Olukotun, Edic Sedlar, Hassan Chafi, Sungpack Hong
Issue Date:March 2012
pp. 349-362
The increasing importance of graph-data based applications is fueling the need for highly efficient and parallel implementations of graph analysis software. In this paper we describe Green-Marl, a domain-specific language (DSL) whose high level language co...
     
Hardware/software co-design for high performance computing: challenges and opportunities
Found in: Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis (CODES/ISSS '10)
By Kunle Olukotun, Richard C. Murphy, Stephen Poole, Sudip Dosanjh, X. Sharon Hu
Issue Date:October 2010
pp. 63-64
This special session aims to introduce to the hardware/software codesign community challenges and opportunities in designing high performance computing (HPC) systems. Though embedded system design and HPC system design have traditionally been considered as...
     
Language virtualization for heterogeneous parallel computing
Found in: Proceedings of the ACM international conference on Object oriented programming systems languages and applications (OOPSLA '10)
By Adriaan Moors, Arvind K. Sujeeth, Hassan Chafi, Kunle Olukotun, Martin Odersky, Pat Hanrahan, Tiark Rompf, Zach DeVito
Issue Date:October 2010
pp. 835-847
As heterogeneous parallel systems become dominant, application developers are being forced to turn to an incompatiblemix of low level programming models (e.g. OpenMP, MPI, CUDA, OpenCL). However, these models do little to shield developers from the difficu...
     
 1  2 Next >>