Search For:

Displaying 1-24 out of 24 total
Second Life and the New Generation of Virtual Worlds
Found in: Computer
By Sanjeev Kumar, Jatin Chhugani, Changkyu Kim, Daehyun Kim, Anthony Nguyen, Pradeep Dubey, Christian Bienia, Youngmin Kim
Issue Date:September 2008
pp. 46-53
Unlike online games, metaverses present a single seamless, persistent world where users can transparently roam around without predefined objectives. An analysis of Second Life illustrates the demands such applications place on clients, servers, and the net...
 
Atomic Vector Operations on Chip Multiprocessors
Found in: Computer Architecture, International Symposium on
By Sanjeev Kumar, Daehyun Kim, Mikhail Smelyanskiy, Yen-Kuang Chen, Jatin Chhugani, Christopher J. Hughes, Changkyu Kim, Victor W. Lee, Anthony D. Nguyen
Issue Date:June 2008
pp. 441-452
The current trend is for processors to deliver dramatic improvements in parallel performance while only modestly improving serial performance. Parallel performance is harvested through vector/SIMD instructions as well as multithreading (through both multit...
 
Large-scale energy-efficient graph traversal: A path to efficient data-intensive supercomputing
Found in: 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
By Nadathur Satish,Changkyu Kim,Jatin Chhugani,Pradeep Dubey
Issue Date:November 2012
pp. 1-11
Graph traversal is a widely used algorithm in a variety of fields, including social networks, business analytics, and high-performance computing among others. There has been a push for HPC machines to be rated not just in Petaflops, but also in "GigaT...
 
Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems
Found in: 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
By Jatin Chhugani,Changkyu Kim,Hemant Shukla,Jongsoo Park,Pradeep Dubey,John Shalf,Horst D. Simon
Issue Date:November 2012
pp. 1-11
Two-point Correlation Function (TPCF) is widely used in astronomy to characterize the distribution of matter/energy in the Universe, and help derive the physics that can trace back to the creation of the universe. However, it is prohibitively slow for curr...
 
DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing
Found in: IEEE Micro
By Venkatraman Govindaraju,Chen-Han Ho,Tony Nowatzki,Jatin Chhugani,Nadathur Satish,Karthikeyan Sankaralingam,Changkyu Kim
Issue Date:September 2012
pp. 38-51
The DySER (Dynamically Specializing Execution Resources) architecture supports both functionality specialization and parallelism specialization. By dynamically specializing frequently executing regions and applying parallelism mechanisms, DySER provides ef...
 
Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency
Found in: 2012 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
By Jatin Chhugani,Nadathur Satish,Changkyu Kim,Jason Sewall,Pradeep Dubey
Issue Date:May 2012
pp. 378-389
Graph-based structures are being increasingly used to model data and relations among data in a number of fields. Graph-based databases are becoming more popular as a means to better represent such data. Graph traversal is a key component in graph algorithm...
 
Efficient shared cache management through sharing-aware replacement and streaming-aware insertion policy
Found in: Parallel and Distributed Processing Symposium, International
By Yu Chen,Wenlong Li,Changkyu Kim, Zhizhong Tang
Issue Date:May 2009
pp. 1-11
Multi-core processors with shared caches are now commonplace. However, prior works on shared cache management primarily focused on multi-programmed workloads. These schemes consider how to partition the cache space given that simultaneously-running applica...
 
Composable Lightweight Processors
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Changkyu Kim, Simha Sethumadhavan, M.S. Govindan, Nitya Ranganathan, Divya Gulati, Doug Burger, Stephen W. Keckler
Issue Date:December 2007
pp. 381-394
Modern chip multiprocessors (CMPs) are designed to exploit both instruction-level parallelism (ILP) within pro- cessors and thread-level parallelism (TLP) within and across processors. However, the number of processors and the granularity of each processor...
 
On-Chip Interconnection Networks of the TRIPS Chip
Found in: IEEE Micro
By Paul Gratz, Changkyu Kim, Karthikeyan Sankaralingam, Heather Hanson, Premkishore Shivakumar, Stephen W. Keckler, Doug Burger
Issue Date:September 2007
pp. 41-50
The TRIPS chip prototypes two networks on chip to demonstrate the viability of a routed interconnection fabric for memory and operand traffic. In a 170-million-transistor custom ASIC chip, these NoCs provide system performance within 28 percent of ideal no...
 
A NUCA Substrate for Flexible CMP Cache Sharing
Found in: IEEE Transactions on Parallel and Distributed Systems
By Jaehyuk Huh, Changkyu Kim, Hazim Shafi, Lixin Zhang, Doug Burger, Stephen W. Keckler
Issue Date:August 2007
pp. 1028-1040
<p><b>Abstract</b>—We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture...
 
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Karthikeyan Sankaralingam, Ramadass Nagarajan, Robert McDonald, Rajagopalan Desikan, Saurabh Drolia, M.S. Govindan, Paul Gratz, Divya Gulati, Heather Hanson, Changkyu Kim, Haiming Liu, Nitya Ranganathan, Simha Sethumadhavan, Sadia Sharif, Premkishore Shiva
Issue Date:December 2006
pp. 480-491
Growing on-chip wire delays will cause many future microarchitectures to be distributed, in which hardware resources within a single processor become nodes on one or more switched micronetworks. Since large processor cores will require multiple clock cycle...
 
Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture
Found in: IEEE Micro
By Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Doug Burger, Stephen W. Keckler, Charles Moore
Issue Date:November 2003
pp. 46-51
<p>The TRIPS architecture seeks to deliver system-level configurability to applications and runtime systems. It does so by employing the concept of polymorphism, which permits the runtime system to configure the hardware execution resources to match ...
 
Nonuniform Cache Architectures for Wire-Delay Dominated On-Chip Caches
Found in: IEEE Micro
By Changkyu Kim, Doug Burger, Stephen W. Keckler
Issue Date:November 2003
pp. 99-107
<p>Nonuniform cache access designs solve the on-chip wire delay problem for future large integrated caches. By embedding a network in the cache, NUCA designs let data migrate within the cache, clustering the working set nearest the processor.</p&g...
 
Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture
Found in: Computer Architecture, International Symposium on
By Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Doug Burger, Stephen W. Keckler, Charles R. Moore
Issue Date:June 2003
pp. 422
This paper describes the polymorphous TRIPS architecture which can be configured for different granularities and types of parallelism. TRIPS contains mechanisms that enable the processing cores and the on-chip memory system to be configured and combined in...
 
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
Found in: SC Conference
By Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, Pradeep Dubey
Issue Date:November 2010
pp. 1-13
Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations. The bandwidth-to-compute requirement for a large class of stencil kernels is very high, and their performance is bound by the available memory...
 
Performance and Energy Implications of Many-Core Caches for Throughput Computing
Found in: IEEE Micro
By C J Hughes, Changkyu Kim, Yen-Kuang Chen
Issue Date:November 2010
pp. 25-35
Processors that target throughput computing often have many cores, which stresses the cache hierarchy. logically centralized, shared data storage is needed for many-core chips to provide high cache throughput for heavily read-write shared lines. techniques...
 
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Found in: Proceedings of the 37th annual international symposium on Computer architecture (ISCA '10)
By Anthony D. Nguyen, Changkyu Kim, Daehyun Kim, Jatin Chhugani, Michael Deisher, Mikhail Smelyanskiy, Nadathur Satish, Per Hammarlund, Pradeep Dubey, Ronak Singhal, Srinivas Chennupaty, Victor W. Lee
Issue Date:June 2010
pp. 72-ff
Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important...
     
Locality-aware task management for unstructured parallelism: a quantitative limit study
Found in: Proceedings of the 25th ACM symposium on Parallelism in algorithms and architectures (SPAA '13)
By Changkyu Kim, Christopher J. Hughes, Christos Kozyrakis, Richard M. Yoo, Yen-Kuang Chen
Issue Date:July 2013
pp. 315-325
As we increase the number of cores on a processor die, the on-chip cache hierarchies that support these cores are getting larger, deeper, and more complex. As a result, non-uniform memory access effects are now prevalent even on a single chip. To reduce ex...
     
Can traditional programming bridge the Ninja performance gap for parallel computing applications?
Found in: Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12)
By Changkyu Kim, Hideki Saito, Jatin Chhugani, Mikhail Smelyanskiy, Milind Girkar, Nadathur Satish, Pradeep Dubey, Rakesh Krishnaiyer
Issue Date:June 2012
pp. 440-451
Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approac...
     
Designing fast architecture-sensitive tree search on modern multicore/many-core processors
Found in: ACM Transactions on Database Systems (TODS)
By Anthony D. Nguyen, Changkyu Kim, Eric Sedlar, Jatin Chhugani, Nadathur Satish, Pradeep Dubey, Scott A. Brandt, Tim Kaldewey, Victor W. Lee
Issue Date:December 2011
pp. 1-34
In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous computing power by integrating multiple cores, each with wide vector units. There has been much work to exploit modern processor architectures ...
     
Moguls: a model to explore the memory hierarchy for bandwidth improvements
Found in: Proceeding of the 38th annual international symposium on Computer architecture (ISCA '11)
By Changkyu Kim, Christopher J. Hughes, Cong Xu, Guangyu Sun, Jishen Zhao, Yen-Kuang Chen, Yuan Xie
Issue Date:June 2011
pp. 377-388
In recent years, the increasing number of processor cores and limited increases in main memory bandwidth have led to the problem of the bandwidth wall, where memory bandwidth is becoming a performance bottleneck. This is especially true for emerging latenc...
     
ClearPath: highly parallel collision avoidance for multi-agent simulation
Found in: Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA '09)
By Changkyu Kim, Dinesh Manocha, Jatin Chhugani, Ming Lin, Nadathur Satish, Pradeep Dubey, Stephen. J. Guy
Issue Date:August 2009
pp. 177-187
We present a new local collision avoidance algorithm between multiple agents for real-time simulations. Our approach extends the notion of velocity obstacles from robotics and formulates the conditions for collision free navigation as a quadratic optimizat...
     
Multitasking workload scheduling on flexible-core chip multiprocessors
Found in: Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT '08)
By Changkyu Kim, Divya P. Gulati, Doug Burger, Simha Sethumadhavan, Stephen W. Keckler
Issue Date:October 2008
pp. 133-133
While technology trends have ushered in the age of chip multiprocessors (CMP), a fundamental question is what size to make each core. Most current commercial designs are symmetric CMPs (SCMP) in which each core is identical and range from a simple RISC pro...
     
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Found in: Tenth international conference on architectural support for programming languages and operating systems on Proceedings of the 10th international conference on architectural support for programming languages and operating systems (ASPLOS-X) (ASPLOS '02)
By Changkyu Kim, Doug Burger, Stephen W. Keckler
Issue Date:October 2002
pp. 205-209
Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the hit ...
     
 1