Search For:

Displaying 1-37 out of 37 total
A Versatile Performance and Energy Simulation Tool for Composite GPU Global Memory
Found in: 2013 IEEE 21st International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS)
By Bin Wang,Yizheng Jiao,Weikuan Yu,Xipeng Shen,Dong Li,Jeffrey S. Vetter
Issue Date:August 2013
pp. 298-302
As a cost-effective compute device, Graphic Processing Unit (GPU) has been widely embraced in the field of high performance computing. GPU is characterized by its massive thread-level parallelism and high memory bandwidth. Although GPU has exhibited tremen...
 
Profmig: A framework for flexible migration of program profiles across software versions
Found in: 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
By Mingzhou Zhou,Bo Wu,Yufei Ding,Xipeng Shen
Issue Date:February 2013
pp. 1-12
Offline program profiling is costly, especially when software update is frequent. In this paper, we initiate a systematic exploration in cross-version program profile migration, which tries to effectively reuse the valid part of the behavior profiles of an...
 
Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Bo Wu,Eddy Z. Zhang,Xipeng Shen
Issue Date:October 2011
pp. 243-252
Many dynamic simulation programs contain complex, irregular memory reference patterns, and require runtime optimizations to enhance data locality. Current approaches periodically stop the execution of an application to reorder the computation or data based...
 
Correctly Treating Synchronizations in Compiling Fine-Grained SPMD-Threaded Programs for CPU
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Ziyu Guo,Eddy Zheng Zhang,Xipeng Shen
Issue Date:October 2011
pp. 310-319
Automatic compilation for multiple types of devices is important, especially given the current trends towards heterogeneous computing. This paper concentrates on some issues in compiling fine-grained SPMD-threaded code (e.g., GPU CUDA code) for multicore C...
 
The Significance of CMP Cache Sharing on Contemporary Multithreaded Applications
Found in: IEEE Transactions on Parallel and Distributed Systems
By Eddy Zheng Zhang,Yunlian Jiang,Xipeng Shen
Issue Date:February 2012
pp. 367-374
Cache sharing on modern Chip Multiprocessors (CMPs) reduces communication latency among corunning threads, and also causes interthread cache contention. Most previous studies on the influence of cache sharing have concentrated on the design or management o...
 
The Complexity of Optimal Job Co-Scheduling on Chip Multiprocessors and Heuristics-Based Solutions
Found in: IEEE Transactions on Parallel and Distributed Systems
By Yunlian Jiang, Kai Tian, Xipeng Shen, Jinghe Zhang, Jie Chen, Rahul Tripathi
Issue Date:July 2011
pp. 1192-1205
In Chip Multiprocessors (CMPs) architecture, it is common that multiple cores share some on-chip cache. The sharing may cause cache thrashing and contention among co-running jobs. Job co-scheduling is an approach to tackling the problem by assigning jobs t...
 
Speculation with Little Wasting: Saving Cost in Software Speculation through Transparent Learning
Found in: Parallel and Distributed Systems, International Conference on
By Yunlian Jiang, Feng Mao, Xipeng Shen
Issue Date:December 2009
pp. 543-550
Software speculation has shown promise in parallelizing programs with coarse-grained dynamic parallelism. However, most speculation systems use offline profiling for the selection of speculative regions. The mismatch with the input-sensitivity of dynamic p...
 
A cross-input adaptive framework for GPU program optimizations
Found in: Parallel and Distributed Processing Symposium, International
By Yixun Liu,Eddy Z. Zhang,Xipeng Shen
Issue Date:May 2009
pp. 1-10
Recent years have seen a trend in using graphic processing units (GPU) as accelerators for general-purpose computing. The inexpensive, single-chip, massively parallel architecture of GPU has evidentially brought factors of speedup to many numerical applica...
 
Cross-Input Learning and Discriminative Prediction in Evolvable Virtual Machines
Found in: Code Generation and Optimization, IEEE/ACM International Symposium on
By Feng Mao, Xipeng Shen
Issue Date:March 2009
pp. 92-101
Modern languages like Java and C# rely on dynamic optimizations in virtual machines for better performance. Current dynamic optimizations are reactive. Their performance is constrained by the dependence on runtime sampling and the partial knowledge of the ...
 
Adaptive Software Speculation for Enhancing the Cost-Efficiency of Behavior-Oriented Parallelization
Found in: Parallel Processing, International Conference on
By Yunlian Jiang, Xipeng Shen
Issue Date:September 2008
pp. 270-278
Recently, software speculation has shown promising results in parallelizing complex sequential programs by exploiting dynamic high-level parallelism. The speculation however is cost-inefficient. Failed speculations may cause unnecessary shared resource con...
 
Adaptive speculation in behavior-oriented parallelization
Found in: Parallel and Distributed Processing Symposium, International
By Yunlian Jiang, Xipeng Shen
Issue Date:April 2008
pp. 1-5
Behavior-oriented parallelization is a technique for parallelizing complex sequential programs that have dynamic parallelism. Although the technique shows promising results, the software speculation mechanism it uses is not cost-efficient. Failed speculati...
 
Bridging Inputs and Program Dynamic Behavior
Found in: Parallel Architectures and Compilation Techniques, International Conference on
By Xipeng Shen, Feng Mao
Issue Date:September 2007
pp. 426
Program optimizations have evolved from static to dynamic. However, runtime optimization often suffers from not knowing global behavior of a program?s execution, and not affording sophisticated program analysis. On the other hand, offline profiling techniq...
   
Miss Rate Prediction Across Program Inputs and Cache Configurations
Found in: IEEE Transactions on Computers
By Yutao Zhong, Steven G. Dropsho, Xipeng Shen, Ahren Studer, Chen Ding
Issue Date:March 2007
pp. 328-343
Improving cache performance requires understanding cache behavior. However, measuring cache performance for one or two data input sets provides little insight into how cache behavior varies across all data input sets and all cache configurations. This pape...
 
A Key-based Adaptive Transactional Memory Executor
Found in: Parallel and Distributed Processing Symposium, International
By Tongxin Bai, Xipeng Shen, Chengliang Zhang, William N. Scherer III, Chen Ding, Michael L. Scott
Issue Date:March 2007
pp. 308
Software transactional memory systems enable a programmer to easily write concurrent data structures such as lists, trees, hashtables, and graphs, where non-conflicting operations proceed in parallel. Many of these structures take the abstract form of a di...
 
Adaptive Data Partition for Sorting Using Probability Distribution
Found in: Parallel Processing, International Conference on
By Xipeng Shen, Chen Ding
Issue Date:August 2004
pp. 250-257
Many computing problems benefit from dynamic partition of data into smaller chunks with better parallelism and locality. However, it is difficult to partition all types of inputs with the same high efficiency. This paper presents a new partition method in ...
 
Exploring hybrid memory for GPU energy efficiency through software-hardware co-design
Found in: 2013 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT)
By Bin Wang,Bo Wu,Dong Li,Xipeng Shen,Weikuan Yu,Yizheng Jiao,Jeffrey S. Vetter
Issue Date:September 2013
pp. 93-102
Hybrid memory designs, such as DRAM plus Phase Change Memory (PCM), have shown some promise for alleviating power and density issues faced by traditional memory systems. But previous studies have concentrated on CPU systems with a modest level of paralleli...
   
Challenging the "embarrassingly sequential": parallelizing finite state machine-based computations through principled speculation
Found in: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems (ASPLOS '14)
By Bo Wu, Xipeng Shen, Zhijia Zhao
Issue Date:March 2014
pp. 543-558
Finite-State Machine (FSM) applications are important for many domains. But FSM computation is inherently sequential, making such applications notoriously difficult to parallelize. Most prior methods address the problem through speculations on simple heuri...
     
Finding the limit: examining the potential and complexity of compilation scheduling for JIT-based runtime systems
Found in: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems (ASPLOS '14)
By Mingzhou Zhou, Sarah Eisenstat, Xipeng Shen, Yufei Ding, Zhijia Zhao
Issue Date:March 2014
pp. 607-622
This work aims to find out the full potential of compilation scheduling for JIT-based runtime systems. Compilation scheduling determines the order in which the compilation units (e.g., functions) in a program are to be compiled or recompiled. It decides wh...
     
HPar: A practical parallel parser for HTML--taming HTML complexities for parallel parsing
Found in: ACM Transactions on Architecture and Code Optimization (TACO)
By Dave Herman, Jianhua Sun, Michael Bebenita, Xipeng Shen, Zhijia Zhao
Issue Date:December 2013
pp. 1-25
Parallelizing HTML parsing is challenging due to the complexities of HTML documents and the inherent dependencies in its parsing algorithm. As a result, despite numerous studies in parallel parsing, HTML parsing remains sequential today. It forms one of th...
     
Do computer programs have to be as dumb as they are?: input-centric dynamic program optimizations
Found in: Proceedings of the 7th ACM workshop on Virtual machines and intermediate languages (VMIL '13)
By Xipeng Shen
Issue Date:October 2013
pp. 41-42
Looking around this world, we see that a fledgling can fly faster and faster, a pupil can calculate quicker and quicker, and a graduate student can write papers better and better. But since the birth of computers, it has been the case that after the releas...
     
Software-level scheduling to exploit non-uniformly shared data cache on GPGPU
Found in: Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC '13)
By Bo Wu, Weilin Wang, Xipeng Shen
Issue Date:June 2013
pp. 1-2
Data cache is introduced to GPUs to mitigate the irregular memory access problem. But few studies have investigated how to exploit its full potential. In this work, we consider some important GPU applications that feature data sharing across thread blocks....
     
Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU
Found in: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP '13)
By Bo Wu, Eddy Zheng Zhang, Xipeng Shen, Yunlian Jiang, Zhijia Zhao
Issue Date:February 2013
pp. 57-68
The performance of Graphic Processing Units (GPU) is sensitive to irregular memory references. Some recent work shows the promise of data reorganization for eliminating non-coalesced memory accesses that are caused by irregular references. However, all pre...
     
Exploiting inter-sequence correlations for program behavior prediction
Found in: Proceedings of the ACM international conference on Object oriented programming systems languages and applications (OOPSLA '12)
By Bo Wu, Raul Silvera, Xipeng Shen, Yaoqing Gao, Yunlian Jiang, Zhijia Zhao
Issue Date:October 2012
pp. 851-866
Prediction of program dynamic behaviors is fundamental to program optimizations, resource management, and architecture reconfigurations. Most existing predictors are based on locality of program behaviors, subject to some inherent limitations. In this pape...
     
Speculative parallelization needs rigor: probabilistic analysis for optimal speculation of finite-state machine applications
Found in: Proceedings of the 21st international conference on Parallel architectures and compilation techniques (PACT '12)
By Bo Wu, Xipeng Shen, Zhijia Zhao
Issue Date:September 2012
pp. 433-434
Software speculative parallelization has shown effectiveness in parallelizing certain applications. Prior techniques have mainly relied on simple exploitation of heuristics for speculation. In this work, we introduce probabilistic analysis into the design ...
     
One stone two birds: synchronization relaxation and redundancy removal in GPU-CPU translation
Found in: Proceedings of the 26th ACM international conference on Supercomputing (ICS '12)
By Bo Wu, Xipeng Shen, Ziyu Guo
Issue Date:June 2012
pp. 25-36
As an approach to promoting whole-system synergy on a heterogeneous computing system, compilation of fine-grained SPMD-threaded code(e.g., GPU CUDA code) for multicore CPU has drawn some recent attentions. This paper concentrates on two important sources o...
     
An input-centric paradigm for program dynamic optimizations
Found in: Proceedings of the ACM international conference on Object oriented programming systems languages and applications (OOPSLA '10)
By Eddy Z. Zhang, Kai Tian, Xipeng Shen, Yunlian Jiang
Issue Date:October 2010
pp. 125-139
Accurately predicting program behaviors (e.g., locality, dependency, method calling frequency) is fundamental for program optimizations and runtime adaptations. Despite decades of remarkable progress, prior studies have not systematically exploited program...
     
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?
Found in: Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel computing (PPoPP '10)
By Eddy Z. Zhang, Xipeng Shen, Yunlian Jiang
Issue Date:January 2010
pp. 203-212
Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention. A number of studies have examined the influence o...
     
Program locality analysis using reuse distance
Found in: ACM Transactions on Programming Languages and Systems (TOPLAS)
By Chen Ding, Xipeng Shen, Yutao Zhong
Issue Date:August 2009
pp. 1-39
On modern computer systems, the memory performance of an application depends on its locality. For a single execution, locality-correlated measures like average miss rate or working-set size have long been analyzed using reuse distance—the number of d...
     
A study on optimally co-scheduling jobs of different lengths on chip multiprocessors
Found in: Proceedings of the 6th ACM conference on Computing frontiers (CF '09)
By Kai Tian, Xipeng Shen, Yunlian Jiang
Issue Date:May 2009
pp. 227-227
Cache sharing in Chip Multiprocessors brings cache contention among corunning processes, which often causes considerable degradation of program performance and system fairness. Recent studies have seen the effectiveness of job co-scheduling in alleviating ...
     
Influence of program inputs on the selection of garbage collectors
Found in: Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments (VEE '09)
By Eddy Z. Zhang, Feng Mao, Xipeng Shen
Issue Date:March 2009
pp. 1-24
Many studies have shown that the best performer among a set of garbage collectors tends to be different for different applications. Researchers have proposed application-specific selection of garbage collectors. In this work, we concentrate on a second dim...
     
Analysis and approximation of optimal co-scheduling on chip multiprocessors
Found in: Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT '08)
By Jie Chen, Rahul Tripathi, Xipeng Shen, Yunlian Jiang
Issue Date:October 2008
pp. 133-133
Cache sharing among processors is important for Chip Multiprocessors to reduce inter-thread latency, but also brings cache contention, degrading program performance considerably. Recent studies have shown that job co-scheduling can effectively alleviate th...
     
Analysis of input-dependent program behavior using active profiling
Found in: Proceedings of the 2007 workshop on Experimental computer science (ExpCS '07)
By Chen Ding, Chengliang Zhang, Michael L. Scott, Mitsunori Ogihara, Sandhya Dwarkadas, Xipeng Shen
Issue Date:June 2007
pp. 5-es
Utility programs, which perform similar and largely independent operations on a sequence of inputs, include such common applications as compilers, interpreters, and document parsers; databases; and compression and encoding tools. The repetitive behavior of...
     
Software behavior oriented parallelization
Found in: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation (PLDI '07)
By Chen Ding, Chengliang Zhang, Chris Tice, Kirk Kelsey, Ruke Huang, Xipeng Shen
Issue Date:June 2007
pp. 223-234
Many sequential applications are difficult to parallelize because of unpredictable control flow, indirect data access, and input-dependent parallelism. These difficulties led us to build a software system for behavior oriented parallelization (BOP), which ...
     
Locality approximation using time
Found in: Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages (POPL '07)
By Brian Meeker, Chen Ding, Jonathan Shaw, Xipeng Shen
Issue Date:January 2007
pp. 949-984
Reuse distance (i.e. LRU stack distance) precisely characterizes program locality and has been a basic tool for memory system research since the 1970s. However, the high cost of measuring has restricted its practical uses in performance debugging, locality...
     
Program-level adaptive memory management
Found in: Proceedings of the 2006 international symposium on Memory management (ISMM '06)
By Chen Ding, Chengliang Zhang, Kirk Kelsey, Matthew Hertz, Mitsunori Ogihara, Xipeng Shen
Issue Date:June 2006
pp. 174-183
Most application's performance is impacted by the amount of available memory. In a traditional application, which has a fixed working set size, increasing memory has a beneficial effect up until the application's working set is met. In the presence of garb...
     
Gated memory control for memory monitoring, leak detection and garbage collection
Found in: Proceedings of the 2005 workshop on Memory system performance (MSP '05)
By Chen Ding, Chengliang Zhang, Mitsunori Ogihara, Xipeng Shen
Issue Date:June 2005
pp. 62-67
In the past, program monitoring often operates at the code level, performing checks at function and loop boundaries. Recent research shows that profiling analysis can identify high-level phases in complex binary code. Examples are time steps in scientific ...
     
Locality phase prediction
Found in: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems (ASPLOS-XI)
By Chen Ding, Xipeng Shen, Yutao Zhong
Issue Date:October 2004
pp. 97-105
As computer memory hierarchy becomes adaptive, its performance increasingly depends on forecasting the dynamic program locality. This paper presents a method that predicts the locality phases of a program by a combination of locality profiling and run-time...
     
 1