Search For:

Displaying 1-29 out of 29 total
Optimizing Web Browser on Many-Core Architectures
Found in: Parallel and Distributed Computing Applications and Technologies, International Conference on
By Lingjun Fan,Weisong Shi,Shibin Tang,Chenggang Yan,Dongrui Fan
Issue Date:October 2011
pp. 173-178
As more and more Web applications emerging on sever end today, the Web browser on client end has become a host of a variety of applications other than just rendering static Web pages. This leads to more and more performance requirements of a Web browser, f...
 
Godson-T: An Efficient Many-Core Processor Exploring Thread-Level Parallelism
Found in: IEEE Micro
By Dongrui Fan,Hao Zhang,Da Wang,Xiaochun Ye,Fenglong Song,Guojie Li,Ninghui Sun
Publication Date: April 2012
pp. N/A
Godson-T is a research many-core processor designed for parallel scientific computing. It delivers efficient performance and flexible programmability simultaneously. On the one hand, Godson-T has many features to achieve high efficiency for on-chip resourc...
 
Energy-Performance Modeling and Optimization of Parallel Computing in On-Chip Networks
Found in: 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)
By Shuai Zhang, Zhiyong Liu, Dongrui Fan, Fonglong Song, Mingzhe Zhang
Issue Date:July 2013
pp. 879-886
This paper discusses energy-performance trade-off of networks-on-chip with real parallel applications. First, we propose an accurate energy-performance analytical model that conduct and analyze the impacts of both frequency-independent and frequency-depend...
 
Auto-Tuning GEMV on Many-Core GPU
Found in: 2012 IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS)
By Weizhi Xu,Zhiyong Liu,Jun Wu,Xiaochun Ye,Shuai Jiao,Da Wang,Fenglong Song,Dongrui Fan
Issue Date:December 2012
pp. 30-36
GPUs provide powerful computing ability especially for data parallel algorithms. However, the complexity of the GPU system makes the optimization of even a simple algorithm difficult. Different parallel algorithms or optimization methods on a GPU often lea...
 
Self-Correction Trace Model: A Full-System Simulator for Optical Network-on-Chip
Found in: 2012 26th IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
By Mingzhe Zhang,Liqiang He,Dongrui Fan
Issue Date:May 2012
pp. 242-247
The improvement of the emerging technology involves the nanophotonic into the on-chip interconnection, which provides a large communication capability for the future large-scale CMP processor. As an important way to the architecture research, full¨Csystem ...
 
Godson-T: An Efficient Many-Core Processor Exploring Thread-Level Parallelism
Found in: IEEE Micro
By Dongrui Fan,Hao Zhang,Da Wang,Xiaochun Ye,Fenglong Song,Guojie Li,Ninghui Sun
Issue Date:March 2012
pp. 38-47
Godson-T is a research many-core processor designed for parallel scientific computing that delivers efficient performance and flexible programmability simultaneously. It also has many features to achieve high efficiency for on-chip resource utilization, su...
 
Extendable pattern-oriented optimization directives
Found in: Code Generation and Optimization, IEEE/ACM International Symposium on
By Huimin Cui, Jingling Xue, Lei Wang, Yang Yang, Xiaobing Feng, Dongrui Fan
Issue Date:April 2011
pp. 107-118
Current programming models and compiler technologies for multi-core processors do not exploit well the performance benefits obtainable by applying algorithm-specific, i.e., semantic-specific optimizations to a particular application. In this work, we propo...
 
Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors
Found in: Microarchitecture, IEEE/ACM International Symposium on
By Guoping Long, Diana Franklin, Susmit Biswas, Pablo Ortiz, Jason Oberg, Dongrui Fan, Frederic T. Chong
Issue Date:December 2010
pp. 337-348
Parallelism is the key to continued performance scaling in modern microprocessors. Yet we observe that this parallelism can often contain a surprising amount of instruction redundancy. We propose to exploit this redundancy to improve performance and decrea...
 
P-GAS: Parallelizing a Cycle-Accurate Event-Driven Many-Core Processor Simulator Using Parallel Discrete Event Simulation
Found in: Parallel and Distributed Simulation, Workshop on
By Huiwei Lv, Yuan Cheng, Lu Bai, Mingyu Chen, Dongrui Fan, Ninghui Sun
Issue Date:May 2010
pp. 1-8
Multi-core processors are commonly available now, but most traditional computer architectural simulators still use single-thread execution. In this paper we use parallel discrete event simulation (PDES) to speedup a cycle-accurate event-driven many-core pr...
 
A Fast Linear-Space Sequence Alignment Algorithm with Dynamic Parallelization Framework
Found in: Computer and Information Technology, International Conference on
By Xiaochun Ye, Dongrui Fan, Wei Lin
Issue Date:October 2009
pp. 274-279
Exact pairwise sequence alignment algorithms using dynamic programming require quadratic space and time, and this makes these algorithms impractical for large-scale sequences. In this paper, we propose and evaluate a new Anti-Diagonal based Parallel Linear...
 
Software and Hardware Cooperate for 1-D FFT Algorithm Optimization on Multicore Processors
Found in: Computer and Information Technology, International Conference on
By Yongbin Zhou, Junchao Zhang, Dongrui Fan
Issue Date:October 2009
pp. 86-91
Multicore architecture is becoming a promise to keep Moore’s Law and brings a revolution in both research and industry which results new design space for software and architecture. Fast Fourier Transform (FFT), computing intensive and bandwidth intensive, ...
 
Design of New Hash Mapping Functions
Found in: Computer and Information Technology, International Conference on
By Fenglong Song, Zhiyong Liu, Dongrui Fan, Junchao Zhang, Lei Yu, Nan Yuan, Wei Lin
Issue Date:October 2009
pp. 45-50
Conflict can decrease performance of computer severely, such as bank conflicts reduce bandwidth of interleave multibank memory systems and conflict misses reduce effective on-chip capacity, and this incurs much conflict miss further. Conflicts can be avoid...
 
A Low-Complexity Synchronization Based Cache Coherence Solution for Many Cores
Found in: Computer and Information Technology, International Conference on
By Wei Lin, DongRui Fan, He Huang, Nan Yuan, XiaoChun Ye
Issue Date:October 2009
pp. 69-75
Computer architectures make a dramatic turn away from improving single-processor performance towards improved parallel performance through integrating many cores in one chip. However, providing directory based coherence protocols for these platforms is too...
 
GFFC: The Global Feedback Based Flow Control in the NoC Design for Many-core Processor
Found in: Network and Parallel Computing Workshops, IFIP International Conference on
By Xu Wang, Ge Gan, Dongrui Fan, Shuxu Guo
Issue Date:October 2009
pp. 227-232
GFFC (Global Feedback based Flow Control) is proposed to be used in NoC design for many-core processor. GFFC is designed based on two fundamental principles: (a) when network congestion occurs, the packet sender that causes the congestion needs to know thi...
 
Data Management: The Spirit to Pursuit Peak Performance on Many-Core Processor
Found in: Parallel and Distributed Processing with Applications, International Symposium on
By Yongbin Zhou, Junchao Zhang, Shuai Zhang, Nan Yuan, Dongrui Fan
Issue Date:August 2009
pp. 559-564
to date, most of many-core prototypes employ tiled topologies connected through on-chip networks. The throughput and latency of the on-chip networks usually become to the bottleneck to achieve peak performance especially for communication intensive applica...
 
A Synchronization-Based Alternative to Directory Protocol
Found in: Parallel and Distributed Processing with Applications, International Symposium on
By He Huang, Lei Liu, Nan Yuan, Wei Lin, Fenglong Song, Junchao Zhang, Dongrui Fan
Issue Date:August 2009
pp. 175-181
The efficient support of cache coherence is extremely important to design and implement many-core processors. In this paper, we propose a synchronization-based coherence (SBC) protocol to efficiently support cache coherence for shared memory many-core arch...
 
Evaluation Method of Synchronization for Shared-Memory On-Chip Many-Core Processor
Found in: Parallel and Distributed Processing with Applications, International Symposium on
By Fenglong Song, Zhiyong Liu, Dongrui Fan, He Huang, Nan Yuan, Lei Yu, Junchao Zhang
Issue Date:August 2009
pp. 571-576
On-chip many-core architecture is an emerging and promising computation platform. High speed on-chip communication and abundant chipped resources are two outstanding advantages of this architecture, which provide an opportunity to implement efficient synch...
 
Study on Fine-Grained Synchronization in Many-Core Architecture
Found in: Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, ACIS International Conference on
By Lei Yu, Zhiyong Liu, Dongrui Fan, Fenglong Song, Junchao Zhang, Nan Yuan
Issue Date:May 2009
pp. 524-529
The synchronization between threads has serious impact on the performance of many-core architecture. When communication is frequent, coarse-grained synchronization brings significant overhead. Thus, coarse-grained synchronization is not suitable for this s...
 
A Quantitative Study of the On-Chip Network and Memory Hierarchy Design for Many-Core Processor
Found in: Parallel and Distributed Systems, International Conference on
By Xu Wang, Ge Gan, Joseph Manzano, Dongrui Fan, Shuxu Guo
Issue Date:December 2008
pp. 689-696
In this paper, we will study the on-chip network and memory hierarchy design of the Godson-T - a homogeneous many-core processor. Godson-T has 64 cores (with private L1 cache), and 16 global L2 cache banks. All these on-chip units are connected by a 2D $8\...
 
Location Consistency Model Revisited: Problem, Solution and Prospects
Found in: Parallel and Distributed Computing Applications and Technologies, International Conference on
By Guoping Long, Nan Yuan, Dongrui Fan
Issue Date:December 2008
pp. 91-98
Location Consistency (LC) is a weak memory consistency model which is defined entirely on partial order execution semantics of parallel programs. Compared with sequential consistency (SC), LC is scalable and provides ample theoretical parallelism. This mak...
 
Efficient Parallelization of a Protein Sequence Comparison Algorithm on Manycore Architecture
Found in: Parallel and Distributed Computing Applications and Technologies, International Conference on
By Xiaochun Ye, Van Hoa Nguyen, Dominique Lavenier, Dongrui Fan
Issue Date:December 2008
pp. 167-170
This paper introduces the Godson-T manycore architecture and demonstrates the efficiency of its synchronization mechanism through a computation intensive bioinformatics application: the comparison of protein banks. The parallel part of the protein sequence...
 
Simplified Multi-Ported Cache in High Performance Processor
Found in: Networking, Architecture, and Storage, International Conference on
By Hao Zhang, Dongrui Fan
Issue Date:July 2007
pp. 9-14
The memory bandwidth demands of modern microprocessors require the use of a multi-ported cache to achieve peak performance. However, multi-ported caches are costly to implement. In this paper we propose technique for using a simplified dual-ported cache in...
 
Design and Implementation of Floating Point Stack on General RISC Architecture
Found in: Parallel, Distributed, and Network-Based Processing, Euromicro Conference on
By Xuehai Qian, He Huang, Hao Zhang, Guoping Long, Junchao Zhang, Dongrui Fan
Issue Date:February 2007
pp. 238-245
This paper presents a framework for implementing the X86 FP stack used in an x86-compliant processor based on a general RISC architecture. Architectural supports are added to a typical RISC architecture to maintain the FP stack status. Some speculative tec...
 
A Path-Adaptive Opto-electronic Hybrid NoC for Chip Multi-processor
Found in: 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)
By Mingzhe Zhang, Da Wang, Xiaochun Ye, Liqiang He, Dongrui Fan, Zhiyong Liu
Issue Date:July 2013
pp. 1198-1205
The continuous development of manufacture allows to integrate optical components in a chip, which providing a feasible solution for the communication between the cores in manycore processors. Considering the limitation of manufacture technology and the cha...
 
Low power cache architectures with hybrid approach of filtering unnecessary way accesses
Found in: Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '13)
By Dongrui Fan, Lingjun Fan, Shinan Wang, Weisong Shi, Yasong Zheng
Issue Date:February 2013
pp. 93-99
Power has been a big issue in processor design for several years. As caches account for more and more CPU die area and power, this paper presents using filtering unnecessary way accesses to reduce dynamic power consumption of unified L2 cache shared by ins...
     
Extendable pattern-oriented optimization directives
Found in: ACM Transactions on Architecture and Code Optimization (TACO)
By Dongrui Fan, Huimin Cui, Jingling Xue, Lei Wang, Xiaobing Feng, Yang Yang
Issue Date:September 2012
pp. 1-37
Algorithm-specific, that is, semantic-specific optimizations have been observed to bring significant performance gains, especially for a diverse set of multi/many-core architectures. However, current programming models and compiler technologies for the sta...
     
Performance analysis and optimization of molecular dynamics simulation on Godson-T many-core processor
Found in: Proceedings of the 8th ACM International Conference on Computing Frontiers (CF '11)
By Aiichiro Nakano, Dongrui Fan, Fenglong Song, Guangming Tan, Hao Zhang, Liu Peng, Priya Vashishta, Rajiv K. Kalia
Issue Date:May 2011
pp. 1-10
Molecular dynamics (MD) simulation has broad applications, but its irregular memory-access pattern makes performance optimization a challenge. This paper presents a joint application/architecture study to enhance on-chip parallelism of MD on Godson-T -like...
     
Architectural support for cilk computations on many-core architectures
Found in: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP '09)
By Dongrui Fan, Guoping Long, Junchao Zhang
Issue Date:February 2009
pp. 283-284
The X10 programming language is organized around the notion of places (an encapsulation of data and activities operating on the data), partitioned global address space (PGAS), and asynchronous computation and communication. This paper introduces an express...
     
Experience on optimizing irregular computation for memory hierarchy in manycore architecture
Found in: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming (PPoPP '08)
By Andrew Russo, Dongrui Fan, Guang R. Gao, Guangming Tan, Junchao Zhang
Issue Date:February 2008
pp. 26-33
As modern supercomputing systems reach peta-flop performance they grow in both size and complexity, becoming increasingly vulnerable to failures. Checkpointing is a popular technique for tolerating such failures. Although a variety of automated system-leve...
     
 1