The Community for Technology Leaders
RSS Icon
Issue No.12 - Dec. (2012 vol.61)
pp: 1711-1723
Ankit Sethia , University of Michigan, Ann Arbor
Ganesh Dasika , ARM, Austin
Trevor Mudge , University of Michigan, Ann Arbor
Scott Mahlke , University of Michigan, Ann Arbor
The rapid advancements in the computational capabilities of the graphics processing unit (GPU) as well as the deployment of general programming models for these devices have made the vision of a desktop supercomputer a reality. It is now possible to assemble a system that provides several TFLOPs of performance on scientific applications for the cost of a high-end laptop computer. While these devices have clearly changed the landscape of computing, there are two central problems that arise. First, GPUs are designed and optimized for graphics applications resulting in delivered performance that is far below peak for more general scientific and mathematical applications. Second, GPUs are power hungry devices that often consume 100-300 watts, which restricts the scalability of the solution and requires expensive cooling. To combat these challenges, this paper presents the PEPSC architecture—an architecture customized for the domain of data parallel dense matrix style scientific application where power efficiency is the central focus. PEPSC utilizes a combination of a 2D single-instruction multiple-data (SIMD) datapath, an intelligent dynamic prefetching mechanism, and a configurable SIMD control approach to increase execution efficiency over conventional GPUs. A single PEPSC core has a peak performance of 120 GFLOPs while consuming 2 W of power when executing modern scientific applications, which represents an increase in computation efficiency of more than 10X over existing GPUs.
Graphics processing unit, Benchmark testing, Computer architecture, Energy efficiency, Energy management, Low power electronics, Parallel processing, scientific computing, Low-power design, hardware, SIMD processors, processor architectures, parallel processors, Graphics Processing Unit (GPU), throughput computing
Ankit Sethia, Ganesh Dasika, Trevor Mudge, Scott Mahlke, "A Customized Processor for Energy Efficient Scientific Computing", IEEE Transactions on Computers, vol.61, no. 12, pp. 1711-1723, Dec. 2012, doi:10.1109/TC.2012.144
[1] K. Asanovic et al., "The Landscape of Parallel Computing Research: A View from Berkeley," Technical Report UCB/EECS-2006-183, Univ. of California, Berkeley, Dec. 2006.
[2] A. Bakhoda, G.L. Yuan, W.W.L. Fung, H. Wong, and T.M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," Proc. IEEE Symp. Performance Analysis of Systems and Software, pp. 163-174, Apr. 2009.
[3] S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," Proc. IEEE Symp. Workload Characterization, pp. 44-54, 2009.
[4] T.-F. Chen and J.-L. Baer, "Effective Hardware-Based Data Prefetching for High-Performance Processors," IEEE Trans. Computers, vol. 44, no. 5, pp. 609-623, May 1995.
[5] W. Dally et al., "Merrimac: Supercomputing with Streams," Proc. ACM/IEEE Conf. Supercomputing, pp. 35-42, 2003.
[6] P. Diaz and M. Cintra, "Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching," Proc. 36th Ann. Int'l Symp. Computer Architecture, pp. 81-92, 2009.
[7] J.W.C. Fu, J.H. Patel, and B.L. Janssens, "Stride Directed Prefetching in Scalar Processors," Proc. 25th Ann. Int'l Symp. Microarchitecture, pp. 102-110, 1992.
[8] W.W.L. Fung, I. Sham, G. Yuan, and T.M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," Proc. 40th Ann. Int'l Symp. Microarchitecture, pp. 407-420, 2007.
[9] S. Goldstein et al., "PipeRench: A Coprocessor for Streaming Multimedia Acceleration," Proc. 26th Ann. Int'l Symp. Computer Architecture, pp. 28-39, June 1999.
[10] V. Govindaraju et al., "Toward a Multicore Architecture for Real-Time Ray-Tracing," Proc. 41st Ann. Int'l Symp. Microarchitecture, pp. 176-197, Dec. 2008.
[11] R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic, "The Vector-Thread Architecture," Proc. 31st Ann. Int'l Symp. Computer Architecture, pp. 52-63, 2004.
[12] V.W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A.D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey, "Debunking the 100x GPU versus CPU Myth: An Evaluation of Throughput Computing on CPU and GPU," Proc. 37th Ann. Int'l Symp. Computer Architecture, pp. 451-460, 2010.
[13] J. Michalakes and M. Vachharajani, "Gpu Acceleration of Numerical Weather Prediction," Proc. IEEE Int'l Symp. Parallel and Distributed Processing, pp. 1-7, 2008.
[14] F. Pratas, P. Trancoso, A. Stamatakis, and L. Sousa, "Fine-Grain Parallelism Using Multi-Core, Cell/BE, and GPU Systems: Accelerating the Phylogenetic Likelihood Function," Proc. Int'l Conf. Parallel Processing, pp. 9-17, 2009.
[15] B.R. Rau, "Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops," Proc. 27th Ann. Int'l Symp. Microarchitecture, pp. 63-74, Nov. 1994.
[16] D. Rivera, D. Schaa, M. Moffie, and D. Kaeli, "Exploring Novel Parallelization Technologies for 3-D Imaging Applications," Proc. Symp. Computer Architecture and High Performance Computing, pp. 26-33, 2007.
[17] K. Sankaralingam et al., "Exploiting ILP, TLP, and DLP using Polymorphism in the TRIPS Architecture," Proc. 30th Ann. Int'l Symp. Computer Architecture, pp. 422-433, June 2003.
[18] D. Schaa and D. Kaeli, "Exploring the Multiple-GPU Design Space," Proc. IEEE Int'l Symp. Parallel and Distributed Processing, pp. 1-12, 2009.
[19] S. Srinath et al., "Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers," Proc. 13th Int'l Symp. High-Performance Computer Architecture, pp. 63-74, Feb. 2007.
[20] M.B. Taylor et al., "The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs," IEEE Micro, vol. 22, no. 2, pp. 25-35, Mar./Apr. 2002.
[21] P. Trancoso, D. Othonos, and A. Artemiou, "Data Parallel Acceleration of Decision Support Queries Using Cell/BE and GPUs," Proc. Symp. Computing Frontiers, pp. 117-126, 2009.
[22] Trimaran "An Infrastructure for Research in ILP," http:/, 2000.
[23] G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M.B. Taylor, "Conservation Cores: Reducing the Energy of Mature Computations," Proc. 18th Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 205-218, 2010.
[24] M. Woh et al., "From SODA to Scotch: The Evolution of a Wireless Baseband Processor," Proc. 41st Ann. Int'l Symp. Microarchitecture, pp. 152-163, Nov. 2008.
[25] T. Yeh et al., "ParallAX: An Architecture for Real-Time Physics," Proc. 34th Ann. Int'l Symp. Computer Architecture, pp. 232-243, June 2007.
[26] S. Yehia, S. Girbal, H. Berry, and O. Temam, "Reconciling Specialization and Flexibility through Compound Circuits," Proc. 15th Int'l Symp. High-Performance Computer Architecture, pp. 277-288, 2009.
[27] H. Zhu, Y. Chen, and X.-H. Sun, "Timing Local Streams: Improving Timeliness in Data Prefetching," Proc. Int'l Conf. Supercomputing, pp. 169-178, 2010.
46 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool