This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Comparing Hardware Accelerators in Scientific Applications: A Case Study
January 2011 (vol. 22 no. 1)
pp. 58-68
Rick Weber, University of Tennessee, Knoxville
Akila Gothandaraman, University of Pittsburgh, Pittsburgh
Robert J. Hinde, University of Tennessee, Knoxville
Gregory D. Peterson, University of Tennessee, Knoxville
Multicore processors and a variety of accelerators have allowed scientific applications to scale to larger problem sizes. We present a performance, design methodology, platform, and architectural comparison of several application accelerators executing a Quantum Monte Carlo application. We compare the application's performance and programmability on a variety of platforms including CUDA with Nvidia GPUs, Brook+ with ATI graphics accelerators, OpenCL running on both multicore and graphics processors, C++ running on multicore processors, and a VHDL implementation running on a Xilinx FPGA. We show that OpenCL provides application portability between multicore processors and GPUs, but may incur a performance cost. Furthermore, we illustrate that graphics accelerators can make simulations involving large numbers of particles feasible.

[1] ATI Radeon HD 5870 GPU Feature Summary, http://www.amd. com/us/products/desktop/ graphics/ati-radeon-hd-5000/ hd-5870/ Pagesati-radeon-hd-5870-specifications.aspx , 2010.
[2] S. Wasson, "Intel's Core i7 Processors: Nehalem Arrives with a Splash," http://techreport.com/articles.x15818, 2010.
[3] Nvidia, NVIDIA CUDA Programming Guide 2.3.1, Nvidia, 2009.
[4] Khro nos Group, The OpenCL Specification Version 1.0, A. Munshi, ed. Khro nos Group, 2009.
[5] A. Gothandaraman, G. Peterson, G. Warren, R. Hinde, and R. Harrison, "FPGA Acceleration of a Quantum Monte Carlo Application," Parallel Computing, vol. 34, nos. 4/5, pp. 278-291, http://dx.doi.org/10.1016j.parco.2008.01.009 , May 2008.
[6] J.M. Thijssen, Computational Physics. Cambridge Univ. Press, 1999.
[7] M. Mascagni and A. Srinivasan, "Algorithm 806: SPRNG: A Scalable Library for Pseudorandom Number Generation," ACM Trans. Math. Software, vol. 26, no. 3, pp. 436-461, 2000.
[8] W. Kahan, "Pracniques: Further Remarks on Reducing Truncation Errors," Comm. ACM, vol. 8, no. 1 p. 40, 1965.
[9] O.A.R. Rev. Board, OpenMP Application Programming Interface Version 3.0, OpenMP Architecture Rev. Board, 2008.
[10] Message Passing Interface, http://www.mcs.anl.gov/research/projects mpi/, 2010.
[11] NVIDIA GeForce 9400m Integrated Graphics Launched, http://laptoping.comnvidia-geforce-9400m.html , 2010.
[12] Firestream 9170: Industry's First GPU with Double-Precision Floating Point, http://ati.amd.com/products/streamprocessor specs.html, 2010.
[13] V. Volkov and J.W. Demmel, "Benchmarking GPUs to Tune Dense Linear Algebra," Proc. ACM/IEEE Conf. Supercomputing (SC '08), pp. 1-11, 2008.
[14] Y. Li, J. Dongarra, and S. Tomov, "A Note on Auto-tuning GEMM for GPUs," Proc. Ninth Int'l Conf. Computational Science (ICCS '09), pp. 884-892, 2009.
[15] J. Molemaker, J.M. Cohen, S. Patel, and J. Noh, "Low Viscosity Flow Simulations for Animation," Proc. ACM SIGGRAPH/Eurographics Symp. Computer Animation (SCA '08), pp. 9-18, 2008.
[16] Nvidia, CUDA Zone—The Resource for CUDA Developers, http://www.nvidia.com/objectcuda_home.html , 2009.
[17] A.G. Anderson, W.A. Goddard,III, and P. Schröder, "Quantum Monte Carlo on Graphical Processing Units," Computer Physics Comm., vol. 177, no. 3, pp. 298-306, 2007.
[18] Univ. of Stanford, BrookGPU, http://graphics.stanford.edu/projectsbrookgpu /, 2010.
[19] ATI, ATI Stream Computing User Guide. ATI, 2009.
[20] V. Demchik and A. Strelchenko, "Monte Carlo Simulations on Graphics Processing Units," http://www.citebase.orgabstract? id=oai:arXiv.org:0903.3053 , 2009.
[21] Apple—Mac OS X Snow Leopard—Technical Specifications, http://www.apple.com/macosxspecs.html, 2009.
[22] First Details on Future Intel Design Codename "Larrabee," http://www.intel.com/pressroom/archive/releases 2008 0804fact.htm, 2008.
[23] IBM, The Cell Project at IBM Research, http://www.research. ibm.comcell/, 2010.
[24] K. Compton and S. Hauck, "Reconfigurable Computing: A Survey of Systems and Software," ACM Computing Surveys, vol. 34, no. 2, pp. 171-210, 2002.
[25] J.W. Lockwood, J.S. Turner, and D.E. Taylor, "Field Programmable Port Extender (FPX) for Distributed Routing and Queuing," Proc. ACM/SIGDA Eighth Int'l Symp. Field Programmable Gate Arrays (FPGA '00), pp. 137-144, 2000.
[26] J. Sun, G.D. Peterson, and O.O. Storaasli, "High-Performance Mixed-Precision Linear Solver for FPGAs," IEEE Trans. Computers, vol. 57, no. 12, pp. 1614-1623, Dec. 2008.
[27] Compare Stratix III and Virtex-5 Core Power Consumption, http://www.fpgacentral.com/fpga-webcastcompare-stratix-iii-and-virtex-5-core-po , 2010.
[28] A. Gothandaraman, G.D. Peterson, G.L. Warren, R.J. Hinde, and R.J. Harrison, "A Hardware-Accelerated Quantum Monte Carlo Framework (HAQMC) for N-Body Systems," to be published in Computer Physics Comm., http://www.sciencedirect.com/science/article/ B6TJ5-4WNGWBF-3/20308fb43857a82bf5e41ec d30e19193a , 2009.
[29] J.-L. Brelet, Using Block RAM for High Performance Read/Write CAMs, http://www.xilinx.com/support/documentation/ application_notesxapp204.pdf , 2000.
[30] ATI, ATI Stream SDK v2.0 Performance and Optimization, http://developer.amd.com/gpu/ATIStreamSDK/ assetsATI_ Stream_SDK_P erformance_Notes.pdf , 2009.
[31] D. Behr, AMD GPU Architecture OpenCL Tutorial, PPAM 2009, http://gpgpu.org/wp/wp-content/uploads/2009/ 09E1-OpenCL-Architecture.pdf, 2009.

Index Terms:
Accelerator, OpenCL, FPGA, GPU, multicore, CUDA, computational science.
Citation:
Rick Weber, Akila Gothandaraman, Robert J. Hinde, Gregory D. Peterson, "Comparing Hardware Accelerators in Scientific Applications: A Case Study," IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 1, pp. 58-68, Jan. 2011, doi:10.1109/TPDS.2010.125
Usage of this product signifies your acceptance of the Terms of Use.