This Article 
 Bibliographic References 
 Add to: 
Comparing Hardware Accelerators in Scientific Applications: A Case Study
January 2011 (vol. 22 no. 1)
pp. 58-68
Rick Weber, University of Tennessee, Knoxville
Akila Gothandaraman, University of Pittsburgh, Pittsburgh
Robert J. Hinde, University of Tennessee, Knoxville
Gregory D. Peterson, University of Tennessee, Knoxville
Multicore processors and a variety of accelerators have allowed scientific applications to scale to larger problem sizes. We present a performance, design methodology, platform, and architectural comparison of several application accelerators executing a Quantum Monte Carlo application. We compare the application's performance and programmability on a variety of platforms including CUDA with Nvidia GPUs, Brook+ with ATI graphics accelerators, OpenCL running on both multicore and graphics processors, C++ running on multicore processors, and a VHDL implementation running on a Xilinx FPGA. We show that OpenCL provides application portability between multicore processors and GPUs, but may incur a performance cost. Furthermore, we illustrate that graphics accelerators can make simulations involving large numbers of particles feasible.

[1] ATI Radeon HD 5870 GPU Feature Summary, http://www.amd. com/us/products/desktop/ graphics/ati-radeon-hd-5000/ hd-5870/ Pagesati-radeon-hd-5870-specifications.aspx , 2010.
[2] S. Wasson, "Intel's Core i7 Processors: Nehalem Arrives with a Splash,", 2010.
[3] Nvidia, NVIDIA CUDA Programming Guide 2.3.1, Nvidia, 2009.
[4] Khro nos Group, The OpenCL Specification Version 1.0, A. Munshi, ed. Khro nos Group, 2009.
[5] A. Gothandaraman, G. Peterson, G. Warren, R. Hinde, and R. Harrison, "FPGA Acceleration of a Quantum Monte Carlo Application," Parallel Computing, vol. 34, nos. 4/5, pp. 278-291, , May 2008.
[6] J.M. Thijssen, Computational Physics. Cambridge Univ. Press, 1999.
[7] M. Mascagni and A. Srinivasan, "Algorithm 806: SPRNG: A Scalable Library for Pseudorandom Number Generation," ACM Trans. Math. Software, vol. 26, no. 3, pp. 436-461, 2000.
[8] W. Kahan, "Pracniques: Further Remarks on Reducing Truncation Errors," Comm. ACM, vol. 8, no. 1 p. 40, 1965.
[9] O.A.R. Rev. Board, OpenMP Application Programming Interface Version 3.0, OpenMP Architecture Rev. Board, 2008.
[10] Message Passing Interface, mpi/, 2010.
[11] NVIDIA GeForce 9400m Integrated Graphics Launched, http://laptoping.comnvidia-geforce-9400m.html , 2010.
[12] Firestream 9170: Industry's First GPU with Double-Precision Floating Point, specs.html, 2010.
[13] V. Volkov and J.W. Demmel, "Benchmarking GPUs to Tune Dense Linear Algebra," Proc. ACM/IEEE Conf. Supercomputing (SC '08), pp. 1-11, 2008.
[14] Y. Li, J. Dongarra, and S. Tomov, "A Note on Auto-tuning GEMM for GPUs," Proc. Ninth Int'l Conf. Computational Science (ICCS '09), pp. 884-892, 2009.
[15] J. Molemaker, J.M. Cohen, S. Patel, and J. Noh, "Low Viscosity Flow Simulations for Animation," Proc. ACM SIGGRAPH/Eurographics Symp. Computer Animation (SCA '08), pp. 9-18, 2008.
[16] Nvidia, CUDA Zone—The Resource for CUDA Developers, , 2009.
[17] A.G. Anderson, W.A. Goddard,III, and P. Schröder, "Quantum Monte Carlo on Graphical Processing Units," Computer Physics Comm., vol. 177, no. 3, pp. 298-306, 2007.
[18] Univ. of Stanford, BrookGPU, /, 2010.
[19] ATI, ATI Stream Computing User Guide. ATI, 2009.
[20] V. Demchik and A. Strelchenko, "Monte Carlo Simulations on Graphics Processing Units," http://www.citebase.orgabstract? , 2009.
[21] Apple—Mac OS X Snow Leopard—Technical Specifications,, 2009.
[22] First Details on Future Intel Design Codename "Larrabee," 2008 0804fact.htm, 2008.
[23] IBM, The Cell Project at IBM Research, http://www.research. ibm.comcell/, 2010.
[24] K. Compton and S. Hauck, "Reconfigurable Computing: A Survey of Systems and Software," ACM Computing Surveys, vol. 34, no. 2, pp. 171-210, 2002.
[25] J.W. Lockwood, J.S. Turner, and D.E. Taylor, "Field Programmable Port Extender (FPX) for Distributed Routing and Queuing," Proc. ACM/SIGDA Eighth Int'l Symp. Field Programmable Gate Arrays (FPGA '00), pp. 137-144, 2000.
[26] J. Sun, G.D. Peterson, and O.O. Storaasli, "High-Performance Mixed-Precision Linear Solver for FPGAs," IEEE Trans. Computers, vol. 57, no. 12, pp. 1614-1623, Dec. 2008.
[27] Compare Stratix III and Virtex-5 Core Power Consumption, , 2010.
[28] A. Gothandaraman, G.D. Peterson, G.L. Warren, R.J. Hinde, and R.J. Harrison, "A Hardware-Accelerated Quantum Monte Carlo Framework (HAQMC) for N-Body Systems," to be published in Computer Physics Comm., B6TJ5-4WNGWBF-3/20308fb43857a82bf5e41ec d30e19193a , 2009.
[29] J.-L. Brelet, Using Block RAM for High Performance Read/Write CAMs, application_notesxapp204.pdf , 2000.
[30] ATI, ATI Stream SDK v2.0 Performance and Optimization, assetsATI_ Stream_SDK_P erformance_Notes.pdf , 2009.
[31] D. Behr, AMD GPU Architecture OpenCL Tutorial, PPAM 2009, 09E1-OpenCL-Architecture.pdf, 2009.

Index Terms:
Accelerator, OpenCL, FPGA, GPU, multicore, CUDA, computational science.
Rick Weber, Akila Gothandaraman, Robert J. Hinde, Gregory D. Peterson, "Comparing Hardware Accelerators in Scientific Applications: A Case Study," IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 1, pp. 58-68, Jan. 2011, doi:10.1109/TPDS.2010.125
Usage of this product signifies your acceptance of the Terms of Use.