The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2014 vol.25)
pp: 695-705
Kentaro Sano , Tohoku University, Sendai
Yoshiaki Hatsuda , Kobo Co. Ltd, Kawaguchi
Satoru Yamamoto , Tohoku University, Sendai
ABSTRACT
Stencil computation is one of the important kernels in scientific computations. However, sustained performance is limited owing to restriction on memory bandwidth, especially on multicore microprocessors and graphics processing units (GPUs) because of their small operational intensity. In this paper, we present a custom computing machine (CCM), called a scalable streaming-array (SSA), for high-performance stencil computations with multiple field-programmable gate arrays (FPGAs). We design SSA based on a domain-specific programmable concept, where CCMs are programmable with the minimum functionality required for an algorithm domain. We employ a deep pipelining approach over successive iterations to achieve linear scalability for multiple devices with a constant memory bandwidth. Prototype implementation using nine FPGAs demonstrates good agreement with a performance model, and achieves 260 and 236 GFlop/s for 2D and 3D Jacobi computation, which are 87.4 and 83.9 percent of the peak, respectively, with a memory bandwidth of only 2.0 GB/s. We also evaluate the performance of SSA for state-of-the-art FPGAs.
INDEX TERMS
Field programmable gate arrays, Arrays, Bandwidth, Scalability, Hardware, Computational modeling,high-performance computation, Scalable streaming-array, stencil computation, custom computing machine, FPGA
CITATION
Kentaro Sano, Yoshiaki Hatsuda, Satoru Yamamoto, "Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth", IEEE Transactions on Parallel & Distributed Systems, vol.25, no. 3, pp. 695-705, March 2014, doi:10.1109/TPDS.2013.51
REFERENCES
[1] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick, "Stencil Computation Optimization and Auto-Tuning on State-of-the-Art Multicore Architectures," Proc. ACM/IEEE Conf. Supercomputing, pp. 1-12, 2008.
[2] S. Williams, A. Waterman, and D. Patterson, "Roofline: An Insightful Visual Performance Model for Multicore Architectures," Comm. of the ACM, vol. 52, no. 4, pp. 65-76, 2009.
[3] W. Augustin, V. Heuveline, and J.-P. Weiss, "Optimized Stencil Computation Using In-Place Calculation on Modern Multicore Systems," Proc. Int'l European Conf. Parallel and Distributed Computing (Euro-Par), pp. 772-784, 2009.
[4] G. Wellein, G. Hager, T. Zeiser, M. Wittmann, and H. Fehske, "Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization," Proc. Int'l Computer Software and Applications Conf., pp. 579-586, 2009.
[5] S. Matsuoka, T. Aoki, T. Endo, A. Nukada, T. Kato, and A. Hasegawa, "GPU Accelerated Computing—From Hype to Mainstream, the Rebirth of Vector Computing," J. of Physics: Conf. Series, vol. 180, no. 1,article 012043, 2009.
[6] E.H. Phillips and M. Fatica, "Implementing the Himeno Benchmark with CUDA on GPU Clusters," Proc. Int'l Symp. Parallel and Distributed Processing (IPDPS), pp. 1-10, 2010.
[7] K. Underwood, "FPGA vs. CPUs: Trends in Peak Floating-Point Performance," Proc. Int'l Symp. Field-Programmable Gate Arrays, pp. 171-180, Feb. 2004.
[8] D. Strenski, J. Simkins, R. Walke, and R. Wittig, "Evaluating FPGAs for Floating Point Performance," Proc. Int'l Workshop High-Performance Reconfigurable Computing Technology and Applications, Nov. 2008, doi: 10.1109/HPRCTA.2008.4745680.
[9] K. Sano, Y. Hatsuda, and S. Yamamoto, "Scalable Streaming-Array of Simple Soft-Processors for Stencil Computations with Constant Memory-Bandwidth," Proc. IEEE Symp. Field-Programmable Custom Computing Machines, pp. 234-241, May 2011.
[10] K. Sano, Y. Hatsuda, and S. Yamamoto, "Domain-Specific Programmable Design of Scalable Streaming-Array for Power-Efficient Stencil Computation," ACM SIGARCH Computer Architecture News, vol. 39, no. 4, pp. 44-49, Sept. 2011.
[11] K. Sano, T. Iizuka, and S. Yamamoto, "Systolic Architecture for Computational Fluid Dynamics on FPGAs," Proc. IEEE Symp. Field-Programmable Custom Computing Machines, pp. 107-116, Apr. 2007.
[12] K. Sano, W. Luzhou, and S. Yamamoto, "Prototype Implementation of Array-Processor Extensible over Multiple FPGAs for Scalable Stencil Computation," ACM SIGARCH Computer Architecture News, vol. 38, no. 4, pp. 80-86, Dec. 2010.
[13] W. Luzhou, K. Sano, and S. Yamamoto, "Local-and-Global Stall Mechanism for Systolic Computational-Memory Array on Extensible Multi-FPGA System," Proc. Int'l Conf. Field-Programmable Technology, pp. 102-109, Dec. 2010.
[14] W. Luzhou, K. Sano, and S. Yamamoto, "Domain-Specific Language and Compiler for Stencil Computation on FPGA-Based Systolic Computational-Memory Array," Proc. Int'l Symp. Applied Reconfigurable Computing, pp. 26-39, Mar. 2012.
[15] K. Sano, W. Luzhou, Y. Hatsuda, T. Iizuka, and S. Yamamoto, "FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods," ACM Trans. Reconfigurable Technology and Systems, vol. 3, no. 4,article 21, Nov. 2010, doi:10.1145/1862648.1862651.
[16] T. Kobori, T. Maruyama, and T. Hoshino, "A Celluer Automata System with FPGA," Proc. IEEE Symp. Field-Programmable Custom Computing Machines, pp. 120-129, Apr. 2001.
[17] T. Kobori and T. Maruyama, "High Speed Computation of Three Dimensional Cellular Automata with FPGA," Proc. Int'l Conf. Field Programmable Logic and Applications, pp. 167-174, Sept. 2002.
[18] T. Kobori and T. Maruyama, "A High Speed Computation System for 3D FCFC Lattice Gas Model with FPGA," Proc. Int'l Conf. Field Programmable Logic and Applications, pp. 755-765, Sept. 2003.
[19] S. Murtaza, A. Hoekstra, and P. Sloot, "Performance Modeling of 2D Cellular Automata on FPGA," Proc. Int'l Conf. Field Programmable Logic and Applications, pp. 74-78, Sept. 2007.
[20] S. Murtaza, A. Hoekstra, and P. Sloot, "Compute Bound and I/O Bound Cellular Automata Simulations on FPGA logic," ACM Trans. Reconfigurable Technology and Systems, vol. 1, no. 4,article 23, Jan. 2009.
[21] S. Murtaza, A. Hoekstra, and P. Sloot, "Floating Point Based Cellular Automata Simulations Using a Dual FPGA-Enabled System," Proc. Int'l Workshop High-Performance Reconfigurable Computing Technology and Applications, Nov. 2008, doi:10.1109/HPRCTA.2008.4745686.
[22] L.A. Hageman and D.M. Young, Applied Iterative Methods, Academic, 1981.
[23] Terasic Technologies, http:/www.terasic.com, 2013.
[24] The Green500 List, http:/www.green500.org, 2013.
[25] Altera Corporation, http://www.altera.comliterature/, 2012.
26 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool