Subscribe

Issue No.01 - January (2011 vol.22)

pp: 119-131

John C. Linford , Virginia Polytechnic Institute and State University, Blacksburg

John Michalakes , National Center for Atmospheric Research, Boulder

Manish Vachharajani , University of Colorado, Boulder

Adrian Sandu , Virginia Polytechnic Institute and State University, Blacksburg

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.106

ABSTRACT

This work presents the Kinetics PreProcessor: Accelerated (KPPA), a general analysis and code generation tool that achieves significantly reduced time-to-solution for chemical kinetics kernels on three multicore platforms: NVIDIA GPUs using CUDA, the Cell Broadband Engine, and Intel Quad-Core Xeon CPUs. A comparative performance analysis of chemical kernels from WRF-Chem and the Community Multiscale Air Quality Model (CMAQ) is presented for each platform in double and single precision on coarse and fine grids. We introduce the multicore architecture parameterization that KPPA uses to generate a chemical kernel for these platforms and describe a code generation system that produces highly tuned platform-specific code. Compared to state-of-the-art serial implementations, speedups exceeding 25{\times} are regularly observed, with a maximum observed speedup of 41.1{\times} in single precision.

INDEX TERMS

KPPA, multicore, NVIDIA CUDA, Cell Broadband Engine, OpenMP, chemical kinetics, atmospheric modeling, Kinetics PreProcessor, WRF-Chem, CMAQ.

CITATION

John C. Linford, John Michalakes, Manish Vachharajani, Adrian Sandu, "Automatic Generation of Multicore Chemical Kernels",

*IEEE Transactions on Parallel & Distributed Systems*, vol.22, no. 1, pp. 119-131, January 2011, doi:10.1109/TPDS.2010.106REFERENCES

- [1] G.A. Grell, S.E. Peckham, R. Schmitz, S.A. McKeen, G. Frost, W.C. Skamarock, and B. Eder, "Fully Coupled Online Chemistry within the WRF Model,"
Atmospheric Environment, vol. 39, no. 37, pp. 6957-6975, 2005.- [2] D.W. Byun and J.K.S. Ching, "Science Algorithms of the EPA Models-3 Community Multiscale Air Quality (CMAQ) Modeling System," Technical Report EPA/600/R-99/030, US Environment Protection Agency, 1999.
- [3] G.R. Carmichael, L.K. Peters, and T. Kitada, "A Second Generation Model for Regional Scale Transport/Chemistry/Deposition."
Atmospheric Environment, vol. 20, no. 1, pp. 173-188, 1986.- [4] I. Bey, D.J. Jacob, R.M. Yantosca, J.A. Logan, B.D. Field, A.M. Fiore, Q. Li, H.Y. Liu, A.M. Mickley, and A.M. Schultz, "Global Modeling of Tropospheric Chemistry with Assimilated Meteorology: Model Description and Evaluation,"
J. Geophysical Research, vol. 106, no. D19, pp. 23073-23095, 2001.- [5] V. Damian, A. Sandu, M. Damian, F. Potra, and G.R. Carmichael, "The Kinetic Preprocessor KPP—a Software Environment for Solving Chemical Kinetics,"
Computers and Chemical Eng., vol. 26, no. 1, pp. 1567-1579, 2002.- [6] N. Galoppo, N. Govindaraju, M. Henson, and D. Manocha, "LU-GPU: Algorithms for Dense Linear Systems on Graphics Hardware,"
Proc. ACM/IEEE Conf. Supercomputing (SC '05), 2005.- [7] V. Volkov and J. Demmel, "LU, QR and Cholesky Factorizations Using Vector Capabilities of GPUs," Technical Report UCB/EECS-2008-49, EECS Dept., Univ. of California, Berkeley, http://www.eecs.berkeley.edu/Pubs/TechRpts/ 2008EECS-2008-49. html, May 2008.
- [8] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms,"
Proc. ACM/IEEE Conf. Supercomputing (SC '07), pp. 1-12, 2007.- [9] Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover, "GPU Cluster for High Performance Computing,"
Proc. ACM/IEEE Conf. Supercomputing (SC '04), p. 47, 2004.- [10] K.S. Perumalla, "Discrete-Event Execution Alternatives on General Purpose Graphical Processing Units (GPGPUs),"
Proc. 20th Int'l Workshop Principles of Advanced and Distributed Simulation (PADS '06), pp. 74-81, 2006.- [11] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick, "The Potential of the Cell Processor for Scientific Computing,"
Proc. Third Int'l Conf. Computing Frontiers (CF '06), pp. 9-20, 2006.- [12] J. Dongarra, S. Moore, G. Peterson, S. Tomov, J. Allred, and V. Natoli, "Exploring New Architectures in Accelerating CFD for Air Force Applications,"
Proc. High Performance Computing Modernization Program (HPCMP) Users Group Conf. '08, pp. 472-478, July 2008.- [13] J. Bolz, I. Farmer, E. Grinspun, and P. Schröder, "Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid,"
Proc. ACM SIGGRAPH '05 Courses, 2005.- [14] J. Krüger and R. Westermann, "Linear Algebra Operators for GPU Implementation of Numerical Algorithms,"
Proc. ACM SIGGRAPH '03, pp. 908-916, 2003.- [15] M. Baboulin, J. Demmel, J. Dongarra, S. Tomov, and V. Volkov, "Enhancing the Performance of Dense Linear Algebra Solvers on GPUs [in the MAGMA]," Poster at Supercomputing '08, Nov. 2008.
- [16] K.Z. Ibrahim and F. Bodin, "Implementing Wilson-Dirac Operator on the Cell Broadband Engine,"
Proc. 22nd Ann. Int'l Conf. Supercomputing (ICS '08), pp. 4-14, 2008.- [17] J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, C. Whaley, and K. Yelick, "Self Adapting Linear Algebra Algorithms and Software,"
Proc. IEEE, special issue on program generation, optimization, and adaptation, vol. 93, no. 2, pp. 293-312, June 2005.- [18] M. Frigo and S. Johnson, "FFTW: An Adaptive Software Architecture for the FFT,"
Proc. IEEE Int'l Conf. Acoustics, Speech and Signal Processing, vol. 3, pp. 1381-1384, 1998.- [19] Y. Li, J. Dongarra, and S. Tomov, "A Note on Auto-Tuning GEMM for GPUs,"
Lecture Notes in Computer Science, pp. 884-892, Springer, May 2009.- [20] M.W. Gery, G.Z. Whitten, J.P. Killus, and M.C. Dodge, "A Photochemical Kinetics Mechanism for Urban and Regional Scale Computer Modeling,"
J. Geophysical Research, vol. 94, no. D10, pp. 12925-12956, 1989.- [21] W.P.L. Carter, "A Detailed Mechanism for the Gas-Phase Atmospheric Reactions of Organic Compounds,"
Atmospheric Environment, vol. 24A, pp. 481-518, 1990.- [22] M.Z. Jacobson and R. Turco, "SMVGEAR: A Sparse-Matrix, Vectorized Gear Code for Atmospheric Models,"
Atmospheric Environment, vol. 28, pp. 273-284, 1994.- [23] P. Eller, K. Singh, A. Sandu, K. Bowman, D.K. Henze, and M. Lee, "Implementation and Evaluation of an Array of Chemical Solvers in the Global Chemical Transport Model GEOS-Chem,"
Geoscientific Model Development, vol. 2, pp. 89-96, 2009.- [24] J.D. Lambert,
Numerical Methods for Ordinary Differential Systems. John Wiley and Sons, 1991.- [25] E. Hairer and G. Wanner,
Solving Ordinary Differential Equations II: Stiff and Differential Algebraic Problems, second ed., vol. 14. Springer-Verlag, 1996.- [26] "Platform Brief: Intel Xeon Processor 5400 Series," http://www. intel.com/cd/channel/reseller/ asmo-na/eng/products/server/processors/ q5400/featureindex.htm, Sept. 2009.
- [27] P. Gepner, D.L. Fraser, and M.F. Kowalik, "Second Generation Quad-Core Intel Xeon Processors Bring 45 nm Technology and a New Level of Performance to HPC Applications,"
Proc. Eighth Int'l Conf. Computational Science (ICCS '08), pp. 417-426, June 2008.- [28] "Technical Brief: NVIDIA GeForce GTX 200 GPU Architectural Overview," Technical Report TB-04044-001_v01, NVIDIA Corporation, May 2008.
- [29] B. Himawan and M. Vachharajani, "Deconstructing Hardware Usage for General Purpose Computation on GPUs,"
Proc. Fith Ann. Workshop Duplicating, Deconstructing, and Debunking (in conjunction with ISCA-33), 2006.- [30] T. Chen, R. Raghavan, J. Dale, and E. Iwata, "Cell Broadband Engine Architecture and Its First Implementation,"
IBM developerWorks, June 2006.- [31] B. Flachs et al., "The Microarchitecture of the Synergistic Processor for a Cell Processor,"
IEEE J. Solid State Circuits, vol. 41, no. 1, pp. 63-70, Jan. 2006.- [32] "PowerXCell 8i Processor Technology Brief," http://www-03. ibm.com/technologycell/, Aug. 2008.
- [33] K.J. Barker, K. Davis, A. Hoisie, D.J. Kerbyson, M. Lang, S. Pakin, and J.C. Sancho, "Entering the Petaflop Era: The Architecture and Performance of Roadrunner,"
Proc. ACM/IEEE Conf. Supercomputing (SC '08), pp. 1-11, 2008.- [34] "The Green500 list," http:/www.green500.org/, Nov. 2008.
- [35] W.R. Stockwell, P. Middleton, J.S. Chang, and X. Tang, "The Second Generation Regional Acid Deposition Model Chemical Mechanism for Regional Air Quality Modeling,"
J. Geophysical Research, vol. 95, pp. 16343-16367, 1990.- [36] J. Chang, F. Binowski, N. Seaman, J. McHenry, P. Samson, W. Stockwell, C. Walcek, S. Madronich, P. Middleton, J. Pleim, and H. Lansford, "The Regional Acid Deposition Model and Engineering Model," State-of-Science/Technology Report 4, Nat'l Acid Precipitation Assessment Program, 1989.
- [37] P. Middleton, W. Stockwell, and W. Carter, "Aggregation and Analysis of Volatile Organic Compound Emissions for Regional Modeling,"
Atmospheric Environment, vol. 24A, pp. 1107-1133, 1990.- [38] J.C. Linford, J. Michalakes, M. Vachharijani, and A. Sandu, "Multi-Core Acceleration of Chemical Kinetics for Modeling and Simulation,"
Proc. ACM/IEEE Conf. Supercomputing (SC '09), Nov. 2009.- [39] E. Gamma, R. Helm, R. Johnson, and J. Vlissides,
Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Longman Publishing Co., Inc., 1995.- [40] K. Fatahalian, T.J. Knight, M. Houston, M. Erez, D.R. Horn, L. Leem, J.Y. Park, M. Ren, A. Aiken, W.J. Dally, and P. Hanrahan, "Sequoia: Programming the Memory Hierarchy,"
Proc. ACM/IEEE Conf. Supercomputing (SC), 2006.- [41] "Intel Math Kernel Library," Reference Manual 630813-031US, Intel Corporation, Mar. 2009.
- [42] "NVIDIA's Next Generation CUDA Compute Architecture: Fermi," Whitepaper v1.1, NVIDIA Corporation, 2009.
- [43] L.O. Jay, A. Sandu, F.A. Potra, and G.R. Carmichael, "Improved Quasi-Steady-State-Approximation Methods for Atmospheric Chemistry Integration,"
SIAM J. Scientific Computing, vol. 18, no. 1, pp. 182-202, 1997.- [44] The IBM BlueGene Team, "An Overview of the BlueGene/La Supercomputer,"
Proc. ACM/IEEE Conf. Supercomputing (SC '02), Nov. 2002.- [45] M. Purcell, O. Callanan, and D. Gregg, "Streamlining Offload Computing to High Performance Architectures,"
Proc. Int'l Conf. Computational Science—ICCS '09, G. Allen, J. Nabrzyski, E. Seidel, G.D. van Albada, J. Dongarra, and P.M. Sloot, eds., pp. 974-983. Springer, 2009.- [46] A. Munshi,
The OpenCL Specification Version 1.0 Rev 43, Khronos OpenCL Working Group, May 2009.- [47] A. Ghuloum, T. Smith, G. Wu, X. Zhou, J. Fang, P. Guo, B. So, M. Rajagopalan, Y. Chen, and B. Chen, "Future-Proof Data Parallel Algorithms and Software on Intel Multi-Core Architecture,"
Intel Technology J., Nov. 2007. |