This Article 
 Bibliographic References 
 Add to: 
A Compute Unified System Architecture for Graphics Clusters Incorporating Data Locality
July/August 2009 (vol. 15 no. 4)
pp. 605-617
Christoph Müller, Visualisierungsinstitut der Universität Stuttgart
Steffen Frey, Visualisierungsinstitut der Universität Stuttgart
Magnus Strengert, Visualisierungsinstitut der Universität Stuttgart
Carsten Dachsbacher, Visualisierungsinstitut der Universität Stuttgart
Thomas Ertl, Visualisierungsinstitut der Universität Stuttgart
We present a development environment for distributed GPU computing targeted for multi-GPU systems, as well as graphics clusters. Our system is based on CUDA and logically extends its parallel programming model for graphics processors to higher levels of parallelism, namely, the PCI bus and network interconnects. While the extended API mimics the full function set of current graphics hardware—including the concept of global memory—on all distribution layers, the underlying communication mechanisms are handled transparently for the application developer. To allow for high scalability, in particular for network-interconnected environments, we introduce an automatic GPU-accelerated scheduling mechanism that is aware of data locality. This way, the overall amount of transmitted data can be heavily reduced, which leads to better GPU utilization and faster execution. We evaluate the performance and scalability of our system for bus and especially network-level parallelism on typical multi-GPU systems and graphics clusters.

[1] P. An, A. Jula, S. Rus, S. Saunders, T. Smith, G. Tanase, N. Thomas, N.M. Amato, and L. Rauchwerger, “STAPL: A Standard Template Adaptive Parallel C++ Library,” Proc. Int'l Workshop Advanced Compiler Technology for High Performance and Embedded Systems (IWACT), 2001.
[2] H.E. Bal, “A Comparative Study of Five Parallel Programming Languages,” Future Generation Computer Systems, vol. 8, nos. 1-3, pp. 121-135, 1992.
[3] H.E. Bal, J.G. Steiner, and A.S. Tanenbaum, “Programming Languages for Distributed Computing Systems,” ACM Computing Surveys, vol. 21, no. 3, pp. 261-322, 1989.
[4] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, “Brook for GPUs: Stream Computing on Graphics Hardware,” ACM Trans. Graphics, vol. 23, no. 3, pp.777-786, 2004.
[5] Z. Fan, F. Qiu, and A.E. Kaufman, “Zippy: A Framework for Computation and Visualization on a GPU Cluster,” Computer Graphics Forum, vol. 27, pp. 341-350, 2008.
[6] K. Fatahalian, D.R. Horn, T.J. Knight, L. Leem, M. Houston, J.Y. Park, M. Erez, M. Ren, A. Aiken, W.J. Dally, and P. Hanrahan, “Sequoia: Programming the Memory Hierarchy,” Proc. ACM/IEEE Conf. Supercomputing, 2006.
[7] R. Fernando and M.J. Kilgard, The CG Tutorial: The Definitive Guide to Programmable Real-Time Graphics. Addison-Wesley, 2003.
[8] W. Huang, G. Santhanaraman, H.W. Jin, Q. Gao, and D.K. Panda, ”Design of High Performance MVAPICH2: MPI2 over InfiniBand,“ Proc. Sixth IEEE Int'l Symp. Cluster Computing and the Grid (CCGRID '06), pp. 43-48, 2006.
[9] J.T. Kajiya, “The Rendering Equation,” ACM SIGGRAPH Computer Graphics, vol. 20, no. 4, pp. 143-150, 1986.
[10] A.E. Lefohn, S. Sengupta, J. Kniss, R. Strzodka, and J.D. Owens, “Glift: Generic, Efficient, Random-Access GPU Data Structures,” ACM Trans. Graphics, vol. 25, no. 1, pp. 60-99, 2006.
[11] M.D. McCool, Z. Qin, and T.S. Popa, “Shader Metaprogramming,” Proc. ACM SIGGRAPH/Eurographics Conf. Graphics Hardware (HWWS '02), pp. 57-68, 2002.
[12] P. McCormick, J. Inman, J. Ahrens, J. Mohd-Yusof, G. Roth, and S. Cummins, “Scout: A Data-Parallel Programming Language for Graphics Processors,” Parallel Computing, vol. 33, nos. 10-11, pp. 648-662, 2007.
[13] S. McPeak and G.C. Necula, “Elkhound: A Fast, Practical GLR Parser Generator,” Proc. 13th Int'l Conf. Compiler Construction (CC'04), pp. 73-88, 2004.
[14] A. Moerschell and J.D. Owens, “Distributed Texture Memory in a Multi-GPU Environment,” Proc. 21st ACM SIGGRAPH/Eurographics Symp. Graphics Hardware, pp. 31-38, 2006.
[15] J. Nieplocha, R.J. Harrison, and R.J. Littlefield, “Global Arrays: A Portable Shared-Memory Programming Model for Distributed Memory Computers,” Proc. ACM/IEEE Conf. Supercomputing (Supercomputing '94), pp. 340-349, 1994.
[16] Nvidia, CUDA for Rocks Cluster User Guide, http://developer. , 2007.
[17] Nvidia, CUDA Programming Guide, , 2007.
[18] M. Peercy, M. Segal, and D. Gerstmann, “A Performance-Oriented Data Parallel Virtual Machine for GPUs,” ACM SIGGRAPH '06 Sketches, p. 184, 2006.
[19] K. Proudfoot, W.R. Mark, S. Tzvetkov, and P. Hanrahan, “A Real-Time Procedural Shading System for Programmable Graphics Hardware,” Proc. ACM SIGGRAPH '01, pp. 159-170, 2001.
[20] M.C. Rinard, D.J. Scales, and M.S. Lam, “Jade: A High-Level, Machine-Independent Language for Parallel Programming,” Computer, vol. 26, pp. 28-38, 1993.
[21] S. Sengupta, M. Harris, Y. Zhang, and J.D. Owens, “Scan Primitives for GPU Computing,” Proc. 22nd ACM SIGGRAPH/Eurographics Symp. Graphics Hardware (GH '07), pp. 97-106, 2007.
[22] J.W. Sheaffer, D.P. Luebke, and K. Skadron, “A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors,” Proc. 22nd ACM SIGGRAPH/Eurographics Symp. Graphics Hardware (GH '07), pp. 55-64, 2007.
[23] M. Strengert, C. Müller, C. Dachsbacher, and T. Ertl, “CUDASA: Compute Unified Device and Systems Architecture,” Proc. Eurographics Symp. Parallel Graphics and Visualization (EGPGV'08), pp. 49-56, 2008.

Index Terms:
GPU computing, graphics clusters, parallel programming.
Christoph Müller, Steffen Frey, Magnus Strengert, Carsten Dachsbacher, Thomas Ertl, "A Compute Unified System Architecture for Graphics Clusters Incorporating Data Locality," IEEE Transactions on Visualization and Computer Graphics, vol. 15, no. 4, pp. 605-617, July-Aug. 2009, doi:10.1109/TVCG.2008.188
Usage of this product signifies your acceptance of the Terms of Use.