The Community for Technology Leaders
RSS Icon
Issue No.06 - June (2012 vol.61)
pp: 804-816
Lin Shi , Hunan University, Chang Sha
Hao Chen , Hunan University, Chang Sha
Jianhua Sun , Hunan University, Chang Sha
Kenli Li , Hunan University, Chang Sha
This paper describes vCUDA, a general-purpose graphics processing unit (GPGPU) computing solution for virtual machines (VMs). vCUDA allows applications executing within VMs to leverage hardware acceleration, which can be beneficial to the performance of a class of high-performance computing (HPC) applications. The key insights in our design include API call interception and redirection and a dedicated RPC system for VMs. With API interception and redirection, Compute Unified Device Architecture (CUDA) applications in VMs can access a graphics hardware device and achieve high computing performance in a transparent way. In the current study, vCUDA achieved a near-native performance with the dedicated RPC system. We carried out a detailed analysis of the performance of our framework. Using a number of unmodified official examples from CUDA SDK and third-party applications in the evaluation, we observed that CUDA applications running with vCUDA exhibited a very low performance penalty in comparison with the native environment, thereby demonstrating the viability of vCUDA architecture.
CUDA, virtual machine, GPGPU, RPC, virtualization.
Lin Shi, Hao Chen, Jianhua Sun, Kenli Li, "vCUDA: GPU-Accelerated High-Performance Computing in Virtual Machines", IEEE Transactions on Computers, vol.61, no. 6, pp. 804-816, June 2012, doi:10.1109/TC.2011.112
[1] S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, and M. Ripeanu, “StoreGPU: Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems,” Proc. Int'l Symp. High Performance Distributed Computing (HPDC '08), June 2008.
[2] A. Burtsev, K. Srinivasan, P. Radhakrishnan, L.N. Bairavasundaram, K. Voruganti, and G.R. Goodson, “Fido: Fast Inter-Virtual-Machine Communication for Enterprise Appliances,” Proc. Conf. USENIX Ann. Technical Conf. (USENIX '09), June 2009.
[3] B. Bershad, T. Anderson, E. Lazowska, and H. Levy, “User-Level Interprocess Communication for Shared Memory Multiprocessors,” ACM Trans. Computer Systems, vol. 9, no. 2, pp. 175-198, May 1991.
[4] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the Art of Virtualization,” Proc. 19th ACM Symp. Operating Systems Principles (SOSP '03), pp. 164-177, Oct. 2003.
[5] A. Chien et al. “Design and Evaluation of an HPVM-Based Windows NT Supercomputer,” The Int'l J. High Performance Computing Applications, vol. 13, no. 3, pp. 201-219, 1999.
[6] H. Chen, L. Shi, and J. Sun, “VMRPC: A High Efficiency and Light Weight RPC System for Virtual Machines,” Proc. 18th IEEE Int'l Workshop Quality of Service (IWQoS '10), 2010.
[7] CUDA: Compute Unified Device Architecture. http://www. , 2010.
[8] M. Dowty and J. Sugerman, “GPU Virtualization on VMware's Hosted I/O Architecture,” SIGOPS Operating Systems Rev., vol. 43, pp. 73-82, July 2009.
[9] J. Duato, A. Pena, F. Silla, R. Mayo, and E.S. Quintana, “rCUDA: Reducing the Number of GPU-Based Accelerators in High Performance Clusters,” Proc. Int'l Conf. High Performance omputing and Simulation (HPCS '10), pp. 224-231, July 2010.
[10] G.W. Dunlap, S.T. King, S. Cinar, M.A. Basrai, and P.M. Chen, “Revirt: Enabling Intrusion Analysis through Virtual Machine Logging and Replay,” Proc. Fifth Symp. Operating Systems design and Implementation (OSDI '02), Dec. 2002.
[11] N. Fujimoto, “Faster Matrix-Vector Multiplication on GeForce 8800GTX,” Proc. IEEE Int'l Symp. Parallel and Distributed Processing (IPDPS '08), Apr. 2008.
[12] G. Giunta, R. Montella, G. Agrillo, and G. Coviello, “A GPGPU Transparent Virtualization Component for High Performance Computing Clouds,” Proc. Int'l Euro-Par Conf. Parallel Processing, pp. 379-391, 2010.
[13] “General Purpose Programming on GPUs: What programming APIs exist for GPGPU,” GPGPU 2011.
[14] V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia, V. Talwar, and P. Ranganathan, “Gvim: Gpu-Accelerated Virtual Machines,” Proc. ACM Workshop System-Level Virtualization for High Performance Computing (HPCVirt '09), pp. 17-24, 2009.
[15] J.G. Hansen, “Blink: 3d Display Multiplexing for Virtualized Applications,” technical report, DIKU - Univ. of Copenhagen, , Jan. 2006.
[16] W. Huang, J. Liu, B. Abali, and D.K. Panda, “A Case for High Performance Computing with Virtual Machines,” Proc. 20th Ann. Int'l Conf. Supercomputing, June 2006.
[17] G. Humphreys, M. Houston, R. Ng, R. Frank, S. Ahern, P.D. Kirchner, and J.T. Klosowski, “Chromium: A Streamprocessing Framework for Interactive Rendering on Clusters,” Proc. 29th Ann. Conf. Computer Graphics and Interactive Techniques, pp. 693-702, 2002.
[18] G. Humphreys, M. Eldridge, I. Buck, G. Stoll, M. Everett, and P. Hanrahan, “WireGL: A Scalable Graphics System for Clusters,” Proc. ACM SIGGRAPH, pp. 129-140, Aug. 2001.
[19] IBM's ZAPdb OpenGL Debugger, Computer Software, 1998.
[20] Intel Graphics Performance Toolkit. Computer Software.
[21] A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori, “KVM: The Linux Virtual Machine Monitor,” Proc. Linux Symp., pp. 225-230, 2007.
[22] K. Kim, C. Kim, S.I. Jung, H.S. Shin, and J.S. Kim, “Inter-Domain Socket Communications Supporting High Performance and Full Binary Compatibility on Xen,” Proc. Int'l Conf. Virtual Execution Environments (VEE '08), pp. 11-20, Mar. 2008.
[23] H.A. Lagar-Cavilla, N. Tolia, M. Satyanarayanan, and E. de La-ra, “VMM-Independent Graphics Acceleration,” Proc. Int'l Conf. Virtual Execution Environments (VEE '07), June 2007.
[24] C. Lessig, “An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor,” technical report, Univ. of Toronto, 2008.
[25] J. LeVasseur, V. Uhlig, J. Stoess, and S. Gotz, “Unmodified Device Driver Reuse and Improved System Dependability via Virtual Machines,” Proc. Sixth Symp. Operating Systems Design and Implementation (OSDI '04), Dec. 2004.
[26] MDGPU, , 2011.
[27] IVSHMEM, 0.11.Nahanni-CamMacdonell.pdf, 2011.
[28] A. Menon et al. “Diagnosing Performance Overheads in the Xen Virtual Machine Environment,” Proc. First ACM/USENIX Int'l Conf. Virtual Execution Environments (VEE '05), pp. 13-23, June 2005.
[29] A. Mohr and M. Gleicher, “HijackGL: Reconstructing from Streams for Stylized Rendering,” Proc. Second Int'l Symp. Non-Photorealistic Animation and Rendering, 2002.
[30] MP3 LAME Encoder (Nvidia's CUDA Contest), http:/, 2010.
[31] J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A.E. Lefohn, and T.J. Purcell, “A Survey of General-Purpose Computation on Graphics Hardware,” J. Computer Graphics Forum, vol. 26, pp. 21-51, 2007.
[32] L. Shi, H. Chen, and J. Sun, “vCUDA: GPU Accelerated High Performance Computing in Virtual Machines,” Proc. Int'l Symp. Parallel and Distributed Processing (IPDPS '09), pp. 1-11, May 2009.
[33] D. Tarditi, S. Puri, and J. Oglesby, “Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses,” Proc. 12th Int'l Conf. Architectural Support for Programming guages and Operating Systems (ASPLOS), 2006.
[34] VirtualGL, http:/, 2011.
[35] VMCHANNEL, Requirements , 2011.
[36] VMware Workstation,, 2011.
[37] J. Wang, K. Wright, and K. Gopalan, “XenLoop: A Transparent High Performance Inter-VM Network Loopback,” Proc. 17th Int'l Symp. High Performance Distributed Computing (HPDC '08), pp. 109-118, June 2008.
[38] P. Willmann, J. Shafer, D. Carr, A. Menon, and S. Rixner, “Concurrent Direct Network Access for Virtual Machine Monitors,” Proc. IEEE 13th Int'l Symp. High Performance Computer Architecture (HPCA '07), pp. 306-317, 2007.
[39] Xen VGA Passthrough, , 2011.
[40] XMLRPC, http:/, 2011.
[41] X. Zhang, S. McIntosh, P. Rohatgi, and J.L. Griffin, “Xensocket: A High-Throughput Interdomain Transport for Virtual Machines,” Proc. Eighth ACM/IFIP/USENIX Int'l Conf. Middleware, pp. 184-203, Nov. 2007.
21 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool