The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.11 - November (2010 vol.59)
pp: 1466-1479
Heeseung Jo , Korea Advanced Institute of Science and Technology, Daejeon
Hwanju Kim , Korea Advanced Institute of Science and Technology, Daejeon
Jae-Wan Jang , Korea Advanced Institute of Science and Technology, Daejeon
Joonwon Lee , SungKyunKwan University, Suwon
Seungryoul Maeng , Korea Advanced Institute of Science and Technology, Daejeon
ABSTRACT
In a consolidated server system using virtualization, physical device accesses from guest virtual machines (VMs) need to be coordinated. In this environment, a separate driver VM is usually assigned to this task to enhance reliability and to reuse existing device drivers. This driver VM needs to be highly reliable, since it handles all the I/O requests. This paper describes a mechanism to detect and recover the driver VM from faults to enhance the reliability of the whole system. The proposed mechanism is transparent in that guest VMs cannot recognize the fault and the driver VM can recover and continue its I/O operations. Our mechanism provides a progress monitoring-based fault detection that is isolated from fault contamination with low monitoring overhead. When a fault occurs, the system recovers by switching the faulted driver VM to another one. The recovery is performed without service disconnection or data loss and with negligible delay by fully exploiting the I/O structure of the virtualized system.
INDEX TERMS
Reliability, disconnected operation, fault tolerance, high availability.
CITATION
Heeseung Jo, Hwanju Kim, Jae-Wan Jang, Joonwon Lee, Seungryoul Maeng, "Transparent Fault Tolerance of Device Drivers for Virtual Machines", IEEE Transactions on Computers, vol.59, no. 11, pp. 1466-1479, November 2010, doi:10.1109/TC.2010.61
REFERENCES
[1] http://www.vmware.com/virtualization/cost-savings customers.html, 2010.
[2] http://www.xen.org/filesxen_interface.pdf , 2010.
[3] A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler, "An Empirical Study of Operating System Errors," Proc. 18th ACM Symp. Operating Systems Principles, 2001.
[4] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, "Xen and the Art of Virtualization," Proc. 19th ACM Symp. Operating Systems Principles, pp. 164-177, 2003.
[5] K. Fraser, S. Hand, R. Neugebauer, I. Pratt, A. Warfield, and M. Williamson, "Safe Hardware Access with the Xen Virtual Machine Monitor," Workshop Operating System and Architectural Support for the On-Demand IT Infrastructure (OASIS), 2004.
[6] K. Fraser, S. Hand, R. Neugebauer, I. Pratt, A. Warfield, and M. Williamson, "Reconstructing I/O," Technical Report UCAM-CL-TR-596, 2004.
[7] P. Alsberg and J. Day, "A Principle for Resilient Sharing of Distributed Resources," Proc. Second Int'l Conf. Software Eng., pp. 562-570, 1976.
[8] K. Birman, T. Joseph, T. Raeuchle, and A. Abbadi, "Implementing Fault-Tolerant Distributed Objects," IEEE Trans. Software Eng., SE- vol. 11, no. 6, pp. 502-508, June 1985.
[9] C. Clark, K. Fraser, S. Hand, J. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield, "Live migration of virtual machines," Proc. Second ACM/USENIX Symp. Networked Systems Design and Implementation (NSDI), pp. 273-286, 2005.
[10] M. Nelson, B. Lim, and G. Hutchins, "Fast Transparent Migration for Virtual Machines," Proc. USENIX Ann. Technical Conf., p. 25, 2005.
[11] R. Bradford, E. Kotsovinos, A. Feldmann, and H. Schioberg, "Live Wide-Area Migration of Virtual Machines Including Local Persistent State," Proc. Third Int'l Conf. Virtual Execution Environments, pp. 169-179, 2007.
[12] F. Travostino, P. Daspit, L. Gommans, C. Jog, C. de Laat, J. Mambretti, I. Monga, B. Oudenaarde, S. Raghunath, and P. Wang, "Seamless Live Migration of Virtual Machines over the Man/Wan," Future Generation Computer Systems, vol. 22, no. 8, pp. 901-907, 2006.
[13] N. Budhiraja, K. Marzullo, F.B. Schneider, and S. Toueg, "Primary-Backup Protocols: Lower Bounds and Optimal Implementations," Proc. Third Int'l Federation for Information Processing (IFIP) Conf. Dependable Computing for Critical Applications, pp. 187-198, 1992.
[14] P.A. Alsberg and J.D. Day, "A Principle for Resilient Sharing of Distributed Resources," Proc. Second Int'l Conf. Software Eng., pp. 562-570, 1976.
[15] S. Chetan, A. Ranganathan, and R. Campbell, "Towards Fault Tolerance Pervasive Computing," IEEE Technology and Soc. Magazine, vol. 24, no. 1, pp. 38-44, 2005.
[16] R. Guerraoui and A. Schiper, "Software-Based Replication for Fault Tolerance," Computer, vol. 30, no. 4, pp. 68-74, Apr. 1997.
[17] M. Swift, M. Annamalai, B. Bershad, and H. Levy, "Recovering Device Drivers," Proc. Sixth USENIX Symp. Operating Systems Design and Implementation, pp. 1-16, 2004.
[18] M. Swift, B. Bershad, and H. Levy, "Improving the Reliability of Commodity Operating Systems," ACM Trans. Computer Systems, vol. 22, no. 4, pp. 77-110, 2004.
[19] M. Swift, B. Bershad, and H. Levy, "Improving the Reliability of Commodity Operating Systems," Proc. 19th ACM Symp. Operating Systems Principles, 2003.
[20] M. Swift, S. Martin, H. Levy, and S. Eggers, "Nooks: An Architecture for Reliable Device Drivers," Proc. 10th ACM SIGOPS European Workshop, 2002.
[21] http:/www.xensource.com, 2010.
[22] Linux Ethernet Bonding Driver HOWT, Documentation/ networkingbonding.txt, 2010.
[23] SysKonnect, "Link Aggregation According to IEEE 802.3ad," technical report, SysKonnect GmbH, 2002.
[24] A.C. Arpaci-Dusseau and R.H. Arpaci-Dusseau, "Information and Control in Gray-Box Systems," Proc. 18th ACM Symp. Operating Systems Principles, 2001.
[25] T.C. Bressoud and F.B. Schneider, "Hypervisor-Based Fault-Tolerance," Proc. 15th ACM Symp. Operating Systems Principles, pp. 1-11, 1995.
[26] http://www.hpl.hp.com/research/linuxhttperf /, 2010.
[27] http://www.cs.virginia.edustream/, 2010.
[28] A. Whitaker, M. Shaw, and S.D. Gribble, "Denali: Lightweight Virtual Machines for Distributed and Networked Applications," Proc. Fifth USENIX Symp. Operating Systems Design and Implementation, pp. 195-209, 2002.
[29] I. Molnar, "KVM Paravirtualization for Linux," Linux Kernel Mailing List, http://lkml.org/lkml/2007/1/5205, 2010.
[30] "Performance of VMware VMI," white paper, VMware, 2008.
[31] T. Borden, J. Hennessy, and J. Rymarczyk, "Multiple Operating Systems on One Processor Complex," IBM Systems J., vol. 28, no. 1, pp. 104-123, 1989.
[32] J. Sugerman, G. Venkitachalam, and B. Lim, "Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor," Proc. USENIX Ann. Technical Conf., pp. 1-14, 2001.
[33] D.E. Lowell, Y. Saito, and E.J. Samberg, "Devirtualizable Virtual Machines Enabling General, Single-Node, Online Maintenance," Proc. 11th ACM Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 211-223, 2004.
[34] H. Chen, R. Chen, F. Zhang, B. Zang, and P.C. Yew, "Live Updating Operating Systems Using Virtualization," Proc. Second Int'l Conf. Virtual Execution Environments, pp. 35-44, 2006.
[35] R. Uhlig, G. Neiger, D. Rodgers, A.L. Santoni, F.C.M. Martins, A.V. Anderson, S.M. Bennett, A. Kagi, F.H. Leung, and L. Smith, "Intel Virtualization Technology," Computer, vol. 38, no. 5, pp. 48-56, May 2005.
[36] AMD, AMD64 Virtualization Codenamed "Pacifica" Technology: Secure Virtual Machine Architecture, Reference Manual, May 2005.
[37] M. Vrable, J. Ma, J. Chen, D. Moore, E. Vandekieft, A. Snoeren, G. Voelker, and S. Savage, "Scalability, Fidelity and Containment in the Potemkin Virtual Honeyfarm," Proc. 20th ACM Symp. Operating Systems Principles, 2005.
[38] D. Mosberger and T. Jin, "Httperf: A Tool for Measuring Web Server Performance," Proc. Workshop Internet Server Performance, 1998.
[39] P.M. Chen, W.T. Ng, S. Chandra, C. Aycock, G. Rajamani, and D. Lowell, "The Rio File Cache: Surviving Operating System Crashes," Proc. Seventh ACM Int'l Conf. Architectural Support for Programming Languages and Operating Systems, 1996.
[40] B. Pawlowski, C. Juszczak, P. Staubach, C. Smith, D. Lebel, and D. Hitz, "NFS Version 3: Design and Implementation," Proc. USENIX Summer Conf., pp. 137-152, 1994.
[41] J.D. McCalpin, "Memory Bandwidth and Machine Balance in Current High Performance Computers," IEEE CS Technical Committee on Computer Architecture (TCCA) Newsletter, Dec. 1995.
[42] J. LeVasseur, V. Uhlig, J. Stoess, and S. Götz, "Unmodified Device Driver Reuse and Improved System Dependability via Virtual Machines," Proc. Sixth USENIX Symp. Operating Systems Design and Implementation, 2004.
[43] J. Liedtke, "On μ-Kernel Construction," Proc. 15th ACM Symp. Operating Systems Principles, pp. 237-250, 1995.
[44] M. Young, M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, and A. Tevanian, "Mach: A New Kernel Foundation for UNIX Development," Proc. Summer USENIX Conf., pp. 93-113, 1986.
[45] D.R. Engler, M.F. Kaashoek, and J. O'TooleJr., "Exokernel: An Operating System Architecture for Application-Level Resource Management," Proc. 15th ACM Symp. Operating Systems Principles, pp. 251-266, 1995.
[46] B. Ford, G. Back, G. Benson, J. Lepreau, A. Lin, and O. Shivers, "The Flux OSKit: A Substrate for OS Language and Research," Proc. 16th ACM Symp. Operating Systems Principles, pp. 38-51, 1997.
[47] S.M. Hand, "Self-Paging in the Nemesis Operating System," Proc. Third USENIX Symp. Operating Systems Design and Implementation, pp. 73-86, 1999.
[48] B.N. Bershad, T.E. Anderson, E.D. Lazowska, and H.M. Levy, "Lightweight Remote Procedure Call," ACM Trans. Computer Systems, vol. 8, no. 1, pp. 37-55, 1990.
[49] J.R. Douceur and W.J. Bolosky, "Progress-Based Regulation of Low-Importance Processes," Proc. 17th ACM Symp. Operating Systems Principles, pp. 247-260, 1999.
[50] J.N. Herder, H. Bos, B. Gras, P. Homburg, and A.S. Tanenbaum, "Minix 3: A Highly Reliable, Self-Repairing Operating System," Proc. 38th Ann. IEEE/IFIP Int'l Conf. Dependable Systems and Networks (DSN), 2009.
[51] J.N. Herder, H. Bos, B. Gras, P. Homburg, and A.S. Tanenbaum, "Fault Isolation for Device Drivers," Proc. 38th Ann. IEEE/IFIP Int'l Conf. Dependable Systems and Networks (DSN), 2009.
[52] J.N. Herder, H. Bos, B. Gras, P. Homburg, and A.S. Tanenbaum, "Failure Resilience for Device Drivers," Proc. 37th Ann. IEEE/IFIP Int'l Conf. Dependable Systems and Networks (DSN), pp. 41-50, 2007.
[53] http:/sysbench.sourceforge.net/, 2010.
[54] C.A. Waldspurger, "Memory Resource Management in VMware ESX Server," Proc. Fifth USENIX Symp. Operating Systems Design and Implementation, 2002.
18 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool