The Community for Technology Leaders
RSS Icon
Issue No.12 - December (2011 vol.60)
pp: 1718-1729
Jun Zhu , Peking University, Beijing
Zhefu Jiang , Peking University, Beijing
Zhen Xiao , Peking University, Beijing
Xiaoming Li , State Key Laboratory of Software Development Environment, MOST, China
Hypervisor-based fault tolerance (HBFT), which synchronizes the state between the primary VM and the backup VM at a high frequency of tens to hundreds of milliseconds, is an emerging approach to sustaining mission-critical applications. Based on virtualization technology, HBFT provides an economic and transparent fault tolerant solution. However, the advantages currently come at the cost of substantial performance overhead during failure-free, especially for memory intensive applications. This paper presents an in-depth examination of HBFT and options to improve its performance. Based on the behavior of memory accesses among checkpointing epochs, we introduce two optimizations, read-fault reduction and write-fault prediction, for the memory tracking mechanism. These two optimizations improve the performance by 31 percent and 21 percent, respectively, for some applications. Then, we present software superpage which efficiently maps large memory regions between virtual machines (VM). Our optimization improves the performance of HBFT by a factor of 1.4 to 2.2 and achieves about 60 percent of that of the native VM.
Virtualization, hypervisor, checkpoint, recovery, fault tolerance.
Jun Zhu, Zhefu Jiang, Zhen Xiao, Xiaoming Li, "Optimizing the Performance of Virtual Machine Synchronization for Fault Tolerance", IEEE Transactions on Computers, vol.60, no. 12, pp. 1718-1729, December 2011, doi:10.1109/TC.2010.224
[1] J. Zhu, W. Dong, Z. Jiang, X. Shi, Z. Xiao, and X. Li, “Improving the Performance of Hypervisor-Based Fault Tolerance,” Proc. 24th IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '10), 2010.
[2] N. Rafe, “Minor Outage at Facebook Monday,” 2009.
[3] R. Miller, “IBM Generator Failure Causes Airline Chaos,” 2009.
[4] R. Miller, “Codero Addresses Lengthy Power Outage,” 2010.
[5] B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson, and A. Warfield, “Remus: High Availability via Asynchronous Virtual Machine Replication,” Proc. Fifth USENIX Symp. Networked Systems Design and Implementation (NSDI '08), 2008.
[6] Y. Tamura, K. Sato, S. Kihara, and S. Moriai, “Kemari: Virtual Machine Synchronization for Fault Tolerance,” 2008.
[7] “Taiji.” , 2011.
[8] J. Gray, “Why Do Computers Stop and What Can Be Done about It?” Proc. Third Symp. Reliability in Distributed Software and Database System (SRDS '86), 1986.
[9] R.P. Goldberg, “Survey of Virtual Machine Research,” IEEE Computer, vol. 7, no. 6, pp. 34-45, June 1974.
[10] E.N. Elnozahy and W. Zwaenepoel, “Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit,” IEEE Trans. Computers, vol. 41, no. 5, pp. 526-531, May. 1992.
[11] W. Bartlett, “HP NonStop Server: Overview of an Integrated Architecture for Fault Tolerance,” Proc. Second Workshop Evaluating and Architecting System Dependability, 1999.
[12] M. Lu and T.-C. Chiueh, “Fast Memory State Synchronization for Virtualization-Based Fault Tolerance,” Proc. 39th Ann. IEEE/IFIP Int'l Conf. Dependable Systems and Networks (DSN '09), 2009.
[13] E.N. Elnozahy, L. Alvisi, Y.-M. Wang, and D.B. Johnson, “A Survey of Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
[14] J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas, “ReViveI/O: Efficient Handling of I/O in Highly-Available Rollback-Recovery Servers,” Proc. 12th Int'l Symp. High- Performance Computer Architecture (HPCA '06), 2006.
[15] AMD, AMD64 Architecture Programmer's Manual Volume 2: System Programming, 2006.
[16] C.A. Waldspurger, “Memory Resource Management in VMware ESX Server,” ACM SIGOPS Operating Systems Rev., vol. 36, no. SI, p. 181, Dec. 2002.
[17] K. Adams and O. Agesen, “A Comparison of Software and Hardware Techniques for x86 Virtualization,” Proc. 12th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS '06), 2006.
[18] VMware, “Protecting Mission-Critical Workloads with VMware Fault Tolerance,” 2009.
[19] “Marathon.” http:/, 2011.
[20] G.W. Dunlap, D.G. Lucchetti, M.A. Fetterman, and P.M. Chen, “Execution Replay of Multiprocessor Virtual Machines,” Proc. Fourth ACM SIGPLAN/SIGOPS Int'l Conf. Virtual Execution Environments (VEE '08), 2008.
[21] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the Art of Virtualization,” ACM SIGOPS Operating Systems Rev., vol. 37, no. 5, p. 164, 2003.
[22] D.A. Patterson, G. Gibson, and R.H. Katz, “A Case for Redundant Arrays of Inexpensive Disks (RAID),” Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '88), 1988.
[23] “Synology.” http:/, 2011.
[24] C. Clark, K. Fraser, S. Hand, J.G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield, “Live Migration of Virtual Machines,” Proc. Second Symp. Networked Systems Design and Implementation (NSDI '05), 2005.
[25] M. Nelson, B. Lim, and G. Hutchins, “Fast Transparent Migration for Virtual Machines,” Proc. USENIX Ann. Technical Conf. (USENIX '05), 2005.
[26] A. Burtsev, K. Srinivasan, P. Radhakrishnan, L.N. Bairavasundaram, K. Voruganti, and G.R. Goodson, “Fido: Fast Inter-Virtual-Machine Communication for Enterprise Appliances,” Proc. USENIX Ann. Technical Conf. (USENIX '09), 2009.
[27] “CINT2006.”, 2011.
[28] “CFP2006.”, 2011.
[29] “SPECjbb2005.” http://www.spec.orgjbb2005/, 2011.
[30] “SPECweb2005.” http://www.spec.orgweb2005/, 2011.
[31] A. Silberschatz, P. Galvin, and G. Gagne, Operating System Principles. Wiley India Pvt. Ltd., 2006.
[32] Intel, Intel 64 and IA-32 Architectures Software Developer's Manual, 2009.
[33] D. Hansen, M. Kravetz, B. Christiansen, and M. Tolentino, “Hotplug Memory and the Linux VM,” Proc. Ottawa Linux Symp., 2004.
[34] D. Gupta, S. Lee, M. Vrable, S. Savage, A.C. Snoeren, G. Varghese, G.M. Voelker, and A. Vahdat, “Difference Engine: Harnessing Memory Redundancy in Virtual Machines,” Proc. Eighth USENIX Symp. Operating System Design and Implementation (OSDI '08), 2008.
[35] A. Menon, A.L. Cox, and W. Zwaenepoel, “Optimizing Network Virtualization in Xen,” Proc. USENIX Ann. Technical Conf., 2006.
[36] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison-Wesley Reading, 1999.
[37] M.F. Mergen, V. Uhlig, O. Krieger, and J. Xenidis, “Virtualization for High-Performance Computing,” ACM SIGOPS Operating Systems Rev., vol. 40, no. 2, p. 8, 2006.
[38] Y. Ling, J. Mi, and X. Lin, “A Variational Calculus Approach to Optimal Checkpoint Placement,” IEEE Trans. Computers, vol. 50, no. 7, pp. 699-708, July 2001.
[39] A.J. Oliner, L. Rudolph, and R.K. Sahoo, “Cooperative Checkpointing: A Robust Approach to Large-Scale Systems Reliability,” Proc. 20th Ann. Int'l Conf. Supercomputing (ICS '06), 2006.
[40] T.C. Bressoud and F.B. Schneider, “Hypervisor-Based Fault Tolerance,” Proc. 15th ACM Symp. Operating Systems Principles (SOSP '95), 1995.
[41] C. Studies, D. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. Enriquez, A. Fox, E. Kcman, M. Merzbacher, D. Oppenheimer, N. Sastry, W. Tetzlaff, J. Traupman, and N. Treuhaft, “Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies,” pp. 1-16, 2002.
[42] S. Sankaran, J.M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman, “The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing,” Int'l J. High Performance Computing Applications, vol. 19, no. 4, pp. 479-493, 2005.
[43] G. Bronevetsky, D.J. Marques, K.K. Pingali, R. Rugina, and S.A. McKee, “Compiler-Enhanced Incremental Checkpointing for OpenMP Applications,” Proc. 13th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP '08), 2008.
[44] F. Dahlgren, M. Dubois, and P. Stenstrom, “Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors,” Proc. Int'l Conf. Parallel Processing (ICPP '93), 1993.
[45] D. Joseph and D. Grunwald, “Prefetching Using Markov Predictors,” Proc. 24th Ann. Int'l Symp. Computer Architecture (ISCA '97), 1997.
[46] G.B. Kandiraju and A. Sivasubramaniam, “Going the Distance for TLB Prefetching: An Application-Driven Study,” Proc. 29th Ann. Int'l Symp. Computer Architecture (ISCA '02), 2002.
[47] D. Bovet and M. Cesati, Understanding the Linux Kernel. O'Reilly Media, Inc, 2005.
[48] J. Navarro, S. Iyer, P. Druschel, and A. Cox, “Practical, Transparent Operating System Support for Superpages,” Proc. Fifth USENIX Symp. Operating System Design and Implementation (OSDI '02), 2002.
[49] M. Swanson, L. Stoller, and J. Carter, “Increasing TLB Reach Using Superpages Backed by Shadow Memory,” Proc. 25th Ann. Int'l Symp. Computer Architecture (ISCA '98), 1998.
25 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool