This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Comprehensive Model for Software Rejuvenation
April-June 2005 (vol. 2 no. 2)
pp. 124-137
Recently, the phenomenon of software aging, one in which the state of the software system degrades with time, has been reported. This phenomenon, which may eventually lead to system performance degradation and/or crash/hang failure, is the result of exhaustion of operating system resources, data corruption, and numerical error accumulation. To counteract software aging, a technique called software rejuvenation has been proposed, which essentially involves occasionally terminating an application or a system, cleaning its internal state and/or its environment, and restarting it. Since rejuvenation incurs an overhead, an important research issue is to determine optimal times to initiate this action. In this paper, we first describe how to include faults attributed to software aging in the framework of Gray's software fault classification (deterministic and transient), and study the treatment and recovery strategies for each of the fault classes. We then construct a semi-Markov reward model based on workload and resource usage data collected from the UNIX operating system. We identify different workload states using statistical cluster analysis, estimate transition probabilities, and sojourn time distributions from the data. Corresponding to each resource, a reward function is then defined for the model based on the rate of resource depletion in each state. The model is then solved to obtain estimated times to exhaustion for each resource. The result from the semi-Markov reward model are then fed into a higher-level availability model that accounts for failure followed by reactive recovery, as well as proactive recovery. This comprehensive model is then used to derive optimal rejuvenation schedules that maximize availability or minimize downtime cost.

[1] P.E. Amman and J.C. Knight, “Data Diversity: An Approach to Software Fault Tolerance,” Proc. 17th Int'l Symp. Fault Tolerant Computing, pp. 122-126, June 1987.
[2] A. Avritzer and E.J. Weyuker, “Monitoring Smoothly Degrading Systems for Increased Dependability,” Empirical Software Eng. J., vol. 2, no. 1, pp. 59-77, 1997.
[3] A. Avizienis and L. Chen, “On the Implementation of N-Version Programming for Software Fault Tolerance During Execution,” Proc. IEEE COMPSAC 77 Conf., pp. 149-155, Nov. 1977.
[4] A. Avizienis, J.-C. Laprie, and B. Randell, “Fundamental Concepts of Dependability,” LAAS Technical Report No. 01-145, LAAS, France, Apr. 2001.
[5] Y. Bao, X. Sun, and K. Trivedi, “Adaptive Software Rejuvenation: Degradation Models and Rejuvenation Schemes,” Proc. Int'l. Conf. Dependable Systems and Networks (DSN-2003), June 2003.
[6] A. Bobbio, A. Sereno, and C. Anglano, “Fine Grained Software Degradation Models for Optimal Rejuvenation Policies,” Performance Evaluation, vol. 46, pp. 45-62, 2001.
[7] T. Boyd and P. Dasgupta, “Preemptive Module Replacement Using the Virtualizing Operating System,” Proc. Workshop Self-Healing, Adaptive and Self-Managed Systems (SHAMAN 2002), June 2002.
[8] K. Cassidy, K. Gross, and A. Malekpour, “Advanced Pattern Recognition for Detection of Complex Software Aging in Online Transaction Processing Servers,” Proc. Int'l Conf. Dependable Systems and Networks (DSN 2002), June 2002.
[9] V. Castelli, R.E. Harper, P. Heidelberger, S.W. Hunter, K.S. Trivedi, K. Vaidyanathan, and W. Zeggert, “Proactive Management of Software Aging,” IBM J. Research & Development, vol. 45, no. 2, Mar. 2001.
[10] M. Chereque, D. Powell, P. Reynier, J.-L. Richier, and J. Voiron, “Active Replication in Delta-4,” Proc. 22nd IEEE Int'l. Symp. Fault Tolerant Computing (FTCS-22), pp. 28-37, July 1992.
[11] R. Chillarege, S. Biyani, and J. Rosenthal, “Measurement of Failure Rate in Widely Distributed Software,” Proc. 25th IEEE Int'l Symp. Fault Tolerant Computing, pp. 424-433, July 1995.
[12] T. Dohi, K. Goševa-Popstojanova, and K.S. Trivedi, “Statistical Non-Parametric Algorithms to Estimate the Optimal Software Rejuvenation Schedule,” Proc. 2000 Pacific Rim Int'l Symp. Dependable Computing (PRDC 2000), Dec. 2000.
[13] C. Fetzer and K. Hostedt, “Rejuvenation and Failure Detection in Partitionable Systems,” Proc. Pacific Rim Int'l Symp. Dependable Computing (PRDC 2001), Dec. 2001.
[14] S. Garg, A. Puliafito, and K.S. Trivedi, “Analysis of Software Rejuvenation Using Markov Regenerative Stochastic Petri Net,” Proc. Sixth Int'l Symp. Software Reliability Eng., pp. 180-187, Oct. 1995.
[15] S. Garg, Y. Huang, and C. Kintala, K.S. Trivedi, “Minimizing Completion Time of a Program by Checkpointing and Rejuvenation,” Proc. 1996 ACM SIGMETRICS Conf., pp. 252-261, May 1996.
[16] S. Garg, A. Puliafito, M. Telek, and K.S. Trivedi, “Analysis of Preventive Maintenance in Transactions Based Software Systems,” IEEE Trans. Computers, pp. 96-107, vol. 47, no. 1, Jan. 1998.
[17] S. Garg, A. van Moorsel, K. Vaidyanathan, K. Trivedi, “A Methodology for Detection and Estimation of Software Aging,” Proc. Ninth Int'l Symp. Software Reliability Eng., pp. 282-292, Nov. 1998.
[18] S. Garg, Y. Huang, C.M.R. Kintala, K.S. Trivedi, and S. Yagnik, “Performance and Reliability Evaluation of Passive Replication Schemes in Application Level Fault Tolerance,” Proc. Fault Tolerant Computing Symp. (FTCS 1999), pp. 322-329, June 1999.
[19] R.O. Gilbert, Statistical Methods for Environmental Pollution Monitoring. New York: Van Nostrand Reinhold, 1987.
[20] J. Gray, “Why Do Computers Stop and What Can Be Done About It?” Proc. Fifth Symp. Reliability in Distributed Software and Database Systems, pp. 3-12, Jan. 1986.
[21] J. Gray, “A Census of Tandem System Availability between 1985 and 1990,” IEEE Trans. Reliability, vol. 39, pp. 409-418, Oct. 1990.
[22] J.A. Hartigan, Clustering Algorithms. New York: Wiley, 1975.
[23] Y. Hong, D. Chen, L. Li, and K.S. Trivedi, “Closed Loop Design for Software Rejuvenation,” Proc. Workshop Self-Healing, Adaptive and Self-Managed Systems (SHAMAN 2002), June 2002.
[24] Y. Huang, P. Jalote, and C. Kintala, “Two Techniques for Transient Software Error Recovery,” Lecture Notes in Computer Science, vol. 774, pp. 159-170, 1994.
[25] Y. Huang, C. Kintala, N. Kolettis, and N.D. Fulton, “Software Rejuvenation: Analysis, Module and Applications,” Proc. 25th Symp. Fault Tolerant Computing (FTCS-25), pp. 381–390, June 1995.
[26] M.C. Hsueh, R.K. Iyer, and K.S. Trivedi, “Performability Modeling Based on Real Data: A Case Study,” IEEE Trans. Computers, vol. 37, no. 4, pp. 478-484, Apr. 1988.
[27] “IBM Netfinity Director Software Rejuvenation,” White Paper, IBM Corp., Research Triangle Park, N.C., Jan. 2001.
[28] P. Jalote, Y. Huang, and C. Kintala, “A Framework for Understanding and Handling Transient Software Failures,” Proc. Second ISSAT Int'l Conf. Reliability and Quality in Design, 1995.
[29] J.C. Knight and N.G. Leveson, “An Experimental Evaluation of the Assumption of Independence in Multiversion Programming,” Software Eng. J., pp. 96-109, vol. 12, no. 1, 1986.
[30] I. Lee and R.K. Iyer, “Software Dependability in the Tandem GUARDIAN System,” IEEE Trans. Software Eng., vol. 21, no. 5, pp. 455-467, May 1995.
[31] L. Li, K. Vaidyanathan, and K.S. Trivedi, “An Approach to Estimation of Software Aging in a Web Server,” Proc. Int'l Symp. Empirical Software Eng. (ISESE 2002), Oct. 2002.
[32] Y. Liu, Y. Ma, J.J. Han, H. Levendel, and K.S. Trivedi, “Modeling and Analysis of Software Rejuvenation in Cable Modem Termination System,” Proc. Int'l Symp. Software Reliability Eng. (ISSRE 2002), Nov. 2002.
[33] E. Marshall, “Fatal Error: How Patriot Overlooked a Scud,” Science, p. 1347, Mar. 1992.
[34] D. Powell, J. Arlat, L. Beus-Dukic, A. Bondavalli, P. Coppola, A. Fantechi, E. Jenn, C. Rabejac, and A. Wellings, “GUARDS: A Generic Upgradable Architecture for Real-Time Dependable Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 6, June 1999.
[35] D. Powell, “Distributed Fault Tolerance: Lessons from Delta-4,” IEEE Micro, vol. 14, no. 1, Feb. 1994.
[36] A. Pfening, S. Garg, A. Puliafito, M. Telek, and K.S. Trivedi, “Optimal Rejuvenation for Tolerating Soft Failures,” Performance Evaluation, vols. 27-28, pp. 491-506, Oct. 1996.
[37] R.A. Sahner, K.S. Trivedi, A. Puliafito, Performance and Reliability Analysis of Computer Systems— An Example-Based Approach Using the SHARPE Software Package. Norwell, Mass.: Academic Publishers, 1996.
[38] P.K. Sen, “Estimates of the Regression Coefficient Based on Kendall's Tau,” J. the Am. Statistical Assoc., vol. 63, pp. 1379-1389, 1968.
[39] M. Sullivan and R. Chillarege, “Software Defects and Their Impact on System Availability— A Study of Field Failures in Operating Systems,” Proc. 21st IEEE Int'l Symp. Fault-Tolerant Computing, pp. 2-9, 1991.
[40] A.T. Tai, S.N. Chau, L. Alkalaj, and H. Hecht, “On-Board Preventive Maintenance: Analysis of Effectiveness and Optimal Duty Period,” Proc. Third Int'l Workshop Object Oriented Real-Time Dependable Systems, Feb. 1997.
[41] M. Telek, A. Pfening, and G. Fodor, “An Effective Numerical Method to Compute the Moments of the Completion Time of Markov Reward Models,” Computer Math. Applications, vol. 36, no. 8, pp. 59-65, 1998.
[42] A. Thakur and R.K. Iyer, “Analyze-NOW— An Environment for Collection and Analysis of Failures in a Network of Workstations,” Proc. Int'l Symp. Software Reliability Eng., pp. 14-23, Apr. 1996.
[43] K.S. Trivedi, Probability and Statistics, with Reliability, Queuing, and Computer Science Applications, second ed. John Wiley, 2001.
[44] K. Vaidyanathan and K.S. Trivedi, “A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems,” Proc. 10th IEEE Int'l Symp. Software Reliability Eng., pp. 84-93, Nov. 1999.
[45] K. Vaidyanathan, R.E. Harper, S.W. Hunter, K.S. Trivedi, “Analysis and Implementation of Software Rejuvenation in Cluster Systems,” Proc. Joint Int'l Conf. Measurement and Modeling of Computer Systems, ACM SIGMETRICS 2001/Performance 2001, June 2001.
[46] Y.-M. Wang, Y. Huang, P.-Y. Chung, P. Vo, and C. Kintala, “Checkpointing and Its Applications,” Proc. Symp. Fault Tolerant Computer Systems, pp. 22-31, June 1995.
[47] W. Xie, Y. Hong, and K.S. Trivedi, “Software Rejuvenation Policies for Cluster Systems under Varying Workload,” Proc. 10th Int'l Pacific Rim Dependable Computing Symp. (PRDC 2004), Mar. 2004.

Index Terms:
Index Terms- Availability, measurement-based dependability evaluation, semi-Markov reward models, software aging, software rejuvenation, workload characterization.
Citation:
Kalyanaraman Vaidyanathan, Kishor S. Trivedi, "A Comprehensive Model for Software Rejuvenation," IEEE Transactions on Dependable and Secure Computing, vol. 2, no. 2, pp. 124-137, April-June 2005, doi:10.1109/TDSC.2005.15
Usage of this product signifies your acceptance of the Terms of Use.