This Article 
 Bibliographic References 
 Add to: 
Analysis of Restart Mechanisms in Software Systems
August 2006 (vol. 32 no. 8)
pp. 547-558
Restarts or retries are a common phenomenon in computing systems, for instance, in preventive maintenance, software rejuvenation, or when a failure is suspected. Typically, one sets a time-out to trigger the restart. We analyze and optimize time-out strategies for scenarios in which the expected required remaining time of a task is not always decreasing with the time invested in it. Examples of such tasks include the download of Web pages, randomized algorithms, distributed queries, and jobs subject to network or other failures. Assuming the independence of the completion time of successive tries, we derive computationally attractive expressions for the moments of the completion time, as well as for the probability that a task is able to meet a deadline. These expressions facilitate efficient algorithms to compute optimal restart strategies and are promising candidates for pragmatic online optimization of restart timers.

[1] S. Aalto, U. Ayesta, and E. Nyberg-Oksanen, “M/G/1 MLPS Compared to M/G/1 PS,” Operations Research Letters, vol. 33, no. 5, pp. 519-524, 2004.
[2] H. Alt, L. Guibas, K. Mehlhorn, R. Karp, and A. Wigderson, “A Method for Obtaining Randomized Algorithms with Small Tail Probabilities,” Algorithmica, vol. 16, nos. 4/5, pp. 543-547, 1996.
[3] A. Bobbio, A. Puliafito, M. Telek, and K. Trivedi, “Recent Developments in Non-Markovian Stochastic Petri Nets,” J. Systems Circuits and Computers, vol. 8, no. 1, pp. 119-158, 1998.
[4] A. Bobbio, M. Sereno, and C. Anglano, “Fine Grained Software Degradation Models for Optimal Rejuvenation Policies,” Performance Evaluation, vol. 46, no. 1, pp. 45-62, 2001.
[5] A. Bobbio, S. Garg, M. Gribaudo, A. Horvath, M. Sereno, and M. Telek, “Modeling Software Systems with Rejuvenation, Restoration and Checkpointing through Fluid Stochastic Petri Nets,” Proc. Int'l Workshop Petri Nets and Performance Models (PNPM '99), pp.82-91, Sept. 1999.
[6] P. Chalasani, S. Jha, O. Shehory, and K. Sycara, “Query Restart Strategies for Web Agents,” Proc. Int'l Conf. Autonomous Agents (Agents '98), 1998.
[7] W. Chen, S. Toueg, and M.K. Aguilera, “On the Quality of Service of Failure Detectors,” Proc. Int'l Conf. Dependable Systems and Networks (DSN '00), pp. 191-200, 2000.
[8] S. Garg, Y. Huang, C. Kintala, and K.S. Trivedi, “Minimizing Completion Time of a Program by Checkpointing and Rejuvenation,” Proc. ACM SIGMETRICS Conf., May 1996.
[9] I. Gertsbakh, Reliability Theory, with Applications to Preventive Maintenance. Springer Verlag, 2000.
[10] B. Haverkort, Performance of Computer Communication Systems: A Model-Based Approach. John Wiley, 1999.
[11] B. Krishnamurthy and J. Rexford, Web Protocols and Practice. Addison Wesley, 2001.
[12] M. Luby, A. Sinclair, and D. Zuckerman, “Optimal Speedup of Las Vegas Algorithms,” Proc. Israel Symp. Theory of Computing Systems, pp. 128-133, 1993.
[13] S.M. Maurer and B.A. Huberman, “Restart Strategies and Internet Congestion,” J. Economic Dynamics and Control, vol. 25, pp. 641-654, 2001.
[14] A. van Moorsel and K. Wolter, “Analysis and Algorithms for Restart,” Proc. First Int'l Conf. the Quantitative Evaluation of Systems (QEST), pp. 195-204, Sept. 2004.
[15] A. van Moorsel and K. Wolter, “Making Deadlines through Restart,” Proc. 12th GI/ITG Conf. Measuring, Modeling and Evaluation of Computer and Comm. Systems (MMB '04), pp. 155-160, Sept. 2004.
[16] A. van Moorsel and K. Wolter, “Optimal Restart Times for Moments of Completion Time,” IEE Proc. Software, vol. 151, no. 5, pp. 219-223, Oct. 2004.
[17] V.F. Nicola, “Checkpointing and the Modeling of Program Execution Time,” Trends in Software 3: Software Fault Tolerance, chapter7, pp. 167-188, Wiley & Sons, 1995.
[18] P. Reinecke, A. van Moorsel, and K. Wolter, “A Measurement Study of the Interplay between Application Level Restart and Transport Protocol,” Proc. Int'l Service Availability Symp. (ISAS '04), pp. 86-100, 2005.
[19] Y. Ruan, E. Horvitz, and H. Kautz, “Restart Policies with Dependence among Runs: A Dynamic Programming Approach,” Proc. Eighth Int'l Conf. Principles and Practice of Constraint Programming, Sept. 2002.
[20] M. Schroeder and L. Buro, “Does the Restart Method Work? Preliminary Results on Efficiency Improvements for Interactions of Web-Agents,” Proc. Workshop Infrastructure for Agents, MAS, and Scalable MAS, Conf. Autonomous Agents '01, 2001.
[21] K. Wolter, “Self-Management of Systems through Automatic Restart,” Self-Star Properties in Complex Information Systems, pp.189-203, 2005.

Index Terms:
Restart, software rejuvenation, time-out, fault-tolerant systems, performance and reliability modeling, completion time, adaptive systems, self-management.
Aad P.A. van Moorsel, Katinka Wolter, "Analysis of Restart Mechanisms in Software Systems," IEEE Transactions on Software Engineering, vol. 32, no. 8, pp. 547-558, Aug. 2006, doi:10.1109/TSE.2006.73
Usage of this product signifies your acceptance of the Terms of Use.