This Article 
 Bibliographic References 
 Add to: 
Analysis and optimization of service availability in a HA cluster with load-dependent machine availability
September 2007 (vol. 18 no. 9)
pp. 1307-1319
Calculations of service availability of a High- Availability (HA) cluster are usually based on the assumption of load-independent machine availabilities. In this paper, we study the issues and show how the service availabilities can be calculated under the assumption that machine availabilities are load-dependent. we present a Markov chain analysis to derive the steady-state service availabilities of a load-dependentmachine- availability HA cluster. We show that, with loaddependent machine-availability, the attained service availability is now policy-dependent. After formulating the problem as a Markov Decision Process, we proceed to determine the optimal policy to achieve the maximum service availabilities using the method of policy iteration. Two greedy assignment algorithms are studied: least-load and FDL-based, where leastload corresponds to some load-balancing algorithms.We carry out analysis and simulations on two cases of load profiles: in the first profile, a single machine has the capacity to host all services in the HA cluster; in the second profile, a single machine does not have enough capacity to host all services. We show that the service availabilities achieved under the first load profile are the same, while the service availabilities achieved under the second load profile are different. Since the service availabilities achieved are different in the second load profile, we proceed to investigate how the distribution of service availabilities across the services can be controlled by adjusting the rewards vector.

[1] D. Scott, “NSM: Often the Weakest Link in Business Availability,” http://www.gartner.comDisplayDocument?id=334197 , July 2001.
[2] M. Loney, “The Magic That Makes Google Tick,” http://www. 0,39023769,39168647,00.htm, Dec. 2004.
[3] The Grid: Blueprint for a New Computing Infrastructure, I. Foster and C. Kesselman, eds. Morgan Kaufmann, July 1999.
[4] Y.S. Dai and G. Levitin, “Reliability and Performance of Tree-Structured Grid Services,” IEEE Trans. Reliability, vol. 55, pp. 337-349, June 2006.
[5] K. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications. John Wiley & Sons, 2001.
[6] A. Sathaye, S. Ramani, and K. Trivedi, “Availability Models in Practice,” Proc. Int'l Workshop Fault-Tolerant Control and Computing (FTCC-1), May 2000.
[7] Y.S. Dai, M. Xie, K.L. Poh, and G.Q. Liu, “A Study of Service Reliability and Availability for Distributed Systems,” Reliability Eng. and System Safety, vol. 79, pp. 103-112, Jan. 2003.
[8] G. Ciardo, K.S. Trivedi, and J.K. Muppala, “SPNP: Stochastic Petri Net Package,” Proc. Third Int'l Workshop Petri Nets and Performance Models (PNPM '89), Dec. 1989.
[9] K. Trivedi and C. Hirel, “Sharpe—Symbolic Hierarchical Automated Reliability and Performance Evaluator,” duke.edusoftware_packages.htm , Dec. 2004.
[10] K. Iyer, E. Butner, and E.J. McCluskey, “An Exponential Failure/Load Relationship: Results of a Multi-Computer Statistical Study,” Technical Report CSL-TR-81-214, Computer Systems Laboratory, Stanford Univ., July 1981.
[11] B. Schroeder and G.A. Gibson, “A Large-Scale Study of Failures in High-Performance Computing Systems,” Proc. Int'l Conf. Dependable Systems and Networks (DSN '06), June 2006.
[12] D. Heimann, N. Mittal, and K.S. Trivedi, “Availability and Reliability Modeling for Computer Systems,” Advances in Computers, M. Yovitts, ed., vol. 31, pp. 175-233. Academic Press, 1990.
[13] R. Robinson and A. Polozoff, “IBM WebSphere Developer Technical J.: Planning for Availability in the Enterprise,” techjournal/0312_polozoffpolozoff.html , Oct. 2003.
[14] J. Tian, S. Rudraraju, and Z. Li, “Evaluating Web Software Reliability Based on Workload and Failure Data Extracted from Server Logs,” IEEE Trans. Software Eng., vol. 30, no. 11, pp. 754-769, Nov. 2004.
[15] R.K. Iyer and D.J. Rossetti, “A Statistical Load Dependency Model for CPU Errors at SLAC,” Proc. 12th Int'l Symp. Fault-Tolerant Computing (FTCS-12), pp. 363-372, June 1982.
[16] K. Vaidyanathan and K.S. Trivedi, “A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems,” Proc. 10th Int'l Symp. Software Reliability Eng. (ISSRE '99), 1999.
[17] “IBM DB2 V7 Administration Guide Part 12 Chapter 35: DB2 and High Availability on SUN Cluster 2.2,” http://publib.boulder. db2v7luw/topic/ db2d0273.htm, 2001.
[18] “Linux-HA Heartbeat Program,” http://www.linux-ha.orgHeartbeatProgram, 1999.
[19] D.A. Patterson, G.A. Gibson, and R.H. Katz, “A Case for Redundant Arrays of Inexpensive Disks (RAID),” Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '88), June 1988.
[20] A. Heddaya and A. Helal, “Reliability, Availability, Dependability and Performability: A User-Centered View,” technical report, Boston Univ., 1997.
[21] K. Nagaraja, G. Gama, R. Bianchini, R.P. Martin, W. Meira Jr., and T.D. Nguyen, “Quantifying the Performability of Cluster-Based Services,” IEEE Trans. Parallel and Distributed Systems, vol. 16, no. 5, pp. 456-467, May 2005.
[22] D. Bertsekas and R. Gallager, Data Networks, second ed. Prentice Hall, 1992.
[23] W. Press, B. Flannery, S. Teukolsky, and W. Vetterling, Numerical Recipes in C: The Art of Scientific Computing. Cambridge Univ. Press, 2002.
[24] R.S. Sutton and A.G. Barto, Reinforcement Learning—An Introduction. MIT Press, 1998.
[25] M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 1994.
[26] B. Van Roy, “Neuro-Dynamic Programming: Overview and Recent Trends,” Handbook of Markov Decision Processes: Methods and Applications, E. Feinberg and A. Shwartz, eds. Kluwer Academic Publishers, 2001.
[27] G. Tesauro, N.K. Jong, R. Das, and M.N. Bennani, “A Hybrid Reinforcement Learning Approach to Autonomic Resource Allocation,” Proc. Third Int'l Conf. Autonomic Computing (ICAC '06), pp. 65-73, June 2006.
[28] J. Guo and L.N. Bhuyan, “Load Balancing in a Cluster-Based Web Server for Multimedia Applications,” IEEE Trans. Parallel and Distributed Systems, vol. 17, no. 11, pp. 1321-1334, Nov. 2006.
[29] M. Adler, S. Chakrabarti, M. Mitzenmacher, and L. Rasmussen, “Parallel Randomized Load Balancing,” Proc. 27th Ann. ACM Symp. Theory of Computing (STOC '95), pp. 238-247, 1995.
[30] B.A. Shirazi, A.R. Hurson, and K.M. Kavi, Scheduling and Load Balancing in Parallel and Distributed Systems. Wiley–IEEE CS Press, May 1995.
[31] Q. Zhang, A. Riska, W. Sun, E. Smirni, and G. Ciardo, “Workload-Aware Load Balancing for Clustered Web Servers,” IEEE Trans. Parallel and Distributed Systems, vol. 16, no. 3, pp. 219-233, Mar. 2005.
[32] D.P. Bertsekas and J.N. Tsitsiklis, Neuro-Dynamic Programming. Athena Scientific, 1996.
[33] L.P. Kaelbling, M.L. Littman, and A.P. Moore, “Reinforcement Learning: A Survey,” J. Artificial Intelligence Research, vol. 4, pp.237-285, 1996.
[34] Service Availability Forum, http:/, 2006
[35] S. Floyd and V. Jacobson, “Random Early Detection Gateways for Congestion Avoidance,” IEEE/ACM Trans. Networking, vol. 1, pp.397-413, Aug. 1993.
[36] M. MacDougall, Simulating Computer Systems. MIT Press, 1987.

Index Terms:
High Availability, cluster computing, Markov chains, Markov decision processes, dynamic programming, neuro-dynamic programming
Chee-Wei Ang, Chen-Khong Tham, "Analysis and optimization of service availability in a HA cluster with load-dependent machine availability," IEEE Transactions on Parallel and Distributed Systems, vol. 18, no. 9, pp. 1307-1319, Sept. 2007, doi:10.1109/TPDS.2007.1071
Usage of this product signifies your acceptance of the Terms of Use.