This Article 
 Bibliographic References 
 Add to: 
Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation
October 2010 (vol. 21 no. 10)
pp. 1531-1544
Jorge E. Pezoa, University of New Mexico, Albuquerque
Sagar Dhakal, Naval Research Laboratory, Washington DC
Majeed M. Hayat, University of New Mexico, Albuquerque
In distributed computing systems (DCSs) where server nodes can fail permanently with nonzero probability, the system performance can be assessed by means of the service reliability, defined as the probability of serving all the tasks queued in the DCS before all the nodes fail. This paper presents a rigorous probabilistic framework to analytically characterize the service reliability of a DCS in the presence of communication uncertainties and stochastic topological changes due to node deletions. The framework considers a system composed of heterogeneous nodes with stochastic service and failure times and a communication network imposing random tangible delays. The framework also permits arbitrarily specified, distributed load-balancing actions to be taken by the individual nodes in order to improve the service reliability. The presented analysis is based upon a novel use of the concept of stochastic regeneration, which is exploited to derive a system of difference-differential equations characterizing the service reliability. The theory is further utilized to optimize certain load-balancing policies for maximal service reliability; the optimization is carried out by means of an algorithm that scales linearly with the number of nodes in the system. The analytical model is validated using both Monte Carlo simulations and experimental data collected from a DCS testbed.

[1] R. Shah, B. Veeravalli, and M. Misra, "On the Design of Adaptive and Decentralized Load Balancing Algorithms with Load Estimation for Computational Grid Environments," IEEE Trans. Parallel and Distributed Systems, vol. 18, no. 12, pp. 1675-1686, Dec. 2007.
[2] L. Tassiulas and A. Ephremides, "Stability Properties of Constrained Queuing Systems and Scheduling Policies for Maximum Throughput in Multihop Radio Networks," IEEE Trans. Automatic Control, vol. 37, no. 12, pp. 1936-1948, Dec. 1992.
[3] M. Neely, E. Modiano, and C. Rohrs, "Dynamic Power Allocation and Routing for Time Varying Wireless Networks," Proc. IEEE INFOCOM, 2003.
[4] G. Koole, P. Sparaggis, and D. Towsley, "Minimizing Response Times and Queue Lengths in Systems of Parallel Queues," J. Applied Probability, vol. 36, pp. 1185-1193, 1999.
[5] L. Golubchik, J. Lui, and R. Muntz, "Chained Declustering: Load Balancing and Robustness to Skew and Failures," Proc. Workshop Research Issues on Data Eng., pp. 88-95, 1992.
[6] A. Brandt and M. Brandt, "On a Two-Queue Priority System with Impatience and Its Application to a Call Center," Methodology and Computing in Applied Probability, vol. 1, pp. 191-210, 1999.
[7] M. Hayat, S. Dhakal, C. Abdallah, J. Birdwell, and J. Chiasson, "Advances in Time Delay Systems" Dynamic Time Delay Models for Load Balancing. Part II: Stochastic Analysis of the Effect of Delay Uncertainty, pp. 355-368, Springer-Verlag, 2004.
[8] S. Dhakal, M. Hayat, J. Pezoa, C. Yang, and D. Bader, "Dynamic Load Balancing in Distributed Systems in the Presence of Delays: A Regeneration-Theory Approach," IEEE Trans. Parallel and Distributed Systems, vol. 18, no. 4, pp. 485-497, Apr. 2007.
[9] S. Dhakal, M. Hayat, J. Pezoa, C. Abdallah, J. Birdwell, and J. Chiasson, "Load Balancing in the Presence of Random Node Failure and Recovery," Proc. IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS), 2006.
[10] Y.-S. Dai and G. Levitin, "Optimal Resource Allocation for Maximizing Performance and Reliability in Tree-Structured Grid Services," IEEE Trans. Reliability, vol. 56, no. 3, pp. 444-453, Sept. 2007.
[11] Y.-S. Dai, G. Levitin, and K. Trivedi, "Performance and Reliability of Tree-Structured Grid Services Considering Data Dependence and Failure Correlation," IEEE Trans. Computers, vol. 56, no. 7, pp. 925-936, July 2007.
[12] G. Attiya and Y. Hamam, "Reliability Oriented Task Allocation in Heterogeneous Distributed Computing Systems," Proc. Ninth Int'l Symp. Computers and Comm., pp. 68-73, 2004.
[13] C.-I. Chen, "Task Allocation and Reallocation for Fault Tolerance in Multicomputer Systems," Trans. Aerospace and Electronic Systems, vol. 30, pp. 1094-1104, 1994.
[14] S. Dhakal, "Load Balancing in Communication Constrained Distributed Systems: A Probabilistic Approach," PhD dissertation, Univ. of New Mexico, 2006.
[15] J. Pezoa, S. Dhakal, and M. Hayat, "Decentralized Load Balancing for Improving Reliability in Heterogeneous Distributed Systems," Proc. Int'l Conf. Parallel Processing (ICPP), 2009.
[16] V. Shestak, J. Smith, A. Maciejewski, and H. Siegel, "Stochastic Robustness Metric and Its Use for Static Resource Allocations," J. Parallel and Distributed Computing, vol. 68, pp. 1157-1173, 2008.
[17] M. Trehel, C. Balayer, and A. Alloui, "Modeling Load Balancing Inside Groups Using Queuing Theory," Proc. 10th Int'l Conf. Parallel and Distributed Computing Systems, 1997.
[18] C. Hui and S. Chanson, "Hydrodynamic Load Balancing," IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 11, pp. 1118-1137, Nov. 1999.
[19] Z. Lan, V. Taylor, and G. Bryan, "Dynamic Load Balancing for Adaptive Mesh Refinement Application," Proc. Int'l Conf. Parallel Processing (ICPP), 2001.
[20] S. Dhakal, B. Paskaleva, M. Hayat, E. Schamiloglu, and C. Abdallah, "Dynamical Discrete-Time Load Balancing in Distributed Systems in the Presence of Time Delays," Proc. IEEE Conf. Decision and Control (CDC), 2003.
[21] H. Lee, S. Chin, J. Lee, D. Lee, K. Chung, S. Jung, and H. Yu, "A Resource Manager for Optimal Resource Selection and Fault Tolerance Service in Grids," Proc. IEEE Int'l Symp. Cluster Computing and the Grid (ISCCG), 2004.
[22] M. Litzkow, M. Livny, and M. Mutka, "Condor—A Hunter of Idle Workstations," Proc. Int'l Conf. Distrbuted Computing Systems (ICDCS), pp. 104-111, 1988.
[23] R. Sheahan, L. Lipsky, and P. Fiorini, "The Effect of Different Failure Recovery Procedures on the Distribution of Task Completion Times," Proc. Workshop Dependable Parallel Distributed and Network-Centric Systems (DPDNS), 2005.
[24] J. Palmer and I. Mitrani, "Empirical and Analytical Evaluation of Systems with Multiple Unreliable Servers," Proc. Int'l Conf. Dependable Systems and Networks, pp. 517-525, 2006.
[25] S. Shatz and J.-P. Wang, "Models and Algorithms for Reliability-Oriented Task-Allocation in Redundant Distributed-Computer Systems," IEEE Trans. Reliability, vol. 38, no. 1, pp. 16-27, Apr. 1989.
[26] V. Ravi, B. Murty, and J. Reddy, "Nonequilibrium Simulated-Annealing Algorithm Applied to Reliability Optimization of Complex Systems," IEEE Trans. Reliability, vol. 46, no. 2, pp. 233-239, June 1997.
[27] S. Srinivasan and N. Jha, "Safety and Reliability Driven Task Allocation in Distributed Systems," IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 3, pp. 238-251, Mar. 1999.
[28] D. Vidyarthi and A. Tripathi, "Maximizing Reliability of a Distributed Computing System with Task Allocation Using Simple Genetic Algorithm," J. Systems Architecture, vol. 47, pp. 549-554, 2001.
[29] G. Attiya and Y. Hamam, "Task Allocation for Maximizing Reliability of Distributed Systems: A Simulated Annealing Approach," J. Parallel and Distributed Computing, vol. 66, pp. 1259-1266, 2006.
[30] Y. Hamam and K. Hindi, "Assignment of Program Tasks to Processors: A Simulated Annealing Approach," European J. Operational Research, vol. 122, pp. 509-513, 2000.

Index Terms:
Renewal theory, queuing theory, reliability, distributed computing, load balancing.
Jorge E. Pezoa, Sagar Dhakal, Majeed M. Hayat, "Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation," IEEE Transactions on Parallel and Distributed Systems, vol. 21, no. 10, pp. 1531-1544, Oct. 2010, doi:10.1109/TPDS.2010.34
Usage of this product signifies your acceptance of the Terms of Use.