This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Reliability of Heterogeneous Distributed Computing Systems in the Presence of Correlated Failures
April 2014 (vol. 25 no. 4)
pp. 1034-1043
Majeed M. Hayat, Dept. of Electr. & Comput. Eng., Univ. of New Mexico, Albuquerque, NM, USA
Jorge E. Pezoa, Dept. of Electr. & Comput. Eng., Univ. of New Mexico, Albuquerque, NM, USA
While the reliability of distributed-computing systems (DCSs) has been widely studied under the assumption that computing elements (CEs) fail independently, the impact of correlated failures of CEs on the reliability remains an open question. Here, the problem of modeling and assessing the impact of stochastic, correlated failures on the service reliability of applications running on DCSs is tackled. The service reliability is modeled using an integrated analytical and Monte-Carlo (MC) approach. The analytical component of the model comprises a generalization of a previously developed model for reliability of non-Markovian DCSs to a setting where specific patterns of simultaneous failures in CEs are allowed. The analytical model is complemented by a MC-based procedure to draw correlated-failure patterns using the recently reported concept of probabilistic shared risk groups (PSRGs). The reliability model is further utilized to develop and optimize a novel class of dynamic task reallocation (DTR) policies that maximize the reliability of DCSs in the presence of correlated failures. Theoretical predictions, MC simulations, and results from an emulation testbed show that the reliability can be improved when DTR policies correctly account for correlated failures. The impact of correlated failures of CEs on the reliability and the key dependence of DTR policies on the type of correlated failures are also investigated.
Index Terms:
Servers,Reliability,Analytical models,Correlation,Computational modeling,Vectors,shared risk group,Distributed computing,load balancing,reliability,non-Markovian process,spatially correlated failures
Citation:
Majeed M. Hayat, Jorge E. Pezoa, "Reliability of Heterogeneous Distributed Computing Systems in the Presence of Correlated Failures," IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 4, pp. 1034-1043, April 2014, doi:10.1109/TPDS.2013.78
Usage of this product signifies your acceptance of the Terms of Use.