Issue No. 04 - April (2014 vol. 25)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPDS.2013.78
Jorge E. Pezoa , Dept. of Electr. & Comput. Eng., Univ. of New Mexico, Albuquerque, NM, USA
Majeed M. Hayat , Dept. of Electr. & Comput. Eng., Univ. of New Mexico, Albuquerque, NM, USA
While the reliability of distributed-computing systems (DCSs) has been widely studied under the assumption that computing elements (CEs) fail independently, the impact of correlated failures of CEs on the reliability remains an open question. Here, the problem of modeling and assessing the impact of stochastic, correlated failures on the service reliability of applications running on DCSs is tackled. The service reliability is modeled using an integrated analytical and Monte-Carlo (MC) approach. The analytical component of the model comprises a generalization of a previously developed model for reliability of non-Markovian DCSs to a setting where specific patterns of simultaneous failures in CEs are allowed. The analytical model is complemented by a MC-based procedure to draw correlated-failure patterns using the recently reported concept of probabilistic shared risk groups (PSRGs). The reliability model is further utilized to develop and optimize a novel class of dynamic task reallocation (DTR) policies that maximize the reliability of DCSs in the presence of correlated failures. Theoretical predictions, MC simulations, and results from an emulation testbed show that the reliability can be improved when DTR policies correctly account for correlated failures. The impact of correlated failures of CEs on the reliability and the key dependence of DTR policies on the type of correlated failures are also investigated.
Servers, Reliability, Analytical models, Correlation, Computational modeling, Vectors,shared risk group, Distributed computing, load balancing, reliability, non-Markovian process, spatially correlated failures
Jorge E. Pezoa, Majeed M. Hayat, "Reliability of Heterogeneous Distributed Computing Systems in the Presence of Correlated Failures", IEEE Transactions on Parallel & Distributed Systems, vol. 25, no. , pp. 1034-1043, April 2014, doi:10.1109/TPDS.2013.78