Reliability and availability are very critical when faults appear in the design of large multicomputer systems. On the other hand, it is very difficult for predicting the reliability and availability of multicomputer systems. In this paper, we study the reliability and availability of large multicomputer systems under a more realistic model in which each network node has an independent failure probability. We mainly consider the reliability and availability of large mesh-connected multicomputer systems. The metric is connectivity probability of networks. In [8], we proved that if the node failure probability is .xed, then the connectivity probability of mesh networks can be arbitrarily small when the network size is suf.ciently large. Thus, it is practically important for multicomputer system manufacturer to determine the upper bound for node failure probability when the probability of network connectivity and the network size are given. We develop another novel technique to formally derive lower bounds on the connectivity probability for mesh networks. Our study shows that mesh networks of practical size can tolerate a large number of faulty nodes thus are reliable enough for multicomputer systems. For example, we formally prove that as long as the node failure probability is bounded by 0.09%(note that according to today?s VLSI technology, building network nodes with failure probability under 0.09% is achievable), the mesh networks of up to a million nodes remain connected with a probability larger than 99%. The results for mesh network reliability and availability that are obtained by formal and thorough mathematical proofs.
Citation:
Gaocai Wang, Jianer Chen, Guojun Wang, Songqiao Chen, "Probability Model for Faults in Large-Scale Multicomputer Systems," ats, pp.452, 12th Asian Test Symposium (ATS'03), 2003