Issue No.08 - August (2010 vol.21)
Ricardo Fernández-Pascual , Universidad de Murcia, Murcia
Manuel E. Acacio , Universidad de Murcia, Murcia
José Duato , Universidad Politécnica de Valencia, Valencia
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPDS.2009.148
The importance of transient faults is predicted to grow due to current technology trends of increased scale of integration. One of the components that will be significantly affected by transient faults is the interconnection network of chip multiprocessors (CMPs). To deal efficiently with these faults and differently from other authors, we propose to use fault-tolerant cache coherence protocols that ensure the correct execution of programs when not all messages are correctly delivered. We describe the extensions made to a directory-based cache coherence protocol to provide fault tolerance and provide a modified set of token counting rules which are useful to design fault-tolerant token-based cache coherence protocols. We compare the directory-based fault-tolerant protocol with a token-based fault-tolerant one. We also show how to adjust the fault tolerance parameters to achieve the desired level of fault tolerance and measure the overhead achieved to be able to support very high fault rates. Simulation results using a set of scientific, multimedia, and commercial applications show that the fault tolerance measures have virtually no impact on execution time with respect to a non-fault-tolerant protocol. Additionally, our protocols can support very high rates of transient faults at the cost of slightly increased network traffic.
fault tolerance, cache coherence, transient faults, interconnection network.
Ricardo Fernández-Pascual, Manuel E. Acacio, José Duato, "Dealing with Transient Faults in the Interconnection Network of CMPs at the Cache Coherence Level", IEEE Transactions on Parallel & Distributed Systems, vol.21, no. 8, pp. 1117-1131, August 2010, doi:10.1109/TPDS.2009.148