The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.08 - August (2010 vol.21)
pp: 1117-1131
Ricardo Fernández-Pascual , Universidad de Murcia, Murcia
Manuel E. Acacio , Universidad de Murcia, Murcia
José Duato , Universidad Politécnica de Valencia, Valencia
ABSTRACT
The importance of transient faults is predicted to grow due to current technology trends of increased scale of integration. One of the components that will be significantly affected by transient faults is the interconnection network of chip multiprocessors (CMPs). To deal efficiently with these faults and differently from other authors, we propose to use fault-tolerant cache coherence protocols that ensure the correct execution of programs when not all messages are correctly delivered. We describe the extensions made to a directory-based cache coherence protocol to provide fault tolerance and provide a modified set of token counting rules which are useful to design fault-tolerant token-based cache coherence protocols. We compare the directory-based fault-tolerant protocol with a token-based fault-tolerant one. We also show how to adjust the fault tolerance parameters to achieve the desired level of fault tolerance and measure the overhead achieved to be able to support very high fault rates. Simulation results using a set of scientific, multimedia, and commercial applications show that the fault tolerance measures have virtually no impact on execution time with respect to a non-fault-tolerant protocol. Additionally, our protocols can support very high rates of transient faults at the cost of slightly increased network traffic.
INDEX TERMS
fault tolerance, cache coherence, transient faults, interconnection network.
CITATION
Ricardo Fernández-Pascual, Manuel E. Acacio, José Duato, "Dealing with Transient Faults in the Interconnection Network of CMPs at the Cache Coherence Level", IEEE Transactions on Parallel & Distributed Systems, vol.21, no. 8, pp. 1117-1131, August 2010, doi:10.1109/TPDS.2009.148
REFERENCES
[1] N. Aggarwal, P. Ranganathan, N.P. Jouppi, and J.E. Smith, "Configurable Isolation: Building High Availability Systems with Commodity Multi-Core Processors," Proc. 34th Int'l Symp. Computer Architecture (ISCA '07), June 2007.
[2] M. Ali, M. Welzl, and S. Hessler, "A Fault Tolerant Mechanism for Handling Permanent and Transient Failures in a Network on Chip," Proc. Int'l Conf. Information Technology (ITNG '07), pp. 1027-1032, 2007.
[3] R. Baumann, "Soft Errors in Advanced Computer Systems," IEEE Design and Test of Computers, vol. 22, no. 3, pp. 258-266, May/June 2005.
[4] K. Constantinides, S. Plaza, J. Blome, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky, "BulletProof: A Defect-Tolerant CMP Switch Architecture," Proc. 12th Int'l Symp. High-Performance Computer Architecture (HPCA '06), pp. 3-14, Feb. 2006.
[5] R. Fernández-Pascual, J.M. García, M.E. Acacio, and J. Duato, "A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures," Proc. 13th Int'l Symp. High-Performance Computer Architecture (HPCA '07), pp. 157-168, Feb. 2007.
[6] R. Fernández-Pascual, J.M. García, M.E. Acacio, and J. Duato, "Extending the TokenCMP Cache Coherence Protocol for Low Overhead Fault Tolerance in CMP Architectures," IEEE Trans. Parallel and Distributed Systems, vol. 19, no. 8, pp. 1044-1056, Aug. 2008.
[7] R. Fernández-Pascual, J.M. García, M.E. Acacio, and J. Duato, "Fault-Tolerant Cache Coherence Protocols for CMPs: Evaluation and Trade Offs," Proc. Int'l Conf. High Performance Computing (HiPC '08), Dec. 2008.
[8] R. Fernández-Pascual, J.M. García, M.E. Acacio, and J. Duato, "A Fault-Tolerant Directory-Based Cache Coherence Protocol for Shared-Memory Architectures," Proc. Int'l Conf. Dependable Systems and Networks (DSN '08), June 2008.
[9] M. Li, R. Sasanka, S.V. Adve, Y. Chen, and E. Debes, "The ALPBench Benchmark Suite for Complex Multimedia Applications," Proc. IEEE Int'l Symp. Workload Characterization, pp. 34-45, 2005.
[10] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A Full System Simulation Platform," Computer, vol. 35, no. 2, pp. 50-58, Feb. 2002.
[11] M.M.K. Martin, "Token Coherence," PhD thesis, Univ. of Wisconsin-Madison, Dec. 2003.
[12] M.M.K. Martin, M.D. Hill, and D.A. Wood, "Token Coherence: Decoupling Performance and Correctness," Proc. 30th Ann. Int'l Symp. Computer Architecture (ISCA '03), pp. 182-193, June 2003.
[13] M.M.K. Martin, D.J. Sorin, B.M. Beckmann, M.R. Marty, M. Xu, A.R. Alameldeen, K.E. Moore, M.D. Hill, and D.A. Wood, "Multifacet's General Execution-Driven Multiprocessor Simulator (GEMS) Toolset," Computer Architecture News, vol. 33, no. 4, pp. 92-99, Sept. 2005.
[14] M.R. Marty, J.D. Bingham, M.D. Hill, A.J. Hu, M.M.K. Martin, and D.A. Wood, "Improving Multiple-CMP Systems Using Token Coherence," Proc. 11th Int'l Symp. High-Performance Computer Architecture (HPCA '05), pp. 328-339, Feb. 2005.
[15] A. Meixner and D.J. Sorin, "Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures," Proc. 13th Int'l Symp. High-Performance Computer Architecture (HPCA '07), pp. 145-156, Feb. 2007.
[16] A. Meixner and D.J. Sorin, "Dynamic Verification of Memory Consistency in Cache-Coherent Multithreaded Computer Architectures," Proc. Int'l Conf. Dependable Systems and Networks (DSN '06), pp. 73-82, June 2006.
[17] S.S. Mukherjee, J. Emer, and S.K. Reinhardt, "The Soft Error Problem: An Architectural Perspective," Proc. 11th Int'l Symp. High-Performance Computer Architecture (HPCA '05), Feb. 2005.
[18] S. Murali, T. Theocharides, N. Vijaykrishnan, M.J. Irwin, L. Benini, and D. De Micheli, "Analysis of Error Recovery Schemes for Networks on Chips," IEEE Design and Test of Computers, vol. 22, no. 5, pp. 434-442, Sept./Oct. 2005.
[19] D. Park, C. Nicopoulos, J. Kim, N. Vijaykrishnan, and C.R. Das, "Exploring Fault-Tolerant Network-On-Chip Architectures," Proc. 2006 Int'l Conf. Dependable Systems and Networks (DSN '06), pp. 93-104, 2006.
[20] M. Pirretti, G.M. Link, R.R. Brooks, N. Vijaykrishnan, M. Kandemir, and M.J. Irwin, "Fault Tolerant Algorithms for Network-On-Chip Interconnect," Proc. IEEE Computer Soc. Ann. Symp. Very-Large-Scale-Integration (VLSI), pp. 46-51, Feb. 2004.
[21] M. Prvulovic, Z. Zhang, and J. Torrellas, "ReVive: Cost-Effective Architectural Support for Rollback," Proc. 29th Ann. Int'l Symp. Computer Architecture (ISCA '02), pp. 111-122, May 2002.
[22] D.J. Sorin, M.M.K. Martin, M.D. Hill, and D.A. Wood, "SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery," Proc. 29th Ann. Int'l Symp. Computer Architecture (ISCA '02), pp. 123-134, May 2002.
[23] M.B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, J. Lee, P. Johnson, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal, "The RAW Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs," IEEE Micro, vol. 22, no. 2, pp. 25-35, May 2002.
[24] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar, "An 80-Tile 1.28TFLOPS Network-On-Chip in 65 nm CMOS," Proc. IEEE Int'l Solid-State Circuits Conf. (ISSCC), 2007.
[25] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, "The SPLASH-2 Programs: Characterization and Methodological Considerations," Proc. 22nd Int'l Symp. Computer Architecture (ISCA '95), pp. 24-36, June 1995.
23 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool