The Community for Technology Leaders
RSS Icon
Issue No.08 - August (2008 vol.19)
pp: 1044-1056
It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, Chip-multiprocessors (CMP) that integrate several processor cores in a single chip are nowadays the best alternative to more efficient use of the increasing number of transistors that can be placed in a single die. Hence, it is necessary to design new techniques to deal with these faults to be able to build sufficiently reliable Chip Multiprocessors (CMPs). In this work, we present a coherence protocol aimed at dealing with transient failures that affect the interconnection network of a CMP, thus assuming that the network is no longer reliable. In particular, our proposal extends a token-based cache coherence protocol so that no data can be lost and no deadlock can occur due to any dropped message. Using GEMS full system simulator, we compare our proposal against TokenCMP. We show that in absence of failures our proposal does not introduce overhead in terms of increased execution time over TokenCMP. Additionally, our protocol can tolerate message loss rates much higher than those likely to be found in the real world without increasing execution time more than 15%.
Reliability, Testing, and Fault-Tolerance, BShared memory, Multi-core/single-chip multiprocessors
Ricardo Fernández-Pascual, José M. García, Manuel E. Acacio, José Duato, "Extending the TokenCMP Cache Coherence Protocol for Low Overhead Fault Tolerance in CMP Architectures", IEEE Transactions on Parallel & Distributed Systems, vol.19, no. 8, pp. 1044-1056, August 2008, doi:10.1109/TPDS.2007.70803
[1] R.E. Ahmed, R.C. Frazier, and P.N. Marinos, “Cache-Aided Rollback Error Recovery (CARER) Algorithm for Shared-Memory Multiprocessor Systems,” Proc. 20th Ann. Int'l Symp. Fault-Tolerant Computing (FTCS '90), pp. 82-88, June 1990.
[2] M. Banâtre, A. Gefflaut, P. Joubert, C. Morin, and P.A. Lee, “An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors,” IEEE Trans. Computers, vol. 45, no. 10, pp. 1101-1115, Oct. 1996.
[3] L.A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese, “Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing,” Proc. 27th Int'l Symp. Computer Architecture (ISCA '00), pp. 282-293, June 2000.
[4] D. Bernick, B. Bruckert, P. Del Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen, “Nonstop Advanced Architecture,” Proc. 2005 Int'l Conf. Dependable Systems and Networks (DSN '05), pp. 12-21, 2005.
[5] R. Fernández-Pascual, J.M. García, M.E. Acacio, and J. Duato, “A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures,” Proc. 13th Int'l Symp. High-Performance Computer Architecture (HPCA '07), pp. 157-168, Feb. 2007.
[6] L. Hammond, B.A. Hubbert, M. Siu, M.K. Prabhu, M. Chen, and K. Olukotun, “The Stanford Hydra CMP,” IEEE MICRO Magazine, vol. 20, no. 2, pp. 71-84, Mar.-Apr. 2000.
[7] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A Full System Simulation Platform,” Computer, vol. 35, no. 2, pp. 50-58, 2002.
[8] A. Maheshwari, W. Burleson, and R. Tessier, “Trading Off Transient Fault Tolerance and Power Consumption in Deep Submicron (DSM) VLSI Circuits,” IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 12, no. 3, pp. 299-311, Mar. 2004.
[9] M.M.K. Martin, “Token Coherence,” PhD thesis, Univ. of Wisconsin-Madison, Dec. 2003.
[10] M.M.K. Martin, M.D. Hill, and D.A. Wood, “Token Coherence: A New Framework for Shared-Memory Multiprocessors,” IEEE Micro, vol. 23, no. 6, pp. 108-116, Nov./Dec. 2003.
[11] M.M.K. Martin, D.J. Sorin, B.M. Beckmann, M.R. Marty, M. Xu, A.R. Alameldeen, K.E. Moore, M.D. Hill, and D.A. Wood, “Multifacet's General Execution-Driven Multiprocessor Simulator (GEMS) Toolset,” Computer Architecture News, vol. 33, no. 4, pp. 92-99, Sept. 2005.
[12] M.R. Marty, J.D. Bingham, M.D. Hill, A.J. Hu, M.M.K. Martin, and D.A. Wood, “Improving Multiple-CMP Systems Using Token Coherence,” Proc. 11th Int'l Symp. High-Performance Computer Architecture (HPCA '05), pp. 328-339, Feb. 2005.
[13] A. Meixner and D.J. Sorin, “Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures,” Proc. 13th Int'l Symp. High-Performance Computer Architecture (HPCA '07), pp.145-156, Feb. 2007.
[14] S.S. Mukherjee, J. Emer, and S.K. Reinhardt, “The Soft Error Problem: An Architectural Perspective,” Proc. 11th Int'l Symp. High-Performance Computer Architecture (HPCA '05), Feb. 2005.
[15] J.B. Postel, “Transmission Control Protocol,” RFC 793, Sept. 1981.
[16] M. Prvulovic, Z. Zhang, and J. Torrellas, “ReVive: Cost-Effective Architectural Support for Rollback,” Proc. 29th Ann. Int'l Symp. Computer Architecture (ISCA '02), pp. 111-122, May 2002.
[17] T.J. Slegel, R.M. Averill III, M.A. Check, B.C. Giamei, B.W. Krumm, C.A. Krygowski, W.H. Li, J.S. Liptay, J.D. MacDougall, T.J. McPherson, J.A. Navarro, E.M. Schwarz, K. Shum, and C.F. Webb, “IBM's S/390 G5 Microprocessor Design,” IEEE Micro, vol. 19, no. 2, pp. 12-23, 1999.
[18] D.J. Sorin, M.M.K. Martin, M.D. Hill, and D.A. Wood, “SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery,” Proc. 29th Ann. Int'l Symp. Computer Architecture (ISCA '02), pp. 123-134, May 2002.
[19] L. Spainhower and T.A. Gregg, “IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective,” IBM J.Research and Development, vol. 43, nos. 5/6, pp. 863-873, Sept. 1999.
[20] D. Sunada, M. Flynn, and D. Glasco, “Multiprocessor Architecture Using an Audit Trail for Fault Tolerance,” Proc. 29th Ann. Int'l Symp. Fault-Tolerant Computing (FTCS '99), pp. 40-47, June 1999.
[21] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta, “The SPLASH-2 Programs: Characterization and Methodological Considerations,” Proc. 22nd Int'l Symp. Computer Architecture (ISCA '95), pp. 24-36, June 1995.
[22] K.L. Wu, W.K. Fuchs, and J.H. Patel, “Error Recovery in Shared Memory Multiprocessors Using Private Caches,” IEEE Trans. Parallel and Distributed Systems, vol. 1, no. 2, pp. 231-240, Apr. 1990.
414 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool