The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.02 - March/April (2011 vol.8)
pp: 207-217
Jacques Henri Collet , CNRS, Université de Toulouse, Toulouse
Piotr Zajac , CNRS, Université de Toulouse, Toulouse and Technical University of Lodz, Lodz
Mihalis Psarakis , University of Piraeus, Piraeus
Dimitris Gizopoulos , University of Piraeus, Piraeus
ABSTRACT
We study chip self-organization and fault tolerance at the architectural level to improve dependable continuous operation of multicore arrays in massively defective nanotechnologies. Architectural self-organization results from the conjunction of self-diagnosis and self-disconnection mechanisms (to identify and isolate most permanently faulty or inaccessible cores and routers), plus self-discovery of routes to maintain the communication in the array. In the methodology presented in this work, chip self-diagnosis is performed in three steps, following an ascending order of complexity: interconnects are tested first, then routers through mutual test, and cores in the last step. The mutual testing of routers is especially important as faulty routers are disconnected by good ones with no assumption on the behavior of defective elements. Moreover, the disconnection of faulty routers is not physical “"hard”) but logical (“soft”) in that a good router simply stops communicating with any adjacent router diagnosed as defective. There is no physical reconfiguration in the chip and no need for spare elements. Ultimately, the multicore array may be viewed as a black box, which incorporates protection mechanisms and self-organizes, while the external control reduces to a simple chip validation test which, in the simplest cases, reduces to counting the number of valid and accessible cores.
INDEX TERMS
Multicore architectures, multiprocessors, fault diagnosis, fault tolerance, massively defective nanotechnologies.
CITATION
Jacques Henri Collet, Piotr Zajac, Mihalis Psarakis, Dimitris Gizopoulos, "Chip Self-Organization and Fault Tolerance in Massively Defective Multicore Arrays", IEEE Transactions on Dependable and Secure Computing, vol.8, no. 2, pp. 207-217, March/April 2011, doi:10.1109/TDSC.2009.53
REFERENCES
[1] A. Asenov, A.R. Brown, J.H. Davies, S. Kaya, and G. Slavcheva, "Simulation of Intrinsic Parameter Fluctuations in Decananometer and Nanometer-Scale MOSFETs," IEEE Trans. Electron Devices, vol. 50, no. 4, pp. 1837-1852, Sept. 2005.
[2] A. Bhavnagarwala, X. Tang, and J.D. Meindl, "The Impact of Intrinsic Device Fluctuations on CMOS SRAM Cell Stability," IEEE J. Solid-State Circuit, vol. 36, no. 4, pp. 658-665, Apr. 2001.
[3] C. Constantinescu, "Trends and Challenges in VLSI Circuit Reliability," IEEE Micro, vol. 23, no. 4, pp. 14-19, July/Aug. 2003.
[4] R. Baumann, "Soft Errors in Advanced Computer Systems," IEEE Design and Test of Computers, vol. 22, no. 3, pp. 258-266, May/June 2005.
[5] http://en.wikipedia.org/wikiIntel_Core, 2010.
[6] http:/www.amd.com, 2010.
[7] M. Riley, L. Bushard, N. Chelstrom, N. KIryu, and S. Fergusson, "Testability Features of the First-Generation Cell Processor," Proc. IEEE Int'l Test Conf., p. 119, 2005.
[8] P.J. Tan, T. Le, K.H. Ng, P. Mantri, and J. Westsfall, "Testing of UltraSPARC T1 Microprocessor and Its Challenges," Proc. IEEE Int'l Test Conf., pp. 1-10, 2006.
[9] S. Borkar, "Challenges in Reliable System Design in the Presence of Transistor Variability and Degradation," IEEE Micro, vol. 25, no. 6, pp. 10-16, Nov. 2005.
[10] A. Argawal, "Limits of Interconnections Network Performance," IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 4, pp. 398-412, Oct. 1991.
[11] D. Lenoski, J. Laudon, K. Gharacharloo, W. Weber, A. Gupta, H. Hennessy, M. Horowitz, and M. Lam, "The Stanford DASH Multiprocessor," Computer, vol. 25, no. 3, pp. 63-79, Mar. 1992.
[12] F.P. Preparata and J. Vuillmin, "The Cube-Connnected Cycles: A Versatile Network for Parallel Computations," Comm. ACM, vol. 24, pp. 300-309, 1981.
[13] B. Webb and A. Louri, "A Class of Highly Scalable Interconnection Networks for Parallel Computing Systems," IEEE Trans. Parallel and Distributed Systems, vol. 11, no. 5, pp. 444-458, May 2000.
[14] L. Chen, S. Ravi, A. Raghunathan, and S. Dey, "A Scalable Software-Based Self-Test Methodology for Programmable Processors," Proc. IEEE/ACM Design Automation Conf., pp. 548-553, 2003.
[15] A. Paschalis and D. Gizopoulos, "Effective Software-Based Self-Test Strategies for On-Line Periodic Testing of Embedded Processors," IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 24, no. 1, pp. 88-99, Jan. 2005.
[16] F. Corno, E. Sanchez, M.S. Reorda, and G. Squillero, "Automatic Test Program Generation—A Case Study," IEEE Design and Test of Computers, vol. 21, no. 2, pp. 102-109, Mar./Apr. 2004.
[17] S. Gurumurthy, S. Vasudevan, and J. Abraham, "Automatic Generation of Instruction Sequences Targeting Hard-to-Detect Structural Faults in a Processor," Proc. IEEE Int'l Test Conf., pp. 1-9, 2006.
[18] M. Hatzimihail, M. Psarakis, D. Gizopoulos, and A. Paschalis, "A Methodology for Detecting Performance Faults in Microprocessor Speculative Execution Units via Hardware Performance Monitoring," Proc. IEEE Int'l Test Conf., pp. 1-10, 2007.
[19] D. Gizopoulos, M. Psarakis, M. Hatzimihail, M. Maniatakos, A. Paschalis, A. Raghunathan, and S. Ravi, "Systematic Software-Based Self-Test for Pipelined Processors," IEEE Trans. Very Large Scale Integration Systems, vol. 16, no. 11, pp. 1441-1453, Nov. 2008.
[20] P. Parvathala, K. Maneparambil, and W. Lindsay, "FRITS—A Microprocessor Functional BIST Method," Proc. IEEE Int'l Test Conf., pp. 590-598, 2002.
[21] I. Bayraktaroglu, J. Hunt, and D. Watkins, "Cache Resident Functional Microprocessor Testing: Avoiding High Speed IO Issues," Proc. IEEE Int'l Test. Conf., pp. 1-7, 2006.
[22] J.W. Greene and A. El Gamal, "Configuration of VLSI Arrays in the Presence of Defects," J. ACM, vol. 31, no. 4, pp. 694-717, 1984.
[23] A.D. Singh, "Interstitial Redundancy: An Area Fault-Tolerant Scheme for Larger Area VLSI Processors Arrays," IEEE Trans. Computers, vol. 37, no. 11, pp. 1398-1410, Nov. 1988.
[24] I. Koren and A.D. Singh, "Fault Tolerance in VLSI Circuits," Computer, Special Issue on Fault-Tolerant Systems, vol. 23, no. 7, pp. 73-83, July 1990.
[25] J. Han and P. Jonker, "A Defect- and Fault-Tolerant Architecture for Nanocomputers," Nanotechnology, vol. 14, pp. 224-230, 2003.
[26] L. Zhang, Y. Han, Q. Xu, and X. Li, "Defect Tolerance in Homogeneous Manycore Processors Using Core-Level Redundancy with Unified Topology," Proc. Conf. Design, Automation and Test in Europe (DATE), pp. 891-896, 2008.
[27] P. Zajac and J.H. Collet, "Production Yield and Self-Configuration in the Future Massively Defective Nanochips," Proc. IEEE Symp. Defects and Fault Tolerance in VLSI, pp. 197-205, 2007.
[28] P. Zajac, J.H. Collet, and A. Napieralski, "Self-Configuration and Reachability Metrics in Massively Defective Multiport Chips," Proc. IEEE Int'l Online Testing Symp., pp. 197-205, 2008.
[29] R.D. Adams, High Performance Memory Testing: Design Principles, Fault Modeling and Self-Test. Springer, 2002.
[30] A. Apostolakis, M. Psarakis, D. Gizopoulos, and A. Paschalis, "Functional Self-Testing for Bus-Based Symmetric Multiprocessors," Proc. IEEE Conf. Design, Automation and Test in Europe (DATE), pp. 393-398, 2008.
[31] A.M. Amory, K. Goossens, E.J. Marinissen, M. Lubaszewski, and F. Moraes, "Wrapper Design for the Reuse of Networks-on-Chip as Test Access Mechanism," Proc. IEEE European Test Symp. (ETS), pp. 213-218, May 2006.
[32] C. Liu, E. Cota, H. Sharif, and D.K. Pradhan, "Test Scheduling for Network-on-Chip with BIST and Precedence Constraints," Proc. Int'l Test Conf., pp. 1369-1378, Oct. 2004.
[33] F.P. Preparata, G. Metze, and R.T. Chien, "On the Connection Assignment Problem of Diagnosable Systems," IEEE Trans. Computers, vol. EC-16, no. 12, pp. 848-854, 1967.
[34] S. Rangarajan, D. Fussell, and M. Malek, "Built-In Testing of Integrated Wafers," IEEE Trans. Computers, vol. 39, no. 2, pp. 195-205, Feb. 1990.
[35] L.E. Laforge, K. Huang, and V.K. Agarwal, "Almost Sure Diagnosis of Almost Every Good Elements," IEEE Trans. Computers, vol. 43, no. 3, pp. 295-305, Mar. 1994.
[36] P. Maestrini and P. Santi, "Self-Diagnosis of Processor Arrays Using a Comparison Model," Proc. 14th IEEE Symp. Reliable Distributed Systems, pp. 218-228, 1995.
[37] J. Han and P. Jonker, "A Defect- and Fault-Tolerant Architecture for Nanocomputers," Nanotechnology, vol. 14, pp. 224-230, 2003.
[38] J.W. Greene and A. El Gamal, "Configuration of VLSI Arrays in the Presence of Defects," J. ACM, vol. 31, no. 4, pp. 694-717, 1984.
[39] I. Koren and Z. Koren, "Defect Tolerant VLSI Circuits: Techniques and Yield Analysis," Proc. IEEE, vol. 86, no. 9, pp. 1817-1836, Sept. 1998.
[40] T.A. Bartic et al., "Highly Scalable Network on Chip for Reconfigurable Systems," Proc. Int'l Symp. System on Chip, pp. 79-82, 2003.
[41] A. Kumary, P. Kunduz, A.P. Singhx, L.S. Pehy, and N.K. Jhay, "A 4.6Tbits/s 3.6GHz Single-Cycle NoC Router with a Novel Switch Allocator in 65nm CMOS," Proc. 25th Int'l Conf. Computer Design, pp. 63-70, 2007.
[42] S. Vangal et al., "A 80-Tile 1.28 Tflops Network on Chip in 65 nm CMOS," Proc. IEEE Solid State Circuit Conf., pp. 98-100, 2008.
[43] C. Grecu, P. Pande, A. Ivanov, and R. Saleh, "BIST for Network-on-Chip Interconnect Infrastructures," Proc. 24th IEEE VLSI Test Symp., p. 35, 2006.
[44] S.-Y. Lin, W.-C. Shen, C.-C. Hsu, C.-H. Chao, and A.-Y. Wu, "Cost-Efficient Fault-Tolerant Router Design for 2D-Mesh Based Chip Multiprocessor Systems," Proc. VLSI Design/CAD Symp. '08, Aug. 2008.
[45] A.J. van de Goor, I. Schanstra, and Y. Zorian, "BIST for Ring-Address SRAM-Type FIFOs," Proc. Records of the IEEE Int'l Workshop Memory Technology, Design and Testing, pp. 112-118, Aug. 1994.
[46] K. Lee, S.J. Lee, and H.J. Yoo, "A High-Speed and Lightweight On-Chip Crossbar Switch Scheduler for On-Chip, Interconnection Networks," Proc. European Solid-State Circuits Conf. (ESSCIRC '03), pp. 453-456, 2003.
[47] T. Singh and A. Taubin, "A GALS Solution Based on Highly Scalable, Low Latency, Crossbar Using Token Ring Arbitration," Proc. 49th IEEE Int'l Midwest Symp. Circuits and Systems (MWSCAS '06), pp. 94-98, 2006.
[48] Y.K. Dalal and R.M. Metcalfe, "Reverse Path Forwarding of Broadcast Packets," Comm. ACM, vol. 21, no. 12, pp. 1040-1048, 1978.
[49] P. Zajac, "Fault Tolerance through Self-Configuration in the Future Nanoscale Multiprocessors," PhD thesis, Toulouse Univ., http://tel.archives-ouvertes.fr/docs/00/ 34/05/08/PDF Zajac.PhDThesis.pdf , June 2008.
[50] T. Sato and A. Chivonobu, "Multiple Clustered Core Processors," Proc. 13th Workshop Synthesis and System Integration of Mixed Information Technologies (SASIMI '06), pp. 262-267, 2006.
[51] T. Sato and T. Funaki, "Power-Performance Trade-Off of a Dependable Multicore Processor," Proc. Pacific RIM Int'l Dependable Computing (PRDC '07), pp. 268-273, 2007.
[52] M. Cirinei, B. Bini, G. Lipari, and A. Ferrari, "A Flexible Scheme for Scheduling Fault-Tolerant Real-Time Tasks on Multiprocessors," Proc. Parallel and Distributed Processing Symp. (IPDPS '07), pp. 1-8, 2007.
[53] S.K. Reinhardt and S.S. Mukherjee, "Transient Fault Detection via Simultaneous Multithreading," Proc. Int'l Symp. Computer Architecture (ISCA), pp 25-36, 2000.
[54] C. LaFrieda, E. Ipek, J.F. Martinez, and R. Manohar, "Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor," Proc. Int'l Conf. Dependable Systems and Networks (DSN '07), pp. 317-326, 2007.
[55] D. Sánchez, J.L. Aragón, and J.M. García, "Adapting Dynamic Core Coupling to a Direct-Network Environment," Proc. XIX Jornadas de Paralelismo, Sept. 2008.
[56] http://www.intel.com/pressroom/archive/releases 20071031comp.htm, 2010.
22 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool