The Community for Technology Leaders
RSS Icon
Issue No.05 - May (2010 vol.59)
pp: 651-665
Omer Khan , University of Massachusetts Amherst, Amherst
As the semiconductor industry continues its relentless push for nano-CMOS technologies, device reliability and occurrence of hard errors have emerged as a dominant concern in multicores. Although regular memory structures are protected against hard errors using error correcting codes or spare rows and columns, many of the structures within the cores are left unprotected. Even if the location of hard errors is known a priori, disabling faulty cores results in a substantial performance loss. Several proposed techniques use microarchitectural redundancy to allow defective cores to continue operation. These techniques are attractive, but limited due to either added cost of additional redundancy that offers no benefits to an error-free core, or limited coverage, due to the natural redundancy offered by the microarchitecture. We propose to exploit the intercore redundancy in chip multiprocessors for hard-error tolerance. Our scheme combines hardware reconfiguration to ensure reduced functionality of cores, and a runtime layer of software (microvisor) to manage mapping of threads to cores. Microvisor observes the changing phase behavior of threads and initiates thread relocation to match the computational demands of threads to the capabilities of cores. Our results show that in the presence of degraded cores, microvisor mitigates performance losses by an average of two percent.
Chip multiprocessor (CMP), hard-error tolerance, hardware/software codesign, hypervisor, virtualization.
Omer Khan, "Thread Relocation: A Runtime Architecture for Tolerating Hard Errors in Chip Multiprocessors", IEEE Transactions on Computers, vol.59, no. 5, pp. 651-665, May 2010, doi:10.1109/TC.2009.76
[1] International Technology Roadmap for Semiconductor (ITRS), http:/, 2009.
[2] S. Mitra, "Circuit Failure Prediction for Robust System Design in Scaled CMOS," Proc. Int'l Reliability Physics Symp., 2008.
[3] J. Smolens et al., "Detecting Emerging Wearout Faults," Proc. Workshop Silicon Errors in Logic—System Effects, 2007.
[4] F.A. Bower, D.J. Sorin, and S. Ozev, "A Mechanism for Online Diagnosis of Hard Faults in Microprocessors," Proc. Int'l Symp. Microarchitecture, 2005.
[5] E. Schuchmann and T. Vijaykumar, "BlackJack: Hard Error Detection with Redundant Threads on SMT," Proc. Int'l Conf. Dependable Systems and Networks, 2007.
[6] K. Constantinides, O. Mutlu, T. Austin, and V. Bertacco, "Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support and Evaluation," Proc. Int'l Symp. Microarchitecture, 2007.
[7] E. Schuchman and T.N. Vijaykumar, "Rescue: A Microarchitecture for Testability and Defect Tolerance," Proc. Int'l Symp. Computer Architecture, 2005.
[8] P. Shivakumar, S. Keckler, C. Moore, and D. Burger, "Exploiting Microarchitectural Redundancy for Defect Tolerance," Proc. Int'l Conf. Computer Design, 2003.
[9] J. Srinivasan, S. Adve, P. Bose, and J. Rivers, "Exploiting Structural Duplication for Lifetime Reliability Enhancement," Proc. Int'l Symp. Computer Architecture, 2005.
[10] B. Romanescu and D. Sorin, "Core Cannibalization Architecture: Improving Lifetime Chip Performance for Multicore Processors in the Presence of Hard Faults," Proc. Int'l Conf. Parallel Architectures and Compilation Techniques, 2008.
[11] S.Y. Borkar, "Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation," IEEE Micro, vol. 25, no. 6, pp. 10-16, Nov./Dec. 2005.
[12] M. Agostinelli et al., "Random Charge Effects for PMOS NBTI in Ultra-Small Gate Area Devices," Proc. Int'l Reliability Physics Symp., 2005.
[13] G. Groseneken, R. Degraeve, B. Kaczer, and P. Rousel, "Recent Trends in Reliability Assessment of Advanced CMOS Technology," Proc. Int'l Conf. Microelectronics Test Structure, vol. 18, Apr. 2005.
[14] S. Borkar et al., "Parameter Variations and Impact on Circuits and Microarchitecture," Proc. Design Automation Conf., 2003.
[15] S. Borkar, T. Karnik, and V. De, "Reliable System-on-a-Chip Design in the Nanometer Era," Proc. Design Automation Conf., 2004.
[16] J. Chang et al., "The 65nm 16MB On-Die L3 Cache for Dual Core Multi-Threaded Xeon Processor," Proc. Symp. Very-Large-Scale Integration (VLSI) Circuits, 2006.
[17] F. Bower, P. Shealy, S. Ozev, and D. Sorin, "Tolerating Hard Faults in Microprocessor Array Structures," Proc. Int'l Conf. Dependable Systems and Networks, 2004.
[18] I. Koren and Z. Koren, "Defect Tolerant VLSI Circuits: Techniques and Yield Analysis," Proc. IEEE, vol. 86, no. 9, pp. 1819-1838, Sept. 1998.
[19] A. Dhodapkar and J.E. Smith, "Saving and Restoring Implementation Contexts with Co-Designed Virtual Machines," Proc. Workshop Complexity-Effective Design, 2001.
[20] J. Smith and R. Nair, Virtual Machines: Versatile Platforms for Systems and Processes. Morgan Kaufmann Publication, 2005.
[21] L. Heller and M. Farrell, "Millicode in an IBM zSeries Processor," IBM J. Research and Development, July 2004.
[22] A. Klaiber, The Technology behind Crusoe Processors. Transmeta Corporation, Jan. 2000.
[23] O. Khan and S. Kundu, "A Self-Adaptive Architecture to Address Transistor Aging," Proc. Int'l Conf. Design, Automation and Test in Europe, Apr. 2009.
[24] D. Joseph and D. Grunwald, "Prefetching Using Markov Predictors," Proc. Int'l Symp. Computer Architecture, 1997.
[25] T. Sherwood et al., "Phase Tracking and Prediction," Proc. Int'l Symp. Computer Architecture, 2003.
[26] P. Denning, "The Working Set Model for Program Behavior," Comm. of ACM, vol. 11, no. 5, pp. 323-333, May 1968.
[27] T. Sherwood and B. Calder, "Time Varying Behavior of Programs," Technical Report UCSD-CS99-630, Aug. 1999.
[28] A. Dhodapakar and J.E. Smith, "Comparing Program Phase Detection Techniques," Proc. Int'l Symp. Microarchitecture, 2003.
[29] T. Sherwood et al., "Automatically Characterizing Large Scale Program Behavior," Proc. Int'l Conf. Architectural Support for Programming Languages and Operating Systems, 2002.
[30] A. Dhodapakar and J.E. Smith, "Managing Multi-Configuration Hardware via Dynamic Working Set Analysis," Proc. Int'l Symp. Computer Architecture, 2002.
[31] C. Isci, A. Buyuktosuniglu, and M. Martonosi, "Long-Term Workload Phases: Duration Prediction and Applications to DVFS," IEEE Micro, vol. 25, no. 5, pp. 39-51, Sep./Oct. 2005.
19 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool