The Community for Technology Leaders
RSS Icon
Issue No.01 - January (2011 vol.60)
pp: 5-19
Shantanu Gupta , University of Michigan, Ann Arbor
Shuguang Feng , University of Michigan, Ann Arbor
Amin Ansari , University of Michigan, Ann Arbor
Scott Mahlke , University of Michigan, Ann Arbor
CMOS scaling has long been a source of dramatic performance gains. However, semiconductor feature size reduction has resulted in increasing levels of operating temperatures and current densities. Given that most wearout mechanisms are highly dependent on these parameters, significantly higher failure rates are projected for future technology generations. Consequently, fault tolerance, which has traditionally been a subject of interest for high-end server markets, is now getting emphasis in the mainstream computing systems space. The popular solution for this has been the use of redundancy at a coarse granularity, such as dual/triple modular redundancy. In this work, we challenge the practice of coarse-granularity redundancy by identifying its inability to scale to high failure rate scenarios and investigating the advantages of finer-grained configurations. To this end, this paper presents and evaluates a highly reconfigurable CMP architecture, named as StageNet (SN), that is designed with reliability as its first-class design criteria. SN relies on a reconfigurable network of replicated processor pipeline stages to maximize the useful lifetime of a chip, gracefully degrading performance toward the end of life. Our results show that the proposed SN architecture can perform 40 percent more cumulative work compared to a traditional CMP over 12 years of its lifetime.
Reliability, fault tolerance, multicore, CMP, wearout.
Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, "StageNet: A Reconfigurable Fabric for Constructing Dependable CMPs", IEEE Transactions on Computers, vol.60, no. 1, pp. 5-19, January 2011, doi:10.1109/TC.2010.205
[1] K. Bernstein, "Nano-Meter Scale cmos Devices (Tutorial Presentation)," 2004.
[2] S. Borkar, "Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation," IEEE Micro, vol. 25, no. 6, pp. 10-16, Nov./Dec. 2005.
[3] P. Kongetira, K. Aingaran, and K. Olukotun, "Niagara: A 32-Way Multithreaded SPARC Processor," IEEE Micro, vol. 25, no. 2, pp. 21-29, Feb. 2005.
[4] P. Shivakumar, S. Keckler, C. Moore, and D. Burger, "Exploiting Microarchitectural Redundancy for Defect Tolerance," Proc. 2003 Int'l Conf. Computer Design, pp. 481-488, Oct. 2003.
[5] J. Zeigler, "Terrestrial Cosmic Ray Intensities," IBM J. Research and Development, vol. 42, no. 1, pp. 117-139, 1998.
[6] A. Christou, Electromigration and Electronic Device Degradation. John Wiley and Sons, Inc., 1994.
[7] E. Wu, J.M. McKenna, W. Lai, E. Nowak, and A. Vayshenker, "Interplay of Voltage and Temperature Acceleration of Oxide Breakdown for Ultra-Thin Gate Oxides," Solid-State Electronics, vol. 46, pp. 1787-1798, 2002.
[8] C. Weaver and T.M. Austin, "A Fault Tolerant Approach to Microprocessor Design," Proc. 2001 Int'l Conf. Dependable Systems and Networks, pp. 411-420, 2001.
[9] J.A. Blome, S. Feng, S. Gupta, and S. Mahlke, "Self-Calibrating Online Wearout Detection," Proc. 40th Ann. Int'l Symp. Microarchitecture, pp. 109-120, 2007.
[10] A. Meixner, M. Bauer, and D. Sorin, "Argus: Low-Cost, Comprehensive Error Detection in Simple Cores," IEEE Micro, vol. 28, no. 1, pp. 52-59, Jan. 2008.
[11] F.A. Bower, D.J. Sorin, and S. Ozev, "A Mechanism for Online Diagnosis of Hard Faults in Microprocessors," Proc. 38th Ann. Int'l Symp. Microarchitecture, pp. 197-208, 2005.
[12] F.A. Bower, P.G. Shealy, S. Ozev, and D.J. Sorin, "Tolerating Hard Faults in Microprocessor Array Structures," Proc. 2004 Int'l Conf. Dependable Systems and Networks, pp. 51-60, 2004.
[13] D. Bernick, B. Bruckert, P.D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen, "Nonstop Advanced Architecture," Proc. Int'l Conf. Dependable Systems and Networks, pp. 12-21, June 2005.
[14] N. Aggarwal, P. Ranganathan, N.P. Jouppi, and J.E. Smith, "Configurable Isolation: Building High Availability Systems with Commodity Multi-Core Processors," Proc. 34th Ann. Int'l Symp. Computer Architecture, pp. 470-481, 2007.
[15] D. Sylvester, D. Blaauw, and E. Karl, "Elastic: An Adaptive Self-Healing Architecture for Unpredictable Silicon," IEEE J. Design and Test, vol. 23, no. 6, pp. 484-490, June 2006.
[16] Tilera "Tile64 Processor—Product Brief," http://www.tilera. compdf/, 2008.
[17] L. Seiler et al., "Larrabee: A Many-Core $\times 86$ Architecture for Visual Computing," ACM Trans. Graphics, vol. 27, no. 3, pp. 1-15, 2008.
[18] OpenCores "OpenRISC 1200," projects.cgi/ web/ or1kopenrisc_1200, 2006.
[19] J. Srinivasan, S.V. Adve, P. Bose, and J.A. Rivers, "The Case for Lifetime Reliability-Aware Microprocessors," Proc. 31st Ann. Int'l Symp. Computer Architecture, pp. 276-287, June 2004.
[20] W. Huang, M.R. Stan, K. Skadron, K. Sankaranarayanan, and S. Ghosh, "Hotspot: A Compact Thermal Modeling Method for cmos vlsi Systems," IEEE Trans. Very Large Scale Integration Systems, vol. 14, no. 5, pp. 501-513, May 2006.
[21] K. Constantinides, S. Plaza, J.A. Blome, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky, "Bulletproof: A Defect-Tolerant CMP Switch Architecture," Proc. 12th Int'l Symp. High-Performance Computer Architecture, pp. 3-14, Feb. 2006.
[22] ARM "Arm11," families ARM11Family.html, 2010.
[23] M. Vachharajani, N. Vachharajani, D.A. Penry, J.A. Blome, S. Malik, and D.I. August, "The Liberty Simulation Environment: A Deliberate Approach to High-Level System Modeling," ACM Trans. Computer Systems, vol. 24, no. 3, pp. 211-249, 2006.
[24] N. Clark, A. Hormati, S. Mahlke, and S. Yehia, "Scalable Subgraph Mapping for Acyclic Computation Accelerators," Proc. 2006 Int'l Conf. Compilers, Architecture, and Synthesis for Embedded Systems, pp. 147-157, Oct. 2006.
[25] M. Postiff, D. Greene, S. Raasch, and T. Mudge, "Integrating Superscalar Processor Components to Implement Register Caching," Proc. 2001 Int'l Conf. Supercomputing, pp. 348-357, 2001.
[26] E. Karl, P. Singh, D. Blaauw, and D. Sylvester, "Compact In Situ Sensors for Monitoring nbti and Oxide Degradation," Proc. 2008 IEEE Int'l Solid-State Circuits Conf., Feb. 2008.
[27] L.-S. Peh and W. Dally, "A Delay Model and Speculative Architecture for Pipelined Routers," Proc. Seventh Int'l Symp. High-Performance Computer Architecture, pp. 255-266, Jan. 2001.
[28] ITRS "Int'l Technology Roadmap for Semiconductors 2008," http:/, 2008.
[29] Trimaran "An Infrastructure for Research in ILP," http:/, 2000.
[30] V. Kathail, M. Schlansker, and B. Rau, "HPL-PD Architecture Specification: Version 1.1," Technical Report HPL-93-80(R.1), Hewlett-Packard Laboratories, Feb. 2000.
[31] L. Shang, L. Peh, A. Kumar, and N.K. Jha, "Temperature-Aware On-Chip Networks," IEEE Micro, vol. 26, no. 1, pp. 130-139, Jan./Feb. 2006.
[32] W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. Amarasinghe, "Space-Time Scheduling of Instruction-Level Parallelism on a RAW Machine," Proc. Eighth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 46-57, Oct. 1998.
[33] B.F. Romanescu and D.J. Sorin, "Core Cannibalization Architecture: Improving Lifetime Chip Performance for Multicore Processor in the Presence of Hard Faults," Proc. 17th Int'l Conf. Parallel Architectures and Compilation Techniques, 2008.
[34] W. Bartlett and L. Spainhower, "Commercial Fault Tolerance: A Tale of Two Systems," IEEE Trans. Dependable and Secure Computing, vol. 1, no. 1, pp. 87-96, Jan.-Mar. 2004.
[35] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits, second ed. Prentice Hall, 2003.
[36] PTM "Predictive Technology Model," http:/, 2010.
[37] M.D. Powell, A. Biswas, S. Gupta, and S.S. Mukherjee, "Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance," Proc. 36th Ann. Int'l Symp. Computer Architecture, June 2009.
28 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool