Issue No.06 - November/December (2005 vol.25)
John Lach , University of Virginia
Zhijian Lu , University of Virginia
Kevin Skadron , University of Virginia
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MM.2005.114
Most existing integrated circuit (IC) reliability models assume a uniform, typically worst-case, operating temperature, but temporal and spatial temperature variations affect expected device lifetime. As a result, design decisions and dynamic thermal management (DTM) techniques using worst-case models are pessimistic and result in excessive design margins and unnecessary runtime engagement of cooling mechanisms (and associated performance penalties). By leveraging a reliability model that accounts for temperature gradients (dramatically improving interconnect lifetime prediction accuracy) and modeling expected lifetime as a resource that is consumed over time at a temperature-and voltage-dependent rate, substantial design margin can be reclaimed and runtime penalties avoided while meeting expected lifetime requirements. In this paper, we evaluate the potential benefits and implementations of this technique by tracking the expected lifetime of a system under different workloads while accounting for the impact of dynamic voltage and temperature variations. Simulation results show that our dynamic reliability management (DRM) techniques provide a 40% performance penalty reduction over that incurred by pessimistic DTM in general-purpose computing and a 10% increase in quality of service (QoS) for servers, all while preserving the expected IC lifetime reliability.
Performability, Modeling, Dynamic thermal/reliability management, Analytical and simulation techniques, Electromigration
John Lach, Zhijian Lu, Kevin Skadron, "Improved Thermal Management with Reliability Banking", IEEE Micro, vol.25, no. 6, pp. 40-49, November/December 2005, doi:10.1109/MM.2005.114