Issue No. 06 - November/December (2005 vol. 25)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/MM.2005.114
Zhijian Lu , University of Virginia
John Lach , University of Virginia
Mircea R. Stan , University of Virginia
Kevin Skadron , University of Virginia
Most existing integrated circuit (IC) reliability models assume a uniform, typically worst-case, operating temperature, but temporal and spatial temperature variations affect expected device lifetime. As a result, design decisions and dynamic thermal management (DTM) techniques using worst-case models are pessimistic and result in excessive design margins and unnecessary runtime engagement of cooling mechanisms (and associated performance penalties). By leveraging a reliability model that accounts for temperature gradients (dramatically improving interconnect lifetime prediction accuracy) and modeling expected lifetime as a resource that is consumed over time at a temperature-and voltage-dependent rate, substantial design margin can be reclaimed and runtime penalties avoided while meeting expected lifetime requirements. In this paper, we evaluate the potential benefits and implementations of this technique by tracking the expected lifetime of a system under different workloads while accounting for the impact of dynamic voltage and temperature variations. Simulation results show that our dynamic reliability management (DRM) techniques provide a 40% performance penalty reduction over that incurred by pessimistic DTM in general-purpose computing and a 10% increase in quality of service (QoS) for servers, all while preserving the expected IC lifetime reliability.
Performability, Modeling, Dynamic thermal/reliability management, Analytical and simulation techniques, Electromigration
K. Skadron, M. R. Stan, Z. Lu and J. Lach, "Improved Thermal Management with Reliability Banking," in IEEE Micro, vol. 25, no. , pp. 40-49, 2005.