2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) (2016)
Sept. 11, 2016 to Sept. 15, 2016
Jingweijia Tan , ECE Department, University of Houston, TX 77004, United States
Shuaiwen Leon Song , HPC Group, Pacific Northwest National Lab, Richland, WA 99354, United States
Kaige Yan , ECE Department, University of Houston, TX 77004, United States
Xin Fu , ECE Department, University of Houston, TX 77004, United States
Andres Marquez , HPC Group, Pacific Northwest National Lab, Richland, WA 99354, United States
Darren Kerbyson , HPC Group, Pacific Northwest National Lab, Richland, WA 99354, United States
Supply voltage reduction is an effective approach to significantly reduce GPU energy consumption. As the largest on-chip storage structure, the GPU register file becomes the reliability hotspot that prevents further supply voltage reduction below the safe limit (V
min) due to process variation effects. This work addresses the reliability challenge of the GPU register file at low supply voltages, which is an essential first step for aggressive supply voltage reduction of the entire GPU chip. To better understand the reliability issues posed by undervolting and its energy-saving potential, we first rigorously model and analyze the process variation impact on the GPU register file at different voltages. By further analyzing the GPU architecture, we make a key observation that the time GPU registers contain useless data (i.e., dead time) is long, providing a unique opportunity to enhance register reliability. We then propose GR-Guard, an architectural solution that leverages long register dead time to enable reliable operations from unreliable register file at low voltages. GR-Guard is both effective and low-cost, and does not affect normal (i.e., non-faulty) register accesses. Experimental results show that for a 28nm baseline GPU under aggressive voltage reduction, GR-Guard can maintain the register file reliability with less than 2% overall performance degradation, while achieving an average of 31% energy reduction across various applications.
Registers, Graphics processing units, Reliability, Computer architecture, Circuit faults, Analytical models, Transistors
J. Tan, S. L. Song, K. Yan, X. Fu, A. Marquez and D. Kerbyson, "Combating the reliability challenge of GPU register file at low supply voltage," 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT), Haifa, Israel, 2016, pp. 3-15.