The Community for Technology Leaders
Green Image
Issue No. 03 - May-June (2013 vol. 33)
ISSN: 0272-1732
pp: 58-66
Siva Kumar Sastry Hari , University of Illinois at Urbana-Champaign
Sarita V. Adve , University of Illinois at Urbana-Champaign
Helia Naeimi , Intel
ABSTRACT
Future microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost software-level symptom monitors. However, there remains a nonnegligible risk that several faults might escape these detectors to produce silent data corruptions (SDCs). Evaluating and bounding SDCs is, therefore, crucial for low-cost resiliency solutions. The authors present Relyzer, an approach that can systematically analyze all application fault sites and identify virtually all SDC-causing program locations. Instead of performing fault injections on all possible application-level fault sites, which is impractical, Relyzer carefully picks a small subset. It employs novel fault-pruning techniques that reduce the number of fault sites by either predicting their outcomes or showing them equivalent to others. Results show that 99.78 percent of faults are pruned across 12 studied workloads, reducing the complete application resiliency evaluation time by 2 to 6 orders of magnitude. Relyzer, for the first time, achieves the capability to list virtually all SDC-vulnerable program locations, which is critical in designing low-cost application-centric resiliency solutions. Relyzer also opens new avenues of research in designing error-resilient programming models as well as even faster (and simpler) evaluation methodologies.
INDEX TERMS
Computer architecture, Microprocessors, Fault diagnosis, Hardware, Costs, Computer programs, computer architecture, low-cost hardware resiliency, silent data corruption, transient faults
CITATION
Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, Pradeep Ramachandran, "Relyzer: Application Resiliency Analyzer for Transient Faults", IEEE Micro, vol. 33, no. , pp. 58-66, May-June 2013, doi:10.1109/MM.2013.30
82 ms
(Ver 3.1 (10032016))