The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - May-June (2013 vol.33)
pp: 58-66
Siva Kumar Sastry Hari , University of Illinois at Urbana-Champaign
Sarita V. Adve , University of Illinois at Urbana-Champaign
Helia Naeimi , Intel
ABSTRACT
Future microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost software-level symptom monitors. However, there remains a nonnegligible risk that several faults might escape these detectors to produce silent data corruptions (SDCs). Evaluating and bounding SDCs is, therefore, crucial for low-cost resiliency solutions. The authors present Relyzer, an approach that can systematically analyze all application fault sites and identify virtually all SDC-causing program locations. Instead of performing fault injections on all possible application-level fault sites, which is impractical, Relyzer carefully picks a small subset. It employs novel fault-pruning techniques that reduce the number of fault sites by either predicting their outcomes or showing them equivalent to others. Results show that 99.78 percent of faults are pruned across 12 studied workloads, reducing the complete application resiliency evaluation time by 2 to 6 orders of magnitude. Relyzer, for the first time, achieves the capability to list virtually all SDC-vulnerable program locations, which is critical in designing low-cost application-centric resiliency solutions. Relyzer also opens new avenues of research in designing error-resilient programming models as well as even faster (and simpler) evaluation methodologies.
INDEX TERMS
Computer architecture, Microprocessors, Fault diagnosis, Hardware, Costs, Computer programs, computer architecture, low-cost hardware resiliency, silent data corruption, transient faults
CITATION
Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, Pradeep Ramachandran, "Relyzer: Application Resiliency Analyzer for Transient Faults", IEEE Micro, vol.33, no. 3, pp. 58-66, May-June 2013, doi:10.1109/MM.2013.30
REFERENCES
1. S.K.S. Hari et al., "mSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems," Proc. 42nd Ann. IEEE/ACM Int'l Symp. Microarchitecture, ACM, 2009, pp. 122-132.
2. M.-L. Li et al., "Understanding the Propagation of Hard Errors to Software and Implications for Resilient Systems Design," Proc. 13th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS 08), ACM, 2008, pp. 265-276.
3. G. Lyle et al., "An End-to-End Approach for the Automatic Derivation of Application-Aware Error Detectors," Proc. IEEE/IFIP Int'l Conf. Dependable Systems and Networks (DSN 09), IEEE CS, 2009, pp. 584-589.
4. P. Racunas et al., "Perturbation-Based Fault Screening," Proc. IEEE 13th Int'l Symp. High Performance Computer Architecture (HPCA 07), IEEE CS, 2007, pp. 169-180.
5. N. Wang and S. Patel, "ReStore: Symptom-Based Soft Error Detection in Microprocessors," IEEE Trans. Dependable and Secure Computing, vol. 3, no. 3, 2006, pp. 188-201.
6. P. Ramachandran, "Detecting and Recovering from In-Core Hardware Faults Through Software Anomaly Treatment," doctoral dissertation, Computer Science Dept., Univ. of Illinois at Urbana-Champaign, 2011.
7. G. Reis et al., "Software-Controlled Fault Tolerance," ACM Trans. Architecture and Code Optimization, vol. 2, no. 4, 2005, pp. 366-396.
8. S. Feng et al., "Shoestring: Probabilistic Soft Error Reliability on the Cheap," Proc. 15th Int'l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS 10), ACM, 2010, pp. 385-396.
9. K. Pattabiraman, Z. Kalbarczyk, and R.K. Iyer, "Application-Based Metrics for Strategic Placement of Detectors," Proc. 11th Pacific Rim Int'l Symp. Dependable Computing (PRDC 05), IEEE CS, 2005, pp. 75-82.
10. S. Sahoo et al., "Using Likely Program Invariants to Detect Hardware Errors," Proc. IEEE Int'l Conf. Dependable Systems and Networks (DSN 08), IEEE CS, 2008, pp. 70-79.
11. S.K.S. Hari, S.V. Adve, and H. Naeimi, "Low-Cost Program-Level Detectors for Reducing Silent Data Corruptions," Proc. 42nd Ann. IEEE/IFIP Int'l Conf. Dependable Systems and Networks (DSN 12), IEEE CS, 2012, pp. 1-12.
19 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool