loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance
Edinburgh, UK
June 25-June 28
ISBN: 0-7695-2855-4
Alex Shye, U. of Colorado at Boulder, USA
Tipp Moseley, U. of Colorado at Boulder, USA
Vijay Janapa Reddi, Harvard University, USA
Joseph Blomstedt, U. of Colorado at Boulder, USA
Daniel A. Connors, U. of Colorado at Boulder, USA
Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point towards multi-threaded multi-core designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper proposes a software-based multi-core alternative for transient fault tolerance using process-level redundancy (PLR). PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR?s softwarecentric approach to transient fault tolerance shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, PLR ignores many benign faults that do not propagate to affect program correctness. A real PLR prototype for running single-threaded applications is presented and evaluated for fault coverage and performance. On a 4-way SMP machine, PLR provides improved performance over existing software transient fault tolerance techniques with 16.9% overhead for fault detection on a set of optimized SPEC2000 binaries.
Citation:
Alex Shye, Tipp Moseley, Vijay Janapa Reddi, Joseph Blomstedt, Daniel A. Connors, "Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance," dsn, pp.297-306, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07), 2007
Usage of this product signifies your acceptance of the Terms of Use.