This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Soft Failures in Large Datacenters
RapidPost
ISSN: 1556-6056
Sriram Sankar, Microsoft Corporation, Redmond
Sudhanva Gurumurthi, University of Virginia , Charlottesville
A major problem in managing large-scale datacenters is diagnosing and fixing machine failures. Most large datacenter deployments have a management infrastructure that diagnoses failure causes, and also manages assets that were fixed as part of the repair process. Previous studies identify only actual hardware replacements to calculate AFR and component reliability. In this paper, we show that service availability is significantly impacted by soft failures and this class of failures is becoming an important part of large datacenters that have minimum human intervention. Soft failures in the datacenter are not actual hardware replacements, but still result in service downtime, and are equally important since they disrupt normal service operation. We show failure trends observed in a real large datacenter deployment of commodity servers. Finally, we motivate the need for datacenter and server architects to modify conventional design to reduce soft failures and increase service availability.
Index Terms:
C.4 Performance of Systems,C.5.5 Servers
Citation:
Sriram Sankar, Sudhanva Gurumurthi, "Soft Failures in Large Datacenters," IEEE Computer Architecture Letters, 26 Aug. 2013. IEEE computer Society Digital Library. IEEE Computer Society, <http://doi.ieeecomputersociety.org/10.1109/L-CA.2013.25>
Usage of this product signifies your acceptance of the Terms of Use.