ISSN: 1556-6056
Sriram Sankar , Microsoft Corporation, Redmond
Sudhanva Gurumurthi , University of Virginia , Charlottesville
A major problem in managing large-scale datacenters is diagnosing and fixing machine failures. Most large datacenter deployments have a management infrastructure that diagnoses failure causes, and also manages assets that were fixed as part of the repair process. Previous studies identify only actual hardware replacements to calculate AFR and component reliability. In this paper, we show that service availability is significantly impacted by soft failures and this class of failures is becoming an important part of large datacenters that have minimum human intervention. Soft failures in the datacenter are not actual hardware replacements, but still result in service downtime, and are equally important since they disrupt normal service operation. We show failure trends observed in a real large datacenter deployment of commodity servers. Finally, we motivate the need for datacenter and server architects to modify conventional design to reduce soft failures and increase service availability.
C.4 Performance of Systems, C.5.5 Servers
Sriram Sankar, Sudhanva Gurumurthi, "Soft Failures in Large Datacenters", IEEE Computer Architecture Letters, vol. , no. , pp. 0, 5555, doi:10.1109/L-CA.2013.25
