Issue No. 02 - July-Dec. (2014 vol. 13)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/L-CA.2013.25
Sriram Sankar , , Microsoft Corporation
Sudhanva Gurumurthi , , AMD Research
A major problem in managing large-scale datacenters is diagnosing and fixing machine failures. Most large datacenter deployments have a management infrastructure that can help diagnose failure causes, and manage assets that were fixed as part of the repair process. Previous studies identify only actual hardware replacements to calculate Annualized Failure Rate (AFR) and component reliability. In this paper, we show that service availability is significantly affected by soft failures and that this class of failures is becoming an important issue at large datacenters with minimum human intervention. Soft failures in the datacenter do not require actual hardware replacements, but still result in service downtime, and are equally important because they disrupt normal service operation. We show failure trends observed in a large datacenter deployment of commodity servers and motivate the need to modify conventional datacenter designs to help reduce soft failures and increase service availability.
Data centers, Client-server systems, Hard disks, Transient analysis, Maintenance engineering, Large-scale systems, Market research,C.4 Performance of Systems, C.5.5 Servers,Management, Datacenter, Reliability, Characterization
Sriram Sankar, Sudhanva Gurumurthi, "Soft Failures in Large Datacenters", IEEE Computer Architecture Letters, vol. 13, no. , pp. 105-108, July-Dec. 2014, doi:10.1109/L-CA.2013.25