Data Center Total Cost of Ownership: Buyer Beware! Memory Errors Add Up
SEP 11, 2016 20:53 PM
Cloud computing has brought rise to the commodity, ‘white box server,’ which is no frills, low cost and uses commodity hardware.  The higher volume and lower margins has pushed many server vendors to ship a quality of system that they would not have shipped in the past. Validation budgets have been slashed and as a result data centers are seeing increases in memory failures.  Case in point is a recent study about the Facebook fleet of servers .  This study revealed that if a Facebook DIMM receives 100 soft errors (single bit correctable) in a weeks time it is replaced.  A quick calculation of the error rates and the number of servers shows that DIMMs are being replaced every hour at Facebook!  Clearly memory errors are not scaling well.  If the memory fails it can cause corruption resulting in data loss or a server crash. 

Some IT manager are under the mistaken impression that if ECC is implemented by the system they can buy cheap memory since the Error Code Correction will save the day and correct the failures.   ECC is a well established technique for Single Error Detection and Correction and Double Error Detection (SECDED).  Most data center servers implement this.  However, this does not mean that your memory will get less errors, what this does is allow the system to keep operating in the face of some errors.  Every time a single bit error is detected and corrected it is a performance hit since it takes time to recover. If the system detects excessive amount of ECC errors it spends most of its time logging the errors and trying to recover.  This becomes a performance issue, with the system eventually crashing if the number of errors becomes excessive.  The quality of the memory is important . If the data centers are installing low quality memory, then the data centers will be more expensive to operate due to frequent DIMM replacements and lower performance.  However the quality of the memory is not the only culprit when it comes to memory errors.  The system itself, due to design flaws or mis-programming by the BIOS, can use the wrong timings when accessing the memory.  This is a little known cause that most data centers miss.  They simply blame the memory and replace the DIMM only to have the problem repeat itself several months later on the same system.  Another new source of memory failures that has been publicized over the past year are row hammer failures.  This is where software repeatedly accesses a single location or semaphore to facilitate inter-processor or task to task communication.  The electrical charge from this frequently accessed ‘aggressor’ row leaks into adjacent rows causing bit flips.  This has also become a large security concern as several researchers have demonstrated the ability to gain unauthorized kernel access due to the repeatability of the failing bits.  Row Hammer failures have been shown to occur in both DDR3 and DDR4 systems.

 Everyone is too busy shipping products into those growing data centers.  No one is tracking what quality hardware should be matched with what critical applications.  This leaves the data centers vulnerable to increased cost due to failures, and unhappy customers due to performance issues.  To date no industry standard group or trade association has come up with a testing standard for Server DDR Memory.  Since Intel dominated the server industry in years past they operated their own internal testing that resulted in an ‘Approved Vendors List’ for their various platforms.  However even that testing has proven to be ineffective as many of Facebook’s servers use those approved DIMMs.

 An industry standard testing standard that raises the bar on DDR memory quality needs to be created.  JEDEC, the industry standard group the writes the world’s memory standards, has not firmly addressed quality and testing standards.  This may change in the future as several T&M vendors have started to be more assertive in the standards groups proposing measurement procedures and related documents.  However, testing adds cost and adding testing requirements is often met with resistance.  Case in point is the Open Compute Project Compliance and Interoperability.  Robust testing for server memory has been proposed, by this author, but systematically rejected because of cost.  In fact the OCP Compliance Labs for servers has never really taken off.  Facebook has gifted its server designs to OCP but has not forced its own server suppliers to jump through a robust compliance testing facilitated by an OCP lab.

The IT industry needs to assert its buying power.  Don’t let the ‘white box’ vendors pull the wool over your eyes.  Demand higher quality memory subsystems and ask for validation testing documents.  Benchmark the vendors against each other and give your business to those vendors who deliver a cost/quality ratio that meets your reliability and serviceability goals.

Barbara P. Aichinger is co-founder and Vice President of FuturePlus Systems Corporation.  FuturePlus was founded in 1991 and has its corporate offices in Bedford NH.  Barbara holds and BS and MS in Electrical Engineering.  She is a frequent public speaker on DDR Memory and other computer bus technologies.   She is a member of the JEDEC  JC42, JC45 and JC40 committees and often evangelizes quality and test issues.  She is also part of the Open Compute Project Compliance and Interoperability Committee.   FuturePlus Systems is a long time trusted name is test equipment for the computer industry.  Their most recent product, the DDR Detective®, can detect JEDEC specification violations and performance problems in addition to being used for general debug and design validation.  Barbara can be reached at and her LinkedIn profile can be found here.  

