Data Storage Reliability in the IoT Era
Guest Editor’s Introduction • Cecilia Metra • August 2017
Translations by Osvaldo Perez and Tiejun Huang
Listen to the Guest Editors’ Introduction
English (Steve Woods):
Spanish (Martin Omana):
Chinese (Robert Hsu):
Microelectronic technology’s ongoing scaling according to Moore’s Law enables increasing microprocessor performance and complexity, paving the way for innovative applications that were unthinkable just a few years ago. Today, we are surrounded by electronic devices that exchange data with one another via the Internet, creating the Internet of Things (IoT). Many companies and analysts forecast a huge growth in the IoT and the data it generates — with predictions ranging from 20 to 30 billion connected devices by 2020. Storing such a huge amount of data in a reliable way will be a challenge.
Even with new memory technologies, structures, and architectures that provide higher speed and density, reliability remains a major concern. Parameter variations occur during the manufacturing process, and environmentally induced faults can affect reliable operation in the field. Transient faults (TFs) caused by radiation particles can affect memory elements (such as latches and flip-flops) or memory arrays, corrupting the stored data.
The articles and videos in this August 2017 Computing Now theme explore reliability challenges for memory elements and arrays, as well as some approaches for solving them.
Challenges and Solutions
Designers must adopt proper approaches — either to avoid data corruption or to restore the correct data — in memory elements and arrays.
To preserve the correctness of the data stored in memory elements, robust design of latches and flip-flops can be adopted. The literature has proposed numerous such schemes so far, each with different robustness levels and required costs in terms of performance, power consumption, and area overhead.
To restore the correct data in memory arrays, error correcting codes (ECCs) are usually employed. ECCs range from simple Single Error Correcting/Double Error Detecting (SEC/DED) codes to ECCs that are capable of correcting more than a single error. The latter is important for scaled-down technologies and high-density memory arrays, in which TFs are likely to simultaneously affect more than a single memory cell, thus creating multiple bit upsets (MBUs). However, adopting these more powerful ECCs usually implies high area overhead and a non-negligible impact on performance — due mainly to the greater number of check bits to be stored and to the more complex encoding and decoding structures.
Memory interleaving — logically mapping physically adjacent memory cells into different memory logical words — can be adopted, together with SEC/DED codes, to protect memory arrays against MBUs. With memory interleaving, errors affecting two or more physically adjacent cells manifest themselves as single errors affecting two or more different memory words, and thus, SEC/DED codes can correct them. However, interleaving generally requires rather complex, expensive decoding circuitry — and cannot guarantee error correction when two errors affect the same memory word.
The five articles in this month’s theme provide a comprehensive reference for the theoretical and practical aspects of innovative approaches for reliable data storage.
“High Performance Robust Latches,” an article that I coauthored with Martin Omaña and Daniele Rossi, proposes a new high-performance robust latch called the HiPeR latch, which is insensitive to TFs affecting its internal and output nodes regardless of the radiation particle’s energy. A modified version, called the HiPeR-CG latch, is suitable for clock gating to reduce power consumption. We argue that both latches are faster than the latches previously presented in the literature, and that they provide better or comparable robustness at comparable or lower area and power costs, making them particularly suitable for microprocessor critical data paths.
In “A Novel Scheme for Tolerating Single Event/Multiple Bit Upsets (SEU/MBU) in Non-Volatile Memories,” Wei Wei and his colleagues address the problem of tolerating SEUs and MBUs in SRAMs. First, they review three previously published designs for non-volatile SRAM cells that provide non-volatile operation through a single resistive element with a good SEU tolerance. Then, they propose a novel scheme for tolerating MBUs by utilizing non-volatile storage. The scheme relies on added coding circuitry for detection and a “restore” operation that retrieves the correct datum from the non-volatile storage. The authors state that the proposed scheme signiﬁcantly reduces delays and better detects and corrects large numbers of SEUs and MBUs compared to a six transistor (6T)-based scheme.
Large SRAM structures, such as the last-level cache (LLC), are aggressively sized for high density and are consequently vulnerable to process variations. Alexandra Ferreron and her colleagues propose an LLC that enables reliable operation at low voltages with conventional SRAM cells in “Concertina: Squeezing in Cache Content to Operate at Near-Threshold Voltage.” Because LLCs often contain large amounts of null data, the authors’ LLC, called Concertina, compresses cache blocks and allocates them to cache entries with faulty cells. To distribute blocks among cache entries, it implements a compression- and fault-aware insertion/replacement technique that reduces the LLC miss rate.
“Accurate Model for Application Failure Due to Transient Faults in Caches” proposes a solution for evaluating cache reliability, expressed by the failure in time (FIT) metric, in the presence of multi-bit faults. Authors Mehrtash Manoochehri and Michel Dubois introduce the PARMA+ model, which enables FIT rate estimates under all possible sequences of multi-bit faults with very high accuracy and low simulation times. They argue that PARMA+ can model the FIT rate of a cache equipped with major existing reliability features, such as bit-interleaving, early write-back, scrubbing, and various common error-protection schemes. Furthermore, it can model faults with any set of patterns and any cache conﬁguration, including low-power techniques such as Dynamic Voltage and Frequency Scaling (DVFS).
Current microprocessors can face as much as 12.5 percent space overhead for ECCs. In “Smart ECC Allocation Cache Utilizing Cache Data Space,” Jeongkyu Hong and Soontae Kim reduce that overhead with the SEA (Smart ECC Allocation) cache, which locates ECCs in the cache data space and dynamically modulates the number of ECC check bits according to the program behavior. Experimental results conﬁrm that the proposed scheme reduces LLC power consumption by seven percent and reduces space overheads in conventional ECC schemes without noticeable reliability and performance degradation.
Rob Aitken, from ARM, on memory reliability.
Yervant Zorian, from Synopsys, on memory reliability.
The Industry Perspective
This month’s theme also includes two videos, which provide deep technical insights by two industry experts on memory reliability (in alphabetic order):
- Rob Aitken, from ARM
- Yervant Zorian, from Synopsys
The IoT will enable electronic objects to exchange huge amounts of data; storing it in a reliable way will be challenging. We hope that this Computing Now issue highlights the major challenges in reliable data storage and stimulates further research in the field.
M. Omaña, D. Rossi, T. Edara, and C. Metra, “Impact of Aging Phenomena on Latches’ Robustness,” IEEE Transactions on Nanotechnology, volume 15, issue 2, March 2016, pp. 129-136.
Cecilia Metra is a candidate for President-Elect 2018 (President 2019) of the IEEE Computer Society. She is the incoming editor-in-chief of the IEEE Transactions on Emerging Topics in Computing, and she was the editor-in-chief of Computing Now (2012-2016). She is the 2017 vice president of Computer Society Member and Geographic Activities, and she was the vice president of Computer Society Technical and Conference Activities. She is a full professor at the University of Bologna, Italy, from which she has a PhD in electronic engineering and computer science. Metra has served on the editorial board and advisory board of many publications, including IEEE Transactions on Computers, IEEE’s The Institute, and IEEE Design & Test. She has contributed to numerous IEEE international conferences and has published extensively on design for test and reliability of integrated systems. She is an IEEE Fellow, an IEEE CS Golden Core Member, and a member of the IEEE honor society IEEE-HKN. Contact her at firstname.lastname@example.org.