2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA) (2015)
Burlingame, CA, USA
Feb. 7, 2015 to Feb. 11, 2015
Young Hoon Son , Seoul National University
Sukhan Lee , Seoul National University
O Seongil , Seoul National University
Sanghyuk Kwon , Seoul National University
Nam Sung Kim , University of Wisconsin-Madison
Jung Ho Ahn , Seoul National University
Although aggressive technology scaling has allowed manufacturers to integrate Giga bits of cells into a cost-sensitive main memory DRAM device, these cells have become more defect-prone. With increased cell failure rates, conventional solutions such as populating spare DRAM rows and relying on error-correcting codes (ECCs) have shown limited success due to high area overhead, the latency penalties of data coding, and interference between ECC within a device (in-DRAM ECC) and other ECC across devices (rank-level ECC). In this paper, we propose CiDRA, a cache-inspired DRAM resilience architecture, which substantially reduces the area and latency overheads of correcting bit errors on random locations due to these faulty cells. We put a small SRAM cache within a DRAM device to replace accesses to the addresses including the faulty cells with ones that correspond to the cache data array. This CiDRA cache is paired with a Bloom filter to minimize the energy overhead of accessing the cache tags for every DRAM access and is also partitioned into small pieces, each being associated with the I/O pads for better area efficiency. Both the cache and DRAM banks are accessed in parallel while the banks are much slower. Consequently, the cache and filter are not in the critical path for normal DRAM accesses and incur no latency overhead. We also enhance the traditional in-DRAM ECC with error position bits and the appropriate error detecting capability while preventing interference with the traditional rank-level ECC scheme. By combining this enhanced in-DRAM ECC with the cache and Bloom filter, CiDRA becomes more area efficient because the in-DRAM ECC corrects most bit errors that are sporadic while the cache deals with the remaining relatively few pathological cases.
Random access memory, Error correction codes, Resilience, Arrays, Circuit faults, Decoding, Bandwidth
Y. H. Son, S. Lee, O. Seongil, S. Kwon, N. S. Kim and J. H. Ahn, "CiDRA: A cache-inspired DRAM resilience architecture," 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), Burlingame, CA, USA, 2015, pp. 502-513.