Combining Low IO-Operations During Data Recovery with Low Parity Overhead in Two-Failure Tolerant Archival Storage Systems
2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC) (2015)
Nov. 18, 2015 to Nov. 20, 2015
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/PRDC.2015.16
Archival data storage systems contain data that must be preserved over long periods of time but which are often unlikely to be accessed during their lifetime. The best strategy for such systems is to keep their disks powered-off unless they have to be powered up to access their contents, to reconstruct lost data, or to perform other disk maintenance tasks. Of all such tasks, reconstructing data after a disk failure is the one that is likely to have the highest energy footprint and the most impact on the overall power consumption of the array, because it typically involves powering up all the disks belonging to the same reliability stripe as the failed disk and keeping them running for considerable time at each occurrence. We investigate two two-failure tolerant disk layouts that have lower parity overhead than the number of disks read (and hence powered-on) for recovering data on lost drives would suggest. Our first organization is a flat XOR code that organizes the data disks into a rectangle with fewer rows than columns, and adds a simple parity disk to each row and column. Recovery from a disk failure proceeds by prefering columns when reconstructing lost data, and thereby has fewer reads than the parity overhead would normally suggest. Our second layout is based on the most basic pyramid code. We can view this layout as an example RAID Level 6 variant. In this variant, a stripe has a Q-parity calculated from the data disks in the stripe, but the data disks are also organized into smaller groups where each group has a separate P-parity calculated as the exclusive-or of the data disks in the group. We compare the two layouts by measuring their robustness to data loss, their one-year survival rate, and the expected number of number of disks that must be involved to recover from both single and multiple disk failures. Our results show that rectangular layouts are significantly more reliable than layouts based on the most basic Pyramid codes, but that they also require more disk accesses to recover from disk failures.
Layout, Reliability, Arrays, Organizations, Maintenance engineering, Data storage systems
T. Schwarz, A. Amer and J. Paris, "Combining Low IO-Operations During Data Recovery with Low Parity Overhead in Two-Failure Tolerant Archival Storage Systems," 2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC), Zhangjiajie, China, 2016, pp. 235-244.