Parallel and Distributed Processing Symposium, International (2007)
Long Beach, CA, USA
Mar. 26, 2007 to Mar. 30, 2007
Kirsten Hildrum , IBM T.J. Watson Research Center
Fred Douglis , IBM T.J. Watson Research Center
Joel L. Wolf , IBM T.J. Watson Research Center
Philip Yu , IBM T.J. Watson Research Center
Lisa Fleischer , Dartmouth College
Akshay Katta , Amazon Corporation
We consider storage in an extremely large-scale distributed computer system designed for stream processing applications. In such systems, incoming data and intermediate results may need to be stored to enable future analyses. The quantity of such data would dominate even the largest storage system. Thus, a mechanism is needed to keep the most useful data. One recently introduced approach is to employ retention value functions, which effectively assign each data object a value that changes over time . Storage space is then reclaimed automatically by deleting data of lowest current value. In such large systems, there will naturally be multiple file systems available, each with different properties. Choosing the right file system for a given incoming data stream presents a challenge. In this paper we provide a novel and effective scheme for optimizing the placement of data within a distributed storage subsystem employing retention value functions. The goal is to keep the data of highest overall value, while simultaneously balancing the read load to the file system.
F. Douglis, K. Hildrum, J. L. Wolf, L. Fleischer, A. Katta and P. Yu, "Storage Optimization for Large-Scale Distributed Stream Processing Systems," 2007 IEEE International Parallel and Distributed Processing Symposium(IPDPS), Rome, 2007, pp. 443.