Programming support and adaptive checkpointing for high-throughput data services with log-based recovery
2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN) (2010)
Chicago, IL, USA
June 28, 2010 to July 1, 2010
Jingyu Zhou , Computer Science Department, Shanghai Jiao Tong University, China
Caijie Zhang , Google Inc., Mountain View, CA 94043, USA
Hong Tang , Yahoo Inc., Sunnyvale, CA 94089, USA
Jiesheng Wu , Microsoft, Redmond, WA 98052, USA
Tao Yang , University of California, Santa Barbara, 93106, USA
Many applications in large-scale data mining and offline processing are organized as network services, running continuously or for a long period of time. To sustain high-throughput, these services often keep their data in memory, thus susceptible to failures. On the other hand, the availability requirement for these services is not as stringent as online services exposed to millions of users. But those data-intensive offline or mining applications do require data persistence to survive failures. This paper presents programming and runtime support called SLACH for building multi-threaded high-throughput persistent services. To keep in-memory objects persistent, SLACH employs application-assisted logging and checkpointing for log-based recovery while maximizing throughput and concurrency. SLACH adaptively adjusts checkpointing frequency based on log growth and throughput demand to balance between runtime overhead and recovery speed. This paper describes the design and API of SLACH, adaptive checkpoint control, and our experiences and experiments in using SLACH at Ask.com.
Jingyu Zhou, Caijie Zhang, Hong Tang, Jiesheng Wu and T. Yang, "Programming support and adaptive checkpointing for high-throughput data services with log-based recovery," 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN), Chicago, IL, USA, 2010, pp. 91-100.