Today, batch processing frameworks like Hadoop MapReduce are difficult to scale to multiple clouds due to latencies involved in inter-cloud data transfer and synchronization overheads during shuffle-phase. This inhibits the MapReduce framework from guaranteeing performance at variable load surges without over-provisioning in the internal cloud (IC). We propose BStream, a cloud bursting framework that couples stream-processing in the external cloud (EC) with Hadoop in the internal cloud (IC) to realize inter-cloud MapReduce. Stream processing in EC enables pipelined uploading, processing and downloading of data to minimize network latencies. We use this framework to guarantee service-level objective (SLO) of meeting job deadlines. BStream uses an analytical model to minimize the usage of EC and burst only when necessary. We propose different checkpointing strategies to overlap output transfer with input transfer/processing while simultaneously reducing the computation involved in merging the results from EC and IC. Checkpointing further reduces the job completion time. We experimentally compare BStream with other related works and illustrate the benefits of using stream processing and checkpointing strategies in EC. Lastly, we characterize the operational regime of BStream.
Janakiram Dharanipragada, "Extending MapReduce across Clouds with BStream", IEEE Transactions on Cloud Computing, , no. 1, pp. 1, PrePrints PrePrints, doi:10.1109/TCC.2014.2316810