Data Placement and Task Scheduling Optimization for Data Intensive Scientific Workflow in Multiple Data Centers Environment
2014 Second International Conference on Advanced Cloud and Big Data (CBD) (2014)
Nov. 20, 2014 to Nov. 22, 2014
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/CBD.2014.19
Running data-intensive scientific workflow across multiple data centers faces massive data transfer problem which leads to low efficiency in actual workflow application for scientists. By considering data size and data dependency, we propose a k-means algorithm based initial data placement strategy that places the most related initial data sets into the same data center at workflow preparation stage. During the execution of scientific workflow, by analyzing interdependent relationship between data sets and tasks, we adopt multilevel task replication strategy to reduce volume of intermediate data transfer. The simulation results show that the proposed strategies can effectively reduce data transfer among data centers and improve performance of running data intensive scientific workflows.
Data transfer, Distributed databases, Scheduling, Algorithm design and analysis, Data models, Big data, Processor scheduling
M. Wang, J. Zhang, F. Dong and J. Luo, "Data Placement and Task Scheduling Optimization for Data Intensive Scientific Workflow in Multiple Data Centers Environment," 2014 Second International Conference on Advanced Cloud and Big Data (CBD), Huangshan, China, 2014, pp. 77-84.