Parallel and Distributed Processing Symposium, International (2009)
May 23, 2009 to May 29, 2009
Doruk Bozdag , The Ohio State University, Dept. of Biomedical Informatics, Columbus, 43210, USA
Catalin C. Barbacioru , Applied Biosystems, 850 Lincoln Center Drive, Foster City, CA 94404, USA
Umit V. Catalyurek , The Ohio State University, Dept. of Biomedical Informatics, Columbus, 43210, USA
With the advent of next-generation high throughput sequencing instruments, large volumes of short sequence data are generated at an unprecedented rate. Processing and analyzing these massive data requires overcoming several challenges including mapping of generated short sequences to a reference genome. This computationally intensive process takes time on the order of days using existing sequential techniques on large scale datasets. In this work, we propose six parallelization methods to speedup short sequence mapping and to reduce the execution time under just a few hours for such large datasets. We comparatively present these methods and give theoretical cost models for each method. Experimental results on real datasets demonstrate the effectiveness of the parallel methods and indicate that the cost models help accurate estimation of parallel execution time. Based on these cost models we implemented a selection function to predict the best method for a given scenario. To the best of our knowledge this is the first study on parallelization of short sequence mapping problem.
D. Bozdag, U. V. Catalyurek and C. C. Barbacioru, "Parallel short sequence mapping for high throughput genome sequencing," 2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Rome, 2009, pp. 1-10.