Shifting the bioinformatics computing paradigm: A case study in parallelizing genome annotation using MAKER and Work Queue
2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS) (2012)
Las Vegas, NV, USA
Feb. 23, 2012 to Feb. 25, 2012
Andrew Thrasher , Department of Computer Science and Engineering, University of Notre Dame, IN, USA
Douglas Thain , Department of Computer Science and Engineering, University of Notre Dame, IN, USA
Scott Emrich , Department of Computer Science and Engineering, University of Notre Dame, IN, USA
Zachary Musgrave , Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, USA
Next generation sequencing technologies have enabled various entities, ranging from large sequencing centers to individual laboratories, to sequence organisms of choice and analyze them on demand. Sequencing and analysis, however, is only part of the equation: to learn about a certain organism, scientists need to annotate it. Each of these problems is highly parallel at a basic level of computation; however, only a few applications support single parallelization frameworks such as MPI. Because of the overall increasing demand for computational analysis and the inherent parallelism available in these problems, applications should easily run on clusters, clouds, and/or grids (even simultaneously); this would enable labs of various sizes to harness the computing power available to them without forcing them to invest in a particular type of batch system. Here we describe modifications made to one particular tool, MAKER. MAKER is a tool for genome annotation that is provided as both a serial application and as an MPI application. We make modifications to enable it to run without MPI and to utilize a wide variety of distributed computing platforms. Further, our proposed parallel framework allows for easy explicit data transfer, which helps overcome a major limitation of bioinformatics tools that generally rely on a shared filesystem. The distributed computing framework we chose to utilize can be used, even during early stages of development, to run bioinformatics tools on clusters, grids, and clouds. We present an evaluation of our modifications using the Caenorhabditis japonica genome comprising 180 megabases of data and achieve a speedup of 45Ã- using 50 workers.
S. Emrich, A. Thrasher, Z. Musgrave and D. Thain, "Shifting the bioinformatics computing paradigm: A case study in parallelizing genome annotation using MAKER and Work Queue," 2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS), Las Vegas, NV, USA, 2012, pp. 1-6.