The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.12 - Dec. (2012 vol.23)
pp: 2189-2197
Christopher Moretti , Princeton University, Princeton
Andrew Thrasher , University of Notre Dame, Notre Dame
Li Yu , University of Notre Dame, Notre Dame
Michael Olson , University of Notre Dame, Notre Dame
Scott Emrich , University of Notre Dame, Notre Dame
Douglas Thain , University of Notre Dame, Notre Dame
ABSTRACT
Bioinformatics researchers need efficient means to process large collections of genomic sequence data. One application of interest, genome assembly, has great potential for parallelization; however, most previous attempts at parallelization require uncommon high-end hardware. This paper introduces the Scalable Assembler at Notre Dame (SAND) framework that can achieve significant speedup using large numbers of commodity machines harnessed from clusters, clouds, and grids. SAND interfaces with the Celera open-source assembly toolkit, replacing two independent sequential modules with scalable parallel alternatives: the candidate selector exploits distributed memory capacity, and the sequence aligner exploits distributed computing capacity. For large problems, these modules provide robust task and data management while also achieving speedup with high efficiency. We show results for several data sets ranging from 738 thousand to over 320 million alignments using resources ranging from a small cluster to more than a thousand nodes spanning three institutions.
INDEX TERMS
Bioinformatics, Genomics, Cloud computing, Distributed processing, Random access memory, Biomedical informatics, genome assembly, Distributed systems, bioinformatics
CITATION
Christopher Moretti, Andrew Thrasher, Li Yu, Michael Olson, Scott Emrich, Douglas Thain, "A Framework for Scalable Genome Assembly on Clusters, Clouds, and Grids", IEEE Transactions on Parallel & Distributed Systems, vol.23, no. 12, pp. 2189-2197, Dec. 2012, doi:10.1109/TPDS.2012.80
REFERENCES
[1] TeraGrid, http:/www.teragrid.org, 2012.
[2] The Open Science Grid, http:/www.opensciencegrid.org, 2012.
[3] S. Ahuja, N. Carriero, and D. Gelernter, "Linda and Friends," Computer, vol. C-19, no. 8, pp. 26-34, Aug. 1986.
[4] D. Anderson, "BOINC: A System for Public-Resource Computing and Storage," Proc. IEEE/ACM Workshop Grid Computing, 2004.
[5] D. Bakken and R. Schlichting, "Tolerating Failures in the Bag-of-Tasks Programming Paradigm," Proc. IEEE Int'l Symp. Fault Tolerant Computing, June 1991.
[6] C. Banino, O. Beaumont, L. Carter, J. Ferrante, A. Legrand, and Y. Robert, "Scheduling Strategies for Master-Slave Tasking on Heterogeneous Processor Platforms," IEEE Trans. Parallel and Distributed Systems, vol. 15, no. 4, pp. 319-330, Apr. 2004.
[7] S. Batzoglou et al., "ARACHNE: A Whole-Genome Shotgun Assembler," Genome Research, vol. 12, no. 1, pp. 177-189, Jan. 2002.
[8] O. Beaumont, L. Carter, J. Ferrante, A. Legrand, L. Marchal, and Y. Robert, "Centralized Versus Distributed Schedulers for Bag-of-Tasks Applications," IEEE Trans. Parallel and Distributed Systems, vol. 19, no. 5, pp. 698-709, May 2008.
[9] H. Casanova, G. Obertelli, F. Berman, and R. Wolski, "The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid," Proc. ACM/IEEE Conf. Supercomputing, 2000.
[10] A. Chien, B. Calder, S. Elber, and K. Bhatia, "Entropia: Architecture and Performance of an Enterprise Desktop Grid System," J. Parallel and Distributed Computing, vol. 63, pp. 597-610, May 2003.
[11] C.-S.S. Chin, J. Sorenson, J.B. Harris, W.P. Robins, R.C. Charles, R.R. Jean-Charles, J. Bullard, D.R. Webster, A. Kasarskis, P. Peluso, E.E. Paxinos, Y. Yamaichi, S.B. Calderwood, J.J. Mekalanos, E.E. Schadt, and M.K. Waldor, "The Origin of the Haitian Cholera Outbreak Strain," The New England J. Medicine, vol. 364, no. 1, pp. 33-42, Jan. 2011.
[12] W. Cirne, D. Paranhos, L. Costa, E. Santos-Neto, F. Brasileiro, J. Sauve, F. Silva, C. Barros, and C. Silveira, "Running Bag-of-Tasks Applications on Computational Grids: The MyGrid Approach," Proc. Int'l Conf. Parallel Processing (ICPP), Oct. 2003.
[13] D. da Silva, W. Cirne, and F. Brasilero, "Trading Cycles for Information: Using Replication to Schedule Bag-of-Tasks Applications on Computational Grids," Proc. Ninth Int'l Euro-Par Conf. (Euro-Par), 2003.
[14] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Cluster," Proc. Symp. Operating Systems Design and Implementation, 2004.
[15] D. Gelernter and D. Kaminsky, "Supercomputing Out of Recycled Garbage: Preliminary Experience with Piranha," Proc. Int'l Conf. Supercomputing, 1992.
[16] W. Gentzsch, "Sun Grid Engine: Towards Creating a Compute Power Grid," CCGRID '01: Proc. Third Int'l Symp. Cluster Computing and the Grid, p. 35, 2001.
[17] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge Univ. Press, Jan. 2007.
[18] P. Havlak et al., "The Atlas Genome Assembly System," Genome Research, vol. 14, no. 4, pp. 721-732, Apr. 2004.
[19] X. Huang and A. Madan, "CAP3: A DNA Sequence Assembly Program," Genome Research, vol. 9, no. 9, pp. 868-877, Sept. 1999.
[20] X. Huang, J. Wang, S. Aluru, S.-P. Yang, and L. Hillier, "PCAP: A Whole-Genome Assembly Program," Genome Research, vol. 13, no. 9, pp. 2164-2170, Sept. 2003.
[21] B. Jackson, P. Schnable, and S. Aluru, "Parallel Short Sequence Assembly of Transcriptomes," BMC Bioinformatics, vol. 10, no. Suppl 1:S14, 2009.
[22] A. Kalyanaraman, S. Emrich, P. Schnable, and S. Aluru, "Assembling Genomes on Large-Scale Parallel Computers," J. Parallel and Distributed Computing, vol. 67, no. 12, pp. 1240-1255, 2007.
[23] V.K. Kundeti, S. Rajasekaran, H. Dinh, M. Vaughn, and V. Thapar, "Efficient Parallel and Out of Core Algorithms for Constructing Large Bi-Directed de Bruijn Graphs," BMC Bioinformatics, vol. 11, p. 560, 2010.
[24] M.K.N. Lawniczak, S.J. Emrich, A.K. Holloway, A.P. Regier, M. Olson, B. White, S. Redmond, L. Fulton, E. Appelbaum, J. Godfrey, C. Farmer, A. Chinwalla, S.-P. Yang, P. Minx, J. Nelson, K. Kyung, B.P. Walenz, E. Garcia-Hernandez, M. Aguiar, L.D. Viswanathan, Y.-H. Rogers, R.L. Strausberg, C.A. Saski, D. Lawson, F.H. Collins, F.C. Kafatos, G.K. Christophides, S.W. Clifton, E.F. Kirkness, and N.J. Besansky, "Widespread Divergence Between Incipient Anopheles Gambiae Species Revealed by Whole Genome Sequences," Science, vol. 330, no. 6003, pp. 512-514, 2010.
[25] J. Linderoth et al., "An Enabling Framework for Master-Worker Applications on the Computational Grid," Proc. IEEE Int'l Symp. High Performance Distributed Computing, pp. 43-50, Aug. 2000.
[26] J.R. Miller, A.L. Delcher, S. Koren, E. Venter, B.P. Walenz, A. Brownley, J. Johnson, K. Li, C. Mobarry, and G. Sutton, "Aggressive Assembly of Pyroseqencing Reads with Mates," Bioinformatics, vol. 24, no. 24, pp. 2818-2824, 2008.
[27] C. Moretti, M. Olson, S. Emrich, and D. Thain, "Highly Scalable Genome Assembly on Campus Grids," Proc. IEEE Workshop Many-Task Computing on Grids and Supercomputers (MTAGS '09), 2009.
[28] E.W. Myers et al., "A Whole-Genome Assembly of Drosophila," Science, vol. 287, no. 5461, pp. 2196-2204, Mar. 2000.
[29] A.H. Paterson et al., "The Sorghum Bicolor Genome and the Diversification of Grasses," Nature, vol. 457, no. 7229, pp. 551-556, Jan. 2009.
[30] M. Pop et al., "Genome Sequence Assembly: Algorithms and Issues," Computer, vol. 35, no. 7, pp. 47-54, July 2002.
[31] M. Pop and S.L. Salzberg, "Bioinformatics Challenges of New Sequencing Technology," Trends in Genetics, vol. 24, no. 3, pp. 142-149, Mar. 2008.
[32] I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and M. Wilde, "Falkon: A Fast and Light-Weight Task Execution Framework," Proc. IEEE/ACM Conf. Supercomputing, 2007.
[33] I. Raicu, Y. Zhao, I. Foster, and A. Szalay, "Accelerating Large-Scale Data Exploration through Data Diffusion," Proc. Workshop Data Aware Distributed Computing, 2008.
[34] M. Roberts et al., "A Preprocessor for Shotgun Assembly of Large Genomes," J. Computational Biology, vol. 11, no. 4, pp. 734-752, 2004.
[35] M. Roberts, A.V. Zimin, W. Hayes, B.R. Hunt, C. Ustun, J.R. White, P. Havlak, and J. Yorke, "Improving Phrap-Based Assembly of the Rat Using Reliable Overlaps," PLoS ONE, vol. 3, p. e1836, 2008.
[36] A.L. Rosenberg, "Optimal Schedules for Cycle-Stealing in a Network of Workstations with a Bag-of-Tasks Workload," IEEE Trans. Parallel and Distributed Systems, vol. 13, no. 2, pp. 179-191, Feb. 2002.
[37] A. Sarje and S. Aluru, "Parallel Biological Sequence Alignments on the Cell Broadband Engine," Proc. IEEE Parallel and Distributed Processing Symp. (IPDPS '08), pp. 1-11, Apr. 2008.
[38] M. Schatz, "CloudBurst: Highly Sensitive Read Mapping with Mapreduce," Bioinformatics, vol. 25, pp. 1363-1369, Apr. 2009.
[39] M.V. Sharakhova et al., "Update of the Anopheles Gambiae PEST Genome Assembly," Genome Biology, vol. 8, p. R5, Jan. 2007.
[40] O. Storaasli and D. Strenski, "Exploring Accelerating Science Applications with FPGAs," Proc. Third Ann. Reconfigurable Systems Summer Inst. (RSSI), July 2007.
[41] W.T. Sullivan, D. Werthimer, S. Bowyer, J. Cobb, D. Gedye, and D. Anderson, "A New Major SETI Project Based on Project Serendip Data and 100,000 Personal Computers," Proc. Fifth Int'l Conf. Bioastronomy, 1997.
[42] D. Thain, T. Tannenbaum, and M. Livny, "Condor and the Grid," Grid Computing: Making the Global Infrastructure a Reality, F. Berman, G. Fox, and T. Hey, eds., John Wiley, 2003.
[43] J.C. Venter et al., "The Sequence of the Human Genome," Science, vol. 291, no. 5507, pp. 1304-1351, Feb. 2001.
[44] L. Yu, C. Moretti, S. Emrich, K. Judd, and D. Thain, "Harnessing Parallelism in Multicore Clusters with the All-Pairs and Wavefront Abstractions," Proc. IEEE Symp. High Performance Distributed Computing, pp. 1-10, 2009.
7 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool