The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.01 - January (2008 vol.19)
pp: 15-23
ABSTRACT
Bioinformatics databases used for sequence comparison and sequence alignment are growing exponentially.This has popularized programs that carry out database searches. Current implementations of sequence alignmentmethods based on hidden Markov models (HMM) have proven to be computationally intensive, hence, amenable toarchitectures with multiple processors. In this paper we describe a modified version of the original parallelimplementation of hidden Markov models on a massively parallel system. This is part of the HMMER bioinformaticscode. HMMER 2.3.2 uses profile hidden Markov models (HMM) for sensitive database searching based on statisticaldescriptions of sequence family's consensus [1]. Two of the nine programs were further parallelized to take advantage oflarge number of processors, namely, hmmsearch and hmmpfam. For our study, we start by porting the parallel virtualmachine (PVM) versions of these two programs currently available as part of the HMMER suite of programs. We reportthe performance of these non-optimized versions as baselines. Our work also includes the introduction of an alternatesequence file indexing, multiple master configuration, dynamic data collection and finally load balancing via the indexedsequence files. This set of optimizations constitutes our modified version for massively parallel systems. Our results showparallel performance improvements of more than one order of magnitude (16x) for hmmsearch and hmmpfam.
INDEX TERMS
Hidden Markov models, HMMER, massively parallel systems, multiple master parallelization, parallel implementation, genomic sequence-search, bioinformatics.
CITATION
Karl Jiang, Oystein Thorsen, Amanda Peters, Brian Smith, Carlos P. Sosa, "An Efficient Parallel Implementation of the Hidden Markov Methods for Genomic Sequence-Search on a Massively Parallel System", IEEE Transactions on Parallel & Distributed Systems, vol.19, no. 1, pp. 15-23, January 2008, doi:10.1109/TPDS.2007.70712
REFERENCES
[1] R. Durbin, S.R. Eddy, A. Krogh, and G.J. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press, 1998.
[2] G.D. Schuler, Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, A.D. Baxevanis and B.F. Francis Ouellette, eds. Wiley-Liss, Inc., 1998.
[3] D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and D.L. Wheeler, “GenBank,” Nucleic Acids Research, vol. 34, pp. D16-D20, 2006.
[4] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215, pp. 403-410, 1990.
[5] S. Altschul, T.L. Madden, A.A. Schaffer, J. Zheng, Z. Zhang, M. Miller, and D.L. Lipman, “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs,” Nucleic Acid Research, vol. 25, pp. 3389-3402, 1997.
[6] W.R. Pearson and D.J. Lipman, “Improved Tools for Biological Sequence Comparison,” Proc. Nat'l Academy of Science, vol. 85, pp.2444-2448, 1988.
[7] L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proc. IEEE, vol. 77, pp. 257-286, 1989.
[8] S.R. Eddy, “Multiple Alignment Using Hidden Markov Models,” Proc. Third Int'l Conf. Intelligent Systems Molecular Biology (ISMB '95), vol. 3, pp. 114-120, 1995.
[9] S.R. Eddy, “Hidden Markov Models,” Current Opinions in Structural Biology, vol. 6, pp. 361-365, 1996.
[10] S.R. Eddy, “Profile Hidden Markov Models,” Bioinformatics, vol. 14, pp. 755-763, 1998.
[11] P. Baldi, P.Y. Cauvin, T. Hunkapillar, and M.A. McClure, “Hidden Markov Models of Biological Primary Sequence Information,” Proc. Nat'l Academy of Sciences, vol. 91, pp. 1059-1063, 1994.
[12] A. Krogh, M. Brown, I.S. Mian, K. Sjolander, and D. Haussler, “Hidden Markov Models in Computational Biology: Applications to Protein Modeling,” J. Molecular Biology, vol. 235, pp. 1501-1531, 1994.
[13] S.R. Eddy HMMER User's Guide: Biological Sequence Analysis Using Profile Hidden Markov Models, Version 2.3.2, http:/hmmer.wustl.edu/, Oct. 1998.
[14] M.Y. Galperin, “The Molecular Biology Database Collection: 2006 Update,” Nucleic Acids Research, vol. 34, pp. D3-D5, 2006.
[15] M.K. Gardner, W.-C. Feng, J. Archuleta, H. Lin, and X. Ma, “Parallel Genomic Sequence-Searching on an Ad-Hoc grid: Experiences, Lessons Learned and Implications,” Proc. ACM/IEEE Supercomputing Conf. (SC '06), 2006.
[16] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Mancheck, and Y. Sunderam, PVM: Parallel Virtual Machine. A User”s Guide and Tutorial for Networked Parallel Computing, J. Kowalik, ed. MIT Press, 1994.
[17] A. Darling, L. Carey, and W.-C. Feng, “The Design, Implementation and Evaluation of mpiBLAST,” Proc. ClusterWorld Conf. and Expo, 2003.
[18] H. Lin, X. Ma, P. Chandramohan, A. Geist, and N. Samatova, “Efficient Data Access for Parallel BLAST,” Proc. 19th Int'l Parallel and Distributed Processing Symp. (IPDPS '05), 2005.
[19] R. Bjornson, A. Sherman, S. Weston, N. Willard, and J. Wing, “TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub,” Proc. 16th Int'l Parallel and Distributed Processing Symp. (IPDPS '02), 2002.
[20] D.R. Mathog, “Parallel BLAST on Split Databases,” Bioinformatics, vol. 19, pp. 1865-1866, 2003.
[21] K. Hokamp, D.C. Shields, K.H. Wolfe, and D.R. Caffrey, “Wrapping Up BLAST and Other Applications for Use on Unix Clusters,” Bioinformatics, vol. 19, pp. 441-442, 2003.
[22] C. Oehmen and J. Nieplocha, “ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis,” IEEE Trans. Parallel and Distributed Systems, vol. 17, pp. 740-749, 2006.
[23] R.C. Braun, K.T. Pedretti, T.L. Casavart, T.E. Sheetz, C.L. Birkett, and C.A. Roberts, “Parallelization of Local BLAST Service on Workstation Clusters,” Future Generation Computer Systems, vol. 17, pp. 745-754, 2001.
[24] J.D. Grant, R.L. Dunbrack, F.J. Manion, and M.F. Ochs, “BeoBLAST: Distributed BLAST and PSI-BLAST on a Beowulf Cluster,” Bioinformatics, vol. 18, pp. 765-766, 2002.
[25] H. Stockinger, M. Pagai, L. Cerutti, and L. Falquet, “Grid Approach to Embarrassingly Parallel CPU-Intensive Bioinformatics Problems,” Proc. Second IEEE Int'l Conf. One-Science and Grid Computing (e-Science '06), 2006.
[26] W. Zhu, Y. Niu, J. Lu, and G.R. Gao, “Implementing Parallel HMM-pfam on the EARTH Multithreaded Architecture,” Computational Systems Bioinformatics, 2003.
[27] A. Ching, W.-C. Feng, H. Lin, Y. Ma, and A. Chodhary, “Exploring I/O for Parallel Sequence-Search Tools with S3aSim,” Proc. 15th IEEE Int'l Symp. High Performance Distributed Computing, pp. 229-240, 2006.
[28] J. Nieplocha, R. Harrison, and R. Littlefield, “Global Arrays: A Nonuniform Memory Access Programming Model for High-Performance Computing,” J. Supercomputing, vol. 10, pp. 197-220, 1996.
[29] U. Srinivasan, P.-S. Chen, Q. Diao, C.-C. Lim, E. Li, Y. Chen, R. Ju, and Y. Zhang, “Characterization and Analysis of HMMER and SVM-RFE Parallel Bioinformatics Applications,” Proc. IEEE Int'l Symp. Workload Characterization (IISWC '05), pp. 87-98, 2005.
[30] Message Passing Interface Forum. MPI: Message-Passing Interface Standard, June 1995.
[31] O. Lascu, N. Allsopp, P. Vezolle, J. Follows, M. Hennecke, F. Ishibashi, M. Paolini, S. Prakash, H. Reddy, C.P. Sosa, A. Tabary, and D. Quintero, Unfolding the IBM Server Blue Gene Solution. IBM Corp., Int'l Technical Support Organization, http://www.redbooks.ibm.com/abstractssg246686.html?Open , 2005.
[32] C.H. Wu, R. Apweiler, A. Bairoch, D.A. Natale, W.C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M.J. Martin, R. Mazumder, C. O'Donovan, N. Redaschi, and B. Suzek, “The Universal Protein Resource (UniProt): An Expanding Universe of Protein Information,” Nucleic Acids Research, vol. 34, pp. D187-D191, 2006.
[33] A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S.R. Eddy, S. Griffiths-Jones, K.L. Howe, and E.L. Sonnhammer, “The Pfam Protein Families Database,” Nucleic Acids Research, vol. 30, pp. 276-280, 2002.
[34] E. Birney, D. Andrews, M. Caccamo, Y. Chen, L. Clarke, G. Coates, T. Cox, F. Cunningham, V. Curwen, T. Cutts, T. Down, R. Durbin, X.M. Fernandez-Suarez, P. Flicek, S. Gräf, M. Hammond, J. Herrero, K. Howe, V. Iyer, K. Jekosch, A. Kähäri, A. Kasprzyk, D. Keefe, F. Kokocinski, E. Kulesha, D. London, I. Longden, C. Melsopp, P. Meidl, B. Overduin, A. Parker, G. Proctor, A. Prlic, M. Rae, D. Rios, S. Redmond, M. Schuster, I. Sealy, S. Searle, J. Severin, G. Slater, D. Smedley, J. Smith, A. Stabenau, J. Stalker, S. Trevanion, A. Ureta-Vidal, J. Vogel, S. White, C. Woodwark, and T.J.P. Hubbard, “Ensembl 2006,” Nucleic Acids Research, vol. 34, pp. D556-D561, 2006.
[35] O. Thorsen, B. Smith, C.P. Sosa, K. Jiang, H. Lin, A. Peters, and W.-C. Feng, “Parallel Genomic Sequence-Search on a Massively Parallel System,” to be published, 2007.
[36] J.P. Walters, J. Landman, and V. Chaudhary, “Optimized Cluster-Enabled HMMER Searches,” to be published in, Grids for Bioinformatics and Computational Biology, E.-G. Talbi and A.Zomaya, eds., John Wiley & Sons, 2007.
14 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool