The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.10 - Oct. (2012 vol.23)
pp: 1923-1933
Changjun Wu , Xerox Research Center, Webster
Ananth Kalyanaraman , Washington State University, Pullman
William R. Cannon , Pacific Northwest National Laboratory, Richland
ABSTRACT
Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting the homology between two protein sequences is relatively inexpensive, detecting pairwise homology for a large number of protein sequences can become computationally prohibitive for modern inputs, often requiring millions of CPU hours. Yet, there is currently no robust support to parallelize this kernel. In this paper, we identify the key characteristics that make this problem particularly hard to parallelize, and then propose a new parallel algorithm that is suited for detecting homology on large data sets using distributed memory parallel computers. Our method, called pGraph, is a novel hybrid between the hierarchical multiple-master/worker model and producer-consumer model, and is designed to break the irregularities imposed by alignment computation and work generation. Experimental results show that pGraph achieves linear scaling on a 2,048 processor distributed memory cluster for a wide range of inputs ranging from as small as 20,000 sequences to 2,560,000 sequences. In addition to demonstrating strong scaling, we present an extensive report on the performance of the various system components and related parametric studies.
INDEX TERMS
Protein sequence, Computational modeling, Amino acids, DNA, Image edge detection, Dynamic programming, producer-consumer model, Parallel protein sequence homology detection, parallel sequence graph construction, hierarchical master-worker paradigm
CITATION
Changjun Wu, Ananth Kalyanaraman, William R. Cannon, "pGraph: Efficient Parallel Construction of Large-Scale Protein Sequence Homology Graphs", IEEE Transactions on Parallel & Distributed Systems, vol.23, no. 10, pp. 1923-1933, Oct. 2012, doi:10.1109/TPDS.2012.19
REFERENCES
[1] K. Aida, W. Natsume, and Y. Futakata, "Distributed Computing with Hierarchical Master-Worker Paradigm for Parallel Branch and Bound Algorithm," Proc. IEEE/ACM Int'l Symp. Cluster Computing and the Grid, pp. 156-163, 2003.
[2] S.F. Altschul et al., "Basic Local Alignment Search Tool," J. Molecular Biology, vol. 215, no. 3, pp. 403-410, 1990.
[3] R. Apweiler, A. Bairoch, and C.H. Wu, "Protein Sequence Databases," Current Opinion in Chemical Biology, vol. 8, no. 1, pp. 76-80, 2004.
[4] A. Bateman et al., "The Pfam Protein Families Database," Nucleic Acids Research, vol. 32, pp. D138-D141, 2004.
[5] J. Berthold, M. Dieterle, R. Loogen, and S. Priebe, "Hierarchical Master-Worker Skeletons," Proc. 10th Int'l Conf. Practical Aspects of Declarative Languages, pp. 248-264, 2008.
[6] CAMERA—Community Cyberinfrastructure for Advanced Microbial Ecology Research & Analysis. http:/camera.calit2.net, 2011.
[7] E. Cantú-Paz, "A Survey of Parallel Genetic Algorithms," Calculateurs Parallèles, Réseaux et Systèmes Répartis, vol. 10, no. 2, pp. 141-171, 1998.
[8] A. Darling, L. Carey, and W. Feng, "The Design Implementation, and Evaluation of mpiBLAST," Proc. Fourth Int'l Conf. Linux Clusters, 2003.
[9] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Comm. ACM, vol. 51, no. 1, pp. 107-113, 2008.
[10] F.E. Dewhirst et al., "The Human Oral Microbiome," J. Bacteriology, vol. 192, no. 19, pp. 5002-5017, 2010.
[11] R.C. Edgar, "Search and Clustering Orders of Magnitude Faster Than BLAST," Bioinformatics, vol. 26, no. 19, pp. 2460-2461, 2010.
[12] A.J. Enright, S. VanDongen, and S.A. Ouzounis, "An Efficient Algorithm for Large-Scale Detection of Protein Families," Nucleic Acids Research, vol. 30, no. 7, pp. 1575-1584, 2002.
[13] A. Ghoting and K. Makarychev, "Indexing Genomic Sequences on the IBM Blue Gene," Proc. ACM/IEEE Conf. Supercomputing, pp. 1-11, 2009.
[14] GOLD "Genomes OnLine Database," http:/www.genomes online.org/, Sept. 2011.
[15] J. Gough, K. Karplus, R. Hughey, and C. Chothia, "Assignment of Homology to Genome Sequences Using a Library of Hidden Markov Models that Represent All Proteins of Known Structure," J. Molecular Biology, vol. 313, no. 4, pp. 903-919, 2001.
[16] K. Liolios et al., "The Genomes on Line Database (GOLD) in 2009: Status of Genomic and Metagenomic Projects and Their Associated Metadata," Nucleic Acids Research, vol. 38, pp. D346-D354, Nov. 2009.
[17] D.G. Feitelson and L. Rudolph, "Distributed Hierarchical Control for Parallel Processing," Computer, vol. 23, no. 5, pp. 65-77, 1990.
[18] J. Handelsman, "Metagenomics: Application of Genomics to Uncultured Microorganisms," Microbiology and Molecular Biology Rev., vol. 68, no. 4, pp. 669-685, 2004.
[19] J. He et al., "A Hierarchical Parallel Scheme for Global Parameter Estimation in Systems Biology," Proc. 18th Int'l Parallel and Distributed Processing Symp., p. 42b, 2004.
[20] V.M. Markowitz et al., "IMG/M: A Data Management and Analysis System for Metagenomes," Nucleic Acids Research, vol. 36, no. (suppl 1), pp. D534-D538, 2008.
[21] V.M. Markowitz et al., "The integrated Microbial Genomes System: An Expanding Comparative Analysis Resource," Nucleic Acids Research, vol 38, pp. D382-D390, 2010.
[22] Marine Microbial Initiative—Gordon and Betty Moore Foundation, http://www.moore.orgmarine-micro.aspx, Sept. 2011.
[23] The Nat'l Center for Biotechnology Information, http://www.ncbi.nlm.nih.govgenbank/, Sept. 2011.
[24] A. Kalyanaraman, S. Aluru, V. Brendel, and S. Kothari, "Space and Time Efficient Parallel Algorithms and Software for EST Clustering," IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 12, pp. 1209-1221, Dec. 2003.
[25] A. Kalyanaraman, S. Aluru, S. Kothari, and V. Brendel, "Efficient Clustering of Large EST Data Sets on Parallel Computers," Nucleic Acids Research, vol. 31, no. 11, pp. 2963-2974, 2003.
[26] A. Kalyanaraman, S.J. Emrich, P.S. Schnable, and S. Aluru, "Assembling Genomes on Large-Scale Parallel Computers," J. Parallel and Distributed Computing, vol. 67, no. 12, pp. 1240-1255, 2007.
[27] E.V. Kriventseva, M. Biswas, and R. Apweiler, "Clustering and Analysis of Protein Families," Current Opinion in Structural Biology, vol. 11, no. 3, pp. 334-339, 2001.
[28] H. Lin, X. Ma, W. Feng, and N.F. Samatova, "Coordinating Computation and I/O in Massively Parallel Sequence Search," IEEE Trans. Parallel and Distributed Systems, vol. 22, no. 4, pp. 529-543, Apr. 2011.
[29] E. McCreight, "A Space Economical Suffix Tree Construction Algorithm," J. ACM, vol. 23, no. 2, pp. 262-272, 1976.
[30] H. Noguchi, J. Park, and T. Takagi, "MetaGene: Prokaryotic Gene Finding from Environmental Genome Shotgun Sequences," Nucleic Acids Research, vol. 34, no. 19, pp. 5623-5630, 2006.
[31] S.B. Needleman and C.D. Wunsch, "A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins," J. Molecular Biology, vol. 48, no. 3, pp. 443-453, 1970.
[32] C. Oehmen and J. Nieplocha, "ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-intensive Bioinformatics Analysis," IEEE Trans. Parallel and Distributed Systems, vol. 17, no. 8, pp. 740-749, Aug. 2006.
[33] V. Olman, F. Mao, H. Wu, and Y. Xu, "A Parallel Clustering Algorithm for Very Large Data Sets," IEEE/ACM Trans Computational Biology and Bioinformatics, vol. 5, no. 2, pp. 344-352, Apr.-June 2007.
[34] W.R. Pearson, "Searching Protein Sequence Libraries: Comparison of the Sensitivity and Selectivity of the Smith-Waterman and FASTA Algorithms," Genomics, vol. 11, no. 3, pp. 635-650, 1991.
[35] W.R. Pearson and D.J. Lipman, "Improved Tools for Biological Sequence Comparison," Proc. Nat'l Academy of Sciences of USA, vol. 85, no. 8, pp. 2444-2448, 1988.
[36] P. Pipenbacher et al., "ProClust: Improved Clustering of Protein Sequences with an Extended Graph-Based Approach," Bioinformatics, vol. 18, no. S2, pp. S182-S191, 2002.
[37] N. Ronaldo and E. Zimeo, "A Transparent Framework for Hierarchical Master-Slave Grid Computing," Proc. CoreGRID, 2006.
[38] S. Sarkar, T. Majumder, P. Pande, and A. Kalyanaraman, "Hardware Accelerators for Biocomputing: A Survey," Proc. IEEE Int'l Symp. Circuits and Systems, pp. 3789-3792, 2010.
[39] R. Seshadri et al., "CAMERA: A Community Resource for Metagenomics," PLoS Biology, vol. 5, p. e75, 2007.
[40] E.G. Shpaer et al., "Sensitivity and Selectivity in Protein Similarity Searches: A Comparison of Smith-Waterman in Hardware to BLAST and FASTA," Genomics, vol. 38, no. 2, pp. 179-191, 1996.
[41] T.F. Smith and M.S. Waterman, "Identification of Common Molecular Subsequences," J. Molecular Biology, vol. 147, no. 1, pp. 195-197, 1981.
[42] E. Talbi and H. Meunier, "Hierarchical Parallel Approach for GSM Mobile Network Design," J. Parallel and Distributed Computing, vol. 66, no. 2, pp. 274-290, 2006.
[43] P.J. Turnbaugh et al., "The Human Microbiome Project," Nature, vol. 449, no. 18, pp. 804-810, 2007.
[44] E. Ukkonen, "A Linear-Time Algorithm for Finding Approximate Shortest Common Superstrings," Algorithmica, vol. 5, no. 1, pp. 313-323, 1990.
[45] J.C. Venter et al., "The Sequence of the Human Genome," Science, vol. 291, no. 5507, pp. 1304-1351, 2001.
[46] P. Weiner, "Linear Pattern Matching Algorithm," Proc. IEEE Symp. Switching and Automata Theory, pp. 1-11, 1973.
[47] C. Wu and A. Kalyanaraman, "An Efficient Parallel Approach for Identifying Protein Families in Large-Scale Metagenomic Data Sets," Proc. ACM/IEEE Conf. Supercomputing, pp. 1-10, 2008.
[48] S. Yooseph et al., "The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families," PLoS Biology, vol. 5, no. 3, pp. 432-466, 2007.
32 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool