The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2009 vol.20)
pp: 303-315
ABSTRACT
As genome sequence databases grow in size, the accuracy and speed of sequence similarity detection become more important. There is an increasing number of methods being used for detecting sequence similarity. Meanwhile the demands for genome sequence search and alignment services are also increasing. It is a challenge to scale up the computer systems for hosting various methods and serving requests to these methods in a timely manner. Traditional clusters, which are used in most of scientific centers, can not cope with this challenge. This paper tackles this problem in a novel way, which treats the sequence search requests as content requests to both genome databases and similarity detection methods; therefore, scaling up the computer systems that serve these contents is a process of constructing content distribution network. The paper gives a decentralized method to dynamically construct content distribution networks for a variety of genome sequence similarity detection services. It also provides a scheduling algorithm for efficiently using content nodes. Our simulation study shows that scalability and high content node utilization can be achieved in such a system while the cost of achieving remains reasonable.
INDEX TERMS
Simulation, Performance Analysis and Design Aids, Simulation, Distributed architectures, Distributed networks, Distributed applications, Data models, Hash-table representations, Optimization
CITATION
B.B. Zhou, C. Wang, "A Decentralized Method for Scaling Up Genome Similarity Search Services", IEEE Transactions on Parallel & Distributed Systems, vol.20, no. 3, pp. 303-315, March 2009, doi:10.1109/TPDS.2008.95
REFERENCES
[1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215, pp. 403-410, 1990.
[2] S.F. Altschul et al., “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs,” Nucleic Acids Research, vol. 25, pp. 3389-3402, 1997.
[3] A.A. Schäffer et al., “Improving the Accuracy of PSI-BLAST Protein Database Searches with Composition-Based Statistics and Other Refinements,” Nucleic Acids Research, vol. 29, no. 14, pp.2994-3005, 2001.
[4] M.G. Kann et al., “A Structure-Based Method for Protein Sequence Alignment,” Bioinformatics, vol. 21, pp. 1451-1456, 2005.
[5] T.F. Smith and M.S. Waterman, “Identification of Common Molecular Subsequences,” J. Molecular Biology, vol. 147, pp.195-197, 1981.
[6] W.R. Pearson, “Searching Protein Sequence Libraries: Comparison of the Sensitivity and Selectivity of the Smith-Waterman and FASTA Algorithms,” Genomics, vol. 11, pp. 635-650, 1991.
[7] K. Karplus, C. Barrett, and R. Hughey, “Hidden Markov Models for Detecting Remote Protein Homologies,” Bioinformatics, vol. 14, pp. 846-856, 1998.
[8] A.A. Schäffer et al., “IMPALA: Matching a Protein Sequence Against a Collection of PSI-BLAST Constructed Position-Specific Score Matrices,” Bioinformatics, vol. 15, pp. 1000-1011, 1999.
[9] S.R. Eddy, “Profile Hidden Markov Models,” Bioinformatics, vol. 14, pp. 755-763, 1998.
[10] L.A. Barroso, J. Dean, and U. Holzle, “Web Search for a Planet: The Google Cluster Architecture,” IEEE Micro, vol. 23, no. 2, pp.22-28, Mar./Apr. 2003.
[11] S. McGinnis and T.L. Madden, “BLAST: At the Core of a Powerful and Diverse Set of Sequence Analysis Tools,” Nucleic Acids Research, pp. W20-W25, 32(Web Server issue), July 2004,
[12] D.L. Wheeler et al., “Database Resources of the National Center for Biotechnology Information: Update,” Nucleic Acids Research, vol. 32, pp. D35-D40, 2004.
[13] A.E. Darling, L. Carey, and W. Feng, “The Design, Implementation, and Evaluation of mpiBLAST,” ClusterWorld Conf. and Expo and the Fourth Int'l Conf. Linux Clusters: The HPC Revolution, 2003.
[14] R. Bjornson, A. Sherman, S. Weston, N. Willard, and J. Wing, “TurboBLAST(r): A Parallel Implementation of BLAST Built on the TurboHub,” Proc. 16th IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS), 2002.
[15] C. Oehmen and J. Nieplocha, “ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis,” IEEE Trans. Parallel and Distributed Systems, vol. 17, no. 8, Aug. 2006.
[16] L.M. Haas, J.E. Rice, P.M. Schwarz, W.C. Swope, P. Kodali, and E. Kotlar, “DiscoveryLink: A System for Integrated Access to Life Sciences Data Sources,” IBM Systems J., vol. 40, no. 2, pp. 489-511, 2001.
[17] B.A. Eckman and A. Kaufmann, “Querying BLAST within a Data Federation,” Bull. IEEE Computer Soc. Technical Committee on Data Eng., vol. 27, no. 3, pp. 12-19, 2004.
[18] S. Stephens, J.Y. Chen, and S. Thomas, “ODM BLAST: Sequence Homology Search in the RDBMS,” Bull. IEEE Computer Soc. Technical Committee on Data Eng., vol. 27, no. 3, pp. 12-19, 2004.
[19] N. Camp, H. Cofer, and R. Gomperts, High-Throughput BLAST, http://www.sgi.com/industries/sciences/chembio/ resources/papers/HTBlastHT_Whitepaper.pdf , SGI whitepaper, 2002.
[20] A. Rowstron and P. Druschel, “Pastry: Scalable, Distributed Object Location and Routing for Large-Scale Peer-to-Peer Systems,” Proc. IFIP/ACM Int'l Conf. Distributed Systems Platforms (Middleware '01), pp. 329-350, 2001.
[21] M. Harchol-Balter, “Task Assignment with Unknown Duration,” J. ACM, vol. 49, no. 2, pp. 260-288, Mar. 2002.
[22] M. Harchol-Balter and A. Downey, “Exploiting Process Lifetime Distributions for Dynamic Load Balancing,” ACM Trans. Computer Systems, vol. 15, no. 3, pp. 253-285, Aug. 1997.
[23] J. Kleinberg, “Small-World Phenomena and the Dynamics of Information,” Advances in Neural Information Processing Systems NIPS 14, 2001.
[24] M. Jelasity and A.-M. Kermarrec, “Ordered Slicing of Very Large-Scale Overlay Networks,” Proc. Sixth IEEE Int'l Conf. Peer-to-Peer Computing (P2P), 2006.
[25] M. Jelasity, A. Montresor, and O. Babaoglu, “Gossip-Based Aggregation in Large Dynamic Networks,” ACM Trans. Computer Systems, vol. 23, no. 3, pp. 219-252, Aug. 2005.
[26] P.T. Eugster, R. Guerraout, A.-M. Kermarrec, and L. Massoulie, “Epidemic Information Dissemination in Distributed Systems,” Computer, vol. 37, no. 5, pp. 60-67, May 2004.
[27] B. Ghosh and S. Muthukrishnan, “Dynamic Load Balancing by Random Matchings,” J. Computer and System Sciences, vol. 53, no. 3, pp. 357-370, 1996.
[28] C. Wang, B.A. Alqaralleh, B.B. Zhou, M. Till, and A.Y. Zomaya, “A BLAST Service Built on Data Indexed Overlay Network,” Proc. First Int'l Conf. e-Science and Grid Computing (e-Science '05), H. Stockinger, R. Buyya, and R. Perrott, eds., pp. 16-23, Dec. 2005.
[29] H. Lin, X. Ma, P. Chandramohan, A. Geist, and N. Samatova, “Efficient Data Access for Parallel BLAST,” Proc. 19th IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS), 2005.
[30] O. Thorsen, B. Smith, C.P. Sosa, K. Jiang, H. Lin, A. Peters, and W. Feng, “Parallel Genomic Sequence-Search on a Massively Parallel System,” Proc. Fourth Int'l Conf. Computing Frontiers (CF), 2007.
[31] A. Ching, W. Feng, H. Lin, X. Ma, and A. Choudhary, “Exploring I/O Strategies for Parallel Sequence-Search Tools with S3aSim,” Proc. 15th IEEE Int'l Symp. High Performance Distributed Computing (HPDC), 2006.
[32] A. Krishnan, “GridBLAST: A Globus-Based High-Throughput Implementation of BLAST in a Grid Computing Framework,” Concurrency and Computation: Practice and Experience, vol. 17, pp.1607-1623, 2005.
[33] A. Boukerche, M.S. Sousa, and A.C.M.A. de Melo, “A Multiple Task Allocation Framework for Biological Sequence Comparison in a Grid Environment,” Proc. 20th IEEE Int'l Parallel and Distributed Processing Symp. (IPDPS '06), NIDISC Workshop, 2006.
[34] M.K. Gardner, W.-C. Feng, J.S. Archuleta, H. Lin, and X. Ma, “Grid Applications—Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications,” Proc. ACM/IEEE Supercomputing Conf. (SC '06), p. 104, 2006.
20 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool