CSDL Home IEEE/ACM Transactions on Computational Biology and Bioinformatics 2010 vol.7 Issue No.04 - October-December

Subscribe

Issue No.04 - October-December (2010 vol.7)

pp: 669-680

Jianjun Zhou , University of Alberta, Edmonton

Zhipeng Cai , University of Alberta, Edmonton

Lusheng Wang , City University of Hong Kong, Hong Kong

Guohui Lin , University of Alberta, Edmonton

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.99

ABSTRACT

Modern biological applications usually involve the similarity comparison between two objects, which is often computationally very expensive, such as whole genome pairwise alignment and protein 3D structure alignment. Nevertheless, being able to quickly identify the closest neighboring objects from very large databases for a newly obtained sequence or structure can provide timely hints to its functions and more. This paper presents a substantial speedup technique for the well-studied k-nearest neighbor (k-nn) search, based on novel concepts of virtual pivots and partial pivots, such that a significant number of the expensive distance computations can be avoided. The new method is able to dynamically locate virtual pivots, according to the query, with increasing pruning ability. Using the same or less amount of database preprocessing effort, the new method outperformed the second best method by using no more than 40 percent distance computations per query, on a database of 10,000 gene sequences, compared to several best known k-nn search methods including M-Tree, OMNI, SA-Tree, and LAESA. We demonstrated the use of this method on two biological sequence data sets, one of which is for HIV-1 viral strain computational genotyping.

INDEX TERMS

Nearest neighbor search, metric space, triangle inequality pruning, virtual pivot, partial pivot, HIV-1 computational genotyping.

CITATION

Jianjun Zhou, Zhipeng Cai, Lusheng Wang, Guohui Lin, "Finding the Nearest Neighbors in Biological Databases Using Less Distance Computations",

*IEEE/ACM Transactions on Computational Biology and Bioinformatics*, vol.7, no. 4, pp. 669-680, October-December 2010, doi:10.1109/TCBB.2008.99REFERENCES

- [1] S.F. Altschul, T.L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman, "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs,"
Nucleic Acids Research, vol. 25, pp. 3389-3402, 1997.- [2] W.R. Pearson and D.J. Lipman, "Improved Tools for Biological Sequence Comparison,"
Proc. Nat'l Academy of Sciences USA, vol. 85, pp. 2444-2448, 1988.- [3] B. Ma, J. Tromp, and M. Li, "PatternHunter: Faster and More Sensitive Homology Search,"
Bioinformatics, pp. 440-445, 2002.- [4] G.R. Hjaltason and H. Samet, "Index-Driven Similarity Search in Metric Spaces,"
ACM Trans. Database Systems, vol. 28, pp. 517-580, 2003.- [5] A. Guttman, "R-Trees: A Dynamic Index Structure for Spatial Searching,"
Proc. ACM SIGMOD '84, pp. 47-57, 1984.- [6] E. Chávez, G. Navarro, R.A. Baeza-Yates, and J.L. Marroquín, "Searching in Metric Spaces,"
ACM Computing Surveys, vol. 33, pp. 273-321, 2001.- [7] S.-A. Berrani, L. Amsaleg, and P. Gros, "Approximate Searches: $k$ -Neighbors + Precision,"
Proc. Conf. Information and Knowledge Management (CIKM '03), pp. 24-31, 2003.- [8] V. Athitsos, M. Hadjieleftheriou, G. Kollios, and S. Sclaroff, "Query-Sensitive Embeddings,"
Proc. ACM SIGMOD '05, pp. 706-717, 2005.- [9] M. Shapiro, "The Choice of Reference Points in Best-Match File Searching,"
Comm. ACM, vol. 20, pp. 339-343, 1977.- [10] M.L. Mico, J. Oncina, and E. Vidal, "A New Version of the Nearest-Neighbour Approximating and Eliminating Search Algorithm (AESA) with Linear Preprocessing Time and Memory Requirements,"
Pattern Recognition Letters, vol. 15, pp. 9-17, 1994.- [11] R.F.S. Filho, A.J.M. Traina, C. Traina Jr., and C. Faloutsos, "Similarity Search without Tears: The OMNI Family of All-Purpose Access Methods,"
Proc. 17th Int'l Conf. Data Eng. (ICDE '01), pp. 623-630, 2001.- [12] B. Bustos, G. Navarro, and E. Chávez, "Pivot Selection Techniques for Proximity Searching in Metric Spaces,"
Pattern Recognition Letters, vol. 24, pp. 2357-2366, 2003.- [13] J.R. Rico-Juan and L. Micó, "Comparison of AESA and LAESA Search Algorithms Using String and Tree-Edit-Distances,"
Pattern Recognition Letters, vol. 24, pp. 1417-1426, 2003.- [14] C. Digout, M.A. Nascimento, and A. Coman, "Similarity Search and Dimensionality Reduction: Not All Dimensions are Equally Useful,"
Proc. Ninth Int'l Conf. Database Systems for Advances Applications (DASFAA '04), pp. 831-842, 2004.- [15] C. Traina Jr., R.F.S. Filho, A.J.M. Traina, M.R. Vieira, and C. Faloutsos, "The Omni-Family of All-Purpose Access Methods: A Simple and Effective Way to Make Similarity Search More Efficient,"
The VLDB J., vol. 16, pp. 483-505, 2007.- [16] P. Ciaccia, M. Patella, and P. Zezula, "M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces,"
Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB '97), pp. 426-435, 1997.- [17] G. Navarro, "Searching in Metric Spaces by Spatial Approximation,"
The VLDB J., vol. 11, pp. 28-46, 2002.- [18] R. Weber, H.-J. Schek, and S. Blott, "A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces,"
Proc. 24th Int'l Conf. Very Large Data Bases (VLDB '98), pp. 194-205, 1998.- [19] H.V. Jagadish, B.C. Ooi, K.-L. Tan, C. Yu, and R. Zhang, "iDistance: An Adaptive ${\rm b}^{+}$ -Tree Based Indexing Method for Nearest Neighbor Search,"
ACM Trans. Database Systems, vol. 30, pp. 364-397, 2005.- [20] J. Vleugels and R.C. Veltkamp, "Efficient Image Retrieval through Vantage Objects,"
Pattern Recognition, vol. 35, pp. 69-80, 2002.- [21] X. Wu, Z. Cai, X.-F. Wan, T. Hoang, R. Goebel, and G.-H. Lin, "Nucleotide Composition String Selection in HIV-1 Subtyping Using Whole Genomes,"
Bioinformatics, vol. 23, pp. 1744-1752, 2007.- [22] http://www.ncbi.nlm.nih.gov/genomesFLU/, 2008.
- [23] Z. Zhang, S. Schwartz, L. Wagner, and W. Miller, "A Greedy Algorithm for Aligning DNA Sequences,"
J. Computational Biology, vol. 7, pp. 203-214, 2000.- [24] J. Zhou and J. Sander, "Speedup Clustering with Hierarchical Ranking,"
Proc. Sixth IEEE Int'l Conf. Data Mining (ICDM '06), pp. 1205-1210, http://www.cs.ualberta.ca/TechReports/2008/ TR08-09TR08-09.pdf, 2006.- [25] P. Ciaccia, M. Patella, and P. Zezula, "A Cost Model for Similarity Queries in Metric Spaces,"
Proc. 17th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems (PODS '98), pp. 59-68, 1998. |