loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
IEEE Computer Society Bioinformatics Conference (CSB'02)
Accelerating Approximate Subsequence Search on Large Protein Sequence Databases
Stanford, California
August 14-August 16
ISBN: 0-7695-1653-X
Jiong Yang, IBM T.J. Watson Research
Wei Wang, University of North Carolina at Chapel Hill
Yi Xia, University of California at Los Angeles
Philip S. Yu, IBM T. J. Watson Research
Bioinformatics has become an active research area in recent years. The amount of mapped sequences doubles every fourteen months. BLAST has been widely employed for retrieving sequences which has similar portion(s) to a given sequence. However, BLAST has to scan the entire database every time when a query is issued. This can be very time consuming especially when the database is large. In this paper, we study the problem on how to build a persistent index structure for protein sequences to support approximate match. The suffix tree has been proposed as a solution to index sequence database and has been deployed on organizing DNA sequences (Hunt et al. 2001). Unfortunately, it suffers from the problem of "memory bottleneck" that prevents it from being applied efficiently to a large database. The performance even degrades further for protein database due to a larger fanout at each node. Here, we employ an indexing structure, called BASS-tree, to support approximate match in sublinear time on a large protein database. We call this indexing method as sequence approximate match (SAM) index method. The search of approximate matches can be properly directed to the portion in the database with a high potential of matching quickly. It has been demonstrated in our experiments that the potential performance improvement is in an order of magnitude over alternative methods such as the BLAST algorithm and the suffix tree.
Citation:
Jiong Yang, Wei Wang, Yi Xia, Philip S. Yu, "Accelerating Approximate Subsequence Search on Large Protein Sequence Databases," csb, pp.207, IEEE Computer Society Bioinformatics Conference (CSB'02), 2002
Usage of this product signifies your acceptance of the Terms of Use.