Issue No.02 - March/April (2012 vol.9)
pp: 548-559
Biing-Feng Wang , National Tsing Hua University, Hsinchu
The focus of this paper is the problem of finding all nested common intervals of two general sequences. Depending on the treatment one wants to apply to duplicate genes, Blin et al. introduced three models to define nested common intervals of two sequences: the uniqueness, the free-inclusion, and the bijection models. We consider all the three models. For the uniqueness and the bijection models, we give O(n + N_{\rm out})-time algorithms, where N_{\rm out} denotes the size of the output. For the free-inclusion model, we give an O(n^{1 + \varepsilon } + N_{{\rm out}})-time algorithm, where \varepsilon > 0 is an arbitrarily small constant. We also present an upper bound on the size of the output for each model. For the uniqueness and the free-inclusion models, we show that N_{\rm out}=O(n^{2}). Let C = \sum _{g \in \Gamma } o_{1}(g)o_{2}(g), where \Gamma is the set of distinct genes, and o_{1}(g) and o_{2}(g) are, respectively, the numbers of copies of g in the two given sequences. For the bijection model, we show that N_{\rm out}=O(Cn). In this paper, we also study the problem of finding all approximate nested common intervals of two sequences on the bijection model. An O(\delta n + N_{{\rm out}})-time algorithm is presented, where \delta denotes the maximum number of allowed gaps. In addition, we show that for this problem N_{\rm out} is O(\delta n^{3}).
Algorithms, data structures, common intervals, comparative genomics, conserved gene clusters.
Biing-Feng Wang, "Output-Sensitive Algorithms for Finding the Nested Common Intervals of Two General Sequences", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.9, no. 2, pp. 548-559, March/April 2012, doi:10.1109/TCBB.2011.112
[1] M.P. Béal, A. Bergeron, S. Corteel, and M. Raffinot, “An Algorithmic View of Gene Teams,” Theoretical Computer Science, vol. 320, nos. 2/3, pp. 395-418, 2004.
[2] J.L. Bentley and H.A. Maurer, “Efficient Worst-Case Data Structures for Range Searching,” Acta Informatica, vol. 13, pp. 155-168, 1980.
[3] A. Bergeron, Y. Gingras, and C. Chauve, “Formal Models of Gene Clusters,” Bioinformatics Algorithms: Techniques and Applications, Chapter 8, I. Mandoiu and A. Zelikovskym, eds., pp. 177-202, Wiley, 2008.
[4] A. Bergeron and J. Stoye, “On the Similarity of Sets of Permutations and Its Applications to Genome Comparison,” J. Computational Biology, vol. 13, pp. 1340-1354, 2006.
[5] G. Blin, D. Faye, and J. Stoye, “Finding Nested Common Intervals Efficiently,” J. Computational Biology, vol. 17, no. 9, pp. 1183-1194, 2010.
[6] S. Böcker, K. Jahn, J. Mixtacki, and J. Stoye, “Computation of Median Gene Clusters,” J. Computational Biology, vol. 16, pp. 1085-1099, 2009.
[7] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein, Introduction to Algorithms, second ed. McGraw-Hill, 2001.
[8] T. Dandekar, B. Snel, M. Huynen, and P. Bork, “Conservation of Gene Order: A Fingerprint for Proteins that Physically Interact,” Trends in Biochemical Sciences, vol. 23, pp. 324-328, 1998.
[9] G. Didier, “Common Intervals of Two Sequences,” Proc. Third Int'l Workshop Algorithms in Bioinformatics, pp. 17-24, 2003.
[10] G. Didier, T. Schmidt, J. Stoye, and D. Tsur, “Character Sets of Strings,” J. Discrete Algorithms, vol. 5, pp. 330-340, 2007.
[11] M.D. Ermolaeva, O. White, and S.L. Salzberg, “Prediction of Operons in Microbial Genomes,” Nucleic Acids Research, vol. 29, no. 5, pp. 1216-1221, 2001.
[12] X. He and M.H. Goldwasser, “Identifying Conserved Gene Clusters in the Presence of Homology Families,” J. Computational Biology, vol. 12, no. 6, pp. 638-656, 2005.
[13] S. Heber and J. Stoye, “Finding All Common Intervals of k Permutations,” Proc. 12th Ann. Symp. Combinatorial Pattern Matching, pp. 207-218, 2001.
[14] R. Hoberman and D. Durand, “The Incompatible Desiderata of Gene Cluster Properties,” Proc. RECOMB '05 Int'l Workshop Comparative Genomics, pp. 73-87, 2005.
[15] K. Jahn, “Efficient computation of approximate gene clusters based on reference occurrences,” Proc. Eighth RECOMB Comparative Genomics Satellite Workshop, pp. 264-277, 2010.
[16] U. Kurzik-Dumke and A. Zengerle, “Identification of a Novel Drosophila Melanogaster Gene, Angel, a Member of a Nested Gene Cluster at Locus 59F4,5,” Biochimica et Biophysica Acta, vol. 1308, pp. 177-181, 1996.
[17] W.C. LatheIII, B. Snel, and P. Bork, “Gene Context Conservation of a Higher Order than Operons,” Trends in Biochemical Sciences, vol. 25, pp. 474-479, 2000.
[18] J. Lawrence, “Selfish Operons: The Evolutionary Impact of Gene Clustering in Prokaryotes and Eukaryotes,” Current Opinion in Genetics & Development, vol. 9, no. 6, pp. 642-648, 1999.
[19] N. Luc, J.-L. Risler, A. Bergeron, and M. Raffinot, “Gene Teams: A New Formalization of Gene Clusters for Comparative Genomics,” Computational Biology and Chemistry, vol. 27, no. 1, pp. 59-67, 2003.
[20] R. Overbeek, M. Fonstein, M. D'Souza, G.D. Pusch, and N. Maltsev, “The Use of Gene Clusters to Infer Functional Coupling,” Proc. Nat'l Academy of Sciences USA, vol. 96, no. 6, pp. 2896-2901, 1999.
[21] F.P. Preparata and M.I. Shamos, Computational Geometry: An Introduction. Springer, 1985.
[22] S. Rahmann and G.W. Klau, “Integer Linear Programs for Discovering Approximate Gene Clusters,” Proc. Workshop Algorithms in Bioinformatics (WABI), pp. 298-309, 2006.
[23] T. Schmidt and J. Stoye, “Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences,” Proc. 15th Ann. Symp. Combinatorial Pattern Matching (CPM), pp. 347-359, 2004.
[24] B. Snel, P. Bork, and M.A. Huynen, “The Identification of Functional Modules from the Genomic Association of Genes,” Proc. Nat'l Academy of Sciences USA, vol. 99, no. 9, pp. 5890-5895, 2002.
[25] T. Uno and M. Yagiura, “Fast Algorithms to Enumerate All Common Intervals of Two Permutations,” Algorithmica, vol. 26, no. 2, pp. 290-309, 2000.
[26] B.-F. Wang and C.-H. Lin, “Improved Algorithms for Finding Gene Teams and Constructing Gene Team Trees,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 8, no. 5, pp. 1258-1272, Sept./Oct. 2011.
[27] B.-F. Wang, C.-C. Kuo, S.-J. Liu, and C.-H. Lin, “A New Efficient Algorithm for the Gene Team Problem on General Sequences,” IEEE/ACM Trans. Computational Biology and Bioinformatics, submitted for publication.