This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees
August 1998 (vol. 20 no. 8)
pp. 889-895

Abstract—Ordered, labeled trees are trees in which each node has a label and the left-to-right order of its children (if it has any) is fixed. Such trees have many applications in vision, pattern recognition, molecular biology and natural language processing. We consider a substructure of an ordered labeled tree T to be a connected subgraph of T. Given two ordered labeled trees T1 and T2 and an integer d, the largest approximately common substructure problem is to find a substructure U1 of T1 and a substructure U2 of T2 such that U1 is within edit distance d of U2 and where there does not exist any other substructure V1 of T1 and V2 of T2 such that V1 and V2 satisfy the distance constraint and the sum of the sizes of V1 and V2 is greater than the sum of the sizes of U1 and U2. We present a dynamic programming algorithm to solve this problem, which runs as fast as the fastest known algorithm for computing the edit distance of two trees when the distance allowed in the common substructures is a constant independent of the input trees. To demonstrate the utility of our algorithm, we discuss its application to discovering motifs in multiple RNA secondary structures (which are ordered labeled trees).

[1] C. Burks, M. Cassidy, M.J. Cinkosky, K.E. Cumella, P. Gilna, J.E.-D. Hayden, G.M. Keen, T.A. Kelley, M. Kelly, D. Kristofferson, and J. Ryals, "GenBank," Nucleic Acids Research, vol. 19, pp. 2,221-2,225, 1991.
[2] Y.C. Cheng and S.Y. Lu, "Waveform Correlation by Tree Matching," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 7, no. 3, pp. 299-305, May 1985.
[3] K.M. Currey and B.A. Shapiro, "Secondary Structure Computer Prediction of the Polio Virus 5' Noncoding Region is Improved With a Genetic Algorithm," Computer Applications Bioscience, vol. 13, no. 1, pp. 1-12, 1997.
[4] T. Jiang, L. Wang, and K. Zhang, "Alignment of Trees-An Alternative to Tree Edit," M. Crochemore and D. Gusfield, eds., Combinatorial Pattern Matching, Lecture Notes in Computer Science, 807, pp. 75-86. Springer-Verlag, 1994.
[5] S.-Y. Le, J. Owens, R. Nussinov, J.-H. Chen, B.A. Shapiro, and J.V. Maizel, "RNA Secondary Structures: Comparison and Determination of Frequently Recurring Substructures by Consensus," Computer Applications Bioscience, vol. 5, no. 3, pp. 205-210, 1989.
[6] S. Liu and E. Tanaka, "A Largest Common Similar Substructure Problem for Trees Embedded in a Plane," Technical Report IEICE, COMP 95-74, Jan. 1996.
[7] S. Liu and E. Tanaka, "Largest Common Similar Substructures of Rooted and Unordered Trees," Mem. Grad. School Science&Technol., Kobe Univ., vol. 14-A, pp. 107-119, 1996.
[8] S. Liu and E. Tanaka, "The Largest Common Similar Substructure Problem," IEICE Trans. Fundamentals, vol. E80-A, pp. 643-650, 1997.
[9] S.Y. Lu, "A Tree-Matching Algorithm Based on Node Splitting and Merging," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 6, no. 2, pp. 249-256, Mar. 1984.
[10] B. Moayer and K.S. Fu, "A Tree System Approach for Fingerprint Pattern Recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 8, pp. 376-387, May 1986.
[11] K. Ohmori and E. Tanaka, "A Unified View on Tree Metrics," Preprint of the Workshop on Syntactic and Structural Pattern Recognition,Barcelona, 1986. Syntactic and Structural Pattern Recognition, G. Ferrate et al., eds. Springer, 1988.
[12] H. Samet, "Distance Transform for Images Represented by Quadtrees," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 4, no. 3, pp. 298-303, May 1982.
[13] B.A. Shapiro, "An Algorithm for Comparing Multiple RNA Secondary Structures," Computer Applications Bioscience, vol. 4, no. 3, pp. 387-393, 1988.
[14] B.A. Shapiro and K. Zhang, "Comparing Multiple RNA Secondary Structures Using Tree Comparisons," Computer Applications Bioscience, vol. 6, no. 4, pp. 309-318, 1990.
[15] L.G. Shapiro and R.M. Haralick, "Structural Descriptions and Inexact Matching," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 3, no. 5, pp. 504-519, Sept. 1981.
[16] D. Shasha, J. Wang, and K. Zhang, “Exact and Approximate Algorithm for Unordered Tree Matching,” IEEE Trans. Systems, Man, and Cybernetics, vol. 24, no. 4, pp. 668-678, 1994.
[17] K.-C. Tai, "The Tree-to-Tree Correction Problem," J. ACM, vol. 26, no. 3, pp. 422-433, 1979.
[18] E. Tanaka, "The Metric Between Rooted and Ordered Trees Based on Strongly Structure Preserving Mapping and Its Computing Method," IECE Trans., vol. J67-D, no. 6, pp. 722-723, 1984.
[19] E. Tanaka, "A Metric Between Unrooted and Unordered Trees and its Bottom-Up Computing Method," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 12, pp. 1,233-1,238, Dec. 1994.
[20] (a) E. Tanaka and K. Tanaka, "A Metric on Trees and its Computing Method," IECE Trans., vol. J65-D, no. 5, pp. 511-518, 1982. (b) Correction to "A Metric on Trees and Its Computing Method," IEICE Trans., vol. J76-D-I, no. 11, p. 635, 1993.
[21] E. Tanaka and K. Tanaka, "The Tree-to-Tree Editing Problem," Int'l J. Pattern Recognition and Artificial Intelligence, vol. 2, no. 2, pp. 221-240, 1988.
[22] Z. Tu, N.M. Chapman, G. Hufnagel, S. Tracy, B.A. Shapiro, J.R. Romero, W.H. Barry, L. Zhao, and K.M. Currey, "The Cardiovirulent Phenotype of Coxsackievirus B3 is Determined at a Single Site in the Genomic 5' Nontranslated Region," J. Virology, vol. 69, pp. 4,607-4,618, 1995.
[23] J.T.L. Wang, B.A. Shapiro, D. Shasha, K. Zhang, and C.-Y. Chang, "Automated Discovery of Active Motifs in Multiple RNA Secondary Structures," Proc. Second Int'l Conf. Knowledge Discovery and Data Mining, pp. 70-75,Portland, Ore., Aug. 1996.
[24] A.K.C Wong, M. You, and S.C. Chan, “An Algorithm for Graph Optimal Monomorphism,” IEEE Trans. Systems, Man and Cybernetics, vol. 20, no. 3, pp. 628-636, May/June 1990.
[25] K. Zhang and D. Shasha, "Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems," SIAM J. Computing, vol. 18, no. 6, pp. 1,245-1,262, Dec. 1989.
[26] K. Zhang, D. Shasha, and J.T.L. Wang, "Approximate Tree Matching in the Presence of Variable Length Don't Cares," J. Algorithms, vol. 16, no. 1, pp. 33-66, Jan. 1994.
[27] K. Zhang, J.T.L. Wang, and D. Shasha, "On the Editing Distance Between Undirected Acyclic Graphs," Int'l J. Foundations of Computer Science, vol. 7, no. 1, pp. 43-57, Mar. 1996.

Index Terms:
Computational biology, dynamic programming, pattern matching, pattern recognition, trees.
Citation:
Jason T.L. Wang, Bruce A. Shapiro, Dennis Shasha, Kaizhong Zhang, Kathleen M. Currey, "An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 889-895, Aug. 1998, doi:10.1109/34.709622
Usage of this product signifies your acceptance of the Terms of Use.