This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Discovering Frequent Agreement Subtrees from Phylogenetic Data
January 2008 (vol. 20 no. 1)
pp. 68-82
We study a new data mining problem concerning the discovery of frequent agreement subtrees (FASTs) from a set of phylogenetic trees. A phylogenetic tree, or phylogeny, is an unordered tree in which the order among siblings is unimportant. Furthermore, each leaf in the tree has a label representing a taxon (species or organism) name whereas internal nodes are unlabeled. The tree may have a root, representing the common ancestor of all species in the tree, or may be unrooted. An unrooted phylogeny arises due to the lack of sufficient evidence to infer a common ancestor of the taxa in the tree. The FAST problem addressed here is a natural extension of the MAST (maximum agreement subtree) problem widely studied in the computational phylogenetics community. The paper establishes a framework for tackling the FAST problem for both rooted and unrooted phylogenetic trees using data mining techniques. We first develop a novel canonical form for rooted trees together with a phylogeny-aware tree expansion scheme for generating candidate subtrees level by level. Then we present an efficient heuristic to find all frequent agreement subtrees in a given set of rooted trees, through an Apriori-like approach. We show the correctness and completeness of the proposed method. Finally we discuss extensions of the techniques to unrooted trees. Experimental results demonstrate that the proposed methods work well, capable of finding interesting patterns in both synthetic data and real phylogenetic trees.

[1] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 20th Int'l Conf. Very Large Data Bases, pp. 487-499, 1994.
[2] A. Amir and D. Keselman, “Maximum Agreement Subtree in a Set of Evolutionary Trees,” SIAM J. Computing, vol. 26, no. 6, pp. 1656-1669, 1997.
[3] T. Asai, K. Abe, S. Kawasoe, H. Sakamoto, H. Arimura, and S. Arikawa, “Efficiently Mining Frequent Substructures from Semi-Structured Data,” Proc. Int'l Workshop Information and Electrical Eng., pp. 59-64, 2002.
[4] T. Asai, H. Arimura, T. Uno, and S. Nakano, “Discovering Frequent Substructures in Large Unordered Trees,” Proc. Sixth Int'l Conf. Discovery Science, 2003.
[5] V. Berry and F. Nicolas, “Improved Parameterized Complexity of the Maximum Agreement Subtree and Maximum Compatible Tree Problems,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 3, no. 3, pp. 289-302, July-Sept. 2006.
[6] Y. Chi, S. Nijssen, R.R. Muntz, and J.N. Kok, “Frequent Subtree Mining—An Overview,” Fundamenta Informaticae, special issue on graph and tree mining, 2005.
[7] Y. Chi, Y. Xia, Y. Yang, and R.R. Muntz, “Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 2, pp. 190-202, Feb. 2005.
[8] Y. Chi, Y. Yang, and R.R. Muntz, “Canonical Forms for Labeled Trees and Their Applications in Frequent Subtree Mining,” Knowledge and Information Systems, vol. 8, no. 2, pp. 203-234, 2005.
[9] W.H.E. Day, “Optimal Algorithms for Comparing Trees with Labeled Leaves,” J. Classification, vol. 1, pp. 7-28, 1985.
[10] M. Farach, T. Przytycka, and M. Thorup, “On the Agreement of Many Trees,” Information Processing Letters, vol. 55, no. 6, pp. 297-301, 1995.
[11] C.R. Finden and A.D. Gordon, “Obtaining Common Pruned Trees,” J. Classification, vol. 2, pp. 255-276, 1985.
[12] G. Ganeshkumar and T. Warnow, “Finding a Maximum Compatible Tree for a Bounded Number of Trees with Bounded Degree Is Solvable in Polynomial Time,” Proc. First Int'l Workshop Algorithms in Bioinformatics, pp. 156-163, 2001.
[13] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach,” Data Mining and Knowledge Discovery, vol. 8, no. 1, pp.53-87, 2004.
[14] S. Holmes and P. Diaconis, “Random Walks on Trees and Matchings,” Electronic J. Probability, vol. 7, 2002.
[15] J. Huan, W. Wang, and J. Prins, “Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism,” Proc. Third IEEE Int'l Conf. Data Mining, pp. 549-552, 2003.
[16] M. Kuramochi and G. Karypis, “Frequent Subgraph Discovery,” Proc. First IEEE Int'l Conf. Data Mining, pp. 313-320, 2001.
[17] J.T. Li, A.L. Bogle, A.S. Klein, and M.J. Donoghue, “Phylogeny and Biogeography of Hamamelis (Hamamelidaceae),” Harvard Papers in Botany, vol. 5, pp. 171-178, 2000.
[18] D.R. Maddison, “The Discovery and Importance of Multiple Islands of Most-Parsimonious Trees,” System Zoology, vol. 40, pp.315-328, 1991.
[19] P. Mardulyn, M.C. Milinkovitch, and J.M. Pasteels, “Phylogenetic Analyses of DNA and Allozyme Data Suggest that Gonioctena Leaf Beetles (Coleoptera: Chrysomelidae) Experienced Convergent Evolution in Their History of Host-Plant Family Shifts,” Systematic Biology, vol. 46, no. 4, pp. 722-747, 1997.
[20] B.M.E. Moret, L. Nakhleh, T. Warnow, C.R. Linder, A. Tholse, A. Padolina, J. Sun, and R. Timme, “Phylogenetic Networks: Modeling, Reconstructibility, and Accuracy,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 1, no. 1, pp. 13-23, Jan.-Mar. 2004.
[21] S. Nijssen and J.N. Kok, “Efficient Discovery of Frequent Unordered Trees: Proofs,” technical report, Leiden Inst. of Advanced Computer Science, Jan. 2003.
[22] R.D.M. Page, “COMPONENT User's Manual (Release 1.5),” Univ. of Auckland, 1989.
[23] W.H. Piel, M.J. Donoghue, and M.J. Sanderson, “TreeBASE: A Database of Phylogenetic Information,” Proc. Second Int'l Workshop of Species, 2000.
[24] C. Semple and M. Steel, “A Supertree Method for Rooted Trees,” Discrete Applied Math., vol. 105, pp. 147-158, 2000.
[25] D. Shasha, J.T.L. Wang, and S. Zhang, “Unordered Tree Mining with Applications to Phylogeny,” Proc. 20th Int'l Conf. Data Eng., pp. 708-719, 2004.
[26] C. Wang, M. Hong, J. Pei, H. Zhou, W. Wang, and B. Shi, “Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining,” Proc. Eighth Pacific-Asia Conf. Knowledge Discovery and Data Mining, May 2004.
[27] J.T.L. Wang, H. Shan, D. Shasha, and W.H. Piel, “Fast Structural Search in Phylogenetic Databases,” Evolutionary Bioinformatics, vol. 1, pp. 37-46, 2005.
[28] J.T.L. Wang, B.A. Shapiro, D. Shasha, K. Zhang, and K.M. Currey, “An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 889-895, Aug. 1998.
[29] J.T.L. Wang, K. Zhang, G. Chang, and D. Shasha, “Finding Approximate Patterns in Undirected Acyclic Graphs,” Pattern Recognition, vol. 35, no. 2, pp. 473-483, 2002.
[30] T. Washio and H. Motoda, “State of the Art of Graph-Based Data Mining,” ACM SIGKDD Explorations, vol. 5, no. 1, July 2003.
[31] Y. Xiao, J. Yao, Z. Li, and M. Dunham, “Efficient Data Mining for Maximal Frequent Subtrees,” Proc. IEEE Int'l Conf. Data Mining, 2003.
[32] X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2003.
[33] L. Yang, M.L. Lee, and W. Hsu, “Efficient Mining of XML Query Patterns for Caching,” Proc. 29th Int'l Conf. Very Large Databases, 2003.
[34] M.J. Zaki, “Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications,” IEEE Trans. Knowledge and Data Eng., special issue on mining biological data, W. Wang and J.Yang, eds., vol. 17, no. 8, pp. 1021-1035, Aug. 2005.
[35] S. Zhang and J.T.L. Wang, “Mining Frequent Agreement Subtrees in Phylogenetic Databases,” Proc. Sixth SIAM Int'l Conf. Data Mining, pp. 222-233, 2006.
[36] S. Zhang, K.G. Herbert, J.T.L. Wang, W.H. Piel, and D.R.B. Stockwell, “Phylominer: A Tool for Evolutionary Data Analysis,” Proc. 18th Int'l Conf. Scientific and Statistical Database Management, pp. 129-132, 2006.

Index Terms:
Data mining, Bioinformatics (genome or protein) databases
Citation:
Sen Zhang, Jason T.L. Wang, "Discovering Frequent Agreement Subtrees from Phylogenetic Data," IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 1, pp. 68-82, Jan. 2008, doi:10.1109/TKDE.2007.190676
Usage of this product signifies your acceptance of the Terms of Use.