This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications
August 2005 (vol. 17 no. 8)
pp. 1021-1035
Mining frequent trees is very useful in domains like bioinformatics, Web mining, mining semistructured data, etc. We formulate the problem of mining (embedded) subtrees in a forest of rooted, labeled, and ordered trees. We present TreeMiner, a novel algorithm to discover all frequent subtrees in a forest, using a new data structure called scope-list. We contrast TreeMiner with a pattern matching tree mining algorithm (PatternMatcher), and we also compare it with TreeMinerD, which counts only distinct occurrences of a pattern. We conduct detailed experiments to test the performance and scalability of these methods. We also use tree mining to analyze RNA structure and phylogenetics data sets from bioinformatics domain.

[1] S. Abiteboul, H. Kaplan, and T. Milo, “Compact Labeling Schemes for Ancestor Queries,” Proc. ACM Symp. Discrete Algorithms, Jan. 2001.
[2] S. Abiteboul and V. Vianu, “Regular Path Expressions with Constraints,” Proc. ACM Int'l Conf. Principles of Database Systems, May 1997.
[3] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo, “Fast Discovery of Association Rules,” U. Fayyad et al., eds., Advances in Knowledge Discovery and Data Mining, pp. 307-328, Menlo Park, Calif.: AAAI Press, 1996.
[4] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 11th Int'l Conf. Data Eng., 1995.
[5] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Satamoto, and S. Arikawa, “Efficient Substructure Discovery from Large Semi-Structured Data,” Proc. Second SIAM Int'l Conf. Data Mining, Apr. 2002.
[6] T. Asai, H. Arimura, T. Uno, and S. Nakano, “Discovering Frequent Substructures in Large Unordered Trees,” Proc. Sixth Int'l Conf. Discovery Science, Oct. 2003.
[7] J.W. Brown, “The Ribonuclease P Database,” Nucleic Acids Research, vol. 27, no. 1, pp. 314-315, 1999.
[8] Z. Chen, H.V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. Ng, and D. Srivastava, “Counting Twig Matches in a Tree,” Proc. 17th Int'l Conf. Data Eng., 2001.
[9] Y. Chi, Y. Yang, and R.R. Muntz, “Indexing and Mining Free Trees,” Proc. Third IEEE Int'l Conf. Data Mining, 2003.
[10] Y. Chi, Y. Yang, and R.R. Muntz, “HybridTreeMiner: An Efficient Algorihtm for Mining Frequent Rooted Trees and Free Trees Using Canonical Forms,” Proc. 16th Int'l Conf. Scientific and Statistical Database Management, 2004.
[11] R. Cole, R. Hariharan, and P. Indyk, “Tree Pattern Matching and Subset Matching in Deterministic $o(n \log^3 n){\hbox{-}}{\rm Time}$ ,” Proc. 10th Symp. Discrete Algorithms, 1999.
[12] D. Cook and L. Holder, “Substructure Discovery Using Minimal Description Length and Background Knowledge,” J. Artificial Intelligence Research, vol. 1, pp. 231-255, 1994.
[13] L. Dehaspe, H. Toivonen, and R. King, “Finding Frequent Substructures in Chemical Compounds,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, Aug. 1998.
[14] M. Fernandez and D. Suciu, “Optimizing Regular Path Expressions Using Graph Schemas,” Proc. IEEE Int'l Conf. Data Eng., Feb. 1998.
[15] H.H. Gan, D. Fera, J. Zorn, N. Shiffeldrim, M. Tang, U. Laserson, N. Kim, and T. Schlick, “RAG: RNA-As-Graphs Database— Concepts, Analysis, and Features,” Bioinformatics, vol. 20, no. 8, pp. 1285-1291, 2004.
[16] H.H. Gan, S. Pasquali, and T. Schlick, “Exploring the Repertoire of RNA Secondary Motifs Using Graph Theory with Implications for RNA Design,” Nucleic Acids Research, vol. 31, pp. 2926-2943, 2003.
[17] J. Huan, W. Wang, and J. Prins, “Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism,” Proc. IEEE Int'l Conf. Data Mining, 2003.
[18] A. Inokuchi, T. Washio, and H. Motoda, “Complete Mining of Frequent Patterns from Graphs: Mining Graph Data,” Machine Learning, vol. 50, no. 3, pp. 321-354, 2003.
[19] P. Kilpelainen and H. Mannila, “Ordered and Unordered Tree Inclusion,” SIAM J. Computing, vol. 24, no. 2, pp. 340-356, 1995.
[20] S. Kramer, L. De Raedt, and C. Helma, “Molecular Feature Mining in HIV Data,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, 2001.
[21] M. Kuramochi and G. Karypis, “An Efficient Algorithm for Discovering Frequent Subgraphs,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 9, pp. 1038-1051, Sept. 2004.
[22] Q. Li and B. Moon, “Indexing and Querying XML Data for Regular Path Expressions,” Proc. 27th Int'l Conf. Very Large Data Bases, 2001.
[23] V. Morell, “Web-Crawling up the Tree of Life,” Science, vol. 273, no. 5275, pp. 568-570, Aug. 1996.
[24] D.W. Mount, Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Press, 2001.
[25] S. Nijssen and J.N. Kok, “A Quickstart in Frequent Structure Mining Can Make a Difference,” Proc. ACM SIGKDD Int'l Conf. KDD, 2004.
[26] S. Nijssen and J.N. Kok, “Efficient Discovery of Frequent Unordered Trees,” Proc. First Int'l Workshop Mining Graphs, Trees, and Sequences, 2003.
[27] R.D. Page and E.C. Holmes, Molecular Evolution: A Phylogenetic Approach. Blackwell Science, 1998.
[28] U. Ruckert and S. Kramer, “Frequent Free Tree Discovery in Graph Data,” Special Track on Data Mining, Proc. ACM Symp. Applied Computing, 2004.
[29] R. Shamir and D. Tsur, “Faster Subtree Isomorphism,” J. Algorithms, vol. 33, pp. 267-280, 1999.
[30] B. Shapiro and K. Zhang, “Comparing Multiple RNA Secondary Strutures Using Tree Comparisons,” Computer Applications in Biosciences, vol. 6, no. 4, pp. 309-318, 1990.
[31] D. Shasha, J. Wang, and S. Zhang, “Unordered Tree Mining with Applications to Phylogeny,” Proc. Int'l Conf. Data Eng., 2004.
[32] A. Termier, M-C. Rousset, and M. Sebag, “Treefinder: A First Step Towards XML Data Mining,” Proc. IEEE Int'l Conf. Data Mining, 2002.
[33] C. Wang, M. Hong, J. Pei, H. Zhou, W. Wang, and B. Shi, “Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining,” Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining, 2004.
[34] K. Wang and H. Liu, “Discovering Typical Structures of Documents: A Road Map Approach,” Proc. ACM SIGIR Conf. Information Retrieval, 1998.
[35] Y. Xiao, J.-F. Yao, Z. Li, and M.H. Dunham, “Efficient Data Mining for Maximal Frequent Subtrees,” Proc. Int'l Conf. Data Mining, 2003.
[36] X. Yan and J. Han, “gSpan: Graph-Based Substructure Pattern Mining,” Proc. IEEE Int'l Conf. Data Mining, 2002.
[37] X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns,” ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, Aug. 2003.
[38] K. Yoshida and H. Motoda, “CLIP: Concept Learning from Inference Patterns,” Artificial Intelligence, vol. 75, no. 1, pp. 63-92, 1995.
[39] M.J. Zaki, “Efficiently Mining Trees in a Forest,” Technical Report 01-7, Computer Science Dept., Rensselaer Polytechnic Inst., July 2001.
[40] M.J. Zaki, “Efficiently Mining Frequent Trees in a Forest,” Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, July 2002.
[41] M.J. Zaki and C.C. Aggarwal, “XRules: An Effective Structural Classifier for XML Data,” Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, Aug. 2003.
[42] C. Zhang, J. Naughton, D. DeWitt, Q. Luo, and G. Lohman, “On Supporting Containment Queries in Relational Database Managment Systems,” Proc. ACM Int'l Conf. Management of Data, May 2001.

Index Terms:
Index Terms- Frequent tree mining, rooted, ordered, labeled trees, subtree enumeration, pattern matching, RNA structure, phylogenetic trees, data mining.
Citation:
Mohammed J. Zaki, "Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 8, pp. 1021-1035, Aug. 2005, doi:10.1109/TKDE.2005.125
Usage of this product signifies your acceptance of the Terms of Use.