| | This Article | |
| |
| |
| | Share | |
| |
| |
| | Bibliographic References | |
| |
| |
| | Add to: | |
| |
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
| |
| | Search | |
| |
| |
| | |
An Efficient Algorithm for Discovering Frequent Subgraphs
September 2004 (vol. 16 no. 9)
pp. 1038-1051
Over the years, frequent itemset discovery algorithms have been used to find interesting patterns in various application areas. However, as data mining techniques are being increasingly applied to nontraditional domains, existing frequent pattern discovery approaches cannot be used. This is because the transaction framework that is assumed by these algorithms cannot be used to effectively model the data sets in these domains. An alternate way of modeling the objects in these data sets is to represent them using graphs. Within that model, one way of formulating the frequent pattern discovery problem is that of discovering subgraphs that occur frequently over the entire set of graphs. In this paper, we present a computationally efficient algorithm, called FSG, for finding all frequent subgraphs in large graph data sets. We experimentally evaluate the performance of FSG using a variety of real and synthetic data sets. Our results show that despite the underlying complexity associated with frequent subgraph discovery, FSG is effective in finding all frequently occurring subgraphs in data sets containing more than 200,000 graph transactions and scales linearly with respect to the size of the data set.
[1] R.C. Agarwal, C.C. Aggarwal, and V.V.V. Prasad, A Tree Projection Algorithm for Generation of Frequent Item Sets, J. Parallel and Distributed Computing, vol. 61, no. 3, pp. 350-371, 2001.
[2] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules Proc. 20th Int'l Conf. Very Large Data Bases (VLDB), pp. 487-499, Sept. 1994.
[3] Y. Amit and A. Kong, Graphical Templates for Model Registration IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 3, pp. 225-236, 1996.
[4] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa, Efficient Substructure Discovery from Large Semi-Structured Data Proc. Second SIAM Int'l Conf. Data Mining (SDM '02), pp. 158-174, 2002.
[5] C. Borgelt and M.R. Berthold, Mining Molecular Fragments: Finding Relevant Substructures of Molecules Proc. 2002 IEEE Int'l Conf. Data Mining (ICDM), 2002.
[6] C.-W.K. Chen and D.Y.Y. Yun, Unifying Graph-Matching Problem with a Practical Solution Proc. Int'l Conf. Systems, Signals, Control, Computers, Sept. 1998.
[7] G. Cong, L. Yi, B. Liu, and K. Wang, Discovering Frequent Substructures from Hierarchical Semi-Structured Data Proc. Second SIAM Int'l Conf. Data Mining (SDM-2002), 2002.
[8] D.J. Cook and L.B. Holder, Graph-Based Data Mining IEEE Intelligent Systems, vol. 15, no. 2, pp. 32-41, 2000.
[9] L. Dehaspe and L. De Raedt, Mining Association Rules in Multiple Relations Proc. Seventh Int'l Workshop Inductive Logic Programming, pp. 125-132, 1997.
[10] L. Dehaspe, H. Toivonen, and R.D. King, Finding Frequent Substructures in Chemical Compounds Proc. Fourth ACMSIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD-98), pp. 30-36, 1998.
[11] M. Deshpande, M. Kuramochi, and G. Karypis, Automated Approaches for Classifying Structures Proc. Second Workshop Data Mining in Bioinformatics (BIOKDD '02), 2002.
[12] J.D. Dixon and B. Mortimer, Permutation Groups Graduate Texts in Math., vol. 163, Springer-Verlag, 1996.
[13] B. Dunkel and N. Soparkar, Data Organizatinon and Access for Efficient Data Mining Proc. 15th IEEE Int'l Conf. Data Eng., Mar. 1999.
[14] D. Dupplaw and P.H. Lewis, Content-Based Image Retrieval with Scale-Spaced Object Trees Proc. SPIE: Storage and Retrieval for Media Databases, vol. 3972, pp. 253-261, 2000.
[15] S. Fortin, The Graph Isomorphism Problem Technical Report TR96-20, Dept. of Computing Science, Univ. of Alberta, 1996.
[16] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. New York: W.H. Freeman and Company, 1979.
[17] S. Ghazizadeh and S. Chawathe, SEuS: Structure Extraction Using Summaries Proc. Fifth Int'l Conf. Discovery Science, 2002.
[18] B. Goethals, Efficient Frequent Pattern Mining PhD thesis, Univ. of Limburg, Diepenbeek, Belgium, Dec. 2002.
[19] J. Gonzalez, L.B. Holder, and D.J. Cook, Application of Graph-Based Concept Learning to the Predictive Toxicology Domain Proc. Predictive Toxicology Challenge Workshop, 2001.
[20] J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation Proc. ACM SIGMOD Int'l Conf. Management of Data, May 2000.
[21] C. Hansch, P.P. Maolney, T. Fujita, and R.M. Muir, Correlation of Biological Activity of Phenoxyacetic Acids with Hammett Substituent Constants and Partition Coefficients Nature, vol. 194, pp. 178-180, 1962.
[22] J. Hipp, U. Güntzer, and G. Nakhaeizadeh, Algorithms for Association Rule Mining A General Survey and Comparison SIGKDD Explorations, vol. 2, no. 1, pp. 58-64, July 2000.
[23] L.B. Holder, D.J. Cook, and S. Djoko, Substructure Discovery in the SUBDUE System Proc. AAAI Workshop Knowledge Discovery in Databases, pp. 169-180, 1994.
[24] A. Inokuchi, T. Washio, and H. Motoda, An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data Proc. Fourth European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '00), pp. 13-23, Sept. 2000.
[25] A. Inokuchi, T. Washio, K. Nishimura, and H. Motoda, A Fast Algorithm for Mining Frequent Connected Subgraphs Technical Report RT0448, IBM Research, Tokyo Research Laboratory, 2002.
[26] H. Kälviäinen and E. Oja, Comparisons of Attributed Graph Matching Algorithms for Computer Vision Proc. STEP-90, Finnish Artificial Intelligence Symp., pp. 354-368, June 1990.
[27] R.D. King, S.H. Muggleton, A. Srinivasan, and M.J.E. Sternberg, Structure-Activity Relationships Derived by Machine Learning: The Use of Atoms and Their Bond Connectivities to Predict Mutagenicity by Inductive Logic Programming Proc. Nat'l Academy of Sciences, pp. 438-442, 1996.
[28] S. Kramer, L. De Raedt, and C. Helma, Molecular Feature Mining in HIV Data Proc. Seventh ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD-01), pp. 136-143, 2001.
[29] M. Kuramochi and G. Karypis, Frequent Subgraph Discovery Proc. 2001 IEEE Int'l Conf. Data Mining (ICDM), Nov. 2001.
[30] T. Leung, M. Burl, and P. Perona, "Finding Faces in Cluttered Scenes Using Labeled Random Graph Matching," Int'l Conf. Computer Vision, 1995, pp. 637-644.
[31] B.D. McKay, Nauty Users Guide http://cs.anu.edu.au/~bdmnauty/, 2003.
[32] B.D. McKay, Practical Graph Isomorphism Congressus Numerantium, vol. 30, pp. 45-87, 1981.
[33] S.H. Muggleton, Inverse Entailment and Progol New Generation Computing, special issue on inductive logic programming, vol. 13, nos. 3-4, pp. 245-286, 1995.
[34] S.H. Muggleton, Scientific Knowledge Discovery Using Inductive Logic Programming Comm. ACM, vol. 42, no. 11, pp. 42-46, 1999.
[35] C.R. Palmer, P.B. Gibbons, and C. Faloutsos, ANF: A Fast and Scalable Tool for Data Mining in Massive Graphs Proc. Eighth ACM SIGKDD Int'l Conf. Knowlege Discovery and Data Mining (KDD '02), July 2002.
[36] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Proc. 2001 Int'l Conf. Data Eng. (ICDE '01), pp. 215-226, 2001.
[37] E.G.M. Petrakis and C. Faloutsos, “Similarity Searching in Medical Image Databases,” IEEE Trans. Knowledge and Data Eng., vol. 9, no. 3, pp. 435-447, May/June 1997.
[38] J.R. Quinlan, Learning Logical Definitions from Relations Machine Learning, vol. 5, pp. 239-266, 1990.
[39] A. Savasere, E. Omiecinski, and S.B. Navathe, An Efficient Algorithm for Mining Association Rules in Large Databases Proc. 21st Int'l Conf. Very Large Data Bases (VLDB), pp. 432-444, 1995.
[40] P. Shenoy, J.R. Haritsa, S. Sundarshan, G. Bhalotia, M. Bawa, and D. Shah, Turbo-Charging Vertical Mining of Large Databases Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 22-33, May 2000.
[41] R. Srikant and R. Agrawal, Mining Sequential Patterns: Generalizations and Performance Improvements Proc. Fifth Int'l Conf. Extending Database Technology (EDBT), pp. 3-17, 1996.
[42] A. Srinivasan and R.D. King, Feature Construction with Inductive Logic Programming: A Study of Quantitative Predictions of Biological Activity Aided by Structural Attributes Data Mining and Knowledge Discovery, vol. 3, no. 1, pp. 37-57, 1999.
[43] A. Srinivasan, R.D. King, S.H. Muggleton, and M. Sternberg, The Predictive Toxicology Evaluation Challenge Proc. 15th Int'l Joint Conf. Artificial Intelligence (IJCAI), pp. 1-6, 1997.
[44] A. Srinivasan, R.D. King, S.H. Muggleton, and M.J.E. Sternberg, Carcinogenesis Predictions Using ILP Proc. Seventh Int'l Workshop Inductive Logic Programming, pp. 273-287, 1997.
[45] X. Yan and J. Han, gSpan: Graph-Based Substructure Pattern Mining Proc. 2002 IEEE Int'l Conf. Data Mining (ICDM), 2002.
[46] K. Yoshida and H. Motoda, CLIP: Concept Learning from Inference Patterns Artificial Intelligence, vol. 75, no. 1, pp. 63-92, 1995.
[47] M.J. Zaki, Scalable Algorithms for Association Mining IEEE Trans. Knowledge and Data Eng., vol. 12, no. 2, pp. 372-390, 2000.
[48] M.J. Zaki, Efficiently Mining Frequent Trees in a Forest Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD-2002), July 2002.
[49] M.J. Zaki and K. Gouda, Fast Vertical Mining Using Diffsets Technical Report 01-1, Dept. of Computer Science, Rensselaer Polytechnic Inst., 2001.
Index Terms:
Data mining, scientific data sets, frequent pattern discovery, chemical compound data sets.
Citation:
Michihiro Kuramochi, George Karypis, "An Efficient Algorithm for Discovering Frequent Subgraphs," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 9, pp. 1038-1051, Sept. 2004, doi:10.1109/TKDE.2004.33