CSDL Home IEEE/ACM Transactions on Computational Biology and Bioinformatics 2010 vol.7 Issue No.02 - April-June

Subscribe

Issue No.02 - April-June (2010 vol.7)

pp: 197-207

Jun (Luke) Huan , University of Kansas, Lawrence

Aaron Smalter , University of Kansas, Lawrence

Gerald Lushington , University of Kansas, Lawrence

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TCBB.2009.80

ABSTRACT

Graph data mining is an active research area. Graphs are general modeling tools to organize information from heterogeneous sources and have been applied in many scientific, engineering, and business fields. With the fast accumulation of graph data, building highly accurate predictive models for graph data emerges as a new challenge that has not been fully explored in the data mining community. In this paper, we demonstrate a novel technique called graph pattern diffusion (GPD) kernel. Our idea is to leverage existing frequent pattern discovery methods and to explore the application of kernel classifier (e.g., support vector machine) in building highly accurate graph classification. In our method, we first identify all frequent patterns from a graph database. We then map subgraphs to graphs in the graph database and use a process we call "pattern diffusion” to label nodes in the graphs. Finally, we designed a graph alignment algorithm to compute the inner product of two graphs. We have tested our algorithm using a number of chemical structure data. The experimental results demonstrate that our method is significantly better than competing methods such as those kernel functions based on paths, cycles, and subgraphs.

INDEX TERMS

Graph classification, graph alignment, frequent subgraph mining.

CITATION

Jun (Luke) Huan, Aaron Smalter, Gerald Lushington, "GPD: A Graph Pattern Diffusion Kernel for Accurate Graph Classification with Applications in Cheminformatics",

*IEEE/ACM Transactions on Computational Biology and Bioinformatics*, vol.7, no. 2, pp. 197-207, April-June 2010, doi:10.1109/TCBB.2009.80REFERENCES

- [1] R. Ahuja, T. Magnanti, and J. Orlin, "Network Flows,"
SIAM Rev., vol. 37, no. 1, 1995.- [2] C. Austin, L. Brady, T. Insel, and F. Collins, "Nih Molecular Libraries Initiative,"
Science, vol. 306, no. 5699, pp. 1138-1139, 2004.- [3] E. Barbu, R. Raveaux, H. Locteau, S. Adam, and P. Heroux, "Graph Classification Using Genetic Algorithm and Graph Probing Application to Symbol Recognition,"
Proc. 18th Int'l Conf. Pattern Recognition (ICPR), 2006.- [4] Y.M. Bing Liu and W. Hsu, "Integrating Classification and Association Rule Mining,"
Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, 1998.- [5] S. Boughorbel, J. Tarel, and F. Fleuret, "Non-Mercer Kernels for SVM Object Recognition,"
Proc. British Machine Vision Conf., 2004.- [6] C. Chang and C. Lin, "Libsvm: A Library for Support Vector Machines," http://www.csie.ntu.edu.tw/cjlinlibsvm, 2001.
- [7] R. Collobert and S. Bengio, "SVMTorch: Support Vector Machines for Large-Scale Regression Problems,"
J. Machine Learning Research, vol. 1, pp. 143-160, 2001.- [8] M. Deshpande, M. Kuramochi, and G. Karypis, "Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds,"
IEEE Trans. Knowledge and Data Eng., vol. 17, no. 8, pp. 1036-1050, Aug. 2005.- [9] Fröhlich, J. Wegner, F. Sieker, and A. Zell, "Kernel Functions for Attributed Molecular Graphs—A New Similarity-Based Approach to ADME Prediction in Classification,"
QSAR & Combinatorial Science, vol. 25, pp. 317-326, 2006.- [10] D. Haussler, "Convolution Kernels on Discrete Structures," Technical Report UCSC-CRL099-10, Computer Science Dept., UC Santa Cruz, 1999.
- [11] C. Helma, R. King, and S. Kramer, "The Predictive Toxicology Challenge 2000-2001,"
Bioinformatics, vol. 17, no. 1, pp. 107-108, 2001.- [12] T. Horvath, T. Gartner, and S. Wrobel, "Cyclic Pattern Kernels for Predictive Graph Mining,"
Proc. ACM SIGKDD, 2004.- [13] T. Horvath, J. Ramon, and S. Wrobel, "Frequent Subgraph Mining in Outerplanar Graphs,"
Proc. ACM SIGKDD, 2006.- [14] J. Huan, W. Wang, and J. Prins, "Efficient Mining of Frequent Subgraph in the Presence of Isomorphism,"
Proc. Third IEEE Int'l Conf. Data Mining (ICDM), pp. 549-552, 2003.- [15] J. Huan, W. Wang, J. Prins, and J. Yang, "SPIN: Mining Maximal Frequent Subgraphs from Graph Databases,"
Proc. ACM SIGKDD, pp. 581-586, 2004.- [16] J. Huan, W. Wang, A. Washington, J. Prins, R. Shah, and A. Tropsha, "Accurate Classification of Protein Structural Families Based on Coherent Subgraph Analysis,"
Proc. Pacific Symp. Biocomputing (PSB), pp. 411-422, 2004.- [17] Y. Huang, H. Li, H. Hu, X. Yan, M.S. Waterman, H. Huang, and X.J. Zhou, "Systematic Discovery of Functional Modules and Context-Specific Functional Annotation of Human Genome,"
Bioinformatics, vol. 23, pp. 222-229, 2007.- [18] A. Inokuchi, T. Washio, and H. Motoda, "An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data,"
Proc. Conf. Principles of Data Mining and Knowledge Discovery (PKDD '00), pp. 13-23, 2000.- [19] W.W.J. Huan and J. Prins, "Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism,"
Proc. IEEE Int'l Conf. Data Mining (ICDM), 2003.- [20] R. Jin, S. Mccalle, and E. Almaas, "Trend Motif: A Graph Mining Approach for Analysis of Dynamic Complex Networks,"
Proc. IEEE Int'l Conf. Data Mining (ICDM), 2007.- [21] R. Jorissen and M. Gilson, "Virtual Screening of Molecular Databases Using a Support Vector Machine,"
J. Chemical Information and Modeling, vol. 45, no. 3, pp. 549-561, 2005.- [22] H. Kashima, K. Tsuda, and A. Inokuchi, "Marginalized Kernels between Labeled Graphs,"
Proc. 20th Int'l Conf. Machine Learning (ICML), 2003.- [23] Y. Ke, J. Cheng, and W. Ng, "Correlation Search in Graph Databases,"
Proc. ACM SIGKDD, 2007.- [24] R. Kondor and J. Lafferty, "Diffusion Kernels on Graphs and Other Discrete Input Spaces,"
Proc. Int'l Conf. Machine Learning (ICML), 2002.- [25] T. Kudo, E. Maeda, and Y. Matsumoto, "An Application of Boosting to Graph Classification,"
Proc. Neural Information Processing Systems (NIPS) Conf., 2004.- [26] M. Kuramochi and G. Karypis, "Frequent Subgraph Discovery,"
Proc. Int'l Conf. Data Mining 2001, pp. 313-320, 2001.- [27] S. Nijssen and J. Kok, "A Quickstart in Frequent Structure Mining Can Make a Difference,"
Proc. 10th ACM SIGKDD, pp. 647-652, 2004.- [28] J.R. Quinlan,
C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.- [29] L. Ralaivola, S.J. Swamidass, and H. Saigo, "Graph Kernels for Chemical Informatics,"
Neural Networks, vol. 18, pp. 1093-1110, 2005.- [30] B. Schölkopf and A.J. Smola,
Learning with Kernels. The MIT Press, 2002.- [31] B. Schölkopf, A.J. Smola, and K.-R. Müller, "Kernel Principal Component Analysis,"
Advances in Kernel Methods: Support Vector Learning, pp. 327-352, The MIT Press, 1999.- [32] D. Shasha, J.T.L. Wang, and R. Giugno, "Algorithmics and Applications of Tree and Graph Searching,"
Proc. ACM Symp. Principles of Database Systems (PODS), 2002.- [33] A. Smalter, J. Huan, and G. Lushington, "Structure-Based Pattern Mining for Chemical Compound Classification,"
Proc. Sixth Asia Pacific Bioinformatics Conf., 2008.- [34] J. Sun, S. Papadimitriou, P.S. Yu, and C. Faloutsos, "Parameter-Free Mining of Large Time-Evolving Graphs,"
Proc. ACM SIGKDD, 2007.- [35] N. Tolliday, P.A. Clemons, P. Ferraiolo, A.N. Koehler, T.A. Lewis, X. Li, S.L. Schreiber, D.S. Gerhard, and S. Eliasof, "Small Molecules, Big Players: The National Cancer Institute's Initiative for Chemical Genetics,"
Cancer Research, vol. 66, pp. 8935-8942, 2006.- [36] H. Tong, Y. Koren, and C. Faloutsos, "Fast Direction-Aware Proximity for Graph Mining,"
Proc. ACM SIGKDD, 2007.- [37] V. Vapnik,
Statistical Learning Theory. John Wiley, 1998.- [38] P. Vert, "The Optimal Assignment Kernel is not Positive Definite," Computing Research Repository (CoRR), abs/0801.4061, 2008.
- [39] N. Wale, I. Watson, and G. Karypis, "Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification,"
Knowledge and Information Systems, vol. 14, pp. 347-375, 2008.- [40] M. Wessel, P. Jurs, J. Tolan, and S. Muskal, "Prediction of Human Intestinal Absorption of Drug Compounds from Molecular Structure,"
J. Chemical Information and Computer Sciences, vol. 38, no. 4, pp. 726-735, 1998.- [41] J. Weston, R. Kuang, C. Leslie, and W.S. Noble, "Protein Ranking by Semi-Supervised Network Propagation,"
BMC Bioinformatics, vol. 7, 2006.- [42] D. Williams, J. Huan, and W. Wang, "Graph Database Indexing Using Structured Graph Decomposition,"
Proc. 23rd IEEE Int'l Conf. Data Eng. (ICDE), 2007.- [43] X. Yan and J. Han, "gSpan: Graph-Based Substructure Pattern Mining,"
Proc. Int'l Conf. Data Mining 2002, pp. 721-724, 2002.- [44] X. Yan, P.S. Yu, and J. Han, "Graph Indexing Based on Discriminative Frequent Structure Analysis,"
ACM Trans. Database Systems, vol. 30, pp. 960-993, 2005.- [45] M.J. Zaki and C.C. Aggarwal, "Xrules: An Effective Structural Classifier for Xml Data,"
Machine Learning J., special issue on statistical relational learning and multi-relational data mining, vol. 62, nos. 1/2, pp. 137-170, 2006.- [46] Z. Zeng, J. Wang, L. Zhou, and G. Karypis, "Coherent Closed Quasi-Clique Discovery from Large Dense Graph Databases,"
Proc. ACM SIGKDD, 2006.- [47] D. Zhou, J. Huang, and B. Schöolkopf, "Learning from Labeled and Unlabeled Data on a Directed Graph,"
Proc. 22nd Int'l Conf. Machine Learning, 2005. |