This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Frequent Substructure-Based Approaches for Classifying Chemical Compounds
August 2005 (vol. 17 no. 8)
pp. 1036-1050
Computational techniques that build models to correctly assign chemical compounds to various classes of interest have many applications in pharmaceutical research and are used extensively at various phases during the drug development process. These techniques are used to solve a number of classification problems such as predicting whether or not a chemical compound has the desired biological activity, is toxic or nontoxic, and filtering out drug-like compounds from large compound libraries. This paper presents a substructure-based classification algorithm that decouples the substructure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topological and geometric substructures present in the data set. The advantage of this approach is that during classification model construction, all relevant substructures are available allowing the classifier to intelligently select the most discriminating ones. The computational scalability is ensured by the use of highly efficient frequent subgraph discovery algorithms coupled with aggressive feature selection. Experimental evaluation on eight different classification problems shows that our approach is computationally scalable and, on average, outperforms existing schemes by 7 percent to 35 percent.

[1] Daylight Inc., Mission Viejo, Calif., http:/www.daylight.com, 2005.
[2] MDL Information Systems Inc., San Leandro, Calif., http:/www.mdl.com, 2005.
[3] G.W. Adamson, J. Cowell, M.F. Lynch, A.H. McLure, W.G. Town, and A.M. Yapp, “Strategic Considerations in the Design of a Screening System for Substructure Searches of Chemical Structure File,” J. Chemical Documentation, 1973.
[4] A. An and Y. Wang, “Comparisons of Classification Methods for Screening Potential Compounds,” Proc. Int'l Conf. Data Mining, 2001.
[5] T.A. Andrea and H. Kalayeh, “Applications of Neural Networks in Quantitative Structure-Activity Relationships of Dihydrofolate Reductase Inhibitors,” J. Medicinal Chemistry, vol. 34, pp. 2824-2836, 1991.
[6] M.J. Ashton, M.C. Jaye, and J.S. Mason, “New Perspectives in Lead Generation II: Evaluating Molecular Diversity,” Drug Discovery Today, 1996.
[7] J. Bajorath, “Integration of Virtual and High Throughput Screening,” Nature Rev. Drug Discovery, 2002.
[8] J.M. Barnard, G.M. Downs, and P. Willet, “Descriptor-Based Similarity Measures for Screening Chemical Databases,” Virtual Screening for Bioactive Molecules, H.J. Bohm and G. Schneider, eds., Wiley-VCH, 2000.
[9] S.C. Basak, V.R. Magnuson, J.G. Niemi, and R.R. Regal, “Determining Structural Similarity of Chemicals Using Graph Theoretic Indices,” Discrete Applied Math., 1988.
[10] G.W. Bemis and M.A. Murcko, “The Properties of Known Drugs. 1. Molecular Frameworks,” J. Medicinal Chemistry, vol. 39, no. 15, pp. 2887-2893, 1996.
[11] G.W. Bemis and M.A. Murcko, “The Properties of Known Drugs. 2. Side Chains,” J. Medicinal Chemistry, vol. 42, no. 25, pp. 5095-5099, 1999.
[12] K.H. Bleicher, H.-J. Bohm, K. Muller, and A.I. Alanine, “Hit and Lead Generation: Beyond High Throughput Screening,” Nature Rev. Drug Discovery, 2003.
[13] H.J. Bohm and G. Schneider, Virtual Screening for Bioactive Molecules. Wiley-VCH, 2000.
[14] G. Bravi, E. Gancia, D. Green, V.S. Hann, and M. Mike, “Modelling Structure-Activity Relationship,” Virtual Screening for Bioactive Molecules, H.J. Bohm and G. Schneider, eds., Wiley-VCH, 2000.
[15] R. Brown and Y. Martin, “The Information Content of 2D and 3D Structural Descriptors Relevant to Ligand-Receptor Binding,” J. Chemical Information and Computer Science, vol. 37, no. 1, pp. 1-9, 1997.
[16] E. Byvatov, U. Fechner, J. Sadowski, and G. Schneider, “Comparison of Support Vector Machine and Artificial Neural Network Systems for Drug/Nondrug Classification,” J. Chemical Information and Computer Science, vol. 43, no. 6, pp. 1882-1889, 2003.
[17] R.E. CarHart, D.H Smith, and R. Venkataraghavan, “Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications,” J. Chemical Information and Computer Science, 1985.
[18] X. Chen, A. Rusinko, and S.S. Young, “Recursive Partitioning Analysis of a Large Structure-Activity Data Set Using Three-Dimensional Descriptors,” J. Chemical information and Computer Science, 1998.
[19] M.R. Berthold and C. Borgelt, “Mining Molecular Fragments: Finding Relevant Substructures of Molecules,” Proc. Int'l Conf. Data Mining, 2002.
[20] D.J. Cook and L.B. Holder, “Graph-Based Data Mining,” IEEE Intelligent Systems, vol. 15, no. 2, pp. 32-41, 2000.
[21] R.D King, A. Srinivasan, and L. Dehaspe, “Warmr: A Data Mining Tool for Chemical Data,” J. Computer Aided Molecular Design, vol. 15, pp. 173-181, 2001.
[22] E.K. Davies, “Molecular Diversity and Combinatorial Chemistry: Libraries and Drug Discovery,” Am. Chemical Soc., 1996.
[23] L. Dehaspe, H. Toivonen, and R.D. King, “Finding Frequent Substructures in Chemical Compounds,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, pp. 30-36, 1998.
[24] M. Deshpande and G. Karypis, “Automated Approaches for Classifying Structure,” Proc. Second ACM SIGKDD Workshop Data Mining in Bioinformatics, 2002.
[25] M. Deshpande and G. Karypis, “Using Conjunction of Attribute Values for Classification,” Proc. 11th ACM Conf. Information and Knowledge Management, pp. 356-364, 2002.
[26] M. Deshpande, M. Kuramochi, and G. Karypis, “Frequent Sub-Structure Based Approaches for Classifying Chemical Compounds,” Proc. 2003 IEEE Int'l Conf. Data Mining (Int'l Conf. Data Mining), pp. 35-42, 2003.
[27] J. Devillers, Neural Networks in QSAR and Drug Design. London: Acamedic Press, 1996.
[28] dtp.nci.nih.gov., DTP AIDS Antiviral Screen Data Set, 2005.
[29] B. Dunkel and N. Soparkar, “Data Organizatinon and Access for Efficient Data Mining,” Proc. 15th IEEE Int'l Conf. Data Eng., Mar. 1999.
[30] H. Gao, C. Williams, P. Labute, and J. Bajorath, “Binary Quantitative Structure-Activity Relationship (QSAR) Analysis of Estrogen Receptor Ligands,” J. Chemical Information and Computer Science, 1999.
[31] J. Gasteiger, C. Rudolph, and J. Sadowski, “Automatic Generation of 3D-Atomic Coordinates for Organic Molecules,” Tetrahedron Comp. Method, vol. 3, pp. 537-547, 1990.
[32] J. Gonzalez, L. Holder, and D. Cook, “Application of Graph Based Concept Learning to the Predictive Toxicology Domain,” Proc. Pacific Telecomm. Conf., Workshop at the Fifth Principles and Practice of Knowledge Discovery in Databases Conf., 2001.
[33] A.C. Good, J.S. Mason, and S.D. Pickett, “Pharmacophore Pattern Application in Virtual Screening, Library Design and QSAR,” Virtual Screening for Bioactive Molecules, H.J. Bohm and G. Schneider, eds., Wiley-VCH, 2000.
[34] L.H. Hall and L.B. Kier, “Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information,” J. Chemical Information and Computer Science, 1995.
[35] J.S. Handen, “The Industrialization of Drug Discovery,” Drug Discovery Today, vol. 7, no. 2, pp. 83-85, Jan. 2002.
[36] C. Hansch, P.P. Maolney, T. Fujita, and R.M. Muir, “Correlation of Biological Activity of Phenoxyacetic Acids with Hammett Substituent Constants and Partition Coefficients,” Nature, vol. 194, pp. 178-180, 1962.
[37] C. Hansch, R.M. Muir, T. Fujita, C.F. Maloney, and M. Streich, “The Correlation of Biological Activity of Plant Growth-Regulators and Chloromycetin Derivatives with Hammett Constants and Partition Coefficients,” J. Am. Chemical Soc., vol. 85, pp. 2817-1824, 1963.
[38] L. Holder, D. Cook, and S. Djoko, “Substructure Discovery in the Subdue System,” Proc. AAAI Workshop Knowledge Discovery in Databases, pp. 169-180, 1994.
[39] J. Huan, W. Wang, and J. Prins, “Efficient Mining of Frequent Subgraph in the Presence of Isomophism,” Proc. 2003 IEEE Int'l Conf. Data Mining (Int'l Conf. Data Mining '03), 2003.
[40] A. Inokuchi, T. Washio, and H. Motoda, “An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data,” Proc. Fourth European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '00), pp. 13-23, Sept. 2000.
[41] T. Joachims, Advances in Kernel Methods: Support Vector Learning. MIT-Press, 1999.
[42] G. Karypis, “CLUTO a Clustering Toolkit,” Technical Report 02-017, Dept. of Computer Science, Univ. of Minnesota, http://www.cs.umn.educluto, 2002.
[43] L. Kier and L. Hall, Molecular Structure Description. Academic Press, 1999.
[44] R.D. King, S.H. Muggleton, A. Srinivasan, and M.J.E. Sternberg, “Strucutre-Activity Relationships Derived by Machine Learning: The Use of Atoms and Their Bond Connectivities to Predict Mutagenecity BYD Inductive Logic Programming,” Proc. Nat'l Academy of Science, vol. 93, pp. 438-442, Jan. 1996.
[45] R.D. King, S. Muggleton, R.A. Lewis, and J.E. Sternberg, “Drug Design by Machine Learning: The Use of Inductive Logic Programming to Model the Sturcture-Activity Relationships of Trimethoprim Analogues Binding to Dihydrofolate Reductase,” Proc. Nat'l Academy of Science, vol. 89, pp. 11322-11326, Dec. 1992.
[46] S. Kramer, L. De Raedt, and C. Helma, “Molecular Feature Mining in HIV Data,” Proc. Seventh Int'l Conf. Knowledge Discovery and Data Mining, 2001.
[47] M. Kuramochi and G. Karypis, “Frequent Subgraph Discovery,” Proc. IEEE Int'l Conf. Data Mining, 2001, also available as a UMN-CS Technical Report, TR# 01-028.
[48] M. Kuramochi and G. Karypis, “Discovering Geometric Frequent Subgraph,” Proc. IEEE Int'l Conf. Data Mining, 2002, also available as a UMN-CS Technical Report, TR# 02-024.
[49] M. Kuramochi and G. Karypis, “Discovering Frequent Geometric Subgraphs,” Technical Report 04-039, Dept. of Computer Science, Univ. of Minnesota, 2004.
[50] M. Kuramochi and G. Karypis, “An Efficient Algorithm for Discovering Frequent Subgraphs,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 9, pp. 1038-1051, Sept. 2004.
[51] P. Labute, “Binary QSAR: A New Method for the Determination of Quantitative Structure Activity Relationships,” Proc. Pacific Symp. Biocomputing, 1999.
[52] S.M. Le Grand and J.K.M. Merz, “Rapid Approximation to Molecular Surface Area via the Use of Boolean Logic Look-Up Tables,” J. Computational Chemistry, vol. 14, pp. 349-352, 1993.
[53] A.R. Leach, Molecular Modeling: Principles and Applications. Englewood Cliffs, N.J.: Prentice Hall, 2001.
[54] X.Q. Lewell, D.B. Judd, S.P. Watson, and M.M. Hann, “RECAP Retrosynthetic Combinatorial Analysis Procedure: A Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorial Chemistry,” J. Chemical Information and Computer Science, vol. 38, no. 3, pp. 511-522, 1998.
[55] W. Li, J. Han, and J. Pei, “CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules,” Proc. IEEE Int'l Conf. Data Mining, 2001.
[56] B. Liu, W. Hsu, and Y. Ma, “Integrating Classification and Association Rule Mining,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, 1998.
[57] D.J. Livingstone, Neural Networks in QSAR and Drug Design. London: Acamedic Press, 1996.
[58] D.J. Livingstone, “The Characterization of Chemical Structures Using Molecular Properties. A Survey,” J. Chemical Information and Computer Science, 2000.
[59] T.M. Mitchell, Machine Learning. McGraw Hill, 1997.
[60] K. Morik, P. Brockhausen, and T. Joachims, “Combining Statistical Learning with a Knowledge-Based Approach— A Case Study in Intensive Care Monitoring,” Proc. Int'l Conf. Machine Learning, 1999.
[61] S. Muggleton, “Inverse Entailment and Progol,” New Generation Computing, vol. 13, pp. 245-286, 1995.
[62] S. Muggleton and L. De Raedt, “Inductive Logic Programming: Theory and Methods,” J. Logic Programming, vol. 19, no. 20, pp. 629-679, 1994.
[63] S.H. Muggleton and C. Feng, “Efficient Induction of Logic Programs,” Inductive Logic Programming, S. Muggleton, ed., pp. 281-298, London: Academic Press, 1992.
[64] C.A. Nicalaou, S.Y. Tamura, B.P. Kelley, S.I. Bassett, and R.F. Nutt, “Analysis of Large Screening Data Sets via Adaptively Grown Phylogenetic-Like Trees,” J. Chemical Information and Computer Science, 2002.
[65] S. Nijssen and J.N. Kok, “A Quickstart in Frequent Structure Mining Can Make a Difference,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD-2004), pp. 647-652, Aug. 2004.
[66] R. Nilakantan, N. Bauman, S. Dixon, and R. Venkataraghavan, “Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Descriptors,” J. Chemical Information and Computer Science, 1987.
[67] M. Otto, Chemometrics. Wiley-VCH, 1999.
[68] S.D. Pickett, J.S. Mason, and I.M. McLay, “Diversity Profiling and Design Using 3D Pharmacophores: Pharmacophore-Derived Queries (PDG),” J. Chemical information and Computer Science, 1996.
[69] F. Provost and T. Fawcett, “Robust Classification for Imprecise Environments,” Machine Learning, vol. 42, no. 3, 2001.
[70] J. Ross Quinlan, C4.5: Programs for Machine Learning. San Mateo, Calif.: Morgan Kaufmann, 1993.
[71] G.W. Richards, “Virtual Screening Using Grid Computing: The Screensaver Project,” Nature Rev.: Drug Discovery, vol. 1, pp. 551-554, July 2002.
[72] A. Rusinko, M.W. Farmen, C.G. Lambert, P.L. Brown, and S.S. Young, “Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning,” J. Chemical Information and Computer Science, 1999.
[73] B. Scholkopf and A. Smola, Learning with Kernels. Boston, Mass.: MIT Press, 2002.
[74] P. Shenoy, J.R. Haritsa, S. Sundarshan, G. Bhalotia, M. Bawa, and D. Shah, “Turbo-Charging Vertical Mining of Large Databases,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 22-33, May 2000.
[75] R.P. Sheridan, M.D. Miller, D.J. Underwood, and S.J. Kearsley, “Chemical Similarity Using Geometric Atom Pair Descriptors,” J. Chemical Information and Computer Science, 1996.
[76] A. Srinivasan, R.D. King, S.H. Muggleton, and M. Sternberg, “The Predictive Toxicology Evaluation Challenge,” Proc. 15th Int'l Joint Conf. Artificial Intelligence (IJCAI-97), pp. 1-6, 1997.
[77] A. Sriniviasan and R. King, “Feature Construction with Inductive Logic Programming: A Study of Quantitative Predictions of Biological Activity Aided by Structural Attributes,” Knowledge Discovery and Data Mining J., vol. 3, pp. 37-57, 1999.
[78] S.Y. Tamura, P.A. Bacha, H.S. Gruver, and R.F. Nutt, “Data Analysis of High-Throughput Screening Results: Application of Multidomain Clustering to the NCI Anti-HIV Data Set,” J. Medicinal Chemistry, 2002.
[79] V. Vapnik, Statistical Learning Theory. New York: John Wiley, 1998.
[80] P. Willett, “Chemical Similarity Searching,” J. Chemical Information and Computer Science, vol. 38, no. 6, pp. 983-996, 1998.
[81] S. Wold, E. Johansson, and M. Cocchi, 3D QSAR in Drug Design: Theory, Methods and Application. ESCOM Science Publishers B.V., 1993.
[82] X. Yan and J. Han, “gSpan: Graph-Based Substructure Pattern Mining,” Proc. Int'l Conf. Data Mining, 2002.
[83] M.J. Zaki and K. Gouda, “Fast Vertical Mining Using Diffsets,” Technical Report 01-1, Dept. of Computer Science, Rensselaer Polytechnic Inst., 2001.
[84] M. Javeed Zaki, “Scalable Algorithms for Association Mining,” Knowledge and Data Eng., vol. 12, no. 2, pp. 372-390, 2000.
[85] J. Zupan and J. Gasteiger, Neural Networks for Chemists. VCH Publisher, 1993.

Index Terms:
Index Terms- Classification, chemical compounds, virtual screening, graphs, SVM.
Citation:
Mukund Deshpande, Michihiro Kuramochi, Nikil Wale, George Karypis, "Frequent Substructure-Based Approaches for Classifying Chemical Compounds," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 8, pp. 1036-1050, Aug. 2005, doi:10.1109/TKDE.2005.127
Usage of this product signifies your acceptance of the Terms of Use.