The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - Sept.-Oct. (2012 vol.9)
pp: 1301-1313
Hossam M. Ashtawy , Dept. of Electr. & Comput. Eng., Michigan State Univ., East Lansing, MI, USA
Nihar R. Mahapatra , Dept. of Electr. & Comput. Eng., Michigan State Univ., East Lansing, MI, USA
ABSTRACT
Accurately predicting the binding affinities of large sets of protein-ligand complexes efficiently is a key challenge in computational biomolecular science, with applications in drug discovery, chemical biology, and structural biology. Since a scoring function (SF) is used to score, rank, and identify drug leads, the fidelity with which it predicts the affinity of a ligand candidate for a protein's binding site has a significant bearing on the accuracy of virtual screening. Despite intense efforts in developing conventional SFs, which are either force-field based, knowledge-based, or empirical, their limited ranking accuracy has been a major roadblock toward cost-effective drug discovery. Therefore, in this work, we explore a range of novel SFs employing different machine-learning (ML) approaches in conjunction with a variety of physicochemical and geometrical features characterizing protein-ligand complexes. We assess the ranking accuracies of these new ML-based SFs as well as those of conventional SFs in the context of the 2007 and 2010 PDBbind benchmark data sets on both diverse and protein-family-specific test sets. We also investigate the influence of the size of the training data set and the type and number of features used on ranking accuracy. Within clusters of protein-ligand complexes with different ligands bound to the same target protein, we find that the best ML-based SF is able to rank the ligands correctly based on their experimentally determined binding affinities 62.5 percent of the time and identify the top binding ligand 78.1 percent of the time. For this SF, the Spearman correlation coefficient between ranks of ligands ordered by predicted and experimentally determined binding affinities is 0.771. Given the challenging nature of the ranking problem and that SFs are used to screen millions of ligands, this represents a significant improvement over the best conventional SF we studied, for which the corresponding ranking performance values are 57.8 percent, 73.4 percent, and 0.677.
INDEX TERMS
proteins, biochemistry, biology computing, drugs, learning (artificial intelligence), molecular biophysics, Spearman correlation coefficient, comparative assessment, machine-learning-based scoring functions, protein-ligand binding affinity prediction, protein-ligand complexes, computational biomolecular science, drug discovery, physicochemical feature, geometrical feature, 2010 PDBbind benchmark data sets, protein-family-specific test sets, 2007 PDBbind benchmark data sets, training data set, Proteins, Feature extraction, Training, Databases, Drugs, Accuracy, Three dimensional displays, virtual screening., Drug discovery, machine learning, protein-ligand binding affinity, ranking power, scoring function
CITATION
Hossam M. Ashtawy, Nihar R. Mahapatra, "A Comparative Assessment of Ranking Accuracies of Conventional and Machine-Learning-Based Scoring Functions for Protein-Ligand Binding Affinity Prediction", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.9, no. 5, pp. 1301-1313, Sept.-Oct. 2012, doi:10.1109/TCBB.2012.36
REFERENCES
[1] L. Kavraki, "Rigid Receptor-Flexible Ligand Docking: Overview and Examples," http://cnx.org/content/m114641.5/, 2006.
[2] P.D. Lyne, "Structure-Based Virtual Screening: An Overview," Drug Discovery Today, vol. 7, no. 20, pp. 1047-1055, 2002.
[3] N. Singh, G. Chevé, D. Ferguson, and C. McCurdy, "A Combined Ligand-Based and Target-Based Drug Design Approach for G-Protein Coupled Receptors: Application to Salvinorin a, a Selective Kappa Opioid Receptor Agonist," J. Computer-Aided Molecular Design, vol. 20, no. 7, pp. 471-493, 2006.
[4] T. Marrone, J. Briggs, and J. McCammon, "Structure-Based Drug Design: Computational Advances," Ann. Rev. of Pharmacology and Toxicology, vol. 37, no. 1, pp. 71-90, 1997.
[5] X. Xu, M. Kasembeli, X. Jiang, B. Tweardy, and D. Tweardy, "Chemical Probes that Competitively and Selectively Inhibit Stat3 Activation," PLoS One, vol. 4, no. 3, p. e4783, 2009.
[6] K. Simons, R. Bonneau, I. Ruczinski, and D. Baker, "Ab Initio Protein Structure Prediction of Casp iii Targets Using ROSETTA," Proteins: Structure, Function, and Genetics, vol. 37, pp. 171-176, 1999.
[7] A. Favia, I. Nobeli, F. Glaser, and J. Thornton, "Molecular Docking for Substrate Identification: The Short-Chain Dehydrogenases/Reductases," J. Molecular Biology, vol. 375, no. 3, pp. 855-874, 2008.
[8] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, and P.E. Bourne, "The Protein Data Bank," Nucleic Acids Research, vol. 28, no. 1, pp. 235-242, 2000.
[9] F. Allen, "The Cambridge Structural Database: A Quarter of a Million Crystal Structures and Rising," Acta Crystallographica Section B: Structural Science, vol. 58, no. 3, pp. 380-388, 2002.
[10] H. Bohm, "The Development of a Simple Empirical Scoring Function to Estimate the Binding Constant for a Protein-Ligand Complex of Known Three-Dimensional Structure," J. Computer-Aided Molecular Design, vol. 8, no. 3, pp. 243-256, 1994.
[11] T. Cheng, X. Li, Y. Li, Z. Liu, and R. Wang, "Comparative Assessment of Scoring Functions on a Diverse Test Set," J. Chemical Information and Modeling, vol. 49, no. 4, pp. 1079-1093, 2009.
[12] R. Wang, Y. Lu, X. Fang, and S. Wang, "An Extensive Test of 14 Scoring Functions Using the PDBbind Refined Set of 800 Protein-Ligand Complexes," J. Chemical Information and Computer Sciences, vol. 44, no. 6, pp. 2114-2125, 2004.
[13] P. Ballester and J. Mitchell, "A Machine Learning Approach to Predicting Protein-Ligand Binding Affinity with Applications to Molecular Docking," Bioinformatics, vol. 26, no. 9, pp. 1169-1175, 2010.
[14] R. Wang, X. Fang, Y. Lu, and S. Wang, "The PDBbind Database: Collection of Binding Affinities for Protein-Ligand Complexes with Known Three-Dimensional Structures," J. Medicinal Chemistry, vol. 47, no. 12, pp. 2977-2980, 2004.
[15] T.Inc, The SYBYL Software, St. Louis, MO, Version 7.2, 2006.
[16] T. Madden, "The Blast Sequence Analysis Tool," The NCBI Handbook, Nat'l Center for Biotechnology Information, 2002.
[17] R. Wang, L. Lai, and S. Wang, "Further Development and Validation of Empirical Scoring Functions for Structure-Based Binding Affinity Prediction," J. Computer-Aided Molecular Design, vol. 16, pp. 11-26, 2002.
[18] M.I Zavodszky, P.C Sanschagrin, L.A. Kuhn, and R.S. Korde, "Distilling the Essential Features of a Protein Surface for Improving Protein-Ligand Docking, Scoring, and Virtual Screening," J. Computer-Aided Molecular Design, vol. 16, pp. 883-902, 2002.
[19] V. Schnecke and L.A. Kuhn, "Virtual Screening with Solvation and Ligand-Induced Complementarity," Virtual Screening: An Alternative or Complement to High Throughput Screening?, G. Klebe, ed., pp. 171-190, Springer, 2002.
[20] M.I. Zavodszky and L.A. Kuhn, "Side-Chain Flexibility in Protein-Ligand Binding: The Minimal Rotation hypothesis," Protein Science, vol. 14, no. 4, pp. 1104-1114, 2005.
[21] I. Muegge and Y.C. Martin, "A General and Fast Scoring Function for Protein-Ligand Interactions: A Simplified Potential Approach," J. Medicinal Chemistry, vol. 42, no. 5, pp. 791-804, 1999.
[22] A.S., Inc., The Discovery Studio Software, San Diego, CA, Version 2.0., 2001.
[23] A. Krammer, P.D. Kirchhoff, X. Jiang, C. Venkatachalam, and M. Waldman, "LigScore: A Novel Scoring Function for Predicting Binding Affinities," J. Molecular Graphics and Modelling, vol. 23, no. 5, pp. 395-407, 2005.
[24] D.K. Gehlhaar, G.M. Verkhivker, P.A. Rejto, C.J. Sherman, D.R. Fogel, L.J. Fogel, and S.T. Freer, "Molecular Recognition of the Inhibitor ag-1343 by HIV-1 Protease: Conformationally Flexible Docking by Evolutionary Programming," Chemistry & Biology, vol. 2, no. 5, pp. 317-324, 1995.
[25] I. Muegge, "Effect of Ligand Volume Correction on PMF Scoring," J. Computational Chemistry, vol. 22, no. 4, pp. 418-425, 2001.
[26] A.N. Jain, "Scoring Noncovalent Protein-Ligand Interactions: A Continuous Differentiable Function Tuned to Compute Binding Affinities," J. Computer-Aided Molecular Design, vol. 10, pp. 427-440, 1996.
[27] G. Jones, P. Willett, R. Glen, A. Leach, and R. Taylor, "Development and Validation of a Genetic Algorithm for Flexible Docking," J. Molecular Biology, vol. 267, no. 3, pp. 727-748, 1997.
[28] M.D. Eldridge, C.W. Murray, T.R. Auton, G.V. Paolini, and R.P. Mee, "Empirical Scoring Functions: I. The Development of a Fast Empirical Scoring Function to Estimate the Binding Affinity of Ligands in Receptor Complexes," J. Computer-Aided Molecular Design, vol. 11, pp. 425-445, 1997.
[29] W. Mooij and M. Verdonk, "General and Targeted Statistical Potentials for Protein-Ligand Interactions," Proteins, vol. 61, no. 2, pp. 272-287, 2005.
[30] R.A. Friesner, J.L. Banks, R.B. Murphy, T.A. Halgren, J.J. Klicic, D.T. Mainz, M.P. Repasky, E.H. Knoll, M. Shelley, J.K. Perry, D.E. Shaw, P. Francis, and P.S. Shenkin, "Glide: A New Approach for Rapid, Accurate Docking and Scoring. 1. Method and Assessment of Docking Accuracy," J. Medicinal Chemistry, vol. 47, no. 7, pp. 1739-1749, 2004.
[31] L. Schrödinger, The Schrödinger Software, New York, Version 8.0., 2005.
[32] H.F.G. Velec, H. Gohlke, and G. Klebe, "DrugScore CSD - Knowledge-Based Scoring Function Derived from Small Molecule Crystal Data with Superior Recognition Rate of Near-Native Ligand Poses and Better Affinity Prediction," J. Medicinal Chemistry, vol. 48, no. 20, pp. 6296-6303, 2005.
[33] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Springer, 2001.
[34] J. Elith, J. Leathwick, and T. Hastie, "A Working Guide to Boosted Regression Trees," J. Animal Ecology, vol. 77, no. 4, pp. 802-813, 2008.
[35] S.M. Derived from Mda:mars by Trevor Hastie, and R. Tibshirani Derived from Mda:mars by Trevor, Earth: Multivariate Adaptive Regression Spline Models, R Package Version 2.4-5, 2010.
[36] K. Schliep and K. Hechenbichler, kknn: Weighted k-Nearest Neighbors, R Package Version 1.0-8, 2010.
[37] E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel, e1071: Miscellaneous Functions of the Department of Statistics (e1071), TU Wien, R Package Version 1.5-24, 2010.
[38] L. Breiman, "Random Forests," Machine Learning, vol. 45, pp. 5-32, 2001.
[39] G. Ridgeway, gbm: Generalized Boosted Regression Models, R Package Version 1.6-3.1., 2010.
[40] J. Overington, B. Al-Lazikani, and A. Hopkins, "How Many Drug Targets Are There?," Nature Rev. Drug Discovery, vol. 5, no. 12, pp. 993-996, 2006.
47 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool