This Article 
 Bibliographic References 
 Add to: 
Detection of Outlier Residues for Improving Interface Prediction in Protein Heterocomplexes
July-Aug. 2012 (vol. 9 no. 4)
pp. 1155-1165
Limsoon Wong, Sch. of Comput., Nat. Univ. of Singapore, Singapore, Singapore
Peng Chen, Inst. of Intell. Machines, Hefei, China
Jinyan Li, Sch. of Comput., Nat. Univ. of Singapore, Singapore, Singapore
Sequence-based understanding and identification of protein binding interfaces is a challenging research topic due to the complexity in protein systems and the imbalanced distribution between interface and noninterface residues. This paper presents an outlier detection idea to address the redundancy problem in protein interaction data. The cleaned training data are then used for improving the prediction performance. We use three novel measures to describe the extent a residue is considered as an outlier in comparison to the other residues: the distance of a residue instance from the center instance of all residue instances of the same class label (Dist), the probability of the class label of the residue instance (PCL), and the importance of within-class and between-class (IWB) residue instances. Outlier scores are computed by integrating the three factors; instances with a sufficiently large score are treated as outliers and removed. The data sets without outliers are taken as input for a support vector machine (SVM) ensemble. The proposed SVM ensemble trained on input data without outliers performs better than that with outliers. Our method is also more accurate than many literature methods on benchmark data sets. From our empirical studies, we found that some outlier interface residues are truly near to noninterface regions, and some outlier noninterface residues are close to interface regions.

[1] R.P. Bahadur, P. Chakrabarti, F. Rodier, and J. Janin, "Dissecting Subunit Interfaces in Homodimeric Proteins," Proteins, vol. 53, pp. 708-719, 2003.
[2] R.P. Bahadur, P. Chakrabarti, F. Rodier, and J. Janin, "A Dissection of Specific and Non-Specific Protein-Protein Interfaces," J. Molecular Biology, vol. 336, pp. 943-955, 2004.
[3] P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Approach. The MIT Press; 2000.
[4] V. Barnett, Outliers in Statistical Data. John Wiley, 1994.
[5] H.M. Berman et al., "The Protein Data Bank," Nucleic Acids Research, vol. 28, pp. 235-242, 2000.
[6] J.R. Bradford and D.R. Westhead, "Improved Prediction of Protein-Protein Binding Sites Using a Support Vector Machines Approach," Bioinformatics, vol. 21, pp. 1487-94, 2005.
[7] A. Bradley, "The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms," Pattern Recognition, vol. 30, pp. 1145-1159, 1997.
[8] V. Chandola, A. Banerjee, and V. Kumar, "Anomaly Detection: A Survey," ACM Computing Surveys, vol. 41, no. 3, pp. 1-58, 2009.
[9] C.C. Chang and C.J. Lin, "LIBSVM : A Library for Support Vector Machines," ACM Trans. Intelligent Systems and Technology, vol. 2, no. 3, pp. 27:1-27:27, 2011.
[10] H. Chen and H. Zhou, "Prediction of Interface Residues in Protein-Protein Complexes by a Consensus Neural Network Method: Test Against NMR Data," Proteins, vol. 61, pp. 21-35, 2005.
[11] X.W. Chen and J.C. Jeong, "Sequence-Based Prediction of Protein Interaction Sites with an Integrative Method," Bioinformatics, vol. 25, no. 5 pp. 585-591, 2009.
[12] P. Chen and J.Y. Li, "Sequence-Based Identification of Interface Residues by an Integrative Profile Combining Hydrophobic and Evolutionary Information," BMC Bioinformatics, vol. 11, article no. 402, 2010.
[13] J. Chung, W. Wang, and P.E. Bourne, "Exploiting Sequence and Structure Homologs to Identify Protein-Protein Binding Sites," Proteins, vol. 62, pp. 630-40, 2006.
[14] P. Chakrabarti and J. Janin, "Dissecting Protein-Protein Recognition Sites," Proteins, vol. 47, pp. 334-343, 2002.
[15] M.C. Demirel, A.R. Atilgan, R.L. Jernigan, B. Erman, and I. Bahar, "Identification of Kinetically Hot Residues in Proteins," Protein Science, vol. 7, pp. 2522-2532, 1998.
[16] Q. Dong, X. Wang, L. Lin, and Y. Guan, "Exploiting Residue-Level and Profile-Level Interface Propensities for Usage in Binding Sites Prediction of Proteins," BMC Bioinformatics, vol. 8, article no. 147, 2007.
[17] I. Ezkurdia, L. Bartoli, P. Fariselli, R. Casadio, A. Valencia, and M.L. Tress, "Progress and Challenges in Predicting Protein¨Cprotein Interaction Sites," Briefings in Bioinformatics, vol. 10, no. 3 pp. 233-246, 2009.
[18] T. Fawcett and F.J. Provost, "Activity Monitoring: Noticing Interesting Changes in Behavior," Proc. Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 53-62, 1999.
[19] F. Glaser et al., "Residue Frequencies and Pairing Preferences at Protein-Protein Interfaces," Proteins, vol. 43, pp. 89-102, 2001.
[20] M. Guharoy and P. Chakrabarti, "Conservation and Relative Importance of Residues Across Protein-Protein Interfaces," Proc. Nat'l Academy of Sciences USA, vol. 102, pp. 15447-15452, 2005.
[21] T. Haliloglu, O. Keskin, B. Ma, and R. Nussinov, "How Similar Are Protein Folding and Protein Binding Nuclei? Examination of Vibrational Motions of Energy Hot Spots and Conserved Residues," Biophysical J., vol. 88, no. 3 pp. 1552-1559, 2005.
[22] D. Hawkins, Identification of Outliers. Chapman and Hall, 1980.
[23] Z. He, S. Deng, and X. Xu, "Outlier Detection Integrating Semantic Knowledge," Proc. Third Int'l Conf. Advances in Web-Age Information Management (WAIM '02), pp. 126-131, 2002.
[24] Z. He, X. Xu, J. Huang, and S. Deng, "Mining Class Outliers: Concepts, Algorithms and Applications in CRM," Expert Systems with Applications, vol. 27, no. 4, pp. 681-697, 2004.
[25] N. Japkowicz, C. Myers, and M.A. Gluck, "A Novelty Detection Approach to Classification," Proc. 14th Int'l Conf. Artificial Intelligence (IJCAI '95), pp. 518-523, 1995.
[26] S. Jones and J.M. Thornton, "Principles of Proteinprotein Interactions," Proc. Nat'l Academy of Sciences USA, vol. 93, pp. 13-20, 1996.
[27] S. Jones and J.M. Thornton, "Prediction of Protein-Protein Interaction Sites Using Patch Analysis," J. Molecular Biology, vol. 272, pp. 133-143, 1997.
[28] S.S. Keerthi, S. Sundararajan, K.W. Chang, C.J. Hsieh, and C.J. Lin, "A Sequential Dual Method for Large Scale Multi-Class Linear SVMs," Proc. 14th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD) vol. 10, no. 3 pp. 233-246, 2008.
[29] R.M. Kini and J.H. Evans, "Prediction of Potential Protein¨Cprotein Interaction Sites from Amino Acid Sequence Identification of a Fibrin Polymerization Site," FEBS Letters, vol. 385, pp. 81-86, 1996.
[30] G.J. Kleywegt and T.A. Jonesm, "Phi/Psi-chology: Ramachandran Revisited," Structure, vol. 15, no. 4, pp. 1395-1400, 1996.
[31] J. Kyte and R. Doolittle, "A Simple Method for Displaying the Hydropathic Character of a Protein," J. Molecular Biology, vol. 157, pp. 105-132, 1982.
[32] R.A. Laskowski, "SURFNET: A Program for Visualizing Molecular Surfaces, Cavities, and Intermolecular Interactions," J. Molecular Graphics, vol. 13, pp. 323-330, 1995.
[33] E.D. Levy, J.B. Pereira-Leal, C. Chothia, and S.A. Teichmann, "3D Complex: A Structural Classification of Protein Complexes," PLoS Computational Biology, vol. 2, no. 11, p. e155, 2006.
[34] S. Liang et al., "Protein Binding Site Prediction Using an Empirical Scoring Function," Nucleic Acids Research, vol. 34, pp. 3698-3707, 2006.
[35] S. Marsland, "On-Line Novelty Detection through Self-Organisation, with Application to Inspection Robotics," PhD thesis, Faculty of Science and Eng., Univ. of Manchester, United Kingdom, 2001.
[36] J. Mihel, M. Sikic, S. Tomic, B. Jeren, and K. Vlahovicek, "PSAIA-Protein Structure and Interaction Analyzer," BMC Structural Biology, vol. 8, article no. 21, 2008.
[37] J. Mintseris et al., "Protein-Protein Docking Benchmark 2.0: An Update," Proteins, vol. 60, pp. 214-216, 2005.
[38] H. Neuvirth, R. Raz, and G. Schreiber, "ProMate: A Structure Based Prediction Program to Identify the Location of Protein-Protein Binding Sites," J. Molecular Biology, vol. 338, pp. 181-99, 2004.
[39] Y. Ofran and B. Rost, "ISIS: Interaction Sites Identified from Sequence," Bioinformatics, vol. 23, pp. 13-16, 2007.
[40] A. Pintar, O. Carugo, and S. Pongor, "CX, an Algorithm that Identifies Protruding Atoms in Proteins," Bioinformatics, vol. 18, no. 7, pp. 980-984, 2002.
[41] A. Pintar, O. Carugo, and S. Pongor, "DPX: For the Analysis of the Protein Core," Bioinformatics, vol. 19, no. 2, pp. 313-314, 2003.
[42] A. Porollo and J. Meller, "Prediction-Based Fingerprints of Protein-Protein Interactions," Proteins, vol. 66, 630-645, 2007.
[43] P. Rousseeuw and A. Leroy, Robust Regression and Outlier Detection, third ed. John Wiley and Sons. 1996.
[44] C. Sander and R. Schneider, "Database of Homology Derived Protein Structures and the Structural Meaning of Sequence Alignment," Proteins, vol. 9, pp. 56-68, 1991.
[45] L.L.C. Schrodinger, "The PyMOL Molecular Graphics System," Version 1.3r1, 1991.
[46] M. Sikic, S. Tomic, and K. Vlahovicek, "Prediction of Protein-Protein Interaction Sites in Sequences and 3D Structures by Random Forests," PLoS Computational Biology, vol. 5, no. 1, p. e1000278, 2009.
[47] R. Singh, J. Xu, and B. Berger, "Struct2net: Integrating Structure into Protein-Protein Interaction Prediction," Proc. Pacific Symp. Biocomputing, vol. 11, pp. 403-414, 2006.
[48] C. Cortes and V. Vapnik, "Support-Vector Networks." Machine Learning, vol. 20, pp. 273-297, 1995.
[49] B. Wang et al., "Predicting Protein Interaction Sites from Residue Spatial Sequence Profile and Evolution Rate," FEBS Letters, vol. 580, pp. 380-384, 2006.
[50] H. Zhou and S. Qin, "Interaction-Site Prediction for Protein Complexes: A Critical Assessment," Bioinformatics, vol. 23, no. 17, pp. 2203-2209, 2007.

Index Terms:
support vector machines,benchmark testing,bioinformatics,molecular biophysics,proteins,noninterface region,interface prediction,protein heterocomplex,sequence-based understanding,protein binding interface,protein systems,redundancy problem,protein interaction data,residue distance,residue instance probability,support vector machine ensemble,SVM ensemble,benchmark data sets,outlier interface residues,Training,Proteins,Support vector machines,Training data,Educational institutions,Vectors,Bioinformatics,SVM ensemble.,Outlier detection,protein-protein interaction
Limsoon Wong, Peng Chen, Jinyan Li, "Detection of Outlier Residues for Improving Interface Prediction in Protein Heterocomplexes," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 4, pp. 1155-1165, July-Aug. 2012, doi:10.1109/TCBB.2012.58
Usage of this product signifies your acceptance of the Terms of Use.