Issue No.05 - Sept.-Oct. (2013 vol.10)
pp: 1201-1210
Pavel P. Kuksa , NEC Laboratories America Inc, Princeton
String kernel-based machine learning methods have yielded great success in practical tasks of structured/sequential data analysis. They often exhibit state-of-the-art performance on many practical tasks of sequence analysis such as biological sequence classification, remote homology detection, or protein superfamily and fold prediction. However, typical string kernel methods rely on the analysis of discrete 1D string data (e.g., DNA or amino acid sequences). In this paper, we address the multiclass biological sequence classification problems using multivariate representations in the form of sequences of features vectors (as in biological sequence profiles, or sequences of individual amino acid physicochemical descriptors) and a class of multivariate string kernels that exploit these representations. On three protein sequence classification tasks, the proposed multivariate representations and kernels show significant 15-20 percent improvements compared to existing state-of-the-art sequence classification methods.
Sequential analysis, Kernel, Amino acids, Protein sequence, Quantization, Machine learning,kernel methods, Biological sequence classification
Pavel P. Kuksa, "Biological Sequence Classification with Multivariate String Kernels", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, no. 5, pp. 1201-1210, Sept.-Oct. 2013, doi:10.1109/TCBB.2013.15
[1] J. Cheng and P. Baldi, "A Machine Learning Information Retrieval Approach to Protein Fold Recognition," Bioinformatics, vol. 22, no. 12, pp. 1456-1463, June 2006.
[2] R. Kuang, E. Ie, K. Wang, K. Wang, M. Siddiqi, Y. Freund, and C.S. Leslie, "Profile-Based String Kernels for Remote Homology Detection and Motif Extraction," Proc. IEEE Computational Systems Bioinformatics Conf. (CSB '04), pp. 152-160, 2004.
[3] C.S. Leslie, E. Eskin, J. Weston, and W.S. Noble, "Mismatch String Kernels for SVM Protein Classification," Proc. Conf. Neural Information Processing Systems Conf., pp. 1417-1424, 2002.
[4] P.P. Kuksa and V. Pavlovic, "Spatial Representation for Efficient Sequence Classification," Proc. 20th Int'l Conf. Pattern Recognition (ICPR '10), 2010.
[5] S. Sonnenburg, G. Rätsch, and B. Schölkopf, "Large Scale Genomic Sequence SVM Classifiers," Proc. 22nd Int'l Conf. Machine Learning (ICML '05), pp. 848-855, 2005.
[6] C. Leslie and R. Kuang, "Fast String Kernels Using Inexact Matching for Protein Sequences," J. Machine Learning Research, vol. 5, pp. 1435-1455, 2004.
[7] I. Melvin, E. Ie, J. Weston, W.S. Noble, and C. Leslie, "Multi-Class Protein Classification Using Adaptive Codes," J. Machine Learning Research, vol. 8, pp. 1557-1581, 2007.
[8] P. Kuksa and V. Pavlovic, "Efficient Alignment-Free DNA Barcode Analytics," BMC Bioinformatics, vol. 10, no. Suppl 14, article S9, 2009.
[9] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge Univ. Press, 2004.
[10] P. Kuksa, P.-H. Huang, and V. Pavlovic, "Scalable Algorithms for String Kernels with Inexact Matching," Proc. Conf. Neural Information Processing Systems (NIPS '08), 2008.
[11] C.S. Leslie, E. Eskin, and W.S. Noble, "The Spectrum Kernel: A String Kernel for SVM Protein Classification," Proc. Pacific Symp. Biocomputing, pp. 566-575, 2002.
[12] N. Toussaint, C. Widmer, O. Kohlbacher, and G. Ratsch, "Exploiting Physico-Chemical Properties in String Kernels," BMC Bioinformatics, vol. 11, no. Suppl 8, article S7, 2010.
[13] Y. Yang, E. Tantoso, and K.-B. Li, "Remote Protein Homology Detection Using Recurrence Quantification Analysis and Amino Acid Physicochemical Properties," J. Theoretical Biology, vol. 252, no. 1, pp. 145-154, 2008.
[14] B.-J. Webb-Robertson, K. Ratuiste, and C. Oehmen, "Physicochemical Property Distributions for Accurate and Rapid Pairwise Protein Homology Detection," BMC Bioinformatics, vol. 11, no. 1, article 145, 2010.
[15] R.D. King, A. Karwath, A. Clare, and L. Dehaspe, "The Utility of Different Representations of Protein Sequence for Predicting Functional Class," Bioinformatics, vol. 17, no. 5, pp. 445-454, 2001.
[16] C.Z. Cai, L.Y. Han, Z.L. Ji, X. Chen, and Y.Z. Chen, "SVM-Prot: Web-Based Support Vector Machine Software for Functional Classification of a Protein from Its Primary Sequence," Nucleic Acids Research, vol. 31, pp. 3692-3697, 2003.
[17] C.S. Ong and A. Zien, "An Automated Combination of Kernels for Predicting Protein Subcellular Localization," Proc. Eighth Int'l Workshop Algorithms in Bioinformatics (WABI '08), pp. 186-197, 2008.
[18] T. Hertz and C. Yanover, "PepDist: A New Framework for Protein-Peptide Binding Prediction Based on Learning Peptide Distance Functions," BMC Bioinformatics, vol. 7, no. Suppl 1, article S3, 2006.
[19] N. Pfeifer and O. Kohlbacher, "Multiple Instance Learning Allows MHC Class II Epitope Predictions across Alleles," Proc. Eighth Int'l Workshop Algorithms in Bioinformatics (WABI '08), pp. 210-221, 2008.
[20] V.N. Vapnik, Statistical Learning Theory. John Wiley & Sones, 1998.
[21] C. Cortes, P. Haffner, and M. Mohri, "Rational Kernels: Theory and Algorithms," J. Machine Learning Research, vol. 5, pp. 1035-1062, 2004.
[22] S.V.N. Vishwanathan and A. Smola, "Fast Kernels for String and Tree Matching," Proc. Conf. Neural Information Processing Systems (NIPS '02), 2002.
[23] M. Gribskov, A. McLachlan, and D. Eisenberg, "Profile Analysis: Detection of Distantly Related Proteins," Proc. Nat'l Academy of Sciences, vol. 84, pp. 4355-4358, 1987.
[24] L.J. Jensen, R. Gupta, N. Blom, D. Devos, J. Tamames, C. Kesmir, H. Nielsen, H.H. Staerfelt, K. Rapacki, C. Workman, C.A. Andersen, S. Knudsen, A. Krogh, A. Valencia, and S. Brunak, "Prediction of Human Protein Function from Post-Translational Modifications and Localization Features," J. Molecular Biology, vol. 319, no. 5, pp. 1257-1265, 2002.
[25] I. Dubchak, I. Muchnik, S.R. Holbrook, and S.H. Kim, "Prediction of Protein Folding Class Using Global Description of Amino Acid Sequence," Proc. Nat'l Academy of Sciences USA, vol. 92, no. 19, pp. 8700-8704, 1995.
[26] Y. Weiss, A. Torralba, and R. Fergus, "Spectral Hashing," Proc. Advances in Neural Information Processing Systems, pp. 1753-1760, 2009.
[27] J. Weston, C. Leslie, E. Ie, D. Zhou, A. Elisseeff, and W.S. Noble, "Semi-Supervised Protein Classification Using Cluster Kernels," Bioinformatics, vol. 21, no. 15, pp. 3241-3247, 2005.
[28] G. Ifrim and C. Wiuf, "Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space," Proc. 17th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '11), pp. 708-716, 2011.
[29] C.H. Ding and I. Dubchak, "Multi-Class Protein Fold Recognition Using Support Vector Machines and Neural Networks," Bioinformatics, vol. 17, no. 4, pp. 349-358, 2001.
[30] B. Peters, H.-H. Bui, S. Frankild, M. Nielsen, C. Lundegaard, E. Kostem, D. Basch, K. Lamberth, M. Harndahl, W. Fleri, S.S. Wilson, J. Sidney, O. Lund, S. Buus, and A. Sette, "A Community Resource Benchmarking Predictions of Peptide Binding to MHC-I Molecules," PLoS Computational Biology, vol. 2, no. 6, article e65, 2006.
[31] L. Lo Conte, B. Ailey, T. Hubbard, S. Brenner, A. Murzin, and C. Chothia, "SCOP: A Structural Classification of Proteins Database," Nucleic Acids Research, vol. 28, no. 1, pp. 257-259, 2000.
[32] M.S. Venkatarajan and W. Braun, "New Quantitative Descriptors of Amino Acids Based on Multidimensional Scaling of a Large Number of Physical-Chemical Properties," J. Molecular Modeling, vol. 7, pp. 445-453, 2001.
[33] T. Jaakkola, M. Diekhans, and D. Haussler, "Using the Fisher Kernel Method to Detect Remote Protein Homologies," Proc. Seventh Int'l Conf. Intelligent Systems for Molecular Biology, pp. 149-158, 1999.
[34] P. Huang and V. Pavlovic, "Protein Homology Detection with Biologically Inspired Features and Interpretable Statistical Models," Int'l J. Data Mining and Bioinformatics, vol. 2, no. 2, pp. 157-175, June 2008.
[35] G. Raetsch and S. Sonnenburg, Accurate Splice Site Detection for Caenorhabditis Elegans, pp. 277-298, MIT Press, 2004.