DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TCBB.2013.15
Pavel P. Kuksa , NEC Laboratories America Inc, Princeton
String kernel-based machine learning methods have yielded great success in practical tasks of structured/sequential data analysis. They often exhibit state-of-the-art performance on many practical tasks of sequence analysis such as biological sequence classification, remote homology detection, or protein superfamily and fold prediction. However, typical string kernel methods rely on analysis of discrete one-dimensional (1D) string data (e.g., DNA or amino acid sequences). In this work we address the multi-class biological sequence classification problems using multivariate representations in the form of sequences of features vectors (as in biological sequence profiles, or sequences of individual amino acid physico-chemical descriptors) and a class of multivariate string kernels that exploit these representations. On a number of protein sequence classification tasks proposed multivariate representations and kernels show significant 15-20\% improvements compared to existing state-of-the-art sequence classification methods.
multivariate string kernels, biological sequence classification, kernel methods, string kernels, multivariate sequence representations
P. P. Kuksa, "Biological Sequence Analysis with Multivariate String Kernels," in IEEE/ACM Transactions on Computational Biology and Bioinformatics.