Issue No. 04 - October-December (2008 vol. 5)

ISSN: 1545-5963

pp: 546-556

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.57

John Kececioglu , University of Arizona, Tucson

Eagu Kim , University of Arizona, Tucson

ABSTRACT

When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for aligning biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the scores of the example alignments close to those of optimal alignments for their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the alignment is left unspecified, and to an improved formulation based on minimizing the average error between the score of an example and the score of an optimal alignment. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the accuracy of multiple sequence alignment by as much as 25%.

INDEX TERMS

Pattern matching, Analysis of Algorithms and Problem Complexity, Linear programming, Biology and genetics

CITATION

John Kececioglu, Eagu Kim, "Learning Scoring Schemes for Sequence Alignment from Partial Examples",

*IEEE/ACM Transactions on Computational Biology and Bioinformatics*, vol. 5, no. , pp. 546-556, October-December 2008, doi:10.1109/TCBB.2008.57