TY - JOUR

T1 - Learning scoring schemes for sequence alignment from partial examples

AU - Kim, Eagu

AU - Kececioglu, John

N1 - Funding Information:
The authors wish to thank Chuong Do for helpful discussions, and Travis Wheeler for assistance with using Opal [36]. We also thank the anonymous referees for suggestions that improved the paper. This research was supported by the US National Science Foundation through Grant DBI-0317498. An earlier version of this paper appeared in [21].

PY - 2008/10

Y1 - 2008/10

N2 - When aligning biological sequences, the choice of scoring scheme is critical. Even small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to learn parameter values that are appropriate for biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct reference alignments, this is the problem of finding parameter values that make the scores of the reference alignments be as close as possible to those of optimal alignments of their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the reference alignment is not specified, and to an improved formulation based on minimizing the average error between the scores of the reference alignments and the scores of optimal alignments. Experiments on benchmark biological alignments show we can learn scoring schemes that generalize across protein families, and that boost the accuracy of multiple sequence alignment by as much as 25 percent.

AB - When aligning biological sequences, the choice of scoring scheme is critical. Even small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to learn parameter values that are appropriate for biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct reference alignments, this is the problem of finding parameter values that make the scores of the reference alignments be as close as possible to those of optimal alignments of their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the reference alignment is not specified, and to an improved formulation based on minimizing the average error between the scores of the reference alignments and the scores of optimal alignments. Experiments on benchmark biological alignments show we can learn scoring schemes that generalize across protein families, and that boost the accuracy of multiple sequence alignment by as much as 25 percent.

KW - Affine gap penalties

KW - Cutting plane algorithms

KW - Inverse parametric sequence alignment

KW - Linear programming

KW - Substitution score matrices

UR - http://www.scopus.com/inward/record.url?scp=55949093159&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=55949093159&partnerID=8YFLogxK

U2 - 10.1109/TCBB.2008.57

DO - 10.1109/TCBB.2008.57

M3 - Article

C2 - 18989042

AN - SCOPUS:55949093159

SN - 1545-5963

VL - 5

SP - 546

EP - 556

JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics

JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics

IS - 4

M1 - 4540089

ER -