TY - GEN
T1 - Simple and fast inverse alignment
AU - Kececioglu, John
AU - Kim, Eagu
PY - 2006
Y1 - 2006
N2 - For as long as biologists have been computing alignments of sequences, the question of what values to use for scoring substitutions and gaps has persisted. While some choices for substitution scores are now common, largely due to convention, there is no standard for choosing gap penalties. An objective way to resolve this question is to learn the appropriate values by solving the Inverse String Alignment Problem: given examples of correct alignments, find parameter values that make the examples be optimal-scoring alignments of their strings. We present a new polynomial-time algorithm for Inverse String Alignment that is simple to implement, fast in practice, and for the first time can learn hundreds of parameters simultaneously. The approach is also flexible: minor modifications allow us to solve inverse unique alignment (find parameter values that make the examples be the unique optimal alignments of their strings), and inverse near-optimal alignment (find parameter values that make the example alignments be as close to optimal as possible). Computational results with an implementation for global alignment show that, for the first time, we can find best-possible values for all 212 parameters of the standard protein-sequence scoring-model from hundreds of alignments in a few minutes of computation.
AB - For as long as biologists have been computing alignments of sequences, the question of what values to use for scoring substitutions and gaps has persisted. While some choices for substitution scores are now common, largely due to convention, there is no standard for choosing gap penalties. An objective way to resolve this question is to learn the appropriate values by solving the Inverse String Alignment Problem: given examples of correct alignments, find parameter values that make the examples be optimal-scoring alignments of their strings. We present a new polynomial-time algorithm for Inverse String Alignment that is simple to implement, fast in practice, and for the first time can learn hundreds of parameters simultaneously. The approach is also flexible: minor modifications allow us to solve inverse unique alignment (find parameter values that make the examples be the unique optimal alignments of their strings), and inverse near-optimal alignment (find parameter values that make the example alignments be as close to optimal as possible). Computational results with an implementation for global alignment show that, for the first time, we can find best-possible values for all 212 parameters of the standard protein-sequence scoring-model from hundreds of alignments in a few minutes of computation.
KW - Affine gap penalties
KW - Cutting plane algorithms
KW - Linear programming
KW - Parametric sequence alignment
KW - Sequence analysis
KW - Substitution score matrices
KW - Supervised learning
UR - http://www.scopus.com/inward/record.url?scp=33745804251&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33745804251&partnerID=8YFLogxK
U2 - 10.1007/11732990_37
DO - 10.1007/11732990_37
M3 - Conference contribution
AN - SCOPUS:33745804251
SN - 3540332952
SN - 9783540332954
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 441
EP - 455
BT - Research in Computational Molecular Biology - 10th Annual International Conference, RECOMB 2006, Proceedings
T2 - 10th Annual International Conference on Research in Computational Molecular Biology, RECOMB 2006
Y2 - 2 April 2006 through 5 April 2006
ER -