TY - JOUR
T1 - Learning Parameter-Advising Sets for Multiple Sequence Alignment
AU - Deblasio, Dan
AU - Kececioglu, John
N1 - Funding Information:
The authors would like to thank William Pearson for pointing them to the VTML family [21] of substitution scoring matrices. This work was supported by a US National Science Foundation Grant IIS-1217886 to J.K., and a PhD fellowship to D.D. from the University of Arizona IGERT in Comparative Genomics through US National Science Foundation Grant DGE-0654435. A preliminary conference version of this paper appeared as [1]. Dan DeBlasio is the corresponding author.
Publisher Copyright:
© 2004-2012 IEEE.
PY - 2017/9/1
Y1 - 2017/9/1
N2 - While the multiple sequence alignment output by an aligner strongly depends on the parameter values used for the alignment scoring function (such as the choice of gap penalties and substitution scores), most users rely on the single default parameter setting provided by the aligner. A different parameter setting, however, might yield a much higher-quality alignment for the specific set of input sequences. The problem of picking a good choice of parameter values for specific input sequences is called parameter advising. A parameter advisor has two ingredients: (i) a set of parameter choices to select from, and (ii) an estimator that provides an estimate of the accuracy of the alignment computed by the aligner using a parameter choice. The parameter advisor picks the parameter choice from the set whose resulting alignment has highest estimated accuracy. In this paper, we consider for the first time the problem of learning the optimal set of parameter choices for a parameter advisor that uses a given accuracy estimator. The optimal set is one that maximizes the expected true accuracy of the resulting parameter advisor, averaged over a collection of training data. While we prove that learning an optimal set for an advisor is NP-complete, we show there is a natural approximation algorithm for this problem, and prove a tight bound on its approximation ratio. Experiments with an implementation of this approximation algorithm on biological benchmarks, using various accuracy estimators from the literature, show it finds sets for advisors that are surprisingly close to optimal. Furthermore, the resulting parameter advisors are significantly more accurate in practice than simply aligning with a single default parameter choice.
AB - While the multiple sequence alignment output by an aligner strongly depends on the parameter values used for the alignment scoring function (such as the choice of gap penalties and substitution scores), most users rely on the single default parameter setting provided by the aligner. A different parameter setting, however, might yield a much higher-quality alignment for the specific set of input sequences. The problem of picking a good choice of parameter values for specific input sequences is called parameter advising. A parameter advisor has two ingredients: (i) a set of parameter choices to select from, and (ii) an estimator that provides an estimate of the accuracy of the alignment computed by the aligner using a parameter choice. The parameter advisor picks the parameter choice from the set whose resulting alignment has highest estimated accuracy. In this paper, we consider for the first time the problem of learning the optimal set of parameter choices for a parameter advisor that uses a given accuracy estimator. The optimal set is one that maximizes the expected true accuracy of the resulting parameter advisor, averaged over a collection of training data. While we prove that learning an optimal set for an advisor is NP-complete, we show there is a natural approximation algorithm for this problem, and prove a tight bound on its approximation ratio. Experiments with an implementation of this approximation algorithm on biological benchmarks, using various accuracy estimators from the literature, show it finds sets for advisors that are surprisingly close to optimal. Furthermore, the resulting parameter advisors are significantly more accurate in practice than simply aligning with a single default parameter choice.
KW - Multiple sequence alignment
KW - accuracy estimation
KW - alignment scoring functions
KW - parameter advising
KW - parameter values
UR - http://www.scopus.com/inward/record.url?scp=85018521836&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85018521836&partnerID=8YFLogxK
U2 - 10.1109/TCBB.2015.2430323
DO - 10.1109/TCBB.2015.2430323
M3 - Article
C2 - 28991725
AN - SCOPUS:85018521836
SN - 1545-5963
VL - 14
SP - 1028
EP - 1041
JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics
JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics
IS - 5
M1 - 7102700
ER -