TY - JOUR
T1 - Model selection for mixtures of mutagenetic trees
AU - Yin, Junming
AU - Beerenwinkel, Niko
AU - Rahnenführer, Jörg
AU - Lengauer, Thomas
N1 - Funding Information:
P a r t of this w o r k has been performed in the context of the BioSapiens Network of Excellence (EU contract no. LSHG-CT-2003-503265). Financial support wa s provided by an IMPRS scholarship (J.Y.), by Deutsche Forschungsge-meinschaft under grant No. HO 1582/1-3 and BE 3217/1-1 (N.B.) and by BMBF grant No. 01GR0453 (J.R.). W ew o u ld like to thank Michael I. Jordan and Simon Lacoste-Julien for helpful discussion.
Funding Information:
Author Notes: Part of this work has been performed in the context of the BioSapiens Network of Excellence (EU contract no. LSHG-CT-2003-503265). Financial support was provided by an IMPRS scholarship (J.Y.), by Deutsche Forschungsgemeinschaft under grant No. HO 1582/1-3 and BE 3217/1-1 (N.B.) and by BMBF grant No. 01GR0453 (J.R.). We would like to thank Michael I. Jordan and Simon Lacoste-Julien for helpful discussion.
PY - 2006
Y1 - 2006
N2 - The evolution of drug resistance in HIV is characterized by the accumulation of resistance-associated mutations in the HIV genome. Mutagenetic trees, a family of restricted Bayesian tree models, have been applied to infer the order and rate of occurrence of these mutations. Understanding and predicting this evolutionary process is an important prerequisite for the rational design of antiretroviral therapies. In practice, mixtures models of K mutagenetic trees provide more flexibility and are often more appropriate for modelling observed mutational patterns. Here, we investigate the model selection problem for K-mutagenetic trees mixture models. We evaluate several classical model selection criteria including cross-validation, the Bayesian Information Criterion (BIC), and the Akaike Information Criterion. We also use the empirical Bayes method by constructing a prior probability distribution for the parameters of a mutagenetic trees mixture model and deriving the posterior probability of the model. In addition to the model dimension, we consider the redundancy of a mixture model, which is measured by comparing the topologies of trees within a mixture model. Based on the redundancy, we propose a new model selection criterion, which is a modification of the BIC. Experimental results on simulated and on real HIV data show that the classical criteria tend to select models with far too many tree components. Only cross-validation and the modified BIC recover the correct number of trees and the tree topologies most of the time. At the same optimal performance, the runtime of the new BIC modification is about one order of magnitude lower. Thus, this model selection criterion can also be used for large data sets for which cross-validation becomes computationally infeasible.
AB - The evolution of drug resistance in HIV is characterized by the accumulation of resistance-associated mutations in the HIV genome. Mutagenetic trees, a family of restricted Bayesian tree models, have been applied to infer the order and rate of occurrence of these mutations. Understanding and predicting this evolutionary process is an important prerequisite for the rational design of antiretroviral therapies. In practice, mixtures models of K mutagenetic trees provide more flexibility and are often more appropriate for modelling observed mutational patterns. Here, we investigate the model selection problem for K-mutagenetic trees mixture models. We evaluate several classical model selection criteria including cross-validation, the Bayesian Information Criterion (BIC), and the Akaike Information Criterion. We also use the empirical Bayes method by constructing a prior probability distribution for the parameters of a mutagenetic trees mixture model and deriving the posterior probability of the model. In addition to the model dimension, we consider the redundancy of a mixture model, which is measured by comparing the topologies of trees within a mixture model. Based on the redundancy, we propose a new model selection criterion, which is a modification of the BIC. Experimental results on simulated and on real HIV data show that the classical criteria tend to select models with far too many tree components. Only cross-validation and the modified BIC recover the correct number of trees and the tree topologies most of the time. At the same optimal performance, the runtime of the new BIC modification is about one order of magnitude lower. Thus, this model selection criterion can also be used for large data sets for which cross-validation becomes computationally infeasible.
KW - BIC
KW - Empirical bayes
KW - Mixtures of mutagenetic trees
KW - Model selection
UR - http://www.scopus.com/inward/record.url?scp=85045416681&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85045416681&partnerID=8YFLogxK
U2 - 10.2202/1544-6115.1164
DO - 10.2202/1544-6115.1164
M3 - Article
C2 - 17049028
AN - SCOPUS:85045416681
SN - 1544-6115
VL - 5
SP - i-23
JO - Statistical Applications in Genetics and Molecular Biology
JF - Statistical Applications in Genetics and Molecular Biology
IS - 1
M1 - 17
ER -