TY - JOUR
T1 - Phylogenomics with incomplete taxon coverage
T2 - The limits to inference
AU - Sanderson, Michael J.
AU - McMahon, Michelle M.
AU - Steel, Mike
N1 - Funding Information:
This research was supported by a US NSF grant to MJS and MMM. MS thanks the Allan Wilson Centre for Molecular Ecology and Evolution, and the NZ Royal Society (James Cook Fellowship). We thank Tandy Warnow for discussion.
PY - 2010
Y1 - 2010
N2 - Background. Phylogenomic studies based on multi-locus sequence data sets are usually characterized by partial taxon coverage, in which sequences for some loci are missing for some taxa. The impact of missing data has been widely studied in phylogenetics, but it has proven difficult to distinguish effects due to error in tree reconstruction from effects due to missing data per se. We approach this problem using a explicitly phylogenomic criterion of success, decisiveness, which refers to whether the pattern of taxon coverage allows for uniquely defining a single tree for all taxa. Results. We establish theoretical bounds on the impact of missing data on decisiveness. Results are derived for two contexts: a fixed taxon coverage pattern, such as that observed from an already assembled data set, and a randomly generated pattern derived from a process of sampling new data, such as might be observed in an ongoing comparative genomics sequencing project. Lower bounds on how many loci are needed for decisiveness are derived for the former case, and both lower and upper bounds for the latter. When data are not decisive for all trees, we estimate the probability of decisiveness and the chances that a given edge in the tree will be distinguishable. Theoretical results are illustrated using several empirical examples constructed by mining sequence databases, genomic libraries such as ESTs and BACs, and complete genome sequences. Conclusion. Partial taxon coverage among loci can limit phylogenomic inference by making it impossible to distinguish among multiple alternative trees. However, even though lack of decisiveness is typical of many sparse phylogenomic data sets, it is often still possible to distinguish a large fraction of edges in the tree.
AB - Background. Phylogenomic studies based on multi-locus sequence data sets are usually characterized by partial taxon coverage, in which sequences for some loci are missing for some taxa. The impact of missing data has been widely studied in phylogenetics, but it has proven difficult to distinguish effects due to error in tree reconstruction from effects due to missing data per se. We approach this problem using a explicitly phylogenomic criterion of success, decisiveness, which refers to whether the pattern of taxon coverage allows for uniquely defining a single tree for all taxa. Results. We establish theoretical bounds on the impact of missing data on decisiveness. Results are derived for two contexts: a fixed taxon coverage pattern, such as that observed from an already assembled data set, and a randomly generated pattern derived from a process of sampling new data, such as might be observed in an ongoing comparative genomics sequencing project. Lower bounds on how many loci are needed for decisiveness are derived for the former case, and both lower and upper bounds for the latter. When data are not decisive for all trees, we estimate the probability of decisiveness and the chances that a given edge in the tree will be distinguishable. Theoretical results are illustrated using several empirical examples constructed by mining sequence databases, genomic libraries such as ESTs and BACs, and complete genome sequences. Conclusion. Partial taxon coverage among loci can limit phylogenomic inference by making it impossible to distinguish among multiple alternative trees. However, even though lack of decisiveness is typical of many sparse phylogenomic data sets, it is often still possible to distinguish a large fraction of edges in the tree.
UR - http://www.scopus.com/inward/record.url?scp=77952490857&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77952490857&partnerID=8YFLogxK
U2 - 10.1186/1471-2148-10-155
DO - 10.1186/1471-2148-10-155
M3 - Article
C2 - 20500873
AN - SCOPUS:77952490857
SN - 1471-2148
VL - 10
JO - BMC Evolutionary Biology
JF - BMC Evolutionary Biology
IS - 1
M1 - 155
ER -