TY - JOUR

T1 - Accuracy in Near-Perfect Virus Phylogenies

AU - Wertheim, Joel O.

AU - Steel, Mike

AU - Sanderson, Michael J.

N1 - Publisher Copyright:
© 2021 The Author(s) 2021. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For permissions, please email: journals.permissions@oup.com.

PY - 2022/3/1

Y1 - 2022/3/1

N2 - Phylogenetic trees from real-world data often include short edges with very few substitutions per site, which can lead to partially resolved trees and poor accuracy. Theory indicates that the number of sites needed to accurately reconstruct a fully resolved tree grows at a rate proportional to the inverse square of the length of the shortest edge. However, when inferred trees are partially resolved due to short edges, "accuracy"should be defined as the rate of discovering false splits (clades on a rooted tree) relative to the actual number found. Thus, accuracy can be high even if short edges are common. Specifically, in a "near-perfect"parameter space in which trees are large, the tree length $\xi$ (the sum of all edge lengths) is small, and rate variation is minimal, the expected false positive rate is less than $\xi/3$; the exact value depends on tree shape and sequence length. This expected false positive rate is far below the false negative rate for small $\xi$ and often well below 5% even when some assumptions are relaxed. We show this result analytically for maximum parsimony and explore its extension to maximum likelihood using theory and simulations. For hypothesis testing, we show that measures of split "support"that rely on bootstrap resampling consistently imply weaker support than that implied by the false positive rates in near-perfect trees. The near-perfect parameter space closely fits several empirical studies of human virus diversification during outbreaks and epidemics, including Ebolavirus, Zika virus, and SARS-CoV-2, reflecting low substitution rates relative to high transmission/sampling rates in these viruses.[Ebolavirus; epidemic; HIV; homoplasy; mumps virus; perfect phylogeny; SARS-CoV-2; virus; West Nile virus; Yule-Harding model; Zika virus.]

AB - Phylogenetic trees from real-world data often include short edges with very few substitutions per site, which can lead to partially resolved trees and poor accuracy. Theory indicates that the number of sites needed to accurately reconstruct a fully resolved tree grows at a rate proportional to the inverse square of the length of the shortest edge. However, when inferred trees are partially resolved due to short edges, "accuracy"should be defined as the rate of discovering false splits (clades on a rooted tree) relative to the actual number found. Thus, accuracy can be high even if short edges are common. Specifically, in a "near-perfect"parameter space in which trees are large, the tree length $\xi$ (the sum of all edge lengths) is small, and rate variation is minimal, the expected false positive rate is less than $\xi/3$; the exact value depends on tree shape and sequence length. This expected false positive rate is far below the false negative rate for small $\xi$ and often well below 5% even when some assumptions are relaxed. We show this result analytically for maximum parsimony and explore its extension to maximum likelihood using theory and simulations. For hypothesis testing, we show that measures of split "support"that rely on bootstrap resampling consistently imply weaker support than that implied by the false positive rates in near-perfect trees. The near-perfect parameter space closely fits several empirical studies of human virus diversification during outbreaks and epidemics, including Ebolavirus, Zika virus, and SARS-CoV-2, reflecting low substitution rates relative to high transmission/sampling rates in these viruses.[Ebolavirus; epidemic; HIV; homoplasy; mumps virus; perfect phylogeny; SARS-CoV-2; virus; West Nile virus; Yule-Harding model; Zika virus.]

UR - http://www.scopus.com/inward/record.url?scp=85124499074&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85124499074&partnerID=8YFLogxK

U2 - 10.1093/sysbio/syab069

DO - 10.1093/sysbio/syab069

M3 - Article

C2 - 34398231

AN - SCOPUS:85124499074

VL - 71

SP - 426

EP - 438

JO - Systematic Biology

JF - Systematic Biology

SN - 1063-5157

IS - 2

ER -