Incomplete taxa, incomplete characters, and phylogenetic accuracy: Is there a missing data problem?

Research output: Contribution to journalArticlepeer-review

137 Scopus citations


The problem of missing data is often considered to be the most significant obstacle in reconstructing the phylogeny of fossil taxa and their relationships to extant taxa. In this paper, I review the results of recent simulation studies and present new results that explore how missing data affect phylogenetic accuracy, which is defined here as the success of a method at reconstructing the true phylogeny. Missing data cells are typically added to a phylogenetic analysis in the form of incomplete taxa (e.g., highly fragmentary fossil taxa) or incomplete characters (e.g., a set of DNA sequence or soft anatomical characters in an analysis including living and fossil taxa). These two types of incomplete data affect phylogenetic analyses in two very different ways, suggesting that there is not a single “missing data problem.” Recent simulation results show that including incomplete taxa is a problem of including too few characters rather than too many missing data cells—if enough characters are scored in these taxa, even the relationships of highly incomplete taxa (e.g., 95% missing data) can be accurately reconstructed. Including incomplete characters is largely a problem of taxon sampling. Adding incomplete characters can improve accuracy under many conditions, but inadequate taxon sampling in these characters can lead to problems of long branch attraction (which causes methods to reconstruct an incorrect tree). New simulation results show that highly incomplete taxa may have little impact on the relationships estimated for the complete taxa. Thus, adding highly incomplete taxa may not adversely affect relationships among the complete taxa. However, these added taxa may be unable to improve accuracy for the complete taxa if they are too incomplete. These results suggest that analyses which combine data from fossils and molecular data sets can be successful, despite large amounts of missing data. The accuracy of these analyses will depend on adequate sampling of characters for fossil taxa and adequate sampling of taxa for molecular data sets.

Original languageEnglish (US)
Pages (from-to)297-310
Number of pages14
JournalJournal of Vertebrate Paleontology
Issue number2
StatePublished - Jun 2003

ASJC Scopus subject areas

  • Palaeontology


Dive into the research topics of 'Incomplete taxa, incomplete characters, and phylogenetic accuracy: Is there a missing data problem?'. Together they form a unique fingerprint.

Cite this