Skip to main content

Incorporating hierarchical characters into phylogenetic analysis

Cite this dataset

St. John, Katherine; Hopkins, Melanie (2021). Incorporating hierarchical characters into phylogenetic analysis [Dataset]. Dryad.


Popular optimality criteria for phylogenetic trees focus on sequences of characters that are applicable to all the taxa. As studies grow in breadth, it can be the case that some characters are applicable for a portion of the taxa and inapplicable for others.  Past work has explored the limitations of treating inapplicable characters as missing data, noting that this strategy may favor trees where interval nodes are assigned impossible states, where the arrangement of taxa within subclades is unduly influenced by variation in distant parts of the tree, and/or where taxa that otherwise share most primary characters are grouped distantly. Approaches that avoid the first two problems have recently been proposed. Here, we propose an alternative approach which avoids all three problems. In the spirit of maximum parsimony, the proposed criterion seeks the phylogenetic tree with the minimal changes across any tree branch, but where changes are defined in terms of dissimilarity metrics that weigh the affects of inapplicable characters. The approach can accommodate binary, multistate, ordered, unordered, and polymorphic characters.  We give a polynomial-time algorithm, inspired by Fitch's algorithm, to score trees under a family of dissimilarity metrics, and prove its correctness.  We show that the resulting optimality criteria is computationally hard, by reduction to the NP-hardness of the maximum parsimony optimality criteria.  We demonstrate our approach using synthetic and empirical data sets and compare the results with other recently proposed methods for choosing optimal phylogenetic trees when the data includes inapplicable characters.


There are three different data sets:

* The first is a character matrix of brachiopods (Cusask et al, 1999), recoded using redundant coding (the original matrix was coded using a mixture of non-additive binary coding and composite coding).  A file of character dependencies has also been added.

* The second is a character matrix of myriapods (Fernandez et al, 2016) that had been previously modified for use as a case study for managing inapplicable characters for disparity analysis (Hopkins and St. John, 2018).  A file of character dependencies has also been added.

* The third are synthetic data sets on 8 taxa with 13 and 20 characters.  A file of character dependencies has also been included.

Supplementary material, including data files and additional discussion and figures, are in SOM.pdf.


National Science Foundation

Simons Foundation