Skip to main content
Dryad logo

Supplementary material for: Building alternative consensus trees and supertrees using k-means and Robinson and Foulds distance

Citation

Makarenkov, Vladimir (2021), Supplementary material for: Building alternative consensus trees and supertrees using k-means and Robinson and Foulds distance, Dryad, Dataset, https://doi.org/10.5061/dryad.ffbg79ctw

Abstract

Each gene has its own evolutionary history which can substantially differ from the evolutionary histories of other genes. For example, some individual genes or operons can be affected by specific horizontal gene transfer and recombination events. Thus, the evolutionary history of each gene should be represented by its own phylogenetic tree which may display different evolutionary patterns from the species tree that accounts for the main patterns of vertical descent. The output of traditional consensus tree or supertree inference methods is a unique consensus tree or supertree. Here, we describe a new efficient method for inferring multiple alternative consensus trees and supertrees to best represent the most important evolutionary patterns of a given set of phylogenetic trees (i.e. additive trees or X-trees). We show how a specific version of the popular k-means clustering algorithm, based on some interesting properties of the Robinson and Foulds topological distance, can be used to partition a given set of trees into one (when the data are homogeneous) or multiple (when the data are heterogeneous) cluster(s) of trees. We adapt the popular Caliński-Harabasz, Silhouette, Ball and Hall, and Gap cluster validity indices to tree clustering with k-means. A special attention is paid to the relevant but very challenging problem of inferring alternative supertrees, built from phylogenies constructed for different, but mutually overlapping, sets of taxa. The use of the Euclidean approximation in the objective function of the method makes it faster than the existing tree clustering techniques, and thus perfectly suitable for the analysis of large genomic datasets.