Phylogenomics reveals an almost perfect polytomy among the almost ungulates (Paenungulata)
Data files
Nov 29, 2023 version files 914.49 MB
-
All_gene_trees.tar.gz
50.63 MB
-
completed_iqtree_elephant_tree_test.tar.gz
687.44 MB
-
README.md
491 B
-
treefile_with_ensembl_all.treefile
176.42 MB
Abstract
Phylogenetic studies have resolved most relationships among Eutherian Orders. However, the branching order of elephants (Proboscidea), hyraxes (Hyracoidea), and sea cows (Sirenia) (i.e., the Paenungulata) has remained uncertain since at least 1758, when Linnaeus grouped elephants and manatees into a single Order (Bruta) to the exclusion of hyraxes. Subsequent morphological, molecular, and large-scale phylogenomic datasets have reached conflicting conclusions on the branching order within Paenungulates. We use a phylogenomic dataset of alignments from 13,388 protein-coding genes across 261 Eutherian mammals to infer phylogenetic relationships within Paenungulates. We find that gene trees almost equally support the three alternative resolutions of Paenungulate relationships, and that despite strong support for a Proboscidea+Hyracoidea split in the multispecies coalescent (MSC) tree, there is significant evidence for gene tree uncertainty, incomplete lineage sorting, and introgression among Proboscidea, Hyracoidea, and Sirenia. Indeed, only 8-10% of genes have statistically significant phylogenetic signal to reject the hypothesis of a Paenungulate polytomy. These data indicate little support for any resolution for the branching order Proboscidea, Hyracoidea, and Sirenia within Paenungulata and suggest that Paenungulata may be as close to a real, or at least unresolvable, polytomy as possible.
README: Phylogenomics reveals an almost perfect polytomy among the almost ungulates (Paenungulata)
https://doi.org/10.5061/dryad.j0zpc86n3
###
Generated trees with IQTREE-2.2.2
- All_gene_trees.tar.gz - 13,388 gene trees generated by IQTREE
- completed_iqtree_elephant_tree_test.tar.gz - 13,388 gene trees with alternative branch options generated by IQTREE, as well as other output files from IQTREE
- treefile_with_ensembl_all.treefile - TreeFile containing all generated trees
Methods
Phylogenomic Analyses
We use IQTREE2 v.2.2.2 COVID-edition (Nguyen et al. 2015) to infer maximum likelihood phylogenetic gene trees from all 13,491 genes (nucleotide sequences) after the best-fitting model was identified for each gene by ModelFinder v.1.42 (Kalyaanamoorthy et al. 2017) with the -m MFP option; note that tree searches failed for 103 genes, thus our final dataset of trees includes 13,388 gene trees. Branch supports were assessed with the Shimodaira-Hasegawa-like approximate likelihood ratio (SH-like aLRT) test with 1000 replicates (Anisimova et al. 2006; Guindon et al. 2010) with the -alrt 1000 option. We also used IQTREE2 to perform tree topology tests with the RELL approximation (Kishino et al., 1990), including the bootstrap proportion, the Kishino-Hasegawa (Kishino and Hasegawa, 1989) and Shimodaira-Hasegawa tests (Shimodaira and Hasegawa, 1999), the weighted Kishino-Hasegawa and Shimodaira-Hasegawa tests, expected likelihood weights (Strimmer and Rambaut, 2002), and the approximately unbiased test (Shimodaira, 2002) with the -zb 10000 -au -zw options; a tree with Paenungulates as a polytomy was compared to the three alternate resolutions of this clade (Figure 3A). The 13,388 gene trees were used to infer a species tree under the multispecies coalescent (MSC) model with ASTRAL-III v.5.6.3 (Zhang et al. 2018); support for the ASTRAL species tree was also assessed with the gene-tree bootstrap approach (Simmons et al. 2019), which generated 100 pseudoreplicate datasets of the 13,388 gene trees followed by ASTRAL species tree inference on each of the 100 datasets using the msc_tree_resampling.pl script (Simmons et al. 2019). Gene and site concordance factors (Minh et al. 2020) were inferred with IQTREE2 v.2.2.2 COVID-edition, using the updated maximum likelihood-based method (--scfl option) for site concordance factors (Mo et al. 2022), and the ASTRAL species tree.
Characterizing incomplete lineage sorting and introgression
We used two methods to detect patterns of introgression between Paenungualte lineages based on the distributions of gene tree topologies and branch lengths for triplets of lineages. If the species tree is ((A, B), C), or the Paratethytheria resolution inferred with the ASTRAL species tree, these tests can detect introgression between A (Probosceidea) and C (Sirenia), and between B (Hyracoidea) and C (Sirenia). The discordant-count test (DCT) compares the number of genes supporting each of the two possible discordant gene trees ((A, C), B) and (A, (B, C)); in the absence of ancestral population structure, gene genealogies from loci experiencing ILS will show either topology with equal probability and ILS alone is not expected to bias the count towards one of the topologies (Huson et al. 2005; Suvorov et al. 2022). Introgression, however, will lead to a statistically significant difference in the number of gene trees, which can be evaluated with a Χ2–test (Lanfear 2018; Suvorov et al. 2022); if there is introgression between A and C, there will be an excess of gene trees with the ((A, C), B) topology (Suvorov et al. 2022). The branch-length test (BLT) examines branch lengths to estimate the age of the most recent coalescence event (measured in substitutions per site); introgression leads to more recent coalescences than expected under the species tree topology with complete lineage sorting, while ILS shows older coalescence events (Green et al. 2010; Suvorov et al. 2022).
ILS alone does not result in different coalescence times between the two discordant topologies, and this forms the null hypothesis for the BLT that the distribution of branch lengths of gene trees supporting the ((A, C), B) and (A, (B, C)) topologies should be similar (Suvorov et al. 2022). In the presence of introgression, these branch length distributions will be skewed such that ((A, C), B) < (A, (B, C)) suggests introgression consistent with discordant topology ((A, C), B) and ((A, C), B) > (A, (B, C)) suggests introgression consistent with discordant topology (A, (B, C)). We implemented two versions of this test, one that does not scale branch lengths by total tree length and one that does (Suvorov et al. 2022); for the former, we tested the statistical significance of differences between the distribution of branch lengths with a Kruskal-Wallis one-way ANOVA while for the latter used a Mann-Whitney U test (Suvorov et al. 2022). P-values were corrected for multiple testing within the DCT and BLT with Holm’s method (Holm 1979). For the scaled BLT and DCT with all trios within Paeunungulates, we used the blt_dct_test.r script (Suvorov et al. 2022).