Artifactual orthologs and the need for diligent data exploration in complex phylogenomic datasets: A museomic case study from the Andean flora
Data files
Jan 08, 2024 version files 11.24 MB
-
alignments.zip
-
gene_trees.zip
-
README.md
-
species_trees.zip
Abstract
The Andes mountains of western South America are a globally important biodiversity hotspot, yet there is a paucity of resolved phylogenies for plant clades from this region. Filling an important gap to our understanding of the World’s richest flora, we present the first phylogeny of Freziera (Pentaphylacaceae), an Andean-centered, cloud forest radiation. Our dataset was obtained via hybrid-enriched target sequence capture of Angiosperms353 universal loci for 50 of the ca. 75 spp., obtained almost entirely from herbarium specimens. We identify high phylogenomic complexity in Freziera, including a significant proportion of paralogous loci and a high degree of gene tree discordance. Via gene tree filtering, by-eye observation of gene trees, and detailed examination of warnings from recently improved assembly pipelines, we identified that cryptic paralogs (i.e., the presence of only one copy of a multi-copy gene due to assembly errors) were a major source of gene tree heterogeneity that had a negative impact on phylogenetic inference and support. These cryptic paralogs likely result from limitations in data collection that are common in museomics, combined with a history of genome duplication; they may be common in plant phylogenomic datasets. After accounting for cryptic paralogs as source of gene tree error, we identified a significant, but non-specific signal of introgression using Patterson’s D and f4 statistics. Despite phylogenomic complexity, we were able to resolve Freziera into nine well-supported subclades whose histories have been shaped by myriad evolutionary processes, including incomplete lineage sorting, historical gene flow, and gene duplication. Our results highlight the complexities of plant phylogenomics, and point to the need to test for multiple sources of gene tree discordance via careful examination of empirical datasets.
README: Strong phylogenetic signal despite high phylogenomic complexity in an Andean plant radiation (Freziera, Pentaphylacaceae)
https://doi.org/10.5061/dryad.v9s4mw72k
Description of the data and file structure
Data files are organized into three main directories: alignments, gene trees and species trees. Below is a summary of the file structure in which each folder is noted by a bullet. Names of folders are bolded; descriptions of the folders are provided after the folder name.
- alignments: .nexus files, DNA sequence data
- Hybpiper_v1.3.1: alignments of supercontigs assembled by HybPiper v1.3.1 for loci without paralog warnings (318 loci). Sequences identified by TreeShrink as producing abnormally long branch lengths have been removed from alignments ("treeshrunk").
- paralogs: alignments of supercontigs assembled by HybPiper v1.3.1 for loci with paralog warnings (31 loci).
- Hybpiper_v2.0.1: alignments of supercontigs assembled by HybPiper v2.0.1 for loci without long paralog warnings (269 loci). Sequences identified by TreeShrink as producing abnormally long branch lengths have been removed from alignments ("treeshrunk").
- paralogs: alignments of supercontigs assembled by HybPiper v2.0.1 for loci with long paralog warnings (84 loci).
- Hybpiper_v1.3.1: alignments of supercontigs assembled by HybPiper v1.3.1 for loci without paralog warnings (318 loci). Sequences identified by TreeShrink as producing abnormally long branch lengths have been removed from alignments ("treeshrunk").
- gene trees: .tre files, newick format
- datasets: multiphylo files with filtered HybPiper v1.3.1 gene trees for each of the 11 datasets used; input files for ASTRAL species tree analyses.
- datasets_with_HybPiper2_gene_trees: multiphylo files with filtered HybPiper v2.0.1 gene trees for the 2 datasets that used thresholds based on HybPiper v2.0.1 assembly.
- Hybpiper_v1.3.1: gene trees inferred using IQTree from alignments of supercontigs assembled by HybPiper v1.3.1 for loci without paralog warnings. Gene trees with <25 ingroup samples have been removed, leaving 313 gene trees.
- Hybpiper_v2.0.1: gene trees inferred using IQTree from alignments of supercontigs assembled by HybPiper v2.0.1 for loci without paralog warnings. Gene trees with <25 ingroup samples have been removed, leaving 267 gene trees.
- datasets: multiphylo files with filtered HybPiper v1.3.1 gene trees for each of the 11 datasets used; input files for ASTRAL species tree analyses.
- species trees: .tre files, newick format; output files from ASTRAL analyses of the 11 datasets used.
- datasets_with_HybPiper2_gene_trees: output files from ASTRAL analyses of the 2 datasets that used HybPiper v2.0.1 gene trees.