Data and code from: Unravelling complex hybrid and polyploid evolutionary relationships using phylogenetic placement of homologous gene copies from target enrichment data
Data files
May 08, 2026 version files 1.29 GB
-
ParalogPhylogenomics.tar
1.29 GB
-
README.md
4.45 KB
Abstract
Phylogenomic datasets comprising hundreds of genes have become the standard for plant systematics and phylogenetics. However, large-scale phylogenomic studies often exclude polyploids and hybrids due to the challenges in assessing the origin of duplicated loci and incorporating them into tree reconstruction methods. Using a newly generated target enrichment dataset of 1081 genes from 452 samples from the Brassicaceae tribe Arabideae, including many hybrid and high ploidy taxa, we developed a novel approach to disentangle the evolutionary history of this phylogenetically and taxonomically challenging clade. Our approach extends beyond commonly used gene tree-species tree reconciliation techniques by using phylogenetic placement, a method adopted from metagenomics, of gene copies into a diploid tree. We show how it allows for the simultaneous assessment of the origins of ancient and recent hybrids and autopolyploids, and the detection of nested polyploidization events. Additionally, we demonstrate how synonymous substitution rates provide further evidence for the mode of polyploidization, specifically to distinguish between allo- and autopolyploidization, and to identify hybridization events involving a ghost lineage. Our approach can serve as an exploratory tool for large and complex phylogenomic datasets and can aid in identifying polyploid and hybrid clades for further analysis with specialized methods.
https://doi.org/10.5061/dryad.xksn02vqn
Give a brief summary of dataset contents, contextualized in experimental procedures and results.
Description of the data and file structure
ParalogPhylogenomics.tar contains all alignments, gene trees, species trees and results of phylogenetic placements of homeologs included for the three datasets: 1) Full Arabideae set, 2) Selected exons Arabideae set, and 3) Arabidopsis test set.
1) Full Arabideae set:
- Folder Alignments_all: 994 nucleotide alignments of all homologous gene copies from the selected samples
- Folder Genetrees_all: 994 gene trees in newick format with all homologous gene copies from the selected samples
- Folder Genetrees_diploids: 994 gene trees in newick format with all homologous gene copies from diploid samples
- Folder Placement: 994 jplace files with phylogenetic placement from RAxML evolutionary placement algorithm; all homologous gene copies from "non-diploid" samples placed into the diploid gene trees
- Folder Speciestrees: ASTRAL-pro output files; complete sample set and diploid subset
- File 4_Summary_table.csv: summary output from HybPhaser
- File allsamples.smudgeplot.ploidy.txt: smudgeplot output to infer ploidy based on kmer
- File Allsamples_details_updated.txt: Accession data used for analyses in R
- File Arabideae_PPG.R: RScript to analyze the data
- File diploid.selected.txt: list of diploid samples
- File genes.selected.txt: list of selected genes after filtering
- File hybpiper_stats.tsv: output file from hybpiper
- File nondiploids.selected.txt: list of nondiploid samples
- File paralog_report.tsv: output file from hybpiper
- File samples.excluded.txt: list of samples excluded after filtering
- File samples.selected.txt: list of samples selected after filtering
2) Selected exons Arabideae set:
- Folder Alignments_all: 567 nucleotide alignments of all homologous gene copies from the selected samples
- Folder Genetrees_all: 567 gene trees in newick format with all homologous gene copies from the selected samples
- Folder Genetrees_diploids: 567 gene trees in newick format with all homologous gene copies from diploid samples and a text file describing which trees were rerooted manually.
- Folder Placement: 567 jplace files with phylogenetic placement from RAxML evolutionary placement algorithm; all homologous gene copies from "non-diploid" samples placed into the diploid gene trees
- Folder Speciestrees: ASTRAL-pro output files; complete sample set and diploid subset
- File 4_Summary_table_Arabideae_exons.csv: summary output from HybPhaser
- File allsamples.smudgeplot.ploidy.txt: smudgeplot output to infer ploidy based on kmer
- File Allsamples_details_updated.txt: Accession data used for analyses in R
- File Arabideae_exons.R: RScript to analyze the data
- File diploid.selected.txt: list of diploid samples
- File exons.selected.txt: list of selected exons after filtering
- File heatmap.reorder.txt: file used for reordering samples for heatmap plotting
- File nondiploids.selected.txt: list of nondiploid samples
- File samples.selected.txt: list of samples selected after filtering
3) Test set Arabidopsis:
- Folder Alignments_all: 994 nucleotide alignments of all homologous gene copies from the selected samples
- Folder Genetrees_all: 994 gene trees in newick format with all homologous gene copies from the selected samples
- Folder Genetrees_diploids: 994 gene trees in newick format with all homologous gene copies from diploid samples
- Folder Placement: 994 jplace files with phylogenetic placement from RAxML evolutionary placement algorithm; all homologous gene copies from "non-diploid" samples placed into the diploid gene trees
- Folder Speciestrees: ASTRAL-pro output files; complete sample set and diploid subset
- File 4_Summary_table.csv: summary output from HybPhaser
- File Arabidopsis.details.txt: Accession data used for analyses in R
- File Arabidopsis.diploids.txt: list of diploid samples
- File Arabidopsis.nondiploids.txt: list of nondiploid samples
- File Arabideae_PPG.R: RScript to analyze the data
- File Arabidopsis_stats.tsv: output file from hybpiper
- File genes.selected.txt: list of selected genes after filtering
