A comparison of phylogenomic inference pipelines for low-coverage whole-genome sequencing in Formica ants
Data files
Jan 07, 2025 version files 5.66 GB
-
36Assembled.Genomes.zip
1.94 GB
-
DatingDivergence.zip
54.10 KB
-
FinalDatasets.rename.zip
3.70 GB
-
FinalTrees.rename.zip
48.60 KB
-
Partitioned_likelihood_5kwinodws-12datasets_AMAS_results.zip
17.27 MB
-
README.md
1.67 KB
-
RF_distances-Ave_Bootstrap_6datasets.zip
1.05 MB
Abstract
A rapid proliferation in the availability of whole genome sequences (WGS), often with relatively low read depth, offers an unprecedented opportunity for phylogenomic advances using publicly available data, but there are several key challenges in applying these data. Using low-coverage WGS data for Formica ants, we conducted detailed comparisons on two different analytical pipelines (reference-based vs. de novo genome assembly), four types of datasets (5kbp-window, ultra-conserved element [UCE], single-copy ortholog [BUSCO] and mitogenome), and a series of analytical procedures (e.g., concatenation vs. coalescent analyses) to identify which are robust to typical WGS data. The results show that at shallow scale of phylogenetic relationships of closely related species 5kbp-windows from the reference-based pipeline and UCEs from the de novo assemblies are more advantageous than the BUSCOs in recovering informative markers for phylogenetic inference. Compared to concatenation analyses, coalescent analyses often resulted in disparate deeper relationships in the phylogeny. This study uncovers obvious mito-nuclear discordance, and demonstrates genome-wide gene conflicts in phylogenetic signals, both pointing to possible incomplete lineage sorting and/or hybridization during the early, rapid radiation of Formica ants. Divergence dating analyses show that different types of data often resulted in inconsistent time estimates, with older ages estimated for deep nodes using the mitogenomic and 5kbp-window datasets. A taxon sampling covering the diversity of a lineage is essential to accurately estimate its divergence time. The strengths and weaknesses of different analytical pipelines and strategies are discussed. Findings from this study provide valuable insights for large-scale phylogenomic projects using WGS data.
README: A comparison of phylogenomic inference pipelines for low-coverage whole-genome sequencing in Formica ants
Description of the data and file structure
36Assembled.Genomes.zip
: contains genome assemblies for 36 Formica taxa (named as "[taxon name].fa]").FinalDatasets.rename.zip
: contains all the final concateated matrices (named as "[dataset].phy.rename" or "[dataset].fas.rename") and its corresponding partition files (named as "[dataset].partition").FinalTrees.rename.zip
: contains the trees in newick format from all phylogenetic analyses. The ML results from the concatenated datasets were named as "[dataset].MFM.10ML.[best run number]UFboot.suptree.rename.tre"; the ASTRAL results from the species tree analyses were named as "Astral.[dataset][bootstrap cutoff]_100boots.rename.tre".Partitioned_likelihood_5kwinodws-12datasets_AMAS_results.zip
: provide statistics of 5kbp-windows and ΔGLS from partitioned likelihood analyses (named as "Partitioned_likelihood_analyses_5kwindow_results.xlsx"), as well as the AMAS statistics for 12 different datasets (named as "[dataset].AMAS.csv").RF_distances-Ave_Bootstrap_6datasets.zip
: provide normalized RF distances of the gene trees to the T1 topology (named as "[dataset]avg_bootstrap.csv") and the average bootstrap supports of gene trees (named as "normalized_rf_distances[dataset].csv") for 6 different datasets.DatingDivergence.zip
: provide divergence dating results from BEAST and MCMCtree analyses on different datasets. The BEAST results were named as "MaxCre.Mean.[number of runs].BEAST.[dataset].tre"; the MCMCtree results were named as: "MCMCtree.[dataset].APL.run1.tre".