Data from: Phylogenomics of a genus of ‘Great Speciators’ reveals rampant incomplete lineage sorting, gene flow, and mitochondrial discordance in island systems
Data files
Nov 04, 2025 version files 5.02 GB
-
convert_vcf_to_nexus.rb
3.83 KB
-
datasets.zip
5.02 GB
-
drop_fasta_N.py
970 B
-
drop_fasta_X.py
985 B
-
README.md
6.26 KB
-
trees-from-fig-2.zip
31.14 KB
Nov 04, 2025 version files 5.02 GB
-
convert_vcf_to_nexus.rb
3.83 KB
-
datasets.zip
5.02 GB
-
drop_fasta_N.py
970 B
-
drop_fasta_X.py
985 B
-
README.md
6.26 KB
-
trees-from-fig-2.zip
31.14 KB
Abstract
The flora and fauna of island systems, especially those in the Indo-Pacific, are renowned for their high diversification rates and outsized contribution to the development of evolutionary theories. The total diversity of geographic radiations of many Indo-Pacific fauna is often incompletely sampled in phylogenetic studies due to the difficulty in obtaining single-island endemic forms across the Pacific and the relatively poor performance of degraded DNA when using museum specimens for inference of evolutionary relationships. New methods for the production and analysis of genome-wide datasets sourced from degraded DNA are facilitating insights into the complex evolutionary histories of these influential island faunas. Here, we leverage whole genome resequencing (20X average coverage) and extensive sampling of all taxonomic diversity within Todiramphus kingfishers, a rapid radiation of largely island-endemic ‘Great Speciators.’ We find that whole genome datasets do not outright resolve the evolutionary relationships of this clade: four types of molecular markers (UCEs, BUSCOs, SNPs, and mtDNA) and tree-building methods did not find a single well-supported and concordant species-level topology. We then uncover evidence of widespread incomplete lineage sorting and both ancient and contemporary gene flow and demonstrate how these factors contribute to conflicting evolutionary histories. Our complete taxonomic sampling allowed us to further identify a novel case of mitochondrial capture between two allopatric species, suggesting a potential historical (but since lost) hybrid zone as islands were successively colonized. Taken together, these results highlight how increased genomic and taxon sampling can reveal complex evolutionary patterns in rapid island radiations.
https://doi.org/10.5061/dryad.cfxpnvxg8
Description of the data and file structure
Whole genome resequencing data for all taxonomic diversity within Todiramphus kingfishers. Raw reads are available on genbank (Bioproject PRJNA1174201).
This dataset has:
One zipped datasets file. This zipped file contains a folder structure of the four types of molecular markers analyzed and output files. UCE and BUSCO folders have the locus-specific alignments in a subfolder with IQtree output, which can be used to produce input files for ASTRAL.
- UCE: 90% complete matrix. In the concatenated subfolder, the concatenated matrix (todi116_90per.phylip) and a character set matrix for a partitioned analysis is available. In the gene-trees subfolder, the 4,845 fasta formatted locus-specific alignments (N0.25_missing0.7_alignments.zip), IQtree gene trees (N0.25_missing0.7_iqtree.zip), and information about the parsimony informativeness of each locus (N0.25_missing0.7_PIsites.csv) is in a gene-trees subfolder. The final input file for ASTRAL is final_N0.25_M0.7_ParInfor100.trees. In the SNaQ subfolder, there are files comrpising code, input CF matrix, and output for SNaQ.
- BUSCO: In the concatenated subfolder, the 8,012 loci in the concatenated format (Todi109_Busco8012_filtered.phy) with character set file (todi109-Busco8012-partitions.nex) is availiable. In the gene-trees subfolder, the are 4,178 fasta formatted locus-specific BUSCO alignments (busco_ambig_dropp0.5_parsInfor21_alignments.zip). Gene trees for loci with fewer than 21 parsimony informative sites were not inferred but these data can be extracted using the charactersets file from the concatenated matrix. There is also the IQtree inferred gene trees and output files (busco_ambig_dropp0.5_parsInfor21_iqtree.zip) as well as information about the parsimony informativeness of each locus (BUSCO_X0.5_parsInfor20_PIsites.csv). The final tree file from ASTRAL is Busco468-ParInfor100_Astral.tre (100 parsimony site threshold) and Busco4155_ParInFor20_Astral.tre (20 PI site threshold). The input files are iqtree_Busco468-ParInfor100.trees and iqtree_Busco4155-ParInfor20.trees, respectively.
- SNP: VCF files of filtered SNP data (with [todi119] and without [todi115] outgroups), R code to produce distance files as input for Splitstree (convert-to-distance-matrix.R) and the distance matricies themselves (todi115.nomissing.g5mac3dp3BA2.5kb.txt). todi119.nomissing.g5mac3dp3BA2.5kb-alignment.VarsitesOnly.phy is the input alignment file for IQtree. The alignment file for SVDquartets (with charpartitions and commands) is todi119.nomissing.g5mac3dp3BA2.5kb.SVD1000.nex. Pairwise distance files for Splitstree inputs are todi92.nomissing.g5mac3dp3BA2.5kb.txt (no outgroups) and todi115.nomissing.g5mac3dp3BA2.5kb.txt (with outgroups). The input and output files from Dsuite are also available in the Dsuite subfolder.
- mtDNA: concatenated matrix of mitochondrial genes and character sets file.
Four python scripts/files. These scripts used in the paper.
- drop_fasta_N.py: used to drop individual samples from individual UCE loci if the sample has a certain threshold of missing data specified as a flag. See top of script for information.
- drop_fasta_X.py: used to drop individual samples from individual BUSCO loci if the sample has a certain threshold of missing data specified as a flag. See top of script for information.
- MetaeukToGff3.py: Convert the information in the header for the output file generated from MetaEuk to a gff3 format, used for creation of BUSCO dataset with coordinates rather than traditional pulling of BUSCOs (see supplementary methods).
- pull-coordinates.txt: this has UNIX commands and could easily be made into a python script but currently just has instructions on how to pull BUSCO loci out of genomes that are all aligned to the SAME reference genome. The code runs BUSCO on the reference genome then takes the coordinates from the output (using MetaeukToGff3.py) to pull data into locus specific files.
- convert_vcf_to_nexus.rb, a ruby script to convert a VCF file to nexus format.
One zipped folder with ten tree files summarized in Figure 2 (trees-from-fig-2.zip)
Includes:
- 01_UCE_IQTree_90per.tre: UCE 90% complete, partitioned IQtree-inferred phylogeny from Figs. 1, 2, and S10. Names with "_clean" mean that they were put through the toepad cleaning pipeline (see supplemental methods in Appendix 2).
- 02_UCE_RAxML_90per.tre: UCE 90% complete RAxML phylogeny. Fig S11.
- 03A_UCE_SVDquartets_todi117A_100bs.tre, 90% complete UCE SVDquartets topology of subset focusing on complete sampling of the Oceanic Clade. The results of this were combined with subset B in figure 2. See Fig S9.
- 03B_UCE_SVDquartets_todi117B_100bs.tre, 90% complete UCE SVDquartets topology of subset focusing on complete sampling of the Australasian Clade. The results of this were combined with subset A in figure 2. See Fig S9.
- 04_UCE_ASTRAL_N0.25_M0.7_ParInfor100.tre, 90% complete UCE ASTRAL tree with parsimony informative filters. Fig S12.
- 05_BUSCO_IQtree_concatenated_all_partitions.tre, BUSCO partitioned IQtree-inferred phylogeny from Fig. S17.
- 06_BUSCO_ASTRAL_X0.5_ParInfor100.tre, BUSCO ASTRAL topology with parsimony informative filters. Fig. S16
- 07_SNP_IQtree.tre, IQtree inferred topology from SNP dataset. Fig. S15
- 08_SNP_SVDquartets.tre, SVDquartets topology from SNP dataset. Fig. S14.
- 09_mtDNA_IQtree.tre, IQtree inferred mitochondrial topology. Figs. 4, S20
Please note that the gene trees for BUSCO and UCE datasets, as well as the SNaQ networks are in the datasets.zip folder structure.
Code/software
Tables are able to be viewed in excel. Python, UNIX, etc. is used to run the scripts, but there are text files as well and can be viewed in text editing platforms. Tree files can be viewed in FigTree.
Access information
Other publicly accessible locations of the data:
- Raw reads are available on genbank (Bioproject PRJNA1174201).
