Data from: Phylogenomics supports island contribution to metapopulation dynamics in a predominantly continental bird species

Data files

Sep 17, 2025 version files 146.32 KB

Dataset.zip

133.83 KB
README.md

12.49 KB

Abstract

Aim:

Islands have recently been recognized as potential sources of biodiversity, challenging the traditional view that their small population sizes and low genetic diversity limit such roles. This raises the question of how insular genetic variation becomes incorporated into continental populations, contrary to expectations of unidirectional colonization. Here, we investigate whether and how island-derived genetic variation has influenced a continental population through population establishment and gene flow in a bird species where frequent trans-ocean dispersal is expected.

Location:

Continental East Asia (Russian Far East), Japanese Archipelago

Taxon:

Swinhoe's Rail (Coturnicops exquisitus)

Materials and Methods:

We apply integrative phylogenomics to reconstruct the spatiotemporal history of the species. Colonization sequences and gene flow are inferred by comparing four different phylogenetic reconstruction methods, using mitochondrial sequences obtained by Sanger sequencing and genome-wide data obtained by genotyping by sequencing (MIG-seq). We assess a history of colonization and gene flow based on summary statistics, demographic trajectory inference by Stairway Plot2, demographic modeling by fastsimcoal2, and species distribution modeling.

Results:

Analyses collectively supported asymmetric gene flow from the island to the continental population, following divergence around the Middle Pleistocene. Post-divergence, the island maintained a large and stable population, while the continental population underwent a severe bottleneck, suggesting a significant evolutionary role of the island for the continental population. Additionally, evidence of recent re-establishment of the island by continental individuals indicates dynamic exchange and persistence within a continent-island metapopulation.

Main conclusions:

The maintenance of insular genetic variation within a dynamic continent-island metapopulation may have enabled the island to act as a genetic and demographic reservoir for the continental population. Thus, continent-island metapopulation dynamics may be a key evolutionary pathway through which island populations contribute to continental genetic diversity.

This README.txt file was generated on 2024-05-22.

Title of Dataset : Phylogenomics reveals a gene flow history of reverse colonization

Brief Summary

This dataset is the minimum requirement to conduct analyses and visualization for a manuscript by Aoki et al., entitled "Phylogenomics supports island contribution to metapopulation dynamics in a predominantly continental bird species." The associated data and scripts are all created by Daisuke AOKI. The FASTQ files of raw sequence reads generated by the MIG-seq protocol retrieved from MiSeq were deposited to Sequence Read Archives (SRA) through DDBJ (see Table S1 in the main manuscript for the SRA accession numbers). All the details about the experiment and analyses are described in the main text and supplementary materials. The adapter and primer sequences of raw FASTQ files are pre-trimmed using a pre-defined script in the data output step in MiSeq. The minimum dataset is provided here, which can conduct the analyses thoroughly if the bioinformatics environment is sufficiently prepared, and intermediate datasets are generated by the scripts associated with the dataset and scripts. Specifically, the CONDA environment was created at the UBUNTU in WSL2 of Windows 10 by using miniconda 3, on which the Rstudio server was established. The Rstudio server can access the conda environment and hence any software packages loaded to the environment. For analyses that take a relatively long time, runs were passed to cluster PCs on which batch runs were submitted. Codes for batch runs were written in bash through R scripts, and job queues were separately submitted to the cluster PC. Most of the parameter settings are described in the scripts below, or otherwise, manuscripts or supplementary materials provide the rest of them. Other detailed information is provided below in this section.

Author Information

A. Principal Investigator Contact Information

Name: Daisuke AOKI
Institution: Department of Wildlife Biology, Forestry and Forest Products Research Institute
Address: Tsukuba, Ibaraki, Japan 305-8687
Email1: aokid@ffpri.affrc.go.jp
Email2: aokidaisuke1109@gmail.com

B. Associate or Co-investigator Contact Information

Name: Haruko ANDO
Institution: Biodiversity Division, National Institute for Environmental Studies
Address: Tsukuba, Japan, 305-8506

Date of data collection:

From 2017-05-22 to 2019-08-09 for the blood and feather sample materials. From 2020-09-01 to 2020-09-30 for the MIG-seq data.

Geographic location of data collection

Across Japan (Tomakomai and Kushiro of Hokkaido Pref., Ibaraki Pref. (Kanto region), Aomori Pref.) and Russia (Amur and Baikal region).

Information about funding sources that supported the collection of the data:

the Japanese Society for the Promotion of Science (JSPS)/ no. 22K20670 and 23H02243

Description of the data and file structure

File List

Dataset.zip
|-data
| |-data.df.csv
| |-ref.df.csv
| |-primer.tsv
| |-adapter.tsv
| |-adapter_primer.csv
| |-adapter_primer.fa
| |-metadata_data.df.csv
| |-metadata_ref.df.csv
|-functions
| |-cluster_functions.R
| |-general_functions.R
| |-ngs_functions.R
| |-catg_formatter.R
| |-lddecay_blocksizefinder.R
| |-pastclim_retriever.R
|-code
| |-data_prep.R
| |-qc_trimmomatic.R
| |-slice_bams.R
| |-sra_manipulation.R
| |-reference_mapping.R
| |-mapping_qc.R
| |-angsd_pre.R
| |-angsd_intersect.R
| |-angsd_ngsrelate.R
| |-angsd_global.R
| |-angsd_raxml_minind0.5.R
| |-angsd_raxml_minind0.8.R
| |-angsd_treemix.R
| |-angsd_snapp.R
| |-angsd_sfs.R
| |-angsd_sfs_pop4.R
| |-ngsdist_neighbornet.R
| |-angsd_abbababa.R
| |-mitochondrial.R
| |-diversity_indices.R
| |-stairwayplot.R
| |-fastsimcoal2.R
| |-sdm_swinhoe.R
| |-figures_1.R
| |-figures_2.R
| |-figures_3_indices.R
| |-figures_3_demog.R
| |-figures_supqc.R
| |-figures_abbababa.R
| |-readcheck.R

Relationships between files

"data": includes dataset information including sample details that are used in "code".
- "data.df.csv" is the base dataframe containing sample and sequencing information of MIG-seq
- "ref.df.csv" is the base dataframe containing reference sequence infomration
- "primer.tsv", "adapter.tsv", "adapter_primer.tsv", and "adapter_primer.fa" are the files that contain primer or adapter sequences used for pre-checking contamination of these sequences in the obtained reads.
- "metadata_data.df.csv" is a metadata of the "data.df.csv", describing each column.
- "metadata_ref.df.csv" is a metadata of the "ref.df.csv", describing each column.
"functions": contains several scripts needed to run analyses described in the "code". These functions are set to be loaded at the beginning of each "code" scripts or "dataprep.R".
- "general_functions.R" contains functions for general use
- "cluster_functions.R" contains functions to convert scripts to analyses at a cluster PC
- "ngs_functions.R" contains utilities for bioinformatics
- "lddecay_blocksizefinder.R" is to fit a linkage disequilibirium model to the sites obtained by ANGSD for TreeMix (angsd_treemix.R) to determine the block size for TreeMix analysis.
- "catg_formatter.R" is to create a CATG input file for RAxML-ng.
- "pastclim_retriever.R" contains a function that is needed to retrieve once downloaded pastclim dataset through its older package version.
"code" includes 31 scripts that were used to conduct analyses.
- "data_prep.R" is sourced in many other scripts to load a data frame that includes sample details.
- "qc_trimmomatic.R" is to pre-check the raw reads, trim and filter raw rads of FASTQ files based on read qualities, and post-trimming quality check.
- "sra_manipulation.R" is to conduct a similar procedure to "read_manipulation.R" but on the database reads downloaded from SRA. This includes both short and long reads that are needed to be handled differently. The obtained information will be used to crate a "ref.df.csv".
- "slice_bams.R" is to slice huge bam files for mapped reads generated from SRA reads to limit them on intersecting sites with the MIG-seq dataset.
- "reference_mapping.R" is to conduct mapping trimmed and filtered reads against a reference genome and merge multiple bams obtained into a single bam file for each sample.
- "mapping_qc.R" is to conduct quality check on reference mapped reads for each samples, including the number of reads mapped, missing proportion, and so on. These analyses are done using BCFtools with different depth values.
- "angsd_pre.R" is to conduct a preanalysis run of ANGSD to calculate the proportion of missingness, which was used to determine samples to be removed from the downstream analyses.
- "angsd_intersect.R" is to conduct ANGSD for different sampling regions separately to determine the intersecting sites among regions with filtering parameters, such as setMinDepthInd and minInd. These sites are used in the downstream analyses.
- "angsd_ngsrelate.R" is to conduct ANGSD and ngsRelate to detect any sample pairs that have high relatedness. One of each pair (i.e., two Kanto samples) were removed in the following analyses.
- "angsd_global.R" is to conduct ANGSD genotype likelihood analyses on the samples collected from all the regions, and conduct PCA and ADMIXTURE to examine the population genetics structure.
- "angsd_raxml_minind0.5.R" is to conduct ANGSD genotype likelihood analyses on the samples used for RAxML tree reconstruction, and obtained genotype probabilities are used to a generate RAxML input. RAxML runs are also conducted through this script. This script uses functions/catg_formatter.R to create the input. For this script, the ANGSD is run with the minimum individual set to be half of the total number of samples.
- "angsd_raxml_minind0.8.R" is mostly similar to the script "angsd_raxml_minind0.8.R" while with different value of the minimum individual (80% of the total number of samples).
- "angsd_treemix.R" is to conduct ANGSD genotype likelihood analyses on the samples used for TreeMix. This output genotype likelihood file is used to create an input file of TreeMix by using glactools, and TreeMix run and its model evaluation is also done using this script.
- "angsd_snapp.R" is to conduct ANGSD genotype likelihood analyses to hard call SNPs using the postcutoff option. The called SNPs is used to create an input file for SNAPP. SNAPP analysis is conducted externally on the BEAST software and visualized using the FigTree software.
- "angsd_sfs.R" is to conduct ANGSD genotype likelihood analyses on the samples collected from the two genetic groups and folded SFSs are calculated on the intersecting sites (on both linked and unlinked sites). 1d and 2d SFSs are produced, which will be used as the input of "diversity indices.R", "stairwayplot.R", and "fastsimcoal2.R".
- "angsd_sfs_pop4.R" is to compute SFS for four regional populations on unlinked sites and computed Fst using 2d-SFS. The intersecting sites were used from the angsd_intersects.R.
- "ngsdist_neighbornet.R" is to estimate pairwise genetic distance for each pair of samples by using ngsDist based on genotype likelihoods. This script is also used to plot a NeighborNet constructed based on the pairwise genetic distance.
- "angsd_abbababa.R" is to compute the D-statistic using doAbbababa2 program of ANGSD.
- "mitochondrial.R" is to create input files for BEAST phylogenetic analyses and the NETWORK analysis.
- "diversity_indices.R" is to summarize summary diversity indices and conduct a linear model.
- "stairwayplot.R" is to prepare input files for the Stairway Plot 2 analysis, conduct the analysis.
- "fastsimcoal2.R" is to prepare input files for the fastsimcoal2 analysis and results are also compared in this script.
- "sdm_swinhoe.R" is to construct species distribution models of rails using R. SDM plots are generated by this script too.
- "figures_1.R" is to plot a part of figure 1 and its related supplementary figures.
- "figures_2.R" is to plot a part of figure 2 and its related supplementary figures.
- "figures_3*_*indices.R" is to plot a part of figure 3 (indices) and its related supplementary figures.
- "figures_3_demog.R" is to plot a part of figure 3 (demographic inferences) and its related supplementary figures.
- "figures_supqc.R" is to plot a part of supplementary figures for quality check.
- "figures_abbababa.R" is to plot a supplementary figure of D-statistic analysis
- "read_check.R" is to summarize the number of reads used for each analyses.

Analyses are needed to be conducted in an appropriate conda environment where miniconda3 and required packages are installed. All the scripts are written in R and they are used in the R-studio server built on the conda environment so that the R script can submit a shell script to run the analyses that are dependent on the conda environment. Scripts that loaded and used "cluster_functions.R" were used to create job queues that are to be submitted to the supercomputer of AFFRIT, MAFF, Japan for efficient calculation. All the sequence data are deposited to DDBJ under accession numbers listed in Table S1. Database reads used in the analyses are also listed in Table S1. The script "sdm_swinhoe.R" requires a csv file that contains occurrence records of the Swinhoe's Rail. Although this dataset contains previously published records, some records come from proceedings of annual meetings that are not fully publicized. Therefore, this dataset will be available upon a direct request. Or otherwise, the dataset is visualized in Fig. 4b and Fig. S20. These data are needed to be downloaded to conduct the complete analyses. See the main texts and supplementary materials for details.

Licenses/restrictions placed on the data:

This work is licensed under a CC0 1.0 Universal (CC0 1.0) Public Domain Dedication license.

The associated manuscript has been published in Journal of Biogeography under 10.1111/jbi.70038 (https://onlinelibrary.wiley.com/doi/10.1111/jbi.70038).

Raw Sequence Reads downloaded from Sequence Read Archives were used in the scripts. See the supplementary file of the preprint/manuscript for accession numbers and details.

Recommended citation for this dataset:

Aoki et al. (2024), Phylogenomics reveals an island as a genetic reservoir of a continental population. Dryad, Dataset.