Demographic history and adaptive evolution of Indo-Pacific bottlenose dolphins (Tursiops aduncus) in Western Australia
Data files
Oct 21, 2024 version files 2.65 GB
-
Datasets.zip
2.62 GB
-
filtering_steps_in_the_command_line.html
635.14 KB
-
Part1_PopStruct.zip
18.36 KB
-
Part2_Demographic_modelling.zip
8.02 KB
-
Part3_LocalAdaptation.zip
31.04 MB
-
README.md
18.21 KB
-
RsessionInfo.txt
13 KB
Abstract
Demographic processes can substantially affect a species’ response to changing ecological conditions, necessitating the combined consideration of genetic responses to environmental variables and neutral genetic variation. Using a seascape genomics approach combined with population demographic modelling, we explored the interplay of demographic and environmental factors that shaped the current population structure in Indo-Pacific bottlenose dolphins (Tursiops aduncus) along most of the Western Australian coastline. We combined large-scale environmental data gathered via remote sensing with RADseq genomic data from 133 individuals at 19 sampling sites. Using population genetic and outlier detection anaylses, we identified three distinct genetic clusters, coinciding with tropical, subtropical and temperate provincial bioregions. In contrast to previous studies, our demographic models indicated that populations occupying the paleo-shoreline split into two demographically independent lineages before the last glacial maximum (LGM). A subsequent split after the LGM gave rise to the Shark Bay population, thereby creating the three currently observed clusters. Although multi-locus heterozygosity declined from north to south, dolphins from the southernmost cluster inhabiting temperate waters had higher heterozygosity in potentially adaptive loci, compared to dolphins from subtropical and tropical waters. These findings suggest ongoing adaptation to cold temperate waters in the southernmost cluster, possibly linked to distinct selective pressures between the different bioregions. Our study demonstrated that in the marine realm, without apparent physical boundaries, only a combined approach can fully elucidate the intricate environmental and genetic interactions shaping the evolutionary trajectory of marine mammals.
Short explanation of scripts and datasets analysed in: “Neutral and adaptive genetic differentiation patterns suggest a complex history of Indo-Pacific bottlenose dolphins (Tursiops aduncus) in West-Australian waters”, 2024, 10.1111/mec.17555
Svenja M. Marfurt, Delphine B. H. Chabanne, Samuel Wittwer, Livia Gerber, Simon J. Allen, Manuela R. Bizzozzero, Krista Nicholson, Michael Krützen
For questions regarding Parts 1 & 3 please contact svenja.marfurt@iea.uzh.ch
For questions regarding Part 2 please contact samuel.wittwer@iea.uzh.ch
This dryad repository contains the following objects in provided folder structure listed below with a short explanation about each object.
Datasets and folders in cursive are created at some point when running all analysis parts. They are provided to facilitate replication of only specific parts of the analyses.
RsessionInfo.txt
The R-session info with details of R-package versions used
filtering_steps_in_the_command_line.html
An html containing commands for read filtering, alignment, variant calling and additional information
Datasets:
- BioOracle_data:
A folder containing all environmental layers downloaded using the sdmpredictors R package (e.g., BO_bathymean_lonlat.zip, BO22_chlomean_ss_lonlat.zip, …). For a detailed description of these layers, please refer to the BioOracle website: https://www.bio-oracle.org/documentation.php . The layers can be loaded into R directly by specifyingoptions(sdmpredictors_datadir = "/Datasets/BioOracle_data")
. Alternatively, these layers can be downloaded from the BioOracle website from within R following the instructions in detailed in EnvData.R. - genotype.lfmm_imputed.snmf:
can be created when running PopDemHistory_TursiopsAduncusAustralia.R. A folder containing ancestry coefficient computations (sNMF) for K = 3 after imputing missing data using LEA package in R, plus an additional “masked” analysis folder. The masked analysis is conducted by masking a portion of the data to assess the stability of the results or to cross-validate the sNMF model. Each analysis (K = 3 and the masked analysis) has its own folder, and within each folder, 10 runs have been performed to evaluate the consistency of the results across replicates. Each run contains the following files, with X referring to the run number and Y referring to the K specified:- genotype_rX.Y.G: Genotype matrix used for the run
- genotype_rX.Y.Q: Ancestry coefficients (Q-matrix) produced by sNMF for the run.
- genotype_rX.Y.snmfClass: Classification of samples based on the inferred ancestry coefficients for the run. The snmfClass object is generated by the LEA package in R as part of the sNMF analysis. This object stores key information about the sNMF model and results for a particular run.
- genotype.snmf:
can be created when running PopDemHistory_TursiopsAduncusAustralia.R. A folder containing ancestry coefficient computations (sNMF) using the LEA package in R for K values ranging from 1 to 20 (as specified in PopDemHistory_TursiopsAduncusAustralia.R), plus an additional “masked” analysis folder. The masked analysis is conducted by masking a portion of the data to assess the stability of the results or to cross-validate the sNMF model.
Each K value (including the masked analysis) has its own folder, and within each folder, 10 runs (as specified in PopDemHistory_TursiopsAduncusAustralia.R) have been performed to evaluate the consistency of the results across replicates. Each run contains the following files with X referring to the run number and Y referring to the K specified:- genotype_rX.Y.G: Genotype matrix used for the run
- genotype_rX.Y.Q: Ancestry coefficients (Q-matrix) produced by sNMF for the run.
- genotype_rX.Y.snmfClass: Classification of samples based on the inferred ancestry coefficients for the run. The snmfClass object is generated by the LEA package in R as part of the sNMF analysis. This object stores key information about the sNMF model and results for a particular run.
- Maps:
A folder containing all necessary files to plot detailed maps of the Australian continent. The primary map file is AUS_complete.shp with all corresponding auxiliary files. These files can be used in any GIS software or loaded into R using spatial packages like sf or rgdal for visualization. Refer to EnvData.R for R-code to plot maps using these files. - c3_i133_popmap.txt:
A tab-delimited file containing the sample ID and the corresponding genetic cluster at K = 3 (north, central or south or admixed determined using sMNF,ADMIXTURE & PCA) including admixed individuals from Coral Bay (n = 133) - c3_i126_noCBA_popmap.txt:
A tab-delimited file containing the sample ID and the corresponding genetic cluster at K = 3 (north, central or south, determined using sMNF,ADMIXTURE & PCA) excluding admixed individuals (n = 7) from Coral Bay (CBA) (n = 126) - genotype.geno:
SNP data for all individuals (n=133) in geno format - genotype.lfmm:
SNP data for all individuals (n=133) in lfmm format - genotype.lfmm_imputed.geno:
SNP data for all individuals (n=133) with missing genotype imputed from the best snmf run (created using snmf in the Part1_PopStruct_PopDemHistory_TursiopsAduncusAustralia.R script) in geno format - genotype.lfmm_imputed.lfmm:
SNP data for all individuals (n=133) with missing genotype imputed from the best snmf run (created using snmf in the Part1_PopStruct_PopDemHistory_TursiopsAduncusAustralia.R script) in lfmm format - genotype.lfmm_imputed.snmfProject:
The corresponding imputed snmfProject (created using snmf in the Part1_PopStruct_PopDemHistory_TursiopsAduncusAustralia.R script) - genotype.snmfProject:
The corresponding initial snmfProject (created using snmf in the Part1_PopStruct_PopDemHistory_TursiopsAduncusAustralia.R script)to find the best K - marmap_coord_110;-38;129;-12_res_1.csv:
The bathymetry data from Western Australia used by the marmap package to calculate shortest in-water distances (downloaded using the marmap package in Part1_PopStruct_PopDemHistory_TursiopsAduncusAustralia.R) - noCBA_inds.txt:
A list of all sample IDs except the 7 admixed Coral Bay (CBA) individuals (n = 126) - p18i126.vcf:
The vcf containing snp info of 126 individuals from 18 sampling sites (excluding the sampling site Coral Bay, n = 7) for 12,356 snps - p19i133.vcf:
The vcf containing snp info of all 133 individuals analysed in this study for 12,356 snps - p19i133.gds:
SNP info of all 133 individuals analysed in this study for 12,356 snps in gds format - p19i133.het:
Heterozygosity information of all 133 individuals analysed in this study generated using vcftools –het - sampleinfo.csv:
A csv file containing metadata for each sample including sampling site (short code and full sampling site name), longitude, latitude and provincial bioregion
Part1_PopStruct
- scr: folder containing R-scripts used for the Population genomic structure analyses
- EnvData.R:
R-script used to plot sampling sites and extract all environmental variables at sampling sites (PART Environmental variable extraction of the Manuscript) - PopDemHistory_TursiopsAduncusAustralia.R:
R-script used to conduct all population genomic structure analyses including dimension-reduction and clustering-based methods as well as population genetic statistics, IBD, heterozygosity decline (“Population genetic structure”- Part of the Manuscript)
- EnvData.R:
Part2_Demographic_modelling
- dadianalysis.py:
Python automation script for running dadi analyses in conjunction with Portiks dadi_pipeline https://github.com/dportik/dadi_pipeline (Optimize_functions) - models.py:
Python script containing models for testing two population and three population scenarios - PopDemR_functions.R:
R-functions used in PopDemR_script.R - PopDemR_gather_dadiresults.R:
R-script used to combine all model results
Part3_LocalAdaptation
- dat
- dolphin_genome_summary.txt:
Genome summary (Chromosome, scaffold RefSeq and GenBank names) for the Tursiops truncatus genome; NCBI GenBank GCA_011762595.1 - genotype_p19i133maf0.05_k3imputed.lfmm:
SNP data for the MAF filtered SNP dataset (7179 snps) for all individuals (n=133) with missing genotype imputed with K = 3 in lfmm format - genotype_p19i133maf0.05_k3imputed.lfmm.bed:
SNP data for the MAF filtered SNP dataset (7179 snps) for all individuals (n=133) with missing genotype imputed with K = 3 in bed format for PCAdapt - p19i133env_variables.csv:
A csv file containing the values for all environmental variables for each sample analysed in this study (extracted using the Part1_PopStruct/EnvData.R script) used for the GEA-methods - p19i133_maf0.05_IndLabelEdited.vcf:
The vcf containing snp info of all 133 individuals analysed in this study for 7179 snps after applying a MAF 0.05 threshold for the Outlier scans, the row-names contain the full IDs including the sampling site and ID number - proteins_769_839090.csv:
Protein list for the Tursiops truncatus genome (NCBI GenBank GCA_011762595.1)
- dolphin_genome_summary.txt:
- scr: folder containing R-scripts used for the Outlier detection (“Outlier detection, Annotating candidate loci & Investigating potentially adaptive divergence”- Part of the Manuscript)
- individual_based_TursiopsAduncus_rda.R:
R-Script containing the code to run RDA for outlier detection - rdadapt.R:
rdadapt function to detect outlier loci based on their contribution to a RDA model, from https://landscape-genomics.github.io/rdadapt/reference/index.html - TursiopsAduncus_lfmm.R:
R-Script containing the code to run multivariate lfmm for outlier detection - TursiopsAduncus_outflank.R:
R-Script containing the code to run Outflank for outlier detection - TursiopsAduncus_overlap.R:
R-Script containing the code to combine all outlier detection methods to obtain a dataset of candidate outliers for further investigation and annotation as well as plotting candidate snps - TursiopsAduncus_pcadapt.R:
R-Script containing the code to run PCAdapt for outlier detection
- individual_based_TursiopsAduncus_rda.R:
- res:
- pot_adaptive_2methods_within_genes_28snps.txt:
a txt file containing the snp.ids of outliers detected in 2 or more outlier tests that are located within a gene. - dolphin_GEA_detected_in_2methodspRDA.txt:
a txt file containing the snp.ids of outliers detected in 2 or more outlier tests. - dolphin_GEA_detected_in_onemethod.txt:
a txt file containing the snp.ids of all outliers detected in any of the outlier tests. - pcadapt:
folder containing the results obtained when running TursiopsAduncus_pcadapt.R.- PCA_K1-8_ScreenPlot.pdf:
Plot visualizing likely choice of K. - GenomicInflationFactor_K1-8.csv:
A csv containing the genomic inflation factor (gif) values for all potential Ks (K=1 up until K=8). - NbSignifOutliers_K1-8.csv:
A csv containing the number of significant outliers calculated using PCAdapt for all potential Ks (K=1 up until K=8) for different q-value thresholds. - K1 to K8
Each folder (K1 to K8) contains summarized results and diagnostic plots from PCAdapt specifying K values from 1 to 8. (Note: Y in the file structure below is a placeholder for the specified K value, ranging from 1 to 8.)- QQplot_KY.pdf:
A visualization of the QQplot under K = Y. - StatDistribution_KY.pdf:
A visualization of the chi² distribution under K = Y. - All_Results_Tests_KY.csv:
A csv containing the test results for each of the loci in the SNP dataset under K = Y.
- QQplot_KY.pdf:
- PCA_K1-8_ScreenPlot.pdf:
- outflank:
folder containing the results obtained when running TursiopsAduncus_outflank.R.- OutFlank_AllResults.csv:
A csv containing the test results for each of the loci in the SNP dataset. - SignifOutliers.csv:
A csv containing the number of significant outliers calculated using Outflank under the specified significance threshold.
- OutFlank_AllResults.csv:
- ib_rda:
folder containing the results obtained when running individual_based_TursiopsAduncus_rda.R.- ib_pRDA_AllResults.csv:
A csv containing the test results for each of the loci in the SNP dataset running partial RDA (correcting for population structure). - pRDA_Outliers_indbased.csv:
A csv containing the test results only for the significant outlier loci obtained running partial RDA (correcting for population structure). - RDA_AllResults_unconstrained_indbased.csv:
A csv containing the test results for each of the loci in the SNP dataset running an unconstrained RDA (not correcting for population structure), details/rationale constrained vs. unconstrained: https://github.com/Capblancq/RDA-landscape-genomics. - RDA_unconstrainedOutliers_indbased.csv:
A csv containing the test results only for the significant outlier loci obtained running unconstrained RDA (not correcting for population structure), details/rationale constrained vs. unconstrained: https://github.com/Capblancq/RDA-landscape-genomics.
- ib_pRDA_AllResults.csv:
- lfmm:
folder containing the results obtained when running TursiopsAduncus_lfmm.R.- GenomicInflationFactor_K1-8_multivariate.csv:
A csv containing the genomic inflation factor (gif) values for all potential Ks (K=1 up until K = 8). - NbSignifAssociations_K1-8_q0.05_multivariate.csv:
A csv containing the number of significant outliers calculated using multivariate lfmm for all potential Ks (K=1 up until K = 8) for a q-value of 0.05. - envAll:
folder containing the results of running lfmm in multivariate mode for each K (K=1 up until K=8).- K1 to K8:
Each folder (K1 to K8) contains summarized results and diagnostic plots from multivariate LFMM specifying K values from 1 to 8. (Note: Y in the file structure below is a placeholder for the specified K value, ranging from 1 to 8.)- AssociationsNb_envAll_KY_q.csv:
A csv containing the number of significant outliers obtained when running multivariate lfmm specifying K = Y for different q-value thresholds. - CandidatesOrdered_envAllY_q*.csv:
A csv containing the test results for the list of significant outlier loci obtained running multivariate lfmm specifying K = Y and the corresponding q-value threshold. - LFMM_AllResults_envAll_KY.csv:
A csv containing the test results for each of the loci in the entire SNP dataset running multivariate lfmm specifying K = Y. - ManhattanPlot_envAll_KY_q*.pdf:
A pdf depicting the Manhattan plot of all loci in the dataset specifying K = Y and the corresponding q-value threshold. - PvalueDistribution_envAll_KY.pdf:
A pdf depicting the P-value distribution for K = Y.
- AssociationsNb_envAll_KY_q.csv:
- K1 to K8:
- GenomicInflationFactor_K1-8_multivariate.csv:
- pot_adaptive_2methods_within_genes_28snps.txt:
Contents:
This Dryad entry contains all the code and analysis steps associated with the manuscript: Demographic history and adaptive evolution of Indo-Pacific bottlenose dolphins (Tursiops aduncus) in Western Australia. The data and code are organized into several sections to facilitate replication and further exploration of the analyses conducted in this study.
- Code:
Part 1: General Population Structure
R-Scripts and commands used to perform population genetic analyses, including Principal Component Analysis (PCA), sNMF, DPCA, Tests for Isolation By Distance (IBD) using depth-restricted in-water distances and heterozygosity decline
Part 2: Demographic Modelling
Scripts used for demographic history modelling using dadi
Part 3: Outlier Tests/Local Adaptation
R-scripts focusing on different outlier scan methods to detect potential locally adaptive markers, including population-differentiation and genome-environment-association analyses
- Datasets:
Metadata:
Detailed metadata for all 133 individuals included in the study, including sampling locations, environmental data, and other relevant information
Data Frames:
Pre-processed and filtered datasets including intermediate files used to replicate each of the analyses
- Raw Data:
The corresponding raw read fastq files for all individuals analyzed in this study are available on the NCBI Short Read Archive (SRA) under BioProject ID PRJNA966102 (ddRAD data produced in Wittwer et al. 2023, https://doi.org/10.1111/mec.16984) and BioProject PRJNA1150531 (ddRAD data produced for this study). For further information, refer to the ESM published along with the manuscript.
Usage Notes:
The provided code is organized by analysis type and includes comments to guide users through each step of the analysis. Users should ensure that all dependencies (e.g., software, packages) are correctly installed in their environment to replicate the analyses. An R-session info is provided as .txt file for information on package versions used
The datasets are formatted in CSV or tab-delimited formats or vcf formats as well as additional formats (.geno, .lfmm, .bed, .gds) for compatibility with different softwares/packages
The Markdown file outlining the bioinformatic filtering steps provides a detailed protocol for preprocessing of raw sequencing data but should be carefully adjusted in regards to the species under study, genomic data type and research question(s)
For specific questions regarding Part 1: Population genomic structure & Part 3 Outlier Tests/Local Adaptation, please contact Svenja Marfurt: svenja.marfurt@iea.uzh.ch
For specific questions regarding Part 2: Demographic modelling, please contact Samuel Wittwer: samuel.wittwer@iea.uzh.ch