Population genetics and invasion history of the European Starling across Aotearoa New Zealand
Data files
Oct 22, 2024 version files 75.36 MB
-
Metadata_NZ_AU_UK_BE_ReplicatesSibRemoved2.csv
11.46 KB
-
README.md
2.58 KB
-
starling_noduprel_qual_miss_filt.recode.vcf
75.34 MB
Abstract
The expansion of human settlements over the past few centuries is responsible for an unprecedented number of invasive species introductions globally. An important component of biological invasion management is understanding how introduction history and post-introduction processes have jointly shaped present-day distributions and patterns of population structure, diversity, and adaptation. One example of a successful invader is the European starling (Sturnus vulgaris), which was intentionally introduced to numerous countries in the 19th century, including Aotearoa New Zealand, where it has become firmly established. We used reduced-representation sequencing to characterise the genetic population structure of the European starling in New Zealand, and compared the population structure to that present in sampling locations in the native range and invasive Australian range. We found that population structure and genetic diversity patterns suggested restricted gene flow from the majority of New Zealand to the northmost sampling location (Auckland). We also profiled genetic bottlenecks and shared outlier genomic regions, which supported historical accounts of translocations between both Australian subpopulations and New Zealand, and provided evidence of which documented translocation events were more likely to have been successful. Using these results as well as historic demographic patterns, we demonstrate how genomic analysis complements even well-documented invasion histories to better understand invasion processes, with direct implication for understanding contemporary gene flow and informing invasion management.
https://doi.org/10.5061/dryad.6djh9w1bd
Description of the data and file structure
Data Files:
starling_noduprel_qual_miss_filt.recode.vcf - final filtered genetic data file.
Metadata_NZ_AU_UK_BE_ReplicatesSibRemoved2.csv - metadata file containing coordinate and location information for all samples in the final data file.
Scripts:
01_sequencecleaning_code.pdf: Trimming data from all the the sequence data batches.
02_SNPcalling_code.pdf: Calling SNPs using BCFtools.
03_SNPfiltering_code.pdf: Code for the SNP filtering methods.
04_BioinformaticSexing_code.pdf: Profiling the sex of individuals using sex chromosome heterozygosity levels.
05_Popgen_Selection_code.pdf: Baypass analysis and plots for assessing signals of selection within invasive lineages.
06_SFS_code.7z: Code for the folded site frequency spectrum fitering and plots.
07_jaccard_diversity_code.ipynb: Genetic pairwise dissimilarity using Jaccard’s.
08_MMRR_code.R: Isolation by distance and environment analysis, performed using MMRR r package.
09_PSMC_code.ipynb: demographic analysis using PSMC on whole genome resequencing data.
10_Stairwayplot2_code.pdf: demographic analysis using StairwayPlot2 on the site frequency spectrum data taken from the reduced representation sequencing data.
Files and variables
File: Metadata_NZ_AU_UK_BE_ReplicatesSibRemoved2.csv
Description:
Variables
- id:Individuals ID matching VCF file ID
- pop: Sampling locations for individuals.
- pop2: Population location groupings for individuals used in the manuscript.
- Con: Continent code for the population.
- lat: Latitude of capture site.
- lon: Latitude of capture site.
- loc: More detailed discription of capture/colleciton/source location.
File: starling_noduprel_qual_miss_filt.recode.vcf
Description:
Final genetic variant file after filtering.
Access information:
The raw DArT-seq data have been deposited under BioProject accession no. PRJNA1164936 (MRL samples) and PRJNA1168657 (AUK, WHA, WEL, and CAN samples) in the NCBI BioProject database (https://www.ncbi.nlm.nih.gov/bioproject/). The whole genome resequencing data of the three MRL individuals used in the psmc analysis, along with an additional three individuals from MRL individuals that were not presented in any analysis in this manuscript, have been deposited under BioProject accession no. PRJNA1165315.
Sample Collection
A total of 106 starling specimen samples were obtained from various contributors within New Zealand from five geographically distinct locations between May 2022 and October 2023. Sampling covered three locations in the North Island, specifically in the Auckland region (AUK: n=18), the Manawatū-Whanganui region (WHA: n=12), the Wellington region (WEL: n=40) and two in the South Island in the Marlborough region (MRL: n=15) and Canterbury region (CAN: n=21). In addition to the newly obtained samples, we also incorporated sequence data from the native European range (Antwerp, Belgium; ANT: n=15, Newcastle, United Kingdom; NWC: n=15, Monks Wood, United Kingdom; MKW: n=15), as well as two locations from within the invasive Australian range (Orange; ORG: n=15, McLaren Vale; MLV: n=15) from a previously published Diversity Arrays Technology Pty Ltd sequencing (DArT-seq) dataset.
DNA Extraction and Sequencing
Extracted DNA from the newly collected New Zealand samples was sent to Diversity Arrays for sequencing. Sequencing was performed on an Illumina Hiseq2500/Novaseq6000.
Raw Sequence Processing
The previously published raw DArT-seq data, along with the MRL samples (January 2023 sequencing batch) were demultiplexed using stacks v2.2 process_radtags, while also discarding low quality reads (-q), reads with uncalled bases (-c), and rescuing barcodes and RAD-Tag cut sites (-r). It was not necessary to perform this step on the remainder of the new raw sequence data because DArT performed in-house demultiplexing using a proprietary bioinformatic pipeline.
For all the data, we used fastp v0.23.2 to remove adapter sequences and in the same step filtered reads for a minimum Phred quality score of 22 (-q 22) and a minimum length of 40 (-l 40). Both batches of sequence data produced as part of this study were additionally length trimmed to reduce the read length of the newer sequence data to match the base length of the older sequence data (-b 69).
Mapping, Variant Calling, and Filtering
We used the program bwa v0.7.17 to index the reference genome S. vulgaris vAU1.0 and align the trimmed DArT reads using the bwa aln function (-B 5 to trim the first 5 base pairs of each read), which is optimised for single-end short reads. This was then followed by the bwa samse function for producing the SAM formatted output files containing the alignments and their respective base qualities. Alignments were then sorted and indexed using samtools v1.16.1, and single nucleotide polymorphisms (SNPs) were subsequently called and annotated using bcftools v1.16 with the mpileup (-a "DP,AD,SP", --ignore-RG) and call (-mv, -f GQ) functions.
We removed known technical replicates and identified relatives from the data. vcftools v0.1.15 was used to remove indels (--remove-indels), and quality filter for a minimum site quality score of 30 (--minQ30), minimum genotype quality score of 20 (--minGQ 20), and minimum and maximum depth of coverage of 5 (--minDP 5) and 100 (--maxDP 100). Then, to account for batch effects that may impact the sequenced loci, we kept only SNPs present in at least 50% of the individuals in each sampling location. We ran one final filtering step to ensure appropriate levels of missingness and rare alleles using the following parameters: maximum missingness per site of 30% (--max-missing 0.7), minor allele count of 5 (--mac 5), and a minimum and maximum allele per locus of 2 (--min-alleles 2 --max-alleles 2), resulting in a dataset containing 19,174 SNPs and 141 individuals.