Skip to main content

Historical museum samples enable the examination of divergent and parallel evolution during invasion

Cite this dataset

Stuart, Katarina et al. (2022). Historical museum samples enable the examination of divergent and parallel evolution during invasion [Dataset]. Dryad.


During the Anthropocene, Earth has experienced unprecedented habitat loss, native species decline, and global climate change. Concurrently, greater globalisation is facilitating species movement, increasing the likelihood of alien species establishment and propagation. There is a great need to understand what influences a species’ ability to persist or perish within a new or changing environment. Examining genes that may be associated with a species’ invasion success or persistence informs invasive species management, assists with native species preservation, and sheds light on important evolutionary mechanisms that occur in novel environments. This approach can be aided by coupling spatial and temporal investigations of evolutionary processes. Here we use the common starling, Sturnus vulgaris, to identify parallel and divergent evolutionary change between contemporary native and invasive range samples and their common ancestral population. To do this, we use reduced-representation sequencing of native samples collected recently in north-western Europe and invasive samples from Australia, together with museum specimens sampled in the UK during the mid-19th Century.  We found evidence of parallel selection on both continents, possibly resulting from common global selective forces such as exposure to pollutants. We also identified divergent selection in these populations, which might be related to adaptive changes in response to the novel environment encountered in the introduced Australian range. Interestingly, signatures of selection are equally as common within both invasive and native range contemporary samples. Our results demonstrate the value of including historical samples in genetic studies of invasion and highlight the ongoing and occasionally parallel role of adaptation in both native and invasive ranges.


DArTseq of 85 sturnus vulgaris samples.

Excerpt from manuscript pertaining to variant calling and filtering for attached files.

Variant calling:

We used the STACKS v2.2 pipeline to process the DArTseq raw data. We used the process_radtags function to clean the tags; discarding reads of low quality (-q), removing reads with uncalled bases (-c), and rescuing barcodes and radtags (-r). We used the Burrows-Wheeler aligner (BWA) v0.7.15 aln function to align the read data to the reference genome S. vulgaris vAU1.0. Using FastQC, we identified base sequence bias in the adapter region, and so the first five bases were trimmed (-B 5) during alignment. The reads were then processed through BWA samse and SAMTOOLS v1.10, before SNP variants were called through STACKS gstacks (default parameters) and then populations (parameter information below).

bwaaln_filter_allsample_rr_nofamily.recode.vcf file:

We generated a ‘population genetics’ variant file by running STACKS populations, filtering for a minimum per-population site call rate of 50% (-r 0.5), a minimum populations per-site of 2 (-p 2), a minimum loci log likelihood value of -15 (--lnl_lim -15), with one random SNP per tag retained (--write_random_snp). We used VCFTOOLS v0.1.16 to filter the following parameters: maximum missingness per site of 10% (-max-missing 0.9), minor allele frequency of 2.5% (MAF; --maf 0.025), minimum loci depth of 2 (--minDP 2), minimum genotype quality score of 15 (--minGQ 15) and site Hardy-Weinberg Equilibrium exact test minimum p value 0.001 (--hwe 0.001). We chose a high threshold for missingness to not bias the population genetics analysis against the historical samples, which had much higher levels of missingness than the contemporary samples. MAF filtering helps remove misreads, and HWE filtering removed highly non-neutral loci, both of which are important for capturing neutral population substructure. After filtering, we calculated individual relatedness, and closely related individuals were removed so that there was only one representative from each cluster in the final data. This resulted in a population genetic variant file of 3,840 SNPs used in the subsequent section ‘Population structure analysis’.

bwaaln_allsample_selection_histSNPs_maf025.recode.vcf file:

We generated a ‘selection’ variant file by using STACKS populations to align the raw reads for all samples (with --lnl_lim -15 --write_random_snp flags) and then used VCFTOOLS to filter out only SNPs present in at least 50% of the historical individuals (i.e. in at least 5 historical individuals), with additional quality filtering (--minGQ 15 --minDP 2), resulting in 12,219 SNP sites. Only these sites were then retained to filter the original populations variant file, along with a MAF minimum of 2.5% to remove possible sequencing errors. This produced a data set that retained only SNPs sequenced in at least half the historical individuals, which would be necessary for selection analysis.

Usage notes

This upload currently contains the raw scripts and processed genetic VCF files.

More fully annotated and cleaned scripts, along with some script vignettes will be available on GitHub (

The raw genetic data files have been deposited under BioProject accession number PRJNA781785 in the NCBI BioProject database (