Genome-wide differentiation by geography not species in taxonomically complex eyebrights (Euphrasia)
Data files
Dec 27, 2024 version files 7.48 MB
-
AllTaxaDataset.vcf.gz
5.23 MB
-
DiploidDataset.vcf.gz
835.65 KB
-
README.md
3.03 KB
-
unlink_AllTaxaDataset.vcf.gz
1.13 MB
-
unlink_DiploidDataset.vcf.gz
278.27 KB
Abstract
Most studies investigating the genomic nature of species differences anticipate monophyletic species with genome-wide differentiation. However, this may not be the case at the earliest stages of speciation where reproductive isolation is weak and homogenising gene flow blurs species boundaries. We investigate genomic differences between species in a postglacial radiation of eyebrights (Euphrasia), a taxonomically complex plant group with variation in ploidy and mating system. We use genotyping-by-sequencing and spatially-aware clustering methods to investigate genetic structure across 378 populations from 18 British and Irish Euphrasia species. We find only northern Scottish populations of the selfing heathland specialist E. micrantha demonstrate genome-wide divergence from other species. Instead of genetic clusters corresponding to species, all other clusters align with geographic regions, such as a genetic cluster on Shetland that includes ten tetraploid species. Recent divergence and extensive gene flow between putative species is supported by a lack of species-specific SNPs or clear outlier loci. We anticipate a similar lack of association between genomic clusters and species identities may occur in other recent postglacial groups. Where new species emerge this is associated with a transition in mating system or novel ecological preferences.
README: Genome-wide differentiation by geography not species in taxonomically complex eyebrights (Euphrasia)
https://doi.org/10.5061/dryad.xd2547dt7
Description of the data and file structure
Data Description and Analysis Details
This paper includes two main datasets: the “All Taxa Dataset” and the “Diploid Dataset”, generated using specific filtering parameters as described below.
- Main Datasets:
- AllTaxaDataset.vcf.gz: Contains 19,666 SNPs for all taxa dataset, with an average missing data per site of 19%.
- DiploidDataset.vcf.gz: Contains 26,278 SNPs for diploid dataset, with an average missing data per site of 11%. A total of 768 individuals from 356 populations were retained for downstream analyses.
- SupTab1_SampleInfo.xlsx: Contains all the annotation information for each sample.
- Datasets for fastStructure:
To meet the requirements of fastStructure, we applied additional filtering to remove all linked SNPs:
- unlink_DiploidDataset.vcf.gz: Contains 6,016 SNPsfor the diploid dataset.
- unlink_AllTaxaDataset.vcf.gz: Contains 5,220 SNPsfor the all taxa dataset.
- Analyses and R Scripts:
The majority of the analyses were performed in R. The following analyses are included, with corresponding R scripts provided:
- AMOVA (Analysis of Molecular Variance)
- PCA (Principal Component Analysis)
- DAPC (Discriminant Analysis of Principal Components)
- FastStructure plots and geographic maps for the pie plots
Files and variables
File: DiploidDataset.vcf.gz
Description: Contains 26,278 SNPs for diploid dataset, with an average missing data per site of 11%.
File: unlink_AllTaxaDataset.vcf.gz
Description: Contains 5,220 SNPsfor the all taxa dataset.
File: unlink_DiploidDataset.vcf.gz
Description: Contains 6,016 SNPsfor the diploid dataset.
File: AllTaxaDataset.vcf.gz
Description: Contains 19,666 SNPs for all taxa dataset, with an average missing data per site of 19%.
Supplemental information file: SupTab1_SampleInfo.xlsx
Description: Contains two sheets: Samples_Info & Keyfile. Smaples_Info contains all the individual sample annotation information requires for the analyses. Keyfile link to the raw data (from NCBI, see below access information) in running TASSEL-GBS pipeline.
Code/software
R script: 1_AMOVA.R
Description: Analysis of Molecular Variance
R script: 2_PCA.R
Description: Principal Component Analysis
R script: 3_DAPC.R
Description: Discriminant Analysis of Principal Components
R script: 4_Structure&Maps.R
Description: FastStructure plots and geographic maps for the pie plots
Access information
Data was derived from the following sources:
- RAW data is on NCBI with the BioProject number: PRJNA1125641