Recent origin of a range-restricted species with subsequent introgression in its widespread congener in the Phyteuma spicatum group (Campanulaceae)
Data files
Nov 24, 2024 version files 40.27 MB
-
dryad_files.zip
40.26 MB
-
README.md
11.34 KB
Abstract
Understanding the causes of restricted geographic distributions is of major interest to evolutionary and conservation biologists. Inferring historical factors has often relied on ad hoc interpretations of genetic data, and hypothesis testing within a statistical framework under different demographic scenarios remains underutilized. Using coalescent modeling on RAD-sequencing data we (i) test hypotheses about the origin of Phyteuma gallicum (Campanulaceae), a range-restricted endemic of central France sympatric with its widespread congener P. spicatum, and (ii) date its origin, irrespective of its mode of origin, to test the hypothesis that the restricted range is due to a recent time of origin. The best-supported model of origin is one of a dichotomous split of P. gallicum, confirmed as a distinct species, and the Central European P. nigrum with subsequent gene flow between P. gallicum and P. spicatum. The split of Ph. gallicum and P. nigrum is estimated at 45–55,000 years ago. Coalescent modeling on genomic data not only clarified the mode of origin (dichotomous speciation instead of hybridogenic origin) but identified recency of speciation as a sufficient, though likely not the sole, factor to explain the restricted distribution range. Coalescent modeling strongly improves our understanding of the evolution of range-restricted species that are frequently of conservation concern, as is the case for P. gallicum.
https://doi.org/10.5061/dryad.8gtht76x2 10.5061/dryad.8gtht76x2
This dataset contains the files used to reconstruct the phylogenetic structure between P. gallicum and its closest relatives P. nigrum, P. spicatum, and P. pyrenaicum, and any potentially reticulate relations between them using Splitstree, PCA (Principal Components Analysis), and sNMF and the data files used to test the best coalescent model and the divergence time using Fastsimcoal2.
Description of the data and file structure
The files for each step are placed in separate folders numerated by their order. For example, 01_artifical_reference_mapping_snp_calling contains the files of the first step.
1. Artificial reference, mapping, and SNP calling
First, the optimal parameters for de novo assembly are tested using a subset of samples with the bash scripts optimization.sh and optimization_pop.sh. The first script tests the diagonal of the parameter space (e.g., M=n=1, M=n=2, etc.) up to a value of seven, then the second script tests the n value for the best M parameter setting (e.g., M=4, n=1; M=4, n=2, etc.), also up to a value of seven.
Then the wrapper script denovo+filtering5.sh is used to:
- filter the catalog of loci (catalog.fa.gz) produced during the de novo assembly of the subset of samples by using STACKS2 populations, a shell command, and a custom Python script (filter_catalog.py),
- blast against a set of databases and filter out loci that do not match a plant (blast_result_parse.py, which produces a file called blacklist, and blacklist_filter.py which uses said blacklist file and the catalog of loci fasta file),
- run a Python pipeline script that assembles the artificial reference (using samtools, picard tools, and bowtie2), then maps the sequences of all sample files against that reference (again bowtie2), does some postprocessing of the sequencing files (sort using picard tools, add readgroup using picard tools and realigns the reads using GATK3.8), runs refmap.pl to call SNPs, produces a new whitelist of loci with at most ten SNPs using command line commands, and finally filters the SNPs using populations.
2. Preparation of model-testing datasets
The dataset used to analyse the relations between the four species (descriptive dataset) is the resulting populations.snps.vcf file from the final populations filtering run of the SNP calling. The reduced dataset used for the coalescent modelling was further filtered as follows:
- only loci that have a coverage of at least 6 are filtered out using vcftools:
vcftools --vcf populations.snps.vcf --minDP 6 --out cov6_spicgrp --recode
, - then STACKS2 populations is run with a population map consisting of only samples from the three species to be analysed (i.e., P. gallicum, P. nigrum and P. spicatum) and only retaining loci that are present in at least two of the samples in each sampling site (-p 15 -r 0.5):
populations --in-vcf cov6_spicgrp.recode.vcf --out-path cov6_GNS -p 15 -r 0.5 --vcf --threads 1 --popmap popmap_spicGroup_GNS
, - because you cannot assign minor/major allele for biallelic loci with exactly 50% allele frequency, those loci are removed using vcftools:
vcftools --vcf cov6_spicgrp.recode.p.snps.vcf --max-maf 0.499 --out spicgrp_cov6_no50MAF --recode
,
3. Running descriptive analyses
The descriptive analyses, the PCA, neighbor-net, and sNMF analyses were performed on the complete dataset including all four species.
The PCA was run using a custom R script available here: https://github.com/DennisLarsson/plot_multi-panel_PCA, after having converted the .vcf file to .stru format using PGDSpider.
The neighbor-net was prepared in splitstree after converting the vcf file into a nex file using PGDspider (the vcf format was first converted to phylip, then from phylip to nexus to ensure proper formatting).
The sNMF analysis was run using the R package LEA in a custom R script available here: https://github.com/DennisLarsson/sNMF_TESS3_in_R. The results were then plotted by first averaging the Q-matrix for each sampling location using a custom R script called calcAverageQ.R, then using a modified version of the R script piechartMap.R. Both scripts can be found here: https://github.com/DennisLarsson/PieChartMap
4. Preparing the SFS summary statistics files
The coalescent modeling using FastSimCoal2 uses the summary statistics SFS for each deme and these were produced using SFS-scripts by marqueda (https://github.com/marqueda/SFS-scripts). The wrapper shell script vcf2sfs_commands.sh runs the following to prepare the SFS files:
- first the names of the samples need to be renamed in order to be properly read by the subsequent scripts, this is done using the custom script renameVCF.py, which also produces a new popmap with the new names,
- after this has been done SFS-scripts script sampleKgenotypesPerPop.py is run to downsample the samples two (diploid) pseudoindividuals,
- then the SFS-scripts script vcf2sfs.py is used to calculate the SFS for each combination of deme to produce the 2D SFS files,
- finally, the SFS-scripts script foldSFS.py is used to fold the unfolded SFS.
5. Calculate the divergence time interval for sampling
Next, the interval to be used as a sampling range for the estimation of the parameter determining the divergence time of the common ancestor is calculated. This is done by calculating the maximum and minimum genetic distance between P. nigrum and P. spicatum. Before this can be done introgressed individuals must be excluded since they will lead to underestimation of the distance between the two taxa. These samples were excluded based on the results of the descriptive analyses (PCA, neighbor-net, and sNMF) and result in the new population map file popmap_spicNig_indv.
- First the shell script sort_vcffiles.sh found in extract_locusID_scripts was run to sort the loci IDs chronologically for both the full dataset and the further filtered modeling dataset,
- then the custom script extractIDvcf.py also in extract_locusID_scripts is run on the vcf files for both the full and the modeling dataset to produce a file (output.txt) with the loci that remained after the further filtering done in the modeling dataset, but with the same loci ID as in the full dataset (loci IDs are renamed during each run of populations):
python3 extractIDvcf.py populations.sorted.snps.vcf cov6_spicgrp.sorted.recode.p.snps.vcf
, - after some manual removal of duplicates, a command is run to produce a whitelist file that can be used by STACKS2 populations:
cut -f 3 output_dups_removed.txt | cut -d":" -f 1 > wl_GNS_filtered.txt
. - Next a run of populations is done to produce a vcf and a phylip file (containing full sequence information per locus) with only P. nigrum and P. spicatum samples and with only the loci from the modeling dataset (where they still vary):
populations -P 05ref_map -O fullsite_spicNig -M popmap_spicNig_indv -W wl_GNS_filtered.txt --vcf --phylip-var-all
Now Splitstree needs to be run to produce a distance matrix, but first, we need to prepare the files and calculate the best substitution model that needs to be tested:
- First the phylip file populations.all.phylip output by populations needs to be reformatted in order to be properly read: 1) remove the last line of the phylip file, 2) change sample names in the file (for example: copy the first section into a separate document, add tab before first few nucleotides using replace function, then copy into a spreadsheet, make new names [i.e. copy names from popmap, use replace function to remove redundant info, MAX 10 letters per sample name], 3) then paste into name column of the phylip file and replace the old names, 4) copy the section back into the phylip file, 5) ensure that spaces are added if sample names aren’t exactly 10 letters.
- Now jmodel test is run to test for the best substitution model using the following command:
java -jar ~/Programs/jmodeltest-2.1.10/jModelTest.jar -d populations.all.phylip -g 4 -f -s 11 -AIC
. The GTR+G model is the best (see the output file jmodeltest_results_ML_best).
Now the phylip file is converted to nexus format using PGDspider and then loaded into Splitstree (nigspic_fullsite.nex) where the substitution model is set to GTR+G. Once the distance matrix is produced (nigspic_fullsite_GTR_distances.txt) it is copied into a spreadsheet where the genetic distance is calculated into the number of generations (distance.ods).
6. The coalescent modeling
Testing of the different divergence scenarios and time of divergence of P. gallicum was done by constructing three 3-deme models and ten 4-deme models to cover each plausible scenario. These models can be found in the folder models under 06_coalescent_modeling.
These models were then run on a slurm cluster using the scripts fsc2_sbatch_prep.sh, fsc2_sbatch.sh and fsc2_submit.sh. Each model was run using the following command:
fsc2702 -t ${model}.tpl -n 100000 -m -e ${model}.est -M -L 60 -c 1 -B 1 -q --foldedSFS --removeZeroSFS --nosingleton
Once the models have finished running, the results are downloaded, and the script extract_results.sh is used to find the best run of each model, with raw results in the file bestLH. All of the results, including AIC scores and ratio to observed likelihood are calculated in the results.ods file.
7. 95% confidence interval (CI)
The parameters of the best run of the best model are then used to simulate 100 replicates, meaning the exact same parameters are used to simulate and generate new SFS data which is then used as observed data to test the confidence of the model.
To do this, the script 100SFS_script.sh in the folders for the 3-deme and the 4-deme models are run (found in 3-deme_95_CI/prep_scripts and 4-deme_95_CI\prep_scripts, respectively) to generate the simulated data. Then model tests are run using the same settings as before (100 runs, 60 optimization cycles and 100000 iterations each) for each of the 100 pseudoreplicates on the slurm cluster using the scripts fsc2_sbatch.sh, fsc2_sbatch_prep_1_20.sh, fsc2_sbatch_prep_21_40.sh, fsc2_sbatch_prep_41_60.sh, fsc2_sbatch_prep_61_80.sh, fsc2_sbatch_prep_81_100.sh and fsc2_submit.sh, found in 3-deme_95_CI/3-deme_bootstrap and 4-deme_95_CI/4-deme_bootstrap for the best 3-deme and 4-deme model.
Next, the script extract_results.sh in the folders 07_95_CI_bootstrap is used to extract the best run of each of the 100 pseudoreplicates, giving us the 100 best runs across all pseudoreplicate runs. These are then used to calculate the 95% CI using the R script calc_95_ci.R and presented in the files 3-deme_bootstrap_results_3_4_2022.ods and 4-deme_bootstrap_results_3_4_2022.ods, for each deme model test.
Leaf material along with voucher specimens from several sampling sites for each study taxon were sampled in the field and stored in silica-gel. DNA was extracted using the Invisorb Spin Plant Mini Kit (Invitec Molecular, Berlin, Germany) following the manufacturer’s protocol with one modification. The genomic extracts were cleaned with NucleoSpin gDNA Clean‑up. RAD libraries were prepared using a custom protocol.
The reads were demultiplexed and quality filtered using BamIndexDecoder from illumina2bam 1.03 and process_radtags from Stacks2 2.41. assembly was done using denovo_map.pl script from Stacks2. We retained polymorphic RADtags with a maximum of 60% missing data across individuals and a maximum of ten SNPs per locus using a combination of populations from Stacks2 and a custom script. Additionally, we retained only loci identified as belonging to a spermatophyte by blasting the RADtags against the BLAST databases.
We further mapped the raw reads of all individuals against an artificial reference including the filtered RADtags using Bowtie2 2.3.4.1. The SAM files were converted to BAM files, sorted by reference coordinates, and read groups were added using Picard 2.20.1. Realignments around indels were done using Genome Analysis Toolkit 3.8 (GATK). Final genotypes were called using ref_map.pl from Stacks2.
The called SNPs were filtered into two datasets, one, for exploratory purposes, comprising all investigated taxa including P. pyrenaicum and P. × adulterinum (hereafter the “complete dataset”) and a second one, for coalescent modeling, comprising only samples from P. gallicum, P. nigrum and P. spicatum (the “reduced dataset”). The complete dataset was filtered as follows: for each locus with at most ten SNPs, a single random SNP with at most 50% missing data, a maximum observed heterozygosity of 50% and that was variable in more than one individual (i.e., not a singleton) was extracted using populations from Stacks2. The reduced dataset was further filtered by only retaining SNPs from loci with a coverage of at least six using vcftools 0.1.15 and by requiring that all SNPs must be present in at least two individuals per sampling site. Finally, loci that had an allele frequency of exactly 0.5:0.5 over the entire dataset were removed since their minor allele frequency cannot be determined.
A Neighbor-net was calculated using the Hasegawa-Kishino-Yano substitution model with empirical frequencies in Splitstree4 4.16. We also made an ordination of the samples by calculating principal components (PC) for both datasets using the PCA (Principal Component Analysis) function in the R (3.6.3) package adegenet 2.1.3. Finally, we analyzed the genetic structure between the taxa using sNMF (Frichot et al 2014) as part of the R package LEA 2.6.0.
As a summary statistic for coalescent modeling we used Site Frequency Spectra (SFS) for each deme (gene pool) as identified with sNMF and PC. To avoid potential biases of the SFS from missing data, we first down-sampled the 6-8 haploid genotypes (corresponding to 3-4 diploid individuals) of each sampling site by randomly sampling four (present) haploid genotypes from across all individuals for each locus within the respective sampling site, thus ensuring that no missing data is present in the dataset. Both the subsampling and the calculation of the folded 2D SFS were done using SFS-scripts. The coalescent modeling was further done using FastSimCoal 2.7 (Excoffier et al 2021).
In order to derive an informed interval to be used as a sampling range for the estimation of the parameter determining the divergence time of the common ancestor (TDIVANC), we calculated the maximum and minimum genetic distance between P. nigrum and P. spicatum. The best substitution model was selected using jModelTest 2.1.10 and Splitstree 4.16 was used to calculate a distance matrix, which was then converted to generations using the mutation rate of Arabidopsis thaliana (5.9E-09 substitutions/site/generation).
In order to get a 95% confidence interval (CI) of the estimated age parameters (time of divergence of the ancestor of all taxa and time of origin of P. gallicum), for the best-fitting model, we used the parametric bootstrapping approach of Ye et al (2020). Briefly, we used FastSimCoal 2.7 to simulate 100 pseudoreplicate SFS under the parameters of the best-fitting run. Each of these pseudoreplicate SFS was then used as observed data for another 100 replicate analyses using the same settings as for the original data (100,000 coalescent simulations and 60 optimization cycles). The best-fitting replicate was then selected for each simulated SFS and used to calculate the mean parameter estimates and the 95% CIs.