Data from: Beyond population size: Whole-genome data reveal bottleneck legacies in the peninsular Italian wolf
Data files
Aug 31, 2024 version files 4.70 GB
-
Codes_DRYAD.zip
31.94 KB
-
Data_DRYAD.zip
4.70 GB
-
README.md
3.60 KB
Abstract
Preserving genetic diversity and adaptive potential while avoiding inbreeding depression is crucial for the long-term conservation of natural populations. Despite demographic increases, traces of past bottleneck events at the genomic level should be carefully considered for population management. From this perspective, the peninsular Italian wolf is a paradigmatic case. After being on the brink of extinction in the late 1960s, peninsular Italian wolves rebounded and recolonized most of the peninsula aided by conservation measures, including habitat and legal protection. Notwithstanding their demographic recovery, a comprehensive understanding of the genomic consequences of the historical bottleneck in Italian wolves is still lacking. To fill this gap, we sequenced whole genomes of thirteen individuals sampled in the core historical range of the species in Central Italy to conduct population genomic analyses, including a comparison with wolves from two highly-inbred wolf populations (i.e., Scandinavia and Isle Royale). We found that peninsular Italian wolves, despite their recent recovery, still exhibit relatively low genetic diversity, a small effective population size, signatures of inbreeding, and a non-negligible genetic load. Our findings indicate that the peninsular Italian wolf population is still susceptible to bottleneck legacies, which could lead to local inbreeding depression in case of population reduction or fragmentations. This study emphasizes the importance of considering key genetic parameters to design appropriate long-term conservation management plans.
README: Beyond population size: whole-genome data reveal bottleneck legacies in the peninsular Italian wolf.
https://doi.org/10.1093/jhered/esae041
Description
This dataset consists of two folders: one containing data ('Data_DRYAD') and the other one containing codes ('Codes_DRYAD').
The 'Data_DRYAD' folder includes autosome SNP genotypes in VCF format, compressed with bgzip, from newly sequenced whole genomes of Italian wolves (WIT), Scandinavian wolves (WSC), and Isle Royale wolves (WUS). After the population acronym (WIT, WSC or WUS) each file name is followed by 'CF31' or 'wolfAlign': 'CF31' means that the reads have been mapped on the dog reference genome (CanFam3.1; Lindblad-Toh et al. 2005); 'wolfAlign' means that the reads have been mapped on the wolf reference genome (Gopalakrishnan et al. 2017). The raw sequences for WSC and WUS are available on NCBI (WSC: BioProject PRJEB20635; WUS: BioProject PRJNA512209). The raw reads for WIT are going to be uploaded on NCBI soon.
The 'Codes_DRYAD' folder contains scripts in text (.txt) files, numbered to reflect the pipeline and workflow used to produce the results of this research. Each script file includes a description and explanation of the codes used, and mentions the required tools.
Folders & files
- 'Data_DRYAD' folder: This folder contains all the genomes that have been processed and genotyped in VCF format, compressed using bgzip through
bcftools view
.- WIT.CF31.SNP.noXY.vcf.gz -> Italian wolves autosome SNP genotypes mapped on dog reference genome (CanFam3.1);
- WIT.wolfAlign.SNP.noXY.vcf.gz -> Italian wolves autosome SNP genotypes mapped on wolf reference genome (Gopalakrishnan et al. 2017);
- WSC.CF31.SNP.noXY.vcf.gz -> Scandinavian wolves autosome SNP genotypes mapped on dog reference genome (CanFam3.1);
- WSC.wolfAlign.SNP.noXY.vcf.gz -> Italian wolves autosome SNP genotypes mapped on wolf reference genome (Gopalakrishnan et al. 2017);
- WUS.CF31.SNP.noXY.vcf.gz -> Isle Royale wolves autosome SNP genotypes mapped on dog reference genome (CanFam3.1);
- WUS.wolfAlign.SNP.noXY.vcf.gz -> Isle Royale wolves autosome SNP genotypes mapped on wolf reference genome (Gopalakrishnan et al. 2017).
- 'Codes_DRYAD' folder: This folder contains all the scripts used to process the data. Each kind of analyses is associated with a specific and unique file.
- 01_mapping -> Commands to perform raw reads mapping on a reference genome.
- 02_genotyping_HF -> Commands to perform genotype calling and hard filtering on genotypes.
- 03_intersect_pops -> Commands to intersect population genotypes, as we did genotype calling separately for each population.
- 04_admixture_analyses -> Commands to perform PCA and ADMIXTURE analyses.
- 05_remove_admixed -> Commands to remove admixed individuals from the dataset (if you want to perform the following analyses excluding admixed individuals).
- 06_genomic_variability_analyses -> Commands to perform genomic variability analyses (heterozygosity and nucleotide diversity).
- 07_demographic_analyses -> Commands to perform recent historical and contemporary effective population size estimations.
- 08_inbreeding_analyses -> Commands to perform runs of homozygosity (ROH) estimation, to derive number of ROHs (NROH), sum of ROHs (SROH), fraction of ROHs (FROH), and to estimate ROHs coalescence time.
- 09_genetic_load_analyses -> Commands to perform SNPs polarization and genetic load estimation, discriminate realized and masked load, and define properly deleterious variants.
Methods
2.1 Sample collection & DNA extraction
Tissue samples were collected from 13 peninsular Italian wolves between 2007 and 2012 from found-death individuals in the Central Apennines, where historical strongholds of wolves in Italy are located (Zimen and Boitani, 1975). The tissue samples were stored in ethanol at -20°C and subsequently processed in the Conservation Genomics Research Unit at the Fondazione Edmund Mach (FEM). Small fragments of tissue of around 25 mg were extracted using the DNeasy Blood and Tissue Kit (Qiagen) with overnight digestion at 56°C. The elution was performed at the GLOBE Institute (University of Copenhagen) using two washes of 50 μL of AE buffer, with 10 minutes of incubation at 37°C. Until the elution, samples were stored at -20°C inside the DNeasy Mini spin columns.
2.2 Library preparation, amplification & whole genome sequencing
Extracts were fragmented in the Covaris LE220 plus Focused-ultrasonicator with the parameters set for 350-bp fragment length. The extracts were diluted to obtain 100 ng concentration and BGI libraries for Italian wolves were constructed following previously optimized protocols (Carøe et al., 2018; Mak et al., 2017) and using 10 μM adaptors. Libraries were purified using MinElute columns using PE buffer (Qiagen) and eluted in 60 μL of EB buffer. The PCR mixture for the peninsular Italian wolf libraries consisted of: 20 μL of purified library, 0.2 μM of forward and reverse BGI primers, 2.5 U/μL PfuTurbo Cx HotStart DNA Polymerase, 10 μL of Buffer 10X, 0.08 mg/mL BSA, 0.5 mM of dNTPs (25 μM) and 61.2 μL AccuGene molecular biology water (Lonza, Basel, CH). The amplification of peninsular Italian wolf samples was performed as follows: initial denaturation at 95°C for 2 minutes followed by 10 to 12 cycles of 95°C for 30 seconds, 60°C for 30 seconds, and 72°C for 110 seconds, and a final elongation step at 72°C for 10 minutes. Peninsular Italian wolf samples were sequenced on ⅛ of a lane each on MGIseq2000 PE150 and DNBSEQ PE150, respectively. Four out of 13 samples have been previously published by (Ciucani et al., 2023) (Supplementary Table 1) at a low coverage (ca 3.6x). We resequenced these samples aiming to reach a higher coverage for the purpose of this study.
2.3 Dataset
In addition to the 13 peninsular Italian wolves (WIT) genomes sequenced here, we also retrieved genomic data from public databases (NCBI GenBank) (Supplementary Table 1). Therefore, the final dataset compiled for this study comprised 101 modern samples, including 13 WIT (Canis lupus italicus), 10 Scandinavian wolves (Canis lupus) (WSC) (Kardos et al., 2017), 11 North American wolves from Isle Royale (Canis lupus) (WUS) (Robinson et al., 2019), and 67 modern domestic dogs (Canis lupus familiaris) belonging to 67 different breeds of medium or large size (DOG) (Supplementary Table 1). The 10 Scandinavian wolves were a random subsample of those available (n=96). One African wolf (Canis lupaster) and one Golden Jackal (Canis aureus) were used as outgroups (OUT) for the genetic load analyses.
2.4 Quality control & alignment
We applied a quality control procedure on the sequencing reads, using FastQC (Andrews, 2010) to check for possible issues such as low quality scores and anomalous GC content, and we used multiQC (Ewels et al., 2016) to visualize them. The reads were mapped both onto the wolf reference genome (Gopalakrishnan et al., 2017) and onto the dog reference genome (CanFam3.1(Lindblad-Toh et al., 2005)), as the “admixture analyses” and the “genetic load analyses” required genomic regions from dogs and outgroups to be also mapped to the dog reference genome. To perform mapping, we set and ran the automated PALEOMIX BAM pipeline (Schubert et al., 2014): first, it indexed each read prefix using SAMtools ‘faidx’ (Danecek et al., 2021); then it removed the specified BGI adapters using AdapterRemoval (Lindgreen, 2012); the mapping was done using BWA ‘mem’ algorithm that is suggested for modern samples (Li and Durbin, 2009), setting the minimum mapping quality to 0 to retain all the reads in this step; to conclude PCR duplicates were removed using Picard MarkDuplicates (https://broadinstitute.github.io/picard/). After that, we used SAMtools (Danecek et al., 2021) to remove non primary alignment reads (samtools view -F 256).
2.5 Genotype processing
We used GATK v 4.3.0.0 and referred to GATK Best Practice Workflow to call high quality genotypes (Van Der Auwera et al., 2013). Then we applied two additional GATK tools for ‘hard filtering’ our genotypes: VariantFiltration (QD < 2.0, FS > 60.0, MQ < 40.0, MQRankSum < −12.5 and ReadPosRankSum < −8.0; settings taken from alternative protocol 2 in GATK Best Practices) to mark the filters, and SelectVariants to apply them. Finally, we applied other filters using VCFtools (Danecek et al., 2011) to keep only biallelic SNPs (flags --remove-indels --max-alleles 2 --min-alleles 2) and to filter for minor allele frequency (MAF), missingness, minimum quality and minimum and minimum average depth (flags --maf 0.05 --max-missing 0.9 --minQ 30 --minDP 5 --min-meanDP 5). We used those filters on different datasets according to the assumptions of the downstream analyses (Supplementary Table 2). As “admixture analyses” rely on the assumption that SNPs are not in physical linkage, we performed LD pruning using PLINK v 1.90b6.21 (Chang et al., 2015), setting a window size of 10 kb, a step size of 5 bp and an r2 threshold of 0.5 (flag --indep-pairwise 10 5 0.5). Moreover, to avoid violating the assumptions of random mating when carrying out most population genomic analyses , we used NgsRelate2 to identify and eventually remove closely related individuals (Hanghøj et al., 2019) applying thresholds of KING-robust kinship ≥ 0.20, R0 ≤ 0.1, and R1 ≥ 0.5 (Waples et al., 2019).
2.6 Admixture analyses
We explored patterns of genetic differentiation among samples through a preliminary non-model Principal Components Analysis (PCA), calculating principal components with PLINK v 1.90b6.21 (flag --pca). The percentage of variance explained was calculated from the ‘.eigenval’ output, and the first two principal components (PCs) were used for plotting in R v 4.2.1 using ‘ggplot2’ (Wickham, 2016). Then, we used ADMIXTURE v1.3.0 that uses a maximum likelihood approach (Alexander et al., 2009) to estimate the proportions of a given number of ancestries (K) for each individual. We assumed K values from 2 to 6 and set the --cv flag to calculate cross-validation errors (CV) for each K. For each K, we ran 15 independent iterations with different starting seeds and chose the iterations with the highest likelihood. The best K was then chosen based on the lowest CV value among the best iterations. If individuals potentially admixed with dogs were identified, we re-ran the analysis excluding them. Admixed individuals were further investigated within each specific wolf population through a supervised ADMIXTURE analysis (flag --supervised) to confirm their status after reapplying the filters for ‘admixture analyses’. In case some of these individuals were confirmed as admixed (i.e., hybrids or introgressed with dog), we conducted the downstream analyses by both keeping and removing them to highlight potential differences. As Stefanović et al. (Stefanović et al., 2024) demonstrated pervasive jackal-dog hybridization across the Canis aureus range, we also checked dog ancestry in the chosen outgroups (OUT) applying an additional supervised ADMIXTURE analysis.
2.7. Genomic variability analyses
To compare the patterns of genomic variation among the three wolf populations, we estimated the observed heterozygosity (Ho) and nucleotide diversity (π). We used ANGSD (Korneliussen et al., 2014) to estimate the heterozygosity of each sample, by calculating the folded site frequency spectrum (SFS) on autosomes. We estimated genotype likelihoods using ANGSD’s GATK (-GL 1) model (doSaf 1), removing bases with quality score lower than 20 (-minQ 20) and reads with mapping quality lower than 30 (-minmapq 30). The dog reference genome was used both as reference and as ancestral (-ref and -anc options). The genome-wide SFS was estimated using the realSFS utility tool provided in ANGSD and subsequently the final heterozygosity was calculated as the ratio of heterozygous sites/total sites. Since most of the individual genomes did not exhibit a high depth of coverage (> 20x), and a recent study has pointed out that low sequencing depth can bias the specific downstream analyses we are interested in (e.g. inbreeding and Ne estimates; (Kardos and Waples, 2024)), we tested the possible correlation between individual heterozygosity and read depth using R v 4.2.1 (Pearson’s correlation test). Subsequently, we tested for possible discrepancies between heterozygosity estimated from genotype likelihood and heterozygosity estimated from variant calling procedures. Nucleotide diversity (π) was estimated on variant sites only, using VCFtools (Danecek et al., 2011). We applied a sliding window approach with a window size of 100 kb (--window-pi 100000), including all the samples for each population.
2.8 Demographic analyses
To understand the demographic trend in the three wolf populations over time, we estimated recent historical Ne in the three wolf populations using the linkage disequilibrium (LD) method as implemented in GONE (Santiago et al., 2020) GONE exploits the genetic distances among SNPs to estimate Ne in more recent generations (relying on the information associated with loosely linked SNPs) and historical Ne (relying on the information associated with tightly linked SNPs), providing reliable estimates of Ne up to 200 generations ago. In GONE, input parameters were set to their default values, except the average recombination rate that was set to 1.3459 CentiMorgans per Megabase, as obtained from (Auton et al., 2013) (CanFam 3.1). We considered generation length in wolves to be 3-4 years (Gray et al., 2009; Mech et al., 2016) and ran the analysis on a maximum of 50,000 SNPs per chromosome (maximum value accepted by GONE) with no additional MAF filtering. Analyses in GONE were repeated 20 times to obtain empirical confidence intervals. Although this method may not fully capture the uncertainty introduced by the sampling process (as, for instance, methods based on jackknifing), especially given our reduced sample size, it is able to provide an indication of the uncertainty around our Ne estimates in GONE (Santiago et al., 2020). Point estimates were summarized using the geometric mean across the values obtained in each replicate. We also used the software currentNe, which is more accurate than GONE for contemporary Ne (i.e. last 2-3 generations) even with small sample sizes (Santiago et al., 2024). CurrentNe produces confidence intervals for Ne based on artificial neural networks without the need for iterating the analysis. We subsampled the datasets for WIT, WUS and WSC to 50,000 random SNPs prior to the analyses in current Ne, and used the Ne estimate obtained only based on LD between chromosomes. To understand how the pedigree of the sampled individuals would affect the Ne estimation, we carried out the analyses by either including or excluding highly-related individuals, under the expectation that in a random sample, the frequency of related individuals is a determinant of the genetic drift signal in the population and therefore, the exclusion of putative relatives from the analyses can upwardly bias the Ne estimates (Waples, 2024; Waples and Anderson, 2017).
2.9 Inbreeding analyses
We identified runs of homozygosity (ROH), indicative of putative identity-by-descent chromosome segments, in WIT, WSC, and WUS whole-genome data using the window-based approach implemented in PLINK v 1.90b6.21 (--homozyg). We employed a sliding window of 100 kb (--homozyg-kb 100), a threshold that has found favor in population genetics studies of non-model species (Hasselgren et al., 2021; Robinson et al., 2019; Sánchez‐Barreiro et al., 2021). For the remaining parameters, we kept PLINK default values. A minimum of 100 SNPs (--homozyg-snp 100) at a minimum density of 1 SNP per 50 kb was required to call a ROH (--homozyg-density 50). To account for genotyping errors, we allowed up to 1 heterozygous site per 1000 kb window within called ROHs (--homozyg-window-het 1), as per (Ceballos et al., 2018a), and 5 missing calls per 1000 kb window within called ROHs (--homozyg-window-missing 5). A length of 1000 kb between two SNPs was required in order for them to be considered in two different ROHs (--homozyg-gap 1000). We calculated the total length of all ROHs for each individual (SROH) and plotted it on the ‘x’ axis against the number of ROHs for each individual (NROH) on the ‘y’ axis. We estimated the inbreeding coefficient based on ROH (FROH) as the ratio of SROH to the total length of the autosomal genome covered by SNP positions (herein 2,222,501,653 bp) for each individual. We plotted SROH accounting for different ROH sizes (‘short’: 100 kb < ROH < 1 Mb; ‘intermediate’: 1 Mb < ROH < 10 Mb; ‘long’: ROH > 10 Mb), to highlight the FROH components. Finally, we calculated ROHs coalescence time aiming to define the time at which the ROHs were likely formed. To do so, we used the formula L = 100/2t cM (Thompson, 2013) where L is the length of the ROH, cM is the recombination rate and t is the time of coalescence in generations. If demographic analyses revealed some bottleneck, we checked the impact of the ROHs that have been formed during and immediately after the bottleneck (5 generations later) on the population SROH.
2.10 Genetic load analyses
We used publicly available short-read data from two outgroups (OUT), Canis lupaster (African wolf, SRA accession: SAMN10199001) and Canis aureus (Golden Jackal, SRA accession: SAMN10180427) to polarize SNPs, defining the 'ancestral’ and ‘derived’ states for each variant. To do so, we used est-sfs (Keightley and Jackson, 2018), a tool implementing a maximum likelihood method to infer the unfolded site frequency spectrum (the uSFS) and the ancestral state probabilities for our OUT data. Using a custom script, we obtained the est-sfs input file and extracted the ancestral state with the highest probability for each variant from the output file. If two bases had the same probability of being ancestral, the script randomly picked one of the two. We used the domestic dog genome annotation gtf, cDS and protein files (CanFam3.1; Ensembl release 104 https://may2021.archive.ensembl.org/Canis_lupus_familiaris/Info/Index) to build custom snpEff v.4.3.183 (Cingolani et al., 2012) and SIFT4G v.6.084 (Vaser et al., 2016) databases with default settings. We then annotated and predicted the effects of variants using the aforementioned tools. Following annotation, we retained variants where the derived state matched the alternative allele in the dog reference genome. This ensured that the derived alleles had the deleterious effects indicated for the alternative alleles of the dog reference genome. We used classified putatively deleterious variants into three categories: (i) Low-impact (LOW) variants likely to be not deleterious (i.e., synonymous), (ii) Moderate-impact (MODERATE) variants likely to modify protein effectiveness (i.e., nonsynonymous), and (iii) High-impact (HIGH) variants likely to disrupt protein function (i.e. loss of function LoF, stop codons, splice donor variant and splice acceptor, or start codon lost) (56). We used SIFT score to discriminate MODERATE in tolerated nonsynonymous (SIFT score ≥0.05) (MOD-TOL) and putatively deleterious nonsynonymous (SIFT score <0.05) (MOD-DEL) variants. Then we used a custom script to estimate individual allele frequencies and genotype counts of LOW, MOD-TOL, MOD-DEL and HIGH impact variant derived alleles, for each wolf in the dataset. In particular, the count of heterozygous genotypes represented the ‘masked’ load, which quantifies the potential loss of fitness due to (partially) recessive deleterious mutations that may become expressed in future generations. The count of homozygous genotypes for the derived alleles represented the ‘realized’ load, which reduces fitness in the current generation; the sum of ‘masked’ and ‘realized’ load represented the ‘total’ load (Bertorelle et al., 2022).
he ‘total’ load.