Surviving the squeeze: Genomic analysis of a successful invasion by European common wall lizards (Podarcis muralis) in North America (Ohio, USA)
Data files
Apr 02, 2026 version files 9.48 GB
-
Analysis_Scripts_Data_2.zip
24.97 MB
-
LD.filtered.SNPs.2.pruned.vcf.gz
355.61 MB
-
Podarcis.filtered.final.SNPs.vcf.gz
9.10 GB
-
README.md
11.43 KB
Abstract
Invasive species that undergo a founder event may experience a decline in genetic diversity yet still establish successful populations despite the potential loss of adaptive variation and an increase in deleterious variation. This widely observed phenomenon is termed the genetic paradox of invasive species. A possible example is the common wall lizard (Podarcis muralis) in Cincinnati, Ohio, USA a successful invasive reptile founded by the introduction of a small number of individuals from Italy. In this study, we evaluated whether this introduction is an example of the genetic paradox of invasive species. Specifically, we examined the origin, demographic history, population structure and genomic diversity of the invasive common wall lizard using whole genome sequences of 35 individuals from the native and introduced locations across multiple time points. Phylogenetic analyses confirm that the invasive lizards in Ohio are likely derived from a population of Podarcis muralis in Northern Italy. We then show that: 1) Analyses of short-term effective population size (Ne) and measures of inbreeding (FROH) demonstrate that introduced lizards in Cincinnati went through a short-term bottleneck but rapidly increased in population size thus potentially minimizing losses in genetic diversity; 2) Direct comparisons of estimates of neutral genetic diversity (PCA, Hobs, FST, and π) between source and introduced populations show that while populations in Cincinnati represent a subset of native genetic variation they show minimal losses of genetic diversity; 3) Comparisons of deleterious mutation load between native and introduced populations show only small increases in load in introduced populations; 4) Tests for selection based on outlier analyses detect potential positive selection in regions of the genome of introduced individuals that contain protein coding genes that could represent cases of rapid adaptation to a new environment. Overall, we do not find evidence for a genetically depauperate introduced population with high levels of genetic load in Cincinnati, which are key criterion for the genetic paradox of invasive species. Rather we find rapid population growth likely minimized genetic diversity loss, possibly enabling genetic adaptation to novel aspects of the environment in Cincinnati. This demographic resilience likely facilitated the retention of standing genetic variation and minimized the accumulation of genetic load, allowing the population to remain viable and increase in size. These findings highlight how the interplay between demography, genetic variation, and ecological compatibility can facilitate the success of introduced species, even in the face of strong genetic constraints. Further our study underscores the importance of integrating genomic, demographic, and ecological perspectives to fully understand invasion success.
The following scripts & data provide information regarding three datasets
Dataset A - Podarcis.filtered.final.SNPs.vcf.gz - GATK called variant call file that has been compressed
Dataset B - LD.filtered.SNPs.2.pruned.vcf.gz - LD filtered version of Dataset A, GATK called variant call file that has been compressed
Dataset 3 - multisample_filter2.all-sites.vcf.gz - Combined all-sites variant call file that has been compressed - available here: https://buckeyemailosu-my.sharepoint.com/:u:/g/personal/bode_52_buckeyemail_osu_edu/IQDKq4Qm6Ym9TZ5rQsnyrEo6AUG4MIT5OhogJQMVxUjmXu8?e=p2prQk
Raw reads available on NCBI's Short Read Archive Database under BioProject Accession No PRJNA1284928
The reference genome used was: GCF_004329235.1_PodMur_1.0_genomic.fna found on NCBI's GenBank
Files for analysis and Scripts used are bundled within a single .zip file called "Analysis_Scripts_Data_2.zip".
Below is description of the contents of each folder within archive.
Descriptions
The below variables can be found in Datasets A, B, and C in the header of each file:
- CHROM
- Chromosome or scaffold identifier from the reference genome
- POS
- Genomic position of the variant (1-based coordinate)
- ID
- Variant identifier (e.g., dbSNP ID) or “.” if not assigned
- REF
- Reference allele at the given position
- ALT
- Alternate (variant) allele(s)
- QUAL
- Phred-scaled quality score of the variant call
- FILTER
- Indicates whether the variant passed filtering (PASS) or which filter(s) it failed
- INFO
- Summary statistics and annotations describing the variant across all samples
- FORMAT
- Defines the format of genotype data for each sample
- GT
- Genotype (e.g., 0/0 = homozygous reference, 0/1 = heterozygous, 1/1 = homozygous alternate)
- AD
- Number of reads supporting REF and ALT alleles
- DP
- Read depth for the individual at this position
- GQ
- Genotype quality (confidence in genotype call)
Samples and Sample Designation/Population Below
| Italy 2009 | Cincinnati 2007 | Cincinnati 2022 | Columbus 2021 |
|---|---|---|---|
| Pm8410 | Pm6047 | Pm0001 | Pm203_Apr1 |
| Pm0004 | Pm6048 | Pm8401 | Pm204_Apr1 |
| Pm0006 | Pm6049 | Pm8402 | Pm205_Feb1 |
| Pm005_Apr1 | Pm6050_Tor07 | Pm8403 | Pm210 |
| Pm8411 | Pm6051_Tor07 | Pm8404 | Pm0211 |
| Pm8417 | Pm6593 | Pm8405 | |
| Pm8419 | Pm6594 | Pm8406 | |
| PmIta3 | Pm6595 | Pm8407 | |
| Pm8418 | Pm6952 | Pm8408 | |
| Pm6982 | Pm8409 |
The above samples can be found in all below files for analysis
Usage Notes
Datasets A, B, and C can be viewed directly in a text editor or processed using tools like bcftools or VCFtools found in a computers shell system or by clustering computer system. Common operations include filtering variants by quality or allele frequency, extracting specific fields (e.g., chromosome, position, genotype), or converting to other formats such as TSV or PLINK for downstream analysis. VCF files can also be used as input for population genetics analyses, functional annotation with tools like SnpEff, or variant prioritization workflows. Below scripts use a combination of Linux/Unix commands, python scripts, and R/RStudio. Scripts ending in .slurm or .sh are Linux/Unix scripts and will need to be run on a compatible software (ie. MacOS) or computer cluster. Scripts ending in .py are python scripts and will need python installed to run, use of JupyterNotebook and Anaconda is suggested. Scripts ending in .R are scripts written in R and require R and/or Rstudio to run. All scripts under R Code require RStudio to visualize plots and varying packages for plot customization.
Table of contents (main folders)
Data -
Scripts -
R_Code_ -
data/
- SNeP - folder of SNeP outputs of Ne.
- plain text files (.txt) that are viewable in Notepad, excel, etc. .NeAll is the standard output for the program SNeP. Each row typically represents a distance bin between SNPs and includes the following variables: GenAgo (number of generations ago), Ne (estimated effective population size at that time point), dist (average physical distance between SNP pairs), r2 (mean linkage disequilibrium value for the bin), r2SD (standard deviation of r² values), and items (number of SNP pairs used in the calculation for that bin).
- Het - folder of VCFTools outputs for Het
- plain text files (.Het) or tab-deleminated text (.log) that can be viewed in Notepad, BBEdit, R. Each row corresponds to an individual and includes the following variables: FID (family ID), IID (individual ID), O(HOM) (observed number of homozygous sites), E(HOM) (expected number of homozygous sites under Hardy–Weinberg equilibrium), N(NM) (number of non-missing genotyped sites), and F (inbreeding coefficient, calculated as the deviation of observed from expected homozygosity).
- qc_flagstat.zip - folder of QC analysis from FastQC (raw) and Fastp (trimmed).
- plain text files (.txt) as well as .html and .json files that can be viewed in Notepad, the internet and any image viewer. FastQC reports contain information on sequence quality scores, GC content, duplication levels per individual. Fastp reports contain information on filtering statistics, read quality before/after trimming per individual.
- Lof mutations - folder of data used to calculate Lof allele frequency
- plain text files (.frq) that can be viewed in Notepad or excel. As some files do not include headers, columns are interpreted as follows based on data structure: Column 1 (chromosome or scaffold ID, e.g., NC_041312.1), Column 2 (genomic position in base pairs), Column 3 (reference allele count), Column 4 (alternate allele count), Column 5 (reference allele frequency), and Column 6 (alternate allele frequency).
- ROH - folder of plink ROH outputs
- plain text files (.hom) that are readable in excel or Notepad. Each row represents one ROH segment and includes the following variables: FID (family ID), IID (individual ID), PHE (phenotype, often -9 if missing), CHR (chromosome), SNP1/SNP2 (start and end SNP identifiers), POS1/POS2 (start and end genomic positions in base pairs), KB (segment length in kilobases), NSNP (number of SNPs in the segment), DENSITY (average SNP spacing, kb per SNP), and PHOM/PHET (proportion of homozygous and heterozygous calls within the segment, respectively).
- multisample_filter2_10kb_pi.txt.zip - compressed file of Nuc Div calc
- Compressed txt file (.txt) that can be viewed in excel, Notepad, R, etc. Each row corresponds to a genomic window and includes: pop (population identifier), chromosome (chromosome or scaffold ID), window_pos_1 and window_pos_2 (start and end positions of the window in base pairs), avg_pi (average nucleotide diversity within the window), no_sites (number of sites analyzed in the window), count_diffs (total number of pairwise nucleotide differences observed), count_comparisons (number of pairwise comparisons made), and count_missing (number of missing sites within the window).
- FST_10kb_pairwise_mean.txt summary of FST calcs for mean per population comparison
- Plain text file (.txt) that can be viewable in R, Notepad, etc. Each row represents a pairwise population comparison and includes: pop1 and pop2 (the two populations being compared), comparison (label or identifier for the pairwise comparison), mean_fst (average FST value across all windows), and n_windows (number of genomic windows included in the calculation).
- outliers - folder of data used in FST outlier analysis
- Contains excel and (.tsv) files that can be viewed in excel, R, or Notepad
- For the TableS3_outlier functions each row typically represents a genomic window and includes: chromosome/chr (chromosome or scaffold ID), window_pos_1 and window_pos_2 (start and end positions of the window in base pairs), avg_wc_fst (average Weir and Cockerham FST for the window), no_snps (number of SNPs in the window), GeneID (gene identifier overlapping or nearest the window), gene_name (common gene name, if available), GO_ID (Gene Ontology identifier), name (GO term name), namespace (GO category, e.g., biological process, molecular function, cellular component), and gene_start/gene_end (genomic coordinates of the annotated gene).
- For TableS4_GO_results each row represents an enriched term and includes: ID (GO term identifier), Description (GO term name), GeneRatio (proportion of input genes associated with the term), BgRatio (proportion of background genes associated with the term), RichFactor (ratio of GeneRatio to BgRatio), FoldEnrichment (degree of overrepresentation relative to background), zScore (standardized enrichment score), pvalue (raw statistical significance), p.adjust (multiple testing–corrected p-value), qvalue (false discovery rate), geneID (list of genes contributing to the enrichment signal), and Count (number of genes associated with the term).
- Contains excel and (.tsv) files that can be viewed in excel, R, or Notepad
scripts/
- 01_trimgalore_template.sh - trimming genome data
- 02_alignmnet_template.slurm - mapping genome data
- 03_haplotypecaller_template.slurm - SNP calling
- 04_Podarcis_genotyping.slum - Genotyping
- 05_Podarcis_filtering.slurm - Filtering
- 06_bqrs_template.slurm - quality score check
- 07_haplotypecaller_template.slurm - SNP calling
- 08_Podarcis_genotyping.slurm - Genotyping
- 09_LD.filter.sh - LD filtering for variant-only vcf file
- 10_align_samples_array.sh - mapping genome data for all-sites vcf
- 11_call_allsites.sh - SNP calling for all-sites VCF and combination of all-sites VCF with variants
- 12_filter1.sh - All-sites filtering for missingness
- 13_filter2.script - All-sites filtering for depth
- busco_step1.sh - Script to extract Busco scores
- Busco_phylogenomics.py - Script to run Busco phylogenomics python script
- 15_SNeP_script.sh - Script to calculate ld-based effective population size (Ne) estimates
- Podarcis_PCA_Code.R - Rscript to run PCA on VCF
- Roh_step1.sh - Script to calculate ROH using Plink
- Roh_step2.py - Script to calculate Froh per population
- 18_Nuc_Div.sh - Script to calculate genome wide estimates of Nuc Diversiy using all-sites vcf
- 19_Fst_calc.sh - Script to calculate population estimates of Fst using all-sites vcf
- 20_Het.sh - Script to calculate heterozygosity per population using VCFTools
- 21_Genetic_Load_calcs.sh - SNPeff script to annotate LoF mutations
- outliers.R - script to find FST outliers
R_Code_/
- Plot_SNeP.R - script to plot Ne outputs from SNeP
- Plot ROH.R - script that plots FROH mean per population ROH length by genome length
- Hetero.R - script that plots heterozygosity outputs
- Genetic Load Plots.Rmd - script that plots Lof frequency
- Plot_Nuc_Div copy.R - script that plots Nuc Div
- Plot_Config.R - script that plots manhattan figure of FST outliers
