Surviving the squeeze: Genomic analysis of a successful invasion by European common wall lizards (Podarcis muralis) in North America (Ohio, USA)

Bode, Emily 1 ; Mason, Andrew1; Bolton, Peri1; Petren, Ken2; Gangloff, Eric3; Gibbs, Lisle1

Research facility: The Ohio State University

Published Apr 02, 2026 on Dryad. https://doi.org/10.5061/dryad.4b8gthtq6

Data files

Apr 02, 2026 version files 9.48 GB

Analysis_Scripts_Data_2.zip

24.97 MB
LD.filtered.SNPs.2.pruned.vcf.gz

355.61 MB
Podarcis.filtered.final.SNPs.vcf.gz

9.10 GB
README.md

11.43 KB

Abstract

Invasive species that undergo a founder event may experience a decline in genetic diversity yet still establish successful populations despite the potential loss of adaptive variation and an increase in deleterious variation. This widely observed phenomenon is termed the genetic paradox of invasive species. A possible example is the common wall lizard (Podarcis muralis) in Cincinnati, Ohio, USA a successful invasive reptile founded by the introduction of a small number of individuals from Italy. In this study, we evaluated whether this introduction is an example of the genetic paradox of invasive species. Specifically, we examined the origin, demographic history, population structure and genomic diversity of the invasive common wall lizard using whole genome sequences of 35 individuals from the native and introduced locations across multiple time points. Phylogenetic analyses confirm that the invasive lizards in Ohio are likely derived from a population of Podarcis muralis in Northern Italy. We then show that: 1) Analyses of short-term effective population size (N_e) and measures of inbreeding (F_ROH) demonstrate that introduced lizards in Cincinnati went through a short-term bottleneck but rapidly increased in population size thus potentially minimizing losses in genetic diversity; 2) Direct comparisons of estimates of neutral genetic diversity (PCA, H_obs, F_ST, and π) between source and introduced populations show that while populations in Cincinnati represent a subset of native genetic variation they show minimal losses of genetic diversity; 3) Comparisons of deleterious mutation load between native and introduced populations show only small increases in load in introduced populations; 4) Tests for selection based on outlier analyses detect potential positive selection in regions of the genome of introduced individuals that contain protein coding genes that could represent cases of rapid adaptation to a new environment. Overall, we do not find evidence for a genetically depauperate introduced population with high levels of genetic load in Cincinnati, which are key criterion for the genetic paradox of invasive species. Rather we find rapid population growth likely minimized genetic diversity loss, possibly enabling genetic adaptation to novel aspects of the environment in Cincinnati. This demographic resilience likely facilitated the retention of standing genetic variation and minimized the accumulation of genetic load, allowing the population to remain viable and increase in size. These findings highlight how the interplay between demography, genetic variation, and ecological compatibility can facilitate the success of introduced species, even in the face of strong genetic constraints. Further our study underscores the importance of integrating genomic, demographic, and ecological perspectives to fully understand invasion success.

The following scripts & data provide information regarding three datasets

Dataset A - Podarcis.filtered.final.SNPs.vcf.gz - GATK called variant call file that has been compressed

Dataset B - LD.filtered.SNPs.2.pruned.vcf.gz - LD filtered version of Dataset A, GATK called variant call file that has been compressed

Dataset 3 - multisample_filter2.all-sites.vcf.gz - Combined all-sites variant call file that has been compressed - available here: https://buckeyemailosu-my.sharepoint.com/:u:/g/personal/bode_52_buckeyemail_osu_edu/IQDKq4Qm6Ym9TZ5rQsnyrEo6AUG4MIT5OhogJQMVxUjmXu8?e=p2prQk

Raw reads available on NCBI's Short Read Archive Database under BioProject Accession No PRJNA1284928
The reference genome used was: GCF_004329235.1_PodMur_1.0_genomic.fna found on NCBI's GenBank

Files for analysis and Scripts used are bundled within a single .zip file called "Analysis_Scripts_Data_2.zip".
Below is description of the contents of each folder within archive.

Descriptions

The below variables can be found in Datasets A, B, and C in the header of each file:

CHROM
- Chromosome or scaffold identifier from the reference genome
POS
- Genomic position of the variant (1-based coordinate)
ID
- Variant identifier (e.g., dbSNP ID) or “.” if not assigned
REF
- Reference allele at the given position
ALT
- Alternate (variant) allele(s)
QUAL
- Phred-scaled quality score of the variant call
FILTER
- Indicates whether the variant passed filtering (PASS) or which filter(s) it failed
INFO
- Summary statistics and annotations describing the variant across all samples
FORMAT
- Defines the format of genotype data for each sample
GT
- Genotype (e.g., 0/0 = homozygous reference, 0/1 = heterozygous, 1/1 = homozygous alternate)
AD
- Number of reads supporting REF and ALT alleles
DP
- Read depth for the individual at this position
GQ
- Genotype quality (confidence in genotype call)

Samples and Sample Designation/Population Below

Italy 2009	Cincinnati 2007	Cincinnati 2022	Columbus 2021
Pm8410	Pm6047	Pm0001	Pm203_Apr1
Pm0004	Pm6048	Pm8401	Pm204_Apr1
Pm0006	Pm6049	Pm8402	Pm205_Feb1
Pm005_Apr1	Pm6050_Tor07	Pm8403	Pm210
Pm8411	Pm6051_Tor07	Pm8404	Pm0211
Pm8417	Pm6593	Pm8405
Pm8419	Pm6594	Pm8406
PmIta3	Pm6595	Pm8407
Pm8418	Pm6952	Pm8408
	Pm6982	Pm8409

The above samples can be found in all below files for analysis

Usage Notes

Datasets A, B, and C can be viewed directly in a text editor or processed using tools like bcftools or VCFtools found in a computers shell system or by clustering computer system. Common operations include filtering variants by quality or allele frequency, extracting specific fields (e.g., chromosome, position, genotype), or converting to other formats such as TSV or PLINK for downstream analysis. VCF files can also be used as input for population genetics analyses, functional annotation with tools like SnpEff, or variant prioritization workflows. Below scripts use a combination of Linux/Unix commands, python scripts, and R/RStudio. Scripts ending in .slurm or .sh are Linux/Unix scripts and will need to be run on a compatible software (ie. MacOS) or computer cluster. Scripts ending in .py are python scripts and will need python installed to run, use of JupyterNotebook and Anaconda is suggested. Scripts ending in .R are scripts written in R and require R and/or Rstudio to run. All scripts under R Code require RStudio to visualize plots and varying packages for plot customization.

Table of contents (main folders)

Data -
Scripts -
R_Code_ -

data/

SNeP - folder of SNeP outputs of Ne.
- plain text files (.txt) that are viewable in Notepad, excel, etc. .NeAll is the standard output for the program SNeP. Each row typically represents a distance bin between SNPs and includes the following variables: GenAgo (number of generations ago), Ne (estimated effective population size at that time point), dist (average physical distance between SNP pairs), r2 (mean linkage disequilibrium value for the bin), r2SD (standard deviation of r² values), and items (number of SNP pairs used in the calculation for that bin).
Het - folder of VCFTools outputs for Het
- plain text files (.Het) or tab-deleminated text (.log) that can be viewed in Notepad, BBEdit, R. Each row corresponds to an individual and includes the following variables: FID (family ID), IID (individual ID), O(HOM) (observed number of homozygous sites), E(HOM) (expected number of homozygous sites under Hardy–Weinberg equilibrium), N(NM) (number of non-missing genotyped sites), and F (inbreeding coefficient, calculated as the deviation of observed from expected homozygosity).
qc_flagstat.zip - folder of QC analysis from FastQC (raw) and Fastp (trimmed).
- plain text files (.txt) as well as .html and .json files that can be viewed in Notepad, the internet and any image viewer. FastQC reports contain information on sequence quality scores, GC content, duplication levels per individual. Fastp reports contain information on filtering statistics, read quality before/after trimming per individual.
Lof mutations - folder of data used to calculate Lof allele frequency
- plain text files (.frq) that can be viewed in Notepad or excel. As some files do not include headers, columns are interpreted as follows based on data structure: Column 1 (chromosome or scaffold ID, e.g., NC_041312.1), Column 2 (genomic position in base pairs), Column 3 (reference allele count), Column 4 (alternate allele count), Column 5 (reference allele frequency), and Column 6 (alternate allele frequency).
ROH - folder of plink ROH outputs
- plain text files (.hom) that are readable in excel or Notepad. Each row represents one ROH segment and includes the following variables: FID (family ID), IID (individual ID), PHE (phenotype, often -9 if missing), CHR (chromosome), SNP1/SNP2 (start and end SNP identifiers), POS1/POS2 (start and end genomic positions in base pairs), KB (segment length in kilobases), NSNP (number of SNPs in the segment), DENSITY (average SNP spacing, kb per SNP), and PHOM/PHET (proportion of homozygous and heterozygous calls within the segment, respectively).
multisample_filter2_10kb_pi.txt.zip - compressed file of Nuc Div calc
- Compressed txt file (.txt) that can be viewed in excel, Notepad, R, etc. Each row corresponds to a genomic window and includes: pop (population identifier), chromosome (chromosome or scaffold ID), window_pos_1 and window_pos_2 (start and end positions of the window in base pairs), avg_pi (average nucleotide diversity within the window), no_sites (number of sites analyzed in the window), count_diffs (total number of pairwise nucleotide differences observed), count_comparisons (number of pairwise comparisons made), and count_missing (number of missing sites within the window).
FST_10kb_pairwise_mean.txt summary of FST calcs for mean per population comparison
- Plain text file (.txt) that can be viewable in R, Notepad, etc. Each row represents a pairwise population comparison and includes: pop1 and pop2 (the two populations being compared), comparison (label or identifier for the pairwise comparison), mean_fst (average FST value across all windows), and n_windows (number of genomic windows included in the calculation).
outliers - folder of data used in FST outlier analysis
- Contains excel and (.tsv) files that can be viewed in excel, R, or Notepad
  - For the TableS3_outlier functions each row typically represents a genomic window and includes: chromosome/chr (chromosome or scaffold ID), window_pos_1 and window_pos_2 (start and end positions of the window in base pairs), avg_wc_fst (average Weir and Cockerham FST for the window), no_snps (number of SNPs in the window), GeneID (gene identifier overlapping or nearest the window), gene_name (common gene name, if available), GO_ID (Gene Ontology identifier), name (GO term name), namespace (GO category, e.g., biological process, molecular function, cellular component), and gene_start/gene_end (genomic coordinates of the annotated gene).
  - For TableS4_GO_results each row represents an enriched term and includes: ID (GO term identifier), Description (GO term name), GeneRatio (proportion of input genes associated with the term), BgRatio (proportion of background genes associated with the term), RichFactor (ratio of GeneRatio to BgRatio), FoldEnrichment (degree of overrepresentation relative to background), zScore (standardized enrichment score), pvalue (raw statistical significance), p.adjust (multiple testing–corrected p-value), qvalue (false discovery rate), geneID (list of genes contributing to the enrichment signal), and Count (number of genes associated with the term).

scripts/

01_trimgalore_template.sh - trimming genome data
02_alignmnet_template.slurm - mapping genome data
03_haplotypecaller_template.slurm - SNP calling
04_Podarcis_genotyping.slum - Genotyping
05_Podarcis_filtering.slurm - Filtering
06_bqrs_template.slurm - quality score check
07_haplotypecaller_template.slurm - SNP calling
08_Podarcis_genotyping.slurm - Genotyping
09_LD.filter.sh - LD filtering for variant-only vcf file
10_align_samples_array.sh - mapping genome data for all-sites vcf
11_call_allsites.sh - SNP calling for all-sites VCF and combination of all-sites VCF with variants
12_filter1.sh - All-sites filtering for missingness
13_filter2.script - All-sites filtering for depth
busco_step1.sh - Script to extract Busco scores
Busco_phylogenomics.py - Script to run Busco phylogenomics python script
15_SNeP_script.sh - Script to calculate ld-based effective population size (Ne) estimates
Podarcis_PCA_Code.R - Rscript to run PCA on VCF
Roh_step1.sh - Script to calculate ROH using Plink
Roh_step2.py - Script to calculate Froh per population
18_Nuc_Div.sh - Script to calculate genome wide estimates of Nuc Diversiy using all-sites vcf
19_Fst_calc.sh - Script to calculate population estimates of Fst using all-sites vcf
20_Het.sh - Script to calculate heterozygosity per population using VCFTools
21_Genetic_Load_calcs.sh - SNPeff script to annotate LoF mutations
outliers.R - script to find FST outliers

R_Code_/

Plot_SNeP.R - script to plot Ne outputs from SNeP
Plot ROH.R - script that plots FROH mean per population ROH length by genome length
Hetero.R - script that plots heterozygosity outputs
Genetic Load Plots.Rmd - script that plots Lof frequency
Plot_Nuc_Div copy.R - script that plots Nuc Div
Plot_Config.R - script that plots manhattan figure of FST outliers