Data from: Using serially collected specimens to investigate the potential population genetic consequences of reported declines in eastern woodland salamanders
Data files
Nov 19, 2025 version files 7.90 MB
-
0_annotate_trans.py
2.12 KB
-
1_IDexon_multi_genelist_same_names.py
2.14 KB
-
1_IDexon_single_genelist.py
1.74 KB
-
2.1_splice_ref_exons.py
3.06 KB
-
3_blast_Transcripts_to_Ref.py
1.61 KB
-
4_filter_blastout_splice_CDS.py
8.98 KB
-
6.1_splice_immune_transcripts.py
1.60 KB
-
bootstrap_genetic_diversity_pleth.py
10.02 KB
-
bootstrap_populations.py
11.98 KB
-
calc_ref_aln_length.py
272 B
-
calc_summary_stats_bootstrap.py
12.04 KB
-
concatenate_fasta.py
3.02 KB
-
create_sp_consensus.py
3.33 KB
-
fill_gaps_consensus_pleth_ref.py
1.61 KB
-
fill_gaps_consensus.py
1.36 KB
-
generate_consensus_align_from_phased_pleth_v1.py
5.18 KB
-
generate_consensus_align_from_phased_pleth_v2.py
5.20 KB
-
generate_consensus_align_from_phased.py
5.31 KB
-
README.md
12.62 KB
-
recommended_coverage_summary_table.txt
104.54 KB
-
remove_fails.py
1.16 KB
-
Salamander-baits-all_Recommended_15791.fas
1.99 MB
-
Salamander-baits-all-filtration.txt
2.78 MB
-
Salamander-baits-all.fas
2.05 MB
-
Salamander-input-seq.fas
872.27 KB
-
santity_check.py
1.07 KB
Abstract
This dataset includes all data and scripts used to investigate demographic shifts in Plethodon salamanders. We designed exonic target loci using custom scripts, generated sequence data via exon-based target capture, and called SNPs using SECAPR v.1.1.15. Here, we provide VCF files from our variant calling pipeline. We calculated summary statistics using the populations module in STACKS and estimated Tajima's D using Python libraries. We provide all scripts used to design exonic capture probes, process genomic data, and calculate summary statistics.
This repository contains data and analysis files for a comprehensive genetic diversity study of six Plethodon salamander species across multiple temporal and geographic sampling strategies.
Project Overview
This study examines genetic diversity patterns in Plethodon salamanders using targeted sequence capture data. The research compares genetic diversity estimates across:
- 6 species: P. cinereus, P. cylindraceus, P. glutinosus, P. montanus, P. welleri, P. yonahlossee
- 2 localities: IGG (Indian Grave Gap) and SG (Skull's Gap)
- 3 sample types/timepoints: FFS (formalin-fixed specimens from the 1960s and 1970s), frozen blood (1980s), and fresh liver (2018/2019)
- 141 total samples across all combinations
Directory Structure
/Final_Baits/
Contains bait design files and coverage statistics for the targeted sequence capture approach.
Key Files:
recommended_coverage_summary_table.txt: Coverage statistics for all 1,200+ target loci- Columns: Target name, Bait coverage %, coverage within 100nt/300nt/500nt windows
- Six target sets representing different gene categories:
- Set 1: Immune genes (ILF2, MHC, TLR-2, chemokine receptors)
- Set 2: Phylogenetically informative genes (RAG1, SLC8A3, RHO, etc.)
- Set 3: Developmental genes (OTX2, KLF9, ALDH1A3, dlx2, etc.)
- Set 4: Vision genes (RHO, ARR3, GNAT1/2, GRK1/7, etc.)
- Set 5: Ultraconserved elements (UCEs) - 80 conserved genomic regions
- Set 6: Additional coding genes (~600 genes from various functional categories)
Bait Design Sequence Files:
Salamander-baits-all.fas: Complete set of all bait sequences in FASTA formatSalamander-baits-all_Recommended_15791.fas: Final recommended bait set (15,791 baits) after filtering and optimizationSalamander-baits-all-filtration.txt: Documentation of filtering criteria and statistics applied to bait selectionSalamander-input-seq.fas: Input sequences used for bait design (target loci sequences)
Bait Design Processing Scripts:
0_annotate_trans.py: Initial annotation of transcript sequences1_IDexon_single_genelist.py: Identifies exons for single-gene target lists1_IDexon_multi_genelist_same_names.py: Identifies exons for multi-gene lists with shared nomenclature2.1_splice_ref_exons.py: Splices reference exon sequences for bait design3_blast_Transcripts_to_Ref.py: BLAST alignment of transcripts to reference sequences4_filter_blastout_splice_CDS.py: Filters BLAST output and splices coding sequences6.1_splice_immune_transcripts.py: Specialized processing for immune gene transcripts
/scripts/
Analysis and processing scripts organized by function:
Consensus Sequence Generation:
generate_consensus_align_from_phased_pleth_v1.py: First version of consensus alignment generation from phased datagenerate_consensus_align_from_phased_pleth_v2.py: Updated version of consensus alignment generation with improved algorithmsgenerate_consensus_align_from_phased.py: Consensus alignment generation from phased sequencescreate_sp_consensus.py: Creates species-level consensus sequencesfill_gaps_consensus.py: Fills gaps in consensus sequences using reference datafill_gaps_consensus_pleth_ref.py: Plethodon-specific gap filling using reference sequences
Sequence Processing Utilities:
concatenate_fasta.py: Concatenates multiple FASTA files into a single outputcalc_ref_aln_length.py: Calculates reference alignment lengths for quality controlremove_fails.py: Removes failed sequences or low-quality samples from datasetssantity_check.py: Quality control and validation checks for processed sequences
Bootstrap Analysis (/Bootstrap_genetic_diversity_Pleth/)
bootstrap_populations.py: Bootstrap resampling for genetic diversity estimationcalc_summary_stats_bootstrap.py: Statistical calculations for bootstrap results- Additional supporting scripts for population genetic analyses
Summary Statistics (/Generate_summary_statistics/)
- Scripts for calculating diversity metrics (π, heterozygosity, Fis, etc.)
- Population-level and species-level statistical summaries
VCF Processing (/vcf_processing/)
- VCF file manipulation and quality filtering scripts
- Format conversion utilities
/summary_statistics/
Contains all genetic diversity summary statistics files:
File Naming Convention:
{dataset}_{snp_type}_{site_type}_{num_loci}_{num_samples}sumstats_combined_all_samples.txt
Dataset Types:
allreads: All sequencing reads includeddownsamp: Downsampled to equal coverage depth
SNP Types:
allsnps: All SNPs per locussinglesnps: One SNP per locus (to reduce linkage)
Site Types:
allsites: All targeted sitescoding: Coding regions onlynoncoding: Non-coding regions only
Summary Statistics Columns:
Population: Species_Locality_Timepoint combinationSpecies: Salamander species nameLocality: IGG or SGReplicate: Sample type (Blood/FFS/Fresh)Sites: Number of loci analyzedVariant_Sites: Sites with variationPolymorphic_Sites: Polymorphic sites within the population%Polymorphic_Loci: Percentage of polymorphic lociNum_Indv: Number of individualsPi: Nucleotide diversity (π)Obs_Het: Observed heterozygosityExp_Het: Expected heterozygosityFis: Inbreeding coefficient- Various variance and standard error estimates
/vcfs/
Variant Call Format (VCF) files containing SNP data:
File Naming:
{dataset}_{snp_type}_{site_type}_{num_loci}_{num_samples}.vcf
Key Files:
allreads_allsnps_allsites_31049_141.vcf: Complete dataset (31,049 loci, 141 samples)downsamp_*: Coverage-standardized versions- Coding vs. non-coding region subsets
- Single SNP per locus versions for population genetic analyses
Sample Information
Species Abbreviations:
cinereus: Plethodon cinereus (red-backed salamander)cylindraceus: Plethodon cylindraceus (white-spotted slimy salamander)glutinosus: Plethodon glutinosus (northern slimy salamander)montanus: Plethodon montanus (mountain salamander)welleri: Plethodon welleri (Weller's salamander)yonahlossee: Plethodon yonahlossee (Yonahlossee salamander)
Locality Codes:
IGG: Indian Grave GapSG: Skull's Gap
Sample Types:
Blood: Recently collected blood samples (DNA extracted from fresh blood)FFS: Formalin-fixed specimens (museum specimens preserved in formalin)Fresh: Fresh tissue samples (recently collected tissue samples)
Key Results Files
populations.tsv: Master sample metadata linking sample IDs to species, locality, and timepointcombined_sumstats_*.txt: Aggregated results across all analyses*_bootstrap.txt: Bootstrap confidence intervals for diversity estimates*_summary.txt: Final summary statistics and comparisons
Analysis Pipeline
- Raw sequence processing: Quality filtering and adapter removal
- Target enrichment: Alignment to reference target sequences
- Variant calling: SNP identification using VCF format
- Quality filtering: Coverage and genotype quality thresholds
- Population genetics: Diversity calculation using VCFtools/custom scripts
- Bootstrap analysis: Resampling for confidence intervals
- Comparative analysis: Temporal and geographic comparisons
Data Specifications
- Sequencing platform: Illumina paired-end sequencing
- Target enrichment: Custom bait design (1,200+ loci)
- Coverage: Variable across samples (standardized in downsampled analyses)
- Reference: Custom reference sequences for each target locus
- Quality thresholds: Minimum coverage and genotype quality filters applied
Code/Software
Required Software for Data Analysis
Core Bioinformatics Tools:
- VCFtools (v0.1.16+): Primary tool for VCF file manipulation and population genetic calculations
- Used for: SNP filtering, summary statistics calculation, format conversion
- Installation:
conda install -c bioconda vcftoolsOr from the source - Key functions:
--site-pi,--het,--hardy,--freq
- BCFtools (v1.9+): VCF file processing and variant calling
- Used for: VCF manipulation, filtering, and format conversion
- Installation:
conda install -c bioconda bcftools
- Python (v3.7+) with required packages:
pandas (v1.3+): Data manipulation and analysisnumpy (v1.20+): Numerical computationssubprocess: System command executionglob: File pattern matchingargparse: Command-line argument parsing
Data Viewing and Analysis:
- R (v4.0+) for statistical analysis and visualization:
ggplot2: Data visualizationdplyr: Data manipulationreadr: File I/Otidyr: Data tidying
- Text Editor/Spreadsheet Software:
- Any text editor for viewing
.txtand.tsvfiles - Excel, LibreOffice Calc, or R/Python for
.txtsummary statistics - Command line tools:
less,head,tailfor large files
- Any text editor for viewing
Software Workflow
1. VCF File Analysis:
# View VCF structure
bcftools view -h input.vcf | head -50
# Calculate population statistics
vcftools --vcf input.vcf --keep population_list.txt --site-pi --out results
2. Summary Statistics Processing:
# Python scripts for bootstrap analysis
python bootstrap_populations.py --vcf input.vcf --populations populations.tsv
python calc_summary_stats_bootstrap.py --input bootstrap_results/
3. Data Visualization:
# R scripts for plotting results
library(ggplot2)
library(dplyr)
data <- read.table("summary_statistics.txt", header=TRUE, sep="\t")
Included Scripts
Bootstrap Analysis Scripts:
bootstrap_populations.py: Performs bootstrap resampling of individuals within populations- Input: VCF files, population metadata
- Output: Bootstrap replicate summary statistics
- Dependencies: Python 3.7+, pandas, subprocess
calc_summary_stats_bootstrap.py: Calculates confidence intervals from bootstrap results- Input: Bootstrap replicate files
- Output: Mean estimates with confidence intervals
- Dependencies: Python 3.7+, pandas, numpy
bootstrap_genetic_diveristy_pleth.py:creates keep-files, subsamples a VCF by species/replicate/locality, runs summary stats, and outputs files sorted by species and datatype
**Alignment Processing Scripts: **
clean_alignments.py: Quality filtering and cleanup of sequence alignmentsconcatenate_alignments.py: Combines individual locus alignments for phylogenetic analysis
VCF Processing Scripts:
- Various Python scripts for VCF manipulation, filtering, and format conversion
- Custom functions for population-specific analyses
Installation Instructions
Using Conda (Recommended):
# Create environment with required tools
conda create -n plethodon_analysis python=3.8 vcftools bcftools pandas numpy
conda activate plethodon_analysis
# Install R packages
conda install -c conda-forge r-base r-ggplot2 r-dplyr r-readr r-tidyr
Alternative Installation:
- VCFtools: Download from https://vcftools.github.io/
- BCFtools: Download from http://samtools.github.io/bcftools/
- Python packages:
pip install pandas numpy - R packages:
install.packages(c("ggplot2", "dplyr", "readr", "tidyr"))
File Processing Workflow
- VCF Analysis: Use VCFtools to calculate basic population genetics statistics
- Bootstrap Resampling: Run Python bootstrap scripts for confidence intervals
- Data Aggregation: Combine results across populations and datasets
- Visualization: Use R or Python plotting libraries for figure generation
- Statistical Testing: Employ R for comparative statistical analyses
Citation and Contact
This dataset supports research on temporal genetic diversity patterns in Appalachian salamanders. Please contact the authors for data usage and collaboration inquiries.
File Format Notes
- VCF files: Standard Variant Call Format (v4.2)
- Summary statistics: Tab-delimited text files
- Sample metadata: Tab-delimited with header row
- Coverage tables: Tab-delimited with percentage values
- All coordinate systems are 1-indexed unless otherwise specified
