Population genetics, trait mapping, and fungal pathogen surveillance using untargeted sequencing in timber rattlesnakes (Crotalus horridus)
Data files
Nov 11, 2025 version files 739.47 MB
-
bracken_files.zip
1.30 MB
-
bracken_run_array_bacteria.sh
1.39 KB
-
bracken_run_array_fungus.sh
1.40 KB
-
GWAS_output.lrt0-2.gz
56.07 MB
-
kraken_run_array_bacteria_horridus.sh
1.18 KB
-
kraken_run_array_fungus_horridus.sh
1.24 KB
-
lifted_tigris_to_adamanteus.gtf_polished
681.78 MB
-
liftoff.sh
1.20 KB
-
pairwise_bray_distances.csv
306.47 KB
-
README.md
6.99 KB
Abstract
Timber rattlesnakes (Crotalus horridus) face escalating threats in the Northeastern Appalachians, including habitat fragmentation, human encroachment, and the fungal pathogen Ophidiomyces ophiodiicola. Using untargeted sequencing of DNA extracted from scale clips, we generated both host whole-genome and metagenomic data for 97 snakes from eight populations. Analysis of the snake genomes shows the populations surveyed exhibit relatively low levels of inbreeding and are genetically distinct, but the degree of separation correlates only weakly with geographic distance. A genome-wide association analysis identified a locus associated with black-to-yellow color variation that contains an aldehyde dehydrogenase gene (ALDH4A1) related to genes involved in hair color differences in humans. Metagenomic analysis showed that O. ophiodiicola read counts were generally higher in snakes exhibiting clinical signs of Snake Fungal Disease, but some visually asymptomatic snakes had high pathogen loads. Together, these findings highlight the dual utility of untargeted sequencing for population genetics and pathogen surveillance, providing a foundation for future studies of adaptation, disease dynamics, and conservation in this declining species.
Dataset DOI: 10.5061/dryad.m37pvmdg9
Description of the data and file structure
This repository contains sequencing output files and reference annotations in the study titled: “Population genetics, trait mapping, and fungal pathogen surveillance using untargeted sequencing in timber rattlesnakes (Crotalus horridus).” It comprises data from scale clipping samples of C. horridus collected at multiple locations throughout the Northeastern United States. A total of 111 bracken output files are included. Additionally, we have included liftoff output GTF used in our GWAS analyses and raw ANGD-assoc GWAS data. The data mainly consist of Bracken output tables generated after running Kraken2 + Bracken with default settings for metagenomic taxonomic classification. These tables show microbial taxa's relative abundance and read assignments at the genus level for each snake sample.
The dataset also features a liftover GTF annotation from Crotalus tigris to Crotalus adamanteus, created to support future genomic or gene-mapping comparisons for GWAS analyses performed in the paper, and identifying the overlapping genes.
Overall, this dataset enables replication and reuse in metagenomic analysis of microbiota and pathogens (including Ophidiomyces ophiodiicola) and future comparative annotation using viper references.
Files and variables
File: lifted_tigris_to_adamanteus.gtf_polished
Description: The file is a cross-species annotation file aligning C. tigris gene coordinates to the C. adamanteus reference assembly.
File: GWAS_output.lrt0-2.gz
Description: Compressed raw output file containing results from genome-wide association analyses performed using the ANGSD-assoc module of ANGSD. Each record reports likelihood ratio test (LRT) statistics, p-values, and associated summary metrics for all tested loci before post-processing or multiple-testing correction. These results represent the unfiltered association output directly generated from the -doAsso step.
File: bracken_files.zip
Description: Zipped directory containing 111 .bracken files generated by Kraken2 + Bracken. Each file corresponds to a unique C. horridus sample. Files include taxonomic assignments at the genus level, estimated read counts, and relative abundances.
File: pairwise_bray_distances.csv
Description: This was generated from the following code:
bray_dist <- vegdist(otu_rel, method = "bray")
bray_mat <- as.matrix(bray_dist)
The first line uses vegdist() to compute pairwise community dissimilarities between samples in otu_rel (relative abundances of the identified taxa), and the second converts that distance object into a full square matrix for easier inspection or downstream analysis.
Code/software
File: liftoff.sh
The script to generate the liftoff runs on an HPC scheduler (SLURM), producing a GTF of lifted features and a GTF listing features that failed to map. This was specifically: Lift over gene annotations from Crotalus tigris to the Crotalus adamanteus reference using Liftoff (genome-to-genome annotation transfer).
GWAS_output.lrt0-2.gz:
The analyses examined the relationships between genotype likelihoods and the animal’s color morph phenotype using an additive genetic model (model 1). We chose this model because additive genetic variance chiefly influences the genetic basis of complex traits like color morphs (Hill et al., 2008). Initially, genotype likelihoods were computed in ANGSD and then refined with BEAGLE to enhance genotype accuracy before testing for associations. The output includes unfiltered likelihood ratio test (LRT) statistics and p-values for all loci considered, prior to any filtering or correction for multiple testing. To address multiple comparisons, a Bonferroni correction established a genome-wide significance threshold based on a family-wise error rate of α = 0.05 and the total loci count (n = 1,159,481). This resulted in a significance cutoff of p < 4.31 × 10⁻⁸.
kraken_run_array_bacteria_horridus.sh, kraken_run_array_fungus_horridus.sh, bracken_run_array_fungus.sh, bracken_run_array_bacteria.sh
These scripts were used for running Kraken2 and then Bracken against a database containing standard bacteria, fungi, and viral sequences. We also included the Ophidiomyces ophiodiicola and human genomes (this is to remove contamination).
References:
Hill, W. G., Goddard, M. E., & Visscher, P. M. (2008). Data and theory point to mainly additive genetic variance for complex traits. PLoS Genetics, 4(2), e1000008. https://doi.org/[10.1371/journal.pgen.1000008](http://dx.doi.org/10.1371/journal.pgen.1000008)
Korneliussen, T. S., Albrechtsen, A., & Nielsen, R. (2014). ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics, 15(1), 356. https://doi.org/[10.1186/s12859-014-0356-4](http://dx.doi.org/10.1186/s12859-014-0356-4)
Lu, J., Breitwieser, F. P., Thielen, P., & Salzberg, S. L. (2017). Bracken: estimating species abundance in metagenomics data. PeerJ. Computer Science, 3(e104), e104. https://doi.org/[10.7717/peerj-cs.104](http://dx.doi.org/10.7717/peerj-cs.104)
Shumate, A., & Salzberg, S. L. (2021). Liftoff: accurate mapping of gene annotations. Bioinformatics (Oxford, England), 37(12), 1639–1643.
vegan: R package for community ecologists: popular ordination methods, ecological null models & diversity analysis. (n.d.). Github. Retrieved June 2, 2025, from https://github.com/vegandevs/vegan
Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1), 257. https://doi.org/[10.1186/s13059-019-1891-0](http://dx.doi.org/10.1186/s13059-019-1891-0)
Access information
Other publicly accessible locations of the data: Additional data will be included in the supplemental data included in our manuscript. These will contain the final outputs from all tools used in the publication.
Data was derived from the following sources: Most data was generated using the ANGSD pipeline and related tools. We used Linux and SLURM for processing the data for job scheduling on the Unity HPC.
ANGSD software website: https://www.popgen.dk/angsd/index.php/ANGSD
HPC used: https://docs.unity.rc.umass.edu
