Data from: Ranavirus epizootics and gut bacteriome dysbiosis in tadpoles: evidence for the Anna Karenina Principle?
Data files
Mar 31, 2026 version files 4.20 MB
-
01_AlphaDiversity_gg2_privatefiltered.R
29.88 KB
-
02_BetaDiversity_taxa_gg2_privatefiltered.R
45.90 KB
-
03_DiffAbund_LinDA_Taxa_gg2_privatefiltered.R
28.09 KB
-
04_Picrust2_GetKegg.R
1.12 KB
-
05_DiffAbund_LinDA_Pathways_PrivateFiltered.R
15.66 KB
-
06_Blacksmith_Specific_Analyses.R
23.27 KB
-
07_MiscPlots.R
21.90 KB
-
Analysis_PrivateFiltered_10.28.24.Rproj
267 B
-
gg2_taxonomy.qza
995.63 KB
-
KEGG_table.txt
474.97 KB
-
metadata_for_R.csv
38.06 KB
-
metadata_for_R.txt
38.06 KB
-
rarefied_table-privatefiltered.qza
870.64 KB
-
README.md
15.82 KB
-
rooted-tree.qza
730.03 KB
-
table-privatefiltered.qza
873.82 KB
Abstract
Host-associated microbial communities (microbiomes) play critical roles in animal health and disease, yet their responses to pathogens under natural conditions remain poorly understood. We investigated gut bacterial community (bacteriome) dynamics in wood frog (Rana sylvatica [Lithobates sylvaticus]) tadpoles during natural ranavirus outbreaks to understand how pathogen-induced disturbances shape microbiome diversity, composition, and function. Using 16S rRNA sequencing, we compared the bacteriomes of tadpoles in ponds experiencing ranavirus-induced die-offs with those from unaffected reference ponds before, during, and after mortality events. Ranavirus infection significantly altered gut bacteriome composition and increased microbiome variability (dispersion), consistent with the Anna Karenina Principle. Tadpoles with high infection intensities exhibited reduced bacterial diversity and pronounced shifts in community structure, characterized by enrichment of specific taxa such as Cetobacterium and Turicibacter, which have been linked previously to antiviral immunity and gut health. Predicted functional analyses revealed shifts toward carbohydrate metabolism pathways during die-offs, suggesting microbial adaptation to altered host physiology under infection stress. Notably, bacteriome disruptions were detectable even before die-offs occurred, highlighting potential early-warning microbiome indicators of infection. In one recovering population post-epizootic, we observed partial recovery of the bacteriome, indicating potential microbial resilience. Our findings demonstrate that ranavirus epizootics profoundly disrupt gut microbiomes in wild amphibian populations while simultaneously eliciting potentially adaptive microbial responses. These insights underscore the complex interplay between host immunity, microbiome dynamics, and environmental conditions during disease outbreaks, highlighting opportunities for microbiome-based interventions to support amphibian conservation.
Dataset DOI: 10.5061/dryad.ngf1vhj5h
Description of the data and file structure
This dataset contains R scripts, metadata, and QIIME2 artifacts used for microbiome analyses of tadpole gut microbiota in relation to ranavirus die-off events. All R scripts are designed to be run within the included R project file and rely on the provided data files.
About QIIME2 artifact files (.qza)
Several files in this dataset are QIIME2 artifacts (file extension .qza). QIIME2 (Quantitative Insights Into Microbial Ecology 2) is a free, open-source platform for microbiome bioinformatics (https://qiime2.org). QIIME2 artifacts are standard ZIP-formatted archive files that package microbiome data along with metadata about how the data was generated (called "provenance"). They can be accessed in the following ways:
- QIIME2 software (recommended): Install QIIME2 for free from https://docs.qiime2.org/. Once installed, artifacts can be loaded and manipulated using QIIME2 commands or imported into R using the
qiime2Rpackage (as done in the provided R scripts). - QIIME2 View (no installation required): Upload
.qzafiles to https://view.qiime2.org to visualize and explore their contents directly in a web browser. - Direct access via any ZIP extraction tool: Because
.qzafiles are ZIP archives, they can be opened with any standard ZIP utility (e.g., 7-Zip, WinZip, or the built-in archive tools in Windows, macOS, or Linux). After extracting, navigate to thedata/folder inside the extracted archive to find the underlying data files. The data format varies by artifact type (see descriptions for each file below).
Files and variables
01_AlphaDiversity_gg2_privatefiltered.R
This script performs alpha diversity analyses. It loads QIIME2 outputs into a phyloseq object in R and calculates three alpha diversity metrics: Faith's phylogenetic diversity (a measure of the total branch length of species present in a sample), observed ASV richness (the count of distinct bacterial sequence variants), and Shannon diversity (a combined measure of richness and evenness). Linear mixed-effects models are used to compare diversity between die-off and healthy ponds, and to examine the effect of viral load on diversity within die-off ponds.
02_BetaDiversity_taxa_gg2_privatefiltered.R
This script performs beta diversity analyses, which measure how different the microbial communities are between pairs of samples. It computes distance matrices using Bray-Curtis dissimilarity, weighted UniFrac (which accounts for both phylogenetic relatedness and abundance of bacteria), and unweighted UniFrac (which accounts for phylogenetic relatedness only). Statistical tests include PERMANOVA (to test whether groups of samples differ in their community composition) and beta-dispersion analysis (to test whether groups differ in their variability). NMDS ordination plots are generated to visualize community differences.
03_DiffAbund_LinDA_Taxa_gg2_privatefiltered.R
This script performs differential abundance analyses of bacterial taxa. It identifies specific bacterial genera whose relative abundances differ significantly between die-off and healthy ponds, or between tadpoles with different viral loads. Differential abundance testing is performed using LinDA (Linear models for Differential Abundance analysis), a method designed for compositional microbiome data.
04_Picrust2_GetKegg.R
This script processes the outputs from PICRUSt2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) to generate KEGG pathway abundances. PICRUSt2 predicts the functional potential of a microbial community based on 16S rRNA gene sequences. This script converts predicted gene family (KO, KEGG Orthology) abundances into higher-level KEGG metabolic pathway abundances.
05_DiffAbund_LinDA_Pathways_PrivateFiltered.R
This script performs differential abundance analyses of KEGG metabolic pathways. Using the pathway abundance table generated by script 04, it identifies specific metabolic pathways that are predicted to differ in abundance between die-off and healthy ponds, and between tadpoles with different viral loads, using the LinDA method.
06_Blacksmith_Specific_Analyses.R
This script performs analyses of the single pond ("Blacksmith") that experienced a ranavirus recovery. It includes differential abundance analysis of bacterial taxa across time points (before, during, and after the die-off), differential abundance of predicted metabolic pathways, alpha diversity analysis, and beta diversity analysis.
07_MiscPlots.R
This script makes various miscellaneous plots, such as taxa bar plots showing the relative abundance of bacterial phyla across samples.
Analysis_PrivateFiltered_10.28.24.Rproj
This is the R project file for RStudio. Opening this file in RStudio sets the working directory to the folder containing the data files, which allows all R scripts to locate data files using relative paths. RStudio is free and open-source software available from https://posit.co/downloads/.
metadata_for_R.csv
Tab-delimited file (despite the .csv extension) with data from pond surveys and individual tadpoles. Contains 251 rows (samples) and 33 columns.
Variables
- SampleID: A unique name assigned to each sample in the dataset (format: number_Snumber, e.g., "1_S2")
- sample_alias: A unique numbered alias assigned to each sample in the dataset
- catalog_number: Catalog number of each sample for the Yale Peabody Collection
- ctrl_sample: Whether or not a sample is a control sample (0 = no; 1 = yes)
- pond: Abbreviation for the pond that a sample was collected from (e.g., "a10", "bs", "bo", "lo", "la", "sh", "wf", "aa")
- collection_date: The date that a sample was collected (format: M/DD/YYYY)
- sampling_session: The sampling session for each pond that a sample was collected from; categorical variable with values "first", "second", and "third" for first, second, and third sampling session
- phase: Redundant with sampling_session, but excluding the post-recovery sampling session from a single pond (bs)
- species: Species abbreviation; RASY = Rana sylvatica (wood frog)
- tl_mm: Total length of each individual tadpole (in mm)
- svl_mm: Snout-to-vent length of each individual tadpole (in mm)
- stage_gosner: Gosner developmental stage of each individual tadpole (integer scale from 1-46, where higher values indicate more advanced development)
- rasy_SQ_avg: Average starting quantity of ranavirus DNA from quantitative PCR (qPCR), averaged across duplicate wells (copies/reaction)
- virus_per_ngDNA: Viral load of an individual tadpole, calculated as viral copies per ng of total extracted DNA in the sample (copies/ng DNA)
- log_virus_per_ngDNA: log10(x+1)-transformed viral load of an individual tadpole
- infection_status: Whether or not a tadpole was found to be infected with ranavirus (0 = not infected; 1 = infected; NA = not tested)
- year: The year that a sample was collected
- doy: The numeric day of the year that a sample was collected (1-365)
- start_of_dieoff: Whether or not a specific date was the start of a ranavirus die-off event, defined as the time when the detection of >= 5 amphibian carcasses first occurred (0 = no; 1 = yes)
- dieoff_status: For ponds that experienced a die-off: whether samples are from before ("pre"), during ("mid"), or after ("post") a die-off; blank for ponds that never experienced a die-off
- dieoff_date: Date that die-off started for each pond that experienced one (format: M/DD/YYYY); blank for ponds that did not experience a die-off
- days_from_dieoff: For ponds that experienced a die-off: the number of days before (negative values) or after (positive values) a die-off that a sample was collected; blank for ponds that did not experience a die-off
- post_dieoff: Whether or not samples were collected after a die-off event (0 = no; 1 = yes)
- dieoff_pond: Whether or not a specific pond experienced a ranavirus die-off (0 = no; 1 = yes)
- conductivity: Conductivity of the pond water at sampling (uS/cm)
- tds: Total dissolved solids in the pond at sampling (ppm)
- salinity: Salinity of the pond at sampling (ppm)
- ph: pH of the pond water at sampling (standard pH units, 0-14 scale)
- temp_c: Temperature of the pond water at sampling (degrees Celsius)
- oxygen_percent: Dissolved oxygen saturation of the pond water at sampling (%)
- oxygen_mgl: Dissolved oxygen concentration in the pond water at sampling (mg/L)
- depth_cm: Depth of the pond at deepest point (z-max) at sampling (cm)
- carcasses_detected_1_0: Whether or not amphibian carcasses were detected at a sampling time point (0 = no; 1 = yes)
metadata_for_R.txt
Tab-delimited file with data from pond surveys and individual tadpoles; contains the same variables as metadata_for_R.csv described above.
gg2_taxonomy.qza
QIIME2 artifact containing taxonomy classifications for amplicon sequence variants (ASVs). ASVs are unique DNA sequences identified from 16S rRNA gene amplicon sequencing, each representing a distinct bacterial type. Taxonomy was assigned to each ASV sequence using a Naive Bayes classifier trained against the GreenGenes2 2022.10 reference database.
To access the data directly: Extract the .qza file as a ZIP archive and navigate to the data/ folder. Inside, you will find a tab-separated file (taxonomy.tsv) with the following columns:
- Feature ID: A unique hash identifier for each ASV
- Taxon: The taxonomic classification string for the ASV, formatted as a semicolon-separated hierarchy (e.g., "d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;...") using abbreviations: d__ = domain, p__ = phylum, c__ = class, o__ = order, f__ = family, g__ = genus, s__ = species
- Confidence: The confidence score (0 to 1) of the taxonomic assignment from the Naive Bayes classifier
KEGG_table.txt
Tab-delimited table of KEGG (Kyoto Encyclopedia of Genes and Genomes) metabolic pathway abundances per sample, predicted by PICRUSt2. KEGG pathways represent sets of genes that work together to carry out specific metabolic functions. Contains 266 pathways (rows) across 240 samples (columns).
Variables
- pathway: KEGG pathway identifier (e.g., "ko00564"). Each identifier corresponds to a specific metabolic pathway in the KEGG database (https://www.genome.jp/kegg/pathway.html). Pathway descriptions can be looked up by searching the identifier at https://www.genome.jp/kegg/.
- Sample columns (e.g., "X1_S2", "X10_S206", ...): Each remaining column represents a single sample, named to match the SampleID in the metadata files (with an "X" prefix added by R). Values represent the predicted relative abundance of each pathway in that sample, expressed as predicted gene copy counts contributed to each pathway.
rarefied_table-privatefiltered.qza
QIIME2 artifact containing the rarefied ASV (amplicon sequence variant) feature table. Rarefaction is a normalization step in which each sample is randomly subsampled to the same sequencing depth to allow fair comparisons of diversity across samples. This table has been rarefied to 10,000 sequences per sample. Private filtering means that ASVs found in only a single sample have been removed.
To access the data directly: Extract the .qza file as a ZIP archive and navigate to the data/ folder. Inside, you will find a BIOM-format file (feature-table.biom). BIOM (Biological Observation Matrix) is a standard format for representing biological sample-by-observation tables. It can be opened with:
- The free
biom-formatPython package (https://biom-format.org): usebiom convertto convert to a tab-separated text file - The
biomformatR/Bioconductor package - The
qiime2RR package (as used in the provided scripts)
The table contains:
- Rows: ASV identifiers (unique hash strings representing distinct bacterial DNA sequences)
- Columns: Sample identifiers matching the SampleID column in the metadata
- Values: Integer read counts (number of sequences assigned to each ASV in each sample), all summing to 10,000 per sample after rarefaction
rooted-tree.qza
QIIME2 artifact containing a rooted phylogenetic tree describing the evolutionary relationships among all ASVs in the dataset. This tree is required for calculating phylogenetic diversity metrics such as Faith's phylogenetic diversity and UniFrac distances, which measure community diversity while accounting for the evolutionary relatedness of bacteria.
To access the data directly: Extract the .qza file as a ZIP archive and navigate to the data/ folder. Inside, you will find a Newick-format tree file (tree.nwk). Newick is a standard plain-text format for representing phylogenetic trees. It can be opened with:
- Any text editor (the tree is stored as a nested parenthetical string)
- Free phylogenetic tree viewers such as FigTree (http://tree.bio.ed.ac.uk/software/figtree/), iTOL (https://itol.embl.de), or the
apeR package - The
qiime2RR package (as used in the provided scripts)
The tree contains:
- Tip labels: ASV identifiers (matching the row identifiers in the feature tables)
- Branch lengths: Substitutions per site, representing evolutionary distance between ASVs
table-privatefiltered.qza
QIIME2 artifact containing the unrarefied (raw count) ASV feature table. Unlike the rarefied table, this table retains the original sequencing depth of each sample and is used for differential abundance analyses, which apply their own internal normalization. Private filtering means that ASVs found in only a single sample have been removed.
To access the data directly: Extract the .qza file as a ZIP archive and navigate to the data/ folder. Inside, you will find a BIOM-format file (feature-table.biom), which can be accessed using the same tools described for rarefied_table-privatefiltered.qza above.
The table contains:
- Rows: ASV identifiers (unique hash strings representing distinct bacterial DNA sequences)
- Columns: Sample identifiers matching the SampleID column in the metadata
- Values: Integer read counts (number of sequences assigned to each ASV in each sample); totals vary by sample because this table has not been rarefied
Code/Software
R code to replicate the analysis is included with further instructions inline in the provided scripts. All statistical analyses were performed in R version 4.3.2 (R Core Team 2023). RStudio (free, https://posit.co/downloads/) is recommended for running the scripts. Open the .Rproj file in RStudio to set the correct working directory, then run scripts in numerical order.
Key R packages used include: phyloseq for microbiome data handling, qiime2R for importing QIIME2 artifacts into R, vegan for ecological statistics, lme4/lmerTest for linear mixed-effects models, MicrobiomeStat for LinDA differential abundance analysis, and ggplot2/patchwork for visualization. All packages are freely available from CRAN or Bioconductor.
Access information
Other publicly accessible locations of the data:
Raw sequence data were uploaded to the NCBI Sequence Read Archive under the Bioproject Accession Number PRJNA1241318.
