Data from: Isolation by distance promotes gut microbial strain divergence in wild mouse populations
Data files
Mar 25, 2026 version files 1.06 GB
-
README.md
6.93 KB
-
RodentMAGs.tar.gz
1.06 GB
Abstract
Bacterial species within the mammalian gut microbiota exhibit considerable strain diversity associated with both geography and host genetic ancestry. However, because geography and ancestry are typically confounded, disentangling their contributions to the divergence of gut bacterial strains has remained challenging. Here, we show that isolation by distance (IBD) promotes gut bacterial strain divergence within host species independently of host ancestry. Joint profiling of gut bacterial and mitochondrial genomes from wild-living populations of deer mice (Peromyscus maniculatus) sampled across the United States revealed significant IBD in 27 predominant gut bacterial species, including Muribaculaceae and Lachnospiraceae spp., but limited evidence for co-inheritance of gut bacterial genomes with mitochondria during the diversification of mouse populations. Spore-forming gut bacterial species exhibited reduced IBD, suggesting that adaptations facilitating bacterial dispersal can lessen geographic structuring of strain diversity. In contrast to conspecific hosts sampled at the same field site, hosts of different rodent genera sampled in sympatry with deer mice harbored divergent strains within shared gut bacterial species. These results indicate that geographic distance mediates the early stages of gut bacterial strain divergence between conspecific hosts, whereas effects of host ancestry on strain-level microbiota composition emerge over longer evolutionary timescales, such as those separating host genera.
Dataset DOI: 10.5061/dryad.95x69p8z4
Description of the data and file structure
Metagenome-assembled genomes from "Isolation by distance promotes strain diversification in the wild mouse gut microbiota."
Files and variables
File: RodentMAGs.tar.gz
Description:
This is a compressed archive file that, when opened, contains the RodentMAGs directory.
Within the RodentMAGs directory, the all_mags directory contains files with the .fa extension. Each .fa file represents a metagenome-assembled genome (MAG) associated with the manuscript titled "Isolation by distance promotes gut microbial strain divergence in wild mouse populations.
For each .fa file, the prefix preceding the first underscore denotes the sample from which the MAG was derived.
Sample and MAG metadata are provided in TableS1.txt and TableS3.txt, respectively, and are located in the RodentMAGs directory.
TableS1.txt - Metadata for rodent fecal samples sequenced with Illumina NovaSeq.
column descriptions:
- Sample - Unique identifier for each host fecal sample.
- Latitude - Latitude of the sample collection location.
- Longitude - Longitude of the sample collection location.
- site - The National Ecological Observatory Network (NEON) field site abbreviation where the sample was collected
- host_species - Host species from which the fecal sample was collected.
- host_individual - NEON identifier for the sampled host individual.
- NeonBiorepository_catalogNumber - NEON biorepository identifier for the fecal sample.
- NEON_sampleID - NEON sample identifier for the fecal sample.
- Collection_date - Date the fecal sample was collected.
- number_host_reads - Number of sequencing reads identified as host-derived in the metagenomic dataset.
- %Dups - Percentage of duplicate sequencing reads relative to total reads.
- %GC - Percentage of guanine (G) and cytosine (C) bases among all sequenced bases.
- Median_Read_Length - Median read length after quality control and processing.
- M_PE_Seq_reads - Number of paired-end sequencing reads (in millions) obtained for each sample.
- percent_PE_reads_mapping_to_SGB_reference_set_sensitive_local - Percentage of total cleaned paired-end reads mapping to the MAG reference set using Bowtie2 with the --sensitive-local option enabled.
- percent_PE_reads_mapping_to_SGB_reference_set_very_sensitive_local - Percentage of total cleaned paired-end reads mapping to the MAG reference set using Bowtie2 with the --very-sensitive-local option enabled.
TableS3.txt - Metadata and quality reports for metagenome-assembled genomes.
- MAG_ID - Unique identifier for each MAG.
- mean_coverage - Mean sequencing depth across the MAG when reads from all samples are mapped to the concatenated MAG reference set. Values of NA indicate that the MAG was not included in the reference set.
- mean_breadth - Mean proportion of the MAG covered by sequencing reads when reads from all samples are mapped to the concatenated MAG reference set. Values of NA indicate that the MAG was not included in the reference set.
- SGB_reference_strain - Binary TRUE/FALSE indicating whether the MAG was included as a reference in the concatenated MAG reference set.
- host_origin - Host species from which the MAG was derived.
- The following columns describe taxonomic classification of each MAG. MAGs that could not be classified contain NA values across all taxonomic columns. If higher ranks are assigned but genus or species are NA, classification was inconclusive at those ranks.
- domain - Taxonomic domain of the MAG.
- phylum - Taxonomic phylum of the MAG
- class - Taxonomic class f the MAG.
- order - Taxonomic order of the MAG.
- family - Taxonomic family of the MAG.
- genus - Taxonomic genus of the MAG.
- species - Taxonomic species of the MAG.
- Checkm2 was used to asses the quality of each MAG. The below columns are associated with the CheckM2 output.
- Completeness - Estimated genome completeness.
- Contamination - Estimated genome contamination.
- Completeness_Model_Used - Model used by CheckM2 to estimate completeness.
- Translation_Table_Used - Genetic code used during gene prediction.
- Coding_Density - Proportion of nucleotides predicted to be a part of a coding sequence.
- Contig_N50 - Contig length such that 50 % of the genome is contained in contigs of this length or longer.
- Average_Gene_Length - Mean gene length (in nucleotides) of predicted genes.
- Genome_Size - The total size of genome (in nucleotides).
- GC_Content - Percentage of guanine (G) and cytosine (C) bases among all sequenced bases as calculated by CheckM2.
- Total_Coding_Sequences - Total number of predicted coding sequences.
- Total_Contigs - Number of contiguous sequences (contigs) making up the MAG.
- Max_Contig_Length - Length (in nucleotides) of the largest contig contained in the MAG.
- Additional_Notes - Any notes provided by CheckM2.
- GUNC was used to check for chimerism and contamination in our MAGs. Below are the columns associated with the GUNC output. Many of these columns mirror columns provided by CheckM2 and will display slightly different results because of differing methodology, all input data was the same.
- n_genes_called - Number of genes detected by prodigal.
- n_genes_mapped - Number of n_genes_called mapped by Diamond to the GUNC reference database.
- n_contigs - Number of contigs making up the MAG.
- taxonomic_level - Lowest taxonomic rank GUNC could assign to the MAG.
- proportion_genes_retained_in_major_clades - A major clade is described as any clade for which >2% of gene mapped are associated. Proportion of n_genes_mapped retained after excluding minor clades (< 2 % of mapped genes).
- genes_retained_index - Proportion of called genes retained in major clades. Calculated via n_genes_mapped/n_genes_called * proportion_genes_retained_in_major_clade
- clade_separation_score - Metric describing the degree to which different parts of a MAG are assigned to different taxonomic clades.
- contamination_portion - Estimated fraction of the genome originating from non-dominant (contaminant) clades
- n_effective_surplus_clades - Estimated number of major clades detected - 1 (an un-contaminated MAG would have 1 clade present).
- mean_hit_identity - Mean sequence identity of genes coming from major clades to the GUNC reference database.
- reference_representation_score - Composite score describing how well a genome is represented in the GUNC reference database, calculated via genes_retained_index * mean_hit_identity
- pass.GUNC - Binary TRUE/FALSE column describing whether a MAG passed GUNC's chimerism screening (pass threshold: clade_separation_score <= 0.45).
Access information
Data was derived from the following sources:
- Metagenomic shotgun sequencing of rodent fecal samples.
