Data from: Tetranucleotide frequencies and ratios of frequencies
Data files
Jan 03, 2025 version files 788.05 MB
-
README.md
2.78 KB
-
Supplemental_File_S2.csv
788.05 MB
Abstract
Microbiomes are constrained by physicochemical conditions, nutrient regimes, and community interactions across diverse environments, yet genomic signatures of this adaptation remain unclear. Metagenome sequencing is a powerful technique to analyze genomic content in the context of natural environments, establishing concepts of microbial ecological trends. Here, we developed a data discovery tool - a tetranucleotide-informed metagenome stability diagram - that is publicly available in the Integrated Microbial Genomes and Microbiomes (IMG/M) platform for metagenome-ecosystem analyses. We analyzed the tetranucleotide frequencies from quality-filtered and unassembled sequence data of over 12,000 metagenomes to assess ecosystem-specific microbial community composition and function. We found that tetranucleotide frequencies can differentiate communities across various natural environments, and that specific functional and metabolic trends can be observed in this structuring. Our tool places metagenomes sampled from diverse environments into clusters and along gradients of tetranucleotide frequency similarity, suggesting microbiome community compositions specific to gradient conditions. Within the resulting metagenome clusters, we identify protein-coding gene identifiers that are most differentiated between ecosystem classifications. We plan for annual updates to the metagenome stability diagram in IMG/M with new data, allowing for refinement of the ecosystem classifications delineated here. This framework has the potential to inform future studies on microbiome engineering, bioremediation, and the prediction of microbial community responses to environmental change.
README: Data from: Tetranucleotide frequencies and ratios of frequencies
https://doi.org/10.5061/dryad.tb2rbp0c8
Description of the data and file structure
Publicly available metagenomes sequenced at the JGI and added to the IMG database before April 10th 2024 (date of collection) were considered for this analysis (Chen et al. 2023). This data collection criteria yielded 15,208 metagenome datasets labeled as “Metagenome Analysis” as their GOLD Analysis Project Type (Mukherjee et al. 2024). Metagenomes that had a GOLD Ecosystem classification of “Engineered” or “Host-associated” or had a GOLD Ecosystem Type classification of “Nest” were removed from our metagenome collection for non-natural or host-restricted environment properties that would alter the interpretation of tetranucleotide frequencies plots as reflecting physicochemical pressures. After data filtering, the total number of included metagenomes was 12,063, with GOLD Ecosystem Categories and Types.
From the included metagenome datasets, filtered sequencing reads following the standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (Huntemann et al. 2016) were analyzed for tetranucleotide counts with KMC (version 3.1.1) (Kokot, Długosz, and Deorowicz 2017). Tetranucleotide counts were converted into frequencies (e.g. 4-mer count divided by the total count of all 4-mers) and ratios of frequencies (e.g. 4-mer frequencies divided by all other permutations of 4-mer frequencies).
Files and variables
File: Supplemental_File_S2.csv
Description: JGI IMG metagenome Taxon IDs and metagenome tetranucleotide frequencies.
Variables
- Taxon_ID: JGI IMG metagenome Taxon ID
- Tetranucleotide: Tetranucleotide frequency; For each JGI IMG metagenome Taxon ID, tetranucleotide frequencies are given for all 136 unique tetranucleotides (i.e. the value for ATCG is the count of ATCG tetranucleotides within metagenome sequencing reads divided by the total count of unique tetranucleotides, then multiplied by 100 to convert to percentage frequency).
- Tetranucleotide ratios: Ratio of tetranucleotide frequencies; For each JGI IMG metagenome Taxon ID, the unique ratios of tetranucleotide frequency pairings are given (i.e. value for ATCG/TTAA is the frequency of ATCG divided by the frequency of TTAA).
Code/software
Any large memory text editor or spreadsheet program can open and view the data.
Access information
Data was derived from JGI IMG metagenome quality-filtered unassembled sequencing read data.
Methods
Metagenome datasets, filtered sequencing reads following the standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (Huntemann et al. 2016) were analyzed for tetranucleotide counts with KMC (version 3.1.1) (Kokot, Długosz, and Deorowicz 2017). Tetranucleotide counts were converted into frequencies (e.g. 4-mer count divided by the total count of all 4-mers) and ratios of frequencies (e.g. 4-mer frequencies divided by all other permutations of 4-mer frequencies).