Data from Readsynth: short-read simulation for consideration of composition-biases in reduced metagenome sequencing approaches
Data files
Apr 12, 2024 version files 13.07 GB
Abstract
Background
The application of reduced metagenomic sequencing approaches holds promise as a middle ground between targeted amplicon sequencing and whole metagenome sequencing approaches but has not been widely adopted as a technique. A major barrier to adoption is the lack of read simulation software built to handle characteristic features of these novel approaches. Reduced metagenomic sequencing (RMS) produces unique patterns of fragmentation per genome that are sensitive to restriction enzyme choice, and the non-uniform size selection of these fragments may introduce novel challenges to taxonomic assignment as well as relative abundance estimates.
Results
Through the development and application of simulation software, readsynth, we compare simulated metagenomic sequencing libraries with existing RMS data to assess the influence of multiple library preparation and sequencing steps on downstream analytical results. Based on read depth per position, readsynth achieved 0.79 Pearson’s correlation and 0.94 Spearman’s correlation to these benchmarks. Application of a novel estimation approach, fixed length taxonomic ratios, improved quantification accuracy of simulated human gut microbial communities when compared to estimates of mean or median coverage.
Conclusions
We investigate the possible strengths and weaknesses of applying the RMS technique to profiling microbial communities via simulations with readsynth. The choice of restriction enzymes and size selection steps in library prep are non-trivial decisions that bias downstream profiling and quantification. The simulations investigated in this study illustrate the possible limits of preparing metagenomic libraries with a reduced representation sequencing approach, but also allow for the development of strategies for producing and handling the sequence data produced by this promising application.
README: readsynth_analysis
https://doi.org/10.5061/dryad.nzs7h44zk
The dataset contained here provides the necessary raw sequence data to perform analyses for the simulation software readsynth.
The dataset includes the genomes and databases necessary to reproduce the steps in the github repository readsynth_analysis and correspond with that repository's "raw_data" directory.
Description of the data and file structure
The genome directory "raw_data" is broken into the following subdirectories (further descriptions below):
.
├── helius
│ └── all_2084
│ ├── genomes
│ └── genomes_combined
├── kraken_dbs
│ ├── k2_pluspfp_20220607
│ ├── snipen_bei_db
│ │ └── library
│ │ └── added
│ └── sun_atcc_db
│ └── library
│ └── added
├── liu_RMS
│ └── mock_community_estimate
│ ├── 10M_bracken_profile
│ │ └── genomes
│ └── SRR5298272_genomes_no_human_combined
├── snipen_RMS
│ ├── mock_community_combined
│ └── mock_community_ref_genomes
└── sun_2bRADM
├── 2brad_msa
└── mock_community_ref_genomes
└── manually_renamed_atcc_msa_1002
Each top-level directory has been compressed using tar.gz and will need to be uncompressed using:
tar xzf <directory>
A brief overview of the contents of each directory:
- helius - reference genomes derived from HELIUS study; Deschasaux et al. 2018 (DOI: 10.1038/s41591-018-0160-1)
- "all_2084/helius_all.csv" shows the source community structure, and after downloading the relevant representatives from GenBank, the successfully retrieved taxa are found in "summary.txt"
- "assembly_summary_genbank.txt" contains the collection of available GenBank species at the time of the study
- "all_2084/genomes" and "all_2084/genomes_combined" contain the gzipped fasta sequence files and necessary index files used with bwa-mem
- kraken_dbs - kraken2 databases used for benchmarking readsynth simulations of corresponding data
- k2_pluspfp pre-built database downloaded from https://benlangmead.github.io/aws-indexes/k2 (this is not present in the current dataset; see notes below to add)
- snipen_bei_db and sun_atcc_db are custom-built kraken2 databases using the mock community members from the publications by Snipen et al. 2022 (DOI: 10.1186/s40168-021-01019-8) and Sun et al. 2022 (DOI: 10.1186/s13059-021-02576-9)
- liu_RMS - reference genomes derived from Liu et al. 2017 (DOI: 10.7717/peerj.3837)
- kraken2/bracken profiling of the SRA fastq reads were used to estimate community members
- "run_download_genomes.sh" was used to retrieve raw data from NCBI SRA
- 10M_bracken_profile directory contains the kraken2/bracken profile of data used to produce the set of reference genomes found in the SRR5298272_genomes_no_human_combined directory
- snipen_RMS - reference genomes from Snipen et al. 2022 (DOI: 10.1186/s40168-021-01019-8)
- mock_community_ref_genomes and mock_community_combined directories contain the individual and combined genomes from the Snipen et al. 2022 mock communities, respectively
- sun_2bRADM - reference genomes from Sun et al. 2022 (DOI: 10.1186/s13059-021-02576-9)
- "run_download_genomes.sh" was used to retrieve the raw sequence data found in the 2brad_msa directory
- Genome sequences were obtained from ATCC, and corresponding sequence maps "" were used to construct the kraken2 database.
The taxonomy subdirectory has been removed from the k2_pluspfp kraken2 database directories for size limitations, however it should be possible to rebuild the taxonomy for the databases using:
kraken2-build --download-taxonomy --use-ftp --db <database>
Sharing/Access information
Data was derived from the following sources:
- Liu
- SRR5298272 (https://www.ncbi.nlm.nih.gov/sra/?term=SRR5298272)
- SRR5298274 (https://www.ncbi.nlm.nih.gov/sra/?term=SRR5298274)
- SRR5360684 (https://www.ncbi.nlm.nih.gov/sra/?term=SRR5360684)
- Snipen
- SRR10199716 (https://www.ncbi.nlm.nih.gov/sra/?term=SRR10199716)
- SRR10199724 (https://www.ncbi.nlm.nih.gov/sra/?term=SRR10199724)
- SRR10199725 (https://www.ncbi.nlm.nih.gov/sra/?term=SRR10199725)
- Sun
Code/Software
See readsynth_analysis for further code necessary for analysis.
Methods
Sequence data were collected and aggregated from publicly available NCBI SRA databases for raw sequence data (https://www.ncbi.nlm.nih.gov/sra) and NCBI RefSeq databases for reference genome assemblies (https://www.ncbi.nlm.nih.gov/refseq/).
Downloaded reference genomes have been concatenated and indexed using command line "cat" command and the bwa index command.