Genotyping by sequencing for estimating relative abundances of diatom taxa in mock communities
Data files
Jan 29, 2025 version files 1.55 GB
-
Metareference_database.fa
22 MB
-
README.md
4.85 KB
-
STAR_mapping.bam
1.52 GB
-
stats.csv
7.06 MB
Abstract
Background Diatoms are present in all waters and are highly sensitive to pollution gradients. Therefore, they are ideal bioindicators for water quality assessment. Current indices used in these applications are based on identifying diatom species and counting their abundances using traditional light microscopy. Several molecular techniques have been developed to help automate different steps of this process, but obtaining reliable estimates of diatom community composition and species abundance remains challenging.
Results Here, we evaluated a recently developed quantification method based on Genotyping by Sequencing (GBS) for the first time in diatoms to estimate the relative abundances within a species complex. For this purpose, a reference database comprised of thousands of genomic DNA clusters was generated from cultures of Nitzschia palea. The sequencing reads from calibration and mock samples were mapped against this database for parallel quantification. We sequenced 25 mock diatom communities containing up to five taxa per sample in different abundances. Taxon abundances in these communities were also quantified by a diatom expert using manual counting of cells on light microscopic slides. The relative abundances of strains across mock samples were over- or under-estimated by the manual counting method, and a majority of mock samples had stronger correlations using GBS. Moreover, one previously recognized putative hybrid had the largest number of false positive detections demonstrating the limitation of the manual counting method when morphologically similar and/or phylogenetically close taxa are analyzed.
Conclusions Our results suggest that GBS is a reliable method to estimate the relative abundances of the N. palea taxa analyzed in this study and outperformed traditional light microscopy in terms of accuracy. GBS provides increased taxonomic resolution compared to currently available quantitative molecular approaches, and it is more scalable in the number of species that can be analyzed in a single run. Hence, this is a significant step forward in developing automated, high-throughput molecular methods specifically designed for the quantification of [diatom] communities for freshwater quality assessments.
README: Genotyping by Sequencing for estimating relative abundances of diatom taxa in mock communities
https://doi.org/10.5061/dryad.gqnk98srr
Description of the data and file structure
We acquired six N. palea strains from Thonon Culture Collection, France (TCC) 53. The cultures were grown in WC medium [54] at 19 °C and on a 16 h light/8 h dark cycle. We routinely examined the live cultures under a Zeiss Axio Imager M2 microscope and transferred the cells every 1–2 weeks based on the observed growth rates of individual cultures. Several harvests from each strain were collected at their exponential growth. These cells were concentrated in 2 ml tubes and counted using a Zeiss Axio Imager M2 microscope under brightfield (DIC) and a Neubauer counting chamber (Carl Roth, Germany). Three replicates were counted for each cell suspension (Additional file 1: Table S4), and we harvested additional cultures until a minimum of 6 million cells per strain had been collected. Suspensions with concentrations above or below the recommended ranges for counting with a Neubauer counting chamber (i.e., 250,000–2.5 million cells/ml) were either diluted in dH2O or concentrated by centrifugation. Finally, we prepared 25 mock samples by mixing the volumes from each suspension that contained the required number of cells for the GBS setup.
Code/software
The processing of data followed these main steps: (1) read demultiplexing, adapter removal using AdapterRemoval [56], and merging paired-end reads using Ngmerge [57] with a minimum of 20 bp overlap and a maximum of 10% mismatches (or else joining), (2) meta-reference creation by dereplicating (minuniquesize = 5) and clustering (95% identity) merged monoclonal reads using VSEARCH [58] and filtering non-Eukaryota and Fungi clusters using BLASTN with a minimum alignment length of 40 bp and an e-value of e−20, (3) mapping reads from all samples to the meta-reference using STAR [59] allowing multi mapping, (4) removing duplicate reads and reads with low alignment scores (< 0.8), (5) removal of homologous clusters between strains from the meta-reference (see below), calculation of a calibration key from samples with equal diatom proportions, and estimation of relative abundances of the mock mixture samples. Homologous clusters were removed from the meta-reference if (1) more reads mapped to a non-target monoculture cluster (non-target reads > target reads), (2) an insufficient number of reads mapped to a target monoculture cluster (target reads < 8), and (3) the ratio of non-target to target reads of a cluster was below the threshold (non-target/target > 1/15).
Access information
Other publicly accessible locations of the data:
- PRJNA868318
Data was derived from the following sources:
- Thonon Culture Collection, France (TCC)
Files and descriptions
Metareference_database.fa: Filtered and clustered monoculture reads for six strains (jenamono1-6) (Steps 2 and 5 in Codes/Software)
STAR_mapping.bam: Mapping file (Steps 3 and 4 in Codes/Software)
stats.csv: Final results file in tabular format showing the number of reads mapped to each monoculture cluster (jenamono1-6: monoculture samples; standaard1-5: calibration samples; ratio_1-25: mock samples)
Methods
Genomic DNA was extracted manually from all samples using a modified CTAB extraction procedure. The GBS protocol and sequencing followed Wagemaker et al. (2020) with minor modifications. In brief, extracted genomic DNA from the 36 samples was digested with two restriction enzymes (PacI and NsiI), and two indexed adapters were ligated to the digested DNA fragments. Each adapter incorporated a three-base pair unique molecular identifier (UMI) region to identify PCR duplicates within each library. The libraries were pooled and aliquoted in three portions to further prevent PCR bias. These aliquots were purified using QIAquick (QIAGEN), size selected for >150 bp fragments using AMPure XP beads (Beckman Coulter), and nick repaired using DNA polymerase I to repair nicks and improve PCR efficiency. The cleaned libraries were amplified (16 PCR cycles) using KAPA HiFi HotStart ReadyMix (Roche). The PCR reactions were combined, concentrated using QIAquick, size selected again for >150 bp fragments using AMPure XP beads, and quantified using the KAPA Library Quantification Kit for Illumina platforms (Roche). The final libraries were spiked with 10% PhiX DNA. Sequencing was performed by Novogene (Hongkong) on an Illumina (USA) NovaSeq 6000 platform generating 2x150 bp paired-end reads.
For LM counts, we transferred the aliquotes from each mock sample to glass tubes and oxidized these samples with hydrogen peroxide on a heat block for 30 min at 90°C following Handboek Hydrobiologie (Van Dam, H., Mertens, A., 2010). The oxidized samples were washed twice with distilled water (centrifugation for 5 min at 4000 g) and dissolved in 100 µl distilled water. Two slides were prepared per mock sample using Naphrax® as the mountant, and the slides with the better spread were selected for light microscopy analysis on a Zeiss Axioskop 40 using phase contrast with a magnification of 1000x (n.a. 1,30). In total, 200 valves per mock sample were measured and identified per microscope slide.
Usage notes
Interactive Genome Viewer for bam and fasta files.
Excel for csv files.