A novel method to assess the integrity of frozen archival DNA samples: Alpha-diversity ratios of short and long-read 16S rRNA gene sequences
Data files
Aug 09, 2024 version files 19.21 MB
Abstract
Archival DNA samples collected and analysed for a range of research and applied questions have accumulated in the laboratories of universities, government agencies, and commercial service providers for decades. These DNA archives represent a valuable, yet largely untapped repository of genomic information. With lowering costs of, and increasing access to, high-throughput sequencing, we predict an increase in retrospective research to explore the wealth of information that resides in these archival samples. However, for this to occur, we need confidence in the integrity of the DNA samples, often stored under sub-optimal conditions and their fitness of purpose for downstream genomic analysis. Here, we borrow from a well-established concept in ancient DNA to evaluate sample integrity, defined as loss of information content in recovered amplicons, of frozen DNA samples and based on the ratio of ⍺-diversity of short and long-read 16S rRNA gene sequences. The 16S rRNA variable region of eighty-seven stored DNA samples, extracted from soil, collected from western and southern agricultural regions of Australia between 2001 to 2021, were sequenced using both PacBio full-length reads (V1-V9, 1.5 kbp) and Illumina short-reads (V3-V4, 200-450 bp). When ⍺-diversity ratios were calculated between the long and short reads to assess DNA degradation, the ratio of ⍺-diversity did not decrease in older samples versus younger samples. We suggest this as a novel method to confirm the integrity of DNA before embarking on large-scale diversity profiling projects using archival DNA.
README: A novel method to assess the integrity of frozen archival DNA samples: Alpha-diversity ratios of short and long-read 16S rRNA gene sequences
https://doi.org/10.5061/dryad.v9s4mw73t
We utilized DNA extracted from various agricultural soils that were stored at -20°C in a gene bank freezer room over 20 years by the South Australian Research and Development Institute (SARDI). This DNA was collected through the PREDICTA® B DNA-based soil disease testing service for broadacre farming (PREDICTA® B). We selected 87 soil DNA extracts from three Australian states (regions), spanning 10 distinct time bins between 2001 and 2020. Our primary concern was the potential DNA degradation in the oldest samples. Therefore, we included samples from the first four years (2001-2004) and selected samples more sporadically from subsequent years (2005 onwards). Alpha-diversity ratios, using Shannon's diversity index, were calculated to determine if there was a decrease in species richness between the long-read and short-read sequencing methods.
Description of the data and file structure
Quantitative Insights Into Microbial Ecology 2 (QIIME2 v. 2018.11) software was used to analyse the trimmed and reoriented sequences. The taxonomy assignment generated from this analysis can be found in the following files:
- lvl2_Illumina_PacBio_export_from_QIIME2.csv (Phyla level)
- level-7_PacBio_illumina.csv (Species-level)
All other analyses were conducted using R (version 4.2.0) on unrarefied sequences with the following packages:
- Vegan (2.6-4)
- ggplot2 (3.4.2)
- phyloseq (1.40.0)
- indicspecies (1.7.14)
The biome file (table-with-taxonomy.biom), mapping file (Manifest_All_prelim_study_forfeturetable.csv), and tree file (tree.nwk) were imported into a phyloseq object as described in the provided documentation (PCoA_plot_alpha_diversity_illumina_PacBio_Fig2.txt). Beta and alpha diversity were calculated as described in PCoA_plot_alpha_diversity_illumina_PacBio_Fig2. Relative abundance graphs were created using ggplot2 as described in relative_abundance_graph.txt. Indicator species analysis was performed in R with relative abundance data (lvl2_Illumina_PacBio_RAtrans.csv) according to the script in Indisp_anal.txt.
File Descriptions
The following files contain scripts and data used to create figures and analyze the sequencing data:
1. Indisp_anal.txt: Script for indicator species analysis.
2. level-7_PacBio_illumina.xlsx: File exported from QIIME2 level 7 (species) taxa assignment.
3. lvl2_Illumina_PacBio_export_from_QIIME2.xlsx: File exported from QIIME2 level 2 (phyla) taxa assignment.
4. Manifest_All_prelim_study_forfeturetable.xlsx: Metadata used in phyloseq.
5. PCoA_plot_alpha_diversity_illumina_PacBio_Fig2.txt: Script for beta and alpha diversity analysis. (Available on Zenodo)
6. relative_abundance_graph.txt: Script for creating relative abundance graphs. (Available on Zenodo)
7. table-with-taxonomy.biom: File containing sequence data for phyloseq.
8. tree.nwk: Tree file for phyloseq.
Sharing/Access information
All raw sequencing data is available through the NCBI Sequence Read Archive (SRA) under BioProject identifiers PRJNA1051317.
Methods
Data analysis The Pacific Biosciences Nextflow pipeline (https://github.com/PacificBiosciences/pb-16S-nf) was followed for initial data processing. Raw reads were processed, including demultiplexing by “q2-demux” in QIIME2, and quality control was assessed with q2-cutadapt. Quantitative Insights Into Microbial Ecology 2 (QIIME2 v. 2018.11) software was used to analyse the trimmed reorientated sequences (Bolyen et al., 2019). The DADA2 denoising option (Callahan et al., 2016) was selected to pick up the representative reads for generating an amplicon sequence variants (ASVs) table. ASVs generated from DADA2 were classified using the Naive Bayes classifier and SILVA reference database version 138.1 (Quast et al., 2013). For analysis between the platforms the feature table of each platform was merged, as were the representative sequences post-DADA2 denoising with QIIME2 before building the phylogenetic tree and assigning taxonomy.
Taxonomic diversity analysis All analysis was conducted with R (4.2.0) on unrarefied sequences. Alpha diversity was calculated using Vegan (2.6-4) (Oksanen et al., 2022), and plots were generated using ggplot2 (3.4.2). Beta diversity was calculated using phyloseq (1.40.0) (McMurdie & Holmes, 2013) based on Bray Curtis or unweighted UniFrac distances and displayed as PCoA ordination plots. The phylogenetic tree used for the calculation of the unweighted and weighted UniFrac distances was generated using Multiple Alignment using Fast Fourier Transform (MAFFT) (Kazutaka & Standley, 2013) to align sequences and infer a phylogenetic tree using FastTree (Price et al., 2010) in QIIME2. The tree was imported into phyloseq using ape (5.6-2). To determine which Phyla were more abundant with either PacBio or Illumina Platforms, an indicator species analysis using multipatt was conducted with indicspecies (1.7.14) in R using 999 permutations (De Cáceres & Legendre, 2009). This used the method of Dufrêne and Legendre (1997) calculated the IndVal index between the species and each site group and then tested for the group corresponding to the highest association value. The statistical significance of this relationship was tested using a permutation test.