Data from: Barcoding 100K specimens in a single nanopore run
Data files
May 30, 2024 version files 16.07 GB
-
autotrim_taxfilter_dominants.R
7.73 KB
-
BOLD_UniqueBINs_sintax.fasta
440.71 MB
-
final_contig_consensus.R
1.62 KB
-
generate_histograms.R
9.47 KB
-
get_parameters.R
2.74 KB
-
ONT.sh
58.58 KB
-
pairwise_divergence.py
3.86 KB
-
parameters_100K.xlsx
2.86 MB
-
parameters_10K.xlsx
239.08 KB
-
parameters_2K.xlsx
71.95 KB
-
primary_contig_consensus.R
1.76 KB
-
Raw_Reads_100K.fastq.gz
14.11 GB
-
Raw_Reads_10K.fastq.gz
853.77 MB
-
Raw_Reads_2K.fastq.gz
642.94 MB
-
README.md
8.10 KB
-
taxonomy_100K.xlsx
15.52 MB
-
taxonomy_10K.xlsx
1.41 MB
-
taxonomy_2K.xlsx
375.19 KB
Abstract
It is a global priority to better manage the biosphere, but action must be informed by comprehensive data on the abundance and distribution of species. The acquisition of such information is currently constrained by high costs. DNA barcoding can speed the registration of unknown animal species, the most diverse kingdom of eukaryotes, as the BIN system automates their recognition. However, inexpensive protocols are critical as the census of all animal species is likely to require the analysis of a billion or more specimens. Barcoding involves DNA extraction followed by PCR and sequencing with the last step dominating costs until 2017. By enabling the sequencing of highly multiplexed samples, the Sequel platforms from Pacific BioSciences slashed costs by 90%, but these instruments are only deployed in core facilities because of their expense. Sequencers from Oxford Nanopore Technologies provide an escape from high capital and service costs, but their low sequence fidelity has, until recently, constrained adoption. However, the improved performance of its latest flow cells (R10.4.1) can erase this barrier. This study demonstrates that a MinION flow cell can characterize an amplicon pool derived from 100,000 specimens while a Flongle flow cell can process one derived from several thousand. At $0.01 per specimen, DNA sequencing is now the least expensive step in the barcode workflow.
https://doi.org/10.5061/dryad.41ns1rnp1
The following files are used for all bioinformatic processes described in the manuscript “BARCODE 100K SPECIMENS: IN A SINGLE NANOPORE RUN” by Hebert et al, 2024.
Description of the data and file structure
Raw_Reads_2K.fastq.gz - This file contains the raw, base-called ONT reads for the 2K dataset. This can be used as the raw data input for ONT.sh.
Raw_Reads_10K.fastq.gz - This file contains the raw, base-called ONT reads for the 10K dataset. This can be used as the raw data input for ONT.sh.
Raw_Reads_100K.fastq.gz - This file contains the raw, base-called ONT reads for the 100K dataset. This can be used as the raw data input for ONT.sh.
parameters_2K.xlsx - This is the parameters file for the 2K dataset (for details, see below). It contains bioinformatic run parameters and the UMI map for this dataset.
parameters_10K.xlsx - This is the parameters file for the 10K dataset (for details, see below). It contains bioinformatic run parameters and the UMI map for this dataset.
parameters_100K.xlsx - This is the parameters file for the 100K dataset (for details, see below). It contains bioinformatic run parameters and the UMI map for this dataset.
taxonomy_2K.xlsx - This is the BOLD data sheet for the 2K dataset.
taxonomy_10K.xlsx - This is the BOLD data sheet for the 10K dataset.
taxonomy_100K.xlsx - This is the BOLD data sheet for the 100K dataset.
BOLD_UniqueBINS_sintax.fasta - This file contains all of the reference sequences used in this study, formatted for use with SINTAX.
The following working directory structure is recommended (for users). The code for each script is currently set to a directory structure used in our lab, and must be modified to reflect your specific folder names and directory structure.
DATA_INPUT - This is the main working directory in which all input data is placed before triggering the master script (ONT.sh)
DATA_OUTPUT - This is the output folder in which completed runs can be found.
SCRIPTS - This folder contains the master script (ONT.sh).
SCRIPTS > SUBSCRIPTS - This subfolder within SCRIPTS contains all dependency R scripts.
REF - This folder contains all SINTAX formatted reference libraries (FASTA files).
PRIMERS - This folder should contain a FASTA file called PrimersDB.fasta, which contains all forward and reverse primer sequences in 5’ -> 3’ orientation.
The output of ONT.sh is a single folder named after the user-specified run name. Within are the following files/folders:
METADATA_FILES - This folder contains the input parameters and taxonomy (if provided) files, as well as the corresponding text files generated by ONT.sh
terminalhistory.log - This is a text file that contains all terminal standard output and standard errors. It is used for troubleshooting if necessary.
LibaryName - This folder contains most of the output data, and will be named after the user-specified library/plate name. If native barcodes were used, there will be a separate folder for each library/plate. The following data will be found within each LibraryName folder:
all.fastq - This is the raw input data resulting from base calling. It is useful if the user wants to repeat analsyis without having to repeat base calling.
LibraryName_AllContigs.fasta - This FASTA file contains all contig consensus sequences.
LibraryName_DominantContigs.fasta - This FASTA file contains all dominant contig consensus sequences.
LibraryName_TaxonomicAssignments_AllContigs.tsv - This text file (can be opened in Excel) is composed of one row for each contig, and contains a lot of information about the contigs including read count, length, taxonomic ID, etc.
LibraryName_TaxonomicAssignments_DominantContigs.tsv - This text file is the same as the above, but contains only dominant contigs as opposed to all contigs.
LibraryName_SummaryHistograms - This PDF contains histograms useful for assessing the quality and performance of the run.
LibraryName_readcounts.txt - This text file was produced by ONT.sh and was used to generate one of the above histograms. It lists read counts after each step in the bioinformatic process.
Contig_Component_Reads - This folder contains one FASTA file for each contig, which comprise all raw reads that make up that contig. They are useful for assessing the composition of each contig.
Contig_Consensus_Sequences - This folder contains one FASTA file for each sample that produced at least one contig. Each file contains the consensus sequence(s) of each contig for that sample.
Original_Reads - This folder contains one FASTA file for each sample that yielded at least one read, and comprise all pre-clustering reads associated with that sample. They are useful for seeing if samples that failed to yield a contig still had a few reads that were below the user-specified minimum threshold.all.fastq - This is the raw input data resulting from base calling. It is useful if the user wants to repeat analsyis without having to repeat base calling.
Code/Software
The following lists all scripts with a description of their use:
ONT.sh - This is the master script that is called via ‘bash ONT.sh’. It requires two input data and one optional file:
parameters_XXXXX.xlsx (REQUIRED) - This file allows the user to set analysis parameters, specify primers, specify reference libraries, and provide a UMI map. Detailed instructions are included within.
raw data (REQUIRED) - This can be one one of three formats: POD5 file(s) from the sequencer, FASTQ file(s), or FASTA file(s).
taxonomy_XXXXX.xlsx (OPTIONAL) - This file can be downloaded from BOLD by searching for all specimens in your sequencing run and downloading the corresponding BOLD datasheet. Be sure to include the defaut lab progress tab as well as the Toxonomy tab.
All input file should be placed into the working directory, at which point this script can be called from the termianl via ‘bash ONT.sh’. The user will then follow the prompts. All dependency scripts are listed below:parameters_XXXXX.xlsx (REQUIRED) - This file allows the user to set analysis parameters, specify primers, specify reference libraries, and provide a UMI map. Detailed instructions are inluded within.
get_parameters.R - This is a dependency R script used by ONT.sh. It extracts relevant run information from the input parameters file.
primary_contig_consensus.R - This is a dependency R script used by ONT.sh. It generates consensus sequences for the primary contigs.
final_contig_consensus.R - This is a dependency R script used by ONT.sh. It generates consensus sequences for the final contings.
autotrim_taxfilter_dominants.R - This is a dependency R script used by ONT.sh. It trims the final contig sequences to the precise barcode region, assigns taxonomy, and selects dominant contigs.
generate_histograms.R - This is a dependency R script used by ONT.sh. It generates summary histograms that can be used to quickly assess the quality of the run.
The python script below is not involved in the main bioinformatic pipeline, but was used for this study to assess sequence congruence between ONT and Sequel-generated barcode sequences:
pairewise_divergence.py - This python script takes as input a single fasta file containing pairs of sequences to be compared. The specimen name is taken as any text up to the first “ | ” character in the fasta header. Optionally, sequences may have a suffix following a “.” character, and only the text before the “.” will be treated as the name of the specimen. It aligns each read pair and then compares the nucleotides at each position, counting any discordances. The number of discordances are summed and compared to the original sequence lengths to calculate a percent pairwise divergence. This is done for each sequence pair and the output is a table of divergence values for each sample pair. |