Skip to main content
Dryad

Simulated ONT libraries for genome assembly experimental design

Cite this dataset

Fierst, Janna; Sutton, John (2021). Simulated ONT libraries for genome assembly experimental design [Dataset]. Dryad. https://doi.org/10.5061/dryad.3r2280gd2

Abstract

Background: High quality reference genome sequences are the core of modern genomics. Oxford Nanopore Technologies (ONT) produces inexpensive DNA sequences in excess of 100,000 nucleotides but high error rates make sequence assembly and analysis a non-trivial problem as genome size and complexity increases. Robust experimental design is necessary for ONT genome sequencing and assembly but few studies have attempted to address eukaryotic organisms. Here, we present novel results using a combination of simulated and empirical ONT and DNA libraries and identify best practices for ONT sequencing and assembly. We simulate ONT and Illumina DNA sequence reads for Escherichia coli, Caenorhabditis elegans, Arabidopsis thaliana, and Drosophila melanogaster and assemble with Canu, Flye, and MaSuRCA software to quantify the influence of sequencing coverage and assembly approach. We show broad applicability of our methods using real ONT data generated for four strains of the nematode Caenorhabditis remanei and C. latens.

Results:

  • ONT libraries have a unique error structure and high sequence depth is necessary to assemble contiguous genome sequences. 
  • As sequence depth increases errors accumulate and assembly statistics plateau.
  • High-quality assembled sequences require high molecular weight DNA extractions that increase sequence read length and computational protocols that reduce error through pre-assembly correction, read selection and post-assembly ‘polishing.’
  • Our robust experimental design results in highly contiguous and accurate genome assemblies for the four strains of C. remanei and C. latens.

Conclusions: ONT sequencing is inexpensive and accessible but the technology’s error structure requires informed experimental design. Our quantitative results will be helpful for a broad array of researchers seeking guidance for de novo assembly projects.

Methods

Data were simulated from reference genomes and ONT flowcell data with the NanoSim program; assembled with either Canu or Masurca.

Funding

National Science Foundation, Award: 1921585

National Science Foundation, Award: 1941854