For any genome-based research, a robust genome assembly is required. De novo assembly strategies have evolved with changes in DNA sequencing technologies and have been through at least three phases: i) short-read only, ii) short- and long-read hybrid, and iii) long-read only assemblies. Each of the phases has their own error model. We hypothesized that hidden scaffolding errors in short-read assembly and erroneous long-read contigs degrade the quality of short- and long-read hybrid assemblies. We assembled the genome of T. borchgrevinki from data generated during each of the three phases and assessed the quality problems we encountered. We developed strategies such as k-mer-assembled region replacement, parameter optimization, and long-read sampling to address the error models. We demonstrated that a k-mer-based strategy improved short-read assemblies as measured by BUSCO while mate-pair libraries introduced hidden scaffolding errors and perturbed BUSCO scores. Further, we found that although hybrid assemblies can generate higher contiguity, they tend to suffer from lower quality. In addition, we found long-read-only assemblies can be optimized for contiguity by sub-sampling length-restricted raw reads. Our results indicate that long-read contig assembly is the current best choice and that assemblies from phase I and phase II were of lower quality.

Illumina based short-reead only de novo genome assembly built with kmer size 51 using Meraculous (v2.2.2.5, Chapman et al. 2011) named as k51 and the file name is k51.fasta

Illumina based short-reead only de novo genome assembly built with kmer size 61 using Meraculous (v2.2.2.5, Chapman et al. 2011) named as k61 and the file name is k61.fasta

Illumina based short-reead only de novo genome assembly built with kmer size 71 using Meraculous (v2.2.2.5, Chapman et al. 2011) named as k71 and the file name is k71.fasta

Illumina based short-reead only de novo genome assembly built with kmer size 81 using Meraculous (v2.2.2.5, Chapman et al. 2011) named as k81 and the file name is k81.fasta

Illumina based short-reead only de novo genome assembly built with kmer size 91 using Meraculous (v2.2.2.5, Chapman et al. 2011) named as k91 and the file name is k91.fasta

k71 de novo genome assembly corrected at BUSCO gene level using custom Python scripts (https://bitbucket.org/CatchenLab/scripts_contig_replacement_repo/src/master/) named as cork71. The

file name is cork71.fasta

The gap between and within scaffolds of k71 de novo genome assembly filled with Canu (Koren et al. 2017) corrected Nanopore long-reads using PBjelly (PBSUITE v15.4; English et al. 2012) and this

hybrid assembly named as filk71 and the file name is filk71.fasta

Low coverge, Nanopore based long-read only de novo contig-level genome assembly generated with wtdbg2 (v2.3, Ruan and Li 2019) assembler and named as corNpor. The file name is corNpor.fasta

k71 de novo genome assembly and corNpor contig-level assembly merged with quickmerge in four different ways. The assemblies corNpor and k71 were aligned to each other using the nucmer program from

the MUMMER package (v3.1, Kurtz et al. 2004). For the alignments, corNpor was used as the “reference” whereas k71 as the “query”. The alignments generated due to repeats and duplicates were

filtered out with the MUMMER delta-filter program by manipulating the minimum alignment identity (-i) and minimum length of alignment (-l) parameters, including a) -i 95 -l 0 (default), b) -i 95

-l 1000, c) -i 95 -l 5000, and d) -i 95 -l 10000. After filtering alignments, finally, we merged the reference corNpor and the query cork71 using quickmerge (v0.3, Chakraborty et al. 2016) with

parameters -hco 5.0 -c 1.5 -l 803500 -ml 5000 and five independent hybrid assemblies were obtained. These quickmerge-based hybrid assemblies were named, mergedA, mergedB, mergedC, and mergedD,

after their respective delta-filter values.

The quickmerge based hybrid assemblies files are mergedA.fasta, mergedB.fasta, mergedC.fasta, and mergedD.fasta

Updates since the last version:

Uncorrected long-read only assembly built with raw PacBio data using WTDBG2 assembler named as WTDBG2r* (file is wtdbg2r.fa.gz)

Uncorrected long-read only assembly built with 70 Gb subsampled PacBio data (generated by sampling minimum and maximum read lengths of 10 and 40 kb, respectively) using WTDBG2 assembler named as WTDBG2Sr* (file is wtdbg2sr.fa.gz)

polished long-read-only assembly built with 70 Gb subsampled PacBio data (generated by sampling minimum and maximum read lengths of 10 and 40 kb, respectively) using WTDBG2 assembler named as WTDBG2Sra (file is wtdbg2sra.fa.gz)

Evaluating Illumina-, Nanopore-, and PacBio-based genome assembly strategies with the bald notothen, Trematomus borchgrevinki

Data files

Abstract

Evaluating Illumina-, Nanopore-, and PacBio-based genome assembly strategies with the bald notothen, Trematomus borchgrevinki

Data files

Abstract

README: Evaluating Illumina-, Nanopore-, and PacBio-based genome assembly strategies with the bald notothen, Trematomus borchgrevinki

Works referencing this dataset