Skip to main content
Dryad

ModEst - Precise estimation of genome size from NGS data

Cite this dataset

Schell, Tilman; Pfenninger, Markus; Schönnenbeck, Philipp (2022). ModEst - Precise estimation of genome size from NGS data [Dataset]. Dryad. https://doi.org/10.5061/dryad.dr7sqvb0j

Abstract

Accurate estimates of genome sizes are important parameters for both theoretical and practical biodiversity genomics. We present here a fast, easy-to-implement and precise method to estimate genome size from the number of bases sequenced and the mean sequencing depth. To estimate the latter, we take advantage of the fact that a precise estimation of the Poisson distribution parameter lambda is possible from truncated data, restricted to the part of the sequencing depth distribution representing the true underlying distribution. With simulations we could show that reasonable genome size estimates can be gained even from low-coverage (10X), highly discontinuous genome drafts. Comparison of estimates from a wide range of taxa and sequencing strategies with flow-cytometry estimates of the same individuals showed a very good fit and suggested that both methods yield comparable, interchangeable results.

Methods

To illustrate the influence of factors like sequencing depth, genome size, repeat content and -distribution on the different genome size estimation methods, we simulated five different genomes according to real examples. The latest genome assemblies and annotations of Saccharomyces cerevisae, Caenorhabditis elegans, Arabidopsis thaliana, Drosophila melanogaster and Scophthalmus maximus were used to obtain distributions of size and distance between annotated repeat regions. Simulated genomes of the size of the five genome assemblies mentioned above were then created using a custom Python-tool, available at https://github.com/Croxa/Simulate-Genome. Regions annotated as repeat regions (rr) were filled with random repeat units up to 10 bp length, high complexity regions with random nucleotides. For sake of ease, we simulated the genomes on a single chromosome. A mean GC content of 0.5 was applied to both categories.