A high-quality genome assembly for Dillenia turbinata (Dilleniales)
Data files
May 19, 2023 version files 789.37 MB
Abstract
Objectives:
Dillenia turbinata (Dilleniaceae) is a member of the order Dilleniales, an enigmatic clade of critical importance for understanding the diversification history of flowering plants but for which genome sequences are not available. We have produced and annotated a chromosome-scale whole genome assembly for D. turbinata through the resources of the 10KP (10,000 Plants) Genomes Project. The genome assembly and associated data provided here will serve as a useful resource for comparative and evolutionary genomics research across the flowering plants.
Data description:
The D. turbinata genome was assembled from Oxford Nanopore Technology (ONT) and whole-genome shotgun (WGS) sequences, and scaffolded into chromosome-scale pseudomolecules using Hi-C data. The genome assembly is 723,739,077 base pairs in length with a BUSCO completeness score of 97%. Twenty-eight scaffolds contain more than 99% of the assembly. The repeat-masked genome sequence is annotated with 36,967 protein-coding gene models (93% BUSCO completeness) supported by transcriptome/protein evidence and/or Pfam domain content. These measures indicate high contiguity and completeness for the D. turbinata genome.
Methods
Genome assembly and annotation
Raw nanopore reads in fastq format were assembled with Canu v2.2 (Koren et al. 2017) using an estimated genome of 900Mb to guide coverage parameters during the read correction, trimming, and assembly steps of the pipeline. The resulting primary assembly was polished with the WGS reads using NextPolish v1.3.1 (Hu et al. 2020), and duplicated constructs were removed by Purge Haplotigs (Roach et al. 2018). The set of deduplicated contigs was scaffolded on the basis of Hi-C reads using the Juicer pipeline (Durand et al. 2016) and 3d-dna tools (Dudchenko et al. 2017) with default parameters. Genome annotation was performed using the MAKER-P pipeline (Campbell et al. 2014) supplied with coding DNA sequences (CDS) from a Trinity (Grabherr et al. 2011) assembly of the Dillenia transcriptome reads, proteomes from four publicly available eudicot genomes —Arabidopsis thaliana, Aquilegia coerulea, Nelumbo nucifera, and Vitis vinifera, and a custom repeat library of transposable elements (TEs) produced by RepeatModeler2 (Flynn et al. 2020) for the Dillenia genome. Gene models were predicted from the masked assemblies using the SNAP (Korf 2004) and AUGUSTUS (Stanke and Waack 2003) ab initio gene predictors after three rounds of training on interim high-quality (AED <= 0.25; length >= 50 amino acids) and BUSCO gene models, respectively. All evidence-based and ab initio gene predictions that were supported by the presence of functional protein domains were retained in the final genome annotation.
Usage notes
The included data files may be opened with MS Word (Detailed Methods.docx), MS Ecel (Dillenia.genome.assembly.stats.xlsx), standard image viewer software (Dillenia.BUSCO.summaries.png), and standard text editor programs (Dillenia.genome.fasta and Dillenia.maker.predict.36967.final.gff), .