De novo genome assembly of Tectona grandis (Teak) with 2993 scaffolds
Data files
Aug 14, 2020 version files 317.62 MB
-
Teak_Draft_genome.fasta
317.62 MB
Abstract
Teak (Tectona grandis L. f.) is one of the precious bench mark tropical hardwood having qualities of durability, strength and visual pleasantries. Natural teak populations harbour a variety of characteristics that determine their economic, ecological and environmental importance. Sequencing of whole nuclear genome of teak provides a platform for functional analyses and development of genomic tools in applied tree improvement. A draft genome of 317 Mb was assembled at 151× coverage and annotated 36, 172 protein-coding genes. Approximately about 11.18% of the genome was repetitive. Microsatellites or simple sequence repeats (SSRs) are undoubtedly the most informative markers in genotyping, genetics and applied breeding applications. We generated 182,712 SSRs at the whole genome level, of which, 170,574 perfect SSRs were found; 16,252 perfect SSRs showed in silico polymorphisms across six genotypes suggesting their promising use in genetic conservation and tree improvement programmes. Genomic SSR markers developed in this study have high potential in advancing conservation and management of teak genetic resources. Phylogenetic studies confirmed the taxonomic position of the genus Tectona within the family Lamiaceae. Interestingly, estimation of divergence time inferred that the Miocene origin of the Tectona genus to be around 21.4508 million years ago.
WGS was performed using Illumina HiSeq 2000 platform and Oxford Nanopore Technologies MinION device by the Genotypic Technology, Bengaluru, India in accordance to standard protocols. Accession 2 was selected for the generation of high quality reference genome assembly. Accessions 1, 3, 4, 5 and 6 were subjected to low coverage genome sequencing to identify polymorphic SSRs. In the case of accession 2, one paired end (PE) (150 bp × 2) library of size 300–700 bp, two mate pair (MP) libraries (2–4 and 4–6 kb fragments) and one nanopore library with genomic DNA (2 μg) were prepared for sequencing. In Illumina HiSeq 2000 platform, one lane of the flow cell was used for each sequencing library. Nanopore sequencing was performed using R9.4 flow cells on a MinION Mk 1B device (Oxford Nanopore) with the MinKNOW software (versions 1.0.5–1.5.12) and base calling was performed using Albacore 1.1.0 (Oxford Nanopore). Template reads were exported as FASTA using poretools version 0.6. In the case of other five accessions (1, 3, 4, 5 and 6) one PE library for each with the size of 300–700 bp was sequenced at ∼15× coverage through Illumina Hiseq 2000 platform. The raw data is uploaded in genome database of GenBank (Project id: PRJNA374940).
The Illumina PE raw reads were filtered using FastQC and the raw reads were processed by in-house (Genotypic Technology, Bangaluru, India) ABLT script for low-quality bases and adapters removal. The MP reads were processed using Platanus24 internal trimmer for adapters and low-quality regions towards 3’-end. The processed PE reads along with MP and nanopore reads were used for contig generation using MaSuRCA v 3.2.2 de novo assembler.25 To assemble the genome following command was used in MaSuRCA assembler: GRAPH_KMER_SIZE = auto, LIMIT_JUMP_COVERAGE = 300, JF_SIZE = 38000000000, DO_HOMOPOLYMER_TRIM = 1. Scaffolding of the assembled contigs was performed using SSPACE v 2.0.526 with processed PE and MP reads followed by gap filling using Gap Closer v 1.12.27 The genome size was estimated automatically during read computing stage which utilized both the Illuimna and Nanopore reads. Similarly, the low depth Illumina reads generated for five accessions of teak were assembled using accession 2 as reference. The sequenced data was uploaded to the Genome database of GenBank (Project id: PRJNA421422).
For a functional overview of draft genome, assembled scaffolds were converted to FASTA formatted sequences, hard masked by RepeatMasker tool (RepeatMasker Open-3.0; www.repeatmasker.org (10 November 2017, date last accessed)). Repeats of Arabidopsis thaliana were used as reference for genome masking. Gene prediction was carried out using Augustus 3.0.228 programme and predicted proteins were searched against the Uniprot non-redundant plant protein database (Taxonomy = Viridiplante) with BlastX algorithm with an e-value (e-10) for gene ontology and annotation. Pathway annotation was performed by mapping the sequences obtained from Blast2GO to the contents of the KEGG Automatic Annotation Server (http://www.genome.jp/kegg/kaas/ (10 November 2017, date last accessed).