Genome assembly of Olea europaea subsp. cuspidate
Data files
Feb 20, 2022 version files 1.53 GB
Abstract
Background: The Olive complex, comprised of six subspecies, are very valuable plants for global trade, human health, and food safety. However, only one subspecies (Olea europaea subsp. europaea, OE) and its wild form (Olea europaea subsp. europaea var. sylvestris, OS) have genomic references, hindering our understanding on the evolution of this species.
Results: By utilizing a hybrid approach to incorporate Illumina, Nanopore, and Hi-C technology, we obtained by far the best reference genome assembly among wild olive subspecies for African olive, Olea europaea subsp cuspidate (OC) with contig and scaffold N50 values 3.83 Mb and 38.04 Mb, respectively. The assessment of protein-coding gene completeness revealed the high integrity of OC, which is at the similar level as OE assembly reported previously and much higher than that of OS. The divergence time between OC and the last common ancestor of OE and OS was estimated to be 4.21 Mya (95% CI: 1.43 - 7.31 Mya). The pathways of positively selected genes of OC are related to metabolism of cofactors and vitamins, indicating the potential medical and economic values of OC for further utilizing and research. The gene origination analyses revealed a substantial outburst (19.5%) of gene transposition events in the common ancestor of olive subspecies, suggesting the importance of olive speciation in shaping the new gene evolution of OC subspecies.
Conclusions: In this study, we constructed the de novo assembly and protein-coding gene pool for Olea europaea subsp cuspidate (OC), which may facilitate the medical and breeding utilizations of this widely distributed olive close relative.
Methods
The sequencing for DNA and RNA molecules was based on an individual of OC sampled from Kunming Arboretum (N 25°9′13″, E 102°45′9″), Yunnan Academy of Forestry and Grassland, Yunnan province of China. Yong-Kang Sima identified it. Voucher specimen (Wu20056) was deposited in the Herbarium of Yunnan Academy of Forestry and Grassland. The standard preparing procedures before sequencing, including DNA and RNA extraction, Hi-C library construction, etc., were based on the requirements of specific sequencers. Totally, five tissues including leaves, roots, twigs, bark, and fruits, were used for RNAseq sequencing in Illumina platform. For DNAseq, ~50x genome short reads (300 bp PE) and ~70x Nanopore long-reads were obtained from DNBSEQ-T7 and PromethION platform, respectively. The raw reads were filtered using the fastp preprocessor. To achieve chromosome-level assembly, we further generated ~130Gb data of the paired-end Hi-C reads (150bp) from DNBSEQ-T7 platform (MGI). We conducted the karyotyping of OC to determine the number of chromosomes using cultivated root, which has active meristems of mitosis suitable for detecting clear chromosomes. Cells were treated with Nitrous Oxide to obtained sufficient cells at mitosis metaphase for staining with DAPI and telomere repetitive sequences (TTTAGGG)6.
The basecalling output from PromethION platform was treated using Guppy. Only the reads with mean quality scores >7 were retained and further corrected using the NextDenovo software with parameters "reads_cutoff:2k, seed_cutoff:18k" (https://github.com/Nextomics/NextDenovo). The assembling processes include the correction module using NextCorrect and the assemble module using NextGraph with default parameters. Subsequently, the Nextpolish software was used to polish genome with short-reads four times and long-reads three times (sgs_options = -max_depth 100). The paired-end Hi-C reads were filtered by fastp to remove adapter and low-quality reads (Phred Score > 15, and 5 > number of Ns in the reads). The clean reads and draft genome were analyzed using LACHESIS with parameters "CLUSTER MIN RE SITES = 100;CLUSTER MAX LINK DENSITY=2.5;CLUSTER NONINFORMATIVE RATIO = 1.4".
The RepeatMasker was used for repeats annotation following the manual recommended parameters. To aid gene annotation, totally ~75 Gb RNA-sequencing (RNA-Seq) clean pair-ended reads from five tissues, including leaves, roots, twigs, bark, and fruits, were generated using Illumina HiSeq platform. All libraries were de novo assembled separately and subsequently merged using TransABySS manual pipeline. The protein-coding and non-coding gene structural annotation was conducted using MAKER pipeline by incorporating transcriptome mapping, de novo gene predictions, and homology predictions.