Ocimum basilicum 'Perrie' genome assembly and annotation
Data files
May 08, 2026 version files 2.71 GB
-
Perrie_v1.0_CDs.fa
104.69 MB
-
Perrie_v1.0_chromosomes.fa
2.20 GB
-
Perrie_v1.0_chromosomes.stats
669 B
-
Perrie_v1.0_gene_functional_description.txt
22.04 MB
-
Perrie_v1.0_gene_models_utr.gff
185.72 MB
-
Perrie_v1.0_gene_models.gff
165.68 MB
-
Perrie_v1.0_proteins.fa
37.66 MB
-
README.md
4.18 KB
Abstract
Basil, Ocimum basilicum L., is a widely cultivated aromatic herb, prized for its culinary and medicinal uses, predominantly owing to its unique aroma, primarily determined by eugenol for Genovese cultivars or methyl chavicol for Thai cultivars. To date, a comprehensive basil reference genome has been lacking, with only a fragmented draft available. To fill this gap, we employed PacBio HiFi and Hi-C sequencing to construct a homeolog-phased chromosome-level genome for basil. The tetraploid basil genome was assembled into 26 pseudomolecules and further categorized into subgenomes. Lamiaceae-related genomic comparison data. We utilized a bi-parental population derived from a Genovese × Thai cross to map quantitative trait loci (QTL) for the aroma chemotype. We discovered a single QTL governing the eugenol/methyl chavicol ratio, which encompassed a genomic region with 95 genes, including 15 genes encoding a shikimate O-hydroxycoumaroyltransferase (HCT/CST) enzyme. Of them, only ObHCT1 exhibited significantly higher expression in the Genovese cultivar and showed a trichome-specific expression. ObHCT1 was functionally confirmed as a genuine HCT enzyme using an in vitro assay. The high-quality, contiguous basil reference genome is now publicly accessible at BasilBase, a valuable resource for the scientific community. Combined with insights into cell-type-specific gene expression, it promises to elucidate specialized metabolite biosynthesis pathways at the cellular level.
https://doi.org/10.5061/dryad.gxd2547vf
High molecular weight genomic DNA from the ‘Perrie’ cultivar was sequenced using PacBio HiFi technology at the Icahn School of Medicine at Mt. Sinai (New York, NY, USA). Circular consensus calling was performed by the Icahn School of Medicine at Mt. Sinai (New York, NY, USA) and the resulting sequences were assembled with hifiasm version 0.16.1-r375 (Cheng et al., 2021) using default parameters. Hi-C proximity ligation library construction, using the Proximo Hi-C Plant kit, was performed by Phase Genomics (Seattle, WA, USA). The contigs were oriented using the Hi-C data as input to 3-D DNA version 180922 (Dudchenko et al., 2017) and the results were manually curated with Juicebox (Durand et al., 2016). Pilon version 1.23 (Walker et al., 2014) was used with Illumina data from the draft genome (Gonda et al., 2020) to polish the genome assembly. Subphaser v1.2.6 (Jia et al., 2022) was used to assign scaffolds to subgenome A and subgenome B using default parameters. Contaminant sequences in the assembly were identified by a local BLASTn (Altschul et al., 1990; Ellinghaus et al., 2008) search against the NCBI nr database (Sayers et al., 2021) and BlobTools v1.1.1 (Laetsch & Blaxter, 2017).
LTR_retriever (Ou & Jiang, 2018) was used with outputs from LTRharvest (Ellinghaus et al., 2008) and LTR_FINDER (Xu & Wang, 2007) to identify long terminal repeat retrotransposons (LTRs) in the ‘Perrie’ genome sequence. The LTR library was then used to hard mask the genome, and RepeatModeler version: open-1.0.11 (Smit et al., 2008-2015) was used to identify additional repetitive elements in the remaining unmasked segments of the genome. Protein-coding sequences were excluded using blastx v2.8.1+ (Altschul et al., 1990; Ellinghaus et al., 2008) results in conjunction with the ProtExcluder.pl script from the ProtExcluder v1.2 package (Campbell et al., 2014). The libraries from RepeatModeler and LTR_retriever were then combined and used with RepeatMasker v 4.0.9 (Smit et al., 2013–2015) to produce the final masked version of the genome.
To predict gene models, ‘Perrie’ RNA-seq reads (2×150 bp paired-end reads, (Gonda et al., 2020) derived from leaves (3 replicates), flowers (3 replicates), stems (1 sample), and roots (1 sample) were mapped to the genome assembly with hisat2 version 2.1.0 (Kim et al., 2019a). Bam files were supplied as input to braker version 2.1.2 (Brůna et al., 2021) for gene prediction. Resulting gene models were removed if they had no close match (e-value > 0.001), had a low total expression value (FPKM <0.01), and no predicted protein domain based on InterProScan results. Matches to repetitive elements were also removed. BLAST searches against the SwissProt and TrEMBL databases (UniProt Consortium, 2023) and InterProScan (Jones et al., 2014) (v5.46-81.0) were used to assign putative functions to gene models.
Description of the data and file structure
Perrie_v1.0_CDs.fa: A fasta file containing coding sequences resulting from gene prediction and filtering as described above.
Perrie_v1.0_chromosomes.fa: A fasta file containing the resulting chromosome sequences generated from assembly and correction.
Perrie_v1.0_chromosomes.stats: This file contains the assembly metrics that describe the total length, N50, and assessments of the genome assembly.
Perrie_v1.0_gene_functional_description.txt: A file containing functional predictions for gene models derived from BLAST matches and InterProScan results described in methods.
Perrie_v1.0_gene_models.gff: A gff file of genomic features that shows where they are found in the genome assembly.
Perrie_v1.0_gene_models_utr.gff: A gff file of genomic features containing gene features and untranslated regions of gene models.
Perrie_v1.0_proteins.fa: A fasta file containing protein sequences predicted by Braker.
Sharing/Access information
Links to other publicly accessible locations of the data:
