The genome and transcriptome of Musa itinerans unveil novel candidate genes for fusaric acid resistance.
Data files
May 26, 2026 version files 180.74 MB
-
I_genome_chr_v5.fasta.gz
152.47 MB
-
I_v5_annotation.gff3.gz
13.54 MB
-
I_v5_EVM.cds.fasta.gz
14.73 MB
-
README.md
738 B
Abstract
Fusarium wilt by Fusarium oxysporum f. sp. cubense, especially the tropical race 4 (Foc-TR4), causes severe losses of worldwide banana. A critical toxin produced by Foc-TR4 is fusaric acid (FA), yet few studies have investigated the source of potential resistance. We identified the FA-resistant wild banana Musa itinerans var. formosana and assembled the chromosome-level genome. Our fully factorial transcriptome analyses showed that genes up-regulated in the susceptible M. acuminata AAA Cavendish ‘Pei-Chiao’ are associated with the downstream response after FA impact, while those up-regulated in the resistant M. itinerans are involved in the membrane system and endocytosis, potentially reflecting the upstream reaction to repair FA damage to the membrane. Through a novel analysis taking the gene expression plasticity response into account, we identified a strong candidate gene, eEF1A, the only gene with consistent expression up-regulation in the resistant species M. itinerans upon FA treatment at all time points. While generally regarded as a housekeeping gene, eEF1A shows FA-resistance effect when over-expressed in tobacco leaves. The results unveil novel candidate genes of FA resistance that have not been reported in banana species and propose a unique direction for future engineering of banana varieties.
Dataset DOI: 10.5061/dryad.5tb2rbpk7
Description of the data and file structure
The dataset is the assembly and annotation of Musa itinerans var. formosana D11.
Files and variables
File: I_v5_annotation.gff3.gz
Description: This is the annotation file of M. itinerans var. formosana D11
File: I_genome_chr_v5.fasta.gz
Description: This is a genome assembly fasta file of M. itinerans var. formosana D11
File: I_v5_EVM.cds.fasta.gz
Description: This is the cds fasta file that contains coding sequences of each annotated gene.
The Musa itinerans var. formosana sample (accession D11) was collected from Hsinchu, Taiwan. All experiments in this study were conducted using clonally propagated D11 plantlets through tissue culture. To assemble the genome, we employed a combination of Illumina short reads, Oxford Nanopore Technology (ONT) long reads, and Hi-C (high-throughput chromosome conformation capture) data. For short-read sequencing, genomic DNA was isolated from plantlet leaves using the cetyltrimethylammonium bromide (CTAB) method. DNA concentration and quality were measured with Qubit 3TM Fluorometer and Thermo Scientific™ NanoDrop™. The libraries were constructed using Ultra II DNA Library Kits (NEB #E7645) and sequenced on an Illumina HiSeq X Ten platform with 150 bp paired-end and 300-500 bp insert size. For long-read sequencing, high molecular weight DNA was extracted according to Mayjonade et al. (Mayjonade et al., 2016). The Nanopore library was constructed using the 1D-ligation library prep kit (SQK-LSK109, Oxford Nanopore) and sequenced on the MinION platform with six flow cells (R9.4). Guppy v3.1.5 (Oxford Nanopore Technologies, https://github.com/nanoporetech) was the base caller for Nanopore reads, and adapter removal steps were performed by Porechop v0.2.4 (Wick et al., 2017). The Hi-C library was prepared with the Proximo Hi-C Kit (Phase Genomics, Seattle, WA) and sequenced with Illumina paired-end 150 bp.
The genome size was estimated by 22-mer distribution by Jellyfish v.2.2.10 (Marçais and Kingsford, 2011). Based on the Nanopore data, we utilized Canu v1.8 (Koren et al., 2017) to obtain error-corrected reads, and the genome size was set to 500 Mb. Reads with a length less than 1 kb were removed. SMARTdenovo v1.0.0 (Liu et al., 2021) was used to build a de novo assembly with 22-mer sizes. The assembly was polished five times using Illumina data by Pilon v1.23 (Walker et al., 2014). We aligned Hi-C reads to the polished genome using Juicer v1.5 (Durand et al., 2016) and anchored the Nanopore contigs into chromosomes with 3D-DNA (Dudchenko et al., 2017). BUSCO v5.4.4 (Benchmarking Universal Single-Copy Orthologs) (Manni et al., 2021), using the embryophyta lineage dataset (embryophyta_odb10.2020-09-10), was used to assess the quality and completeness of the assembly. MUMmer (Marçais et al., 2018) was used to align our assembly and previous Musa genomes, including M. itinerans “ASM164941v1” (Wu et al., 2016), M. acuminata “DH-Pahang” (Belser et al., 2021), and M. balbisiana “DH-PKW” (Wang et al., 2019). Merqury (Rhie et al., 2020) was also used to compare the assemblies with kmer 19.
To obtain repetitive element annotation, we used RepeatModeler v1.0.11 (github.com/Dfam-consortium/RepeatModeler) to detect de novo repeat sequences from the reference genome. Repetitive sequences and transposons were identified and classified by RepeatMasker v4.1.1 (www.repeatmasker.org).
Ab initio-based, transcriptome-based, and homology-based gene structure predictions were used to identify protein-coding genes. For the de novo gene prediction, AUGUSTUS v3.3.3 (Stanke et al., 2008) was used with the "-species = rice" parameter. For the transcriptome-based gene prediction, we used two pieces of evidence: transcript assembly and predicted open reading frames (ORFs). RNA-seq data from three tissues (roots, pseudostems, and leaves) were first mapped to the genome using HISAT2 v2.2.0 (Kim et al., 2019). Aligned RNA-seq reads were used to generate transcriptome assembly with StringTie v2.1.2 (Kovaka et al., 2019). The StringTie assembled transcripts were used as input into TransDecoder v5.5.0 (github.com/TransDecoder) to predict ORFs with more than 100 amino acids by integrating the UniProt and Pfam domain search results into coding region selection. For homology-based gene prediction, protein sequences of M. itinerans (Wu et al., 2016), M. schizocarpa (Belser et al., 2018), M. acuminata (GCF_000313855.2)(D’Hont et al., 2012), M. balbisiana (GCA_004837865.1)(Wang et al., 2019), Arabidopsis thaliana (GCF_000001735.4), Zea mays (GCF_902167145.1), and Oryza sativa (GCA_001433935.1) were downloaded from Banana Genome Hub (Droc et al., 2022) and NCBI database. These protein sequences from related plant species were aligned to the reference genome by GenomeThreader v1.7.1 (Gremme et al., 2005).
All predicted results were integrated into EVidenceModeler v1.1.1 (EVM) (Haas et al., 2008) to generate weighted consensus gene structures (evidence weight of AUGUSTUS: 1, StringTie: 10, TransDecoder: 5, GenomeThreader: 5). The weight of evidence followed the suggestion of EVM that transcription-based evidence should be more substantial than others, and homology-based evidence can be greater than or equal to ab initio-based evidence. Based on the annotated genes, synteny was compared with other species using MCscan in the JCVI utility (Tang et al., 2024).
To obtain functional annotation of each annotated protein-coding gene, we searched them against public databases using multiple tools, including KAAS (KEGG Automatic Annotation Server) (Moriya et al., 2007), BLASTP v2.9.0+ (Camacho et al., 2009), the online functional annotation tool eggNOG-mapper v2.1.9 (Cantalapiedra et al., 2021) based on orthology assignment, the online plant protein annotation tool Mercator4 v5.0 (You et al., 2019), and the online functional annotation tool TRAPID (Bucchini et al., 2021).
