Deep-learning-based annotation of 230 superasterid genomes reveals a harmonized dataset of 91,366 NLRs
Data files
Mar 07, 2025 version files 109.07 GB
-
Apiales.zip
6.18 GB
-
Aquifoliales.zip
627.99 MB
-
Asterales.zip
18.12 GB
-
Boraginales.zip
963.38 MB
-
Caryophyllales.zip
10.32 GB
-
compiled_CDS.fasta.gz
3.50 GB
-
compiled_proteomes.fasta.gz
2.14 GB
-
Cornales.zip
2.29 GB
-
Data_S1.csv
43.27 KB
-
Data_S2.csv
27.66 KB
-
Dilleniales.zip
268.81 MB
-
Dipsacales.zip
4.19 GB
-
Ericales.zip
12.48 GB
-
Garryales.zip
361.21 MB
-
Gentianales.zip
5.41 GB
-
Lamiales.zip
19.03 GB
-
README.md
5.62 KB
-
Santalales.zip
862.23 MB
-
Solanales.zip
22.32 GB
Abstract
Plant nucleotide-binding leucine-rich repeat receptors (NLRs) are intracellular immune receptors crucial for pathogen recognition and immune responses. Despite their importance, NLRs are often challenging to annotate and frequently overlooked by standard annotation pipelines. To address the variability in NLR annotation accuracy across pipelines, we performed a harmonized de novo annotation of 230 high-quality superasterid genomes using the deep learning-based software Helixer (Holst et al. 2023), resulting in the annotation of 10,124,265 protein sequences. Additionally, we employed NLRtracker, which leverages InterProScan for domain identification, to detect NLR and NLR-associated sequences (Kourelis et al. 2021, Blum et al. 2025). Using the NLR definition from the RefPlantNLR dataset, we identified 91,366 NLRs, with counts ranging from 12 and 19 in the parasitic plants Cuscuta campestris and Orobanche coerulescens to 2,804 in Solanum tuberosum (potato). Beyond NLR annotation, we provide genome annotations, including proteomes, coding nucleotide sequences (CDS), and GFF files generated by Helixer. This dataset offers a valuable resource for standardized comparative genomics and evolutionary studies across superasterids.
Helixer v0.3.2 (Stiehler et al. 2020; Holst et al. 2023) was executed using Singularity for genome FASTA files with the option '--lineage land_plant', which applies the default model (land_plant_v0.3_a_0080.h5) for land plants. Coding DNA sequences (CDS) and protein FASTA files were extracted from the output GFF files using GffRead v0.12.7 (Pertea and Pertea 2020) with the '-x' and '-y' options, respectively. The extracted protein sequences were then analyzed using NLRtracker (Kourelis et al. 2021), which integrates InterProScan v5.65-97.0 (Jones et al. 2014).
BUSCO scores were generated using BUSCO v5.5.0 with [-m protein --lineage_dataset viridiplantae_odb10] options (Manni et al. 2021).
