Diverse database and machine learning model to narrow the generalization gap in RNA structure prediction
Data files
Jan 29, 2026 version files 368.20 MB
-
archiveII.json
3.05 MB
-
efold_train.json
349.51 MB
-
human_mRNA.json
11.90 MB
-
lncRNA_nonFiltered.json
174.95 KB
-
PDB.json
91.87 KB
-
pri_miRNA.json
3.30 MB
-
README.md
6.50 KB
-
viral_fragments.json
176.39 KB
Abstract
This dataset contains RNA secondary structure data used for training and testing eFold, a deep learning model for RNA secondary structure prediction. The dataset comprises three main components: (1) experimentally determined secondary structure models for 1,098 pri-miRNAs and 1,456 human mRNA regions derived from DMS-MaP-seq chemical probing experiments, representing the original contribution of this work; (2) a curated pre-training dataset combining subsets of bpRNA (base-pair RNA database) and RNAstralign databases, filtered to remove redundant sequences and ArchiveII sequences as described in the associated publication; and (3) benchmark test sets for evaluating model performance on long and diverse RNA structures.
The dataset includes sequence files in FASTA format and corresponding secondary structure annotations in dot-bracket notation. Structure models represent experimentally validated folding patterns with reactivity data from chemical probing assays. The pri-miRNA structures range from 200 nucleotides in length and include precursor hairpins with flanking regions, while mRNA structures range from 200-1kb and focus on functionally important regions including 3' untranslated regions.
This dataset enables researchers to: (1) train and benchmark machine learning models for RNA structure prediction, particularly for long and complex RNAs that have been traditionally difficult to predict; (2) investigate RNA structural features in pri-miRNAs and mRNA regulatory regions; (3) compare performance of computational methods against experimentally determined structures; and (4) develop improved algorithms that incorporate diverse RNA families beyond the short non-coding RNAs that dominate existing training sets.
All data are freely available without restrictions. No human subjects data or personally identifiable information is included. RNA sequences are derived from publicly available reference genomes and databases.
Dataset DOI: 10.5061/dryad.79cnp5j95
Description of the data and file structure
The data is a json file, structured as follows:
sequence_name:
sequence: AAGUGAAG.. # string of nucleotides
structure: [[171, 317], [351, 403], ...] # list of base pairs
shape: [0.4519, 1.0903, 0.5035, 0.1382,...] # list of normalized shape reactivities (when available)
dms: [1.0, 0.7283, -1000.0, -1000.0, ...] # list of normalized DMS reactivities (when available). Since DMS only reacts to A and C, all reactivities for G and U are set to -1000.
Files and variables
File: pri_miRNA.json
Description: Original contribution. See section Methods of https://www.biorxiv.org/content/10.1101/2024.01.24.577093v4.
File: human_mRNA.json
Description: Original contribution. See section Methods of https://www.biorxiv.org/content/10.1101/2024.01.24.577093v4.
File: lncRNA_nonFiltered.json
Description: The long non-coding RNA (lncRNA) dataset was sourced from Bugnon et al. (11), with the only modification being to cut sequences exceeding 2,000 nucleotides in length, using the same method as for the viral structures. We didn’t use the last filtering step of the cutting process as we found it made almost no difference in the test results of all algorithms and models. To generate a sufficiently large dataset, we implemented a segmentation process that divided RNA sequences into smaller sub-structures. This step was crucial for expanding our training data. However, this segmentation was not necessary for the ArchiveII dataset, which already contained an ample number of diverse sequences, providing a comprehensive representation of RNA structures without further subdivision.
File: viral_fragments.json
Description: Original contribution. The viral structures dataset was created by segmenting long viral RNA sequences into smaller, independent modules. We utilized the HIV structure from Watts et al. (38), the SARS-CoV structure from Lan et al. (26), the Hepatitis virus structure from Mauger et al. (39), and the Alphavirus structure from Kutchko et al. (40). For each of these structures, we identified modular structures characterized by fully closed loops and high agreement between the structure and chemical probing data (AUROC > 0.8). After segmenting the chemical probing signal (DMS or SHAPE), RNAstructure was rerun with each fragment. Fragments were retained if the new structure aligned with the corresponding segment of the chemical probing signal (AUROC > 0.8 and F1>0.8).
File: archiveII.json
Description: The ArchiveII dataset was downloaded from Mathew lab website (https://rna.urmc.rochester.edu/publications.html). We only removed 5 sequences that were longer than 2000 nucleotides, resulting in a test set with 3370 sequences.
File: PDB.json
Description: In selecting entries from the Protein Data Bank (PDB), we focused on those classified under "Polymer Composition" as RNA, and with a "Number of Assemblies", "Number of Distinct Molecular Entities", and "Total Number of Polymer Instances" all set at 1. This approach yielded 355 entries, which were then converted from tertiary to secondary structures using the RNApdbee webserver, applying the default settings: 3DNA/DSSR as the conversion software and the hybrid algorithm method.
File: efold_train.json
Description: We combined the public databases bpRNA (13) and Ribonanza (17) into a pre-training dataset. We applied the following filtering steps: first, we removed duplicate sequences within the databases. We only kept sequences with the canonical bases ACGU. The T bases were converted to U. We filtered out the sequences below 10 nucleotides. We removed sequences for which we have a sequence but no structure. We removed sequences that are common between datasets. For ribonanza specifically, we applied the following filtering: we filtered low-reads and low S/N ratio data. The cutoff was set to more than 500 reads and a S/N ratio - a quality indicator provided by the Ribonanza dataset - greater than 1. The data includes structures predicted with EternaFold. To ensure that the EternaFold-predicted structure was matching the signal, we computed the AUROC between the structure and the one of the chemical probing signals for each sequence. We used DMS by default, and SHAPE if DMS was filtered out. If the AUROC was below 0.8, the structure was filtered out. We removed redundant sequences within the dataset using BLAST. The primers were masked. If two sequences had over 80% matches on a sequence of over 112 nucleotides, we kept only the best covered sequence. Note that 112 nucleotides correspond to 80% of the most represented length minus the primers.
The last dataset used for pretraining is a synthetic dataset. We gathered sequences from RNACentral by sampling uniformly from all RNA clans, then added a balanced proportion of mRNA and viral sequences, and finally predicted the structure of each sequence using RNAstructure Fold. We combined all training data and removed any sequence similar to the test sets with a BLAST analysis using the same parameters as before. The final dataset called “efold_train” contains 306,557 sequences up to 1024 in length.
Code/software
#!/usr/bin/env python3
import json
with open('/path/to/dataset.json', 'r') as f:
data = json.load(f)
print(data)
Access information
Other publicly accessible locations of the data:
Data was derived from the following sources:
- https://rna.urmc.rochester.edu/publications.html
- https://www.rcsb.org/
- L. A. Bugnon, A. A. Edera, S. Prochetto, M. Gerard, J. Raad, E. Fenoy, M. Rubiolo, U. Chorostecki, T. Gabaldón, F. Ariel, L. E. Di Persia, D. H. Milone, G. Stegmayer, Secondary structure prediction of long noncoding RNA: review and experimental comparison of existing approaches. Briefings in Bioinformatics 23, bbac205 (2022).
