Skip to main content
Dryad

Diverse database and machine learning model to narrow the generalization gap in RNA structure prediction

Data files

Jan 29, 2026 version files 368.20 MB

Click names to download individual files

Abstract

This dataset contains RNA secondary structure data used for training and testing eFold, a deep learning model for RNA secondary structure prediction. The dataset comprises three main components: (1) experimentally determined secondary structure models for 1,098 pri-miRNAs and 1,456 human mRNA regions derived from DMS-MaP-seq chemical probing experiments, representing the original contribution of this work; (2) a curated pre-training dataset combining subsets of bpRNA (base-pair RNA database) and RNAstralign databases, filtered to remove redundant sequences and ArchiveII sequences as described in the associated publication; and (3) benchmark test sets for evaluating model performance on long and diverse RNA structures.

The dataset includes sequence files in FASTA format and corresponding secondary structure annotations in dot-bracket notation. Structure models represent experimentally validated folding patterns with reactivity data from chemical probing assays. The pri-miRNA structures range from 200 nucleotides in length and include precursor hairpins with flanking regions, while mRNA structures range from 200-1kb and  focus on functionally important regions including 3' untranslated regions.

This dataset enables researchers to: (1) train and benchmark machine learning models for RNA structure prediction, particularly for long and complex RNAs that have been traditionally difficult to predict; (2) investigate RNA structural features in pri-miRNAs and mRNA regulatory regions; (3) compare performance of computational methods against experimentally determined structures; and (4) develop improved algorithms that incorporate diverse RNA families beyond the short non-coding RNAs that dominate existing training sets.

All data are freely available without restrictions. No human subjects data or personally identifiable information is included. RNA sequences are derived from publicly available reference genomes and databases.