HG002 DNA (PacBio and ONT) and UHRR RNA (ONT) base modification data for minimod
Data files
Jul 14, 2025 version files 770.64 GB
-
hg002_prom_PGXXSX240041_aligned.bam
191.08 GB
-
hg002_prom_PGXXSX240041_unaligned.bam
211.53 GB
-
hg002_revio_RGBX240039_aligned.bam
35.89 GB
-
hg002_revio_RGBX240039_unaligned.bam
298.47 GB
-
README.md
2.98 KB
-
uhrr_prom_PNXRXX240011_aligned.bam
15.13 GB
-
uhrr_prom_PNXRXX240011_unaligned.bam
18.55 GB
Abstract
Recent advances in third-generation sequencing technologies have enabled the detection of various DNA and RNA base modifications in addition to standard nucleotide sequences. Both major vendors in this space—Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio)—now include base modification information in their sequencing outputs using MM/ML tags embedded in unaligned BAM files. Each vendor also provides dedicated tools for extracting and analysing these tags, such as ONT’s modkit and PacBio’s pb-cpg-tools.
This work presents Minimod, a new vendor-agnostic tool designed to extract and analyse any type of base modification from sequencing data generated by any platform that supports MM/ML tags. Minimod is a free, open-source application written in C and is available in GitHub and Zenodo (see related works). This dataset provides the supporting data used to evaluate Minimod’s performance.
Dataset DOI: 10.5061/dryad.c59zw3rm7
Description of the data and file structure
The dataset includes three samples:
- HG002 DNA sequenced on an ONT PromethION using R10.4.1 chemistry
- HG002 DNA sequenced on a PacBio Revio
- UHRR RNA sequenced on an ONT PromethION using RNA004 chemistry
Files and variables
Both unaligned and aligned BAM files containing MM/ML tags are available for each sample as follows:
- HG002 DNA sequenced on an ONT PromethION
hg002_prom_PGXXSX240041_unaligned.bam
- also contains move table tagshg002_prom_PGXXSX240041_aligned.bam
- HG002 DNA sequenced on an PacBio Revio
hg002_revio_RGBX240039_unaligned.bam
- also contains kinetic data tagshg002_revio_RGBX240039_aligned.bam
- UHRR RNA sequenced on an ONT PromethION
uhrr_prom_PNXRXX240011_unaligned.bam
uhrr_prom_PNXRXX240011_aligned.bam
Aligned BAM files were generated by aligning the unaligned BAM files using minimap2 (version 2.28) with the following commands:
# HG002 DNA on ONT PromethION
samtools fastq -TMM,ML hg002_prom_PGXXSX240041_unaligned.bam | minimap2 -ax map-ont -Y -y --secondary=no hg38noAlt.fa - | samtools sort - -o hg002_prom_PGXXSX240041_aligned.bam
# HG002 DNA on PacBio Revio
samtools fastq -TMM,ML hg002_revio_RGBX240039_unaligned.bam | minimap2 -ax map-hifi -y -Y hg38noAlt.fa --secondary=no - | samtools sort - -o hg002_revio_RGBX240039_aligned.bam
# UHRR RNA on ONT PromethION
samtools fastq -TMM,ML uhrr_prom_PNXRXX240011_unaligned.bam | minimap2 -ax map-ont -Y -y --secondary=no -uf gencode.v40.transcripts.fa - | samtools sort -o uhrr_prom_PNXRXX240011_aligned.bam
For nanopore data, the raw nanopore signal data available through the below ENA accessions (see Access information) were basecalled using the following commands to generate the unaligned BAM files:
# DNA
buttery-eel -g /dorado/bin --port 5000 --use_tcp --device cuda:all --call_mods --config dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_hac.cfg -i reads.blow5 -o unaligned.sam && samtools view unaligned.sam -o unaligned.bam
# RNA
slow5-dorado basecaller rna004_130bps_sup@v5.0.0 -x cuda:all reads.blow5 --modified-bases m6A_DRACH > unaligned.bam
The software versions used:
- buttery-eel: 0.6.0
- slow5-dorado: 0.8.3
- samtools: 1.21
Code/software
Any software that can read BAM files with MM/ML tags such as: samtools, minimod, modkit.
Access information
Raw nanopore signal data in BLOW5 format is available at ENA (HG002 sample: ERR12997168 and UHRR sample: ERR12997170).