Data from: Pangenomes reveal extensive structural variation in a suboscine passerine bird, the Pearly-vented Tody-Tyrant (Hemitriccus margaritaceiventer)

Lopez, Kelsie 1

Research facility: Harvard University

Published Apr 03, 2026 on Dryad. https://doi.org/10.5061/dryad.7wm37pw5q

Data files

Apr 03, 2026 version files 2.48 GB

pggb_graphs.tar.gz

2.34 GB
pggb_variation_overlaps_final.tab.gz

138.07 MB
README.md

8.56 KB

Abstract

Structural variants (SV) are major drivers of evolutionary processes such as adaptation and speciation, yet their complexity and dynamics in wild populations remain largely unexplored. Avian diversity is highest in the Neotropics, primarily due to the suboscine passerine radiation; however, despite this diversity, genomic resources and studies of SVs in suboscines are scarce compared to their sister clade, the oscine passerines (“songbirds”). Here, we used long-read and chromatin conformation capture sequencing to assemble a high-quality scaffolded reference genome and construct a population-scale pangenome from 5 individuals of the Pearly-vented Tody-Tyrant (Hemitriccus margaritaceiventer), a suboscine bird with plumage variation across its distribution in South American dry forests. Our pangenome graph reveals extensive structural variation, with the chromosomal distribution of SVs strongly predicted by simple and low-complexity repeats – highlighting how specific repeat architecture may influence genome evolution. We discovered intraspecific copy number variation in multigene families, with the most complex instance including beta-keratin genes. Lastly, weidentified a 306 kb inversion spanning several melanin pigmentation-associated genes (e.g. MREG, MLPH, RAB17), making it a potential candidate SV for known intraspecific plumage variation. Our study establishes a population-scale pangenome resource for a suboscine bird, enabling characterization of the genome-wide abundance, diversity, and distribution of SVs within this species.

Overview

This repository contains data files related to the pangenome analysis of Hemitriccus margaritaceiventer genomes. The data includes:

PGGB Pangenome Graph Files (pggb_graphs.tar.gz): The pangenome graphs are stored in GFA v1 (Graphical Fragment Assembly) format and compressed with gzip (*.gfa.gz). Each GFA file encodes a bidirected sequence graph for a single scaffold (e.g. scaffold_20.pan.fa.gz.gfaffix.unchop.Ygs.view.gfa.gz).

S lines (segments) define graph nodes and their DNA sequence.
- Example:
  S 19 TTTGCCCCCAGTCTGACTCCCAGTTTGCCCCTCAGTTTGGCCCCAGTTGCCCTCCCAGTTTGCCCCCAATTTCACTCCCCATTT
  Here, 19 is the segment (node) identifier and the third field is the nucleotide sequence assigned to that node.
L lines (links) define edges between segments, including orientation.
- Example:
  L 19 + 21 + 0M
  links the end of segment 19 in the forward orientation (+) to the start of segment 21 in the forward orientation (+) with an overlap descriptor 0M (no shared bases; end‑to‑end adjacency as produced by PGGB/gfaffix).
Segment identifiers (e.g. 17, 18, 19, …) are local to each scaffold graph and correspond to chopped pieces of the underlying multiple sequence alignment; they can be traversed along graph paths to reconstruct haplotypes and reference‑like sequences.

There is one graph per scaffold (31 total: scaffold_1–scaffold_34). We excluded the scaffolds corresponding to sex chromosomes, as well as scaffold 30, because its assignment to a specific community (groups of contigs of all the pseudo-haplotypes which best correspond to the H. margaritaceiventer reference scaffold) was inconsistent across samples; rather than forming a distinct community, it was variably grouped with different autosomal communities, unlike all other scaffolds, which were reproducibly assigned. Therefore, we excluded this scaffold due to these community placement ambiguities

Final PGGB Variation File (pggb_variation_overlaps_final.tab.gz): A gzipped Variant Call Format (VCF)-like tab file containing variant information decomposed from the pangenome graphs (see Supplementary Methods in the paper for variant identification and classification details).

Usage notes and recommended software

All file types in this repository can be opened and processed using free and open-source software.

GFA graph files (*.gfa.gz)
- Can be visualized and/or processed with:
  - vg (Graph Genome Toolkit): https://github.com/vgteam/vg
  - odgi: https://github.com/pangenome/odgi
  - Bandage / BandageNG for graphical exploration of assembly graphs: https://rrwick.github.io/Bandage/
- Example to inspect graph statistics with odgi:
```
zcat scaffold_20.pan.fa.gz.gfaffix.unchop.Ygs.view.gfa.gz \
  | odgi build -g - -o scaffold_20.og
odgi stats -i scaffold_20.og
```

Processed Variant Information (`pggb_variation_overlaps_final.tab.gz`)

Column Descriptions:

chrom: Chromosome identifier for the variant position (with PanSN-spec naming wih "HemMar#1#" prefix before scaffold name; see Methods).
bedStart: Start position of the variant on the chromosome (0-based).

bedEnd: End position of the variant on the chromosome (1-based).
type: Bcftools type of structural variant, one of SNP, MNP, INDEL, or OTHER.
overlap: Genomic region(s) overlapped by variant. Either intergenic, cds, intron, or none. The overlap is based on coordinates from the annotated H. margaritaceiventer reference genome.
repeat: Comma-separated list of repeatmasker annotations overlapped by variant (otherwise none); with specific repeat name. Overlap is based on coordinates from the annotated the H. margaritaceiventer reference genome.
repeat_family: Comma-separated list of the repeat family of the repeat overlapped by variant (DNA, Simple_repeat, LTR, Low_complexity, LINE, SINE, Unknown, Satellite, Other, or none).
subtype: Specific subtype of variant. Either SNP, SV, SVINS, SVDEL, INDEL, DEL, INS, and/or including Complex types of each variant.
ref: Reference allele nucleotide(s) at the variant position; based on the H. margaritaceiventer reference genome.
alt: Alternative allele nucleotide(s) at the variant position; based on the H. margaritaceiventer reference genome.
aa: Ancestral allele nucleotide(s) at the variant position; based on the two pseudo-haplotype genomes of the outgroup Pyrocephalus rubinus.
inv: Inversion present? Identified by PGGB; not the other inversion detection programs (Methods)
polarized: Is the variant polarized? False if aa == ".", True otherwise contains allele nucleotide(s) of the outgroup ancestral allele (aa).
base_allele_len: length of reference allele if aa == ".", otherwise length of aa allele.
alt_len_max: Longest non-base allele length.
alt_len_min: Shortest non-base allele length.
allele_count: Total Allele Count out of 10 total pseudo-haplotype assemblies. Missing data reduces this count.
HMRG_DAC: Derived Allele Count out of 10 total pseudo-haplotype assemblies. Missing data reduces this count.
HMRG_AN: Number of alleles out of 10 total pseudo-haplotype samples. Missing data reduces this count.
HMRG_MISS: Missing genotype counts. Count of missing ('.') genotype alleles out of 10 total pseudo-haplotype assemblies.
HMRG_6371: Genotype for HMRG_6371. Either 1|1, 1|0, 0|1, 0|0, and/or including missing data as indicated by '.'.
HMRG_6386: Genotype for HMRG_6386. Either 1|1, 1|0, 0|1, 0|0, and/or including missing data as indicated by '.'.
HMRG_6388: Genotype for HMRG_6388. Either 1|1, 1|0, 0|1, 0|0, and/or including missing data as indicated by '.'.
HMRG_6431: Genotype for HMRG_6431. Either 1|1, 1|0, 0|1, 0|0, and/or including missing data as indicated by '.'.
HMRG_6433: Genotype for HMRG_6433. Either 1|1, 1|0, 0|1, 0|0, and/or including missing data as indicated by '.'.

Methods:

Genomes for the Pearly-vented Tody-Tyrant (Hemitriccus margaritaceiventer) were assembled using Hifiasm (https://github.com/chhylp123/hifiasm). For pangenome graph construction, we used the PanGenome Graph Builder (PGGB, https://github.com/pangenome/pggb; Garrison et al. 2024) via the nf-core/nextflow pipeline (https://nf-co.re/pangenome/1.0.0/). Contigs were first assigned to chromosomes using wfmash mapping and split-mapping for unmapped contigs. PGGB graph induction and normalization used wfmash, seqwish, smoothxg, and gfaffix. Variant decomposition from the resulting graphs was performed using vg deconstruct (v1.40.0, https://github.com/vgteam/vg), with filtering for large variants with vcfbub (https://github.com/pangenome/vcfbub) and conversion to primitive alleles using vcfwave (Garrison et al., 2022). Variant files were merged and processed with bcftools, and additional annotation and summary tables of variants were generated using BEDtools and custom Python scripts. Full details of all software, versions, and parameter settings are described in the supplementary methods and GitHub repository (https://github.com/kelsiealopez/Hemitriccus-Pangenome).

Garrison, E., Guarracino, A., Heumos, S., Villani, F., Bao, Z., Tattini, L., Hagmann, J., Vorbrugg, S., Marco-Sola, S., Kubica, C., Ashbrook, D. G., Thorell, K., Rusholme-Pilcher, R. L., Liti, G., Rudbeck, E., Golicz, A. A., Nahnsen, S., Yang, Z., Mwaniki, M. N., … Prins, P. (2024). Building pangenome graphs. Nature Methods, 21(11), 2008–2012. https://doi.org/10.1038/s41592-024-02430-3

Garrison, E., Kronenberg, Z. N., Dawson, E. T., Pedersen, B. S., & Prins, P. (2022). A spectrum of free software tools for processing the VCF variant call format: Vcflib, bio-vcf, cyvcf2, hts-nim and slivar. PLOS Computational Biology, 18(5), e1009123. https://doi.org/10.1371/journal.pcbi.1009123