Data from: Multispecies pangenomes reveal a pervasive influence of population size on structural variation

Edwards, Scott 1 ; Sackton, Timothy 1 ; Khost, Danielle 1 ; Fang, Bohao1

Published Jul 21, 2025; Updated Dec 29, 2025 on Dryad. https://doi.org/10.5061/dryad.8pk0p2p01

Abstract

Structural variants (SVs) are widespread in vertebrate genomes, yet their evolutionary dynamics remain poorly understood. Using 45 long-read de novo genome assemblies and pangenome tools, we analyze SVs within three closely related species of North American jays (Aphelocoma, scrub-jays) displaying a 60-fold range in effective population size. We find rapid evolution of genome architecture, including ~100 Mb variation in genome size driven by dynamic satellite landscapes with unexpectedly long (> 10 kb) repeat units and widespread variation in gene content, influencing gene expression. SVs exhibit slightly deleterious dynamics modulated by variant length and population size, with strong evidence of adaptive fixation only in large populations. Our results demonstrate how population size shapes the distribution of SVs and the importance of pangenomes to characterizing genomic diversity.

https://doi.org/10.5061/dryad.8pk0p2p01

Description of the data and file structure

Files and variables

File: RepeatMasker_analysis.tar.gz:

Description: This file contains two files related to the analysis of RepeatMasker outputs:

all_haps_repmask_nornd_cat_CS_CY.bed.gz

Description: This file contains a streamlined version of the output of RepeatMasker for each haplotype in the data set, including outgroups. The file is in bed format. The RepeatMasker outfile was converted to bed format by the rmsk2bed command of bedops.

The file contains 6 columns: Reference contig of haplotype; start coordinate of repeat; end coordinate of repeat; the type of repeat; strand; and category of repeat. The contig names begin with a two letter code indicating the species: AA=Aphelocoma californica; AC=A. caerulescens; AI=A. insularis; AW=A. woodhouseii; CS=Cyanocitta stelleri; CY=Cyanocorax yucatanicus. The number following the species code is the individual ID of each bird. The remainder of the contig name indicates which haplotype the contig is on (hap1 or hap2) and then the contig name. The components of each contig name are in accordance with the PanSpec naming convention of the Human Pangenome Project

all_haps_repmask_nornd_sums.txt: This is a distillation of the above file and summarizes the total amount of each repeat type on each haplotype in the study. There are 2820 rows, corresponding to different haplotype/repeat combinations. There are 8 columns as follows:

hap	the haplotype ID
repeat_type	the specific repeat type identified by RepeatMasker
hapabund	the total base pairs covered by that repeat type on the haplotype
species	the species
maj_rep_type	the major category of repeat type. This was used in Fig. 2B of the paper
indiv	the individual bird
short_hap	the haplotype number of the individual
hap_assem_len	the length of the haplotype assembly

File: pggb_cleaned_final.vcf.gz

Description: This is a VCF file containing all the single nucleotide polymorphisms and structural variants retrieved from the entire data set. It was made by projecting the PGGB pangenome graph onto VCF format using vg deconstruct v1.40.0 as described in the associated manuscript. The variants of four species are in the file: AC=A. caerulescens; AI=A. insularis; AW=A. woodhouseii; and CY=Cyanocorax yucatanicus. There are 54 columns, as follows: #CHROM, AW reference chromosome and contig; POS; location in reference contig; ID, variant name (path in pangenome graph); REF, reference allele; ALT, variant allele; QUAL, variant quality; FILTER, filter applied (usually ., indicating none); INFO, standard metadate for VCF variants, including allele count, allele frequency, allele length, number of missing genotypes, type of variant, etc.; FORMAT, GT=genotype. The remaining columns list each individual sample in the study for the four focal species: AC_1603_72852, AC_1603_72872, AC_1713_89780, AC_1873_10093, AC_1873_10702, AC_1873_10717, AC_1873_10752, AC_1873_10831, AC_1873_20908, AC_1873_20930, AC_1873_20934, AC_1873_20946, AC_1873_20954, AC_1873_20970, AI_1363_74536, AI_1363_74537, AI_1363_74539, AI_1363_74544, AI_1363_74563, AI_1363_74565, AI_1603_79174, AI_1603_79203, AI_1603_79218, AI_1603_79232, AI_1603_79302, AI_1603_79315, AI_1703_64680, AI_1833_00513, AI_1833_00687, AW_365326, AW_365327, AW_365335, AW_365336, AW_365337, AW_365338, AW_365339, AW_366487, AW_366488, AW_366490, AW_366493, AW_366494, AW_366497, AW_366498, AW_366499, CY_8788.

File: sj_pggb_graphs.tar.gz

Description: This tar file contains the main pangenome graphs made by the Pangenome Graph Builder (PGGB) and associated tables of graph depth. Each graph is in gfa format. Each graph was made using a combination of wvmash and pggb, as described in the associated paper.

There are a total of 135 files, as follows: 46 files are pggb pangenome graphs of each chromosome and major scaffolds for all AW, AC and AI individuals in the study, including the CY outgroup. .gfa files are in gfa pangenome graph format. 88 files are tab-delimited files (*.tsv). These files record the pangenome graph depth of each pggb .gfa file, in 1kb sliding windows or in 100kb sliding windows, as described in the associated publication. Finally, the file README.CHROMOSOME_COMMUNITIES_key.txt contains a key linking the 31 graphs and communities with chromosomal placements to specific chromosomes as described in the paper. The remaining graphs and communities are not placed on any chromosomes.

File: pggb_no_ref_graphs.tar.gz

Description: This file contains 946 pggb graphs (*.gfa, see above) of graphs that do not contain any sequence from the AW reference genome. These graphs mostly contain complex repetitive DNA, but are still useful. The sizes of each of these files are listed in the file 'list_pggb_no_ref_graphs.txt'.

File: unplaced_pggb_graphs.tar.gz

Description: This file contains 18 compressed pggb graphs (*.gfa files) for communities that contained some AW reference sequence but could not be placed on any chromosomes of the A. coerulescens assembly that we used to assign chromosomes. These communities nonetheless contain important information. For example, many of the sequences for the 18-kb unit satellite ('sj_sat#circ30-18193') found across the three species are found in the file allbird_community.24_unplaced.final.gfa in this collection. Additionally, the MHC class II locus is found in allbird_community.41_unplaced.final.gfa. These graphs, as well as the non-reference and chromosomally placed graphs, can be visualized and analyzed using ODGI (https://academic.oup.com/bioinformatics/article/38/13/3319/6585331).

File: list_unplaced_pggb_graphs.txt

Description: This file lists the 18 unplaced communities in the above file along with their size in bytes.

File: initial communityPartition.paf.gz

Description: The file initial_communityPartition.paf.gz contains the partitioning information for each contig into communities for input into pggb. This file was made with wvmash as described in the associated paper.

File: minigraph_bedfiles_vcfs.tar.gz

Description: This tar file contains 5 files:

sj-92b.call.bed.gz - this is a bed file of structural variant calls made by minigraph. It has 144,279 rows, the first 14 of which are information about the bed file and the column names. The remaining rows are structural variants in bed format. The first three columns are in standard bed format, indicating the contig coordinates. The remaining columns are 1) metadata associated with each variant; and 2) variant calls for each haplotype. Individuals from the AW, AC and AI species as well as a CS outgroup (Cyanocitta stelleri). A total of 92 haplotypes are included in the file and in the associated VCF file and pangenome graph.

sj-92b.call.bed.gz.tbi - this is an index of the above bed file.

sj-92b.call.vcf.gz - this file contains the same information as **sj-92b.call.bed.gz, **but is in standard VCF format.

sj-92b.call.vcf.gz.tbi - this is an index for the above VCF file.

sj-92b.gfa.gz - this is the minigraph pangenome graph of structural variants. It is in gfa graph format.

File: aphWoo1_kallistoExpression.tsv.gz

Description: A tab-delimited file (aphWoo1_kallistoExpression.tsv) containing estimates of transcript abundance from Kallisto for each sample for which gene expression data was obtained. There are 3,395,120 rows, each one listing a given transcript. There are 9 columns: target_id, TOGA annotation; length, length of the TOGA annotation; eff_length, transcript length given uncertainty of start sites, as described; est_counts, the number of RNA-seq reads aligning to a given gene; tpm, transcripts per million; samp_id, sample name; tissue, what tissue is indicated; spp, the species (AW or AC); and geneID, gene name from the annotation file.

File: sj_annotations.tar.gz

Description: This tar file contains three files: 1) aphWoo1_mergedTOGA_geneNames.gtf, a gtf file in gtf format. This file contains the gene annotations for the AW reference, as explained in the main manuscript, including the gene names used in the pangene analysis, for comparison of the two. 2) aphWoo1_mergedTOGA_geneNames.tsv. This file contains the translation table for gene IDs used in TOGA, transcript IDs, and gene names used in pangene. Each of the 20,863 rows has a variable number of columns depending on how many transcript IDs there are; and 3) aphWoo.v1_shortcontignames_nornd.fa.out.bed. This file is the bed format version of the RepeatMasker outfile, using the file AW_365336_combined_repeats_v2.fasta.gz as the library on the AW reference assembly. It has 705,990 rows and six columns: the reference scaffold, bed coordinates in columns 2 and 3, the repeat family, the strand, and the repeat type.

File: all_sj_haps_RepeatMasker_bedfiles.tar.gz

Description: This tar file contains bed files of the complete outputs by RepeatMasker on all haplotypes in the data set. The bed files were produced by the rmsk2bed command of bedops and contain all columns of that output. This is the same information as in the file all_haps_repmask_nornd_cat_CS_CY.bed.gz, except that these bed files contain all columns of the RepeatMasker output, as opposed to the six main ones occurring in all_haps_repmask_nornd_cat_CS_CY.bed.gz.

File: AW_365336_combined_repeats_v2.fasta.gz

Description: This is the fasta file containing the repeat library used to annotate the main AW reference and all the haplotypes. It is in fasta format and is used as input into the Repeatmasker program. It contains the satellites found by Satellite Repeat Finder, which were substituted in for the satellites found by RepeatMasker.

File: satellite_analysis.tar.gz

Description: This file contains four files, as follows:

File: concat_srf_satellite_len_files_table.txt: This text file has 8 columns and 11770 rows. Each row is a separate combination of satellite and individual, including 298 different satellites identified by Satellite Repeat Finder.

The file is a concatenation of files for each of the 44 individuals analyzed. The 8 columns are as follows:

column	description
satellite	satellite ID
tot_bp_reads	total base pairs of satellite in fastq reads as measured by satellite repeat finder
without_long	propoortion of total reads comprised by satellite, not including the longest (and possibly spurious) satellites
with_long	propoortion of total reads comprised by satellite, including the longest (and possibly spurious) satellites
indiv	individual bird ID
unit_length	unit length of satellite
species	species
length_category	category of length, for plotting

File: sj_sats_genomes_CONCAT.len

Description: This file contains information on satellite abundances as measured by Satellite Repeat Finder in primary assemblies of each of 45 individuals. There are 10732 rows corresponding to different combinations of individuals and 298 distinct satellites. The 7 columns correspond to the following variables:

satellite	satellite ID
tot_bp_assembly	total base pairs of satellite in primary assembly
divergence	average sequence divergence of satellite repeat units
without_long	propoortion of assembly comprised by satellite, not including the longest (and possibly spurious) satellites
with_long	propoortion of assembly comprised by satellite, including the longest (and possibly spurious) satellites
indiv	individual bird
species	species

File: top_28_sat_list_from_heatmap.txt.

Description: This file contains a list of 28 of the most common satellites in the data set, in one column. It is used to parse the above files to include only these 28 most common satellites.

File: final_satellite_summary_reads_assemblies_dryad.txt

Description: This is the final summary file produced by integrating and summarizing the above three files. The R script available on Dryad will produce this file. There are 7 columns, with names corresponding to those in the previous files. The one column not found in the previous files is pri_assem_len, the length in base pairs of the primary assembly of a given individual.

File: base_composition_analysis_tables.tar.gz

**Description: **There are 8 files in this file, all of which describe the base composition of different components of the genome of individual assembled haplotypes in the data set (see table below). All files are bed or bed-like files, with the first column designated the contig (with species and haplotype information); the second and third columns indicating the start and end positions of the region queried; and the last four columns indicating the number of As, Gs, Tc, and Cs in the region, as measured by seqtk comp. All files have seven columns, except for cat_all_haps_seqtk_basecomp.txt.gz, which has 6, because this table records base composition across whole contigs, which do not require start and end coordinates, just the length of each contig. These files are simplified versions of the output of seqtk comp, with seven columns removed for simplicity and storage constraints. See https://github.com/lh3/seqtk/issues/47 for full explanation of the seqtk comp output.

file	subgenome	cols	rows
cat_all_haps_seqtk_basecomp.txt.gz	all contigs, whole genome	6	117137
cat_all_haps_all_sats_seqtk_comp.bed.gz	all regions annotated as satellite	7	5492802
cat_all_haps_non_sat_bed_seqtk_comp.bed.gz	all regions not annotated as satellite	7	4600215
cat_all_haps_sj_circ1_2268_bed_seqtk_comp.bed.gz	all regions annotated as satellite sj_circ1_2268	7	85242
cat_all_haps_sj_sat_18kb_seqtk_comp.bed.gz	all regions annotated as satellite sj_sat_circ30_18193	7	28254
cat_all_haps_sj_sat_circ2_2119_seqtk_comp.bed.gz	all regions annotated as satellite sj_sat_circ2_2119	7	13287
cat_all_haps_sj_sat_circ4_2216_seqtk_comp.bed.gz	all regions annotated as satellite sj_sat_circ4_2216	7	53737
cat_all_haps_sj_sat_circ7-2218_seqtk_comp.bed.gz	all regions annotated as satellite sj_sat_circ7-2218	7	118010

File:all_sj_haps_sj_sat_circ30_18193_mafft.fa.gz

Description: This is the fasta alignment of 27,083 18-kb satellite repeats described in the associated paper. As explained in the paper, the individual satellite units were extracted from the haplotype assemblies using minimap2 and seqtk subseq. The satellite units were then aligned using mafft.

File: bpp_files.tar.gz

Description: This file contains the input data for analysis with Bayesian Phylogeography and Phylogenetics (bpp). It consists of 2489 alignments of highly mappable regions as defined by dipcall. This file can be used as input for the program bpp to estimate parameters of the species tree of jays as described in the paper.

File: pangene-1.1-bin-sj.tar.bz2

Description: This folder contains multiple folders and subfolders needed to set up a local, interactive html viewer of our pangene results, using the script pangene-1.1-bin-sj/00run.sh. You can also view this interactive html at https://pangene.bioinweb.org/. You can select a view centered on a specific gene and examine a visualization of the pangenome graph around that gene. We also include two pangene gene graphs, which can be used to examine copy number variants using command-line tools. These are the files used in the associated paper and are described as follows:

File: sj-94-a2.pg.gfa.gz

Description: This file, in the folder pangene-1.1-bin-sj/sj, contains the pangene gene graph describing copy number variants within and among scrub-jay species. These files do not contain any poorly annotated "LOC" genes.
File: sj-94nl-a2.pg.gfa.gz

Description: This file, also in the pangene-1.1-bin-sj/sj folder, contains the pangene gene graph describing copy number variants within and among scrub-jay species. These files contains some poorly annotated "LOC" genes.

File: pangene_analysis.tar.gz

Description: This file contains four files, as follows:

sj-94nl-a2.pg.CNV.Rtable: This text file presents the presence-absence variation of 14112 genes across 94 scrub-jay haplotypes, and was produced by post-processing of files produced by pangene. It has 95 columns and 14112 rows. The first column is the gene name and the remaining 94 columns are two haplotypes each from 47 birds, in the species Aphelocoma californica (AA), A. woodhouseii (AW), A. insularis (AI), A. coerulescens (AC), Cyanocorax yucatanicus (CY) and Cyanocitta. stelleri (CS). This table includes only well-annotated genes (no "LOC" annotations).
sj-94-a2.pg.CNV.Rtable: This file is the same as the above, but contains 15733 genes, including any "LOC" genes (less well-annotated).
Z_linked_gene_list_684.txt: This file contains a list of 684 genes found on the Z chromosome of Aphelocoma woodhouseii reference assembly. It is used to filter the pangene tables above to remove sex-linked genes, yielding only autosomal genes. It was produced using R scripts archived at Zenodo (DOI: 10.5281/zenodo.16053688).
pangene_auto_no_haps_poly_stats_final_table.txt: This table contains various filters used to count copy-number variants. It has 13515 rows, corresponding the the 13515 autosomal genes, and 14 columns, the first one being the gene ID. The remaining 13 columns are as follows, using the species designations above:

sumAI	sum of gene copies in AI
sumAC	sum of gene copies in AC
sumAW	sum of gene copies in AW
polyAI	number of CNVs in AI, counting only variants appearing in at least two haplotypes
polyAC	number of CNVs in AC, counting only variants appearing in at least two haplotypes
polyAW	number of CNVs in AW, counting only variants appearing in at least two haplotypes
AC_0_1	Logical, whether the gene displays only a single gene (1) or deletions (0), in AC haplotypes
AI_0_1	Logical, whether the gene displays only a single gene (1) or deletions (0), in AI haplotypes
AW_0_1	Logical, whether the gene displays only a single gene (1) or deletions (0), in AW haplotypes
num_copy_num_states_AI	Number of distinct haplotypes in terms of gene copy numbers in AI
num_copy_num_states_AC	Number of distinct haplotypes in terms of gene copy numbers in AC
num_copy_num_states_AW	Number of distinct haplotypes in terms of gene copy numbers in AW
num_sp_poly_poly	Number of species (1-3) polymorphic for copy number of a given gene, counting only variants appearing in at least two haplotypes

KING.tar.gz

Description: This file contains three files listing the pairwise relatedness of each individual within each of species AC, AI and AW. Each file has six columns, as follows: #IID1, the first individual of the dyad; IID2, the second individual of the dyad; NSNP, the number of SNPs used in the calculation of relatedness; HETHET; IBS0, the proportion of sites with 0 alleles shared; and KINSHIP, the estimate of relatedness between the two individuals.

Code/software

You need to run the script 00run.sh to set up the interactive pangene viewer. You will need an html viewer, such as Firefox or Safari. The README file can be found in the unzipped path: pangene-1.1-bin-sj/doc/README.md

You can view the *.gfa files in a GFA viewer such as Bandage (Wick et al. 2015).

Wick, R. R., Schultz, M. B., Zobel, J., & Holt, K. E. (2015). Bandage: interactive visualization of de novo genome assemblies. Bioinformatics, 31(20), 3350-3352. doi:10.1093/bioinformatics/btv383

Access information

Other publicly accessible locations of the data:

The original sequence data can be found at NCBI under Umbrella BioProject PRJNA1206191, the Scrub-Jay (Aphelocoma) Pangenome Project. Within this project, BioProject PRJNA1204306 contains the PacBio HiFi reads (with samples SAMN46016487 - SAMN46016457), whereas BioProjects PRJNA1204814 - PRJNA1204903 contain the haplotype assemblies. All scripts can be found at https://github.com/harvardinformatics/scrub-jay-genomics and https://github.com/fangbohao/scrub-jay-pangenome.
The pangene gene graphs can be viewed interactively at https://pangene.bioinweb.org/.
Associated scripts used to generate some of the data files here can be found at DOI:10.5281/zenodo.16053688 (https://zenodo.org/records/16635922)

Data was derived from the following sources:

Museum of Comparative Zoology was the source for tissues of A. woodhouseii.

Forty-four genomes from three species of North American scrub jays (Aphelocoma insularis, A. woodhouseii and A. coerulescens) and one outgroup (Yucatán Jay, Cyanocorax yucatanicus) were sequenced using PacBio HiFi technology. The sequence reads were assembled into primary assemblies and two haplotype assemblies using hifiasm (Cheng et al. 2021). We used various pangenome tools, including the Pangenome Graph Builder (PGGB; Garrison et al. 2024) and minigraph (Li et al. 2020) to detect and characterize structural variants, including inversions, within and between species. We used RepeatModeler2 and RepeatMasker to annotate repetitive elements (Smit et al. 2015 , Flynn et al. 2020). We conducted demographic analysis with PSMC (Li et al. 2011), bpp (Rannala et al. 2017) and other programs. We used Panacus to estimate growth curves for the pangenome graphs (Parmigiani et al. 2024), and fastDFE (Sendrowski et al. 2024) and anavar (Barton et al. 2018) to estimate the distribution of selection co-efficients. We used Pangene to estimate pangene graphs within and between species (Li et al. 2024).

Barton HJ, Zeng K. 2018. New Methods for Inferring the Distribution of Fitness Effects for INDELs and SNPs. Mol Biol Evol: 35:1536-1546.

Cheng H, Concepcion GT, Feng X, Zhang H, Li H. 2021. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods: 18:170-175.

Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. 2020. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A: 117:9451-9457.

Garrison E, Guarracino A, Heumos S, Villani F, Bao Z, Tattini L, Hagmann J, Vorbrugg S, Marco-Sola S, Kubica C, et al. 2024. Building pangenome graphs. Nat Methods: 21:2008-2012.

Li H, Durbin R. 2011. Inference of human population history from individual whole-genome sequences. Nature: 475:493-496.

Li H, Feng X, Chu C. 2020. The design and construction of reference pangenome graphs with minigraph. Genome Biol: 21:265.

Li H, Marin M, Farhat MR. 2024. Exploring gene content with pangene graphs. Bioinformatics: 40:1367-4811 (Electronic).

Parmigiani L, Garrison E, Stoye J, Marschall T, Doerr D. 2024. Panacus: fast and exact pangenome growth and core size estimation. Bioinformatics: 40.

Rannala B, Yang Z. 2017. Efficient Bayesian species tree inference under the multispecies coalescent. Systematic Biology: 66:823-842.

Sendrowski J, Bataillon T. 2024. fastDFE: Fast and Flexible Inference of the Distribution of Fitness Effects. Molecular Biology and Evolution: 41:msae070.

Smit AF, Hubley R, Green P. 2015 RepeatMasker Open-4.0. <http://www.repeatmasker.org>. .

Data from: Multispecies pangenomes reveal a pervasive influence of population size on structural variation

Data files

Abstract

README: Data from: Multispecies pangenomes reveal pervasive influence of population size on evolution of structural variants

Description of the data and file structure

Files and variables

Code/software

Access information

Methods

Change log