Data from: Multispecies pangenomes reveal a pervasive influence of population size on structural variation
Data files
Jul 21, 2025 version files 60.79 GB
-
all_haps_repmask_nornd_cat_CS_CY.bed.gz
80.69 MB
-
all_sj_haps_RepeatMasker_bedfiles.tar.gz
1.74 GB
-
all_sj_haps_sj_sat_circ30_18193_mafft.fa.gz
25.34 MB
-
aphWoo1_kallistoExpression.tsv.gz
58.46 MB
-
AW_365336_combined_repeats_RepMask.fasta.gz
642.77 KB
-
bpp_files.tar.gz
3.13 MB
-
initial_communityPartition.paf.gz
185.16 MB
-
KING.tar.gz
5.59 KB
-
minigraph_bedfiles_vcfs.tar.gz
569.21 MB
-
pangene-1.1-bin-sj.tar.bz2
36 MB
-
pggb_cleaned_final.vcf.gz
9.08 GB
-
sj_annotations.tar.gz
28.57 MB
-
sj_pggb_graphs.tar.gz
48.98 GB
-
sj-94-a2.pg.gfa.gz
7.07 MB
-
sj-94nl-a2.pg.gfa.gz
6.06 MB
Jul 21, 2025 version files 60.79 GB
-
all_haps_repmask_nornd_cat_CS_CY.bed.gz
80.69 MB
-
all_sj_haps_RepeatMasker_bedfiles.tar.gz
1.74 GB
-
all_sj_haps_sj_sat_circ30_18193_mafft.fa.gz
25.34 MB
-
aphWoo1_kallistoExpression.tsv.gz
58.46 MB
-
AW_365336_combined_repeats_RepMask.fasta.gz
642.77 KB
-
bpp_files.tar.gz
3.13 MB
-
initial_communityPartition.paf.gz
185.16 MB
-
KING.tar.gz
5.59 KB
-
minigraph_bedfiles_vcfs.tar.gz
569.21 MB
-
pangene-1.1-bin-sj.tar.bz2
36 MB
-
pggb_cleaned_final.vcf.gz
9.08 GB
-
README.md
11.64 KB
-
sj_annotations.tar.gz
28.57 MB
-
sj_pggb_graphs.tar.gz
48.98 GB
-
sj-94-a2.pg.gfa.gz
7.07 MB
-
sj-94nl-a2.pg.gfa.gz
6.06 MB
Aug 07, 2025 version files 60.93 GB
-
all_sj_haps_RepeatMasker_bedfiles.tar.gz
1.74 GB
-
all_sj_haps_sj_sat_circ30_18193_mafft.fa.gz
25.34 MB
-
aphWoo1_kallistoExpression.tsv.gz
58.46 MB
-
AW_365336_combined_repeats_RepMask.fasta.gz
642.77 KB
-
base_composition_analysis_tables.tar.gz
153.48 MB
-
bpp_files.tar.gz
3.13 MB
-
initial_communityPartition.paf.gz
185.16 MB
-
KING.tar.gz
5.59 KB
-
minigraph_bedfiles_vcfs.tar.gz
569.21 MB
-
pangene_analysis.tar.gz
412.42 KB
-
pangene-1.1-bin-sj.tar.bz2
36 MB
-
pggb_cleaned_final.vcf.gz
9.08 GB
-
README.md
24.35 KB
-
RepeatMasker_analysis.tar.gz
80.62 MB
-
satellite_analysis.tar.gz
671.53 KB
-
sj_annotations.tar.gz
28.57 MB
-
sj_pggb_graphs.tar.gz
48.98 GB
Abstract
Structural variants (SVs) are widespread in vertebrate genomes, yet their evolutionary dynamics remain poorly understood. Using 45 long-read de novo genome assemblies and pangenome tools, we analyze SVs within three closely related species of North American jays (Aphelocoma, scrub-jays) displaying a 60-fold range in effective population size. We find rapid evolution of genome architecture, including ~100 Mb variation in genome size driven by dynamic satellite landscapes with unexpectedly long (> 10 kb) repeat units and widespread variation in gene content, influencing gene expression. SVs exhibit slightly deleterious dynamics modulated by variant length and population size, with strong evidence of adaptive fixation only in large populations. Our results demonstrate how population size shapes the distribution of SVs and the importance of pangenomes to characterizing genomic diversity.
https://doi.org/10.5061/dryad.8pk0p2p01
Description of the data and file structure
Files and variables
**File: RepeatMasker_analysis.tar.gz: **
Description: This file contains two files related to the analysis of RepeatMasker outputs:
- all_haps_repmask_nornd_cat_CS_CY.bed.gz
Description: This file contains a streamlined version of the output of RepeatMasker for each haplotype in the data set, including outgroups. The file is in bed format. The RepeatMasker outfile was converted to bed format by the rmsk2bed command of bedops.
The file contains 6 columns: Reference contig of haplotype; start coordinate of repeat; end coordinate of repeat; the type of repeat; strand; and category of repeat. The contig names begin with a two letter code indicating the species: AA=Aphelocoma californica; AC=A. caerulescens; AI=A. insularis; AW=A. woodhouseii; CS=Cyanocitta stelleri; CY=Cyanocorax yucatanicus. The number following the species code is the individual ID of each bird. The remainder of the contig name indicates which haplotype the contig is on (hap1 or hap2) and then the contig name. The components of each contig name are in accordance with the PanSpec naming convention of the Human Pangenome Project
- all_haps_repmask_nornd_sums.txt: This is a distillation of the above file and summarizes the total amount of each repeat type on each haplotype in the study. There are 2820 rows, corresponding to different haplotype/repeat combinations. There are 8 columns as follows:
hap | the haplotype ID |
---|---|
repeat_type | the specific repeat type identified by RepeatMasker |
hapabund | the total base pairs covered by that repeat type on the haplotype |
species | the species |
maj_rep_type | the major category of repeat type. This was used in Fig. 2B of the paper |
indiv | the individual bird |
short_hap | the haplotype number of the individual |
hap_assem_len | the length of the haplotype assembly |
File: pggb_cleaned_final.vcf.gz
Description: This is a VCF file containing all the single nucleotide polymorphisms and structural variants retrieved from the entire data set. It was made by projecting the PGGB pangenome graph onto VCF format using vg deconstruct v1.40.0 as described in the associated manuscript. The variants of four species are in the file: AC=A. caerulescens; AI=A. insularis; AW=A. woodhouseii; and CY=Cyanocorax yucatanicus. There are 54 columns, as follows: #CHROM, AW reference chromosome and contig; POS; location in reference contig; ID, variant name (path in pangenome graph); REF, reference allele; ALT, variant allele; QUAL, variant quality; FILTER, filter applied (usually ., indicating none); INFO, standard metadate for VCF variants, including allele count, allele frequency, allele length, number of missing genotypes, type of variant, etc.; FORMAT, GT=genotype. The remaining columns list each individual sample in the study for the four focal species: AC_1603_72852, AC_1603_72872, AC_1713_89780, AC_1873_10093, AC_1873_10702, AC_1873_10717, AC_1873_10752, AC_1873_10831, AC_1873_20908, AC_1873_20930, AC_1873_20934, AC_1873_20946, AC_1873_20954, AC_1873_20970, AI_1363_74536, AI_1363_74537, AI_1363_74539, AI_1363_74544, AI_1363_74563, AI_1363_74565, AI_1603_79174, AI_1603_79203, AI_1603_79218, AI_1603_79232, AI_1603_79302, AI_1603_79315, AI_1703_64680, AI_1833_00513, AI_1833_00687, AW_365326, AW_365327, AW_365335, AW_365336, AW_365337, AW_365338, AW_365339, AW_366487, AW_366488, AW_366490, AW_366493, AW_366494, AW_366497, AW_366498, AW_366499, CY_8788.
File: sj_pggb_graphs.tar.gz
Description: This tar file contains the main pangenome graphs made by the Pangenome Graph Builder (PGGB) and associated tables of graph depth. Each graph is in gfa format. Each graph was made using a combination of wvmash and pggb, as described in the associated paper.
There are a total of 135 files, as follows: 46 files are pggb pangenome graphs of each chromosome and major scaffolds for all AW, AC and AI individuals in the study, including the CY outgroup. .gfa files are in gfa pangenome graph format. 88 files are tab-delimited files (*.tsv). These files record the pangenome graph depth of each pggb .gfa file, in 1kb sliding windows or in 100kb sliding windows, as described in the associated publication. Finally, the file README.CHROMOSOME_COMMUNITIES_key.txt contains a key linking the 31 graphs and communities with chromosomal placements to specific chromosomes as described in the paper. The remaining graphs and communities are not placed on any chromosomes.
File: initial communityPartition.paf.gz
Description: The file initial_communityPartition.paf.gz contains the partitioning information for each contig into communities for input into pggb. This file was made with wvmash as described in the associated paper.
File: minigraph_bedfiles_vcfs.tar.gz
Description: This tar file contains 5 files:
sj-92b.call.bed.gz - this is a bed file of structural variant calls made by minigraph. It has 144,279 rows, the first 14 of which are information about the bed file and the column names. The remaining rows are structural variants in bed format. The first three columns are in standard bed format, indicating the contig coordinates. The remaining columns are 1) metadata associated with each variant; and 2) variant calls for each haplotype. Individuals from the AW, AC and AI species as well as a CS outgroup (Cyanocitta stelleri). A total of 92 haplotypes are included in the file and in the associated VCF file and pangenome graph.
sj-92b.call.bed.gz.tbi - this is an index of the above bed file.
sj-92b.call.vcf.gz - this file contains the same information as **sj-92b.call.bed.gz, **but is in standard VCF format.
sj-92b.call.vcf.gz.tbi - this is an index for the above VCF file.
sj-92b.gfa.gz - this is the minigraph pangenome graph of structural variants. It is in gfa graph format.
File: aphWoo1_kallistoExpression.tsv.gz
Description: A tab-delimited file (aphWoo1_kallistoExpression.tsv) containing estimates of transcript abundance from Kallisto for each sample for which gene expression data was obtained. There are 3,395,120 rows, each one listing a given transcript. There are 9 columns: target_id, TOGA annotation; length, length of the TOGA annotation; eff_length, transcript length given uncertainty of start sites, as described; est_counts, the number of RNA-seq reads aligning to a given gene; tpm, transcripts per million; samp_id, sample name; tissue, what tissue is indicated; spp, the species (AW or AC); and geneID, gene name from the annotation file.
File: sj_annotations.tar.gz
Description: This tar file contains two files: 1) aphWoo1_geneAnnot.gtf, a gtf file in gtf format. This file contains the gene annotations for the AW reference, as explained in the main manuscript. 2) aphWoo1_repeatAnnot.gff. This file is an annotation of the main AW reference for repetitive DNA, as detected by RepeatModeler2 and RepeatMasker.
File: all_sj_haps_RepeatMasker_bedfiles.tar.gz
Description: This tar file contains bed files of the complete outputs by RepeatMasker on all haplotypes in the data set. The bed files were produced by the rmsk2bed command of bedops and contain all columns of that output. This is the same information as in the file all_haps_repmask_nornd_cat_CS_CY.bed.gz, except that these bed files contain all columns of the RepeatMasker output, as opposed to the six main ones occurring in all_haps_repmask_nornd_cat_CS_CY.bed.gz.
File: AW_365336_combined_repeats_RepMask.fasta.gz
Description: This is the fasta file containing the repeat library used to annotate the main AW reference and all the haplotypes. It is in fasta format and is used as input into the Repeatmasker program.
- File: satellite_analysis.tar.gz
Description: This file contains four files, as follows:
File: concat_srf_satellite_len_files_table.txt: This text file has 8 columns and 11770 rows. Each row is a separate combination of satellite and individual, including 298 different satellites identified by Satellite Repeat Finder.
The file is a concatenation of files for each of the 44 individuals analyzed. The 8 columns are as follows:
column | description |
---|---|
satellite | satellite ID |
tot_bp_reads | total base pairs of satellite in fastq reads as measured by satellite repeat finder |
without_long | propoortion of total reads comprised by satellite, not including the longest (and possibly spurious) satellites |
with_long | propoortion of total reads comprised by satellite, including the longest (and possibly spurious) satellites |
indiv | individual bird ID |
unit_length | unit length of satellite |
species | species |
length_category | category of length, for plotting |
- File: sj_sats_genomes_CONCAT.len
Description: This file contains information on satellite abundances as measured by Satellite Repeat Finder in primary assemblies of each of 45 individuals. There are 10732 rows corresponding to different combinations of individuals and 298 distinct satellites. The 7 columns correspond to the following variables:
satellite | satellite ID |
---|---|
tot_bp_assembly | total base pairs of satellite in primary assembly |
divergence | average sequence divergence of satellite repeat units |
without_long | propoortion of assembly comprised by satellite, not including the longest (and possibly spurious) satellites |
with_long | propoortion of assembly comprised by satellite, including the longest (and possibly spurious) satellites |
indiv | individual bird |
species | species |
- File: top_28_sat_list_from_heatmap.txt.
Description: This file contains a list of 28 of the most common satellites in the data set, in one column. It is used to parse the above files to include only these 28 most common satellites.
- File: final_satellite_summary_reads_assemblies_dryad.txt
Description: This is the final summary file produced by integrating and summarizing the above three files. The R script available on Dryad will produce this file. There are 7 columns, with names corresponding to those in the previous files. The one column not found in the previous files is pri_assem_len, the length in base pairs of the primary assembly of a given individual.
File: base_composition_analysis_tables.tar.gz
**Description: **There are 8 files in this file, all of which describe the base composition of different components of the genome of individual assembled haplotypes in the data set (see table below). All files are bed or bed-like files, with the first column designated the contig (with species and haplotype information); the second and third columns indicating the start and end positions of the region queried; and the last four columns indicating the number of As, Gs, Tc, and Cs in the region, as measured by seqtk comp. All files have seven columns, except for cat_all_haps_seqtk_basecomp.txt.gz, which has 6, because this table records base composition across whole contigs, which do not require start and end coordinates, just the length of each contig. These files are simplified versions of the output of seqtk comp, with seven columns removed for simplicity and storage constraints. See https://github.com/lh3/seqtk/issues/47 for full explanation of the seqtk comp output.
file | subgenome | cols | rows |
---|---|---|---|
cat_all_haps_seqtk_basecomp.txt.gz | all contigs, whole genome | 6 | 117137 |
cat_all_haps_all_sats_seqtk_comp.bed.gz | all regions annotated as satellite | 7 | 5492802 |
cat_all_haps_non_sat_bed_seqtk_comp.bed.gz | all regions not annotated as satellite | 7 | 4600215 |
cat_all_haps_sj_circ1_2268_bed_seqtk_comp.bed.gz | all regions annotated as satellite sj_circ1_2268 | 7 | 85242 |
cat_all_haps_sj_sat_18kb_seqtk_comp.bed.gz | all regions annotated as satellite sj_sat_circ30_18193 | 7 | 28254 |
cat_all_haps_sj_sat_circ2_2119_seqtk_comp.bed.gz | all regions annotated as satellite sj_sat_circ2_2119 | 7 | 13287 |
cat_all_haps_sj_sat_circ4_2216_seqtk_comp.bed.gz | all regions annotated as satellite sj_sat_circ4_2216 | 7 | 53737 |
cat_all_haps_sj_sat_circ7-2218_seqtk_comp.bed.gz | all regions annotated as satellite sj_sat_circ7-2218 | 7 | 118010 |
File:all_sj_haps_sj_sat_circ30_18193_mafft.fa.gz
Description: This is the fasta alignment of 27,083 18-kb satellite repeats described in the associated paper. As explained in the paper, the individual satellite units were extracted from the haplotype assemblies using minimap2 and seqtk subseq. The satellite units were then aligned using mafft.
File: bpp_files.tar.gz
Description: This file contains the input data for analysis with Bayesian Phylogeography and Phylogenetics (bpp). It consists of 2489 alignments of highly mappable regions as defined by dipcall. This file can be used as input for the program bpp to estimate parameters of the species tree of jays as described in the paper.
File: pangene-1.1-bin-sj.tar.bz2
Description: This folder contains multiple folders and subfolders needed to set up a local, interactive html viewer of our pangene results, using the script pangene-1.1-bin-sj/00run.sh. You can also view this interactive html at https://pangene.bioinweb.org/. You can select a view centered on a specific gene and examine a visualization of the pangenome graph around that gene. We also include two pangene gene graphs, which can be used to examine copy number variants using command-line tools. These are the files used in the associated paper and are described as follows:
-
File: sj-94-a2.pg.gfa.gz
Description: This file, in the folder pangene-1.1-bin-sj/sj, contains the pangene gene graph describing copy number variants within and among scrub-jay species. These files do not contain any poorly annotated "LOC" genes.
-
File: sj-94nl-a2.pg.gfa.gz
Description: This file, also in the pangene-1.1-bin-sj/sj folder, contains the pangene gene graph describing copy number variants within and among scrub-jay species. These files contains some poorly annotated "LOC" genes.
File: pangene_analysis.tar.gz
Description: This file contains four files, as follows:
- sj-94nl-a2.pg.CNV.Rtable: This text file presents the presence-absence variation of 14112 genes across 94 scrub-jay haplotypes, and was produced by post-processing of files produced by pangene. It has 95 columns and 14112 rows. The first column is the gene name and the remaining 94 columns are two haplotypes each from 47 birds, in the species Aphelocoma californica (AA), A. woodhouseii (AW), A. insularis (AI), A. coerulescens (AC), Cyanocorax yucatanicus (CY) and Cyanocitta. stelleri (CS). This table includes only well-annotated genes (no "LOC" annotations).
- sj-94-a2.pg.CNV.Rtable: This file is the same as the above, but contains 15733 genes, including any "LOC" genes (less well-annotated).
- Z_linked_gene_list_684.txt: This file contains a list of 684 genes found on the Z chromosome of Aphelocoma woodhouseii reference assembly. It is used to filter the pangene tables above to remove sex-linked genes, yielding only autosomal genes. It was produced using R scripts archived at Zenodo (DOI: 10.5281/zenodo.16053688).
- pangene_auto_no_haps_poly_stats_final_table.txt: This table contains various filters used to count copy-number variants. It has 13515 rows, corresponding the the 13515 autosomal genes, and 14 columns, the first one being the gene ID. The remaining 13 columns are as follows, using the species designations above:
sumAI | sum of gene copies in AI |
---|---|
sumAC | sum of gene copies in AC |
sumAW | sum of gene copies in AW |
polyAI | number of CNVs in AI, counting only variants appearing in at least two haplotypes |
polyAC | number of CNVs in AC, counting only variants appearing in at least two haplotypes |
polyAW | number of CNVs in AW, counting only variants appearing in at least two haplotypes |
AC_0_1 | Logical, whether the gene displays only a single gene (1) or deletions (0), in AC haplotypes |
AI_0_1 | Logical, whether the gene displays only a single gene (1) or deletions (0), in AI haplotypes |
AW_0_1 | Logical, whether the gene displays only a single gene (1) or deletions (0), in AW haplotypes |
num_copy_num_states_AI | Number of distinct haplotypes in terms of gene copy numbers in AI |
num_copy_num_states_AC | Number of distinct haplotypes in terms of gene copy numbers in AC |
num_copy_num_states_AW | Number of distinct haplotypes in terms of gene copy numbers in AW |
num_sp_poly_poly | Number of species (1-3) polymorphic for copy number of a given gene, counting only variants appearing in at least two haplotypes |
KING.tar.gz
Description: This file contains three files listing the pairwise relatedness of each individual within each of species AC, AI and AW. Each file has six columns, as follows: #IID1, the first individual of the dyad; IID2, the second individual of the dyad; NSNP, the number of SNPs used in the calculation of relatedness; HETHET; IBS0, the proportion of sites with 0 alleles shared; and KINSHIP, the estimate of relatedness between the two individuals.
Version changes
01-Aug-2025: Several files were cleaned up and made less redundant, such as the pangene gene graphs (*.gfa), which were in two separate tar files. SVE also added several data tables pertaining to base composition, satellite DNA analysis, RepeatMasker analysis and pangene gene graphs. These files include basic data tables as well as summarize behind some of the main results reported in the associated paper. The scripts used to generate these files are archived at Zenodo (DOI:10.5281/zenodo.16053688 - see below).
Code/software
You need to run the script 00run.sh to set up the interactive pangene viewer. You will need an html viewer, such as Firefox or Safari. The README file can be found in the unzipped path: pangene-1.1-bin-sj/doc/README.md
You can view the *.gfa files in a GFA viewer such as Bandage (Wick et al. 2015).
Wick, R. R., Schultz, M. B., Zobel, J., & Holt, K. E. (2015). Bandage: interactive visualization of de novo genome assemblies. Bioinformatics, 31(20), 3350-3352. doi:10.1093/bioinformatics/btv383
Access information
Other publicly accessible locations of the data:
- The original sequence data can be found at NCBI under Umbrella BioProject PRJNA1206191, the Scrub-Jay (Aphelocoma) Pangenome Project. Within this project, BioProject PRJNA1204306 contains the PacBio HiFi reads (with samples SAMN46016487 - SAMN46016457), whereas BioProjects PRJNA1204814 - PRJNA1204903 contain the haplotype assemblies. All scripts can be found at https://github.com/harvardinformatics/scrub-jay-genomics and https://github.com/fangbohao/scrub-jay-pangenome.
- The pangene gene graphs can be viewed interactively at https://pangene.bioinweb.org/.
- Associated scripts used to generate some of the data files here can be found at DOI:10.5281/zenodo.16053688 (https://zenodo.org/records/16635922)
Data was derived from the following sources:
- Museum of Comparative Zoology was the source for tissues of A. woodhouseii.
Forty-four genomes from three species of North American scrub jays (Aphelocoma insularis, A. woodhouseii and A. coerulescens) and one outgroup (Yucatán Jay, Cyanocorax yucatanicus) were sequenced using PacBio HiFi technology. The sequence reads were assembled into primary assemblies and two haplotype assemblies using hifiasm (Cheng et al. 2021). We used various pangenome tools, including the Pangenome Graph Builder (PGGB; Garrison et al. 2024) and minigraph (Li et al. 2020) to detect and characterize structural variants, including inversions, within and between species. We used RepeatModeler2 and RepeatMasker to annotate repetitive elements (Smit et al. 2015 , Flynn et al. 2020). We conducted demographic analysis with PSMC (Li et al. 2011), bpp (Rannala et al. 2017) and other programs. We used Panacus to estimate growth curves for the pangenome graphs (Parmigiani et al. 2024), and fastDFE (Sendrowski et al. 2024) and anavar (Barton et al. 2018) to estimate the distribution of selection co-efficients. We used Pangene to estimate pangene graphs within and between species (Li et al. 2024).
Barton HJ, Zeng K. 2018. New Methods for Inferring the Distribution of Fitness Effects for INDELs and SNPs. Mol Biol Evol: 35:1536-1546.
Cheng H, Concepcion GT, Feng X, Zhang H, Li H. 2021. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods: 18:170-175.
Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. 2020. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A: 117:9451-9457.
Garrison E, Guarracino A, Heumos S, Villani F, Bao Z, Tattini L, Hagmann J, Vorbrugg S, Marco-Sola S, Kubica C, et al. 2024. Building pangenome graphs. Nat Methods: 21:2008-2012.
Li H, Durbin R. 2011. Inference of human population history from individual whole-genome sequences. Nature: 475:493-496.
Li H, Feng X, Chu C. 2020. The design and construction of reference pangenome graphs with minigraph. Genome Biol: 21:265.
Li H, Marin M, Farhat MR. 2024. Exploring gene content with pangene graphs. Bioinformatics: 40:1367-4811 (Electronic).
Parmigiani L, Garrison E, Stoye J, Marschall T, Doerr D. 2024. Panacus: fast and exact pangenome growth and core size estimation. Bioinformatics: 40.
Rannala B, Yang Z. 2017. Efficient Bayesian species tree inference under the multispecies coalescent. Systematic Biology: 66:823-842.
Sendrowski J, Bataillon T. 2024. fastDFE: Fast and Flexible Inference of the Distribution of Fitness Effects. Molecular Biology and Evolution: 41:msae070.
Smit AF, Hubley R, Green P. 2015 RepeatMasker Open-4.0. <http://www.repeatmasker.org>. .