Fitness consequences of structural variation inferred from a House Finch pangenome
Data files
Nov 07, 2024 version files 64.89 GB
-
pggb_cleaned_final_noAT.vcf.gz
1.21 GB
-
README.md
3.47 KB
-
VCF_INFO_Housefinch_PGGB_Pangenome.txt.zip
610.09 MB
-
VGP_prim_SUPER_1.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
9.62 GB
-
VGP_prim_SUPER_10.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
1.31 GB
-
VGP_prim_SUPER_11.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
1.18 GB
-
VGP_prim_SUPER_12.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
1.08 GB
-
VGP_prim_SUPER_13.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
1.04 GB
-
VGP_prim_SUPER_14.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
1.05 GB
-
VGP_prim_SUPER_15.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
748.89 MB
-
VGP_prim_SUPER_16.pan.fa.gz.7b8a423.4030258.cc4afae.smooth.fix.gfa
2.67 GB
-
VGP_prim_SUPER_17.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
731.27 MB
-
VGP_prim_SUPER_18.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
768.13 MB
-
VGP_prim_SUPER_19.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
655.82 MB
-
VGP_prim_SUPER_2.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
7.44 GB
-
VGP_prim_SUPER_20.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
742.05 MB
-
VGP_prim_SUPER_21.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
546.30 MB
-
VGP_prim_SUPER_22.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
519.75 MB
-
VGP_prim_SUPER_23.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
481.95 MB
-
VGP_prim_SUPER_24.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
442.03 MB
-
VGP_prim_SUPER_25.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
409.40 MB
-
VGP_prim_SUPER_26.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
365.97 MB
-
VGP_prim_SUPER_27.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
404.70 MB
-
VGP_prim_SUPER_28.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
366.58 MB
-
VGP_prim_SUPER_29.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
377.03 MB
-
VGP_prim_SUPER_3.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
6.35 GB
-
VGP_prim_SUPER_30.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
356.38 MB
-
VGP_prim_SUPER_31.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
296.86 MB
-
VGP_prim_SUPER_32.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
150.16 MB
-
VGP_prim_SUPER_33.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
245.50 MB
-
VGP_prim_SUPER_34.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
165.12 MB
-
VGP_prim_SUPER_35.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
75.77 MB
-
VGP_prim_SUPER_36.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
95.86 MB
-
VGP_prim_SUPER_37.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
174.51 MB
-
VGP_prim_SUPER_38.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
47.39 MB
-
VGP_prim_SUPER_39.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
44.46 MB
-
VGP_prim_SUPER_4.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
5.05 GB
-
VGP_prim_SUPER_5.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
4.38 GB
-
VGP_prim_SUPER_6.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
3.50 GB
-
VGP_prim_SUPER_7.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
2.41 GB
-
VGP_prim_SUPER_8.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
1.57 GB
-
VGP_prim_SUPER_9.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
1.50 GB
-
VGP_prim_SUPER_W.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
50 MB
-
VGP_prim_SUPER_Z.pan.fa.gz.dac1d73.c2fac19.754753f.smooth.final.gfa
3.66 GB
Abstract
Genomic structural variants (SVs) play a crucial role in adaptive evolution, yet their average fitness effects and characterization with pangenome tools are understudied in wild animal populations. We constructed a pangenome for House Finches, a model for studies of host-pathogen coevolution, using long-read sequence data on 16 individuals (32 de novo-assembled haplotypes) and one outgroup. We identified 643,207 SVs larger than 50 base pairs, mostly (60%) involving repetitive elements, with reduced SV diversity in the eastern US as a result of its introduction by humans. The distribution of fitness effects of genome-wide SVs was estimated using maximum likelihood approaches and showed SVs in both coding and non-coding regions to be on average more deleterious than smaller indels or single nucleotide polymorphisms. Our reference-free pangenome facilitated the discovery of a 10-million-year-old, 11-megabase-long pericentric inversion, whose genotype frequency increased steadily over the 25 years since House Finches were first exposed to the bacterial pathogen Mycoplasma gallispecticum and which showed signatures of balancing selection, capturing genes related to immunity and telomerase activity. We also observed shorter telomeres in populations with a greater number of years of exposure to Mycoplasma. Our study illustrates the utility of applying pangenome methods to wild animal populations, helps estimate the fitness effects of genome-wide SVs, and advances our understanding of adaptive evolution through structural variation.
Overview
This repository contains data files related to the pangenome analysis of House Finch genomes. The data includes:
- PGGB Pangenome Graph Files (.gfa): Pangenome graphs generated per chromosome using the PanGenome Graph Builder (PGGB).
- VCF File (
pggb_cleaned_final_noAT.vcf.gz
): A Variant Call Format (VCF) file containing variant information decomposed from the pangenome graphs (see methods in the paper). - Processed Variant Information (
vcf_info_38chrs
): A data file containing detailed variant annotations and metrics.
Processed Variant Information (vcf_info_38chrs
)
Column Descriptions
- CHROM: Chromosome identifier where the variant is located.
- POS: Position of the variant on the chromosome (1-based coordinate).
- REF: Reference allele nucleotide(s) at the variant position.
- ALT: Alternative allele nucleotide(s) at the variant position.
- N_allele: Total number of alleles at the variant site (including REF and all ALT alleles).
- N_bp_ref: Length (in base pairs) of the reference allele (REF).
- N_bp_diff: Maximum difference in length (in base pairs) between the longest and shortest alleles at the variant site.
- type: General classification of the variant based on allele lengths and differences:
- SNP: Single Nucleotide Polymorphism (length REF = length ALT = 1 bp).
- MNP: Multiple Nucleotide Polymorphism (length REF = length ALT, between 1 and 50 bp).
- INDEL: Insertion or Deletion less than 50 bp (size difference between alleles < 50 bp).
- SV: Structural Variant (size difference ≥ 50 bp).
- SV_complex: Complex Structural Variant not fitting standard criteria.
- N_bp: Length of the variant (in base pairs). For SNPs,
N_bp = 1
; for other variants,N_bp = N_bp_diff
. - invariants_HF: Logical flag indicating whether the site is invariant among House Finch haplotypes (
TRUE
if invariant). - ancestry_polarized: Logical flag indicating whether the variant could be polarized using the outgroup genotype (Common Rosefinch haplotype 1). Polarization was limited to sites where the Common Rosefinch genotype was homozygous or had one allele missing (genotypes
0/0
,1/1
,0/.
, or1/.
). - DAC_west: Derived Allele Count in the western House Finch population group.
- DAC_east: Derived Allele Count in the eastern House Finch population group.
- DAC_all: Derived Allele Count across all House Finch populations.
- Geno_RF_hap1: Genotype of the outgroup (Common Rosefinch haplotype 1) used for ancestral allele determination. “.” indicates missing data.
- MC_west: Missing genotype counts in the western population group.
- MC_east: Missing genotype counts in the eastern population group.
- MC_all: Missing genotype counts across all populations.
- MC_ratio: Ratio of missing genotypes across all populations (
MC_all / total number of haplotypes
). - allele_count_West: Number of unique alleles observed in the western population group.
- allele_count_East: Number of unique alleles observed in the eastern population group.
- allele_count_All: Number of unique alleles observed across all populations.
- subtype: Detailed classification of the variant subtype based on size and ancestral allele (see Supporting Methods of the paper).
The House Finch pangenome was constructed using de novo genome assemblies from 16 samples and a Common Rosefinch outgroup. A chromosome-level reference from the Vertebrate Genomes Project aided in establishing stable genomic coordinates for pangenomics analysis. The pangenome includes 35 haplotype assemblies and was constructed using the PanGenome Graph Builder (PGGB), which allows for a detailed representation of genomic variants across autosomes.