Data for: Pangenome analysis reveals local adaptation to climate driven by introgression in oak species
Data files
Apr 15, 2025 version files 23.81 GB
-
Code.zip
52.54 KB
-
README.md
3.98 KB
-
VCF_data.tar
23.81 GB
Abstract
The genetic basis of local adaptation has been extensively studied in natural populations. However, a comprehensive genome-wide perspective on the contribution of structural variants (SVs) and adaptive introgression to local adaptation remains limited. In this study, we performed de novo assembly and annotation of 22 representative accessions of Quercus variabilis, identifying a total of 543,372 SVs. These SVs play crucial roles in shaping genomic structure and influencing gene expression. By analysing range-wide genomic data, we identified both SNPs and SVs associated with local adaptation in Q. variabilis and Q. acutissima. Notably, SV-outliers exhibit selection signals that did not overlap with SNP-outliers, indicating that SNP-based analyses may not detect the same candidate genes associated with SV-outliers. Remarkably, 29−37% of candidate SNPs were located in a 250 kb region on chromosome 9, referred to as Chr9-ERF. This region contains eight duplicated ethylene-responsive factor (ERF) genes, which may have contributed to local adaptation of Q. variabilis and Q. acutissima. We also found that a considerable number of candidate SNPs were shared between Q. variabilis and Q. acutissima in the Chr9-ERF region, suggesting a pattern of repeated selection. We further demonstrated that advantageous variants in this region were introgressed from western populations of Q. acutissima into Q. variabilis, providing compelling evidence that introgression facilitates local adaptation. This study offers a valuable genomic resource for future studies on oak species and highlights the importance of pan-genome analysis in understanding mechanisms driving adaptation and evolution.
Description of the data and file structure
There are two zipped files here: one contains all the codes for this analysis, and the other is the VCF file of SV and SNP calling used for this analysis.
File: Code.zip
All scripts were put in the file “src”.
Part 1: pan-genome analyses
1.1. assembly.sh
1) For the reference individual
2) For the other pan-genome individual
1.2. repeat elements.sh
1) For the reference individual
2) For the other pan-genome individual
1.3. annotations.sh
1.4. Pangenome_graph.sh
1) RUN PBSV for SV calling
2) RUN SVIM for SV calling
3) RUN Sniffles for SV calling
4) RUN cuteSV for SV calling
5) RUN SVIMasm for SV calling
6) RUN Assemblytics for SV calling
7) RUN merge SVs for SV calling
8) RUN winnowmap for SNP calling
9) RUN deepvariant for indel calling
10) Construct the Pan-genome Graph
1.5 Collinearity Analyses.sh
1) Collinearity among Q. variabilis genome assemblies
2) Variation in synteny diversity across the genome
3) Synteny relationship index (SRI) between each pair of Q. variabilis genomes.
1.6 SV calling.sh
1) Genotyping SV in each sample of 257 re-seq individuals
2) Merge vcf
3) Filtering low-quality sv
Part 2: Population genetic analyses
2.1 SNP_calling_filter.sh
1) Build Indexes
2) Filter the raw data
3) Mapping
4) SNP calling using GATK
5) SNP filtering
2.2 population_structure.sh
1) Input prepare
2) PCANGsd
3) NGSadmix
4) NJ tree
2.3 summary_statistics.sh
1) Calculate the number of valid sites
2) Calculate tP, Tajima's D
3) Calculate Fay & Wu's H
4) Calculate ZnS
5) Calculate Fst
6) Calculate dxy
7) Calculate RND
2.4 demography.sh
1) Perform fsc26
2.5 local_adaptation.sh
1) GF
2) LFMM2
3) RDA
4) Pcadapt
2.6 selective_sweep.sh
Perform XP-CLR
2.7 co-directional_SNPs.sh
Identification of co-directional and anti-directional SNPs of 323 shared outlier SNP
2.8 introgression.sh
1) Conduct ML tree
2) Perform ELAI
3) Calculate fdm
4) Calculate df
File: VCF_data.tar
The VCF files with SNP and SV calls for a total of 257 individuals across three oak species.
Dataset Overview:
——————VCF_data.tar
————SNPs_data
——Chr*a.mac.recode.vcf.gz
————SVs_data
——Qv_sv_minQ.mis.recode.vcf.gz
The Chr*a.mac.recode.vcf.gz files represent the SNP calls for each chromosome generated by GATK. It comprises 21,695,563 SNPs.
The Qv_sv_minQ.mis.recode.vcf.gz file represents the SV calls genome-wide generated by vg. It comprises 20,072,814 SVs.
File Details:
- File Name: Chr*a.mac.recode.vcf/Qv_sv_minQ.mis.recode.vcf
- File Format: VCF (Variant Call Format) Version 4.2
- Date: October 7, 2024
- Source Software: GATK /vg
Data Description:
The VCF file includes the following columns standard to the format:
#CHROM
: Chromosome numberPOS
: Position of the SNP/SV on the chromosomeID
: Identifier of the SNP/SVREF
: Reference baseALT
: Alternate base(s)QUAL
: Quality score of the SNP/SVFILTER
: Filter statusINFO
: Additional information (e.g., allele frequency, number of samples)FORMAT
: Data format- Sample columns: One per individual, containing genotype information
Specific Fields in INFO
NS
: Number of samples with dataAF
: Allele frequencyDP
: Total depth of reads
Specific Fields in FORMAT
GT
: GenotypeAD
: Allele depthDP
: Read depthGQ
: Genotype qualityGL
: Genotype likelihood
Access information
Other publicly accessible locations of the data:
- All sequencing data from this study have been deposited at NCBI Sequence Read Archive under Bioproject accession numbers SAMN38056469-SAMN38056477 and SAMN38043932-SAMN38044038.