Data from: Demographic history inferred from an inversion-rich spruce bark beetle genome
Data files
Feb 25, 2026 version files 4.95 GB
-
BED_files.zip
768.43 KB
-
Fastsimcoal_files.zip
735.13 KB
-
PSMC_pipeline
22.33 KB
-
Raw_data_processing.zip
20.07 KB
-
README.md
9.59 KB
-
SFS_pipeline
11.05 KB
-
VCF.tar
4.95 GB
Abstract
The demographic history of species inferred from whole-genome data provides quantitative insights into key biological parameters such as population size changes and divergence times. Reliable estimates often require data that have not been affected by selection. Extensive research, however, indicates that many species harbour multiple polymorphic chromosomal inversions, which often evolve under different selective pressures. Consequently, inversions can influence genome-wide patterns of variation and subsequent evolutionary inferences. In this study, we used genome-wide data from over 300 spruce bark beetle (Ips typographus) individuals from 23 populations across Europe to reconstruct their demographic history and to investigate the impact of a complex polymorphic inversion landscape (covering approximately 28% of the beetle genome) on demographic inference. We used two complementary methods, Pairwise Sequential Markovian Coalescent (PSMC) and Site Frequency Spectrum (SFS)-based modelling, and revealed a Late Pleistocene divergence (~79 kya) between populations from the southern and northern parts of the species’ European range, and a long-term effective population size of ~250,000. The southern group underwent significant population expansion after this divergence event, whereas the northern group expanded during the Holocene (~7 kya). Recent population size estimates suggest that the southern group is twice as large as the northern group. Neglecting the presence of chromosomal inversions did not significantly affect the model selection procedure and resulted in relatively small biases in the estimated demographic parameters. This study provides information on the historical population dynamics of the spruce bark beetle and improves our understanding of the influence of a complex genomic architecture on the inference of evolutionary history.
Dataset DOI: 10.5061/dryad.3ffbg7b01
Description of the data and file structure
Associated publication:
Demographic history inferred from an inversion-rich spruce bark beetle genome
Raw sequencing data availability:
Raw FASTQ files are deposited in the NCBI Sequence Read Archive (SRA) and are available under BioProject ID PRJNA1013983.
General information
This repository contains data files and analysis pipelines used for population genomic and demographic analyses of Ips typographus. The materials include workflows for raw sequencing data processing, variant calling, site frequency spectrum (SFS) construction, and demographic inference using both SFS-based (fastsimcoal2) and Pairwise Sequentially Markovian Coalescent (PSMC) based approaches.
Processed variant call data (VCF format) and scripts are provided to facilitate transparency, reproducibility, and reuse of the analyses presented in the associated manuscript.
Reference genome information
Sequencing reads were mapped to the Ips typographus reference genome assembly:
Assembly accession: GCA_016097725.1
Genome size: 236.8 Mb
Analyses were restricted to 25 autosomal contigs listed in: ALL_PSMC_Contigs.txt (see below).
Repository file inventory
This repository contains the following files and folders:
Raw_data_processing.zip
VCF.tar
BED_files.zip
SFS_pipeline/
Fastsimcoal_files.zip
PSMC_pipeline/
File/folder content description
Raw_data_processing.zip
Pipeline and scripts for parallelised preprocessing (each contains separate README file) of raw sequencing data prior to downstream analyses. The workflows are designed for execution in high-performance computing (HPC) environments. To view and edit all files within subfolders one can use any plain text editor i.e.: Notepad++, nano.
Directory contents:
- Raw_data_processing_pipeline - main wrapper pipeline coordinating successive preprocessing steps
- parallel_trimming/ - parallel adapter trimming and quality filtering of raw sequencing reads
- parallel_sorting/ - parallel sorting and indexing of alignment files
- parallel_duplicate_removal/ - parallel removal or marking of PCR duplicates
- parallel_coverage/ - calculation of genome-wide or regional sequencing coverage
- parallel_snp_calling/ - parallel SNP calling from processed alignment files
- parallel_genotyping/ - genotyping of variants across individuals
- parallel_combine_variants/ - merging and combining variant call files into final datasets
Input: Raw sequencing reads (FASTQ files from NCBI SRA).
Output: Processed BAM & VCF files.
Software requirements: FastQC; Trimmomatic, Bowtie2, samtools/bcftools, Picard, GATK;
VCF.tar
Compressed Variant Call Format (VCF) file and corresponding index file (.tbi extension) used for population genomic and demographic analyses. VCF file structure is according to the VCF v4.2 specification. File represents variation for the whole genome dataset and can be furter filtered using bed files to provide subsets of the data used in the publication for downstream analyses. This file resulted from fastq files processing, mapping and variant calling and was used for SFS construction and population genetic analyses. File with the .tbi extension is tabix index file associated with compressed Variant Call Format file. A .tbi file contains indexing information that allows software tools to rapidly access specific genomic regions within a compressed VCF file without reading the entire file sequentially. To view and edit the files one can use any plain text editor i.e.: Notepad++, nano. For further bioinformatic analyses we reccommend the following software: bedtools, bcftools, GATK, easySFS.
BED_files.zip
This archive contains files defining genomic regions used for filtering, masking, and partitioning the genome, especially for defining genomic regions by including / excluding inversions. To view and edit the files one can use any plain text editor i.e.: Notepad++, nano. For further bioinformatic analyses we reccommend the following software: bedtools, bcftools, GATK.
Contents:
- ALL_PSMC_Contigs.txt - plain text file, contains list of contigs included in PSMC analyses
- PSMC_INVERSIONS.bed - plain text file in bed format, contains genomic coordinates of inversion regions used for filtering
- ALL_25_PSMC_contigs.bed - plain text file in bed format, file defining contigs used for whole genome dataset
- ALL_25_PSMC_contigs_No_inversions.bed - plain text file in bed format, file defining genomic coordinates of 25 contigs with inversion removed
- ALL_25_PSMC_contigs_No_inversions_no_genes_no_repeats.bed - plain text file in bed format, file defining genomic coordinates of 25 contigs excluding inversions, genes, and repeats
- ALL_25_PSMC_contigs_Inversions_no_genes_no_repeats.bed - plain text file in bed format, file defining genomic coordinates of 25 contigs including only inverted regions, excluding genes and repetitive elements
- Small_colinear.bed - plain text file in bed format, file defining genomic coordinates of regions representing small colinear segments in the associated publication
- Small_colinear_sorted.bed - plain text file in bed format, file defining genomic coordinates of small colinear segments in the associated publication, sorted version of Small_colinear.bed
- ALL_25_PSMC_contigs_SMALL_no_genes_no_repeats.bed - plain text file in bed format, file defining genomic coordinates of small colinear segments in the associated publication, excluding genes and repeats
SFS_pipeline
Pipeline for constructing observed site frequency spectra (SFS) from data (plain text files with .obs extension). To view and edit the file one can use any plain text editor i.e.: Notepad++, nano. For further bioinformatic analyses we reccommend using the following software: bedtools, bcftools, GATK, easySFS.
Observed SFS (.obs) files represent joint minor allele frequency spectra where:
- rows and columns represent allele frequency bins across analysed populations
- each matrix cell represents the number of SNPs observed in a given frequency class
Units: number of SNPs per frequency bin.
Two population groups were defined for demographic inference:
- Northern population (Nor) – individuals originating from Sweden, Norway and Finland
- Southern population (Sou) – individuals originating from Italy, Austria, Germany and Czech Republic
Fastsimcoal_files.zip
Input files for demographic inference using fastsimcoal2, organized in folders by genomic partitions. Used to test demographic models under different genomic partitions to assess effects of inversions and genome subsets. To view and edit the file one can use any plain text editor i.e.: Notepad++, nano. For further bioinformatic analyses we recommend using the following software: fastsimcoal2.
Top-level directories and contents:
- whole-genome/ - contains files used for full-genome analyses
- inversions-only/ - contains files used for analyses restricted to inverted regions
- no-inversions/ - contains files used for analyses excluding inversions
- no-inversions-small/ - contains files used for analyses of smaller subset of non-inverted regions
Each folder contains:
- observed site frequency spectrum files (in two different file formats) required by fastsimcoal2 to run a given demographic model. Naming convention for the files (for example NorSouC25INV_jointMAFpop1_0.obs NorSouC25INV_MSFS.obs) is:
- Nor - northern population
- Sou - southern population
- C25 - dataset restricted to 25 contigs used in demographic analyses
- INV - genomic partition including just inversion regions
- Demographic model directories corresponding to different demographic scenarios: IM/, IMDE/, ISO/, ISODE/, SC/, SCDE/.
- IM - isolation with constant migration model
- IMDE - isolation with constant migration model and single, instant demographic event (expansion or contraction)
- ISO - isolation model
- ISODE - isolation model and single, instant demographic event (expansion or contraction)
- SC - secondary contact model
- SCDE - secondary contact model and single, instant demographic event (expansion or contraction)
- Each model subdirectory contains steering files required by fastsimcoal2 to run simulations
- .tpl - plain text file with template defining the demographic model
- .est - plain text file with starting parameters for the model
PSMC_pipeline
Description:
Pipeline for demographic inference using Pairwise Sequentially Markovian Coalescent (PSMC). To view and edit the file one can use any plain text editor i.e.: Notepad++, nano. For further bioinformatic analyses we reccommend using the following software: bedtools, bcftools, GATK, psmc.
Notes on reuse
- File paths in scripts require adaptation to local directory structures.
Contact
For questions regarding the dataset or analysis pipelines, please contact the corresponding author of the associated publication.
Access information
Publicly accessible locations of the raw sequencing data:
- National Centre for Biotechnology Information Sequence Read Archive BioProject ID PRJNA1013983
