Calling structural variants with confidence from short-read data in wild bird populations
Cite this dataset
David, Gabriel (2024). Calling structural variants with confidence from short-read data in wild bird populations [Dataset]. Dryad. https://doi.org/10.5061/dryad.6q573n647
Abstract
Comprehensive characterisation of structural variation in natural populations has only become feasible in the last decade. To investigate the population genomic nature of structural variation (SV), reproducible and high-confidence SV callsets are first required. We created a population-scale reference of the genome-wide landscape of structural variation across 33 Nordic house sparrows (Passer domesticus) individuals. To produce a consensus callset across all samples using short-read data, we compare heuristic-based quality filtering and visual curation (Samplot/PlotCritic and Samplot-ML) approaches. We demonstrate that curation of SVs is important for reducing putative false positives and that the time invested in this step outweighs the potential costs of analysing short-read discovered SV datasets that include many potential false positives. We find that even a lenient manual curation strategy (e.g. applied by a single curator) can reduce the proportion of putative false positives by up to 80%, thus enriching the proportion of high-confidence variants. Crucially, in applying a lenient manual curation strategy with a single curator, nearly all (>99%) variants rejected as putative false positives were also classified as such by a more stringent curation strategy using three additional curators. Furthermore, variants rejected by manual curation failed to reflect the expected population structure from SNPs, whereas variants passing curation did. Combining heuristic-based quality-filtering with rapid manual curation of structural variants in short-read data can therefore become a time- and cost-effective first step for functional and population genomic studies requiring high-confidence SV callsets.
README: Calling structural variants with confidence from short-read data in wild bird populations
Included files:
- sparrow_all.smoove.square.anno.vcf.gz = VCF file of raw structural variant calls analysed in the study
- Scripts and walk-throughs for generating the above VCF file, downstream filtering and plotting in Samplot, creation of a PlotCritic curation project
- passerDomesticusAnnotatedRepeats.gff = TE annotation generated in this study
Description of the data and file structure
Larger (>20bp) structural variants (deletions, duplications, and inversions) from the aligned .bam files using LUMPY (Layer et al. 2014) and genotyped the resulting calls with SVTyper (Chiang et al. 2015), via the smoove pipeline (Pedersen et al. 2020). The resulting VCF file of raw structural variant calls analysed in the study is included here:
- sparrow_all.smoove.square.anno.vcf.gz
A Unix or Linux command line terminal can be used to open the file. Functions such as "less" or "zcat" can open the gzipped VCF file. Features can be queried using, e.g., Bcftools.
Snakemake pipeline for generating the .bam files from raw Illumina reads is provided as:
- Snakefile_BWA_GATK
Example scripts on how to generate and filter the above VCF-file are provided in:
- Smoove_commands_to_generate_VCF_of_raw_SVs_and_example_filtering.txt
Example scripts on how to generate the Samplot images and PlotCritic curation project from a filtered VCF file are provided in:
- Samplot_and_PlotCritic_commands.txt
The script to select only SVs with a specific number of individuals per genotype class, plus randomly select those individuals to generate a Samplot command to plot each image:
- gen_samplot.py
Sharing/Access information
The raw Illumina reads and assembled reference genome from this article are published and available at NCBI, Bioproject number PRJNA255814 (P. domesticus reference accession number SAMN02929199).
Methods
The raw Illumina reads and assembled reference genome from this article are also published and available at NCBI, Bioproject number PRJNA255814 (Passer domesticus reference accession number SAMN02929199). Trimmed reads were aligned with BWA-MEM (bwa v.0.7.17) to the short-read reference genome assembly for Passer domesticus (Elgvin et al. 2017), NCBI: GCA_001700915.1_Passer_domesticus-1.0), and then sorted and indexed with Samtools (samtools v. 1.9). All unplaced scaffolds were removed and thus only mapped chromosomal regions were included in downstream analyses.
Larger (>20bp) structural variants (deletions, duplications, and inversions) from the aligned .bam files using LUMPY (Layer et al. 2014) and genotyped the resulting calls with SVTyper (Chiang et al. 2015), via the smoove pipeline (Pedersen et al. 2020). The resulting VCF file of raw structural variant calls analysed in the study is included in the following file: sparrow_all.smoove.square.anno.vcf.gz
Repetitive elements were identified using the Earl Grey TE annotation pipeline (version 1.2) (Baril et al. 2021, 2022), configured with RepBase (version 23.08) and Dfam (version 3.4) repeat libraries (Hubley et al. 2016; Jurka et al. 2005). Briefly, Earl Grey first annotated known repeats using the Aves repeat library. Following this, Earl Grey identified and refined novel TEs using an automated and iterative implementation of the “BLAST, Extract, Extend” process (Platt et al. 2016). Following the final TE annotation (passerDomesticusAnnotatedRepeats.gff), overlapping and fragmented annotations were resolved by Earl Grey before the final TE quantification.