SNP datasets and genomes used to benchmark the SNPLift program

Normandeau, Eric 1

Research facility: Université Laval

Published Jun 12, 2023 on Dryad. https://doi.org/10.5061/dryad.h9w0vt4nx

Data files

Jun 12, 2023 version files 18.58 GB

README.md

628 B
SNPLift_benchmark_data.tgz

18.58 GB

Abstract

Motivation: The advent of high-throughput sequencing technologies and availability of reference genomes has provided an unprecedented opportunity to discover and genotype millions of genetic variants in hundreds or even thousands of samples. Variant calling, the identification of genetic variants from raw sequencing data, is a time-consuming and computationally expensive process. Currently, reference genomes are evolving very rapidly and new versions come out more and more frequently. To take advantage of new or improved reference genomes, raw reads alignments, genotype calling, and filtration must typically all be redone. This is a costly and time consuming operation that is not always possible when projects are under time constraints.

Results: Here, we present SNPLift, a bioinformatic pipeline that can quickly transfer SNP coordinates from one version of a genome to another, making it possible to rapidly leverage the resources represented by new reference genomes. We tested SNPLift on nine SNP datasets in VCF format from different species (Homo sapiens, Arabidopsis thaliana, Coregonus clupeaformis, Medicato truncatula, Oriza sativa, Salvelinus namaycush, Solanum lycopersicum, Zea mays, and Glycine max). Depending on the species, we accurately lifted between 82.64% and 99.39% of the variants very quickly and efficiently, reducing the required computing power by multiple orders of magnitudes compared to a complete re-analysis using the new genome reference. SNPLift provides an accurate, parallelized, efficient and fast solution to update genome positions, for example for variant calls, based on new reference genomes.

Availability and implementation: SNPLift is available at https://github.com/enormandeau/snplift with its documentation and installation procedure. It also contains a script that runs an automated test on a small dataset, composed of 190,443 SNPs in chromosome 1 of Medicago truncatula. SNPLift uses only common tools that are easy to install and works under Linux and MacOS.

SNP datasets and genomes used to benchmark the SNPLift program

Data files

Abstract

Methods

Usage notes