SNP datasets and genomes used to benchmark the SNPLift program
Data files
Jun 12, 2023 version files 18.58 GB
-
README.md
628 B
-
SNPLift_benchmark_data.tgz
18.58 GB
Abstract
Motivation: The advent of high-throughput sequencing technologies and availability of reference genomes has provided an unprecedented opportunity to discover and genotype millions of genetic variants in hundreds or even thousands of samples. Variant calling, the identification of genetic variants from raw sequencing data, is a time-consuming and computationally expensive process. Currently, reference genomes are evolving very rapidly and new versions come out more and more frequently. To take advantage of new or improved reference genomes, raw reads alignments, genotype calling, and filtration must typically all be redone. This is a costly and time consuming operation that is not always possible when projects are under time constraints.
Results: Here, we present SNPLift, a bioinformatic pipeline that can quickly transfer SNP coordinates from one version of a genome to another, making it possible to rapidly leverage the resources represented by new reference genomes. We tested SNPLift on nine SNP datasets in VCF format from different species (Homo sapiens, Arabidopsis thaliana, Coregonus clupeaformis, Medicato truncatula, Oriza sativa, Salvelinus namaycush, Solanum lycopersicum, Zea mays, and Glycine max). Depending on the species, we accurately lifted between 82.64% and 99.39% of the variants very quickly and efficiently, reducing the required computing power by multiple orders of magnitudes compared to a complete re-analysis using the new genome reference. SNPLift provides an accurate, parallelized, efficient and fast solution to update genome positions, for example for variant calls, based on new reference genomes.
Availability and implementation: SNPLift is available at https://github.com/enormandeau/snplift with its documentation and installation procedure. It also contains a script that runs an automated test on a small dataset, composed of 190,443 SNPs in chromosome 1 of Medicago truncatula. SNPLift uses only common tools that are easy to install and works under Linux and MacOS.
Methods
Nine species are present in the dataset. For each species, two genome versions and one VCF are present. The VCF contains SNPs whose positions refer to the oldest reference genome.
Usage notes
The authors of SNPLift used the data of 9 species to perform benchmarks, as described in the project's GitHub repository (https://github.com/enormandeau/snplift). This repository in turn points to the publication where the benchmark results are presented.