Big cat vcf files from: Exceedingly low genetic diversity in snow leopards due to persistently small population size

Published Oct 28, 2025 on Dryad. https://doi.org/10.5061/dryad.1vhhmgr78

Abstract

Snow leopards (Panthera uncia) serve as an umbrella species whose conservation benefits their high-elevation Asian habitat. Their numbers are believed to be in decline due to numerous Anthropogenic threats; however, their conservation is hindered by numerous knowledge gaps. In particular, the dearth of genetic data, unique among all big cat species, hinders a full understanding of their population structure, historical population size, and current levels of genetic diversity. Here, we use whole-genome sequencing data for 41 snow leopards (37 newly sequenced) to offer new insights into these unresolved aspects of snow leopard biology. Among our samples, we find evidence of a primary genetic divide between the northern and southern part of the range around the Dzungarian Basin–as previously identified using landscape models and fecal microsatellite markers–and a secondary divide south of Kyrgyzstan around the Taklamakan Desert. Most noteworthy, we find that snow leopards have the lowest genetic diversity of any big cat species, likely due to a persistently small population size throughout their evolutionary history rather than recent inbreeding. We also find that snow leopards have significantly less highly deleterious homozygous load compared to numerous Panthera species, suggesting effective purging during their evolutionary history at small population sizes. Without a large population size or ample standing genetic variation to help buffer them from any forthcoming Anthropogenic challenges, snow leopard persistence may be more tenuous than currently appreciated.

Dataset DOI: 10.5061/dryad.1vhhmgr78

Description of the data and file structure

SNPs were called in our focal species, snow leopards (Panthera uncia), as well as all other big cat species using publicly available data. SNPs were called using the same pipeline in all species in order to make comparisons among the species. What we share here are the final filtered vcf files used in our analyses

Additionally, in snow leopards, SNPs were also called by a service provider, Gencove Inc., so there are two vcf files for snow leopards.

Sample data for snow leopard vcfs can be found in Supplementary Table 1.

Sample data for other big cat vcfs can be found in Supplementary Table 3.

Details regarding which reference genomes were used for each vcf can be found in Supplementary Table 4.

Files and variables

File: Puma_n6_map100_filtered_pass_noIndel_biall.recode.vcf.gz

Description: Genome-wide SNP calls for six puma samples mapped to the puma reference genome after filtering steps listed in the methods.

File: Jaguar_n3_dwnSample_map100_filtered_pass_noIndel_biall.recode.vcf.gz

Description: Genome-wide SNP calls for three jaguar samples mapped to the jaguar reference genome after filtering steps listed in the methods.

File: AsianLeop_n4_ALLcontigs_sorted_map100_filtered_pass_noIndel_biall.recode.vcf.gz

Description: Genome-wide SNP calls for four Asian leopard samples mapped to the leopard reference genome after filtering steps listed in the methods.

File: AfricanLeop_n5_ALLcontigs_sorted_map100_filtered_pass_noIndel_biall.recode.vcf.gz

Description: Genome-wide SNP calls for five African leopard samples mapped to the leopard reference genome after filtering steps listed in the methods.

File: Cheetah_n7_map100_filtered_pass_noIndel_biall.recode.vcf.gz

Description: Genome-wide SNP calls for seven cheetah samples mapped to the cheetah reference genome after filtering steps listed in the methods.

File: Lion_n4_map100_filtered_pass_noIndel_biall.recode.vcf.gz

Description: Genome-wide SNP calls for four lion samples mapped to the lion reference genome after filtering steps listed in the methods.

File: Amur_n5_map100_filtered_pass_noIndel_biall.recode.vcf.gz

Description: Genome-wide SNP calls for five Amur tiger samples mapped to the tiger reference genome after filtering steps listed in the methods.

File: Bengal_n5_map100_filtered_pass_noIndel_biall.recode.vcf.gz

Description: Genome-wide SNP calls for five Bengal tiger samples mapped to the tiger reference genome after filtering steps listed in the methods.

File: Sumatran_n5_map100_filtered_pass_noIndel_biall.recode.vcf.gz

Description: Genome-wide SNP calls for five Sumatran tiger samples mapped to the tiger reference genome after filtering steps listed in the methods.

File: Malayan_n5_map100_filtered_pass_noIndel_biall.recode.vcf.gz

Description: Genome-wide SNP calls for five Malayan tiger samples mapped to the tiger reference genome after filtering steps listed in the methods.

File: SL_n28_allScaffolds_sorted_wHeader_map100_filtered_pass_noIndel_biall.recode.vcf.gz

Description: Genome-wide SNP calls for 38 snow leopard samples mapped to the snow leopard reference genome after filtering steps listed in the methods. These SNP calls were done using the exact same methods that were used in all the other big cat species.

File: CombinedAnnotations_wHeaderD_filtered_pass_1nonref_90maxmis_IndRename_no10x_1nonref.recode.vcf.gz

Description: Genome-wide SNP calls for 38 snow leopard samples mapped to the snow leopard reference genome. These SNP calls were done by a service provider, Gencove Inc., and were filtered by us as described in the methods.

Calling SNPs in all big cat species:

We included all other species in the genus Panthera (leopard, lion, tiger, jaguar) as well as cheetah and puma. Accession numbers of publicly available WGS data and reference genomes used in this analysis are listed in Supplementary Table 3 and 4, respectively. We mapped all FASTQ data to the corresponding reference genome using BWA-MEM. We sorted and indexed resulting bam files using SAMtools. We added read groups and marked duplicates using picard. We calculated depth of coverage and breadth of coverage for each sample using SAMtools. We then used GATK HaplotypeCaller to generate a GVCF file for each sample and GATK CombineGVCFs to combine the GVCF files for every species (or subgroup in the case of tigers and leopards). Lastly, we used GATK GenotypeGVCFs to create a final VCF file for every group.

In order to limit the need for excessive computing power, we downsampled some samples with over 30X coverage using SAMtools view as indicated in Supplementary Table 3. The leopard samples required a larger amount of computing power than the other species to call haplotypes, so we split these samples into one BAM file per contig using BAMtools split and ran each contig through GATK HaplotypeCaller and CombineGVCFs in parallel. We also found it necessary to split the jaguar and snow leopard data into intervals to combine GVCF files using GATK GenomicsDBImport followed by CombineGVCFs. In both cases, we concatenated resultant VCF files, one per contig or interval, using BCFtools concat.

We filtered each reference genome for mappability using genmapv1.3.0 to first index and then calculate mappability using flags ‘-K 30’ and ‘-E 2’. We then filtered each VCF to only include nucleotides with a mappability score of one. To do this, we sorted both VCF files and mappability BED files using BEDtools and retained only SNPs falling in regions with a mappability score of one in each VCF using BEDtools intersect. We then filtered each VCF using GATK VariantFiltration with the following flag: ‘--filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"’. Then, we filtered VCF files to remove indels and non-bialellic SNPs using VCFtools flags ‘--remove-indels --min-alleles 2 --max-alleles 2’.

Data for all snow leopard samples were mapped to the snow leopard reference genome and SNPs were called and filtered in the same way described above.

Variant filtering of snow leopard SNPs called by service provider, Gencove Inc. :

In addition to the variant calls described above, SNPs were also called for our snow leopard dataset by a service provider, Gencove Inc. These calls were first filtered based on mappability across the genome. We indexed the reference genome and generated mappability scores using GenMap v1.3.0 with flags ‘-K 30’ and ‘-E 2’. GenMap mappability scores represent the uniqueness of k-mers (k-mer size given by flag -K) for each position in the genome while allowing for a certain number of mismatches (given by the flag -E) where a mappability score of one at a position indicates that the k-mer at that position occurs only once with up to E mismatches. We sorted the reference genome and mappability BED file using BEDtools and removed SNPs falling in a region of the genome with a mappability score less than one using BEDtools intersect with the flags ‘-v’, ‘-header’, and ‘-sorted’ (leaving 63,199,070 sites). We then used VCFtools to remove indels using the option ‘--remove-indels’ (leaving 51,199,725 SNPs). We removed individuals with less than 60% of the genome represented (samples U02, U12, and U20) using the VCFtools option ‘--remove-ind’. Most SNPs were removed at this step because sample U02 had over 1.3 million unique singletons (Supplementary Table 1) and over 44 million unique doubletons, likely due to some unknown contamination. We then used VCFtools to remove sites that no longer had any variation using flag ‘--non-ref-ac-any 1’ (leaving 2,379,069 SNPs), SNPs that did not fall on putative autosomes (leaving 2,213,174 SNPs), and non-biallelic SNPs using the flags ‘--min-alleles 2 --max-alleles 2’ (leaving 2,146,294 SNPs). We also split scaffold 22 at base pair 67,650,000 into chromosomes E1 and F1, respectively, to remedy the misassembly described in Armstrong et al. Next, with bam files as input, we used GATK v4.1.4.1 to add the following annotations to our filtered VCF file using VariantAnnotator: QD (QualByDepth - variant confidence normalized by unfiltered depth of variant samples), FS (FisherStrand - strand bias estimated using Fisher’s exact test), MQ (RMSMappingQuality - root mean square of the mapping quality of reads across all samples), MQRankSum (MappingQualityRankSumTest - rank sum test for mapping qualities of reference versus alternate reads), and ReadPosRankSum (ReadPosRankSumTest - rank sum test for relative positioning of reference versus alternate alleles within reads). After adding these annotations, we used GATK VariantFiltration to filter SNPs using the following flag: ‘--filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"’. We then used VCFtools to remove SNPs that did not pass the GATK VariantFiltration step using option ‘--remove-filtered-all’ (leaving 2,065,453 SNPs). Lastly, we removed SNPs missing data in more than 10% of the individuals using flag ‘--max-missing 0.9’ (leaving a final SNP set of 1,591,978). We calculated the number of singletons and private doubletons using the VCFtools flag ‘--singletons’ and removed one of the captive samples (10x_SDzoo) due to an excess of singletons (Supplementary Table 1) which we suspected to be due to the 10X library prep method unique to this sample.