Assessing kinship detection: Single nucleotide polymorphism array density and estimator comparison in white-tailed deer
Data files
Dec 22, 2025 version files 495.22 MB
-
Kinship_Detection.zip
495.21 MB
-
README.md
10.84 KB
Abstract
Single nucleotide polymorphism (SNP) arrays have become increasingly popular due to their affordability, commercial availability, statistical power, and reproducibility. These arrays are being developed commercially for a wide range of species in various density formats. In this study, we evaluated the ability of commercially available medium-density (72,732 SNPs) and high-density SNP (702,183 SNPs) array for white-tailed deer (Odocoileus virginianus) to accurately identify known genetically related individuals within a wild population. We also assessed the impact of SNP filtering thresholds on relatedness analyses and compared the performance of four common relatedness softwares: KING, COLONY, Sequoia, and COANCESTRY, on these known related pairs. Our analysis revealed that the medium-density array exhibited greater tolerance to filtering and lower sensitivity to bioinformatic pipelines, making it a favorable balance between cost, computational time, and statistical power for analyses such as population structure. Additionally, we found that reducing missing data, specifically by using a subset of 600 loci with no missing data, combined with the relatedness estimator Sequoia (which allows the inclusion of life history data), yielded the most computationally efficient and accurate results. These findings offer valuable insights into the optimal SNP array size, appropriate filtering thresholds, and the most effective genetic relatedness methods for wildlife population studies.
Dataset DOI: 10.5061/dryad.m63xsj4fm
Description of the data and file structure
Kinship_Detection.zip: Genotyping data from white-tailed deer samples generated using the Axiom OVSNP600 and Axiom OVSNP60 genotyping arrays (Thermo Fisher Scientific). Genotype data are provided in Variant Call Format (VCF) files. Input files for the software programs COLONY (Jones and Wang 2010), COANCESTRY (Wang 2011), and Sequoia (Huisman 2017), all prepared using R, are included. Additionally, metadata are available for all individuals, including information on known genetically related pairs.
We have submitted our raw data (raw unfiltered genomic data in the folder “VCF”, known related individual information (Known_Genetic_Relationships), metadata for all the individuals in our analysis (MNDNR_SamplesForSNPsProjec_Modified4publicrelease), and all of our input files for the relatedness software of COLONY, COANCESTRY, and Sequoia. All input files and data files are to be used as is; no insertion of NAs or other steps need to be taken. All associated code is available on GitHub: https://github.com/AlecJChristensen/kinship-detection-snp-density-2025.
COANCESTRY Input Files
Contains input files for running the software COANCESTRY. Formatted for compatibility with COANCESTRY’s input requirements using the 600-loci, medium-density, and high-density datasets.
Subfolders include
- 600-loci: Contains the input file to run COANCESTRY for the 600-loci dataset “600loci.coancestryinput.file.”
- Medium-Density: Contains the input file to run COANCESTRY for the medium-density dataset “60kR_6.3.24.”
- High-Density: Contains the input file to run COANCESTRY for the high-density dataset “600K15.20.numeric.noquote_7.12.24.”
COLONY Input Files: Contains input files for running the software COLONY. Formatted for compatibility with COLONY’s input requirements using the 600-loci, medium-density, and high-density datasets.
Subfolders Include
- 600-Loci: Includes input files for each one of COLONY’s required fields. For the error rate error.rate.07, for females female.adults.input.plusMN164234, for males male.adults.input, for offspring colony.offspring.input.minusMN164234, and a randomly chosen pair of a known maternal relationship Known_maternalsibs.input.subsetof1
- Medium-Density: Includes input files for each one of COLONY’s required fields. For the error rate colony.error.SNP.input.0175, for females female.adults.input.plusMN164234, for males male.adults.input, for offspring colony.offspring.input.minusMN164234, and a randomly chosen pair of a known maternal relationship Known_maternalsibs.input.subsetof1
- High-Density: Includes input files for each one of COLONY’s required fields. For the error rate error.rate0175.600k15.20, for females femaleadult600k15.20, for males maleadult600k15.20, for offspring offspring600k15.20, and a randomly chosen pair of known maternal relationship Known_maternalsibs.input.subsetof1.
Sequoia Input Files: Contains input files for running the R package Sequoia. Formatted for compatibility with COANCESTRY’s input requirements using the 600-loci, medium-density, and high-density datasets.
Subfolders include:
-
600-Loci: Contains input ped file for Sequoia called “Random600.pilot.ped”
-
Medium-Density
- Sequoia Ped Files for Medium-Density Dataset. These include the different filtered and non-filtered ped files that were used in the Sequoia analyses. Please see manuscript for details: 60QC_filtered15.10.ped; 60QC_filtered15.15.ped; 60QC_filtered15.20.ped; 60QC_filtered20.10.ped; 60QC_filtered20.15.ped; 60QC_filtered20.20.ped; 60QC_filtered25.10.ped; 60QC_filtered25.15.ped; 60QC_filtered25.20.ped; OvSNP60LTped.ped
- CSV files for Sequoia life history input. Please see the manuscript for details. Each CSV contains the same column headers: “ID”, “Sex”, “BirthYear”, “BY.Min”, “BY.Max” “Year.last”. Please see the Sequoia user guide for detailed explanations of each of these headers. LH60_15.10_4.16.24; LH60_15.15_4.16.24; LH60_15.20_4.16.24; LH60_20.10_4.16.24; LH60_20.15_4.16.24; LH60_20.20_4.16.24; LH60_25.10_4.16.24; LH60_25.15_4.16.24; LH60_25.20_4.16.24; LH60_RealMaster
-
High-Density
- Sequoia Ped Files for High-Density Dataset. These include the different filtered and non-filtered ped files that were used in the Sequoia analyses. Please see manuscript for details: 600QC_filtered15.10.ped; 600QC_filtered15.15.ped; 600QC_filtered15.20.ped; 600QC_filtered20.10.ped; 600QC_filtered20.15.ped; 600QC_filtered20.20.ped; 600QC_filtered25.10.ped; 600QC_filtered25.15.ped; 600QC_filtered25.20.ped; OvSNP600QCped.ped; OvSNP600QCped.ped
- CSV files for Sequoia life history input. Please see the manuscript for details. Each CSV contains the same column headers: “ID”, “Sex”, “BirthYear”, “BY.Min”, "BY.Max” “Year.last”. Please see the Sequoia user guide for detailed explanations of each of these headers. LH600_15.10_4.16.24; LH600_15.15_4.16.24; LH600_15.20_4.16.24; LH600_20.10_4.16.24; LH600_20.15_4.16.24; LH600_20.20_4.16.24; LH600_25.10_4.16.24; LH600_25.15_4.16.24; LH600_25.20_4.16.24; LH600_RealMaster
VCF: Contains input files for Variant Call Format (VCF) files for 600-loci, medium-density, and high-density SNP datasets.
- 600-Loci: Random600.pilot is the VCF for the 600-loci dataset
- Medium-Density
- 1670010002_LowThersholds_LowSNPQCTXT.txt: Text version of call format in character format from North American Genomics (Decatur, GA). This file was not used for any analyses but was provided in case individuals were interested or wanted to see all associated files.
- 1670010002_LowThersholds_LowSNPQCTXT.log: Summary output from North American Genomics (Decatur, GA). This file was not used for any analyses but was provided in case individuals were interested or wanted to see all associated files.
- 1670010002_LowThreshold_LowSNPQCVCF.vcf: Variant call file of the genomic output from North American Genomics (Decatur, GA), this is the raw vcf that we used for analyses
- 1670010002_LowThreshold_LowSNPQCVCF.log: Summary output from North American Genomics (Decatur, GA). This file was not used for any analyses but provided in case individuals were interested or wanted to see all associated files.
- 1670010002_LowThresholds_LowSNPQC_AxiomAnalysisQCSummary.pdf: Data Quality metrics from North American Genomics (Decatur, GA). This file was not used for any analyses but was provided in case individuals were interested or wanted to see all associated files.
- High-Density
- 1670010001_LowThresholds_LowSNPQC_AxiomAnalysisQCSummary.pdf: Analysis summary from North American Genomics (Decatur, GA). This file was not used for any analyses but was provided in case individuals were interested or wanted to see all associated files.
- 1670010001_LowThresholds_LowSNPQCTXT.txt: Text version of call format in character format from North American Genomics. This file was not used for any analyses but was provided in case individuals were interested or wanted to see all associated files.
- 1670010001_LowThresholds_LowSNPQCTXT.log: Summary output from North American Genomics. This file was not used for any analyses but was provided in case individuals were interested or wanted to see all associated files.
- 1670010001_LowThresholds_LowSNPQCVCF.vcf: Variant call file of the genomic output from North American Genomics (Decatur, GA), this is the raw vcf that we used for analyses. This file was not used for any analyses but was provided in case individuals were interested or wanted to see all associated files.
- 1670010001_LowThresholds_LowSNPQCVCF.log: Analysis summary from North American Genomics (Decatur, GA). This file was not used for any analyses but was provided in case individuals were interested or wanted to see all associated files.
Excel Files
Genetically_Known_Pairs.csv: Contains pairs of individuals with known genetic relationships used for validation.
- Headers include “Parent”, which is the adult female of the corresponding offspring in the headers “Fetus 1”, “Fetus 2”, “Fetus 3”. NAs represent that there is not an additional offspring there.
MNDNR_SamplesForSNPSProjec_Modified4publicrelease.csv: Master metadata file for all individuals.
Headers include:
- WHP_sample_id: Official Sample ID, same ID for when it was sent to the lab for analysis - these are all unique identification.
- Year: Calendar year the animal was sampled: Options include 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023
- Sample_category: How was the sample collected? Options include: Targeted or Opportunistic.
- sample_acquisition: How was the sample acquired? Options include: Agency culled, Found dead, Hunter harvested, Reported sick, Shooting permit, or Vehicle killed (the individuals in this study were a subset of all samples in the study system; not all options might be displayed).
- date_harvested: When was the deer harvested? This column may be blank if it was found dead, or reported sick, and the harvest date isn't quite known.
- date_collected: When was the sample collected? This column should always have a value in it, regardless of the sample_acquisition type.
- permit_area: Deer Permit Area where the deer was harvested - there should be no blanks in this column. This is a three digit code and DNR can provide shapefile, if needed.
- Sex: Sex of deer. Options include: Male, Female, or Unknown.
- Age: Age of deer. Options include: Adult, Yearling, Fawn, Fetus, or Unknown.
- Known_Age: What is the known age of this deer? Data includes: Yearlings are considered 1.5 yrs old, fawns are 0.5 years old, and adults are aged with cementum annuli when teeth are available for analysis. Fetuses were taken from culled deer (prior to birth).
Code/software
Analyses were conducted using COLONY (Jones and Wang 2010), COANCESTRY (Wang 2011), Sequoia (Huisman 2017), and KING (Manichaikul et al. 2010), with all data processing and visualization performed in R v4.4.2 (R Core Team 2024). All associated code is available on GitHub: https://github.com/AlecJChristensen/kinship-detection-snp-density-2025
