Data from: Reducing cryptic relatedness in genomic datasets via a central node exclusion algorithm

Fonseca, Pablo A.S.1; Leal, Thiago P.1; Santos, Fernanda Caroline1; Gouveia, Mateus Henrique1; Id-Lahoucine, Samir2; Rosse, Izinara C.1; Ventura, Ricardo V.2; Bruneli, Frank Angelo T.3; Machado, Marco Antônio3; Peixoto, Maria Gabriela C.D.1; Tarazona-Santos, Eduardo1; Carvalho, Maria Raquel S.1; Fonseca, Pablo A. S.1

Published Dec 07, 2017 on Dryad. https://doi.org/10.5061/dryad.k8b8n

Data files

Dec 07, 2017 version files 1.90 GB

Gwas_simulations_All_Animals.zip

375.55 MB
Gwas_simulations_centralityG.zip

372 MB
Gwas_simulations_CentralityIBD.zip

369.14 MB
Gwas_simulations_threshold_Gmatrix.zip

362.50 MB
Gwas_simulations_thresholdIBD.zip

361.28 MB
Supplementary_file1.R

2.06 KB
Tabular_Gmatrix_All_animals.txt

19.88 MB
Tabular_kinship_AllAnimals.txt

39.05 MB

Abstract

Cryptic relatedness is a confounding factor in genetic diversity and genetic association studies. Development of strategies to reduce cryptic relatedness in a sample is a crucial step for downstream genetic analyzes. The present study uses a node selection algorithm, based on network degrees of centrality, to evaluate its applicability and impact on evaluation of genetic diversity and population stratification. 1,036 Guzerá (Bos indicus) females were genotyped using Illumina Bovine SNP50 v2 BeadChip. Four strategies were compared. The first and second strategies consists on a iterative exclusion of most related individuals based on PLINK kinship coefficient (φij) and VanRaden’s φij, respectively. The third and fourth strategies were based on a node selection algorithm. The fourth strategy, Network G matrix, preserved the larger number of individuals with a better diversity and representation from the initial sample. Determining the most probable number of populations was directly affected by the kinship metric. Network G matrix was the better strategy for reducing relatedness due to producing a larger sample, with more distant individuals, a more similar distribution when compared with the full dataset in the MDS plots and keeping a better representation of the population structure. Resampling strategies using VanRaden’s φij as a relationship metric was better to infer the relationships among individuals. Moreover, the resampling strategies directly impact the genomic inflation values in Genome-wide association studies. The use of the node selection algorithm also implies better selection of the most central individuals to be removed, providing a more representative sample.

Gwas_simulations_All_Animals

GWAS simulation results obtained for the full data set (1036 animals). The file contain the folders with the simulations performed using both heritability value (h2=0.2 and h2=0.5). Inside these folders it is possible to find all the GWAS simulations results (for each replicated), a file with all the results merged and the simulated QTLs.

Gwas_simulations_centralityG

GWAS simulation results obtained for the Network G matrix sample (286 animals). The file contain the folders with the simulations performed using both heritability value (h2=0.2 and h2=0.5). Inside these folders it is possible to find all the GWAS simulations results (for each replicated), a file with all the results merged and the simulated QTLs.

Gwas_simulations_CentralityIBD

GWAS simulation results obtained for the full Network IBD sample (210 animals). The file contain the folders with the simulations performed using both heritability value (h2=0.2 and h2=0.5). Inside these folders it is possible to find all the GWAS simulations results (for each replicated), a file with all the results merged and the simulated QTLs.

Gwas_simulations_threshold_Gmatrix

GWAS simulation results obtained for the Threshold G matrix (286 animals). The file contain the folders with the simulations performed using both heritability value (h2=0.2 and h2=0.5). Inside these folders it is possible to find all the GWAS simulations results (for each replicated), a file with all the results merged and the simulated QTLs.

Gwas_simulations_thresholdIBD

GWAS simulation results obtained for the Threshold IBD sample (203 animals). The file contain the folders with the simulations performed using both heritability value (h2=0.2 and h2=0.5). Inside these folders it is possible to find all the GWAS simulations results (for each replicated), a file with all the results merged and the simulated QTLs.

Tabular_kinship_AllAnimals

This file contains the Kinship values between each pair of individuals present in the All animals sample. Theses kinship values were used in each resampling approach applied in the present study.

Tabular_Gmatrix_All_animals

This file contains the G matrix (VanRaden, 2009) values between each pair of individuals present in the All animals sample. Theses kinship values were used in each resampling approach applied in the present study.

Supplementary_file1

This file contains the R script used to perform the threshold resampling applied in this study.