The estimation of the inbreeding coefficient (F) is essential for the study of inbreeding depression (ID) or for the management of populations under conservation. Several methods have been proposed to estimate the realized F using genetic markers, but it remains unclear which one should be used. Here we used whole-genome sequence data for 245 individuals from a Holstein cattle pedigree to empirically evaluate which estimators best capture homozygosity at variants causing ID, such as rare deleterious alleles or loci presenting heterozygote advantage and segregating at intermediate frequency. Estimators relying on the correlation between uniting gametes (F_UNI) or on the genomic relationships (F_GRM) presented the highest correlations with these variants. However, homozygosity at rare alleles remained poorly captured. A second group of estimators relying on excess homozygosity (F_HOM), homozygous-by-descent segments (F_HBD), runs-of-homozygosity (F_ROH) or on the known genealogy (F_PED) was better at capturing whole genome homozygosity, reflecting the consequences of inbreeding on all variants, and for young alleles with low to moderate frequencies. The results indicate that F_UNI and F_GRM might present a stronger association with ID. However, the situation might be different when recessive deleterious alleles reach higher frequencies, such as in populations with a small effective population size. For locus specific inbreeding measures or at low marker density, the ranking of the methods can also change as F_HBDmakes better use of the information from neighbouring markers. Finally, we confirmed that genomic measures are in general superior to pedigree-based estimates. In particular, F_PED was uncorrelated with locus specific homozygosity.

DNA samples were extracted from whole blood or semen using standard protocols. Sequencing was done on Illumina HiSeq 2000 instruments with a PCR free method to
prepare libraries with 550bp (DAMONA pedigree) insert sizes. Paired-end sequencing with read length of 2 x 100 base pairs was applied.

The whole-genome sequence data was analyzed according to GATK Best Practice V3.4. Alignement of reads (FASTQ files) to the reference genome (Bos Taurus UMD 3.1) was done with BWA MEM (version 0.7.9a-r786, (Li 2013)) with the default settings. The sorted BAM had PCR duplicates detected using sambamba (v0,4,6) and Picard tools and bedtools were used to generate library statistics and coverage information.

The obtained BAM files were then realigned around indels and recalibrated for base quality with Genome Analysis Toolkit (GATK 2.7.4., (DePristo et al. 2011)). List of known SNP used for recalibration were obtained from DBSNP release 138. Variant calling was performed with GATK Haplotype caller in N+1 mode. For calibration of variant quality, a set of trusted SNP and indels was used. For SNPs, the set consisted in SNPs from the BovineHD (Illumina) and Axiom Genome-Wide BOS 1 (Affymetrix) commercial genotyping arrays. For indels, we selected a subset of indels identified in the DAMONA pedigree behaving like true Mendelian variants : presenting no parent-offspring incompatibilities (e.g. opposite homozygotes), no deviation from Hardy-Weinberg proportions (p > 0.05) and no deviation from expected genotypic proportions in
offspring of heterozygous parents (p > 0.05). In addition, we compute the probability to observe no parent-offspring inconsistency if parental alleles were drawn at random and
conserved only indels with a probability below 1e-12 (to make sure that the absence of parent-offspring incompatibilities was not due by chance).

The data is a subset from a pedigree of 743 sequenced Holstein cattle. The full data set has for instance been used in the following study:

https://genome.cshlp.org/content/early/2020/06/26/gr.256172.119.short?rss=1

Original VCFs files are also available at https://www.ebi.ac.uk/ena/browser/view/PRJEB38336, under the name BPWG.vcf.gz.

A readme file describing the files and their format is included in the folder.

13037955 bi-allelic SNP were selected for the associated study. From the sequenced pedigree we conserved individuals with a sequencing cover > 15x (266).
We also removed one outlier with extreme whole genome heterozygosity.

The data was divided in two subsets:
1) genotype_parents.txt: a set of 145 sequenced parents from the second set of individuals
2) genotype_targets.txt: a set of 100 sequenced targets with both parents sequenced parents and without sequenced offspring
These 245 sequenced individuals form thus trios.

Format, per line:
1) chromosome
2) position in bp
3) REF allele
4) ALT allele
5 and more) one genotype per individual, genotype recoding is = number of alternate alleles, 0 for REF/REF, 1 for REF/ALT, 2 for ALT/ALT, 9 for missing

Genotype files at different marker density:
1) typ_parents_50K.txt: the genotypes of the selected markers from the 50K genotyping array for the 145 parents
2) typ_targets_50K.txt: the genotypes of the selected markers from the 50K genotyping array for the 100 targets
3) typ_parents_LD.txt: the genotypes of the selected markers from the low-density genotyping array for the 145 parents

Format, per line:
1) chromosome
2) position in bp
3) position in bp (a second time a surrogate for name)
4) REF allele
5) ALT allele
6 and more) one genotype per individual, genotype recoding is = number of alternate alleles, 0 for REF/REF, 1 for REF/ALT, 2 for ALT/ALT, 9 for missing

The file parents_ids.txt and targets_ids.txt contain information on the individuals from the genotype files, in three columns:
1) position in genotype file
2) pedigree id from the individual (coded from 1 to 266)
3) inbreeding coefficient estimated with the entire pedigree

The file samples_relationships.txt provides pedigree id from parents when these are sequenced, with three columns:
1) pedigree id from the individual
2) pedigree id from the sire, when sequenced with 15x or more
3) pedigree id from the dam, when sequenced with 15x or more

Annotation file annotation_VQSR4_VEP_S_NS.txt :
1) chromosome
2) position in bp
3) SNP id
4) REF allele
5) ALT allele
6) Annotation where M = missense; S = synonymous; I = intergenic or intronic; O = other; U = unknown (based on snpEff)
7) Ancestral allele where R = ancestral is REF; A = ancestral is ALT; U = ancestral is unknown (based on information from Rocha et al., 2014)
8) SNP observed in Belgian Blue Sample: 0 = no, 1 = yes
9) SNP observed in TAAF cattle sample: 0 = no, 1 = yes
10) Synonymous = 1; Missense variants = 2; other = 0 (based on Variant Effect Predictor)
11) Impact from Variant Effect Predictor: 1 = MODIFIER; 2 = LOW; 3 = MODERATE; 4 = HIGH
12) SIFT prediction from Variant Effect Predictor: 1 = tolerated; 2 = tolerated_low_confidence; 3 = deleterious; 4 = deleterious_low_confidence
13) SIFT Score from Variant Effect Predictor (2 if missing)

An evaluation of inbreeding measures using a whole genome sequenced cattle pedigree

Data files

Abstract

An evaluation of inbreeding measures using a whole genome sequenced cattle pedigree

Data files

Abstract

Methods

Usage notes

Works referencing this dataset