Data from: Genome-wide scans reveal selection signatures and cross-population variation in South African and European beef cattle breeds
Data files
May 09, 2024 version files 36.41 MB
-
CattleData.map
-
CattleData.ped
-
README.md
Abstract
In genetics and evolutionary biology, the concept of selection signatures is used to describe specific patterns in the genome that are associated with the process of natural selection. These selection signatures provide insights into how evolutionary forces have shaped a population over time.In this study, a total of 96 samples were collected in several farms from four different cattle breeds, namely South African indigenous Nguni (n = 28) and Bonsmara (n = 21), Scottish Angus (n = 22), and Swedish Simmental (n = 25). Genotyped samples were subjected to quality control, and a total of 105,675 SNPs from 78 individuals remained for further analysis. Genomic signatures of positive selection within each breed were identified using the Integrated Haplotype Score (iHS) method, and cross-population comparison analysis using cross-population extended haplotype homozygosity ( XP-EHH), relative extended haplotype homozygosity (Rsb), and fixation index (Fst) methods, to assess the genetic differences between breeds. The results from the iHS method revealed selection signatures in two genomic regions for Bonsmara, six for Simmental, four for Nguni, and one for Angus cattle. Ten regions were found to be under selection, with BTA 12 being shared between Nguni and Bonsmara. Comparisons across populations using Rsb, and Fst methods performed better and revealed the most specific genomic regions that varied in selection between breeds. Gene annotation analyses linked candidate genes to several Quantitative Trait Loci (QTL). For example, in Simmental cattle's FAM110B gene was linked to carcass weight and body confirmation score. Bonsmara showed fewer candidate genes, such as CDK8 and FLT1, whereas Angus had none on BTA 18. Nguni identified potential genes such as CRB1, PLAG2GA, and VASH2, with CDK8 shared by Bonsmara and Nguni on BTA 12. Further cross-population studies revealed candidate genes associated with certain traits, genes including as PLCXD3, FAM149B1, and GRIK2 for Bonsmara versus Nguni, and SLIT2 and TSPAN9 for Simmental vs Angus. The study also emphasised gene related to meat quality, reproduction, health, illnesses, fertility, and body conformation score. Gene interaction study with the STRING database revealed a network of 63 candidate genes, demonstrating the structure of genetic connections, some biological processes. The study found that iHS performed well in population analysis with Nguni cattle, having exhibited the highest number of signatures across the genome, and significant signatures were also seen in comparisons between Nguni and Bonsmara using the Fst and Rsb methods. Furthermore, the study discovered that a bigger number of genes were connected with various traits, including sperm count and insemination per conception, sensitivity to bovine respiratory disease, and ease of calving. This genomic analysis underlined the relevance of the genetic relying which distinguishes distinct breeds. This understanding has the potential to significantly enhance selective breeding and increase desirable traits in cattle herds. This genomic analysis underlined the significance of the genetic basis for breed-specific traits. This understanding has the potential to drastically improve selective breeding and increase desirable traits in cattle herds.
README: Cattle Genotyped Data
https://doi.org/10.5061/dryad.qjq2bvqq3
Populations and Samples
Ninety-six semen samples from Nguni: n = 28, Bonsmara: n = 21, Angus: n = 22, and Simmental: n = 25 were randomly collected from four provinces Mpumalanga (25.4294°S, 29.3306°E), Northwest (25.6318°S-27.7829°E), Gauteng (26.2708°S, 28.1123°E), and Limpopo (23.0353°S 29.6583°E) of South Africa. Following sampling, traits like: sperm motility, Progressive motility, of the fresh bull semen were individually recorded, and the samples were then cryopreserved at the Agricultural Research Council (Animal Production Germplasm Conservation & Reproductive Biotechnologies) in Irene, South Africa.
DNA extraction, genotyping, and quality control.
Genomic DNA was extracted from semen samples according to the manufacturer's protocol using a NucleoMag® pathogen extraction kit. Quantification was performed using Qubit® 2.0, and quantified DNA was genotyped using the Illumina BovineSNP 150K BeadChip following the manufacturer’s protocol. SNPs data generated was visualized and interpreted using Genome Studio 2.0 software.. Plink v1.07[47]software was used to filter the dataset following criteria: (1) eliminate SNPs with a call frequency of ≥90, (2) eliminate individuals with more than 5% missing genotype using (MIND) ≥ 0.05, (3) eliminate SNPs with more than 5 % of missing genotype using (GENO) ≥0.05, (4) eliminate all SNPs with minor allele frequency of less than 5% using (MAF)≥0.05 and (5) genotype frequency out of Hardy-Weinberg equilibrium (HWE > 0.00001). Subsequently, SNPs with high linkage disequilibrium were filtered out using the "--indep-pairwise r2 ≥ 0.2" option, and related individuals were identified and removed using the "--genome" option to compute genome-wide identity by descent estimates (IBD), with removal criteria specified in "--remove related.txt". A total of 23 samples failing to meet the criteria were excluded from the study. This resulted in the final data set of Nguni: n = 21, Bonsmara: n = 17, Angus: n = 20, and Simmental: n = 20.
Description of the data and file structure
Cattle Genotyped semen samples, converted to plink MAP and PED. PLINK is a command line program written in C/C++, all commands involve typing plink at the command prompt followed by a number of options (all starting with --option) to specify the data files / methods to be used (https://zzz.bwh.harvard.edu/plink/data.shtml). The Plink PED file is a white-space (space or tab) delimited file: with six columns namely Family ID, Individual ID, Paternal ID, Maternal ID, Sex (1=male; 2=female; other=unknown) and Phenotype. The IDs are alphanumeric: the combination of family and individual ID should uniquely identify an individual in this case a cattle and given that its semen samples, a bull (Male).
Each line of the MAP file describes a single marker and must contain exactly 4 columns: chromosome (1-22, X, Y or 0 if unplaced), snp identifier, Genetic distance (morgans) and base-pair position (bp units) (https://zzz.bwh.harvard.edu/plink/data.shtml. In our case the sex ID has been excluded so its only chromosome 1 to 29, since the cattle species have 29 chromosomes.
Family ID abbreviation in the file
- NGU =Nguni Breed
- ANG = Angus Breed
- BON = Bonsmara Breed
- SIM = Simmental Breed
Code/Software
Integrated Haplotype Score (iHS) analysis
The integrated haplotype score (iHS) was used to measure the extent of extended haplotype homozygosity (EHH) around a selected allele compared. Positive iHS values indicate selection for the derived allele, while negative values suggest selection for the ancestral allele . Signatures were identified within the four cattle populations and to prepare input files for the analysis, quality-controlled binary files were generated for each population using PLINK v1.9. The resulting binary files were used to create variant calling format (.vcf) files, which were sorted by base pair positions using bcftools, and then phased with BEAGLE v5. The iHS scores were calculated in R v4.2.2 using the R package REHH with the function "ihs2ihs". Candidate regions were identified at MAF of <0.05, and regions under significant selection were identified at p < 0.0001. The outputs were visualized in R v4.2.2 for all populations using Manhattan plots, and comma-separated value (.csv) files were created with significant regions and chromosomes that passed a threshold of -log (iHS) > 4 per population.
XP-EHH and Rsb analysis
Evidence of positive selection in populations was further investigated by comparing pairs of populations using the XP-EHH and Rsb methods. The regions with selection signatures were identified using pairwise comparisons between the four populations (Nguni & Bonsmara and Angus & Simmental). The XP-EHH is an extension of iHS and was used to measure the differences in EHH between populations. iHS It can be used to identify loci that have undergone selection in one population but not the other. The analyses consider distinct SNPs amongst populations that are monorphic for one and polymorphic for others using the comparison of the EHH score of two populations.
The Rsb compares the EHH of selected alleles between two populations, and it can also be used to identify loci that have undergone recent positive selection within one population compared to another. This analysis uses the formula. The R package REHH was used to determine the genomic regions under selection for the XP-EHH and Rsb using the functions "ies2xpehh" and "ines2rsb" , respectively. Candidate regions were identified at an MAF of <0.05, and significant regions were identified at p < 0.001. The outputs were visualized using Manhattan plots and .csv files were created with significant regions and chromosomes (p < 0.001).
Fst analysis.
Fst was used to measure population differentiation due to genetic structure. High Fst values suggest genetic differences between populations. While not a direct measure of selection, high FST values can indicate regions of the genome that have been under selective pressure. The Fst computation was performed in PLINK v1.9 using the command --fst, while the qqman package in R was used for Manhattan plots visualization.
Gene Functional Annotation
Gene annotation was carried out using the genomic regions identified as positive signatures of selection, from all methods used (iHS, XP-EHH, Rsb and Fst). Genes were annotated with the cattle gene assembly ARS-UCD1.2 using Bio Mart, a program in Ensembl, furthermore, ShinyGO v0.77 was used to determine the functions, pathway analysis, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways(http://bioinformatics.sdstate.edu/go/ of the identified genes. String database was used to compute interaction network between genes.