Data from: The polygenic strategies of host-specific and general virulence of Botrytis cinerea across diverse eudicot hosts
Data files
May 09, 2025 version files 34.39 MB
-
Github_Eudicot_GWAS.zip
34.38 MB
-
README.md
12.50 KB
Abstract
Diverse qualitative and quantitative genetic architectures can successfully enable fungal virulence and host range. To model the quantitative genetic architecture of a generalist pathogen with an extensive host range, we conducted a genome-wide association study (GWAS) of the lesion area of Botrytis cinerea across eight hosts. This revealed that it was possible to partition the virulence, as defined by the lesion area, into common and host-specific components across all hosts from host-specific virulence. All traits showed that a large proportion of the Botrytis genome likely contributes to fungal lesion development on leaves with small effect sizes. The candidate genes are evenly spread across the core chromosomes with no indication of bipartite genomic architecture. The GWAS-identified polymorphisms and genes show that B. cinerea relies on genetic variants across hundreds of genes for growing on diverse hosts, with most genes influencing relatively few hosts. When pathogen genes were associated with multiple hosts, they were associated with unrelated rather than related host species. Comparative genomics further suggested that the GWAS-identified genes are largely syntenic with other specialist Botrytis species and not unique to B. cinerea. Overall, as shown in A. thaliana, B. cinerea’s generalist behavior is derived from the sum of the genome-wide genetic variation acting within gene networks that differentially coordinate the interaction with diverse hosts.
https://doi.org/10.5061/dryad.j6q573npm
Description of the data and file structure
Corresponding publication: Dryad Repository
Author: Celine Caseys, Dept. of Plant Sciences, University of California, Davis
Project Overview
This repository contains data and code used in the genome-wide association study (GWAS) of Botrytis cinerea virulence across eight eudicot crop species. The project investigates host-specific and general virulence strategies through polygenic analyses, including Bayesian Sparse Linear Mixed Models (BSLMM), multivariate MASH analysis, SNP annotation, and comparative genomics.
Notebook files (.html, .pdf, or .Rmd) provide detailed documentation of the datasets, analytical pipelines, and scripts
Main Archive
File: Github_Eudicot_GWAS.zip
Description:
Contains an RStudio project (Github_Eudicot_GWAS.Rproj) which sets the appropriate working directory. Some folders may appear empty depending on your operating system or RStudio version.
Files and variables
BSLMM of lesion area acros 85 plant genotypes
The input files for Gemma were in the PLINK file format (https://www.cog-genomics.org/plink/1.9/input#ped)
Subfolder:
- Gemma
Associated files:
-
bslmm_Eudicot98.sh :
Bash script to run Gemma on linux
PLINK Files:
-
Cor_MAF20NA10.bed
Binary SNP matrix. This is the an imput file for Gemma. It will look like an encrypted file because it is coded as a binary file.
-
Cor_MAF20NA10.bim
Binary SNP locations. his is the an imput file for Gemma
Column 1: Chromosome, Column 2: SNP identifier, Column 3: empty, Column 4: SNP position, Column 5&6: Parameters for Gemma
-
Cor_MAF20NA10.fam
This is an input file for Gemma. It contains the PHENOTYPES in columns 7 and following. Columns 1-6 are the strain informations. https://www.cog-genomics.org/plink/2.0/formats#fam 1= Family ID (‘FID’); 2=Individual ID (‘IID’; cannot be ‘0’); 3=Individual ID of father (‘0’ if father isn’t in dataset); 4=Individual ID of mother (‘0’ if mother isn’t in dataset); 5=Sex code (‘1’ = male, ‘2’ = female, ‘0’ = unknown); 6=Phenotype value (‘1’ = control, ‘2’ = case, ‘-9’/‘0’/non-numeric = missing data case/control)
-
Eudicot_GWAS_revision.fam
This is an input file for Gemma. It contains the PHENOTYPES in columns 7 and following. Columns 1-6 are the strain informations. https://www.cog-genomics.org/plink/2.0/formats#fam 1= Family ID (‘FID’); 2=Individual ID (‘IID’; cannot be ‘0’); 3=Individual ID of father (‘0’ if father isn’t in dataset); 4=Individual ID of mother (‘0’ if mother isn’t in dataset); 5=Sex code (‘1’ = male, ‘2’ = female, ‘0’ = unknown); 6=Phenotype value (‘1’ = control, ‘2’ = case, ‘-9’/‘0’/non-numeric = missing data case/control) The name and order of the traits is encoded in the file Revisions_traitList.txt. This contains the general virulence and residual datasets.
-
Cor_MAF20NA10.map
Non-binary, contains the SNP locations
Column 1: Chromosome, Column 2: SNP identifier, Column 3: empty, Column 4: SNP position.
-
Cor_MAF20NA10.ped
Non-binary SNP matrix, for all strains across all SNPs.
-
Cor_MAF20NA10_kmat1_pheno1.cXX.txt
Kinship matrix from gemma (centered matrix)
-
Cor_MAF20NA10_kmat2_pheno1.sXX.txt
Kinship matrix from gemma (standardized matrix)
R Code:
-
Prep_file-rand.R
Create the randomized datasets for permutation-based null hypothesis testing. Ready to be copy-pasted into the .fam file and ran through GEMMA. This randomizes the phenotypes across the 96 strains while the SNP information remains associated to each 96 strains.
2. Analysis of BSLMM hyperparameters
Associated files:
-
all_hyperParameters.txt
hyperparam: h (heritability), PVE (Total proportion of variance), rho (approximation to proportion of genetic variance explained by variants with major effects), PGE (Proportion of genetic variance explained by sparse effects), pi (proportion of variants with non-zero effects), n.gamma (number of variants with major effect) as hyper-parameters of the BSLMM
The mean, median, 2.5 and 97.5 quantiles of each hyperparameters are reported for each of the 20 BSLMM runs (rep 1-20) for each of the 85 genotypes (coded in GenoN) across the 8 Host species (coded in Species). Trait1 is the combination of the host species and genotypes.
-
I93_Eudic7_traitList.txt
Num_ID: Numerical identifier For each trait. Number 2-8 are the mean lesion area for each host species. Number 9-98 are the mean lesion area for each genotypes.
Name: species name or accession identifier.
-
Revisions_traitList.txt
Num_ID: Numerical identifier For each trait.
Name: Trait description as either model corrected emmean or Residuals (Res).
R Code:
-
HyperParam_extract.R
Import, analyze and plot the BSLMM hyper parameters.
3. MASH Analysis
R Code:
-
Mash_datAnalysis.R
Import the results from the BSLMM and run multivariate adaptive shrinkage (MASH) pipeline to determine the significance of SNPs. Return the local false sign rate (LFSR).
4. Analysis of significant SNPs
Associated files:
-
Signlsfr_85_Geno_SNPs_binom_new.txt
SNP_ID is the combination of the chromosome number and the SNP position.
All other columns are the binary matrix of significant (1) or non-significant (0) association of the trait (see the column name) to the SNP.
-
AllSNP_CountBySpecie_Table.txt
Num_ID: Numerical row ID
snp_id: is the combination of the chromosome number and the SNP position.
All other columns are the binary matrix of significant (1) or non-significant (0) association of the species (see the column name) to the SNP.
N_species is the count of how many species were significantly associated to a SNP.
-
SNP_genomicLoc_plot_perc.txt
Overview of the percentage of variance explained by each parameter.
Cat: Category as (3' UTR, 5' UTR, CDS functional, CDS synonymous, Intergenic, Intron, Intron with splice).
Perc: Percentage of variance explained by the parameter (in %).
Species: Each of the 8 species or for the whole genome (WG) as control.
-
AllSNP_CountBySpecies_1.txt
N_Species: Value between 1-7, as the total count of species a gene was associated to
Species: Each of the 8 species
Counts: Count of genes for each species associated to 1 (unique to the species) to 7 other species (shared gene).
R Code:
-
SlidingWindows_BSLMM.R
Contain the function to calculate the effect of SNPs by sliding windows along the genome
-
SNP_sliding_window.R
Import, analyze and plot the sliding window analysis.
5. Analysis of significant genes
Associated files:
-
All271k_SNPs_Archive.txt
snp_id: the combination of the chromosome number and the SNP position.
seqname: the combination of the chromosome number and the SNP position and nucleotide for the alleles.
omosome: chromosome number
start: SNP location
Distance: Distance of the SNP to the nearest gene. 0 means the SNP is located inside a gene
Gene final: Identifier of the closest gene to a SNP.
Type final: Functional annotation of the SNP.
G500: Association of a SNP to a gene if within 500bp of the start/stop of the gene.
G1000: Association of a SNP to a gene if within 1000bp of the start/stop of the gene.
G1500: Association of a SNP to a gene if within 1000bp of the start/stop of the gene.
gene_ID: Association of a SNP to a gene only if within a gene.
gene_region: as CDS, intron, 3' UTR, 5' UTR.
-
Gene_Binary_venn_Jan2024.txt
Matrix of 7 species (columns) across significant genes (row).
-
Annotations_plot.txt
Process: annotations for each 10 top categories
Num: Count of genes of each of the top 10 categories
Cat: Categories as GO process, Interpro annotations, GO Component, Annotation categories
-
ALLSW_SummEff.txt
File with the sliding windows summary of the effect sizes.
Start_time: Position of the beginning of the window
end_time: Position of the end of the window
center: Position of the center of the window
summary: Sum of the effect size for the SNPs within the window. 0 means there wasn't SNPs to summarize.
chr: chromosome number
Index: Continuous index across chromosomes to allow plotting.
SumEff: Sum of the effect size for the SNPs within the window. NA means there wasn't SNPs to summarize.
Species: Each of the 8 species
R Code:
-
SNP_genomic _features.R
Import, analyse and plot SNP data about their location within genomic features.
-
Genes_Clustering.R
Import data and perform hierarchical clustering.
6. Co-expression networks analysis
Associated files:
-
col0_allreads.csv
Co-expression data for A.thaliana Col-0 from Zhang et al. 2017
Isolate: Each of the 97 Botrytis strains
HostGenotype : Col-0
All other columns are the matrix of Botrytis gene expression.
-
Mash_AllGenes_500bp.txt
All candidate genes with significant SNPs (identified by BSLMM+MASH) within genes +- 500bp of start or stop of the CDS
R Code:
-
Pipeline_GeneNetwork.R
Import data, filter based on presence in the gene expression dataset, calculate co-expression matrix and generate the gene co-expression network.
7. Comparative genomics via orthologs analysis
Subfolder:
- Orthologs
Associated files:
-
Botrytis_12_ortho_R.xlsx
Cluster & Cluster_ID: Cluster of orthologous genes
representative function: functional annotation for the orthogroup
Botrytis_cinerea_B05.10: B.cinerea genes associated to the orthogroup
Botrytis_aclada: B.aclada genes associated to the orthogroup
Botrytis_deweyae: B.deweyae genes associated to the orthogroup
Botrytis_fragariae: B.fragariae genes associated to the orthogroup
Botrytis_porri: B.porri genes associated to the orthogroup
Botrytis_sinoallii: B.sinoallii genes associated to the orthogroup
Botrytis_hyacinthi: B.hyacinthi genes associated to the orthogroup
-
Genes_GWS_Sp_SNP.txt
Genes_GWAS: B.cinerea gene IDs
Gene: Gene IDs as coded in the orthogroups
Columns C-J: Matrix of significant (1) or insignificant (0) association to a species (see the column name)
N_species: The number of species a gene was significantly associated to.
CDS: Number of significant SNPs of that genes located in the CDS.
Intron: Number of significant SNPs of that genes located in Introns
five_prime_UTR: Number of significant SNPs of that genes located in the 5' UTR regions.
three_prime_UTR: Number of significant SNPs of that genes located in the 3' UTR regions.
Within_500bp: Number of significant SNPs of that genes located in the promoter region (500 bp downstream or upstream of the gene).
Total_SNP: Total number of SNPs significantly associated for that gene.
R Code:
-
Orthologs_analysis.R
Import, analyze and plot the orthologs dataset.
The .RData file is a binary file used by R to save the entire workspace, including variables, data frames, functions, and other R objects.
It allows users to reload saved objects exactly as they were during the last session using load(".RData") in R.
Signlsfr_90Geno_SNPs:
This file is a SNP presence-absence matrix across plant genotypes, with each row representing a SNP and columns indicating its presence or absence in individual samples. It’s used to assess genetic variation across species.
Annotation_plot.R
This R script reads gene annotation data, splits it by category, creates bar plots for each subset (Annotations, GO Component, InterPro, and GO Process), and arranges them into a single 2×2 panel figure.
Code/software
This was coded for and functional with R version:
R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Access information
Data was derived from the following sources:
To begin querying how the host range and virulence are determined in Botrytis cinerea, we used previous measurements of disease caused by a population of 96 diverse Botrytis isolates across an array of eudicots including tomato, sunflower, lettuce, chicory, endive, turnip, Arabidopsis, and soybean (Caseys et al., 2021). In this work, we use these phenotypic measurements to identify the Botrytis genes that may shape host susceptibility. Combining the phenotypic measurements with the genomic variation in the Botrytis population, we mapped and analyzed the genetic architecture of host preferences across Botrytis isolates using genome-wide association study (GWAS). Given the quantitative nature of Botrytis virulence, a Bayesian sparse linear mixed model (BSLMM) using the Markov chain Monte Carlo algorithm implemented in GEMMA was run for lesion area on each plant genotype. To incorporate the phenotypic information potentially provided by lesion area measured on up to 12 plant genotypes per host species, we implemented a multivariate approach to the BSLMM using multivariate adaptive shrinkage (MASH).To estimate the proportion of the candidate genes that might have recently evolved within B. cinerea, we performed comparative genomics analysis with 7 Botrytis species. B. cinerea B05.10 genes orthologous to B. fragariae, B. aclada, B. deweyae, B. porri, B. hyacinthi, and B. sinoallii were called by OrthoMCL.
